| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
Update build system to ensure dirs.py is created when it is a
dependency for a build target. Also, update setup.py to
check for that dependency.
Fixes: 943c4a325045 ("python: set ovs.dirs variables with build system values")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently all functions of the type *_is_new() always return
'false'. This patch resolves this issue by using the
'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' instead of the
'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno' to determine if a row
is new and by resetting the 'OVSDB_IDL_CHANGE_INSERT'
'change_seqno' on clear.
Further to this, the code is also updated to match the following
behaviour:
When a row is inserted, the 'OVSDB_IDL_CHANGE_INSERT'
'change_seqno' is updated to match the new database
change_seqno. The 'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno'
is not set for inserted rows (only for updated rows).
At the end of a run, ovsdb_idl_db_track_clear() should be
called to clear all tracking information, this includes
resetting all row 'change_seqno' to zero. This will ensure
that subsequent runs will not see a previously 'new' row.
add_tracked_change_for_references() is updated to only
track rows that reference the current row.
Also, update unit tests in order to test the *_is_new(),
*_is_delete() functions.
Suggested-by: Dumitru Ceara <dceara@redhat.com>
Reported-at: https://bugzilla.redhat.com/1883562
Fixes: ca545a787ac0 ("ovsdb-idl.c: Increase seqno for change-tracking of table references.")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_idl_set_condition() returns a sequence number that can be used to
check if the requested conditions are acknowledged by the server.
However, database bindings do not return this value to the user, making
it impossible to check if the conditions are accepted.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when ovsdb *.db is created by ovsdb-tool it grants read
permission to others. This may incur security concerns, for example,
IPsec Pre-shared keys are stored in ovs-vsitchd.conf.db.
This patch addresses the concerns by removing permission for others.
Reported-by: Antonin Bas <abas@vmware.com>
Acked-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
New appctl 'cluster/set-backlog-threshold' to configure thresholds
on backlog of raft jsonrpc connections. Could be used, for example,
in some extreme conditions where size of a database expected to be
very large, i.e. comparable with default 4GB threshold.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RAFT messages could be fairly big. If something abnormal happens to
one of the servers in a cluster it may not be able to process all the
incoming messages in a timely manner. This results in jsonrpc backlog
growth on the sender's side. For example if follower gets many new
clients at once that it needs to serve, or it decides to take a
snapshot in a period of high number of database changes.
If backlog grows large enough it becomes harder and harder for follower
to process incoming raft messages, it sends outdated replies and
starts receiving snapshots and the whole raft log from the leader.
Sometimes backlog grows too high (60GB in this example):
jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:<ip>,
num of msgs: 15370, backlog: 61731060773.
In this case OS might actually decide to kill the sender to free some
memory. Anyway, It could take a lot of time for such a server to catch
up with the rest of the cluster if it has so much data to receive and
process.
Introducing backlog thresholds for jsonrpc connections.
If sending backlog will exceed particular values (500 messages or
4GB in size), connection will be dropped and re-created. This will
allow to drop all the current backlog and start over increasing
chances of cluster recovery.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commit 8c2c503bdb0d ("raft: Avoid sending equal snapshots.")
took a "safe" approach to not send only exactly same snapshot
installation requests. However, it doesn't make much sense to send
more than one snapshot at a time. If obsolete snapshot installed,
leader will re-send the most recent one.
With this change leader will have only 1 snapshot in-flight per
connection. This will reduce backlogs on raft connections in case
new snapshot created while 'install_snapshot_request' is in progress
or if election timer changed in that period.
Also, not tracking the exact 'install_snapshot_request' we've sent
allows to simplify the code.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Fixes: 8c2c503bdb0d ("raft: Avoid sending equal snapshots.")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Compaction happens at most once in 10 minutes. That is a big time
interval for a heavy loaded ovsdb-server in cluster mode.
In 10 minutes raft logs could grow up to tens of thousands of entries
with tens of gigabytes in total size.
While compaction cleans up raft log entries, the memory in many cases
is not returned to the system, but kept in the heap of running
ovsdb-server process, and it could stay in this condition for a really
long time. In the end one performance spike could lead to a fast
growth of the raft log and this memory will never (for a really long
time) be released to the system even if the database if empty.
Simple example how to reproduce with OVN sandbox:
1. make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered'
2. Run following script that creates 1 port group, adds 4000 acls and
removes all of that in the end:
# cat ../memory-test.sh
pg_name=my_port_group
export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off)
ovn-nbctl pg-add $pg_name
for i in $(seq 1 4000); do
echo "Iteration: $i"
ovn-nbctl --log acl-add $pg_name from-lport $i udp drop
done
ovn-nbctl acl-del $pg_name
ovn-nbctl pg-del $pg_name
ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
ovn-appctl -t ovn-nbctl exit
---
3. Stopping one of Northbound DB servers:
ovs-appctl -t $(pwd)/sandbox/nb1 exit
Make sure that ovsdb-server didn't compact the database before
it was stopped. Now we have a db file on disk that contains
4000 fairly big transactions inside.
4. Trying to start same ovsdb-server with this file.
# cd sandbox && ovsdb-server <...> nb1.db
At this point ovsdb-server reads all the transactions from db
file and performs all of them as fast as it can one by one.
When it finishes this, raft log contains 4000 entries and
ovsdb-server consumes (on my system) ~13GB of memory while
database is empty. And libc will likely never return this memory
back to system, or, at least, will hold it for a really long time.
This patch adds a new command 'ovsdb-server/memory-trim-on-compaction'.
It's disabled by default, but once enabled, ovsdb-server will call
'malloc_trim(0)' after every successful compaction to try to return
unused heap memory back to system. This is glibc-specific, so we
need to detect function availability in a build time.
Disabled by default since it adds from 1% to 30% (depending on the
current state) to the snapshot creation time and, also, next memory
allocations will likely require requests to kernel and that might be
slower. Could be enabled by default later if considered broadly
beneficial.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
In many cases a big part of a memory consumed by ovsdb-server process
is a raft log, so it's important to add its length to the memory
report.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If debug logs enabled, "raft_is_connected: true" printed on every
call to raft_is_connected() which is way too frequently.
These messages are not very informative and only litters the log.
Let's log only disconnected state in a rate-limited way and only
log positive case once at the moment cluster becomes connected.
Fixes: 923f01cad678 ("raft.c: Set candidate_retrying if no leader elected since last election.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
| |
Error should be destroyed before return.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
While sending snapshots backlog on raft connections could quickly
grow over 4GB and this will overflow raft-backlog counter.
Let's report it in kB instead. (Using kB and not KB to match with
ru_maxrss counter reported by kernel)
Fixes: 3423cd97f88f ("ovsdb: Add raft memory usage to memory report.")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
| |
There is one remaining use under datapath. That change should happen
upstream in Linux first according to our usual policy.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a database enters an error state, e.g., in case of RAFT when reading
the DB file contents if applying the RAFT records triggers constraint
violations, there's no way to determine this unless a client generates a
write transaction. Such write transactions would fail with "ovsdb-error:
inconsistent data".
This commit adds a new command to show the status of the storage that's
backing a database.
Example, on an inconsistent database:
$ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB
status: ovsdb error: inconsistent data
Example, on a consistent database:
$ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB
status: ok
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are some occurrences where the database ends up in an inconsistent
state. This happened in ovn-k8s and is described in [0].
Here we are adding a supported way to check that a given db is consistent,
which is less error prone than checking the logs.
Tested against both a valid db and a corrupted db attached to the
above bug [1]. Also, tested with a fresh db that did not do a snapshot.
[0]: https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23
[1]: https://bugzilla.redhat.com/attachment.cgi?id=1697595
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Suggested-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Builds on RHEL 8.2 systems are failing due to this issue.
See [1] as to why this is necessary.
I used the following command to identify files that need this fix:
find . -type f -executable | /usr/lib/rpm/redhat/brp-mangle-shebangs
I also updated the copyright notices as needed.
1. https://fedoraproject.org/wiki/Changes/Make_ambiguous_python_shebangs_error
Fixes: 1ca0323e7c29 ("Require Python 3 and remove support for Python 2.")
Signed-off-by: Greg Rose <gvrose8192@gmail.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Every time a follower has to install a snapshot received from the
leader, it should also replace the data in memory. Right now this only
happens when snapshots are installed that also change the schema.
This can lead to inconsistent DB data on follower nodes and the snapshot
may fail to get applied.
Fixes: bda1f6b60588 ("ovsdb-server: Don't disconnect clients after raft install_snapshot.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
| |
These should belong to OVN project, if still not done yet.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
| |
- snapshot unit test has been added for "change-election-timer" related
patches.
- 100% CPU problem was addressed by:
2cd62f75c1 ("ovsdb raft: Precheck prereq before proposing commit.")
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
| |
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According to RFC 7047, 'timeout' is an integer field:
5.2.6. Wait
The "wait" object contains the following members:
"op": "wait" required
"timeout": <integer> optional
...
For some reason initial implementation treated it as a real number.
This causes a build issue with clang that complains that LLONG_MAX
could not be represented as double:
ovsdb/execution.c:733:32: error: implicit conversion from 'long long'
to 'double' changes value from
9223372036854775807 to 9223372036854775808
timeout_msec = MIN(LLONG_MAX, json_real(timeout));
~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/sys/limits.h:69:19: note: expanded from macro 'LLONG_MAX'
#define LLONG_MAX __LLONG_MAX /* max for a long long */
^~~~~~~~~~~
/usr/include/x86/_limits.h:74:21: note: expanded from macro '__LLONG_MAX'
#define __LLONG_MAX 0x7fffffffffffffffLL /* max value for a long long */
^~~~~~~~~~~~~~~~~~~~
./lib/util.h:90:21: note: expanded from macro 'MIN'
#define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
^ ~
Fix that by changing parser to treat 'timeout' as integer.
Fixes clang build on FreeBSD 12.1 in CirrusCI.
Fixes: f85f8ebbfac9 ("Initial implementation of OVSDB.")
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Snapshots are huge. In some cases we could receive several outdated
append replies from the remote server. This could happen in high
scale cases if the remote server is overloaded and not able to process
all the raft requests in time. As an action to each outdated append
reply we're sending full database snapshot. While remote server is
already overloaded those snapshots will stuck in jsonrpc backlog for
a long time making it grow up to few GB. Since remote server wasn't
able to timely process incoming messages it will likely not able to
process snapshots leading to the same situation with low chances to
recover. Remote server will likely stuck in 'candidate' state, other
servers will grow their memory consumption due to growing jsonrpc
backlogs:
jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.16.0.3:6644,
num of msgs: 3795, backlog: 8838994624.
This patch is trying to avoid that situation by avoiding sending of
equal snapshot install requests. This helps maintain reasonable memory
consumption and allows the cluster to recover on a larger scale.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
parse_txn() function doesn't always take ownership of the 'schema'
passed. So, if the schema of the clustered db has same version as the
one that already in use, parse_txn() will not use it, resulting with a
memory leak:
7,827 (56 direct, 7,771 indirect) bytes in 1 blocks are definitely lost
at 0x483BB1A: calloc (vg_replace_malloc.c:762)
by 0x44AD02: xcalloc (util.c:121)
by 0x40E70E: ovsdb_schema_create (ovsdb.c:41)
by 0x40EA6D: ovsdb_schema_from_json (ovsdb.c:217)
by 0x415EDD: ovsdb_storage_read (storage.c:280)
by 0x408968: read_db (ovsdb-server.c:607)
by 0x40733D: main_loop (ovsdb-server.c:227)
by 0x40733D: main (ovsdb-server.c:469)
While we could put ovsdb_schema_destroy() in a few places inside
'parse_txn()', from the users' point of view it seems better to have a
constant argument and just clone the 'schema' if needed. The caller
will be responsible for destroying the 'schema' it owns.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Memory reports could be found in logs or by calling 'memory/show'
appctl command. For ovsdb-server it includes information about db
cells, monitor connections with their backlog size, etc. But it
doesn't contain any information about memory consumed by raft.
Backlogs of raft connections could be insanely large because of
snapshot installation requests that simply contains the whole database.
In not that healthy clusters where one of ovsdb servers is not able to
timely handle all the incoming raft traffic, backlog on a sender's side
could cause significant memory consumption issues.
Adding new 'raft-connections' and 'raft-backlog' counters to the
memory report to better track such conditions.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We had a user express confusion about the state of a cluster after using
cluster/leave. The user had a three server cluster and used
cluster/leave to remove two servers from the cluster. The user expected
that the single server left would not function since the quorum of two
servers for a three server cluster was not met.
In actuality, cluster/leave removes the server from the cluster and
alters the cluster size in the process. Thus the single remaining server
continued to function since quorum was reached.
This documentation change makes it a bit more explicit that
cluster/leave alters the size of the cluster and cites the three server
down to one server case as an example.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1798158
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the scale test of 640 nodes k8s cluster, raft DB nodes' jsonrpc
session got closed due to the timeout of default 5 seconds probe.
It will cause disturbance of the raft cluster. Since we already have
the heartbeat for RAFT, just disable the probe between the servers
to avoid the unnecessary jsonrpc inactivity probe.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Zhen Wang <zhewang@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An exmplae pattern shown below:
void
ovsrec_ct_zone_index_set_external_ids(const struct ovsrec_ct_zone...
{
// 1. alloc_fn: Storage is returned from allocation function xmalloc.
// 2. var_assign: Assigning: datum = storage returned from xmalloc(24UL).
struct ovsdb_datum *datum = xmalloc(sizeof(struct ovsdb_datum));
// 3. Condition external_ids, taking false branch.
if (external_ids) {
...
} else {
// 4. noescape: Resource datum is not freed or pointed-to in ovsdb_datum_init_empty.
ovsdb_datum_init_empty(datum);
}
// 5. noescape: Resource datum is not freed or pointed-to in ovsdb_idl_index_write.
ovsdb_idl_index_write(CONST_CAST(struct ovsdb_idl_row *, &row->header_),
&ovsrec_ct_zone_columns[OVSREC_CT_ZONE_COL_EXTERNAL_IDS],
datum,
&ovsrec_table_classes[OVSREC_TABLE_CT_ZONE]);
// CID 1420856 (#1 of 1): Resource leak (RESOURCE_LEAK)
// 6. leaked_storage: Variable datum going out of scope leaks the storage it
points to.
Fix it by freeing the datum.
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Coverity shows the following memory leak in this code pattern:
void
ovsrec_ipfix_index_set_obs_domain_id(...
{
struct ovsdb_datum datum;
// 1. alloc_fn: Storage is returned from allocation function xmalloc.
// 2. var_assign: Assigning: key = storage returned from xmalloc(16UL).
union ovsdb_atom *key = xmalloc(sizeof(union ovsdb_atom));
// 3. Condition n_obs_domain_id, taking false branch.
if (n_obs_domain_id) {
datum.n = 1;
datum.keys = key;
key->integer = *obs_domain_id;
} else {
datum.n = 0;
datum.keys = NULL;
}
datum.values = NULL;
ovsdb_idl_index_write(CONST_CAST(struct ovsdb_idl_row *, &row->head...
// CID 1420891 (#1 of 1): Resource leak (RESOURCE_LEAK)
Fixed it by moving the xmalloc to the true branch.
Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Function raft_command_initiate() returns correctly referenced command
instance. 'n_ref' equals 1 for complete commands and 2 for incomplete
commands because one more reference is in raft->commands list.
raft_handle_execute_command_request__() leaks the reference by not
returning pointer anywhere and not unreferencing incomplete commands.
792 bytes in 11 blocks are definitely lost in loss record 258 of 262
at 0x483BB1A: calloc (vg_replace_malloc.c:762)
by 0x44BA32: xcalloc (util.c:121)
by 0x422E5F: raft_command_create_incomplete (raft.c:2038)
by 0x422E5F: raft_command_initiate (raft.c:2061)
by 0x428651: raft_handle_execute_command_request__ (raft.c:4161)
by 0x428651: raft_handle_execute_command_request (raft.c:4177)
by 0x428651: raft_handle_rpc (raft.c:4230)
by 0x428651: raft_conn_run (raft.c:1445)
by 0x428DEA: raft_run (raft.c:1803)
by 0x407392: main_loop (ovsdb-server.c:226)
by 0x407392: main (ovsdb-server.c:469)
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: William Tu <u9012063@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_function_from_string() and ovsdb_function_to_string() are defined
both in ovsdb/condition.c and lib/ovsdb-condidtion.c with the same function
definition. Remove the one in ovsdb/condition.c to avoid duplication.
This also resolves the following bazel building error.
./libopenvswitch.lo(ovsdb-condition.pic.o): In function `ovsdb_function_from_string':
/lib/ovsdb-condition.c:24: multiple definition of `ovsdb_function_from_string'
./libovsdb.a(condition.pic.o):/proc/self/cwd/external/openvswitch_repo/ovsdb/condition.c:34: first defined here
./libopenvswitch.lo(ovsdb-condition.pic.o): In function `ovsdb_function_from_string':
./lib/ovsdb-condition.c:24: multiple definition of `ovsdb_function_to_string'
./libovsdb.a(condition.pic.o):/proc/self/cwd/external/openvswitch_repo/ovsdb/condition.c:335
Reported-by: Harold Lim <haroldl@vmware.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
|
|
|
|
|
|
|
|
|
|
| |
We do not check metadata - mtime, atime, anywhere, so we
do not need to update it every time we sync the log.
if the system supports it, the log update should be
data only
Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: William Tu <u9012063@gmail.com>
|
|
|
|
|
|
|
|
| |
During election, there shouldn't be any leader. This change makes sure
that a server in candidate role always report leader as "unknown".
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sometimes a server can stay in candidate role forever, even if the server
already see the new leader and handles append-requests normally. However,
because of the wrong role, it appears as disconnected from cluster and
so the clients are disconnected.
This problem happens when 2 servers become candidates in the same
term, and one of them is elected as leader in that term. It can be
reproduced by the test cases added in this patch.
The root cause is that the current implementation only changes role to
follower when a bigger term is observed (in raft_receive_term__()).
According to the RAFT paper, if another candidate becomes leader with
the same term, the candidate should change to follower.
This patch fixes it by changing the role to follower when leader
is being updated in raft_update_leader().
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a leader handles install_snapshot reply, the next_index for
the follower should be log_start instead of log_end, because there
can be new entries added in leader's log after initiating the
install_snapshot procedure. Also, it should send all the accumulated
entries to follower in the following append-request message, instead
of sending 0 entries, to speed up the converge.
Without this fix, there is no functional problem, but it takes
uncessary extra rounds of append-requests responsed with "inconsistency"
by follower, although finally will be converged.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
| |
When a follower needs to "catch up", leader can send N entries in
a single append_request instead of only one entry by each message.
The function raft_send_append_request() already supports this, so
this patch just calculate the correct "n" and use it.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a node becomes leader, it sends out heartbeat to all followers
and then sends out another append-request for a no-op command
execution to all followers again immediately. This causes 2
continously append-requests sent out to each followers, and the first
heartbeat append-request is unnecessary. This patch removes the
heartbeat.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a server doesn't see a leader yet, e.g. during leader re-election,
if a transaction comes from a client, it will cause 100% CPU busy loop.
With debug log enabled it is like:
2020-02-28T04:04:35.631Z|00059|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00062|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00065|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00068|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00071|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00074|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
2020-02-28T04:04:35.631Z|00077|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164
...
The problem is that in ovsdb_trigger_try(), all cluster errors are treated
as temporary error and retry immediately. This patch fixes it by introducing
'run_triggers_now', which tells if a retry is needed immediately. When the
cluster error is with detail 'not leader', we don't immediately retry, but
will wait for the next poll event to trigger the retry. When 'not leader'
status changes, there must be a event, i.e. raft RPC that changes the
status, so the trigger is guaranteed to be triggered, without busy loop.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
If there is never a leader known by the current server, it's status
should be "disconnected" to the cluster. Without this patch, when
a server in cluster is restarted, before it successfully connecting
back to the cluster it will appear as connected, which is wrong.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When "schema" field is found in read_db(), there can be two cases:
1. There is a schema change in clustered DB and the "schema" is the new one.
2. There is a install_snapshot RPC happened, which caused log compaction on the
server and the next log is just the snapshot, which always constains "schema"
field, even though the schema hasn't been changed.
The current implementation doesn't handle case 2), and always assume the schema
is changed hence disconnect all clients of the server. It can cause stability
problem when there are big number of clients connected when this happens in
a large scale environment.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
| |
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
| |
Acked-by: Han Zhou <hzhou@ovn.org>
Requested-by: Leonid Ryzhyk <lryzhyk@vmware.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When ovsdb-server is in backup mode and connects to the active
ovsdb-server for replication, and if takes more than 5 seconds to
get the dump of the whole database, it will drop the connection
soon after as the default probe interval is 5 seconds. This
results in a snowball effect of reconnections to the active
ovsdb-server.
This patch handles or mitigates this issue by setting the
default probe interval value to 60 seconds and provide the option to
configure this value from the unixctl command.
Other option could be increase the value of 'RECONNECT_DEFAULT_PROBE_INTERVAL'
to a higher value.
Acked-by: Mark Michelson <mmichels@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cluster doesn't work after all nodes restarted after DB compaction,
unless there is any transaction after DB compaction before the restart.
Error log is like:
raft|ERR|internal error: deferred vote_request message completed but not ready
to send because message index 9 is past last synced index 0: s2 vote_request:
term=6 last_log_index=9 last_log_term=4
The root cause is that the log_synced member is not initialized when
reading the raft header. This patch fixes it and remove the XXX
from the test case.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit a76ba825 took care of saving and restoring election timer in
file header snapshot, but it didn't handle the parsing of election
timer in install_snapshot_request/reply RPC, which results in problems,
e.g. when election timer change log is compacted in snapshot and then a
new node join the cluster, the new node will use the default timer
instead of the new value. This patch fixed it by parsing election
timer in snapshot RPC.
At the same time the patch updates the test case to cover the DB compact and
join senario. The test reveals another 2 problems related to clustered DB
compact, as commented in the test case's XXX, which need to be addressed
separately.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
memory leak was detected by valgrind during execution
of "database commands -- positive checks" test.
leaked memory was allocated in ovsdb_execute_mutate() function
while parsing mutations from the apparent json entity:
==19563== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==19563== by 0x4652D0: xmalloc (util.c:138)
==19563== by 0x46539E: xmemdup0 (util.c:168)
==19563== by 0x4653F7: xstrdup (util.c:177)
==19563== by 0x450379: ovsdb_base_type_clone (ovsdb-types.c:208)
==19563== by 0x450F8D: ovsdb_type_clone (ovsdb-types.c:550)
==19563== by 0x428C3F: ovsdb_mutation_from_json (mutation.c:108)
==19563== by 0x428F6B: ovsdb_mutation_set_from_json (mutation.c:187)
==19563== by 0x42578D: ovsdb_execute_mutate (execution.c:573)
==19563== by 0x4246B0: ovsdb_execute_compose (execution.c:171)
==19563== by 0x41CDE5: ovsdb_trigger_try (trigger.c:204)
==19563== by 0x41C8DF: ovsdb_trigger_init (trigger.c:61)
==19563== by 0x40E93C: ovsdb_jsonrpc_trigger_create (jsonrpc-server.c:1135)
==19563== by 0x40E20C: ovsdb_jsonrpc_session_got_request (jsonrpc-server.c:1002)
==19563== by 0x40D1C2: ovsdb_jsonrpc_session_run (jsonrpc-server.c:561)
==19563== by 0x40D31E: ovsdb_jsonrpc_session_run_all (jsonrpc-server.c:591)
==19563== by 0x40CD6E: ovsdb_jsonrpc_server_run (jsonrpc-server.c:406)
==19563== by 0x40627E: main_loop (ovsdb-server.c:209)
==19563== by 0x406E66: main (ovsdb-server.c:460)
This memory is usually freed at the end of ovsdb_execute_mutate()
however in the aforementioned test case this does not happen. Namely
in case of delete mutator and in case of error while calling ovsdb_datum_from_json()
apparent mutation was marked as invalid, what prevents freeing problematic memory.
Memory leak can be reproduced quickly with the following command sequence:
ovs-vsctl --no-wait -vreconnect:emer add-zone-tp netdev zone=1 icmp_first=1 icmp_reply=2
ovs-vsctl --no-wait -vreconnect:emer del-zone-tp netdev zone=1
Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
| |
Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Memory leak happens while converting existing database into new
database according to the specified schema (ovsdb-client convert
new-schema). Memory leak was detected by valgrind while executing
functional test "schema conversion online - clustered"
==16202== 96 bytes in 6 blocks are definitely lost in loss record 326 of 399
==16202== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16202== by 0x44A5D4: xmalloc (util.c:138)
==16202== by 0x4377A6: alloc_default_atoms (ovsdb-data.c:315)
==16202== by 0x437F18: ovsdb_datum_init_default (ovsdb-data.c:918)
==16202== by 0x413D82: ovsdb_row_create (row.c:59)
==16202== by 0x40AA53: ovsdb_convert_table (file.c:220)
==16202== by 0x40AA53: ovsdb_convert (file.c:275)
==16202== by 0x416BE1: ovsdb_trigger_try (trigger.c:255)
==16202== by 0x40D29E: ovsdb_jsonrpc_trigger_create (jsonrpc-server.c:1119)
==16202== by 0x40D29E: ovsdb_jsonrpc_session_got_request (jsonrpc-server.c:986)
==16202== by 0x40D29E: ovsdb_jsonrpc_session_run (jsonrpc-server.c:556)
==16202== by 0x40D29E: ovsdb_jsonrpc_session_run_all (jsonrpc-server.c:586)
==16202== by 0x40D29E: ovsdb_jsonrpc_server_run (jsonrpc-server.c:401)
==16202== by 0x40682E: main_loop (ovsdb-server.c:209)
==16202== by 0x40682E: main (ovsdb-server.c:460)
The problem was in ovsdb_datum_convert() function, which overrides
pointers to datum memory allocated in ovsdb_row_create() function.
Fix was done by freeing this memory before ovsdb_datum_convert()
is called.
Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Presently, replication is not allowed if there is a schema version mismatch between
the schema returned by the active ovsdb-server and the local db schema. This is
causing failures in OVN DB HA deployments during uprades.
In the case of OpenStack tripleo deployment with OVN, OVN DB ovsdb-servers are
deployed on a multi node controller cluster in active/standby mode. During
minor updates or major upgrades, the cluster is updated one at a time. If
a node A is running active OVN DB ovsdb-servers and when it is updated, another
node B becomes active. After the update when OVN DB ovsdb-servers in A are started,
these ovsdb-servers fail to replicate from the active if there is a schema
version mismatch.
This patch addresses this issue by allowing replication even if there is a
schema version mismatch only if all the active db schema tables and its colums are
present in the local db schema.
This should not result in any data loss.
Signed-off-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The commit [1] force drops all connections when the db read/write status changes.
Prior to the commit [1], when there was read/write status change, the existing
jsonrpc sessions with 'db_change_aware' set to true, were not updated with the
changed 'read_only' value. If the db status was changed to 'standby', the existing
clients could still write to the db.
In the case of pacemaker OVN HA, OVN OCF script 'start' action starts the
ovsdb-servers in read-only state and later, it sets to read-write in the
'promote' action. We have observed that if some ovn-controllers connect to
the SB ovsdb-server (in read-only state) just before the 'promote' action,
the connection is not reset all the times and these ovn-controllers remain connected
to the SB ovsdb-server in read-only state all the time. Even though
the commit [1] calls 'ovsdb_jsonrpc_server_reconnect()' with 'forced' flag
set to true when the db read/write status changes, somehow the FSM misses resetting
the connections of these ovn-controllers.
I think this needs to be addressed in the FSM. This patch doesn't address
this FSM issue. Instead it changes the behavior of 'ovsdb_jsonrpc_server_set_read_only()'
by setting the 'read_only' flag of all the jsonrpc sessions instead of forcefully
resetting the connection.
I think there is no need to reset the connection. In large scale production
deployements with OVN, this results in unnecessary waste of CPU cycles as ovn-controllers
will have to connect twice - once during 'start' action and again during 'promote'.
[1] - 2a9679e3b2c6("ovsdb-server: drop all connections on read/write status change")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
valgrind detects this leak while running functional test "ovsdb-client backup and restore"
==25401== 1,068 (240 direct, 828 indirect) bytes in 6 blocks are definitely lost in loss record 22 of 22
==25401== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25401== by 0x449DA4: xmalloc (util.c:138)
==25401== by 0x43012D: json_create (json.c:1451)
==25401== by 0x43012D: json_array_create_empty (json.c:186)
==25401== by 0x43012D: json_parser_push_array (json.c:1279)
==25401== by 0x4303CF: json_parser_input (json.c:1407)
==25401== by 0x4312F1: json_lex_input (json.c:991)
==25401== by 0x43193B: json_parser_feed (json.c:1149)
==25401== by 0x4329FA: jsonrpc_recv.part.7 (jsonrpc.c:332)
==25401== by 0x432D3B: jsonrpc_recv (jsonrpc.c:297)
==25401== by 0x432D3B: jsonrpc_recv_block (jsonrpc.c:402)
==25401== by 0x4330EB: jsonrpc_transact_block (jsonrpc.c:436)
==25401== by 0x409246: do_backup (ovsdb-client.c:2008)
==25401== by 0x405F76: main (ovsdb-client.c:282)
the problem was in db_backup() function, where _uuid json node was detached from
its parent "row" json node, but never destroyed afterwards.
Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|