summaryrefslogtreecommitdiff
path: root/ovsdb
Commit message (Collapse)AuthorAgeFilesLines
* python: Update build system to ensure dirs.py is created.Mark Gray2020-11-261-1/+1
| | | | | | | | | | Update build system to ensure dirs.py is created when it is a dependency for a build target. Also, update setup.py to check for that dependency. Fixes: 943c4a325045 ("python: set ovs.dirs variables with build system values") Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idl: Fix *_is_new() IDL functions.Mark Gray2020-11-161-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently all functions of the type *_is_new() always return 'false'. This patch resolves this issue by using the 'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' instead of the 'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno' to determine if a row is new and by resetting the 'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' on clear. Further to this, the code is also updated to match the following behaviour: When a row is inserted, the 'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' is updated to match the new database change_seqno. The 'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno' is not set for inserted rows (only for updated rows). At the end of a run, ovsdb_idl_db_track_clear() should be called to clear all tracking information, this includes resetting all row 'change_seqno' to zero. This will ensure that subsequent runs will not see a previously 'new' row. add_tracked_change_for_references() is updated to only track rows that reference the current row. Also, update unit tests in order to test the *_is_new(), *_is_delete() functions. Suggested-by: Dumitru Ceara <dceara@redhat.com> Reported-at: https://bugzilla.redhat.com/1883562 Fixes: ca545a787ac0 ("ovsdb-idl.c: Increase seqno for change-tracking of table references.") Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idlc: Return expected sequence number while setting conditions.Ilya Maximets2020-11-161-3/+3
| | | | | | | | | | ovsdb_idl_set_condition() returns a sequence number that can be used to check if the requested conditions are acknowledged by the server. However, database bindings do not return this value to the user, making it impossible to check if the conditions are accepted. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Remove read permission of *.db from others.Yi-Hung Wei2020-11-101-1/+1
| | | | | | | | | | | | Currently, when ovsdb *.db is created by ovsdb-tool it grants read permission to others. This may incur security concerns, for example, IPsec Pre-shared keys are stored in ovs-vsitchd.conf.db. This patch addresses the concerns by removing permission for others. Reported-by: Antonin Bas <abas@vmware.com> Acked-by: Mark Gray <mark.d.gray@redhat.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Make backlog thresholds configurable.Ilya Maximets2020-11-102-5/+55
| | | | | | | | | | New appctl 'cluster/set-backlog-threshold' to configure thresholds on backlog of raft jsonrpc connections. Could be used, for example, in some extreme conditions where size of a database expected to be very large, i.e. comparable with default 4GB threshold. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Set threshold on backlog for raft connections.Ilya Maximets2020-11-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RAFT messages could be fairly big. If something abnormal happens to one of the servers in a cluster it may not be able to process all the incoming messages in a timely manner. This results in jsonrpc backlog growth on the sender's side. For example if follower gets many new clients at once that it needs to serve, or it decides to take a snapshot in a period of high number of database changes. If backlog grows large enough it becomes harder and harder for follower to process incoming raft messages, it sends outdated replies and starts receiving snapshots and the whole raft log from the leader. Sometimes backlog grows too high (60GB in this example): jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:<ip>, num of msgs: 15370, backlog: 61731060773. In this case OS might actually decide to kill the sender to free some memory. Anyway, It could take a lot of time for such a server to catch up with the rest of the cluster if it has so much data to receive and process. Introducing backlog thresholds for jsonrpc connections. If sending backlog will exceed particular values (500 messages or 4GB in size), connection will be dropped and re-created. This will allow to drop all the current backlog and start over increasing chances of cluster recovery. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid having more than one snapshot in-flight.Ilya Maximets2020-11-033-29/+18
| | | | | | | | | | | | | | | | | | | | | Previous commit 8c2c503bdb0d ("raft: Avoid sending equal snapshots.") took a "safe" approach to not send only exactly same snapshot installation requests. However, it doesn't make much sense to send more than one snapshot at a time. If obsolete snapshot installed, leader will re-send the most recent one. With this change leader will have only 1 snapshot in-flight per connection. This will reduce backlogs on raft connections in case new snapshot created while 'install_snapshot_request' is in progress or if election timer changed in that period. Also, not tracking the exact 'install_snapshot_request' we've sent allows to simplify the code. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Fixes: 8c2c503bdb0d ("raft: Avoid sending equal snapshots.") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-server: Reclaim heap memory after compaction.Ilya Maximets2020-11-034-4/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction happens at most once in 10 minutes. That is a big time interval for a heavy loaded ovsdb-server in cluster mode. In 10 minutes raft logs could grow up to tens of thousands of entries with tens of gigabytes in total size. While compaction cleans up raft log entries, the memory in many cases is not returned to the system, but kept in the heap of running ovsdb-server process, and it could stay in this condition for a really long time. In the end one performance spike could lead to a fast growth of the raft log and this memory will never (for a really long time) be released to the system even if the database if empty. Simple example how to reproduce with OVN sandbox: 1. make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered' 2. Run following script that creates 1 port group, adds 4000 acls and removes all of that in the end: # cat ../memory-test.sh pg_name=my_port_group export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off) ovn-nbctl pg-add $pg_name for i in $(seq 1 4000); do echo "Iteration: $i" ovn-nbctl --log acl-add $pg_name from-lport $i udp drop done ovn-nbctl acl-del $pg_name ovn-nbctl pg-del $pg_name ovs-appctl -t $(pwd)/sandbox/nb1 memory/show ovn-appctl -t ovn-nbctl exit --- 3. Stopping one of Northbound DB servers: ovs-appctl -t $(pwd)/sandbox/nb1 exit Make sure that ovsdb-server didn't compact the database before it was stopped. Now we have a db file on disk that contains 4000 fairly big transactions inside. 4. Trying to start same ovsdb-server with this file. # cd sandbox && ovsdb-server <...> nb1.db At this point ovsdb-server reads all the transactions from db file and performs all of them as fast as it can one by one. When it finishes this, raft log contains 4000 entries and ovsdb-server consumes (on my system) ~13GB of memory while database is empty. And libc will likely never return this memory back to system, or, at least, will hold it for a really long time. This patch adds a new command 'ovsdb-server/memory-trim-on-compaction'. It's disabled by default, but once enabled, ovsdb-server will call 'malloc_trim(0)' after every successful compaction to try to return unused heap memory back to system. This is glibc-specific, so we need to detect function availability in a build time. Disabled by default since it adds from 1% to 30% (depending on the current state) to the snapshot creation time and, also, next memory allocations will likely require requests to kernel and that might be slower. Could be enabled by default later if considered broadly beneficial. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Add log length to the memory report.Ilya Maximets2020-11-031-0/+1
| | | | | | | | | In many cases a big part of a memory consumed by ovsdb-server process is a raft log, so it's important to add its length to the memory report. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid annoying debug logs if raft is connected.Ilya Maximets2020-10-271-1/+10
| | | | | | | | | | | | | If debug logs enabled, "raft_is_connected: true" printed on every call to raft_is_connected() which is way too frequently. These messages are not very informative and only litters the log. Let's log only disconnected state in a rate-limited way and only log positive case once at the moment cluster becomes connected. Fixes: 923f01cad678 ("raft.c: Set candidate_retrying if no leader elected since last election.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Fix error leak on failure while saving snapshot.Ilya Maximets2020-10-271-1/+1
| | | | | | | | Error should be destroyed before return. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Report jsonrpc backlog in kilobytes.Ilya Maximets2020-10-251-2/+3
| | | | | | | | | | | | While sending snapshots backlog on raft connections could quickly grow over 4GB and this will overflow raft-backlog counter. Let's report it in kB instead. (Using kB and not KB to match with ru_maxrss counter reported by kernel) Fixes: 3423cd97f88f ("ovsdb: Add raft memory usage to memory report.") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Eliminate "whitelist" and "blacklist" terms.Ben Pfaff2020-10-163-45/+45
| | | | | | | | There is one remaining use under datapath. That change should happen upstream in Linux first according to our usual policy. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
* ovsdb: Add unixctl command to show storage status.Dumitru Ceara2020-09-163-0/+50
| | | | | | | | | | | | | | | | | | | | | | | If a database enters an error state, e.g., in case of RAFT when reading the DB file contents if applying the RAFT records triggers constraint violations, there's no way to determine this unless a client generates a write transaction. Such write transactions would fail with "ovsdb-error: inconsistent data". This commit adds a new command to show the status of the storage that's backing a database. Example, on an inconsistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ovsdb error: inconsistent data Example, on a consistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ok Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-tool: Add a db consistency check to the ovsdb-tool check-cluster command.Federico Paolinelli2020-09-161-0/+38
| | | | | | | | | | | | | | | | | | There are some occurrences where the database ends up in an inconsistent state. This happened in ovn-k8s and is described in [0]. Here we are adding a supported way to check that a given db is consistent, which is less error prone than checking the logs. Tested against both a valid db and a corrupted db attached to the above bug [1]. Also, tested with a fresh db that did not do a snapshot. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 [1]: https://bugzilla.redhat.com/attachment.cgi?id=1697595 Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> Suggested-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* python: Fixup python shebangs to python3.Greg Rose2020-08-262-4/+4
| | | | | | | | | | | | | | | | | | Builds on RHEL 8.2 systems are failing due to this issue. See [1] as to why this is necessary. I used the following command to identify files that need this fix: find . -type f -executable | /usr/lib/rpm/redhat/brp-mangle-shebangs I also updated the copyright notices as needed. 1. https://fedoraproject.org/wiki/Changes/Make_ambiguous_python_shebangs_error Fixes: 1ca0323e7c29 ("Require Python 3 and remove support for Python 2.") Signed-off-by: Greg Rose <gvrose8192@gmail.com> Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-server: Replace in-memory DB contents at raft install_snapshot.Dumitru Ceara2020-08-061-8/+13
| | | | | | | | | | | | | | Every time a follower has to install a snapshot received from the leader, it should also replace the data in memory. Right now this only happens when snapshots are installed that also change the schema. This can lead to inconsistent DB data on follower nodes and the snapshot may fail to get applied. Fixes: bda1f6b60588 ("ovsdb-server: Don't disconnect clients after raft install_snapshot.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb/TODO.rst: Remove OVN specific items.Han Zhou2020-07-071-4/+0
| | | | | | | These should belong to OVN project, if still not done yet. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb/TODO.rst: Remove completed items.Han Zhou2020-07-071-4/+0
| | | | | | | | | | | - snapshot unit test has been added for "change-election-timer" related patches. - 100% CPU problem was addressed by: 2cd62f75c1 ("ovsdb raft: Precheck prereq before proposing commit.") Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Remove duplicated include.Yunjian Wang2020-07-072-2/+0
| | | | | Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Fix timeout type for wait operation.Ilya Maximets2020-06-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to RFC 7047, 'timeout' is an integer field: 5.2.6. Wait The "wait" object contains the following members: "op": "wait" required "timeout": <integer> optional ... For some reason initial implementation treated it as a real number. This causes a build issue with clang that complains that LLONG_MAX could not be represented as double: ovsdb/execution.c:733:32: error: implicit conversion from 'long long' to 'double' changes value from 9223372036854775807 to 9223372036854775808 timeout_msec = MIN(LLONG_MAX, json_real(timeout)); ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /usr/include/sys/limits.h:69:19: note: expanded from macro 'LLONG_MAX' #define LLONG_MAX __LLONG_MAX /* max for a long long */ ^~~~~~~~~~~ /usr/include/x86/_limits.h:74:21: note: expanded from macro '__LLONG_MAX' #define __LLONG_MAX 0x7fffffffffffffffLL /* max value for a long long */ ^~~~~~~~~~~~~~~~~~~~ ./lib/util.h:90:21: note: expanded from macro 'MIN' #define MIN(X, Y) ((X) < (Y) ? (X) : (Y)) ^ ~ Fix that by changing parser to treat 'timeout' as integer. Fixes clang build on FreeBSD 12.1 in CirrusCI. Fixes: f85f8ebbfac9 ("Initial implementation of OVSDB.") Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid sending equal snapshots.Ilya Maximets2020-05-283-1/+43
| | | | | | | | | | | | | | | | | | | | | | | | | Snapshots are huge. In some cases we could receive several outdated append replies from the remote server. This could happen in high scale cases if the remote server is overloaded and not able to process all the raft requests in time. As an action to each outdated append reply we're sending full database snapshot. While remote server is already overloaded those snapshots will stuck in jsonrpc backlog for a long time making it grow up to few GB. Since remote server wasn't able to timely process incoming messages it will likely not able to process snapshots leading to the same situation with low chances to recover. Remote server will likely stuck in 'candidate' state, other servers will grow their memory consumption due to growing jsonrpc backlogs: jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.16.0.3:6644, num of msgs: 3795, backlog: 8838994624. This patch is trying to avoid that situation by avoiding sending of equal snapshot install requests. This helps maintain reasonable memory consumption and allows the cluster to recover on a larger scale. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-server: Fix schema leak while reading db.Ilya Maximets2020-05-281-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | parse_txn() function doesn't always take ownership of the 'schema' passed. So, if the schema of the clustered db has same version as the one that already in use, parse_txn() will not use it, resulting with a memory leak: 7,827 (56 direct, 7,771 indirect) bytes in 1 blocks are definitely lost at 0x483BB1A: calloc (vg_replace_malloc.c:762) by 0x44AD02: xcalloc (util.c:121) by 0x40E70E: ovsdb_schema_create (ovsdb.c:41) by 0x40EA6D: ovsdb_schema_from_json (ovsdb.c:217) by 0x415EDD: ovsdb_storage_read (storage.c:280) by 0x408968: read_db (ovsdb-server.c:607) by 0x40733D: main_loop (ovsdb-server.c:227) by 0x40733D: main (ovsdb-server.c:469) While we could put ovsdb_schema_destroy() in a few places inside 'parse_txn()', from the users' point of view it seems better to have a constant argument and just clone the 'schema' if needed. The caller will be responsible for destroying the 'schema' it owns. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Add raft memory usage to memory report.Ilya Maximets2020-05-255-0/+35
| | | | | | | | | | | | | | | | | | Memory reports could be found in logs or by calling 'memory/show' appctl command. For ovsdb-server it includes information about db cells, monitor connections with their backlog size, etc. But it doesn't contain any information about memory consumed by raft. Backlogs of raft connections could be insanely large because of snapshot installation requests that simply contains the whole database. In not that healthy clusters where one of ovsdb servers is not able to timely handle all the incoming raft traffic, backlog on a sender's side could cause significant memory consumption issues. Adding new 'raft-connections' and 'raft-backlog' counters to the memory report to better track such conditions. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* RAFT: Add clarifying note for cluster/leave operation.Mark Michelson2020-05-121-0/+5
| | | | | | | | | | | | | | | | | | | | | We had a user express confusion about the state of a cluster after using cluster/leave. The user had a three server cluster and used cluster/leave to remove two servers from the cluster. The user expected that the single server left would not function since the quorum of two servers for a three server cluster was not met. In actuality, cluster/leave removes the server from the cluster and alters the cluster size in the process. Thus the single remaining server continued to function since quorum was reached. This documentation change makes it a bit more explicit that cluster/leave alters the size of the cluster and cites the three server down to one server case as an example. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1798158 Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Disable RAFT jsonrpc inactivity probe.Zhen Wang2020-05-121-0/+1
| | | | | | | | | | | | With the scale test of 640 nodes k8s cluster, raft DB nodes' jsonrpc session got closed due to the timeout of default 5 seconds probe. It will cause disturbance of the raft cluster. Since we already have the heartbeat for RAFT, just disable the probe between the servers to avoid the unnecessary jsonrpc inactivity probe. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Zhen Wang <zhewang@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idlc: Fix memory leak reported by Coverity.William Tu2020-05-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An exmplae pattern shown below: void ovsrec_ct_zone_index_set_external_ids(const struct ovsrec_ct_zone... { // 1. alloc_fn: Storage is returned from allocation function xmalloc. // 2. var_assign: Assigning: datum = storage returned from xmalloc(24UL). struct ovsdb_datum *datum = xmalloc(sizeof(struct ovsdb_datum)); // 3. Condition external_ids, taking false branch. if (external_ids) { ... } else { // 4. noescape: Resource datum is not freed or pointed-to in ovsdb_datum_init_empty. ovsdb_datum_init_empty(datum); } // 5. noescape: Resource datum is not freed or pointed-to in ovsdb_idl_index_write. ovsdb_idl_index_write(CONST_CAST(struct ovsdb_idl_row *, &row->header_), &ovsrec_ct_zone_columns[OVSREC_CT_ZONE_COL_EXTERNAL_IDS], datum, &ovsrec_table_classes[OVSREC_TABLE_CT_ZONE]); // CID 1420856 (#1 of 1): Resource leak (RESOURCE_LEAK) // 6. leaked_storage: Variable datum going out of scope leaks the storage it points to. Fix it by freeing the datum. Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* ovsdb-idlc: Fix memory leak reported by Coverity.William Tu2020-05-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Coverity shows the following memory leak in this code pattern: void ovsrec_ipfix_index_set_obs_domain_id(... { struct ovsdb_datum datum; // 1. alloc_fn: Storage is returned from allocation function xmalloc. // 2. var_assign: Assigning: key = storage returned from xmalloc(16UL). union ovsdb_atom *key = xmalloc(sizeof(union ovsdb_atom)); // 3. Condition n_obs_domain_id, taking false branch. if (n_obs_domain_id) { datum.n = 1; datum.keys = key; key->integer = *obs_domain_id; } else { datum.n = 0; datum.keys = NULL; } datum.values = NULL; ovsdb_idl_index_write(CONST_CAST(struct ovsdb_idl_row *, &row->head... // CID 1420891 (#1 of 1): Resource leak (RESOURCE_LEAK) Fixed it by moving the xmalloc to the true branch. Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* raft: Fix leak of the incomplete command.Ilya Maximets2020-05-041-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Function raft_command_initiate() returns correctly referenced command instance. 'n_ref' equals 1 for complete commands and 2 for incomplete commands because one more reference is in raft->commands list. raft_handle_execute_command_request__() leaks the reference by not returning pointer anywhere and not unreferencing incomplete commands. 792 bytes in 11 blocks are definitely lost in loss record 258 of 262 at 0x483BB1A: calloc (vg_replace_malloc.c:762) by 0x44BA32: xcalloc (util.c:121) by 0x422E5F: raft_command_create_incomplete (raft.c:2038) by 0x422E5F: raft_command_initiate (raft.c:2061) by 0x428651: raft_handle_execute_command_request__ (raft.c:4161) by 0x428651: raft_handle_execute_command_request (raft.c:4177) by 0x428651: raft_handle_rpc (raft.c:4230) by 0x428651: raft_conn_run (raft.c:1445) by 0x428DEA: raft_run (raft.c:1803) by 0x407392: main_loop (ovsdb-server.c:226) by 0x407392: main (ovsdb-server.c:469) Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: William Tu <u9012063@gmail.com>
* ovsdb: Remove duplicated function defintionsYi-Hung Wei2020-04-271-27/+0
| | | | | | | | | | | | | | | | | | | ovsdb_function_from_string() and ovsdb_function_to_string() are defined both in ovsdb/condition.c and lib/ovsdb-condidtion.c with the same function definition. Remove the one in ovsdb/condition.c to avoid duplication. This also resolves the following bazel building error. ./libopenvswitch.lo(ovsdb-condition.pic.o): In function `ovsdb_function_from_string': /lib/ovsdb-condition.c:24: multiple definition of `ovsdb_function_from_string' ./libovsdb.a(condition.pic.o):/proc/self/cwd/external/openvswitch_repo/ovsdb/condition.c:34: first defined here ./libopenvswitch.lo(ovsdb-condition.pic.o): In function `ovsdb_function_from_string': ./lib/ovsdb-condition.c:24: multiple definition of `ovsdb_function_to_string' ./libovsdb.a(condition.pic.o):/proc/self/cwd/external/openvswitch_repo/ovsdb/condition.c:335 Reported-by: Harold Lim <haroldl@vmware.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: William Tu <u9012063@gmail.com>
* ovsdb: Switch ovsdb log fsync to data only.Anton Ivanov2020-04-271-0/+9
| | | | | | | | | | We do not check metadata - mtime, atime, anywhere, so we do not need to update it every time we sync the log. if the system supports it, the log update should be data only Signed-off-by: Anton Ivanov <anton.ivanov@cambridgegreys.com> Signed-off-by: William Tu <u9012063@gmail.com>
* raft: Unset leader when starting election.Han Zhou2020-03-061-0/+1
| | | | | | | | During election, there shouldn't be any leader. This change makes sure that a server in candidate role always report leader as "unknown". Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Fix the problem of stuck in candidate role forever.Han Zhou2020-03-061-2/+17
| | | | | | | | | | | | | | | | | | | | | | Sometimes a server can stay in candidate role forever, even if the server already see the new leader and handles append-requests normally. However, because of the wrong role, it appears as disconnected from cluster and so the clients are disconnected. This problem happens when 2 servers become candidates in the same term, and one of them is elected as leader in that term. It can be reproduced by the test cases added in this patch. The root cause is that the current implementation only changes role to follower when a bigger term is observed (in raft_receive_term__()). According to the RAFT paper, if another candidate becomes leader with the same term, the candidate should change to follower. This patch fixes it by changing the role to follower when leader is being updated in raft_update_leader(). Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Fix next_index in install_snapshot reply handling.Han Zhou2020-03-061-2/+3
| | | | | | | | | | | | | | | | When a leader handles install_snapshot reply, the next_index for the follower should be log_start instead of log_end, because there can be new entries added in leader's log after initiating the install_snapshot procedure. Also, it should send all the accumulated entries to follower in the following append-request message, instead of sending 0 entries, to speed up the converge. Without this fix, there is no functional problem, but it takes uncessary extra rounds of append-requests responsed with "inconsistency" by follower, although finally will be converged. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Send all missing logs in one single append_request.Han Zhou2020-03-061-1/+1
| | | | | | | | | | | When a follower needs to "catch up", leader can send N entries in a single append_request instead of only one entry by each message. The function raft_send_append_request() already supports this, so this patch just calculate the correct "n" and use it. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Avoid sending unnecessary heartbeat when becoming leader.Han Zhou2020-03-061-1/+0
| | | | | | | | | | | | When a node becomes leader, it sends out heartbeat to all followers and then sends out another append-request for a no-op command execution to all followers again immediately. This causes 2 continously append-requests sent out to each followers, and the first heartbeat append-request is unnecessary. This patch removes the heartbeat. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Avoid busy loop during leader election.Han Zhou2020-03-064-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | When a server doesn't see a leader yet, e.g. during leader re-election, if a transaction comes from a client, it will cause 100% CPU busy loop. With debug log enabled it is like: 2020-02-28T04:04:35.631Z|00059|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 2020-02-28T04:04:35.631Z|00062|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 2020-02-28T04:04:35.631Z|00065|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 2020-02-28T04:04:35.631Z|00068|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 2020-02-28T04:04:35.631Z|00071|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 2020-02-28T04:04:35.631Z|00074|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 2020-02-28T04:04:35.631Z|00077|poll_loop|DBG|wakeup due to 0-ms timeout at ../ovsdb/trigger.c:164 ... The problem is that in ovsdb_trigger_try(), all cluster errors are treated as temporary error and retry immediately. This patch fixes it by introducing 'run_triggers_now', which tells if a retry is needed immediately. When the cluster error is with detail 'not leader', we don't immediately retry, but will wait for the next poll event to trigger the retry. When 'not leader' status changes, there must be a event, i.e. raft RPC that changes the status, so the trigger is guaranteed to be triggered, without busy loop. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Fix raft_is_connected() when there is no leader yet.Han Zhou2020-03-061-2/+8
| | | | | | | | | | If there is never a leader known by the current server, it's status should be "disconnected" to the cluster. Without this patch, when a server in cluster is restarted, before it successfully connecting back to the cluster it will appear as connected, which is wrong. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-server: Don't disconnect clients after raft install_snapshot.Han Zhou2020-03-061-1/+2
| | | | | | | | | | | | | | | | When "schema" field is found in read_db(), there can be two cases: 1. There is a schema change in clustered DB and the "schema" is the new one. 2. There is a install_snapshot RPC happened, which caused log compaction on the server and the next log is just the snapshot, which always constains "schema" field, even though the schema hasn't been changed. The current implementation doesn't handle case 2), and always assume the schema is changed hence disconnect all clients of the server. It can cause stability problem when there are big number of clients connected when this happens in a large scale environment. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft-rpc: Fix message format.Han Zhou2020-03-061-1/+1
| | | | | Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-server: Allow OVSDB clients to specify the UUID for inserted rows.Ben Pfaff2020-01-163-6/+47
| | | | | | Acked-by: Han Zhou <hzhou@ovn.org> Requested-by: Leonid Ryzhyk <lryzhyk@vmware.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb replication: Provide option to configure probe interval.Numan Siddique2020-01-074-9/+62
| | | | | | | | | | | | | | | | | | | | | When ovsdb-server is in backup mode and connects to the active ovsdb-server for replication, and if takes more than 5 seconds to get the dump of the whole database, it will drop the connection soon after as the default probe interval is 5 seconds. This results in a snowball effect of reconnections to the active ovsdb-server. This patch handles or mitigates this issue by setting the default probe interval value to 60 seconds and provide the option to configure this value from the unixctl command. Other option could be increase the value of 'RECONNECT_DEFAULT_PROBE_INTERVAL' to a higher value. Acked-by: Mark Michelson <mmichels@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Fix the problem when cluster restarted after DB compaction.Han Zhou2019-12-201-1/+1
| | | | | | | | | | | | | | | | | Cluster doesn't work after all nodes restarted after DB compaction, unless there is any transaction after DB compaction before the restart. Error log is like: raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 9 is past last synced index 0: s2 vote_request: term=6 last_log_index=9 last_log_term=4 The root cause is that the log_synced member is not initialized when reading the raft header. This patch fixes it and remove the XXX from the test case. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Fix election timer parsing in snapshot RPC.Han Zhou2019-11-212-1/+6
| | | | | | | | | | | | | | | | | | Commit a76ba825 took care of saving and restoring election timer in file header snapshot, but it didn't handle the parsing of election timer in install_snapshot_request/reply RPC, which results in problems, e.g. when election timer change log is compacted in snapshot and then a new node join the cluster, the new node will use the default timer instead of the new value. This patch fixed it by parsing election timer in snapshot RPC. At the same time the patch updates the test case to cover the DB compact and join senario. The test reveals another 2 problems related to clustered DB compact, as commented in the test case's XXX, which need to be addressed separately. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-server: fix memory leak while deleting zoneDamijan Skvarc2019-11-201-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memory leak was detected by valgrind during execution of "database commands -- positive checks" test. leaked memory was allocated in ovsdb_execute_mutate() function while parsing mutations from the apparent json entity: ==19563== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==19563== by 0x4652D0: xmalloc (util.c:138) ==19563== by 0x46539E: xmemdup0 (util.c:168) ==19563== by 0x4653F7: xstrdup (util.c:177) ==19563== by 0x450379: ovsdb_base_type_clone (ovsdb-types.c:208) ==19563== by 0x450F8D: ovsdb_type_clone (ovsdb-types.c:550) ==19563== by 0x428C3F: ovsdb_mutation_from_json (mutation.c:108) ==19563== by 0x428F6B: ovsdb_mutation_set_from_json (mutation.c:187) ==19563== by 0x42578D: ovsdb_execute_mutate (execution.c:573) ==19563== by 0x4246B0: ovsdb_execute_compose (execution.c:171) ==19563== by 0x41CDE5: ovsdb_trigger_try (trigger.c:204) ==19563== by 0x41C8DF: ovsdb_trigger_init (trigger.c:61) ==19563== by 0x40E93C: ovsdb_jsonrpc_trigger_create (jsonrpc-server.c:1135) ==19563== by 0x40E20C: ovsdb_jsonrpc_session_got_request (jsonrpc-server.c:1002) ==19563== by 0x40D1C2: ovsdb_jsonrpc_session_run (jsonrpc-server.c:561) ==19563== by 0x40D31E: ovsdb_jsonrpc_session_run_all (jsonrpc-server.c:591) ==19563== by 0x40CD6E: ovsdb_jsonrpc_server_run (jsonrpc-server.c:406) ==19563== by 0x40627E: main_loop (ovsdb-server.c:209) ==19563== by 0x406E66: main (ovsdb-server.c:460) This memory is usually freed at the end of ovsdb_execute_mutate() however in the aforementioned test case this does not happen. Namely in case of delete mutator and in case of error while calling ovsdb_datum_from_json() apparent mutation was marked as invalid, what prevents freeing problematic memory. Memory leak can be reproduced quickly with the following command sequence: ovs-vsctl --no-wait -vreconnect:emer add-zone-tp netdev zone=1 icmp_first=1 icmp_reply=2 ovs-vsctl --no-wait -vreconnect:emer del-zone-tp netdev zone=1 Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-execute: Remove unused variable from ovsdb_execute_mutate().Damijan Skvarc2019-10-301-2/+0
| | | | | Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-server: fix memory leak while converting databaseDamijan Skvarc2019-10-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memory leak happens while converting existing database into new database according to the specified schema (ovsdb-client convert new-schema). Memory leak was detected by valgrind while executing functional test "schema conversion online - clustered" ==16202== 96 bytes in 6 blocks are definitely lost in loss record 326 of 399 ==16202== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==16202== by 0x44A5D4: xmalloc (util.c:138) ==16202== by 0x4377A6: alloc_default_atoms (ovsdb-data.c:315) ==16202== by 0x437F18: ovsdb_datum_init_default (ovsdb-data.c:918) ==16202== by 0x413D82: ovsdb_row_create (row.c:59) ==16202== by 0x40AA53: ovsdb_convert_table (file.c:220) ==16202== by 0x40AA53: ovsdb_convert (file.c:275) ==16202== by 0x416BE1: ovsdb_trigger_try (trigger.c:255) ==16202== by 0x40D29E: ovsdb_jsonrpc_trigger_create (jsonrpc-server.c:1119) ==16202== by 0x40D29E: ovsdb_jsonrpc_session_got_request (jsonrpc-server.c:986) ==16202== by 0x40D29E: ovsdb_jsonrpc_session_run (jsonrpc-server.c:556) ==16202== by 0x40D29E: ovsdb_jsonrpc_session_run_all (jsonrpc-server.c:586) ==16202== by 0x40D29E: ovsdb_jsonrpc_server_run (jsonrpc-server.c:401) ==16202== by 0x40682E: main_loop (ovsdb-server.c:209) ==16202== by 0x40682E: main (ovsdb-server.c:460) The problem was in ovsdb_datum_convert() function, which overrides pointers to datum memory allocated in ovsdb_row_create() function. Fix was done by freeing this memory before ovsdb_datum_convert() is called. Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-server: Allow replication from older schema version servers.Numan Siddique2019-10-241-25/+127
| | | | | | | | | | | | | | | | | | | | | | | Presently, replication is not allowed if there is a schema version mismatch between the schema returned by the active ovsdb-server and the local db schema. This is causing failures in OVN DB HA deployments during uprades. In the case of OpenStack tripleo deployment with OVN, OVN DB ovsdb-servers are deployed on a multi node controller cluster in active/standby mode. During minor updates or major upgrades, the cluster is updated one at a time. If a node A is running active OVN DB ovsdb-servers and when it is updated, another node B becomes active. After the update when OVN DB ovsdb-servers in A are started, these ovsdb-servers fail to replicate from the active if there is a schema version mismatch. This patch addresses this issue by allowing replication even if there is a schema version mismatch only if all the active db schema tables and its colums are present in the local db schema. This should not result in any data loss. Signed-off-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-server: Don't drop all connections on read/write status change.Numan Siddique2019-10-141-4/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit [1] force drops all connections when the db read/write status changes. Prior to the commit [1], when there was read/write status change, the existing jsonrpc sessions with 'db_change_aware' set to true, were not updated with the changed 'read_only' value. If the db status was changed to 'standby', the existing clients could still write to the db. In the case of pacemaker OVN HA, OVN OCF script 'start' action starts the ovsdb-servers in read-only state and later, it sets to read-write in the 'promote' action. We have observed that if some ovn-controllers connect to the SB ovsdb-server (in read-only state) just before the 'promote' action, the connection is not reset all the times and these ovn-controllers remain connected to the SB ovsdb-server in read-only state all the time. Even though the commit [1] calls 'ovsdb_jsonrpc_server_reconnect()' with 'forced' flag set to true when the db read/write status changes, somehow the FSM misses resetting the connections of these ovn-controllers. I think this needs to be addressed in the FSM. This patch doesn't address this FSM issue. Instead it changes the behavior of 'ovsdb_jsonrpc_server_set_read_only()' by setting the 'read_only' flag of all the jsonrpc sessions instead of forcefully resetting the connection. I think there is no need to reset the connection. In large scale production deployements with OVN, this results in unnecessary waste of CPU cycles as ovn-controllers will have to connect twice - once during 'start' action and again during 'promote'. [1] - 2a9679e3b2c6("ovsdb-server: drop all connections on read/write status change") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-client: fix memory leak while executing database backupDamijan Skvarc2019-10-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | valgrind detects this leak while running functional test "ovsdb-client backup and restore" ==25401== 1,068 (240 direct, 828 indirect) bytes in 6 blocks are definitely lost in loss record 22 of 22 ==25401== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==25401== by 0x449DA4: xmalloc (util.c:138) ==25401== by 0x43012D: json_create (json.c:1451) ==25401== by 0x43012D: json_array_create_empty (json.c:186) ==25401== by 0x43012D: json_parser_push_array (json.c:1279) ==25401== by 0x4303CF: json_parser_input (json.c:1407) ==25401== by 0x4312F1: json_lex_input (json.c:991) ==25401== by 0x43193B: json_parser_feed (json.c:1149) ==25401== by 0x4329FA: jsonrpc_recv.part.7 (jsonrpc.c:332) ==25401== by 0x432D3B: jsonrpc_recv (jsonrpc.c:297) ==25401== by 0x432D3B: jsonrpc_recv_block (jsonrpc.c:402) ==25401== by 0x4330EB: jsonrpc_transact_block (jsonrpc.c:436) ==25401== by 0x409246: do_backup (ovsdb-client.c:2008) ==25401== by 0x405F76: main (ovsdb-client.c:282) the problem was in db_backup() function, where _uuid json node was detached from its parent "row" json node, but never destroyed afterwards. Signed-off-by: Damijan Skvarc <damjan.skvarc@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>