summaryrefslogtreecommitdiff
path: root/ovsdb
Commit message (Collapse)AuthorAgeFilesLines
* ovsdb-data: Deduplicate string atoms.Ilya Maximets2021-09-244-72/+67
| | | | | | | | | | | | | | | | | | | | | | | | | ovsdb-server spends a lot of time cloning atoms for various reasons, e.g. to create a diff of two rows or to clone a row to the transaction. All atoms, except for strings, contains a simple value that could be copied in efficient way, but duplicating strings every time has a significant performance impact. Introducing a new reference-counted structure 'ovsdb_atom_string' that allows to not copy strings every time, but just increase a reference counter. This change allows to increase transaction throughput in benchmarks up to 2x for standalone databases and 3x for clustered databases, i.e. number of transactions that ovsdb-server can handle per second. It also noticeably reduces memory consumption of ovsdb-server. Next step will be to consolidate this structure with json strings, so we will not need to duplicate strings while converting database objects to json and back. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
* ovsdb-data: Add function to apply diff in-place.Ilya Maximets2021-09-241-6/+4
| | | | | | | | | | | | | | | ovsdb_datum_apply_diff() is heavily used in ovsdb transactions, but it's linear in terms of number of comparisons. And it also clones all the atoms along the way. In most cases size of a diff is much smaller than the size of the original datum, this allows to perform the same operation in-place with only O(diff->n * log2(old->n)) comparisons and O(old->n + diff->n) memory copies with memcpy. Using this function while applying diffs read from the storage gives a significant performance boost and allows to execute much more transactions per second. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
* ovsdb-data: Optimize union of sets.Ilya Maximets2021-09-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current algorithm of ovsdb_datum_union looks like this: for-each atom in b: if not bin_search(a, atom): push(a, clone(atom)) quicksort(a) So, the complexity looks like this: Nb * log2(Na) + Nb + (Na + Nb) * log2(Na + Nb) Comparisons clones Comparisons for quicksort for search ovsdb_datum_union() is heavily used in database transactions while new element is added to a set. For example, if new logical switch port is added to a logical switch in OVN. This is a very common use case where CMS adds one new port to an existing switch that already has, let's say, 100 ports. For this case ovsdb-server will have to perform: 1 * log2(100) + 1 clone + 101 * log2(101) Comparisons Comparisons for for search quicksort. ~7 1 ~707 Roughly 714 comparisons of atoms and 1 clone. Since binary search can give us position, where new atom should go (it's the 'low' index after the search completion) for free, the logic can be re-worked like this: copied = 0 for-each atom in b: desired_position = bin_search(a, atom) push(result, a[ copied : desired_position - 1 ]) copied = desired_position push(result, clone(atom)) push(result, a[ copied : Na ]) swap(a, result) Complexity of this schema: Nb * log2(Na) + Nb + Na Comparisons clones memory copy on push for search 'swap' is just a swap of a few pointers. 'push' is not a 'clone', but a simple memory copy of 'union ovsdb_atom'. In general, this schema substitutes complexity of a quicksort with complexity of a memory copy of Na atom structures, where we're not even copying strings that these atoms are pointing to. Complexity in the example above goes down from 714 comparisons to 7 comparisons and memcpy of 100 * sizeof (union ovsdb_atom) bytes. General complexity of a memory copy should always be lower than complexity of a quicksort, especially because these copies usually performed in bulk, so this new schema should work faster for any input. All in all, this change allows to execute several times more transactions per second for transactions that adds new entries to sets. Alternatively, union can be implemented as a linear merge of two sorted arrays, but this will result in O(Na) comparisons, which is more than Nb * log2(Na) in common case, since Na is usually far bigger than Nb. Linear merge will also mean per-atom memory copies instead of copying in bulk. 'replace' functionality of ovsdb_datum_union() had no users, so it just removed. But it can easily be added back if needed in the future. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
* ovsdb: transaction: Use diffs for strong reference counting.Ilya Maximets2021-09-231-7/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, even if one reference added to the set of strong references or removed from it, ovsdb-server will walk through the whole set and re-count references to other rows. These referenced rows will also be added to the transaction in order to re-count their references. For example, every time Logical Switch Port added to a Logical Switch, OVN Northbound database server will walk through all ports of this Logical Switch, clone their rows, and re-count references. This is not very efficient. Instead, it can only increase reference counters for added references and reduce for removed ones. In many cases this will be only one row affected in the Logical_Switch_Port table. Introducing new function that generates a diff of two datum objects, but stores added and removed atoms separately, so they can be used to increase or decrease row reference counters accordingly. This change allows to perform several times more transactions that adds or removes strong references to/from sets per second, because ovsdb-server no longer clones and re-counts rows that are irrelevant to current transaction. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: monitor: Store serialized json in a json cache.Ilya Maximets2021-09-011-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Same json from a json cache is typically sent to all the clients, e.g., in case of OVN deployment with ovn-monitor-all=true. There could be hundreds or thousands connected clients and ovsdb will serialize the same json object for each of them before sending. Serializing it once before storing into json cache to speed up processing. This change allows to save a lot of CPU cycles and a bit of memory since we need to store in memory only a string and not the full json object. Testing with ovn-heater on 120 nodes using density-heavy scenario shows reduction of the total CPU time used by Southbound DB processes from 256 minutes to 147. Duration of unreasonably long poll intervals also reduced dramatically from 7 to 2 seconds: Count Min Max Median Mean 95 percentile ------------------------------------------------------------- Before 1934 1012 7480 4302.5 4875.3 7034.3 After 1909 1004 2730 1453.0 1532.5 2053.6 Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Don't keep full json objects in memory if no longer needed.Ilya Maximets2021-09-016-64/+160
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Raft log entries (and raft database snapshot) contains json objects of the data. Follower receives append requests with data that gets parsed and added to the raft log. Leader receives execution requests, parses data out of them and adds to the log. In both cases, later ovsdb-server reads the log with ovsdb_storage_read(), constructs transaction and updates the database. On followers these json objects in common case are never used again. Leader may use them to send append requests or snapshot installation requests to followers. However, all these operations (except for ovsdb_storage_read()) are just serializing the json in order to send it over the network. Json objects are significantly larger than their serialized string representation. For example, the snapshot of the database from one of the ovn-heater scale tests takes 270 MB as a string, but 1.6 GB as a json object from the total 3.8 GB consumed by ovsdb-server process. ovsdb_storage_read() for a given raft entry happens only once in a lifetime, so after this call, we can serialize the json object, store the string representation and free the actual json object that ovsdb will never need again. This can save a lot of memory and can also save serialization time, because each raft entry for append requests and snapshot installation requests serialized only once instead of doing that every time such request needs to be sent. JSON_SERIALIZED_OBJECT can be used in order to seamlessly integrate pre-serialized data into raft_header and similar json objects. One major special case is creation of a database snapshot. Snapshot installation request received over the network will be parsed and read by ovsdb-server just like any other raft log entry. However, snapshots created locally with raft_store_snapshot() will never be read back, because they reflect the current state of the database, hence already applied. For this case we can free the json object right after writing snapshot on disk. Tests performed with ovn-heater on 60 node density-light scenario, where on-disk database goes up to 97 MB, shows average memory consumption of ovsdb-server Southbound DB processes decreased by 58% (from 602 MB to 256 MB per process) and peak memory consumption decreased by 40% (from 1288 MB to 771 MB). Test with 120 nodes on density-heavy scenario with 270 MB on-disk database shows 1.5 GB memory consumption decrease as expected. Also, total CPU time consumed by the Southbound DB process reduced from 296 to 256 minutes. Number of unreasonably long poll intervals reduced from 2896 down to 1934. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idl: Add APIs to query if a table and a column is present.Numan Siddique2021-08-281-1/+27
| | | | | | | | | | | | | | | | | This patch adds 2 new APIs in the ovsdb-idl client library - ovsdb_idl_server_has_table() and ovsdb_idl_server_has_column() to query if a table and a column is present in the IDL or not. This patch also adds IDL helper functions which are auto generated from the schema which makes it easier for the clients. These APIs are required for scenarios where the server schema is old and missing a table or column and the client (built with a new schema version) does a transaction with the missing table or column. This results in a continuous loop of transaction failures. Related-Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1992705 Signed-off-by: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: transaction-forward: Fix initialization of the 'sent' hmap node.Ilya Maximets2021-08-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'sent_node' is initialized to all zeroes by xzalloc(), but HMAP_NODE_NULL is not all zeroes. hmap_node_is_null() is used to detect if the node is valid, but it will fail and cause segmentation fault on attempt to remove the non-existent node from the hash map. This can happen if client disconnected while the transaction is not yet forwarded to the relay source: Program terminated with signal 11, Segmentation fault. 0 in hmap_remove at include/openvswitch/hmap.h:293 293 while (*bucket != node) { (gdb) bt 0 hmap_remove at include/openvswitch/hmap.h:293 1 ovsdb_txn_forward_unlist at ovsdb/transaction-forward.c:67 2 ovsdb_txn_forward_destroy at ovsdb/transaction-forward.c:79 3 ovsdb_trigger_destroy at ovsdb/trigger.c:70 4 ovsdb_jsonrpc_trigger_complete at ovsdb/jsonrpc-server.c:1192 5 ovsdb_jsonrpc_trigger_remove__ at ovsdb/jsonrpc-server.c:1204 6 ovsdb_jsonrpc_trigger_complete_all at ovsdb/jsonrpc-server.c:1223 7 ovsdb_jsonrpc_session_run at ovsdb/jsonrpc-server.c:546 8 ovsdb_jsonrpc_session_run_all at ovsdb/jsonrpc-server.c:591 9 ovsdb_jsonrpc_server_run at ovsdb/jsonrpc-server.c:406 10 main_loop (gdb) print db->txn_forward_sent $20 = {buckets = 0x..., one = 0x0, mask = 63, n = 0} (gdb) print txn_fwd->sent_node $24 = {hash = 0, next = 0x0} Fix that by correct initialization of the 'sent_node'. Reported-by: Wentao Jia <wentao.jia@easystack.cn> Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-August/051354.html Fixes: 7964ffe7d2bf ("ovsdb: relay: Add support for transaction forwarding.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
* docs: Add documentation for ovsdb relay mode.Ilya Maximets2021-07-151-10/+17
| | | | | | | | | Main documentation for the service model and tutorial with the use case and configuration examples. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Make clients aware of relay service model.Ilya Maximets2021-07-151-1/+1
| | | | | | | | | | | Clients needs to re-connect from the relay that has no connection with the database source. Also, relay acts similarly to the follower from a clustered model from the consistency point of view, so it's not suitable for leader-only connections. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: relay: Reflect connection status in _Server database.Ilya Maximets2021-07-154-9/+49
| | | | | | | | | It might be important for clients to know that relay lost connection with the relay remote, so they could re-connect to other relay. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: relay: Add support for transaction forwarding.Ilya Maximets2021-07-159-36/+329
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current version of ovsdb relay allows to scale out read-only access to the primary database. However, many clients are not read-only but read-mostly. For example, ovn-controller. In order to scale out database access for this case ovsdb-server need to process transactions that are not read-only. Relay is not allowed to do that, i.e. not allowed to modify the database, but it can act like a proxy and forward transactions that includes database modifications to the primary server and forward replies back to a client. At the same time it may serve read-only transactions and monitor requests by itself greatly reducing the load on primary server. This configuration will slightly increase transaction latency, but it's not very important for read-mostly use cases. Implementation details: With this change instead of creating a trigger to commit the transaction, ovsdb-server will create a trigger for transaction forwarding. Later, ovsdb_relay_run() will send all new transactions to the relay source. Once transaction reply received from the relay source, ovsdb-relay module will update the state of the transaction forwarding with the reply. After that, trigger_run() will complete the trigger and jsonrpc_server_run() will send the reply back to the client. Since transaction reply from the relay source will be received after all the updates, client will receive all the updates before receiving the transaction reply as it is in a normal scenario with other database models. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: New ovsdb 'relay' service model.Ilya Maximets2021-07-159-39/+474
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | New database service model 'relay' that is needed to scale out read-mostly database access, e.g. ovn-controller connections to OVN_Southbound. In this service model ovsdb-server connects to existing OVSDB server and maintains in-memory copy of the database. It serves read-only transactions and monitor requests by its own, but forwards write transactions to the relay source. Key differences from the active-backup replication: - support for "write" transactions (next commit). - no on-disk storage. (probably, faster operation) - support for multiple remotes (connect to the clustered db). - doesn't try to keep connection as long as possible, but faster reconnects to other remotes to avoid missing updates. - No need to know the complete database schema beforehand, only the schema name. - can be used along with other standalone and clustered databases by the same ovsdb-server process. (doesn't turn the whole jsonrpc server to read-only mode) - supports modern version of monitors (monitor_cond_since), because based on ovsdb-cs. - could be chained, i.e. multiple relays could be connected one to another in a row or in a tree-like form. - doesn't increase availability. - cannot be converted to other service models or become a main active server. Some performance test results can be found here: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385825.html Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: row: Add support for xor-based row updates.Ilya Maximets2021-07-156-15/+39
| | | | | | | | | | | | This will be used to apply update3 type updates to ovsdb tables while processing updates for future ovsdb 'relay' service model. 'ovsdb_datum_apply_diff' is allowed to fail, so adding support to return this error. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: table: Expose functions to execute operations on ovsdb tables.Ilya Maximets2021-07-153-76/+90
| | | | | | | | | | | | | | | | | These functions will be used later for ovsdb 'relay' service model, so moving them to a common code. Warnings translated to ovsdb errors, caller in replication.c only printed inconsistency warnings, but mostly ignored them. Implementing the same logic by checking the error tag. Also ovsdb_execute_insert() previously printed incorrect warning about duplicate row while it was a syntax error in json. Fixing that by actually checking for the duplicate and reporting correct ovsdb error. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: storage: Allow setting the name for the unbacked storage.Ilya Maximets2021-07-154-6/+13
| | | | | | | | | | | | | | | ovsdb_create() requires schema or storage to be nonnull, but in practice it requires to have schema name or a storage name to use it as a database name. Only clustered storage has a name. This means that only clustered database can be created without schema, Changing that by allowing unbacked storage to have a name. This way we can create database with unbacked storage without schema. Will be used in next commits to create database for ovsdb 'relay' service model. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* jsonrpc-server: Wake up jsonrpc session if there are completed triggers.Ilya Maximets2021-07-151-1/+2
| | | | | | | | | | | | | | | | | | If there are completed triggers, jsonrpc server should wake up and update clients with the new data, but there is no such condition in ovsdb_jsonrpc_session_wait(). For some reason this doesn't result in any processing delays in current code, probably because there are always some other types of events in this case that could wake ovsdb server up. But it will become a problem in upcoming ovsdb 'relay' service model because triggers could be completed from a different place, i.e. after receiving transaction reply from the relay source. Fix that by waking up ovsdb-server in case there are completed triggers that needs to be handled. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-server: Fix memleak when failing to read storage.Dumitru Ceara2021-07-151-5/+3
| | | | | | Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-tool: Fix memory leak in "check-cluster" command.lin huang2021-07-091-2/+6
| | | | | | | | | The json returned by raft_entry_to_json() must be freed. Found by Coverity. Signed-off-by: linhuang <linhuang@ruijie.com.cn> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovs: fix wrong quoteYunjian Wang2021-07-061-12/+12
| | | | | | | Remove the coma character by using ' and " character. Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* Remove Python 2 leftovers.Rosemarie O'Riorden2021-06-221-1/+0
| | | | | | | Fixes: 1ca0323e7c29 ("Require Python 3 and remove support for Python 2.") Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949875 Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: print local server ID when opening RAFT databaseDan Williams2021-06-111-0/+2
| | | | | Signed-off-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-client: Integrate record/replay functionality.Ilya Maximets2021-06-072-0/+7
| | | | | | | | This is primarily to be able to test recording of client connections. Unit test added accordingly. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* ovsdb-server: Don't update manager status if replay engine is active.Ilya Maximets2021-06-071-3/+7
| | | | | | | | | | | | | | | Current version or replay engine doesn't handle correctly internal time-based events that ends up in stream events. For example, updates of a database status that happens each 2.5 seconds results in updates on client monitors. Disable updates for now if replay engine is active. The very first update kept to store the initial information about the server. The proper solution would be to record time and replay it, probably, with time warping or in some other way. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* ovsdb-server: Integrate stream replay engine.Ilya Maximets2021-06-072-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change adds support of stream record/replay functionality to ovsdb-server. Since current replay engine doesn't work well with time-based events generated locally, it will work only with standalone databases for now (raft heavily depends on time). To use this functionality run: Recording: # create a directory for replay files. mkdir replay_dir # copy current db for later use by replay cp my_db ./replay_dir/my_db ovsdb-server --record=./replay_dir <OVSDB_ARGS> my_db # connect some clients and run some ovsdb transactions ovs-appctl -t ovsdb-server exit Replay: # restore db from the copy cp ./replay_dir/my_db my_db.for_replay ovsdb-server --replay=./replay_dir <OVSDB_ARGS> my_db.for_replay At this point ovsdb-server should execute all the same commands and transactions. Since the last command was 'exit' via unixctl, ovsdb-server will exit in the end. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* ovsdb-tool: add --election-timer=ms option to 'create-cluster'Dan Williams2021-05-274-12/+90
| | | | | | | | | | | | After creating the new clustered database write a raft entry that sets the desired election timer. This allows CMSes to set the election timer at cluster start and avoid an error-prone election timer modification process after the cluster is up. Reported-at: https://bugzilla.redhat.com/1831778 Signed-off-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Transfer leadership before creating snapshots.Ilya Maximets2021-05-144-7/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With a big database writing snapshot could take a lot of time, for example, on one of the systems compaction of 300MB database takes about 10 seconds to complete. For the clustered database, 40% of this time takes conversion of the database to the file transaction json format, the rest of time is formatting a string and writing to disk. Of course, this highly depends on the disc and CPU speeds. 300MB is the very possible database size for the OVN Southbound DB, and it might be even bigger than that. During compaction the database is not available and the ovsdb-server doesn't do any other tasks. If leader spends 10-15 seconds writing a snapshot, the cluster is not functional for that time period. Leader also, likely, has some monitors to serve, so the one poll interval may be 15-20 seconds long in the end. Systems with so big databases typically has very high election timers configured (16 seconds), so followers will start election only after this significant amount of time. Once leader is back to the operational state, it will re-connect and try to join the cluster back. In some cases, this might also trigger the 'connected' state flapping on the old leader triggering a re-connection of clients. This issue has been observed with large-scale OVN deployments. One of the methods to improve the situation is to transfer leadership before compacting. This allows to keep the cluster functional, while one of the servers writes a snapshot. Additionally logging the time spent for compaction if it was longer than 1 second. This adds a bit of visibility to 'unreasonably long poll interval's. Reported-at: https://bugzilla.redhat.com/1960391 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* raft: Add 'stop-raft-rpc' failure test command.Ilya Maximets2021-03-011-10/+21
| | | | | | | | | | | | This command will stop sending and receiving any RAFT-related traffic or accepting new connections. Useful to simulate network problems between cluster members. There is no unit test that uses it yet, but it's convenient for manual testing. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Report disconnected in cluster/status if candidate retries election.Ilya Maximets2021-03-011-0/+2
| | | | | | | | | | | | | | If election times out for a server in 'candidate' role it sets 'candidate_retrying' flag that notifies that storage is disconnected and client should re-connect. However, cluster/status command reports 'Status: cluster member' and that is misleading. Reporting "disconnected from the cluster (election timeout)" instead. Reported-by: Carlos Goncalves <cgoncalves@redhat.com> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690 Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Reintroduce jsonrpc inactivity probes.Ilya Maximets2021-03-011-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not enough to just have heartbeats. RAFT heartbeats are unidirectional, i.e. leader sends them to followers but not the other way around. Missing heartbeats provokes followers to start election, but if leader will not receive any replies it will not do anything while there is a quorum, i.e. there are enough other servers to make decisions. This leads to situation that while TCP connection is established, leader will continue to blindly send messages to it. In our case this leads to growing send backlog. Connection will be terminated eventually due to excessive send backlog, but this this might take a lot of time and wasted process memory. At the same time 'candidate' will continue to send vote requests to the dead connection on its side. To fix that we need to reintroduce inactivity probes that will drop connection if there was no incoming traffic for a long time and remote server doesn't reply to the "echo" request. Probe interval might be chosen based on an election timeout to avoid issues described in commit db5a066c17bd. Reported-by: Carlos Goncalves <cgoncalves@redhat.com> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690 Fixes: db5a066c17bd ("raft: Disable RAFT jsonrpc inactivity probe.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-client: Fix needs-conversion when SERVER is explicitly specified.Alexey Roytman2021-02-191-2/+3
| | | | | | | | | | | | When you try to specify `SERVER` to the 'ovsdb-client needs-conversion' command, it interprets the `SERVER` parameter as the path to the schema and returns an error. This PR fixes it. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Submitted-at: https://github.com/openvswitch/ovs/pull/347 Signed-off-by: Alexey Roytman <roytman@il.ibm.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-doc: Add build dependency on dirs.py.Ilya Maximets2021-01-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | ovsdb-doc includes python code that requires dirs.py to exist. This change fixes broken 'make manpage-check' target: # make manpage-check Traceback (most recent call last): File "./ovsdb/ovsdb-doc", line 25, in <module> import ovs.db.schema File "/root/ovs/python/ovs/db/schema.py", line 19, in <module> import ovs.db.types File "/root/ovs/python/ovs/db/types.py", line 18, in <module> import ovs.db.data File "/root/ovs/python/ovs/db/data.py", line 22, in <module> import ovs.jsonrpc File "/root/ovs/python/ovs/jsonrpc.py", line 21, in <module> import ovs.poller File "/root/ovs/python/ovs/poller.py", line 23, in <module> import ovs.vlog File "/root/ovs/python/ovs/vlog.py", line 25, in <module> import ovs.dirs ModuleNotFoundError: No module named 'ovs.dirs' Fixes: 943c4a325045 ("python: set ovs.dirs variables with build system values") Acked-by: Mark Gray <mark.d.gray@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Use column diffs for ovsdb and raft log entries.Ilya Maximets2021-01-154-8/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ovsdb-server stores complete value for the column in a database file and in a raft log in case this column changed. This means that transaction that adds, for example, one new acl to a port group creates a log entry with all UUIDs of all existing acls + one new. Same for ports in logical switches and routers and more other columns with sets in Northbound DB. There could be thousands of acls in one port group or thousands of ports in a single logical switch. And the typical use case is to add one new if we're starting a new service/VM/container or adding one new node in a kubernetes or OpenStack cluster. This generates huge amount of traffic within ovsdb raft cluster, grows overall memory consumption and hurts performance since all these UUIDs are parsed and formatted to/from json several times and stored on disks. And more values we have in a set - more space a single log entry will occupy and more time it will take to process by ovsdb-server cluster members. Simple test: 1. Start OVN sandbox with clustered DBs: # make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered' 2. Run a script that creates one port group and adds 4000 acls into it: # cat ../memory-test.sh pg_name=my_port_group export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off) ovn-nbctl pg-add $pg_name for i in $(seq 1 4000); do echo "Iteration: $i" ovn-nbctl --log acl-add $pg_name from-lport $i udp drop done ovn-nbctl acl-del $pg_name ovn-nbctl pg-del $pg_name ovs-appctl -t $(pwd)/sandbox/nb1 memory/show ovn-appctl -t ovn-nbctl exit --- 4. Check the current memory consumption of ovsdb-server processes and space occupied by database files: # ls sandbox/[ns]b*.db -alh # ps -eo vsz,rss,comm,cmd | egrep '=[ns]b[123].pid' Test results with current ovsdb log format: On-disk Nb DB size : ~369 MB RSS of Nb ovsdb-servers: ~2.7 GB Time to finish the test: ~2m In order to mitigate memory consumption issues and reduce computational load on ovsdb-servers let's store diff between old and new values instead. This will make size of each log entry that adds single acl to port group (or port to logical switch or anything else like that) very small and independent from the number of already existing acls (ports, etc.). Added a new marker '_is_diff' into a file transaction to specify that this transaction contains diffs instead of replacements for the existing data. One side effect is that this change will actually increase the size of file transaction that removes more than a half of entries from the set, because diff will be larger than the resulted new value. However, such operations are rare. Test results with change applied: On-disk Nb DB size : ~2.7 MB ---> reduced by 99% RSS of Nb ovsdb-servers: ~580 MB ---> reduced by 78% Time to finish the test: ~1m27s ---> reduced by 27% After this change new ovsdb-server is still able to read old databases, but old ovsdb-server will not be able to read new ones. Since new servers could join ovsdb cluster dynamically it's hard to implement any runtime mechanism to handle cases where different versions of ovsdb-server joins the cluster. However we still need to handle cluster upgrades. For this case added special command line argument to disable new functionality. Documentation updated with the recommended way to upgrade the ovsdb cluster. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-tool: Fix datum leak in the show-log command.Ilya Maximets2020-12-211-0/+1
| | | | | | Fixes: 4e92542cefb7 ("ovsdb-tool: Make "show-log" convert raw JSON to easier-to-read syntax.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* raft: Add some debugging information to cluster/status command.Lorenzo Bianconi2020-12-212-0/+37
| | | | | | | | | | | | | Introduce the following info useful for cluster debugging to cluster/status command: - time elapsed from last start/complete election - election trigger (e.g. timeout) - number of disconnections - time elapsed from last raft messaged received Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* python: Update build system to ensure dirs.py is created.Mark Gray2020-11-261-1/+1
| | | | | | | | | | Update build system to ensure dirs.py is created when it is a dependency for a build target. Also, update setup.py to check for that dependency. Fixes: 943c4a325045 ("python: set ovs.dirs variables with build system values") Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idl: Fix *_is_new() IDL functions.Mark Gray2020-11-161-3/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently all functions of the type *_is_new() always return 'false'. This patch resolves this issue by using the 'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' instead of the 'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno' to determine if a row is new and by resetting the 'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' on clear. Further to this, the code is also updated to match the following behaviour: When a row is inserted, the 'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' is updated to match the new database change_seqno. The 'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno' is not set for inserted rows (only for updated rows). At the end of a run, ovsdb_idl_db_track_clear() should be called to clear all tracking information, this includes resetting all row 'change_seqno' to zero. This will ensure that subsequent runs will not see a previously 'new' row. add_tracked_change_for_references() is updated to only track rows that reference the current row. Also, update unit tests in order to test the *_is_new(), *_is_delete() functions. Suggested-by: Dumitru Ceara <dceara@redhat.com> Reported-at: https://bugzilla.redhat.com/1883562 Fixes: ca545a787ac0 ("ovsdb-idl.c: Increase seqno for change-tracking of table references.") Signed-off-by: Mark Gray <mark.d.gray@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-idlc: Return expected sequence number while setting conditions.Ilya Maximets2020-11-161-3/+3
| | | | | | | | | | ovsdb_idl_set_condition() returns a sequence number that can be used to check if the requested conditions are acknowledged by the server. However, database bindings do not return this value to the user, making it impossible to check if the conditions are accepted. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Remove read permission of *.db from others.Yi-Hung Wei2020-11-101-1/+1
| | | | | | | | | | | | Currently, when ovsdb *.db is created by ovsdb-tool it grants read permission to others. This may incur security concerns, for example, IPsec Pre-shared keys are stored in ovs-vsitchd.conf.db. This patch addresses the concerns by removing permission for others. Reported-by: Antonin Bas <abas@vmware.com> Acked-by: Mark Gray <mark.d.gray@redhat.com> Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Make backlog thresholds configurable.Ilya Maximets2020-11-102-5/+55
| | | | | | | | | | New appctl 'cluster/set-backlog-threshold' to configure thresholds on backlog of raft jsonrpc connections. Could be used, for example, in some extreme conditions where size of a database expected to be very large, i.e. comparable with default 4GB threshold. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Set threshold on backlog for raft connections.Ilya Maximets2020-11-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RAFT messages could be fairly big. If something abnormal happens to one of the servers in a cluster it may not be able to process all the incoming messages in a timely manner. This results in jsonrpc backlog growth on the sender's side. For example if follower gets many new clients at once that it needs to serve, or it decides to take a snapshot in a period of high number of database changes. If backlog grows large enough it becomes harder and harder for follower to process incoming raft messages, it sends outdated replies and starts receiving snapshots and the whole raft log from the leader. Sometimes backlog grows too high (60GB in this example): jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:<ip>, num of msgs: 15370, backlog: 61731060773. In this case OS might actually decide to kill the sender to free some memory. Anyway, It could take a lot of time for such a server to catch up with the rest of the cluster if it has so much data to receive and process. Introducing backlog thresholds for jsonrpc connections. If sending backlog will exceed particular values (500 messages or 4GB in size), connection will be dropped and re-created. This will allow to drop all the current backlog and start over increasing chances of cluster recovery. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid having more than one snapshot in-flight.Ilya Maximets2020-11-033-29/+18
| | | | | | | | | | | | | | | | | | | | | Previous commit 8c2c503bdb0d ("raft: Avoid sending equal snapshots.") took a "safe" approach to not send only exactly same snapshot installation requests. However, it doesn't make much sense to send more than one snapshot at a time. If obsolete snapshot installed, leader will re-send the most recent one. With this change leader will have only 1 snapshot in-flight per connection. This will reduce backlogs on raft connections in case new snapshot created while 'install_snapshot_request' is in progress or if election timer changed in that period. Also, not tracking the exact 'install_snapshot_request' we've sent allows to simplify the code. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Fixes: 8c2c503bdb0d ("raft: Avoid sending equal snapshots.") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-server: Reclaim heap memory after compaction.Ilya Maximets2020-11-034-4/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Compaction happens at most once in 10 minutes. That is a big time interval for a heavy loaded ovsdb-server in cluster mode. In 10 minutes raft logs could grow up to tens of thousands of entries with tens of gigabytes in total size. While compaction cleans up raft log entries, the memory in many cases is not returned to the system, but kept in the heap of running ovsdb-server process, and it could stay in this condition for a really long time. In the end one performance spike could lead to a fast growth of the raft log and this memory will never (for a really long time) be released to the system even if the database if empty. Simple example how to reproduce with OVN sandbox: 1. make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered' 2. Run following script that creates 1 port group, adds 4000 acls and removes all of that in the end: # cat ../memory-test.sh pg_name=my_port_group export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off) ovn-nbctl pg-add $pg_name for i in $(seq 1 4000); do echo "Iteration: $i" ovn-nbctl --log acl-add $pg_name from-lport $i udp drop done ovn-nbctl acl-del $pg_name ovn-nbctl pg-del $pg_name ovs-appctl -t $(pwd)/sandbox/nb1 memory/show ovn-appctl -t ovn-nbctl exit --- 3. Stopping one of Northbound DB servers: ovs-appctl -t $(pwd)/sandbox/nb1 exit Make sure that ovsdb-server didn't compact the database before it was stopped. Now we have a db file on disk that contains 4000 fairly big transactions inside. 4. Trying to start same ovsdb-server with this file. # cd sandbox && ovsdb-server <...> nb1.db At this point ovsdb-server reads all the transactions from db file and performs all of them as fast as it can one by one. When it finishes this, raft log contains 4000 entries and ovsdb-server consumes (on my system) ~13GB of memory while database is empty. And libc will likely never return this memory back to system, or, at least, will hold it for a really long time. This patch adds a new command 'ovsdb-server/memory-trim-on-compaction'. It's disabled by default, but once enabled, ovsdb-server will call 'malloc_trim(0)' after every successful compaction to try to return unused heap memory back to system. This is glibc-specific, so we need to detect function availability in a build time. Disabled by default since it adds from 1% to 30% (depending on the current state) to the snapshot creation time and, also, next memory allocations will likely require requests to kernel and that might be slower. Could be enabled by default later if considered broadly beneficial. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Add log length to the memory report.Ilya Maximets2020-11-031-0/+1
| | | | | | | | | In many cases a big part of a memory consumed by ovsdb-server process is a raft log, so it's important to add its length to the memory report. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid annoying debug logs if raft is connected.Ilya Maximets2020-10-271-1/+10
| | | | | | | | | | | | | If debug logs enabled, "raft_is_connected: true" printed on every call to raft_is_connected() which is way too frequently. These messages are not very informative and only litters the log. Let's log only disconnected state in a rate-limited way and only log positive case once at the moment cluster becomes connected. Fixes: 923f01cad678 ("raft.c: Set candidate_retrying if no leader elected since last election.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Fix error leak on failure while saving snapshot.Ilya Maximets2020-10-271-1/+1
| | | | | | | | Error should be destroyed before return. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Report jsonrpc backlog in kilobytes.Ilya Maximets2020-10-251-2/+3
| | | | | | | | | | | | While sending snapshots backlog on raft connections could quickly grow over 4GB and this will overflow raft-backlog counter. Let's report it in kB instead. (Using kB and not KB to match with ru_maxrss counter reported by kernel) Fixes: 3423cd97f88f ("ovsdb: Add raft memory usage to memory report.") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* Eliminate "whitelist" and "blacklist" terms.Ben Pfaff2020-10-163-45/+45
| | | | | | | | There is one remaining use under datapath. That change should happen upstream in Linux first according to our usual policy. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
* ovsdb: Add unixctl command to show storage status.Dumitru Ceara2020-09-163-0/+50
| | | | | | | | | | | | | | | | | | | | | | | If a database enters an error state, e.g., in case of RAFT when reading the DB file contents if applying the RAFT records triggers constraint violations, there's no way to determine this unless a client generates a write transaction. Such write transactions would fail with "ovsdb-error: inconsistent data". This commit adds a new command to show the status of the storage that's backing a database. Example, on an inconsistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ovsdb error: inconsistent data Example, on a consistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ok Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb-tool: Add a db consistency check to the ovsdb-tool check-cluster command.Federico Paolinelli2020-09-161-0/+38
| | | | | | | | | | | | | | | | | | There are some occurrences where the database ends up in an inconsistent state. This happened in ovn-k8s and is described in [0]. Here we are adding a supported way to check that a given db is consistent, which is less error prone than checking the logs. Tested against both a valid db and a corrupted db attached to the above bug [1]. Also, tested with a fresh db that did not do a snapshot. [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23 [1]: https://bugzilla.redhat.com/attachment.cgi?id=1697595 Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> Suggested-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>