delta/openvswitch.git - github.com: openvswitch/ovs.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	ovsdb: Monitor: Keep and maintain the initial change set.	Ilya Maximets	2023-04-24	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Change sets in OVSDB monitor are storing all the changes that happened between a particular transaction ID and now. Initial change set basically contains all the data. On each monitor request a new initial change set is created by creating an empty change set and adding all the database rows. Then it is converted into JSON reply and immediately untracked and destroyed. This is causing significant performance issues if many clients are requesting new monitors at the same time. For example, that is happening after database schema conversion, because conversion triggers cancellation of all monitors. After cancellation, every client sends a new monitor request. The server then creates a new initial change set, sends a reply, destroys initial change set and repeats that for each client. On a system with 200 MB database and 500 clients, cluster of 3 servers spends 20 minutes replying to all the clients (200 MB x 500 = 100 GB): timeval\|WARN\|Unreasonably long 1201525ms poll interval Of course, all the clients are already disconnected due to inactivity at this point. When they are re-connecting back, server accepts new connections one at a time, so inactivity probes will not be triggered anymore, but it still takes another 20 minutes to handle all the incoming connections. Let's keep the initial change set around for as long as the monitor itself exists. This will allow us to not construct a new change set on each new monitor request and even utilize the JSON cache in some cases. All that at a relatively small maintenance cost, since we'll need to commit changes to one extra change set on every transaction. Measured memory usage increase due to keeping around a shallow copy of a database is about 10%. Measured CPU usage difference during normal operation is negligible. With this change it takes only 30 seconds to send out all the monitor replies in the example above. So, it's a 40x performance improvement. On a more reasonable setup with 250 nodes, the process takes up to 8-10 seconds instead of 4-5 minutes. Conditional monitoring will benefit from this change as well, however results might be less impressive due to lack of JSON cache. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Avoid converting database twice on an initiator.	Ilya Maximets	2023-04-24	7	-20/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cluster member, that initiates the schema conversion, converts the database twice. First time while verifying the possibility of the conversion, and the second time after reading conversion request back from the storage. Keep the converted database from the first time around and use it after reading the request back from the storage. This cuts in half the conversion CPU cost. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Perform conversion with no data for clustered databases.	Ilya Maximets	2023-04-24	4	-3/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, database schema conversion in case of clustered database produces a transaction record with both new schema and converted database data. So, the sequence of events is following: 1. Get the new schema. 2. Convert the database to a new schema. 3. Translate the newly converted database into JSON. 4. Write the schema + data JSON to the storage. 5. Destroy converted version of a database. 6. Read schema + data JSON from the storage and parse. 7. Create a new database from a parsed database data. 8. Replace current database with the new one. Most of these steps are very computationally expensive. Also, conversion to/from JSON is much more expensive than direct database conversion with ovsdb_convert() that can make use of shallow data copies. Instead of doing all that, let's make use of previously introduced ability to not write the converted data into the storage. The process will look like this then: 1. Get the new schema. 2. Convert the database to a new schema (to verify that it is possible). 3. Write the schema to the storage. 4. Destroy converted version of a database. 5. Read the new schema from the storage and parse. 6. Convert the database to a new schema. 7. Replace current database with the new one. Most of the operations here are performed on the small schema object, instead of the actual database data. Two remaining data operations (actual conversion) are noticeably faster than conversion to/from JSON due to reference counting and shallow data copies. Steps 4-6 can be optimized later to not convert twice on the process that initiates the conversion. The change results in following performance improvements in conversion of OVN_Southbound database schema from version 20.23.0 to 20.27.0 (measured on a single-server RAFT cluster with no clients): \| Before \| After +---------+-------------------+---------+------------------ DB size \| Total \| Max poll interval \| Total \| Max poll interval --------+---------+-------------------+---------+------------------ 542 MB \| 47 sec. \| 26 sec. \| 15 sec. \| 10 sec. 225 MB \| 19 sec. \| 10 sec. \| 6 sec. \| 4.5 sec. 542 MB database had 19.5 M atoms, 225 MB database had 7.5 M atoms. Overall performance improvement is about 3x. Also, note that before this change database conversion basically doubles the database file on disk. Now it only writes a small schema JSON. Since the change requires backward-incompatible database file format changes, documentation is updated on how to perform an upgrade. Handled the same way as we did for the previous incompatible format change in 2.15 (column diffs). Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052140.html Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Allow conversion records with no data in a clustered storage.	Ilya Maximets	2023-04-24	4	-36/+93
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the schema with no data was read from the clustered storage, it should mean a database conversion request. In general, we can get: 1. Just data --> Transaction record. 2. Schema + Data --> Database conversion or raft snapshot install. 3. Just schema --> New. Database conversion request. We cannot distinguish between conversion and snapshot installation request in the current implementation, so we will keep handling conversion with data in the same way as before, i.e. if data is provided, we should use it. ovsdb-tool is updated to handle this record type as well while converting cluster to standalone. This change doesn't introduce a way for such records to appear in the database. That will be added in the future commits targeting conversion speed increase. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Check for ephemeral columns before writing a new schema.	Ilya Maximets	2023-04-24	5	-12/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Clustered databases do not support ephemeral columns, but ovsdb-server checks for them after the conversion result is read from the storage. It's much easier to recover if this constraint is checked before writing to the storage instead. It's not a big problem, because the check is always performed by the native ovsdb clients before sending a conversion request. But the server, in general, should not trust clients to do the right thing. Check in the update_schema() remains, because we shouldn't blindly trust the storage. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-tool: Fix cluster-to-standalone for DB conversion records.	Ilya Maximets	2023-04-24	3	-0/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If database conversion happens, both schema and the new data are present in the database record. However, the schema is just silently ignored by ovsdb-tool cluster-to-standalone. This creates data inconsistency if the new data contains new columns, for example, so the resulting database file will not be readable, or data will be lost. Fix that by re-setting the database whenever a conversion record is found and actually writing a new schema that will match the actual data. The database file will not be that similar to the original, but there is no way to represent conversion in a standalone database file format otherwise. Fixes: 00de46f9ee42 ("ovsdb-tool: Convert clustered db to standalone db.") Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	dpdk: Allow retaining CAP_SYS_RAWIO privileges.	Aaron Conole	2023-03-22	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Open vSwitch generally tries to let the underlying operating system managed the low level details of hardware, for example DMA mapping, bus arbitration, etc. However, when using DPDK, the underlying operating system yields control of many of these details to userspace for management. In the case of some DPDK port drivers, configuring rte_flow or even allocating resources may require access to iopl/ioperm calls, which are guarded by the CAP_SYS_RAWIO privilege on linux systems. These calls are dangerous, and can allow a process to completely compromise a system. However, they are needed in the case of some userspace driver code which manages the hardware (for example, the mlx implementation of backend support for rte_flow). Here, we create an opt-in flag passed to the command line to allow this access. We need to do this before ever accessing the database, because we want to drop all privileges asap, and cannot wait for a connection to the database to be established and functional before dropping. There may be distribution specific ways to do capability management as well (using for example, systemd), but they are not as universal to the vswitchd as a flag. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Fix handling of DNS name for listener configuration.	Frode Nordahl	2023-02-10	1	-16/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 08e9e5337383 fixed proper initialization of the dns-resolve module, and made DNS resolution asynchronous. A side effect of that change revealed a long standing logic bug which broke ovsdb-server listener configuration using DNS names. Previously this worked because the DNS resolution would block, now that DNS resolution is asynchronous the code before this change would assume the error from jsonrpc_pstream_open meant the remote was a specification for an active outgoing connection, even when that was not the case. To fix this a couple of changes was made to socket-util: 1) Pass optional result of dns resolution from inet_parse_passive. When (re-)configuring listeners that use DNS names, we may need to know whether the provided connection string is invalid or if the provided DNS name has finished resolving. 2) Check dns resolution status in inet_open_passive. If the connection string is valid, and contains a DNS name, inet_open_passive will now return -EAGAIN if dns resolution failed. DNS resolution failure may either mean the asynchronous resolver has not completed yet, or that the name does not resolve. Reported-at: https://bugs.launchpad.net/bugs/1998781 Fixes: 08e9e5337383 ("ovsdb: raft: Fix inability to read the database with DNS host names.") Fixes: 771680d96fb6 ("DNS: Add basic support for asynchronous DNS resolving") Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Don't convert unchanged columns during database conversion.	Ilya Maximets	2023-01-27	1	-11/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Column conversion involves converting it to json and back. These are heavy operations and completely unnecessary if the column type didn't change. Most of the time schema changes only add new columns/tables without changing existing ones at all. Clone the column instead to save some time. This will also save time while destroying the original database since we will only need to reduce reference counters on unchanged datum objects that were cloned instead of actually freeing them. Additionally, moving the column lookup into a separate loop, so we don't perform an shash lookup for each column of each row. Testing with 440 MB OVN_Southbound database shows 70% speed up of the ovsdb_convert() function. Execution time reduced from 15 to 4.4 seconds, 3.5 of which is a post-conversion transaction replay. Overall time required for the online database conversion reduced from 37 to 25 seconds. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Fix database statistics during the database replacement.	Ilya Maximets	2023-01-18	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The counter for the number of atoms has to be re-set to the number from the new database, otherwise the value will be incorrect. For example, this is causing the atom counter doubling after online conversion of a clustered database. Miscounting may also lead to increased memory consumption by the transaction history or otherwise too aggressive transaction history sweep. Fixes: 317b1bfd7dd3 ("ovsdb: Don't let transaction history grow larger than the database.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Don't log when memory-trim-on-compaction doesn't change.	Dan Williams	2022-12-21	1	-2/+7
\| \| \| \| \| \| \| \|	But log at least once even if the value hasn't changed, for informational purposes. Signed-off-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Count weak reference objects.	Ilya Maximets	2022-12-06	4	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OVSDB creates a separate object for each weak reference in order to track them and there could be a significant amount of these objects in the database. We also had problems with number of these objects growing out of bounds recently. So, adding them to a memory report seems to be a good thing. Counting them globally to cover all the copied instances in transactions and the transaction history (even though there should be none). It's also hard to count them per-database, because weak references are stored on destination rows and can be destroyed either while destroying the destination row or while removing the reference from the source row. Also, not all the involved functions have direct access to the database object. So, there is no single clear place where counters should be updated. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-idl: Add the support to specify the uuid for row insert.	Numan Siddique	2022-11-30	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb-server allows the OVSDB clients to specify the uuid for the row inserts [1]. Both the C IDL client library and Python IDL are missing this feature. This patch adds this support. In C IDL, for each schema table, a new function is generated - <schema_table>insert_persistent_uuid(txn, uuid) which can be used the clients to persist the uuid. ovs-vsctl and other derivatives of ctl now supports the same in the generic 'create' command with the option "--id=<UUID>". In Python IDL, the uuid to persist can be specified in the Transaction.insert() function. [1] - a529e3cd1f("ovsdb-server: Allow OVSDB clients to specify the UUID for inserted rows.:) Acked-by: Adrian Moreno <amorenoz@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Terry Wilson <twilson@redhat.com> Signed-off-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: transaction: Fix weak reference leak.	Han Zhou	2022-11-04	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a row is deleted, if the row has weak references to other rows, the weak reference nodes attached to the destination rows (through weak->dst_node hmap) are not destroyed. Deleting weak references is properly handled when a row is modified. The removed references are taken care by: 1. assess_weak_refs() figures out the deleted references from the row and add them to txn_row->deleted_refs. 2. before commit, in ovsdb_txn_update_weak_refs() it finds the destination row for each item in txn_row->deleted_refs (from step 1), and destroy the corresponding weak references of the destination row. However, when the row is deleted, the step 1 in assess_weak_refs() is missing. It directly returns without adding the deleted references to txn_row->deleted_refs. So, the destination nodes will keep those weak references although the source side of the references are already deleted. When such rows that originating weak references are created and deleted, more and more such useless weak reference structures accumulate in the memory, and can stay there until the destination rows are deleted. It is possible that the destination row is never deleted, and in such case the ovsdb-server memory keeps growing (although it is not strictly memory leak, because the structures are still referenced). This problem has an impact to applications like OVN SB DB - the memory grows very fast in long-running deployments and finally causes OOM. This patch fixes it by generating deleted_refs for deleted rows in assess_weak_refs(). Fixes: 4dbff9f0a685 ("ovsdb: transaction: Incremental reassessment of weak refs.") Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: transaction: Refactor assess_weak_refs.	Han Zhou	2022-11-04	1	-42/+36
\| \| \| \| \| \| \| \| \|	The loops for adding weak refs are quite similar. Abstract to a function, which will be used by one more cases later. The patch also changes the txn_row arg to the source row. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Fix race for datum JSON string reference counter.	Ilya Maximets	2022-10-11	5	-13/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Compaction thread supposed to not change anything in the database it is working on, since the same data can be accessed by the main thread at the same time. However, while converting database rows to JSON objects, strings in the datum will be cloned using json_clone(), which is a shallow copy, and that will change the reference counter for the JSON string object. If both the main thread and the compaction thread will clone/destroy the same object at the same time we may end up with a broken reference counter leading to a memory leak or use-after free. Adding a new argument to the database to JSON conversion to prevent use of shallow copies from the compaction thread. This way all the database operations will be truly read-only avoiding the race. 'ovsdb_atom_to_json' and 'ovsdb_datum_to_json' are more widely used, so creating separate variant for these functions instead of adding a new argument, to avoid changing a lot of existing code. Other solution might be to use atomic reference counters, but that will require API/ABI break, because counter is exposed in public headers. Also, we can not easily expose atomic functions, so we'll need to un-inline reference counting with the associated performance cost. Fixes: 3cd2cbd684e0 ("ovsdb: Prepare snapshot JSON in a separate thread.") Reported-at: https://bugzilla.redhat.com/2133431 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	raft: Fix unnecessary periodic compactions.	Ilya Maximets	2022-08-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While creating a new database file and storing a new snapshot into it, raft module by mistake updates the base offset for the old file. So, the base offset of a new file remains zero. Then the old file is getting replaced with the new one, copying new offsets as well. In the end, after a full compaction, base offset is always zero. And any offset is twice as large as zero. That triggers a new compaction again at the earliest scheduled time. In practice this issue triggers compaction every 10-20 minutes regardless of the database load, after the first one is triggered by the actual file growth or by the 24h maximum limit. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-August/051977.html Reported-by: Oleksandr Mykhalskyi <oleksandr.mykhalskyi@netcracker.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Don't store rows that didn't change in transaction history.	Ilya Maximets	2022-08-30	1	-1/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Transaction history is used to construct database updates for clients. But if the row didn't change it will never be used for monitor updates, because ovsdb_monitor_changes_classify() will always return OVSDB_CHANGES_NO_EFFECT. So, ovsdb_monitor_history_change_cb() will never add it to the update. This condition is very common for rows with references. While processing strong references in ovsdb_txn_adjust_atom_refs() the whole destination row will be cloned into transaction just to update the reference counter. If this row will not be changed later in the transaction, it will just stay in that state and will be added to the transaction history. Since the data didn't change, both 'old' and 'new' datums will be the same and equal to one in the database. So, we're keeping 2 copies of the same row in memory and we are never using them. In this case, we should just not add them to the transaction history in the first place. This change should save some space in the transaction history in case of transactions with rows with big number of strong references. This should also speed up the processing since we will not clone these rows for transaction history and will not count their atoms. Testing shows about 5-10% performance improvement in ovn-heater test scenarios. 'n_atoms' counter for transaction adjusted to count only changed rows, so we will have accurate value for a number of atoms in the history. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Fix copying weak references into transaction history.	Ilya Maximets	2022-08-12	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Transaction history is used only to construct row data updates for clients, it's not used for checking data integrity, hence it doesn't need a copy of weak references. Not copying this data saves a lot of CPU cycles and memory in some cases. For example, in 250-node density-heavy scenario in ovn-heater these references can take up to 70% of RSS, which is about 8 GB of essentially wasted memory as reported by valgrind massif: ------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) ------------------------------------------------------------------------------- 20 1,011,495,832,314 11,610,557,104 10,217,785,620 1,392,771,484 0 88.00% (10,217,785,620B) (heap allocation functions) malloc/new/new[] ->70.47% (8,181,819,064B) 0x455372: xcalloc__ (util.c:121) ->70.07% (8,135,785,424B) 0x41609D: ovsdb_weak_ref_clone (row.c:66) ->70.07% (8,135,785,424B) 0x41609D: ovsdb_row_clone (row.c:151) ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_clone (transaction.c:1124) \| ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_add_to_history (transaction.c:1163) \| ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_replay_commit (transaction.c:1198) \| ->34.74% (4,034,041,440B) 0x408C35: parse_txn (ovsdb-server.c:633) \| ->34.74% (4,034,041,440B) 0x408C35: read_db (ovsdb-server.c:663) \| ->34.74% (4,034,041,440B) 0x406C9D: main_loop (ovsdb-server.c:238) \| ->34.74% (4,034,041,440B) 0x406C9D: main (ovsdb-server.c:500) \| ->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_clone (transaction.c:1125) ->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_add_to_history (transaction.c:1163) ->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_replay_commit (transaction.c:1198) ->34.74% (4,034,041,440B) 0x408C35: parse_txn (ovsdb-server.c:633) ->34.74% (4,034,041,440B) 0x408C35: read_db (ovsdb-server.c:663) ->34.74% (4,034,041,440B) 0x406C9D: main_loop (ovsdb-server.c:238) ->34.74% (4,034,041,440B) 0x406C9D: main (ovsdb-server.c:500) Replacing ovsdb_row_clone() with ovsdb_row_datum_clone() to avoid cloning unnecessary metadata. The ovsdb_txn_clone() function re-named to avoid issues if it will be re-used in the future for some other use-case. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	rhel: Stop installing internal headers.	Ilya Maximets	2022-07-29	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, openvswitch-devel installs following header tree: /usr/include /openflow/.h /openvswitch /.h /openflow/.h /openvswitch/.h /sparse/.h /lib/.h Few issues with that: 1. openflow and openvswitch headers are installed twice. Once in the main /usr/include and second time in the /usr/include/openvswitch/. 2. For some reason internal headers such as lib/.h and fairly useless headers such as sparse/.h are installed as well. One more issue is that current pkg-config files doesn't work with builds installed with 'make install', because 'make install' doesn't create this weird header tree. While double install of same headers is not a huge problem, it doesn't seem right. Installation of the internal headers is a bigger issue. They are not part of API/ABI and we do not provide any stability guarantees for them. We are making incompatible changes constantly in minor updates, so users should not rely on these headers. If it's necessary for some external application to use them, this external application should not link with libopenvswitch dynamically and also it can't expect the static library to not break these API/ABI, hence there is no real point installing them. Application should use OVS as a submodule like OVN does or compile itself by obtaining required version of OVS sources otherwise. Another option is to properly export and install required headers. pkg-config configuration files updated as necessary. Fixes: 4886d4d2495b ("debian, rhel: Ship ovs shared libraries and header files") Reviewed-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Remove extra make target dependency for local-config.5.	Ilya Maximets	2022-07-19	1	-1/+1
\| \| \| \| \| \| \| \| \|	ovsdb/ directory should not be a dependency, otherwise the man page is getting re-built every time unrelated files are changed. Fixes: 6f24c2bc769a ("ovsdb: Add Local_Config schema.") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb/TODO: Update the list of tasks.	Ilya Maximets	2022-07-14	1	-13/+33
\| \| \| \| \| \| \| \| \| \| \|	Some of the work is already done, e.g. 'diff' file format and DNS support. Added more items collected over time including relay and local_config items. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	man: Fix various typos across manual pages.	Frode Nordahl	2022-07-14	1	-3/+3
\| \| \| \| \| \| \|	As reported by Debian lintian. Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Add missing ovs-thread include.	Ilya Maximets	2022-07-14	1	-0/+1
\| \| \| \| \| \| \| \| \|	MSVC doesn't have pthread_t defined by default as other compilers, so the build fails without the header. Fixes: 3cd2cbd684e0 ("ovsdb: Prepare snapshot JSON in a separate thread.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
*	ovsdb: Prepare snapshot JSON in a separate thread.	Ilya Maximets	2022-07-13	9	-24/+204
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conversion of the database data into JSON object, serialization and destruction of that object are the most heavy operations during the database compaction. If these operations are moved to a separate thread, the main thread can continue processing database requests in the meantime. With this change, the compaction is split in 3 phases: 1. Initialization: - Create a copy of the database. - Remember current database index. - Start a separate thread to convert a copy of the database into serialized JSON object. 2. Wait: - Continue normal operation until compaction thread is done. - Meanwhile, compaction thread: * Convert database copy to JSON. * Serialize resulted JSON. * Destroy original JSON object. 3. Finish: - Destroy the database copy. - Take the snapshot created by the thread. - Write on disk. The key for this schema to be fast is the ability to create a shallow copy of the database. This doesn't take too much time allowing the thread to do most of work. Database copy is created and destroyed only by the main thread, so there is no need for synchronization. Such solution allows to reduce the time main thread is blocked by compaction by 80-90%. For example, in ovn-heater tests with 120 node density-heavy scenario, where compaction normally takes 5-6 seconds at the end of a test, measured compaction times was all below 1 second with the change applied. Also, note that these measured times are the sum of phases 1 and 3, so actual poll intervals are about half a second in this case. Only implemented for raft storage for now. The implementation for standalone databases can be added later by using a file offset as a database index and copying newly added changes from the old file to a new one during ovsdb_log_replace(). Reported-at: https://bugzilla.redhat.com/2069108 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Add lazy-copy support for ovsdb_datum objects.	Ilya Maximets	2022-07-13	7	-34/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently ovsdb-server is using shallow copies of some JSON objects by keeping a reference counter. JSON string objects are also used directly as ovsdb atoms in database rows to avoid extra copies. Taking this approach one step further ovsdb_datum objects can also be mostly deduplicated by postponing the copy until it actually needed. datum object itself contains a type and 2 pointers to data arrays. Adding a one more pointer to a reference counter we may create a shallow copy of the datum by simply copying type and pointers and increasing the reference counter. Before modifying the datum, special function needs to be called to perform an actual copy of the object, a.k.a. unshare it. Most of the datum modifications are performed inside the special functions in ovsdb-data.c, so that is not very hard to track. A few places like ovsdb-server.c and column mutations are accessing and changing the data directly, so a few extra unshare() calls has to be added there. This change doesn't affect the maximum memory consumption too much, because most of the copies are short-living. However, not actually performing these copies saves up to 40% of CPU time on operations with large sets. Reported-at: https://bugzilla.redhat.com/2069089 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Enable memory trimming after compaction by default.	Ilya Maximets	2022-07-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Memory trimming was introduced in OVS 2.15 and didn't cause any issues in production environments since then, while allowing ovsdb-sever to consume a lot less memory in high scale OVN deployments. Enabling by default to make it easier to use. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Add Local_Config schema.	Terry Wilson	2022-06-30	4	-0/+362
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The only way to configure settings on a remote (e.g. inactivity_probe) is via --remote=db:DB,table,row. There is no way to do this via the existing CLI options. For a clustered DB with multiple servers listening on unique addresses there is no way to store these entries in the DB as the DB is shared. For example, three servers listening on 1.1.1.1, 1.1.1.2, and 1.1.1.3 respectively would require a Manager/Connection row each, but then all three servers would try to listen on all three addresses. It is possible for ovsdb-server to serve multiple databases. This means that we can have a local "config" database in addition to the main database we are serving (Open_vSwitch, OVN_Southbound, etc.) and this patch adds a Local_Config schema that currently just mirrors the Connection table and a Config table with a 'connections' row that stores each Connection. Signed-off-by: Terry Wilson <twilson@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Log database transactions for user requested tables.	Dumitru Ceara	2022-06-28	7	-0/+172
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new command, 'ovsdb-server/tlog-set DB:TABLE on\|off', which allows the user to enable/disable transaction logging for specific databases and tables. By default, logging is disabled. Once enabled, logs are generated with level INFO and are also rate limited. If used with care, this command can be useful in analyzing production deployment performance issues, allowing the user to pin point bottlenecks without the need to enable wider debug logs, e.g., jsonrpc. A command to inspect the logging state is also added: 'ovsdb-server/tlog-list'. Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Fix memory leak on error path in ovsdb_file_read__().	Yunjian Wang	2022-06-28	1	-0/+1
\| \| \| \| \| \| \| \| \|	Found by Coverity. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	raft: Don't use HMAP_FOR_EACH_SAFE when logging commands.	Dumitru Ceara	2022-05-30	1	-1/+5
\| \| \| \| \| \| \| \| \| \|	There's no need to do that because we're not changing the hmap. Also, if DEBUG logging is disabled there's no need to iterate at all. Fixes: 5a9b53a51ec9 ("ovsdb raft: Fix duplicated transaction execution when leader failover.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: raft: Fix transaction double commit due to lost leadership.	Ilya Maximets	2022-05-26	1	-55/+78
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While becoming a follower, the leader aborts all the current 'in-flight' commands, so the higher layers can re-try corresponding transactions when the new leader is elected. However, most of these commands are already sent to followers as append requests, hence they will actually be committed by the majority of the cluster members, i.e. will be treated as committed by the new leader, unless there is an actual network problem between servers. However, the old leader will decline append replies, since it's not the leader anymore and commands are already completed with RAFT_CMD_LOST_LEADERSHIP status. New leader will replicate the commit index back to the old leader. Old leader will re-try the previously "failed" transaction, because "cluster error"s are temporary. If a transaction had some prerequisites that didn't allow double committing or there are other database constraints (like indexes) that will not allow a transaction to be committed twice, the server will reply to the client with a false-negative transaction result. If there are no prerequisites or additional database constraints, the server will execute the same transaction again, but as a follower. E.g. in the OVN case, this may result in creation of duplicated logical switches / routers / load balancers. I.e. resources with the same non-indexed name. That may cause issues later where ovn-nbctl will not be able to add ports to these switches / routers. Suggested solution is to not complete (abort) the commands, but allow them to be completed with the commit index update from a new leader. It the similar behavior to what we do in order to complete commands in a backward scenario when the follower becomes a leader. That scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated transaction execution when leader failover."). Code paths for leader and follower inside the raft_update_commit_index were very similar, so they were refactored into one, since we also needed an ability to complete more than one command for a follower. Failure test added to exercise scenario of a leadership transfer. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.") Reported-at: https://bugzilla.redhat.com/2046340 Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	sha1: Use implementation from openssl if available.	Ilya Maximets	2022-05-26	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implementation of SHA1 in OpenSSL library is much faster and optimized for all available CPU architectures and instruction sets. OVS should use it instead of internal implementation if possible. Depending on compiler options OpenSSL's version finishes our sha1 unit tests from 3 to 12 times faster. Performance of OpenSSL's version is constant, but OVS's implementation highly depends on compiler. Interestingly, default build with '-g -O2' works faster than optimized '-march=native -Ofast'. Tests with ovsdb-server on big databases shows ~5-10% improvement of the time needed for database compaction (sha1 is only a part of this operation), depending on compiler options. We still need internal implementation, because OpenSSL can be not available on some platforms. Tests enhanced to check both versions of API. Reviewed-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	idlc: support short version of SAFE macros.	Adrian Moreno	2022-03-30	1	-2/+17
\| \| \| \| \| \| \| \| \| \| \|	In order to be consistent with the rest of the SAFE loop macros, overload each of the generated *_SAFE macro with a SHORT version that does not require the user to provide the NEXT variable. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	hmap: use short version of safe loops if possible.	Adrian Moreno	2022-03-30	13	-59/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	list: use short version of safe loops if possible.	Adrian Moreno	2022-03-30	7	-49/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using the SHORT version of the *_SAFE loops makes the code cleaner and less error-prone. So, use the SHORT version and remove the extra variable when possible. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: raft: Fix inability to read the database with DNS host names.	Ilya Maximets	2022-03-30	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Clustered OVSDB allows to use DNS names as addresses of raft members. However, if DNS resolution fails during the initial database read, this causes a fatal failure and exit of the ovsdb-server process. Also, if DNS name of a joining server is not resolvable for one of the followers, this follower will reject append requests for a new server to join until the name is successfully resolved. This makes a follower effectively non-functional while DNS is unavailable. To fix the problem relax the address verification. Allowing validation to pass if only name resolution failed and the address is valid otherwise. This will allow addresses to be added to the database, so connections could be established later when the DNS is available. Additionally fixing missed initialization of the dns-resolve module. Without it, DNS requests are blocking. This causes unexpected delays in runtime. Fixes: 771680d96fb6 ("DNS: Add basic support for asynchronous DNS resolving") Reported-at: https://bugzilla.redhat.com/2055097 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: row: Optimize row updates by applying diffs in-place.	Ilya Maximets	2022-03-03	1	-11/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb_datum_apply_diff_in_place() is much faster than the usual ovsdb_datum_apply_diff() in most cases, because it doesn't clone or compare unnecessary data. Since the original destination datum is destroyed anyway, we might use the faster function here to speed up transaction processing. ovsdb_row_update_columns() with xor is mainly used by relay databases. So, this change should improve their performance. Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: relay: Add transaction history support.	Ilya Maximets	2022-03-03	3	-12/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though relays can be scaled to the big number of servers to handle a lot more clients, lack of transaction history may cause significant load if clients are re-connecting. E.g. in case of the upgrade of a large-scale OVN deployment, relays can be taken down one by one forcing all the clients of one relay to jump to other ones. And all these clients will download the database from scratch from a new relay. Since relay itself supports monitor_cond_since connection to the main cluster, it receives the last transaction id along with each update. Since these transaction ids are 'eid's of actual transactions, they can be used by relay for a transaction history. Relay may not receive all the transaction ids, because the main cluster may combine several changes into a single monitor update. However, all relays will, likely, receive same updates with the same transaction ids, so the case where transaction id can not be found after re-connection between relays should not be very common. If some id is missing on the relay (i.e. this update was merged with some other update and newer id was used) the client will just re-download the database as if there was a normal transaction history miss. OVSDB client synchronization module updated to provide the last transaction id along with the update. Relay module updated to use these ids as a transaction id. If ids are zero, relay decides that the main server doesn't support transaction ids and disables the transaction history accordingly. Using ovsdb_txn_replay_commit() instead of ovsdb_txn_propose_commit_block(), so transactions are added to the history. This can be done, because relays has no file storage, so there is no need to write anything. Relay tests modified to test both standalone and clustered database as a main server. Checks added to ensure that all servers receive the same transaction ids in monitor updates. Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: raft: Fix inability to join the cluster after interrupted attempt.	Ilya Maximets	2022-02-25	1	-7/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the joining server re-connects while catching up (e.g. if it crashed or connection got closed due to inactivity), the data we sent might be lost, so the server will never reply to append request or a snapshot installation request. At the same time, leader will decline all the subsequent requests to join from that server with the 'in progress' resolution. At this point the new server will never be able to join the cluster, because it will never receive the raft log while leader thinks that it was already sent. This happened in practice when one of the servers got preempted for a few seconds, so the leader closed connection due to inactivity. Destroying the joining server if disconnection detected. This will allow to start the joining from scratch when the server re-connects and sends the new join request. We can't track re-connection in the raft_conn_run(), because it's incoming connection and the jsonrpc will not keep it alive or try to reconnect. Next time the server re-connects it will be an entirely new raft conn. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Reported-at: https://bugzilla.redhat.com/2033514 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
*	ovsdb-idlc: Avoid accessing member within NULL idl index cursors.	Dumitru Ceara	2022-02-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reported by UndefinedBehaviorSanitizer: tests/idltest.c:3602:12: runtime error: member access within null pointer of type 'const struct idltest_simple' #0 0x4295af in idltest_simple_cursor_first_ge tests/idltest.c:3602 #1 0x41c81b in test_idl_compound_index_single_column tests/test-ovsdb.c:3128 #2 0x41e035 in do_idl_compound_index tests/test-ovsdb.c:3277 #3 0x4cf640 in ovs_cmdl_run_command__ lib/command-line.c:247 #4 0x4cf79f in ovs_cmdl_run_command lib/command-line.c:278 #5 0x4072f7 in main tests/test-ovsdb.c:79 #6 0x7fa858675b74 in __libc_start_main (/lib64/libc.so.6+0x27b74) #7 0x4060ed in _start (/root/ovs/tests/test-ovsdb+0x4060ed) Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	treewide: Don't pass NULL to library functions that expect non-NULL.	Dumitru Ceara	2022-02-14	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's actually undefined behavior to pass NULL to standard library functions that manipulate arrays (e.g., qsort, memcpy, memcmp), even if the passed number of items is 0. UB Sanitizer reports: ovsdb/monitor.c:408:9: runtime error: null pointer passed as argument 1, which is declared to never be null #0 0x406ae1 in ovsdb_monitor_columns_sort ovsdb/monitor.c:408 #1 0x406ae1 in ovsdb_monitor_add ovsdb/monitor.c:1683 [...] lib/ovsdb-data.c:1970:5: runtime error: null pointer passed as argument 2, which is declared to never be null #0 0x4071c8 in ovsdb_datum_push_unsafe lib/ovsdb-data.c:1970 #1 0x471cd0 in ovsdb_datum_apply_diff_in_place lib/ovsdb-data.c:2345 [...] ofproto/ofproto-dpif-rid.c:159:17: runtime error: null pointer passed as argument 1, which is declared to never be null #0 0x4df5d8 in frozen_state_equal ofproto/ofproto-dpif-rid.c:159 #1 0x4dfd27 in recirc_find_equal ofproto/ofproto-dpif-rid.c:179 [...] Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: transaction: Keep one entry in the transaction history.	Ilya Maximets	2022-01-31	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a single transaction exceeds the size of the whole database (e.g., a lot of rows got removed and new ones added), transaction history will be drained. This leads to sending UUID_ZERO to the clients as the last transaction id in the next monitor update, because monitor doesn't know what was the actual last transaction id. In case of a re-connect that will cause re-downloading of the whole database, since the client's last_id will be out of sync. One solution would be to store the last transaction ID separately from the actual transactions, but that will require a careful management in cases where database gets reset and the history needs to be cleared. Keeping the one last transaction instead to avoid the problem. That should not be a big concern in terms of memory consumption, because this last transaction will be removed from the history once the next transaction appeared. This is also not a concern for a fast re-sync, because this last transaction will not be used for the monitor reply; it's either client already has it, so no need to send, or it's a history miss. The test updated to not check the number of atoms if there is only one transaction in the history. Fixes: 317b1bfd7dd3 ("ovsdb: Don't let transaction history grow larger than the database.") Reported-at: https://bugzilla.redhat.com/2044621 Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: storage: Randomize should_snapshot checks when the minimum time passed.	Ilya Maximets	2021-12-13	2	-3/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Snapshots are scheduled for every 10-20 minutes. It's a random value in this interval for each server. Once the time is up, but the maximum time (24 hours) not reached yet, ovsdb will start checking if the log grew a lot on every iteration. Once the growth is detected, compaction is triggered. OTOH, it's very common for an OVSDB cluster to not have the log growing very fast. If the log didn't grow 2x in 20 minutes, the randomness of the initial scheduled time is gone and all the servers are checking if they need to create snapshot on every iteration. And since all of them are part of the same cluster, their logs are growing with the same speed. Once the critical mass is reached, all the servers will start creating snapshots at the same time. If the database is big enough, that might leave the cluster unresponsive for an extended period of time (e.g. 10-15 seconds for OVN_Southbound database in a larger scale OVN deployment) until the compaction completed. Fix that by re-scheduling a quick retry if the minimal time already passed. Effectively, this will work as a randomized 1-2 min delay between checks, so the servers will not synchronize. Scheduling function updated to not change the upper limit on quick reschedules to avoid delaying the snapshot creation indefinitely. Currently quick re-schedules are only used for the error cases, and there is always a 'slow' re-schedule after the successful compaction. So, the change of a scheduling function doesn't change the current behavior much. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
*	raft: Only allow followers to snapshot.	Dumitru Ceara	2021-12-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.") made it such that raft leaders transfer leadership before snapshotting. However, there's still the case when the next leader to be is in the process of snapshotting. To avoid delays in that case too, we now explicitly allow snapshots only on followers. Cluster members will have to wait until the current election is settled before snapshotting. Given the following logs taken from an OVN_Southbound 3-server cluster during a scale test: S1 (old leader): 19:07:51.226Z\|raft\|INFO\|Transferring leadership to write a snapshot. 19:08:03.830Z\|ovsdb\|INFO\|OVN_Southbound: Database compaction took 12601ms 19:08:03.940Z\|raft\|INFO\|server 8b8d is leader for term 43 S2 (follower): 19:08:00.870Z\|raft\|INFO\|server 8b8d is leader for term 43 S3 (new leader): 19:07:51.242Z\|raft\|INFO\|received leadership transfer from f5c9 in term 42 19:07:51.244Z\|raft\|INFO\|term 43: starting election 19:08:00.805Z\|ovsdb\|INFO\|OVN_Southbound: Database compaction took 9559ms 19:08:00.869Z\|raft\|INFO\|term 43: elected leader by 2+ of 3 servers We see that the leader to be (S3) receives the leadership transfer, initiates the election and immediately after starts a snapshot that takes ~9.5 seconds. During this time, S2 votes for S3 electing it as cluster leader but S3 doesn't effectively become leader until it finishes snapshotting, essentially keeping the cluster without a leader for up to ~9.5 seconds. With the current change, S3 will delay compaction and snapshotting until the election is finished. The only exception is the case of single-node clusters for which we allow the node to snapshot regardless of role. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-data: Consolidate ovsdb atom and json strings.	Ilya Maximets	2021-11-30	3	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb_atom_string and json_string are basically the same data structure and ovsdb-server frequently needs to convert one to another. We can avoid that by using json_string from the beginning for all ovsdb strings. So, the conversion turns into simple json_clone(), i.e. increment of a reference counter. This change gives a moderate performance boost in some scenarios, improves the code clarity and may be useful for future development. Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Don't let transaction history grow larger than the database.	Ilya Maximets	2021-11-05	3	-2/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If user frequently changes a lot of rows in a database, transaction history could grow way larger than the database itself. This wastes a lot of memory and also makes monitor_cond_since slower than usual monotor_cond if the transaction id is old enough, because re-construction of the changes from a history is slower than just creation of initial database snapshot. This is also the case if user deleted a lot of data, so transaction history still holds all of it while the database itself doesn't. In case of current lb-per-service model in ovn-kubernetes, each load-balancer is added to every logical switch/router. Such a transaction touches more than a half of a OVN_Northbound database. And each of these transactions is added to the transaction history. Since transaction history depth is 100, in worst case scenario, it will hold 100 copies of a database increasing memory consumption dramatically. In tests with 3000 LBs and 120 LSs, memory goes up to 3 GB, while holding at 30 MB if transaction history disabled in the code. Fixing that by keeping count of the number of ovsdb_atom's in the database and not allowing the total number of atoms in transaction history to grow larger than this value. Counting atoms is fairly cheap because we don't need to iterate over them, so it doesn't have significant performance impact. It would be ideal to measure the size of individual atoms, but that will hit the performance. Counting cells instead of atoms is not sufficient, because OVN users are adding hundreds or thousands of atoms to a single cell, so they are largely different in size. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
*	ovsdb: transaction: Incremental reassessment of weak refs.	Ilya Maximets	2021-11-04	3	-100/+293
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The main idea is to not store list of weak references in the source row, so they all don't need to be re-checked/updated on every modification of that source row. The point is that source row already knows UUIDs of all destination rows stored in the data, so there is no much profit in storing this information somewhere else. If needed, destination row can be looked up and reference can be looked up in the destination row. For the fast lookup, destination row now stores references in a hash map. Weak reference structure now contains the table and uuid of a source row instead of a direct pointer. This allows to replace/update the source row without breaking any weak references stored in destination rows. Structure also now contains the key-value pair of atoms that triggered creation of this reference. These atoms can be used to quickly subtract removed references from a source row. During reassessment, ovsdb now only needs to care about new added or removed atoms, and atoms that got removed due to removal of the destination rows, but these are marked for reassessment by the destination row. ovsdb_datum_subtract() is used to remove atoms that points to removed or incorrect rows, so there is no need to re-sort datum in the end. Results of an OVN load-balancer benchmark that adds 3K load-balancers to each of 120 logical switches and 120 logical routers in the OVN sandbox with clustered Northbound database and then removes them: Before: %CPU CPU Time CMD 86.8 00:16:05 ovsdb-server nb1.db 44.1 00:08:11 ovsdb-server nb2.db 43.2 00:08:00 ovsdb-server nb3.db After: %CPU CPU Time CMD 54.9 00:02:58 ovsdb-server nb1.db 33.3 00:01:48 ovsdb-server nb2.db 32.2 00:01:44 ovsdb-server nb3.db So, on a cluster leader the processing time dropped by 5.4x, on followers - by 4.5x. More load-balancers - larger the performance difference. There is a slight increase of memory usage, because new reference structure is larger, but the difference is not significant. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
*	ovsdb-data: Deduplicate string atoms.	Ilya Maximets	2021-09-24	4	-72/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb-server spends a lot of time cloning atoms for various reasons, e.g. to create a diff of two rows or to clone a row to the transaction. All atoms, except for strings, contains a simple value that could be copied in efficient way, but duplicating strings every time has a significant performance impact. Introducing a new reference-counted structure 'ovsdb_atom_string' that allows to not copy strings every time, but just increase a reference counter. This change allows to increase transaction throughput in benchmarks up to 2x for standalone databases and 3x for clustered databases, i.e. number of transactions that ovsdb-server can handle per second. It also noticeably reduces memory consumption of ovsdb-server. Next step will be to consolidate this structure with json strings, so we will not need to duplicate strings while converting database objects to json and back. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
*	ovsdb-data: Add function to apply diff in-place.	Ilya Maximets	2021-09-24	1	-6/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb_datum_apply_diff() is heavily used in ovsdb transactions, but it's linear in terms of number of comparisons. And it also clones all the atoms along the way. In most cases size of a diff is much smaller than the size of the original datum, this allows to perform the same operation in-place with only O(diff->n * log2(old->n)) comparisons and O(old->n + diff->n) memory copies with memcpy. Using this function while applying diffs read from the storage gives a significant performance boost and allows to execute much more transactions per second. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>