delta/openvswitch.git - github.com: openvswitch/ovs.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	ovsdb: Avoid converting database twice on an initiator.	Ilya Maximets	2023-04-24	1	-7/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cluster member, that initiates the schema conversion, converts the database twice. First time while verifying the possibility of the conversion, and the second time after reading conversion request back from the storage. Keep the converted database from the first time around and use it after reading the request back from the storage. This cuts in half the conversion CPU cost. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Perform conversion with no data for clustered databases.	Ilya Maximets	2023-04-24	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, database schema conversion in case of clustered database produces a transaction record with both new schema and converted database data. So, the sequence of events is following: 1. Get the new schema. 2. Convert the database to a new schema. 3. Translate the newly converted database into JSON. 4. Write the schema + data JSON to the storage. 5. Destroy converted version of a database. 6. Read schema + data JSON from the storage and parse. 7. Create a new database from a parsed database data. 8. Replace current database with the new one. Most of these steps are very computationally expensive. Also, conversion to/from JSON is much more expensive than direct database conversion with ovsdb_convert() that can make use of shallow data copies. Instead of doing all that, let's make use of previously introduced ability to not write the converted data into the storage. The process will look like this then: 1. Get the new schema. 2. Convert the database to a new schema (to verify that it is possible). 3. Write the schema to the storage. 4. Destroy converted version of a database. 5. Read the new schema from the storage and parse. 6. Convert the database to a new schema. 7. Replace current database with the new one. Most of the operations here are performed on the small schema object, instead of the actual database data. Two remaining data operations (actual conversion) are noticeably faster than conversion to/from JSON due to reference counting and shallow data copies. Steps 4-6 can be optimized later to not convert twice on the process that initiates the conversion. The change results in following performance improvements in conversion of OVN_Southbound database schema from version 20.23.0 to 20.27.0 (measured on a single-server RAFT cluster with no clients): \| Before \| After +---------+-------------------+---------+------------------ DB size \| Total \| Max poll interval \| Total \| Max poll interval --------+---------+-------------------+---------+------------------ 542 MB \| 47 sec. \| 26 sec. \| 15 sec. \| 10 sec. 225 MB \| 19 sec. \| 10 sec. \| 6 sec. \| 4.5 sec. 542 MB database had 19.5 M atoms, 225 MB database had 7.5 M atoms. Overall performance improvement is about 3x. Also, note that before this change database conversion basically doubles the database file on disk. Now it only writes a small schema JSON. Since the change requires backward-incompatible database file format changes, documentation is updated on how to perform an upgrade. Handled the same way as we did for the previous incompatible format change in 2.15 (column diffs). Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052140.html Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Allow conversion records with no data in a clustered storage.	Ilya Maximets	2023-04-24	1	-21/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the schema with no data was read from the clustered storage, it should mean a database conversion request. In general, we can get: 1. Just data --> Transaction record. 2. Schema + Data --> Database conversion or raft snapshot install. 3. Just schema --> New. Database conversion request. We cannot distinguish between conversion and snapshot installation request in the current implementation, so we will keep handling conversion with data in the same way as before, i.e. if data is provided, we should use it. ovsdb-tool is updated to handle this record type as well while converting cluster to standalone. This change doesn't introduce a way for such records to appear in the database. That will be added in the future commits targeting conversion speed increase. Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	dpdk: Allow retaining CAP_SYS_RAWIO privileges.	Aaron Conole	2023-03-22	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Open vSwitch generally tries to let the underlying operating system managed the low level details of hardware, for example DMA mapping, bus arbitration, etc. However, when using DPDK, the underlying operating system yields control of many of these details to userspace for management. In the case of some DPDK port drivers, configuring rte_flow or even allocating resources may require access to iopl/ioperm calls, which are guarded by the CAP_SYS_RAWIO privilege on linux systems. These calls are dangerous, and can allow a process to completely compromise a system. However, they are needed in the case of some userspace driver code which manages the hardware (for example, the mlx implementation of backend support for rte_flow). Here, we create an opt-in flag passed to the command line to allow this access. We need to do this before ever accessing the database, because we want to drop all privileges asap, and cannot wait for a connection to the database to be established and functional before dropping. There may be distribution specific ways to do capability management as well (using for example, systemd), but they are not as universal to the vswitchd as a flag. Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Flavio Leitner <fbl@sysclose.org> Acked-by: Gaetan Rivet <gaetanr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Don't log when memory-trim-on-compaction doesn't change.	Dan Williams	2022-12-21	1	-2/+7
\| \| \| \| \| \| \| \|	But log at least once even if the value hasn't changed, for informational purposes. Signed-off-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Prepare snapshot JSON in a separate thread.	Ilya Maximets	2022-07-13	1	-3/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Conversion of the database data into JSON object, serialization and destruction of that object are the most heavy operations during the database compaction. If these operations are moved to a separate thread, the main thread can continue processing database requests in the meantime. With this change, the compaction is split in 3 phases: 1. Initialization: - Create a copy of the database. - Remember current database index. - Start a separate thread to convert a copy of the database into serialized JSON object. 2. Wait: - Continue normal operation until compaction thread is done. - Meanwhile, compaction thread: * Convert database copy to JSON. * Serialize resulted JSON. * Destroy original JSON object. 3. Finish: - Destroy the database copy. - Take the snapshot created by the thread. - Write on disk. The key for this schema to be fast is the ability to create a shallow copy of the database. This doesn't take too much time allowing the thread to do most of work. Database copy is created and destroyed only by the main thread, so there is no need for synchronization. Such solution allows to reduce the time main thread is blocked by compaction by 80-90%. For example, in ovn-heater tests with 120 node density-heavy scenario, where compaction normally takes 5-6 seconds at the end of a test, measured compaction times was all below 1 second with the change applied. Also, note that these measured times are the sum of phases 1 and 3, so actual poll intervals are about half a second in this case. Only implemented for raft storage for now. The implementation for standalone databases can be added later by using a file offset as a database index and copying newly added changes from the old file to a new one during ovsdb_log_replace(). Reported-at: https://bugzilla.redhat.com/2069108 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: Enable memory trimming after compaction by default.	Ilya Maximets	2022-07-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Memory trimming was introduced in OVS 2.15 and didn't cause any issues in production environments since then, while allowing ovsdb-sever to consume a lot less memory in high scale OVN deployments. Enabling by default to make it easier to use. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Log database transactions for user requested tables.	Dumitru Ceara	2022-06-28	1	-0/+87
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new command, 'ovsdb-server/tlog-set DB:TABLE on\|off', which allows the user to enable/disable transaction logging for specific databases and tables. By default, logging is disabled. Once enabled, logs are generated with level INFO and are also rate limited. If used with care, this command can be useful in analyzing production deployment performance issues, allowing the user to pin point bottlenecks without the need to enable wider debug logs, e.g., jsonrpc. A command to inspect the logging state is also added: 'ovsdb-server/tlog-list'. Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	hmap: use short version of safe loops if possible.	Adrian Moreno	2022-03-30	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: raft: Fix inability to read the database with DNS host names.	Ilya Maximets	2022-03-30	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Clustered OVSDB allows to use DNS names as addresses of raft members. However, if DNS resolution fails during the initial database read, this causes a fatal failure and exit of the ovsdb-server process. Also, if DNS name of a joining server is not resolvable for one of the followers, this follower will reject append requests for a new server to join until the name is successfully resolved. This makes a follower effectively non-functional while DNS is unavailable. To fix the problem relax the address verification. Allowing validation to pass if only name resolution failed and the address is valid otherwise. This will allow addresses to be added to the database, so connections could be established later when the DNS is available. Additionally fixing missed initialization of the dns-resolve module. Without it, DNS requests are blocking. This causes unexpected delays in runtime. Fixes: 771680d96fb6 ("DNS: Add basic support for asynchronous DNS resolving") Reported-at: https://bugzilla.redhat.com/2055097 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: relay: Add transaction history support.	Ilya Maximets	2022-03-03	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even though relays can be scaled to the big number of servers to handle a lot more clients, lack of transaction history may cause significant load if clients are re-connecting. E.g. in case of the upgrade of a large-scale OVN deployment, relays can be taken down one by one forcing all the clients of one relay to jump to other ones. And all these clients will download the database from scratch from a new relay. Since relay itself supports monitor_cond_since connection to the main cluster, it receives the last transaction id along with each update. Since these transaction ids are 'eid's of actual transactions, they can be used by relay for a transaction history. Relay may not receive all the transaction ids, because the main cluster may combine several changes into a single monitor update. However, all relays will, likely, receive same updates with the same transaction ids, so the case where transaction id can not be found after re-connection between relays should not be very common. If some id is missing on the relay (i.e. this update was merged with some other update and newer id was used) the client will just re-download the database as if there was a normal transaction history miss. OVSDB client synchronization module updated to provide the last transaction id along with the update. Relay module updated to use these ids as a transaction id. If ids are zero, relay decides that the main server doesn't support transaction ids and disables the transaction history accordingly. Using ovsdb_txn_replay_commit() instead of ovsdb_txn_propose_commit_block(), so transactions are added to the history. This can be done, because relays has no file storage, so there is no need to write anything. Relay tests modified to test both standalone and clustered database as a main server. Checks added to ensure that all servers receive the same transaction ids in monitor updates. Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-data: Consolidate ovsdb atom and json strings.	Ilya Maximets	2021-11-30	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb_atom_string and json_string are basically the same data structure and ovsdb-server frequently needs to convert one to another. We can avoid that by using json_string from the beginning for all ovsdb strings. So, the conversion turns into simple json_clone(), i.e. increment of a reference counter. This change gives a moderate performance boost in some scenarios, improves the code clarity and may be useful for future development. Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-data: Deduplicate string atoms.	Ilya Maximets	2021-09-24	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb-server spends a lot of time cloning atoms for various reasons, e.g. to create a diff of two rows or to clone a row to the transaction. All atoms, except for strings, contains a simple value that could be copied in efficient way, but duplicating strings every time has a significant performance impact. Introducing a new reference-counted structure 'ovsdb_atom_string' that allows to not copy strings every time, but just increase a reference counter. This change allows to increase transaction throughput in benchmarks up to 2x for standalone databases and 3x for clustered databases, i.e. number of transactions that ovsdb-server can handle per second. It also noticeably reduces memory consumption of ovsdb-server. Next step will be to consolidate this structure with json strings, so we will not need to duplicate strings while converting database objects to json and back. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
*	ovsdb: relay: Reflect connection status in _Server database.	Ilya Maximets	2021-07-15	1	-1/+2
\| \| \| \| \| \| \| \| \|	It might be important for clients to know that relay lost connection with the relay remote, so they could re-connect to other relay. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: New ovsdb 'relay' service model.	Ilya Maximets	2021-07-15	1	-29/+70
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	New database service model 'relay' that is needed to scale out read-mostly database access, e.g. ovn-controller connections to OVN_Southbound. In this service model ovsdb-server connects to existing OVSDB server and maintains in-memory copy of the database. It serves read-only transactions and monitor requests by its own, but forwards write transactions to the relay source. Key differences from the active-backup replication: - support for "write" transactions (next commit). - no on-disk storage. (probably, faster operation) - support for multiple remotes (connect to the clustered db). - doesn't try to keep connection as long as possible, but faster reconnects to other remotes to avoid missing updates. - No need to know the complete database schema beforehand, only the schema name. - can be used along with other standalone and clustered databases by the same ovsdb-server process. (doesn't turn the whole jsonrpc server to read-only mode) - supports modern version of monitors (monitor_cond_since), because based on ovsdb-cs. - could be chained, i.e. multiple relays could be connected one to another in a row or in a tree-like form. - doesn't increase availability. - cannot be converted to other service models or become a main active server. Some performance test results can be found here: https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385825.html Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb: storage: Allow setting the name for the unbacked storage.	Ilya Maximets	2021-07-15	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ovsdb_create() requires schema or storage to be nonnull, but in practice it requires to have schema name or a storage name to use it as a database name. Only clustered storage has a name. This means that only clustered database can be created without schema, Changing that by allowing unbacked storage to have a name. This way we can create database with unbacked storage without schema. Will be used in next commits to create database for ovsdb 'relay' service model. Acked-by: Mark D. Gray <mark.d.gray@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Fix memleak when failing to read storage.	Dumitru Ceara	2021-07-15	1	-5/+3
\| \| \| \| \| \|	Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-server: Don't update manager status if replay engine is active.	Ilya Maximets	2021-06-07	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current version or replay engine doesn't handle correctly internal time-based events that ends up in stream events. For example, updates of a database status that happens each 2.5 seconds results in updates on client monitors. Disable updates for now if replay engine is active. The very first update kept to store the initial information about the server. The proper solution would be to record time and replay it, probably, with time warping or in some other way. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
*	ovsdb-server: Integrate stream replay engine.	Ilya Maximets	2021-06-07	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change adds support of stream record/replay functionality to ovsdb-server. Since current replay engine doesn't work well with time-based events generated locally, it will work only with standalone databases for now (raft heavily depends on time). To use this functionality run: Recording: # create a directory for replay files. mkdir replay_dir # copy current db for later use by replay cp my_db ./replay_dir/my_db ovsdb-server --record=./replay_dir <OVSDB_ARGS> my_db # connect some clients and run some ovsdb transactions ovs-appctl -t ovsdb-server exit Replay: # restore db from the copy cp ./replay_dir/my_db my_db.for_replay ovsdb-server --replay=./replay_dir <OVSDB_ARGS> my_db.for_replay At this point ovsdb-server should execute all the same commands and transactions. Since the last command was 'exit' via unixctl, ovsdb-server will exit in the end. Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
*	ovsdb: Use column diffs for ovsdb and raft log entries.	Ilya Maximets	2021-01-15	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, ovsdb-server stores complete value for the column in a database file and in a raft log in case this column changed. This means that transaction that adds, for example, one new acl to a port group creates a log entry with all UUIDs of all existing acls + one new. Same for ports in logical switches and routers and more other columns with sets in Northbound DB. There could be thousands of acls in one port group or thousands of ports in a single logical switch. And the typical use case is to add one new if we're starting a new service/VM/container or adding one new node in a kubernetes or OpenStack cluster. This generates huge amount of traffic within ovsdb raft cluster, grows overall memory consumption and hurts performance since all these UUIDs are parsed and formatted to/from json several times and stored on disks. And more values we have in a set - more space a single log entry will occupy and more time it will take to process by ovsdb-server cluster members. Simple test: 1. Start OVN sandbox with clustered DBs: # make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered' 2. Run a script that creates one port group and adds 4000 acls into it: # cat ../memory-test.sh pg_name=my_port_group export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off) ovn-nbctl pg-add $pg_name for i in $(seq 1 4000); do echo "Iteration: $i" ovn-nbctl --log acl-add $pg_name from-lport $i udp drop done ovn-nbctl acl-del $pg_name ovn-nbctl pg-del $pg_name ovs-appctl -t $(pwd)/sandbox/nb1 memory/show ovn-appctl -t ovn-nbctl exit --- 4. Check the current memory consumption of ovsdb-server processes and space occupied by database files: # ls sandbox/[ns]b*.db -alh # ps -eo vsz,rss,comm,cmd \| egrep '=[ns]b[123].pid' Test results with current ovsdb log format: On-disk Nb DB size : ~369 MB RSS of Nb ovsdb-servers: ~2.7 GB Time to finish the test: ~2m In order to mitigate memory consumption issues and reduce computational load on ovsdb-servers let's store diff between old and new values instead. This will make size of each log entry that adds single acl to port group (or port to logical switch or anything else like that) very small and independent from the number of already existing acls (ports, etc.). Added a new marker '_is_diff' into a file transaction to specify that this transaction contains diffs instead of replacements for the existing data. One side effect is that this change will actually increase the size of file transaction that removes more than a half of entries from the set, because diff will be larger than the resulted new value. However, such operations are rare. Test results with change applied: On-disk Nb DB size : ~2.7 MB ---> reduced by 99% RSS of Nb ovsdb-servers: ~580 MB ---> reduced by 78% Time to finish the test: ~1m27s ---> reduced by 27% After this change new ovsdb-server is still able to read old databases, but old ovsdb-server will not be able to read new ones. Since new servers could join ovsdb cluster dynamically it's hard to implement any runtime mechanism to handle cases where different versions of ovsdb-server joins the cluster. However we still need to handle cluster upgrades. For this case added special command line argument to disable new functionality. Documentation updated with the recommended way to upgrade the ovsdb cluster. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Reclaim heap memory after compaction.	Ilya Maximets	2020-11-03	1	-2/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Compaction happens at most once in 10 minutes. That is a big time interval for a heavy loaded ovsdb-server in cluster mode. In 10 minutes raft logs could grow up to tens of thousands of entries with tens of gigabytes in total size. While compaction cleans up raft log entries, the memory in many cases is not returned to the system, but kept in the heap of running ovsdb-server process, and it could stay in this condition for a really long time. In the end one performance spike could lead to a fast growth of the raft log and this memory will never (for a really long time) be released to the system even if the database if empty. Simple example how to reproduce with OVN sandbox: 1. make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered' 2. Run following script that creates 1 port group, adds 4000 acls and removes all of that in the end: # cat ../memory-test.sh pg_name=my_port_group export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off) ovn-nbctl pg-add $pg_name for i in $(seq 1 4000); do echo "Iteration: $i" ovn-nbctl --log acl-add $pg_name from-lport $i udp drop done ovn-nbctl acl-del $pg_name ovn-nbctl pg-del $pg_name ovs-appctl -t $(pwd)/sandbox/nb1 memory/show ovn-appctl -t ovn-nbctl exit --- 3. Stopping one of Northbound DB servers: ovs-appctl -t $(pwd)/sandbox/nb1 exit Make sure that ovsdb-server didn't compact the database before it was stopped. Now we have a db file on disk that contains 4000 fairly big transactions inside. 4. Trying to start same ovsdb-server with this file. # cd sandbox && ovsdb-server <...> nb1.db At this point ovsdb-server reads all the transactions from db file and performs all of them as fast as it can one by one. When it finishes this, raft log contains 4000 entries and ovsdb-server consumes (on my system) ~13GB of memory while database is empty. And libc will likely never return this memory back to system, or, at least, will hold it for a really long time. This patch adds a new command 'ovsdb-server/memory-trim-on-compaction'. It's disabled by default, but once enabled, ovsdb-server will call 'malloc_trim(0)' after every successful compaction to try to return unused heap memory back to system. This is glibc-specific, so we need to detect function availability in a build time. Disabled by default since it adds from 1% to 30% (depending on the current state) to the snapshot creation time and, also, next memory allocations will likely require requests to kernel and that might be slower. Could be enabled by default later if considered broadly beneficial. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	Eliminate "whitelist" and "blacklist" terms.	Ben Pfaff	2020-10-16	1	-4/+4
\| \| \| \| \| \| \| \|	There is one remaining use under datapath. That change should happen upstream in Linux first according to our usual policy. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
*	ovsdb: Add unixctl command to show storage status.	Dumitru Ceara	2020-09-16	1	-0/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If a database enters an error state, e.g., in case of RAFT when reading the DB file contents if applying the RAFT records triggers constraint violations, there's no way to determine this unless a client generates a write transaction. Such write transactions would fail with "ovsdb-error: inconsistent data". This commit adds a new command to show the status of the storage that's backing a database. Example, on an inconsistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ovsdb error: inconsistent data Example, on a consistent database: $ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB status: ok Signed-off-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Replace in-memory DB contents at raft install_snapshot.	Dumitru Ceara	2020-08-06	1	-8/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Every time a follower has to install a snapshot received from the leader, it should also replace the data in memory. Right now this only happens when snapshots are installed that also change the schema. This can lead to inconsistent DB data on follower nodes and the snapshot may fail to get applied. Fixes: bda1f6b60588 ("ovsdb-server: Don't disconnect clients after raft install_snapshot.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Fix schema leak while reading db.	Ilya Maximets	2020-05-28	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	parse_txn() function doesn't always take ownership of the 'schema' passed. So, if the schema of the clustered db has same version as the one that already in use, parse_txn() will not use it, resulting with a memory leak: 7,827 (56 direct, 7,771 indirect) bytes in 1 blocks are definitely lost at 0x483BB1A: calloc (vg_replace_malloc.c:762) by 0x44AD02: xcalloc (util.c:121) by 0x40E70E: ovsdb_schema_create (ovsdb.c:41) by 0x40EA6D: ovsdb_schema_from_json (ovsdb.c:217) by 0x415EDD: ovsdb_storage_read (storage.c:280) by 0x408968: read_db (ovsdb-server.c:607) by 0x40733D: main_loop (ovsdb-server.c:227) by 0x40733D: main (ovsdb-server.c:469) While we could put ovsdb_schema_destroy() in a few places inside 'parse_txn()', from the users' point of view it seems better to have a constant argument and just clone the 'schema' if needed. The caller will be responsible for destroying the 'schema' it owns. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
*	ovsdb-server: Don't disconnect clients after raft install_snapshot.	Han Zhou	2020-03-06	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When "schema" field is found in read_db(), there can be two cases: 1. There is a schema change in clustered DB and the "schema" is the new one. 2. There is a install_snapshot RPC happened, which caused log compaction on the server and the next log is just the snapshot, which always constains "schema" field, even though the schema hasn't been changed. The current implementation doesn't handle case 2), and always assume the schema is changed hence disconnect all clients of the server. It can cause stability problem when there are big number of clients connected when this happens in a large scale environment. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb replication: Provide option to configure probe interval.	Numan Siddique	2020-01-07	1	-7/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When ovsdb-server is in backup mode and connects to the active ovsdb-server for replication, and if takes more than 5 seconds to get the dump of the whole database, it will drop the connection soon after as the default probe interval is 5 seconds. This results in a snowball effect of reconnections to the active ovsdb-server. This patch handles or mitigates this issue by setting the default probe interval value to 60 seconds and provide the option to configure this value from the unixctl command. Other option could be increase the value of 'RECONNECT_DEFAULT_PROBE_INTERVAL' to a higher value. Acked-by: Mark Michelson <mmichels@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Numan Siddique <numans@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb monitor: Fix crash when using non-zero last-id with standalone DB.	Han Zhou	2019-08-21	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a client uses monitor-cond-since with a non-zero last-id but the server is not in cluster mode for the DB being monitored, it leads to segmentation fault because the txn_history list is not initialized in this case. Program terminated with signal SIGSEGV, Segmentation fault. 1536 struct ovsdb_txn *txn = h_node->txn; (gdb) bt 0 ovsdb_monitor_get_changes_after (txn_uuid=txn_uuid@entry=0x7ffe8605b7e0, dbmon=0x17c1b40, p_mcs=p_mcs@entry=0x17c4900) at ovsdb/monitor.c:1536 1 0x000000000040da2d in ovsdb_jsonrpc_monitor_create (request_id=0x1804630, version=<optimized out>, params=0x17ad330, db=0x18015b0, s=<optimized out>) at ovsdb/jsonrpc-server.c:1469 2 ovsdb_jsonrpc_session_got_request (request=0x17ad520, s=<optimized out>) at ovsdb/jsonrpc-server.c:1002 3 ovsdb_jsonrpc_session_run (s=<optimized out>) at ovsdb/jsonrpc-server.c:556 ... Although it doesn't happen in normal use cases, no one can prevent a client to send this on purpose or in a corner case when a client firstly connected to a clustered DB but later the server restarted with a non-clustered DB. This patch fixes it by always initialize the txn_history list to avoid the undefined behavior in this case. It adds a test case to cover it, too. Fixes: 695e815 ("ovsdb-server: Transaction history tracking.") Reported-by: Aliasgar Ginwala <aginwala@ebay.com> Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb: Move trigger_run after storage_run and read_db.	Han Zhou	2019-03-04	1	-2/+4
\| \| \| \| \| \| \| \|	Run triggers after storage_run and read_db to make sure new raft updates are utilized in current iteration. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-server: Transaction history tracking.	Han Zhou	2019-02-28	1	-0/+9
\| \| \| \| \| \| \| \| \|	Maintaining last N (n = 100) transactions in memory, which will be used for future patches for generating monitor data from any point in this N transactions. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-server: Correct json-rpc comment for "disable-monitor-cond".	Justin Pettit	2019-01-16	1	-1/+1
\| \| \| \| \|	Signed-off-by: Justin Pettit <jpettit@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-server: Don't log closing session at program termination.	Ben Pfaff	2018-08-15	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When ovsdb-server closes a remote connection, it logs a message about it that includes the reason. Until now this has included sessions that it closes when it exits. That meant that, when --run was used, there was a race between noticing that the subprocess exited and noticing that the session that that subprocess (presumably) had open had been closed. If it noticed the latter first, nothing was logged (because it didn't log anything if a session was closed in the ordinary way by the client). If it noticed the former first, it logged a message about closing the session itself. This is a benign race that causes no real problems--except that the tests didn't expect to see the log message from the former case and fail with errors like the following: 1826. ovsdb-server.at:92: testing truncating database log with bad transaction ... ./ovsdb-server.at:96: ovsdb-tool create db schema stderr: stdout: ./ovsdb-server.at:104: ovsdb-server --remote=punix:socket db --run="sh txnfile" --- /dev/null 2018-04-24 08:50:58.769000000 +0000 +++ /root/openvswitch-2.9.2/rpm/rpmbuild/BUILD/openvswitch-2.9.2/tests/testsuite.dir/at-groups/1826/stderr 2018-05-29 14:29:56.529257295 +0000 @@ -0,0 +1,2 @@ +2018-05-29T14:29:56Z\|00001\|ovsdb_jsonrpc_server\|INFO\|unix#0: disconnecting (removing ordinals database due to server termination) +2018-05-29T14:29:56Z\|00002\|ovsdb_jsonrpc_server\|INFO\|unix#0: disconnecting (removing _Server database due to server termination) This fixes the race. This particular log message isn't too useful since it's pretty obvious that ovsdb-server is closing those sessions, since after all it's exiting! Reported-by: Sanket Sudake <sanket@infracloud.io> Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2018-May/046840.html Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Numan Siddique <nusiddiq@redhat.com>
*	Embrace anonymous unions.	Ben Pfaff	2018-05-25	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Several OVS structs contain embedded named unions, like this: struct { ... union { ... } u; }; C11 standardized a feature that many compilers already implemented anyway, where an embedded union may be unnamed, like this: struct { ... union { ... }; }; This is more convenient because it allows the programmer to omit "u." in many places. OVS already used this feature in several places. This commit embraces it in several others. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org> Tested-by: Alin Gabriel Serdean <aserdean@ovn.org> Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
*	ovsdb: Introduce experimental support for clustered databases.	Ben Pfaff	2018-03-24	1	-90/+274
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit adds support for OVSDB clustering via Raft. Please read ovsdb(7) for information on how to set up a clustered database. It is simple and boils down to running "ovsdb-tool create-cluster" on one server and "ovsdb-tool join-cluster" on each of the others and then starting ovsdb-server in the usual way on all of them. One you have a clustered database, you configure ovn-controller and ovn-northd to use it by pointing them to all of the servers, e.g. where previously you might have said "tcp:1.2.3.4" was the database server, now you say that it is "tcp:1.2.3.4,tcp:5.6.7.8,tcp:9.10.11.12". This also adds support for database clustering to ovs-sandbox. Acked-by: Justin Pettit <jpettit@ovn.org> Tested-by: aginwala <aginwala@asu.edu> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb: Add support for online schema conversion.	Ben Pfaff	2018-03-24	1	-25/+31
\| \| \| \| \| \| \| \| \| \| \|	With this change, "ovsdb-client convert" can be used to convert a database from one schema to another without taking the database offline. This can be useful to minimize downtime for a database during a software upgrade. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org>
*	ovsdb-server: Add new RPC "set_db_change_aware".	Ben Pfaff	2018-03-24	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	The _Server database recently added to ovsdb-server can be used to dump out information about databases, but monitoring updates to _Server is not yet very useful because for historical reasons ovsdb-server drops all of its OVSDB connections whenever databases are added or removed or otherwise change in some major way. It is not a good idea to change this behavior for all clients, because some of them rely on it, but this commit introduces a new RPC that allows clients that understand _Server to suppress the connection-closing behavior. Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-server: Add support for a built-in _Server database.	Ben Pfaff	2018-03-24	1	-6/+125
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The _Server database is valuable primarily because it provides database clients a way to find out the details of changes to databases, schemas, etc. in a granular, natural way. Until now, the only way that the server could notify clients about these kinds of changes was to close the session; when the client reconnects, it is expected to reassess the server's state. One way to provide this kind of granular information would be to add specific JSON-RPC requests to obtain notifications for different kinds of changes, but since ovsdb-server already provides granular and flexible notification support for databases, using a database for the purpose is convenient and avoids duplicating functionality. Initially this database only reports databases' names and schemas, but when clustering support is added in a later commit it will also report important aspects of clustering and cluster status. Thus, this database also reduces the need to add JSON-RPC calls to retrieve information about new features. Signed-off-by: Ben Pfaff <blp@ovn.org>
*	jsonrpc-server: Separate changing read_only status from reconnecting.	Ben Pfaff	2018-03-24	1	-12/+3
\| \| \| \| \| \| \| \| \| \| \|	The code in jsonrpc-server conflated two different kinds of functionality. It makes sense for the client to be able to change whether a particular server is read-only. It also makes sense for the client to tell a server to reconnect. The code in jsonrpc-server only provided a single function that does both, which is weird. This commit breaks these apart. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org>
*	util: Document and rely on ovs_assert() always evaluating its argument.	Ben Pfaff	2018-02-01	1	-3/+1
\| \| \| \| \| \| \| \| \| \|	The ovs_assert() macro always evaluates its argument, even when NDEBUG is defined so that failure is ignored. This behavior wasn't documented, and thus a lot of code didn't rely on it. This commit documents the behavior and simplifies bits of code that heretofore didn't rely on it. Signed-off-by: Ben Pfaff <blp@ovn.org> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
*	ovsdb-server: Forbid user-specified databases with reserved names.	Ben Pfaff	2017-12-22	1	-13/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Names that begin with "_" are reserved, but ovsdb-server didn't previously enforce this. At the same time, make ovsdb-client ignore databases with reserved names for the purpose of selecting a default database to work on. This is in preparation for ovsdb-server starting to serve a new database, full of meta-information, called _Server. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Justin Pettit <jpettit@ovn.org>
*	ovsdb-server: Drop 'txn' member from struct db.	Ben Pfaff	2017-12-19	1	-24/+16
\| \| \| \| \| \| \|	This member was only used in one particular code path, so this commit adds code to pass it around as a function parameter instead. Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-error: New function ovsdb_error_to_string_free().	Ben Pfaff	2017-12-13	1	-6/+3
\| \| \| \| \| \| \| \|	This allows slight code simplifications across the tree. Signed-off-by: Ben Pfaff <blp@ovn.org> Tested-by: Yifeng Sun <pkusunyifeng@gmail.com> Reviewed-by: Yifeng Sun <pkusunyifeng@gmail.com>
*	lib: Move lib/poll-loop.h to include/openvswitch	Xiao Liang	2017-11-03	1	-1/+1
\| \| \| \| \| \| \| \|	Poll-loop is the core to implement main loop. It should be available in libopenvswitch. Signed-off-by: Xiao Liang <shaw.leon@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-server: Fix memory leak	Yifeng Sun	2017-11-02	1	-1/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Valgrind testcase 2349 (ovn -- DSCP marking check) reports the leak below: 21 bytes in 21 blocks are definitely lost in loss record 24 of 362 at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x436FD4: xmalloc (util.c:120) by 0x437044: xmemdup0 (util.c:150) by 0x408C97: add_manager_options (ovsdb-server.c:709) by 0x408C97: query_db_remotes (ovsdb-server.c:765) by 0x408C97: reconfigure_remotes (ovsdb-server.c:926) by 0x406273: main_loop (ovsdb-server.c:194) by 0x406273: main (ovsdb-server.c:434) When options are freed, options->role need to be freed explicitly. Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb: add support for role-based access controls	Lance Richardson	2017-06-08	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add suport for ovsdb RBAC (role-based access control). This includes: - Support for "RBAC_Role" table. A db schema containing a table by this name will enable role-based access controls using this table for RBAC role configuration. The "RBAC_Role" table has one row per role, with each row having a "name" column (role name) and a "permissions" column (map of table name to UUID of row in separate permission table.) The permission table has one row per access control configuration, with the following columns: "name" - name of table to which this row applies "authorization" - set of column names and column:key pairs to be compared against client ID to determine authorization status "insert_delete" - boolean, true if insertions and authorized deletions are allowed. "update" - Set of columns and column:key pairs for which authorized updates are allowed. - Support for a new "role" column in the remote configuration table. - Logic for applying the RBAC role and permission tables, in combination with session role from the remote connection table and client id, to determine whether operations modifying database contents should be permitted. - Support for specifying RBAC role string as a command-line option to ovsdb-tool (Ben Pfaff). Signed-off-by: Lance Richardson <lrichard@redhat.com> Co-authored-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb: refactor utility functions into separate file	Lance Richardson	2017-05-04	1	-176/+11
\| \| \| \| \| \| \| \| \|	Move local db access functions to a new file and make give them global scope so they can be included in the ovsdb library and used by other ovsdb library functions. Signed-off-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
*	ovsdb-server: Drop unnecessary find_db() function.	Ben Pfaff	2017-03-29	1	-25/+5
\| \| \| \| \| \| \| \| \| \| \|	'all_dbs' maps from a schema name to its struct db, so there's no need to iterate the whole thing to find a database by schema name; instead, just use the shash in the usual way. Also, a few related but simpler changes elsewhere. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Andy Zhou <azhou@ovn.org>>
*	ovsdb-server: Fix memory leak in update_remote_status() error path.	Ben Pfaff	2017-03-29	1	-3/+4
\| \| \| \| \| \| \|	Found by inspection. Signed-off-by: Ben Pfaff <blp@ovn.org> Acked-by: Andy Zhou <azhou@ovn.org>
*	ovsdb: Prevent OVSDB server from replicating itself.	Andy Zhou	2017-02-13	1	-7/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replication OVSDB server from itself is usually caused by configuration errors. Such configuration errors can lead to OVSDB server data loss. See "reported-at" for more details. This patch adds logics that prevent OVSDB server from replicating itself. Reported-by: Guishuai Li <ligs@dtdream.com> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/326963.html Suggested-by: Ben Pfaff <blp@ovn.org> Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>
*	ovsdb: Gracefully handle replication errors.	Andy Zhou	2017-02-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Sometimes replication session can fail mostly due to replication configurations. i.e. replicating from a database with a different version of the schema. Currently, those errors are treated as fatal errors, and stops the OVSDB server. A better way to handle those error may be to stop only the replication session, and leave the OVSDB server up, so that the replication can be restarted, may be with a different configuration, at a later time. Signed-off-by: Andy Zhou <azhou@ovn.org> Acked-by: Ben Pfaff <blp@ovn.org>