| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change sets in OVSDB monitor are storing all the changes that happened
between a particular transaction ID and now. Initial change set
basically contains all the data.
On each monitor request a new initial change set is created by creating
an empty change set and adding all the database rows. Then it is
converted into JSON reply and immediately untracked and destroyed.
This is causing significant performance issues if many clients are
requesting new monitors at the same time. For example, that is
happening after database schema conversion, because conversion triggers
cancellation of all monitors. After cancellation, every client sends
a new monitor request. The server then creates a new initial change
set, sends a reply, destroys initial change set and repeats that for
each client. On a system with 200 MB database and 500 clients,
cluster of 3 servers spends 20 minutes replying to all the clients
(200 MB x 500 = 100 GB):
timeval|WARN|Unreasonably long 1201525ms poll interval
Of course, all the clients are already disconnected due to inactivity
at this point. When they are re-connecting back, server accepts new
connections one at a time, so inactivity probes will not be triggered
anymore, but it still takes another 20 minutes to handle all the
incoming connections.
Let's keep the initial change set around for as long as the monitor
itself exists. This will allow us to not construct a new change set
on each new monitor request and even utilize the JSON cache in some
cases. All that at a relatively small maintenance cost, since we'll
need to commit changes to one extra change set on every transaction.
Measured memory usage increase due to keeping around a shallow copy
of a database is about 10%. Measured CPU usage difference during
normal operation is negligible.
With this change it takes only 30 seconds to send out all the monitor
replies in the example above. So, it's a 40x performance improvement.
On a more reasonable setup with 250 nodes, the process takes up to
8-10 seconds instead of 4-5 minutes.
Conditional monitoring will benefit from this change as well, however
results might be less impressive due to lack of JSON cache.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cluster member, that initiates the schema conversion, converts the
database twice. First time while verifying the possibility of the
conversion, and the second time after reading conversion request
back from the storage.
Keep the converted database from the first time around and use it
after reading the request back from the storage. This cuts in half
the conversion CPU cost.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, database schema conversion in case of clustered database
produces a transaction record with both new schema and converted
database data. So, the sequence of events is following:
1. Get the new schema.
2. Convert the database to a new schema.
3. Translate the newly converted database into JSON.
4. Write the schema + data JSON to the storage.
5. Destroy converted version of a database.
6. Read schema + data JSON from the storage and parse.
7. Create a new database from a parsed database data.
8. Replace current database with the new one.
Most of these steps are very computationally expensive. Also,
conversion to/from JSON is much more expensive than direct database
conversion with ovsdb_convert() that can make use of shallow data
copies.
Instead of doing all that, let's make use of previously introduced
ability to not write the converted data into the storage. The process
will look like this then:
1. Get the new schema.
2. Convert the database to a new schema
(to verify that it is possible).
3. Write the schema to the storage.
4. Destroy converted version of a database.
5. Read the new schema from the storage and parse.
6. Convert the database to a new schema.
7. Replace current database with the new one.
Most of the operations here are performed on the small schema object,
instead of the actual database data. Two remaining data operations
(actual conversion) are noticeably faster than conversion to/from
JSON due to reference counting and shallow data copies.
Steps 4-6 can be optimized later to not convert twice on the
process that initiates the conversion.
The change results in following performance improvements in conversion
of OVN_Southbound database schema from version 20.23.0 to 20.27.0
(measured on a single-server RAFT cluster with no clients):
| Before | After
+---------+-------------------+---------+------------------
DB size | Total | Max poll interval | Total | Max poll interval
--------+---------+-------------------+---------+------------------
542 MB | 47 sec. | 26 sec. | 15 sec. | 10 sec.
225 MB | 19 sec. | 10 sec. | 6 sec. | 4.5 sec.
542 MB database had 19.5 M atoms, 225 MB database had 7.5 M atoms.
Overall performance improvement is about 3x.
Also, note that before this change database conversion basically
doubles the database file on disk. Now it only writes a small
schema JSON.
Since the change requires backward-incompatible database file format
changes, documentation is updated on how to perform an upgrade.
Handled the same way as we did for the previous incompatible format
change in 2.15 (column diffs).
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-December/052140.html
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the schema with no data was read from the clustered storage, it
should mean a database conversion request. In general, we can get:
1. Just data --> Transaction record.
2. Schema + Data --> Database conversion or raft snapshot install.
3. Just schema --> New. Database conversion request.
We cannot distinguish between conversion and snapshot installation
request in the current implementation, so we will keep handling
conversion with data in the same way as before, i.e. if data is
provided, we should use it.
ovsdb-tool is updated to handle this record type as well while
converting cluster to standalone.
This change doesn't introduce a way for such records to appear in
the database. That will be added in the future commits targeting
conversion speed increase.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clustered databases do not support ephemeral columns, but ovsdb-server
checks for them after the conversion result is read from the storage.
It's much easier to recover if this constraint is checked before writing
to the storage instead.
It's not a big problem, because the check is always performed by the
native ovsdb clients before sending a conversion request. But the
server, in general, should not trust clients to do the right thing.
Check in the update_schema() remains, because we shouldn't blindly
trust the storage.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If database conversion happens, both schema and the new data are
present in the database record. However, the schema is just silently
ignored by ovsdb-tool cluster-to-standalone. This creates data
inconsistency if the new data contains new columns, for example, so
the resulting database file will not be readable, or data will be lost.
Fix that by re-setting the database whenever a conversion record is
found and actually writing a new schema that will match the actual
data. The database file will not be that similar to the original,
but there is no way to represent conversion in a standalone database
file format otherwise.
Fixes: 00de46f9ee42 ("ovsdb-tool: Convert clustered db to standalone db.")
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Open vSwitch generally tries to let the underlying operating system
managed the low level details of hardware, for example DMA mapping,
bus arbitration, etc. However, when using DPDK, the underlying
operating system yields control of many of these details to userspace
for management.
In the case of some DPDK port drivers, configuring rte_flow or even
allocating resources may require access to iopl/ioperm calls, which
are guarded by the CAP_SYS_RAWIO privilege on linux systems. These
calls are dangerous, and can allow a process to completely compromise
a system. However, they are needed in the case of some userspace
driver code which manages the hardware (for example, the mlx
implementation of backend support for rte_flow).
Here, we create an opt-in flag passed to the command line to allow
this access. We need to do this before ever accessing the database,
because we want to drop all privileges asap, and cannot wait for
a connection to the database to be established and functional before
dropping. There may be distribution specific ways to do capability
management as well (using for example, systemd), but they are not
as universal to the vswitchd as a flag.
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Aaron Conole <aconole@redhat.com>
Acked-by: Flavio Leitner <fbl@sysclose.org>
Acked-by: Gaetan Rivet <gaetanr@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 08e9e5337383 fixed proper initialization of the dns-resolve
module, and made DNS resolution asynchronous.
A side effect of that change revealed a long standing logic bug
which broke ovsdb-server listener configuration using DNS names.
Previously this worked because the DNS resolution would block,
now that DNS resolution is asynchronous the code before this
change would assume the error from jsonrpc_pstream_open meant
the remote was a specification for an active outgoing
connection, even when that was not the case.
To fix this a couple of changes was made to socket-util:
1) Pass optional result of dns resolution from inet_parse_passive.
When (re-)configuring listeners that use DNS names, we may need
to know whether the provided connection string is invalid or if
the provided DNS name has finished resolving.
2) Check dns resolution status in inet_open_passive.
If the connection string is valid, and contains a DNS name,
inet_open_passive will now return -EAGAIN if dns resolution
failed. DNS resolution failure may either mean the asynchronous
resolver has not completed yet, or that the name does not resolve.
Reported-at: https://bugs.launchpad.net/bugs/1998781
Fixes: 08e9e5337383 ("ovsdb: raft: Fix inability to read the database with DNS host names.")
Fixes: 771680d96fb6 ("DNS: Add basic support for asynchronous DNS resolving")
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Column conversion involves converting it to json and back. These are
heavy operations and completely unnecessary if the column type didn't
change. Most of the time schema changes only add new columns/tables
without changing existing ones at all. Clone the column instead to
save some time.
This will also save time while destroying the original database since
we will only need to reduce reference counters on unchanged datum
objects that were cloned instead of actually freeing them.
Additionally, moving the column lookup into a separate loop, so we
don't perform an shash lookup for each column of each row.
Testing with 440 MB OVN_Southbound database shows 70% speed up of the
ovsdb_convert() function. Execution time reduced from 15 to 4.4
seconds, 3.5 of which is a post-conversion transaction replay. Overall
time required for the online database conversion reduced from 37 to 25
seconds.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The counter for the number of atoms has to be re-set to the number from
the new database, otherwise the value will be incorrect. For example,
this is causing the atom counter doubling after online conversion of
a clustered database.
Miscounting may also lead to increased memory consumption by the
transaction history or otherwise too aggressive transaction history
sweep.
Fixes: 317b1bfd7dd3 ("ovsdb: Don't let transaction history grow larger than the database.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
| |
But log at least once even if the value hasn't changed, for
informational purposes.
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
OVSDB creates a separate object for each weak reference in order to
track them and there could be a significant amount of these objects
in the database.
We also had problems with number of these objects growing out of
bounds recently. So, adding them to a memory report seems to be
a good thing.
Counting them globally to cover all the copied instances in transactions
and the transaction history (even though there should be none).
It's also hard to count them per-database, because weak references
are stored on destination rows and can be destroyed either while
destroying the destination row or while removing the reference from
the source row. Also, not all the involved functions have direct
access to the database object. So, there is no single clear place
where counters should be updated.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb-server allows the OVSDB clients to specify the uuid for
the row inserts [1]. Both the C IDL client library and Python
IDL are missing this feature. This patch adds this support.
In C IDL, for each schema table, a new function is generated -
<schema_table>insert_persistent_uuid(txn, uuid) which can
be used the clients to persist the uuid.
ovs-vsctl and other derivatives of ctl now supports the same
in the generic 'create' command with the option "--id=<UUID>".
In Python IDL, the uuid to persist can be specified in
the Transaction.insert() function.
[1] - a529e3cd1f("ovsdb-server: Allow OVSDB clients to specify the UUID for inserted rows.:)
Acked-by: Adrian Moreno <amorenoz@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Terry Wilson <twilson@redhat.com>
Signed-off-by: Numan Siddique <numans@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a row is deleted, if the row has weak references to other rows, the
weak reference nodes attached to the destination rows (through
weak->dst_node hmap) are not destroyed.
Deleting weak references is properly handled when a row is modified. The
removed references are taken care by:
1. assess_weak_refs() figures out the deleted references from the row
and add them to txn_row->deleted_refs.
2. before commit, in ovsdb_txn_update_weak_refs() it finds the
destination row for each item in txn_row->deleted_refs (from step 1),
and destroy the corresponding weak references of the destination row.
However, when the row is deleted, the step 1 in assess_weak_refs() is
missing. It directly returns without adding the deleted references to
txn_row->deleted_refs. So, the destination nodes will keep those weak
references although the source side of the references are already
deleted. When such rows that originating weak references are created
and deleted, more and more such useless weak reference structures
accumulate in the memory, and can stay there until the destination rows
are deleted. It is possible that the destination row is never deleted,
and in such case the ovsdb-server memory keeps growing (although it is
not strictly memory leak, because the structures are still referenced).
This problem has an impact to applications like OVN SB DB - the memory
grows very fast in long-running deployments and finally causes OOM.
This patch fixes it by generating deleted_refs for deleted rows in
assess_weak_refs().
Fixes: 4dbff9f0a685 ("ovsdb: transaction: Incremental reassessment of weak refs.")
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
The loops for adding weak refs are quite similar. Abstract to a
function, which will be used by one more cases later. The patch also
changes the txn_row arg to the source row.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Compaction thread supposed to not change anything in the database
it is working on, since the same data can be accessed by the main
thread at the same time. However, while converting database rows
to JSON objects, strings in the datum will be cloned using
json_clone(), which is a shallow copy, and that will change the
reference counter for the JSON string object. If both the main
thread and the compaction thread will clone/destroy the same object
at the same time we may end up with a broken reference counter
leading to a memory leak or use-after free.
Adding a new argument to the database to JSON conversion to prevent
use of shallow copies from the compaction thread. This way all
the database operations will be truly read-only avoiding the race.
'ovsdb_atom_to_json' and 'ovsdb_datum_to_json' are more widely used,
so creating separate variant for these functions instead of adding
a new argument, to avoid changing a lot of existing code.
Other solution might be to use atomic reference counters, but that
will require API/ABI break, because counter is exposed in public
headers. Also, we can not easily expose atomic functions, so we'll
need to un-inline reference counting with the associated performance
cost.
Fixes: 3cd2cbd684e0 ("ovsdb: Prepare snapshot JSON in a separate thread.")
Reported-at: https://bugzilla.redhat.com/2133431
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While creating a new database file and storing a new snapshot
into it, raft module by mistake updates the base offset for the
old file. So, the base offset of a new file remains zero. Then
the old file is getting replaced with the new one, copying new
offsets as well. In the end, after a full compaction, base offset
is always zero. And any offset is twice as large as zero. That
triggers a new compaction again at the earliest scheduled time.
In practice this issue triggers compaction every 10-20 minutes
regardless of the database load, after the first one is triggered
by the actual file growth or by the 24h maximum limit.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-August/051977.html
Reported-by: Oleksandr Mykhalskyi <oleksandr.mykhalskyi@netcracker.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transaction history is used to construct database updates for clients.
But if the row didn't change it will never be used for monitor updates,
because ovsdb_monitor_changes_classify() will always return
OVSDB_CHANGES_NO_EFFECT. So, ovsdb_monitor_history_change_cb()
will never add it to the update.
This condition is very common for rows with references. While
processing strong references in ovsdb_txn_adjust_atom_refs() the
whole destination row will be cloned into transaction just to update
the reference counter. If this row will not be changed later in
the transaction, it will just stay in that state and will be added
to the transaction history. Since the data didn't change, both 'old'
and 'new' datums will be the same and equal to one in the database.
So, we're keeping 2 copies of the same row in memory and we are
never using them. In this case, we should just not add them to the
transaction history in the first place.
This change should save some space in the transaction history in case
of transactions with rows with big number of strong references.
This should also speed up the processing since we will not clone
these rows for transaction history and will not count their atoms.
Testing shows about 5-10% performance improvement in ovn-heater
test scenarios.
'n_atoms' counter for transaction adjusted to count only changed
rows, so we will have accurate value for a number of atoms in the
history.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transaction history is used only to construct row data updates for
clients, it's not used for checking data integrity, hence it doesn't
need a copy of weak references.
Not copying this data saves a lot of CPU cycles and memory in some
cases. For example, in 250-node density-heavy scenario in ovn-heater
these references can take up to 70% of RSS, which is about 8 GB of
essentially wasted memory as reported by valgrind massif:
-------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
-------------------------------------------------------------------------------
20 1,011,495,832,314 11,610,557,104 10,217,785,620 1,392,771,484 0
88.00% (10,217,785,620B) (heap allocation functions) malloc/new/new[]
->70.47% (8,181,819,064B) 0x455372: xcalloc__ (util.c:121)
->70.07% (8,135,785,424B) 0x41609D: ovsdb_weak_ref_clone (row.c:66)
->70.07% (8,135,785,424B) 0x41609D: ovsdb_row_clone (row.c:151)
->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_clone (transaction.c:1124)
| ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_add_to_history (transaction.c:1163)
| ->34.74% (4,034,041,440B) 0x41B7C9: ovsdb_txn_replay_commit (transaction.c:1198)
| ->34.74% (4,034,041,440B) 0x408C35: parse_txn (ovsdb-server.c:633)
| ->34.74% (4,034,041,440B) 0x408C35: read_db (ovsdb-server.c:663)
| ->34.74% (4,034,041,440B) 0x406C9D: main_loop (ovsdb-server.c:238)
| ->34.74% (4,034,041,440B) 0x406C9D: main (ovsdb-server.c:500)
|
->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_clone (transaction.c:1125)
->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_add_to_history (transaction.c:1163)
->34.74% (4,034,041,440B) 0x41B7DE: ovsdb_txn_replay_commit (transaction.c:1198)
->34.74% (4,034,041,440B) 0x408C35: parse_txn (ovsdb-server.c:633)
->34.74% (4,034,041,440B) 0x408C35: read_db (ovsdb-server.c:663)
->34.74% (4,034,041,440B) 0x406C9D: main_loop (ovsdb-server.c:238)
->34.74% (4,034,041,440B) 0x406C9D: main (ovsdb-server.c:500)
Replacing ovsdb_row_clone() with ovsdb_row_datum_clone() to avoid
cloning unnecessary metadata. The ovsdb_txn_clone() function re-named
to avoid issues if it will be re-used in the future for some other
use-case.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, openvswitch-devel installs following header tree:
/usr/include
/openflow/*.h
/openvswitch
/*.h
/openflow/*.h
/openvswitch/*.h
/sparse/*.h
/lib/*.h
Few issues with that:
1. openflow and openvswitch headers are installed twice. Once in the
main /usr/include and second time in the /usr/include/openvswitch/.
2. For some reason internal headers such as lib/*.h and fairly useless
headers such as sparse/*.h are installed as well.
One more issue is that current pkg-config files doesn't work with
builds installed with 'make install', because 'make install' doesn't
create this weird header tree.
While double install of same headers is not a huge problem, it doesn't
seem right. Installation of the internal headers is a bigger issue.
They are not part of API/ABI and we do not provide any stability
guarantees for them. We are making incompatible changes constantly in
minor updates, so users should not rely on these headers.
If it's necessary for some external application to use them, this
external application should not link with libopenvswitch dynamically
and also it can't expect the static library to not break these API/ABI,
hence there is no real point installing them. Application should use
OVS as a submodule like OVN does or compile itself by obtaining
required version of OVS sources otherwise. Another option is to
properly export and install required headers.
pkg-config configuration files updated as necessary.
Fixes: 4886d4d2495b ("debian, rhel: Ship ovs shared libraries and header files")
Reviewed-by: David Marchand <david.marchand@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
ovsdb/ directory should not be a dependency, otherwise the man
page is getting re-built every time unrelated files are changed.
Fixes: 6f24c2bc769a ("ovsdb: Add Local_Config schema.")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Some of the work is already done, e.g. 'diff' file format
and DNS support.
Added more items collected over time including relay and
local_config items.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
| |
As reported by Debian lintian.
Signed-off-by: Frode Nordahl <frode.nordahl@canonical.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
MSVC doesn't have pthread_t defined by default as other compilers,
so the build fails without the header.
Fixes: 3cd2cbd684e0 ("ovsdb: Prepare snapshot JSON in a separate thread.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Conversion of the database data into JSON object, serialization
and destruction of that object are the most heavy operations
during the database compaction. If these operations are moved
to a separate thread, the main thread can continue processing
database requests in the meantime.
With this change, the compaction is split in 3 phases:
1. Initialization:
- Create a copy of the database.
- Remember current database index.
- Start a separate thread to convert a copy of the database
into serialized JSON object.
2. Wait:
- Continue normal operation until compaction thread is done.
- Meanwhile, compaction thread:
* Convert database copy to JSON.
* Serialize resulted JSON.
* Destroy original JSON object.
3. Finish:
- Destroy the database copy.
- Take the snapshot created by the thread.
- Write on disk.
The key for this schema to be fast is the ability to create
a shallow copy of the database. This doesn't take too much
time allowing the thread to do most of work.
Database copy is created and destroyed only by the main thread,
so there is no need for synchronization.
Such solution allows to reduce the time main thread is blocked
by compaction by 80-90%. For example, in ovn-heater tests
with 120 node density-heavy scenario, where compaction normally
takes 5-6 seconds at the end of a test, measured compaction
times was all below 1 second with the change applied. Also,
note that these measured times are the sum of phases 1 and 3,
so actual poll intervals are about half a second in this case.
Only implemented for raft storage for now. The implementation
for standalone databases can be added later by using a file
offset as a database index and copying newly added changes
from the old file to a new one during ovsdb_log_replace().
Reported-at: https://bugzilla.redhat.com/2069108
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently ovsdb-server is using shallow copies of some JSON objects
by keeping a reference counter. JSON string objects are also used
directly as ovsdb atoms in database rows to avoid extra copies.
Taking this approach one step further ovsdb_datum objects can also
be mostly deduplicated by postponing the copy until it actually
needed. datum object itself contains a type and 2 pointers to
data arrays. Adding a one more pointer to a reference counter
we may create a shallow copy of the datum by simply copying type
and pointers and increasing the reference counter.
Before modifying the datum, special function needs to be called
to perform an actual copy of the object, a.k.a. unshare it.
Most of the datum modifications are performed inside the special
functions in ovsdb-data.c, so that is not very hard to track.
A few places like ovsdb-server.c and column mutations are accessing
and changing the data directly, so a few extra unshare() calls
has to be added there.
This change doesn't affect the maximum memory consumption too much,
because most of the copies are short-living. However, not actually
performing these copies saves up to 40% of CPU time on operations
with large sets.
Reported-at: https://bugzilla.redhat.com/2069089
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
Memory trimming was introduced in OVS 2.15 and didn't cause any
issues in production environments since then, while allowing
ovsdb-sever to consume a lot less memory in high scale OVN
deployments. Enabling by default to make it easier to use.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The only way to configure settings on a remote (e.g. inactivity_probe)
is via --remote=db:DB,table,row. There is no way to do this via the
existing CLI options.
For a clustered DB with multiple servers listening on unique addresses
there is no way to store these entries in the DB as the DB is shared.
For example, three servers listening on 1.1.1.1, 1.1.1.2, and 1.1.1.3
respectively would require a Manager/Connection row each, but then
all three servers would try to listen on all three addresses.
It is possible for ovsdb-server to serve multiple databases. This
means that we can have a local "config" database in addition to
the main database we are serving (Open_vSwitch, OVN_Southbound, etc.)
and this patch adds a Local_Config schema that currently just mirrors
the Connection table and a Config table with a 'connections' row that
stores each Connection.
Signed-off-by: Terry Wilson <twilson@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new command, 'ovsdb-server/tlog-set DB:TABLE on|off', which
allows the user to enable/disable transaction logging for specific
databases and tables.
By default, logging is disabled. Once enabled, logs are generated
with level INFO and are also rate limited.
If used with care, this command can be useful in analyzing production
deployment performance issues, allowing the user to pin point
bottlenecks without the need to enable wider debug logs, e.g., jsonrpc.
A command to inspect the logging state is also added:
'ovsdb-server/tlog-list'.
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
Found by Coverity.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Acked-by: Mike Pattrick <mkp@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
There's no need to do that because we're not changing the hmap.
Also, if DEBUG logging is disabled there's no need to iterate at
all.
Fixes: 5a9b53a51ec9 ("ovsdb raft: Fix duplicated transaction execution when leader failover.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While becoming a follower, the leader aborts all the current
'in-flight' commands, so the higher layers can re-try corresponding
transactions when the new leader is elected. However, most of these
commands are already sent to followers as append requests, hence they
will actually be committed by the majority of the cluster members,
i.e. will be treated as committed by the new leader, unless there is
an actual network problem between servers. However, the old leader
will decline append replies, since it's not the leader anymore and
commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.
New leader will replicate the commit index back to the old leader.
Old leader will re-try the previously "failed" transaction, because
"cluster error"s are temporary.
If a transaction had some prerequisites that didn't allow double
committing or there are other database constraints (like indexes) that
will not allow a transaction to be committed twice, the server will
reply to the client with a false-negative transaction result.
If there are no prerequisites or additional database constraints,
the server will execute the same transaction again, but as a follower.
E.g. in the OVN case, this may result in creation of duplicated logical
switches / routers / load balancers. I.e. resources with the same
non-indexed name. That may cause issues later where ovn-nbctl will
not be able to add ports to these switches / routers.
Suggested solution is to not complete (abort) the commands, but allow
them to be completed with the commit index update from a new leader.
It the similar behavior to what we do in order to complete commands
in a backward scenario when the follower becomes a leader. That
scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
transaction execution when leader failover.").
Code paths for leader and follower inside the raft_update_commit_index
were very similar, so they were refactored into one, since we also
needed an ability to complete more than one command for a follower.
Failure test added to exercise scenario of a leadership transfer.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.")
Reported-at: https://bugzilla.redhat.com/2046340
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implementation of SHA1 in OpenSSL library is much faster and optimized
for all available CPU architectures and instruction sets. OVS should
use it instead of internal implementation if possible.
Depending on compiler options OpenSSL's version finishes our sha1
unit tests from 3 to 12 times faster. Performance of OpenSSL's
version is constant, but OVS's implementation highly depends on
compiler. Interestingly, default build with '-g -O2' works faster
than optimized '-march=native -Ofast'.
Tests with ovsdb-server on big databases shows ~5-10% improvement of
the time needed for database compaction (sha1 is only a part of this
operation), depending on compiler options.
We still need internal implementation, because OpenSSL can be not
available on some platforms. Tests enhanced to check both versions
of API.
Reviewed-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
| |
In order to be consistent with the rest of the SAFE loop macros,
overload each of the generated *_SAFE macro with a SHORT version that
does not require the user to provide the NEXT variable.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using the SHORT version of the *_SAFE loops makes the code cleaner
and less error-prone. So, use the SHORT version and remove the extra
variable when possible.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clustered OVSDB allows to use DNS names as addresses of raft members.
However, if DNS resolution fails during the initial database read,
this causes a fatal failure and exit of the ovsdb-server process.
Also, if DNS name of a joining server is not resolvable for one of the
followers, this follower will reject append requests for a new server
to join until the name is successfully resolved. This makes a follower
effectively non-functional while DNS is unavailable.
To fix the problem relax the address verification. Allowing validation
to pass if only name resolution failed and the address is valid
otherwise. This will allow addresses to be added to the database, so
connections could be established later when the DNS is available.
Additionally fixing missed initialization of the dns-resolve module.
Without it, DNS requests are blocking. This causes unexpected delays
in runtime.
Fixes: 771680d96fb6 ("DNS: Add basic support for asynchronous DNS resolving")
Reported-at: https://bugzilla.redhat.com/2055097
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_datum_apply_diff_in_place() is much faster than the usual
ovsdb_datum_apply_diff() in most cases, because it doesn't clone or
compare unnecessary data. Since the original destination datum is
destroyed anyway, we might use the faster function here to speed up
transaction processing.
ovsdb_row_update_columns() with xor is mainly used by relay databases.
So, this change should improve their performance.
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Even though relays can be scaled to the big number of servers to
handle a lot more clients, lack of transaction history may cause
significant load if clients are re-connecting.
E.g. in case of the upgrade of a large-scale OVN deployment, relays
can be taken down one by one forcing all the clients of one relay to
jump to other ones. And all these clients will download the database
from scratch from a new relay.
Since relay itself supports monitor_cond_since connection to the
main cluster, it receives the last transaction id along with each
update. Since these transaction ids are 'eid's of actual transactions,
they can be used by relay for a transaction history.
Relay may not receive all the transaction ids, because the main cluster
may combine several changes into a single monitor update. However,
all relays will, likely, receive same updates with the same transaction
ids, so the case where transaction id can not be found after
re-connection between relays should not be very common. If some id
is missing on the relay (i.e. this update was merged with some other
update and newer id was used) the client will just re-download the
database as if there was a normal transaction history miss.
OVSDB client synchronization module updated to provide the last
transaction id along with the update. Relay module updated to use
these ids as a transaction id. If ids are zero, relay decides that
the main server doesn't support transaction ids and disables the
transaction history accordingly.
Using ovsdb_txn_replay_commit() instead of ovsdb_txn_propose_commit_block(),
so transactions are added to the history. This can be done, because
relays has no file storage, so there is no need to write anything.
Relay tests modified to test both standalone and clustered database
as a main server. Checks added to ensure that all servers receive the
same transaction ids in monitor updates.
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the joining server re-connects while catching up (e.g. if it crashed
or connection got closed due to inactivity), the data we sent might be
lost, so the server will never reply to append request or a snapshot
installation request. At the same time, leader will decline all the
subsequent requests to join from that server with the 'in progress'
resolution. At this point the new server will never be able to join
the cluster, because it will never receive the raft log while leader
thinks that it was already sent.
This happened in practice when one of the servers got preempted for a
few seconds, so the leader closed connection due to inactivity.
Destroying the joining server if disconnection detected. This will
allow to start the joining from scratch when the server re-connects
and sends the new join request.
We can't track re-connection in the raft_conn_run(), because it's
incoming connection and the jsonrpc will not keep it alive or
try to reconnect. Next time the server re-connects it will be an
entirely new raft conn.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Reported-at: https://bugzilla.redhat.com/2033514
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Reported by UndefinedBehaviorSanitizer:
tests/idltest.c:3602:12:
runtime error: member access within null pointer of type
'const struct idltest_simple'
#0 0x4295af in idltest_simple_cursor_first_ge tests/idltest.c:3602
#1 0x41c81b in test_idl_compound_index_single_column tests/test-ovsdb.c:3128
#2 0x41e035 in do_idl_compound_index tests/test-ovsdb.c:3277
#3 0x4cf640 in ovs_cmdl_run_command__ lib/command-line.c:247
#4 0x4cf79f in ovs_cmdl_run_command lib/command-line.c:278
#5 0x4072f7 in main tests/test-ovsdb.c:79
#6 0x7fa858675b74 in __libc_start_main (/lib64/libc.so.6+0x27b74)
#7 0x4060ed in _start (/root/ovs/tests/test-ovsdb+0x4060ed)
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's actually undefined behavior to pass NULL to standard library
functions that manipulate arrays (e.g., qsort, memcpy, memcmp), even if
the passed number of items is 0.
UB Sanitizer reports:
ovsdb/monitor.c:408:9: runtime error: null pointer passed as argument 1,
which is declared to never be null
#0 0x406ae1 in ovsdb_monitor_columns_sort ovsdb/monitor.c:408
#1 0x406ae1 in ovsdb_monitor_add ovsdb/monitor.c:1683
[...]
lib/ovsdb-data.c:1970:5: runtime error: null pointer passed as argument 2,
which is declared to never be null
#0 0x4071c8 in ovsdb_datum_push_unsafe lib/ovsdb-data.c:1970
#1 0x471cd0 in ovsdb_datum_apply_diff_in_place lib/ovsdb-data.c:2345
[...]
ofproto/ofproto-dpif-rid.c:159:17:
runtime error: null pointer passed as argument 1,
which is declared to never be null
#0 0x4df5d8 in frozen_state_equal ofproto/ofproto-dpif-rid.c:159
#1 0x4dfd27 in recirc_find_equal ofproto/ofproto-dpif-rid.c:179
[...]
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Paolo Valerio <pvalerio@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a single transaction exceeds the size of the whole database (e.g.,
a lot of rows got removed and new ones added), transaction history will
be drained. This leads to sending UUID_ZERO to the clients as the last
transaction id in the next monitor update, because monitor doesn't
know what was the actual last transaction id. In case of a re-connect
that will cause re-downloading of the whole database, since the
client's last_id will be out of sync.
One solution would be to store the last transaction ID separately
from the actual transactions, but that will require a careful
management in cases where database gets reset and the history needs
to be cleared. Keeping the one last transaction instead to avoid
the problem. That should not be a big concern in terms of memory
consumption, because this last transaction will be removed from the
history once the next transaction appeared. This is also not a concern
for a fast re-sync, because this last transaction will not be used
for the monitor reply; it's either client already has it, so no need
to send, or it's a history miss.
The test updated to not check the number of atoms if there is only
one transaction in the history.
Fixes: 317b1bfd7dd3 ("ovsdb: Don't let transaction history grow larger than the database.")
Reported-at: https://bugzilla.redhat.com/2044621
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Snapshots are scheduled for every 10-20 minutes. It's a random value
in this interval for each server. Once the time is up, but the maximum
time (24 hours) not reached yet, ovsdb will start checking if the log
grew a lot on every iteration. Once the growth is detected, compaction
is triggered.
OTOH, it's very common for an OVSDB cluster to not have the log growing
very fast. If the log didn't grow 2x in 20 minutes, the randomness of
the initial scheduled time is gone and all the servers are checking if
they need to create snapshot on every iteration. And since all of them
are part of the same cluster, their logs are growing with the same
speed. Once the critical mass is reached, all the servers will start
creating snapshots at the same time. If the database is big enough,
that might leave the cluster unresponsive for an extended period of
time (e.g. 10-15 seconds for OVN_Southbound database in a larger scale
OVN deployment) until the compaction completed.
Fix that by re-scheduling a quick retry if the minimal time already
passed. Effectively, this will work as a randomized 1-2 min delay
between checks, so the servers will not synchronize.
Scheduling function updated to not change the upper limit on quick
reschedules to avoid delaying the snapshot creation indefinitely.
Currently quick re-schedules are only used for the error cases, and
there is always a 'slow' re-schedule after the successful compaction.
So, the change of a scheduling function doesn't change the current
behavior much.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 3c2d6274bcee ("raft: Transfer leadership before creating
snapshots.") made it such that raft leaders transfer leadership before
snapshotting. However, there's still the case when the next leader to
be is in the process of snapshotting. To avoid delays in that case too,
we now explicitly allow snapshots only on followers. Cluster members
will have to wait until the current election is settled before
snapshotting.
Given the following logs taken from an OVN_Southbound 3-server cluster
during a scale test:
S1 (old leader):
19:07:51.226Z|raft|INFO|Transferring leadership to write a snapshot.
19:08:03.830Z|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms
19:08:03.940Z|raft|INFO|server 8b8d is leader for term 43
S2 (follower):
19:08:00.870Z|raft|INFO|server 8b8d is leader for term 43
S3 (new leader):
19:07:51.242Z|raft|INFO|received leadership transfer from f5c9 in term 42
19:07:51.244Z|raft|INFO|term 43: starting election
19:08:00.805Z|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms
19:08:00.869Z|raft|INFO|term 43: elected leader by 2+ of 3 servers
We see that the leader to be (S3) receives the leadership transfer,
initiates the election and immediately after starts a snapshot that
takes ~9.5 seconds. During this time, S2 votes for S3 electing it
as cluster leader but S3 doesn't effectively become leader until it
finishes snapshotting, essentially keeping the cluster without a
leader for up to ~9.5 seconds.
With the current change, S3 will delay compaction and snapshotting until
the election is finished.
The only exception is the case of single-node clusters for which we
allow the node to snapshot regardless of role.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_atom_string and json_string are basically the same data structure
and ovsdb-server frequently needs to convert one to another. We can
avoid that by using json_string from the beginning for all ovsdb
strings. So, the conversion turns into simple json_clone(), i.e.
increment of a reference counter. This change gives a moderate
performance boost in some scenarios, improves the code clarity and
may be useful for future development.
Acked-by: Mike Pattrick <mkp@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If user frequently changes a lot of rows in a database, transaction
history could grow way larger than the database itself. This wastes
a lot of memory and also makes monitor_cond_since slower than
usual monotor_cond if the transaction id is old enough, because
re-construction of the changes from a history is slower than just
creation of initial database snapshot. This is also the case if
user deleted a lot of data, so transaction history still holds all of
it while the database itself doesn't.
In case of current lb-per-service model in ovn-kubernetes, each
load-balancer is added to every logical switch/router. Such a
transaction touches more than a half of a OVN_Northbound database.
And each of these transactions is added to the transaction history.
Since transaction history depth is 100, in worst case scenario,
it will hold 100 copies of a database increasing memory consumption
dramatically. In tests with 3000 LBs and 120 LSs, memory goes up
to 3 GB, while holding at 30 MB if transaction history disabled in
the code.
Fixing that by keeping count of the number of ovsdb_atom's in the
database and not allowing the total number of atoms in transaction
history to grow larger than this value. Counting atoms is fairly
cheap because we don't need to iterate over them, so it doesn't have
significant performance impact. It would be ideal to measure the
size of individual atoms, but that will hit the performance.
Counting cells instead of atoms is not sufficient, because OVN
users are adding hundreds or thousands of atoms to a single cell,
so they are largely different in size.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The main idea is to not store list of weak references in the source
row, so they all don't need to be re-checked/updated on every
modification of that source row. The point is that source row already
knows UUIDs of all destination rows stored in the data, so there is no
much profit in storing this information somewhere else. If needed,
destination row can be looked up and reference can be looked up in the
destination row. For the fast lookup, destination row now stores
references in a hash map.
Weak reference structure now contains the table and uuid of a source
row instead of a direct pointer. This allows to replace/update the
source row without breaking any weak references stored in destination
rows.
Structure also now contains the key-value pair of atoms that triggered
creation of this reference. These atoms can be used to quickly
subtract removed references from a source row. During reassessment,
ovsdb now only needs to care about new added or removed atoms, and
atoms that got removed due to removal of the destination rows, but
these are marked for reassessment by the destination row.
ovsdb_datum_subtract() is used to remove atoms that points to removed
or incorrect rows, so there is no need to re-sort datum in the end.
Results of an OVN load-balancer benchmark that adds 3K load-balancers
to each of 120 logical switches and 120 logical routers in the OVN
sandbox with clustered Northbound database and then removes them:
Before:
%CPU CPU Time CMD
86.8 00:16:05 ovsdb-server nb1.db
44.1 00:08:11 ovsdb-server nb2.db
43.2 00:08:00 ovsdb-server nb3.db
After:
%CPU CPU Time CMD
54.9 00:02:58 ovsdb-server nb1.db
33.3 00:01:48 ovsdb-server nb2.db
32.2 00:01:44 ovsdb-server nb3.db
So, on a cluster leader the processing time dropped by 5.4x, on
followers - by 4.5x. More load-balancers - larger the performance
difference. There is a slight increase of memory usage, because new
reference structure is larger, but the difference is not significant.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb-server spends a lot of time cloning atoms for various reasons,
e.g. to create a diff of two rows or to clone a row to the transaction.
All atoms, except for strings, contains a simple value that could be
copied in efficient way, but duplicating strings every time has a
significant performance impact.
Introducing a new reference-counted structure 'ovsdb_atom_string'
that allows to not copy strings every time, but just increase a
reference counter.
This change allows to increase transaction throughput in benchmarks
up to 2x for standalone databases and 3x for clustered databases, i.e.
number of transactions that ovsdb-server can handle per second.
It also noticeably reduces memory consumption of ovsdb-server.
Next step will be to consolidate this structure with json strings,
so we will not need to duplicate strings while converting database
objects to json and back.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_datum_apply_diff() is heavily used in ovsdb transactions, but
it's linear in terms of number of comparisons. And it also clones
all the atoms along the way. In most cases size of a diff is much
smaller than the size of the original datum, this allows to perform
the same operation in-place with only O(diff->n * log2(old->n))
comparisons and O(old->n + diff->n) memory copies with memcpy.
Using this function while applying diffs read from the storage gives
a significant performance boost and allows to execute much more
transactions per second.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
|