| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb-server spends a lot of time cloning atoms for various reasons,
e.g. to create a diff of two rows or to clone a row to the transaction.
All atoms, except for strings, contains a simple value that could be
copied in efficient way, but duplicating strings every time has a
significant performance impact.
Introducing a new reference-counted structure 'ovsdb_atom_string'
that allows to not copy strings every time, but just increase a
reference counter.
This change allows to increase transaction throughput in benchmarks
up to 2x for standalone databases and 3x for clustered databases, i.e.
number of transactions that ovsdb-server can handle per second.
It also noticeably reduces memory consumption of ovsdb-server.
Next step will be to consolidate this structure with json strings,
so we will not need to duplicate strings while converting database
objects to json and back.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_datum_apply_diff() is heavily used in ovsdb transactions, but
it's linear in terms of number of comparisons. And it also clones
all the atoms along the way. In most cases size of a diff is much
smaller than the size of the original datum, this allows to perform
the same operation in-place with only O(diff->n * log2(old->n))
comparisons and O(old->n + diff->n) memory copies with memcpy.
Using this function while applying diffs read from the storage gives
a significant performance boost and allows to execute much more
transactions per second.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current algorithm of ovsdb_datum_union looks like this:
for-each atom in b:
if not bin_search(a, atom):
push(a, clone(atom))
quicksort(a)
So, the complexity looks like this:
Nb * log2(Na) + Nb + (Na + Nb) * log2(Na + Nb)
Comparisons clones Comparisons for quicksort
for search
ovsdb_datum_union() is heavily used in database transactions while
new element is added to a set. For example, if new logical switch
port is added to a logical switch in OVN. This is a very common
use case where CMS adds one new port to an existing switch that
already has, let's say, 100 ports. For this case ovsdb-server will
have to perform:
1 * log2(100) + 1 clone + 101 * log2(101)
Comparisons Comparisons for
for search quicksort.
~7 1 ~707
Roughly 714 comparisons of atoms and 1 clone.
Since binary search can give us position, where new atom should go
(it's the 'low' index after the search completion) for free, the
logic can be re-worked like this:
copied = 0
for-each atom in b:
desired_position = bin_search(a, atom)
push(result, a[ copied : desired_position - 1 ])
copied = desired_position
push(result, clone(atom))
push(result, a[ copied : Na ])
swap(a, result)
Complexity of this schema:
Nb * log2(Na) + Nb + Na
Comparisons clones memory copy on push
for search
'swap' is just a swap of a few pointers. 'push' is not a 'clone',
but a simple memory copy of 'union ovsdb_atom'.
In general, this schema substitutes complexity of a quicksort
with complexity of a memory copy of Na atom structures, where we're
not even copying strings that these atoms are pointing to.
Complexity in the example above goes down from 714 comparisons
to 7 comparisons and memcpy of 100 * sizeof (union ovsdb_atom) bytes.
General complexity of a memory copy should always be lower than
complexity of a quicksort, especially because these copies usually
performed in bulk, so this new schema should work faster for any input.
All in all, this change allows to execute several times more
transactions per second for transactions that adds new entries to sets.
Alternatively, union can be implemented as a linear merge of two
sorted arrays, but this will result in O(Na) comparisons, which
is more than Nb * log2(Na) in common case, since Na is usually
far bigger than Nb. Linear merge will also mean per-atom memory
copies instead of copying in bulk.
'replace' functionality of ovsdb_datum_union() had no users, so it
just removed. But it can easily be added back if needed in the future.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, even if one reference added to the set of strong references
or removed from it, ovsdb-server will walk through the whole set and
re-count references to other rows. These referenced rows will also be
added to the transaction in order to re-count their references.
For example, every time Logical Switch Port added to a Logical Switch,
OVN Northbound database server will walk through all ports of this
Logical Switch, clone their rows, and re-count references. This is
not very efficient. Instead, it can only increase reference counters
for added references and reduce for removed ones. In many cases this
will be only one row affected in the Logical_Switch_Port table.
Introducing new function that generates a diff of two datum objects,
but stores added and removed atoms separately, so they can be used
to increase or decrease row reference counters accordingly.
This change allows to perform several times more transactions that
adds or removes strong references to/from sets per second, because
ovsdb-server no longer clones and re-counts rows that are irrelevant
to current transaction.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Same json from a json cache is typically sent to all the clients,
e.g., in case of OVN deployment with ovn-monitor-all=true.
There could be hundreds or thousands connected clients and ovsdb
will serialize the same json object for each of them before sending.
Serializing it once before storing into json cache to speed up
processing.
This change allows to save a lot of CPU cycles and a bit of memory
since we need to store in memory only a string and not the full json
object.
Testing with ovn-heater on 120 nodes using density-heavy scenario
shows reduction of the total CPU time used by Southbound DB processes
from 256 minutes to 147. Duration of unreasonably long poll intervals
also reduced dramatically from 7 to 2 seconds:
Count Min Max Median Mean 95 percentile
-------------------------------------------------------------
Before 1934 1012 7480 4302.5 4875.3 7034.3
After 1909 1004 2730 1453.0 1532.5 2053.6
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Raft log entries (and raft database snapshot) contains json objects
of the data. Follower receives append requests with data that gets
parsed and added to the raft log. Leader receives execution requests,
parses data out of them and adds to the log. In both cases, later
ovsdb-server reads the log with ovsdb_storage_read(), constructs
transaction and updates the database. On followers these json objects
in common case are never used again. Leader may use them to send
append requests or snapshot installation requests to followers.
However, all these operations (except for ovsdb_storage_read()) are
just serializing the json in order to send it over the network.
Json objects are significantly larger than their serialized string
representation. For example, the snapshot of the database from one of
the ovn-heater scale tests takes 270 MB as a string, but 1.6 GB as
a json object from the total 3.8 GB consumed by ovsdb-server process.
ovsdb_storage_read() for a given raft entry happens only once in a
lifetime, so after this call, we can serialize the json object, store
the string representation and free the actual json object that ovsdb
will never need again. This can save a lot of memory and can also
save serialization time, because each raft entry for append requests
and snapshot installation requests serialized only once instead of
doing that every time such request needs to be sent.
JSON_SERIALIZED_OBJECT can be used in order to seamlessly integrate
pre-serialized data into raft_header and similar json objects.
One major special case is creation of a database snapshot.
Snapshot installation request received over the network will be parsed
and read by ovsdb-server just like any other raft log entry. However,
snapshots created locally with raft_store_snapshot() will never be
read back, because they reflect the current state of the database,
hence already applied. For this case we can free the json object
right after writing snapshot on disk.
Tests performed with ovn-heater on 60 node density-light scenario,
where on-disk database goes up to 97 MB, shows average memory
consumption of ovsdb-server Southbound DB processes decreased by 58%
(from 602 MB to 256 MB per process) and peak memory consumption
decreased by 40% (from 1288 MB to 771 MB).
Test with 120 nodes on density-heavy scenario with 270 MB on-disk
database shows 1.5 GB memory consumption decrease as expected.
Also, total CPU time consumed by the Southbound DB process reduced
from 296 to 256 minutes. Number of unreasonably long poll intervals
reduced from 2896 down to 1934.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds 2 new APIs in the ovsdb-idl client library
- ovsdb_idl_server_has_table() and ovsdb_idl_server_has_column() to
query if a table and a column is present in the IDL or not. This
patch also adds IDL helper functions which are auto generated from
the schema which makes it easier for the clients.
These APIs are required for scenarios where the server schema is old and
missing a table or column and the client (built with a new schema
version) does a transaction with the missing table or column. This
results in a continuous loop of transaction failures.
Related-Bug: https://bugzilla.redhat.com/show_bug.cgi?id=1992705
Signed-off-by: Numan Siddique <nusiddiq@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
'sent_node' is initialized to all zeroes by xzalloc(), but
HMAP_NODE_NULL is not all zeroes. hmap_node_is_null() is used
to detect if the node is valid, but it will fail and cause
segmentation fault on attempt to remove the non-existent node
from the hash map. This can happen if client disconnected while
the transaction is not yet forwarded to the relay source:
Program terminated with signal 11, Segmentation fault.
0 in hmap_remove at include/openvswitch/hmap.h:293
293 while (*bucket != node) {
(gdb) bt
0 hmap_remove at include/openvswitch/hmap.h:293
1 ovsdb_txn_forward_unlist at ovsdb/transaction-forward.c:67
2 ovsdb_txn_forward_destroy at ovsdb/transaction-forward.c:79
3 ovsdb_trigger_destroy at ovsdb/trigger.c:70
4 ovsdb_jsonrpc_trigger_complete at ovsdb/jsonrpc-server.c:1192
5 ovsdb_jsonrpc_trigger_remove__ at ovsdb/jsonrpc-server.c:1204
6 ovsdb_jsonrpc_trigger_complete_all at ovsdb/jsonrpc-server.c:1223
7 ovsdb_jsonrpc_session_run at ovsdb/jsonrpc-server.c:546
8 ovsdb_jsonrpc_session_run_all at ovsdb/jsonrpc-server.c:591
9 ovsdb_jsonrpc_server_run at ovsdb/jsonrpc-server.c:406
10 main_loop
(gdb) print db->txn_forward_sent
$20 = {buckets = 0x..., one = 0x0, mask = 63, n = 0}
(gdb) print txn_fwd->sent_node
$24 = {hash = 0, next = 0x0}
Fix that by correct initialization of the 'sent_node'.
Reported-by: Wentao Jia <wentao.jia@easystack.cn>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2021-August/051354.html
Fixes: 7964ffe7d2bf ("ovsdb: relay: Add support for transaction forwarding.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Main documentation for the service model and tutorial with the use case
and configuration examples.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Clients needs to re-connect from the relay that has no connection
with the database source. Also, relay acts similarly to the follower
from a clustered model from the consistency point of view, so it's not
suitable for leader-only connections.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
It might be important for clients to know that relay lost connection
with the relay remote, so they could re-connect to other relay.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current version of ovsdb relay allows to scale out read-only
access to the primary database. However, many clients are not
read-only but read-mostly. For example, ovn-controller.
In order to scale out database access for this case ovsdb-server
need to process transactions that are not read-only. Relay is not
allowed to do that, i.e. not allowed to modify the database, but it
can act like a proxy and forward transactions that includes database
modifications to the primary server and forward replies back to a
client. At the same time it may serve read-only transactions and
monitor requests by itself greatly reducing the load on primary
server.
This configuration will slightly increase transaction latency, but
it's not very important for read-mostly use cases.
Implementation details:
With this change instead of creating a trigger to commit the
transaction, ovsdb-server will create a trigger for transaction
forwarding. Later, ovsdb_relay_run() will send all new transactions
to the relay source. Once transaction reply received from the
relay source, ovsdb-relay module will update the state of the
transaction forwarding with the reply. After that, trigger_run()
will complete the trigger and jsonrpc_server_run() will send the
reply back to the client. Since transaction reply from the relay
source will be received after all the updates, client will receive
all the updates before receiving the transaction reply as it is in
a normal scenario with other database models.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
New database service model 'relay' that is needed to scale out
read-mostly database access, e.g. ovn-controller connections to
OVN_Southbound.
In this service model ovsdb-server connects to existing OVSDB
server and maintains in-memory copy of the database. It serves
read-only transactions and monitor requests by its own, but
forwards write transactions to the relay source.
Key differences from the active-backup replication:
- support for "write" transactions (next commit).
- no on-disk storage. (probably, faster operation)
- support for multiple remotes (connect to the clustered db).
- doesn't try to keep connection as long as possible, but
faster reconnects to other remotes to avoid missing updates.
- No need to know the complete database schema beforehand,
only the schema name.
- can be used along with other standalone and clustered databases
by the same ovsdb-server process. (doesn't turn the whole
jsonrpc server to read-only mode)
- supports modern version of monitors (monitor_cond_since),
because based on ovsdb-cs.
- could be chained, i.e. multiple relays could be connected
one to another in a row or in a tree-like form.
- doesn't increase availability.
- cannot be converted to other service models or become a main
active server.
Some performance test results can be found here:
https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385825.html
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This will be used to apply update3 type updates to ovsdb tables
while processing updates for future ovsdb 'relay' service model.
'ovsdb_datum_apply_diff' is allowed to fail, so adding support
to return this error.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions will be used later for ovsdb 'relay' service model, so
moving them to a common code.
Warnings translated to ovsdb errors, caller in replication.c only
printed inconsistency warnings, but mostly ignored them. Implementing
the same logic by checking the error tag.
Also ovsdb_execute_insert() previously printed incorrect warning about
duplicate row while it was a syntax error in json. Fixing that by
actually checking for the duplicate and reporting correct ovsdb error.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_create() requires schema or storage to be nonnull, but in
practice it requires to have schema name or a storage name to
use it as a database name. Only clustered storage has a name.
This means that only clustered database can be created without
schema, Changing that by allowing unbacked storage to have a
name. This way we can create database with unbacked storage
without schema. Will be used in next commits to create database
for ovsdb 'relay' service model.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If there are completed triggers, jsonrpc server should wake up and
update clients with the new data, but there is no such condition
in ovsdb_jsonrpc_session_wait(). For some reason this doesn't result
in any processing delays in current code, probably because there are
always some other types of events in this case that could wake ovsdb
server up. But it will become a problem in upcoming ovsdb 'relay'
service model because triggers could be completed from a different
place, i.e. after receiving transaction reply from the relay source.
Fix that by waking up ovsdb-server in case there are completed triggers
that needs to be handled.
Acked-by: Mark D. Gray <mark.d.gray@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
| |
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
| |
The json returned by raft_entry_to_json() must be freed.
Found by Coverity.
Signed-off-by: linhuang <linhuang@ruijie.com.cn>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
| |
Remove the coma character by using ' and " character.
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
| |
Fixes: 1ca0323e7c29 ("Require Python 3 and remove support for Python 2.")
Reported at: https://bugzilla.redhat.com/show_bug.cgi?id=1949875
Signed-off-by: Rosemarie O'Riorden <roriorde@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
| |
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
| |
This is primarily to be able to test recording of client connections.
Unit test added accordingly.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current version or replay engine doesn't handle correctly internal
time-based events that ends up in stream events. For example,
updates of a database status that happens each 2.5 seconds results
in updates on client monitors. Disable updates for now if replay
engine is active. The very first update kept to store the initial
information about the server.
The proper solution would be to record time and replay it, probably,
with time warping or in some other way.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change adds support of stream record/replay functionality to
ovsdb-server.
Since current replay engine doesn't work well with time-based
events generated locally, it will work only with standalone databases
for now (raft heavily depends on time).
To use this functionality run:
Recording:
# create a directory for replay files.
mkdir replay_dir
# copy current db for later use by replay
cp my_db ./replay_dir/my_db
ovsdb-server --record=./replay_dir <OVSDB_ARGS> my_db
# connect some clients and run some ovsdb transactions
ovs-appctl -t ovsdb-server exit
Replay:
# restore db from the copy
cp ./replay_dir/my_db my_db.for_replay
ovsdb-server --replay=./replay_dir <OVSDB_ARGS> my_db.for_replay
At this point ovsdb-server should execute all the same commands
and transactions. Since the last command was 'exit' via unixctl,
ovsdb-server will exit in the end.
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
After creating the new clustered database write a raft entry that
sets the desired election timer. This allows CMSes to set the
election timer at cluster start and avoid an error-prone
election timer modification process after the cluster is up.
Reported-at: https://bugzilla.redhat.com/1831778
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With a big database writing snapshot could take a lot of time, for
example, on one of the systems compaction of 300MB database takes
about 10 seconds to complete. For the clustered database, 40% of this
time takes conversion of the database to the file transaction json
format, the rest of time is formatting a string and writing to disk.
Of course, this highly depends on the disc and CPU speeds. 300MB is
the very possible database size for the OVN Southbound DB, and it might
be even bigger than that.
During compaction the database is not available and the ovsdb-server
doesn't do any other tasks. If leader spends 10-15 seconds writing a
snapshot, the cluster is not functional for that time period. Leader
also, likely, has some monitors to serve, so the one poll interval may
be 15-20 seconds long in the end. Systems with so big databases
typically has very high election timers configured (16 seconds), so
followers will start election only after this significant amount of
time. Once leader is back to the operational state, it will
re-connect and try to join the cluster back. In some cases, this might
also trigger the 'connected' state flapping on the old leader
triggering a re-connection of clients. This issue has been observed
with large-scale OVN deployments.
One of the methods to improve the situation is to transfer leadership
before compacting. This allows to keep the cluster functional,
while one of the servers writes a snapshot.
Additionally logging the time spent for compaction if it was longer
than 1 second. This adds a bit of visibility to 'unreasonably long
poll interval's.
Reported-at: https://bugzilla.redhat.com/1960391
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This command will stop sending and receiving any RAFT-related
traffic or accepting new connections. Useful to simulate
network problems between cluster members.
There is no unit test that uses it yet, but it's convenient for
manual testing.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If election times out for a server in 'candidate' role it sets
'candidate_retrying' flag that notifies that storage is disconnected
and client should re-connect. However, cluster/status command
reports 'Status: cluster member' and that is misleading.
Reporting "disconnected from the cluster (election timeout)" instead.
Reported-by: Carlos Goncalves <cgoncalves@redhat.com>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's not enough to just have heartbeats.
RAFT heartbeats are unidirectional, i.e. leader sends them to followers
but not the other way around. Missing heartbeats provokes followers to
start election, but if leader will not receive any replies it will not
do anything while there is a quorum, i.e. there are enough other
servers to make decisions.
This leads to situation that while TCP connection is established,
leader will continue to blindly send messages to it. In our case this
leads to growing send backlog. Connection will be terminated
eventually due to excessive send backlog, but this this might take a
lot of time and wasted process memory. At the same time 'candidate'
will continue to send vote requests to the dead connection on its
side.
To fix that we need to reintroduce inactivity probes that will drop
connection if there was no incoming traffic for a long time and remote
server doesn't reply to the "echo" request. Probe interval might be
chosen based on an election timeout to avoid issues described in commit
db5a066c17bd.
Reported-by: Carlos Goncalves <cgoncalves@redhat.com>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690
Fixes: db5a066c17bd ("raft: Disable RAFT jsonrpc inactivity probe.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
When you try to specify `SERVER` to the 'ovsdb-client needs-conversion'
command, it interprets the `SERVER` parameter as the path to the schema
and returns an error.
This PR fixes it.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Submitted-at: https://github.com/openvswitch/ovs/pull/347
Signed-off-by: Alexey Roytman <roytman@il.ibm.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ovsdb-doc includes python code that requires dirs.py to exist.
This change fixes broken 'make manpage-check' target:
# make manpage-check
Traceback (most recent call last):
File "./ovsdb/ovsdb-doc", line 25, in <module>
import ovs.db.schema
File "/root/ovs/python/ovs/db/schema.py", line 19, in <module>
import ovs.db.types
File "/root/ovs/python/ovs/db/types.py", line 18, in <module>
import ovs.db.data
File "/root/ovs/python/ovs/db/data.py", line 22, in <module>
import ovs.jsonrpc
File "/root/ovs/python/ovs/jsonrpc.py", line 21, in <module>
import ovs.poller
File "/root/ovs/python/ovs/poller.py", line 23, in <module>
import ovs.vlog
File "/root/ovs/python/ovs/vlog.py", line 25, in <module>
import ovs.dirs
ModuleNotFoundError: No module named 'ovs.dirs'
Fixes: 943c4a325045 ("python: set ovs.dirs variables with build system values")
Acked-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, ovsdb-server stores complete value for the column in a database
file and in a raft log in case this column changed. This means that
transaction that adds, for example, one new acl to a port group creates
a log entry with all UUIDs of all existing acls + one new. Same for
ports in logical switches and routers and more other columns with sets
in Northbound DB.
There could be thousands of acls in one port group or thousands of ports
in a single logical switch. And the typical use case is to add one new
if we're starting a new service/VM/container or adding one new node in a
kubernetes or OpenStack cluster. This generates huge amount of traffic
within ovsdb raft cluster, grows overall memory consumption and hurts
performance since all these UUIDs are parsed and formatted to/from json
several times and stored on disks. And more values we have in a set -
more space a single log entry will occupy and more time it will take to
process by ovsdb-server cluster members.
Simple test:
1. Start OVN sandbox with clustered DBs:
# make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered'
2. Run a script that creates one port group and adds 4000 acls into it:
# cat ../memory-test.sh
pg_name=my_port_group
export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off)
ovn-nbctl pg-add $pg_name
for i in $(seq 1 4000); do
echo "Iteration: $i"
ovn-nbctl --log acl-add $pg_name from-lport $i udp drop
done
ovn-nbctl acl-del $pg_name
ovn-nbctl pg-del $pg_name
ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
ovn-appctl -t ovn-nbctl exit
---
4. Check the current memory consumption of ovsdb-server processes and
space occupied by database files:
# ls sandbox/[ns]b*.db -alh
# ps -eo vsz,rss,comm,cmd | egrep '=[ns]b[123].pid'
Test results with current ovsdb log format:
On-disk Nb DB size : ~369 MB
RSS of Nb ovsdb-servers: ~2.7 GB
Time to finish the test: ~2m
In order to mitigate memory consumption issues and reduce computational
load on ovsdb-servers let's store diff between old and new values
instead. This will make size of each log entry that adds single acl to
port group (or port to logical switch or anything else like that) very
small and independent from the number of already existing acls (ports,
etc.).
Added a new marker '_is_diff' into a file transaction to specify that
this transaction contains diffs instead of replacements for the existing
data.
One side effect is that this change will actually increase the size of
file transaction that removes more than a half of entries from the set,
because diff will be larger than the resulted new value. However, such
operations are rare.
Test results with change applied:
On-disk Nb DB size : ~2.7 MB ---> reduced by 99%
RSS of Nb ovsdb-servers: ~580 MB ---> reduced by 78%
Time to finish the test: ~1m27s ---> reduced by 27%
After this change new ovsdb-server is still able to read old databases,
but old ovsdb-server will not be able to read new ones.
Since new servers could join ovsdb cluster dynamically it's hard to
implement any runtime mechanism to handle cases where different
versions of ovsdb-server joins the cluster. However we still need to
handle cluster upgrades. For this case added special command line
argument to disable new functionality. Documentation updated with the
recommended way to upgrade the ovsdb cluster.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
| |
Fixes: 4e92542cefb7 ("ovsdb-tool: Make "show-log" convert raw JSON to easier-to-read syntax.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce the following info useful for cluster debugging to
cluster/status command:
- time elapsed from last start/complete election
- election trigger (e.g. timeout)
- number of disconnections
- time elapsed from last raft messaged received
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
Update build system to ensure dirs.py is created when it is a
dependency for a build target. Also, update setup.py to
check for that dependency.
Fixes: 943c4a325045 ("python: set ovs.dirs variables with build system values")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently all functions of the type *_is_new() always return
'false'. This patch resolves this issue by using the
'OVSDB_IDL_CHANGE_INSERT' 'change_seqno' instead of the
'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno' to determine if a row
is new and by resetting the 'OVSDB_IDL_CHANGE_INSERT'
'change_seqno' on clear.
Further to this, the code is also updated to match the following
behaviour:
When a row is inserted, the 'OVSDB_IDL_CHANGE_INSERT'
'change_seqno' is updated to match the new database
change_seqno. The 'OVSDB_IDL_CHANGE_MODIFY' 'change_seqno'
is not set for inserted rows (only for updated rows).
At the end of a run, ovsdb_idl_db_track_clear() should be
called to clear all tracking information, this includes
resetting all row 'change_seqno' to zero. This will ensure
that subsequent runs will not see a previously 'new' row.
add_tracked_change_for_references() is updated to only
track rows that reference the current row.
Also, update unit tests in order to test the *_is_new(),
*_is_delete() functions.
Suggested-by: Dumitru Ceara <dceara@redhat.com>
Reported-at: https://bugzilla.redhat.com/1883562
Fixes: ca545a787ac0 ("ovsdb-idl.c: Increase seqno for change-tracking of table references.")
Signed-off-by: Mark Gray <mark.d.gray@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
ovsdb_idl_set_condition() returns a sequence number that can be used to
check if the requested conditions are acknowledged by the server.
However, database bindings do not return this value to the user, making
it impossible to check if the conditions are accepted.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, when ovsdb *.db is created by ovsdb-tool it grants read
permission to others. This may incur security concerns, for example,
IPsec Pre-shared keys are stored in ovs-vsitchd.conf.db.
This patch addresses the concerns by removing permission for others.
Reported-by: Antonin Bas <abas@vmware.com>
Acked-by: Mark Gray <mark.d.gray@redhat.com>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
New appctl 'cluster/set-backlog-threshold' to configure thresholds
on backlog of raft jsonrpc connections. Could be used, for example,
in some extreme conditions where size of a database expected to be
very large, i.e. comparable with default 4GB threshold.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RAFT messages could be fairly big. If something abnormal happens to
one of the servers in a cluster it may not be able to process all the
incoming messages in a timely manner. This results in jsonrpc backlog
growth on the sender's side. For example if follower gets many new
clients at once that it needs to serve, or it decides to take a
snapshot in a period of high number of database changes.
If backlog grows large enough it becomes harder and harder for follower
to process incoming raft messages, it sends outdated replies and
starts receiving snapshots and the whole raft log from the leader.
Sometimes backlog grows too high (60GB in this example):
jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:<ip>,
num of msgs: 15370, backlog: 61731060773.
In this case OS might actually decide to kill the sender to free some
memory. Anyway, It could take a lot of time for such a server to catch
up with the rest of the cluster if it has so much data to receive and
process.
Introducing backlog thresholds for jsonrpc connections.
If sending backlog will exceed particular values (500 messages or
4GB in size), connection will be dropped and re-created. This will
allow to drop all the current backlog and start over increasing
chances of cluster recovery.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commit 8c2c503bdb0d ("raft: Avoid sending equal snapshots.")
took a "safe" approach to not send only exactly same snapshot
installation requests. However, it doesn't make much sense to send
more than one snapshot at a time. If obsolete snapshot installed,
leader will re-send the most recent one.
With this change leader will have only 1 snapshot in-flight per
connection. This will reduce backlogs on raft connections in case
new snapshot created while 'install_snapshot_request' is in progress
or if election timer changed in that period.
Also, not tracking the exact 'install_snapshot_request' we've sent
allows to simplify the code.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Fixes: 8c2c503bdb0d ("raft: Avoid sending equal snapshots.")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Compaction happens at most once in 10 minutes. That is a big time
interval for a heavy loaded ovsdb-server in cluster mode.
In 10 minutes raft logs could grow up to tens of thousands of entries
with tens of gigabytes in total size.
While compaction cleans up raft log entries, the memory in many cases
is not returned to the system, but kept in the heap of running
ovsdb-server process, and it could stay in this condition for a really
long time. In the end one performance spike could lead to a fast
growth of the raft log and this memory will never (for a really long
time) be released to the system even if the database if empty.
Simple example how to reproduce with OVN sandbox:
1. make sandbox SANDBOXFLAGS='--nbdb-model=clustered --sbdb-model=clustered'
2. Run following script that creates 1 port group, adds 4000 acls and
removes all of that in the end:
# cat ../memory-test.sh
pg_name=my_port_group
export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach --log-file -vsocket_util:off)
ovn-nbctl pg-add $pg_name
for i in $(seq 1 4000); do
echo "Iteration: $i"
ovn-nbctl --log acl-add $pg_name from-lport $i udp drop
done
ovn-nbctl acl-del $pg_name
ovn-nbctl pg-del $pg_name
ovs-appctl -t $(pwd)/sandbox/nb1 memory/show
ovn-appctl -t ovn-nbctl exit
---
3. Stopping one of Northbound DB servers:
ovs-appctl -t $(pwd)/sandbox/nb1 exit
Make sure that ovsdb-server didn't compact the database before
it was stopped. Now we have a db file on disk that contains
4000 fairly big transactions inside.
4. Trying to start same ovsdb-server with this file.
# cd sandbox && ovsdb-server <...> nb1.db
At this point ovsdb-server reads all the transactions from db
file and performs all of them as fast as it can one by one.
When it finishes this, raft log contains 4000 entries and
ovsdb-server consumes (on my system) ~13GB of memory while
database is empty. And libc will likely never return this memory
back to system, or, at least, will hold it for a really long time.
This patch adds a new command 'ovsdb-server/memory-trim-on-compaction'.
It's disabled by default, but once enabled, ovsdb-server will call
'malloc_trim(0)' after every successful compaction to try to return
unused heap memory back to system. This is glibc-specific, so we
need to detect function availability in a build time.
Disabled by default since it adds from 1% to 30% (depending on the
current state) to the snapshot creation time and, also, next memory
allocations will likely require requests to kernel and that might be
slower. Could be enabled by default later if considered broadly
beneficial.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
In many cases a big part of a memory consumed by ovsdb-server process
is a raft log, so it's important to add its length to the memory
report.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If debug logs enabled, "raft_is_connected: true" printed on every
call to raft_is_connected() which is way too frequently.
These messages are not very informative and only litters the log.
Let's log only disconnected state in a rate-limited way and only
log positive case once at the moment cluster becomes connected.
Fixes: 923f01cad678 ("raft.c: Set candidate_retrying if no leader elected since last election.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
| |
Error should be destroyed before return.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
While sending snapshots backlog on raft connections could quickly
grow over 4GB and this will overflow raft-backlog counter.
Let's report it in kB instead. (Using kB and not KB to match with
ru_maxrss counter reported by kernel)
Fixes: 3423cd97f88f ("ovsdb: Add raft memory usage to memory report.")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
| |
There is one remaining use under datapath. That change should happen
upstream in Linux first according to our usual policy.
Signed-off-by: Ben Pfaff <blp@ovn.org>
Acked-by: Alin Gabriel Serdean <aserdean@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a database enters an error state, e.g., in case of RAFT when reading
the DB file contents if applying the RAFT records triggers constraint
violations, there's no way to determine this unless a client generates a
write transaction. Such write transactions would fail with "ovsdb-error:
inconsistent data".
This commit adds a new command to show the status of the storage that's
backing a database.
Example, on an inconsistent database:
$ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB
status: ovsdb error: inconsistent data
Example, on a consistent database:
$ ovs-appctl -t /tmp/test.ctl ovsdb-server/get-db-storage-status DB
status: ok
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are some occurrences where the database ends up in an inconsistent
state. This happened in ovn-k8s and is described in [0].
Here we are adding a supported way to check that a given db is consistent,
which is less error prone than checking the logs.
Tested against both a valid db and a corrupted db attached to the
above bug [1]. Also, tested with a fresh db that did not do a snapshot.
[0]: https://bugzilla.redhat.com/show_bug.cgi?id=1837953#c23
[1]: https://bugzilla.redhat.com/attachment.cgi?id=1697595
Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Suggested-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|