| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While creating a new database file and storing a new snapshot
into it, raft module by mistake updates the base offset for the
old file. So, the base offset of a new file remains zero. Then
the old file is getting replaced with the new one, copying new
offsets as well. In the end, after a full compaction, base offset
is always zero. And any offset is twice as large as zero. That
triggers a new compaction again at the earliest scheduled time.
In practice this issue triggers compaction every 10-20 minutes
regardless of the database load, after the first one is triggered
by the actual file growth or by the 24h maximum limit.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-August/051977.html
Reported-by: Oleksandr Mykhalskyi <oleksandr.mykhalskyi@netcracker.com>
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Conversion of the database data into JSON object, serialization
and destruction of that object are the most heavy operations
during the database compaction. If these operations are moved
to a separate thread, the main thread can continue processing
database requests in the meantime.
With this change, the compaction is split in 3 phases:
1. Initialization:
- Create a copy of the database.
- Remember current database index.
- Start a separate thread to convert a copy of the database
into serialized JSON object.
2. Wait:
- Continue normal operation until compaction thread is done.
- Meanwhile, compaction thread:
* Convert database copy to JSON.
* Serialize resulted JSON.
* Destroy original JSON object.
3. Finish:
- Destroy the database copy.
- Take the snapshot created by the thread.
- Write on disk.
The key for this schema to be fast is the ability to create
a shallow copy of the database. This doesn't take too much
time allowing the thread to do most of work.
Database copy is created and destroyed only by the main thread,
so there is no need for synchronization.
Such solution allows to reduce the time main thread is blocked
by compaction by 80-90%. For example, in ovn-heater tests
with 120 node density-heavy scenario, where compaction normally
takes 5-6 seconds at the end of a test, measured compaction
times was all below 1 second with the change applied. Also,
note that these measured times are the sum of phases 1 and 3,
so actual poll intervals are about half a second in this case.
Only implemented for raft storage for now. The implementation
for standalone databases can be added later by using a file
offset as a database index and copying newly added changes
from the old file to a new one during ovsdb_log_replace().
Reported-at: https://bugzilla.redhat.com/2069108
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
There's no need to do that because we're not changing the hmap.
Also, if DEBUG logging is disabled there's no need to iterate at
all.
Fixes: 5a9b53a51ec9 ("ovsdb raft: Fix duplicated transaction execution when leader failover.")
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While becoming a follower, the leader aborts all the current
'in-flight' commands, so the higher layers can re-try corresponding
transactions when the new leader is elected. However, most of these
commands are already sent to followers as append requests, hence they
will actually be committed by the majority of the cluster members,
i.e. will be treated as committed by the new leader, unless there is
an actual network problem between servers. However, the old leader
will decline append replies, since it's not the leader anymore and
commands are already completed with RAFT_CMD_LOST_LEADERSHIP status.
New leader will replicate the commit index back to the old leader.
Old leader will re-try the previously "failed" transaction, because
"cluster error"s are temporary.
If a transaction had some prerequisites that didn't allow double
committing or there are other database constraints (like indexes) that
will not allow a transaction to be committed twice, the server will
reply to the client with a false-negative transaction result.
If there are no prerequisites or additional database constraints,
the server will execute the same transaction again, but as a follower.
E.g. in the OVN case, this may result in creation of duplicated logical
switches / routers / load balancers. I.e. resources with the same
non-indexed name. That may cause issues later where ovn-nbctl will
not be able to add ports to these switches / routers.
Suggested solution is to not complete (abort) the commands, but allow
them to be completed with the commit index update from a new leader.
It the similar behavior to what we do in order to complete commands
in a backward scenario when the follower becomes a leader. That
scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated
transaction execution when leader failover.").
Code paths for leader and follower inside the raft_update_commit_index
were very similar, so they were refactored into one, since we also
needed an ability to complete more than one command for a follower.
Failure test added to exercise scenario of a leadership transfer.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.")
Reported-at: https://bugzilla.redhat.com/2046340
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using SHORT version of the *_SAFE loops makes the code cleaner and less
error prone. So, use the SHORT version and remove the extra variable
when possible for hmap and all its derived types.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using the SHORT version of the *_SAFE loops makes the code cleaner
and less error-prone. So, use the SHORT version and remove the extra
variable when possible.
In order to be able to use both long and short versions without changing
the name of the macro for all the clients, overload the existing name
and select the appropriate version depending on the number of arguments.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the joining server re-connects while catching up (e.g. if it crashed
or connection got closed due to inactivity), the data we sent might be
lost, so the server will never reply to append request or a snapshot
installation request. At the same time, leader will decline all the
subsequent requests to join from that server with the 'in progress'
resolution. At this point the new server will never be able to join
the cluster, because it will never receive the raft log while leader
thinks that it was already sent.
This happened in practice when one of the servers got preempted for a
few seconds, so the leader closed connection due to inactivity.
Destroying the joining server if disconnection detected. This will
allow to start the joining from scratch when the server re-connects
and sends the new join request.
We can't track re-connection in the raft_conn_run(), because it's
incoming connection and the jsonrpc will not keep it alive or
try to reconnect. Next time the server re-connects it will be an
entirely new raft conn.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Reported-at: https://bugzilla.redhat.com/2033514
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 3c2d6274bcee ("raft: Transfer leadership before creating
snapshots.") made it such that raft leaders transfer leadership before
snapshotting. However, there's still the case when the next leader to
be is in the process of snapshotting. To avoid delays in that case too,
we now explicitly allow snapshots only on followers. Cluster members
will have to wait until the current election is settled before
snapshotting.
Given the following logs taken from an OVN_Southbound 3-server cluster
during a scale test:
S1 (old leader):
19:07:51.226Z|raft|INFO|Transferring leadership to write a snapshot.
19:08:03.830Z|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms
19:08:03.940Z|raft|INFO|server 8b8d is leader for term 43
S2 (follower):
19:08:00.870Z|raft|INFO|server 8b8d is leader for term 43
S3 (new leader):
19:07:51.242Z|raft|INFO|received leadership transfer from f5c9 in term 42
19:07:51.244Z|raft|INFO|term 43: starting election
19:08:00.805Z|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms
19:08:00.869Z|raft|INFO|term 43: elected leader by 2+ of 3 servers
We see that the leader to be (S3) receives the leadership transfer,
initiates the election and immediately after starts a snapshot that
takes ~9.5 seconds. During this time, S2 votes for S3 electing it
as cluster leader but S3 doesn't effectively become leader until it
finishes snapshotting, essentially keeping the cluster without a
leader for up to ~9.5 seconds.
With the current change, S3 will delay compaction and snapshotting until
the election is finished.
The only exception is the case of single-node clusters for which we
allow the node to snapshot regardless of role.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Raft log entries (and raft database snapshot) contains json objects
of the data. Follower receives append requests with data that gets
parsed and added to the raft log. Leader receives execution requests,
parses data out of them and adds to the log. In both cases, later
ovsdb-server reads the log with ovsdb_storage_read(), constructs
transaction and updates the database. On followers these json objects
in common case are never used again. Leader may use them to send
append requests or snapshot installation requests to followers.
However, all these operations (except for ovsdb_storage_read()) are
just serializing the json in order to send it over the network.
Json objects are significantly larger than their serialized string
representation. For example, the snapshot of the database from one of
the ovn-heater scale tests takes 270 MB as a string, but 1.6 GB as
a json object from the total 3.8 GB consumed by ovsdb-server process.
ovsdb_storage_read() for a given raft entry happens only once in a
lifetime, so after this call, we can serialize the json object, store
the string representation and free the actual json object that ovsdb
will never need again. This can save a lot of memory and can also
save serialization time, because each raft entry for append requests
and snapshot installation requests serialized only once instead of
doing that every time such request needs to be sent.
JSON_SERIALIZED_OBJECT can be used in order to seamlessly integrate
pre-serialized data into raft_header and similar json objects.
One major special case is creation of a database snapshot.
Snapshot installation request received over the network will be parsed
and read by ovsdb-server just like any other raft log entry. However,
snapshots created locally with raft_store_snapshot() will never be
read back, because they reflect the current state of the database,
hence already applied. For this case we can free the json object
right after writing snapshot on disk.
Tests performed with ovn-heater on 60 node density-light scenario,
where on-disk database goes up to 97 MB, shows average memory
consumption of ovsdb-server Southbound DB processes decreased by 58%
(from 602 MB to 256 MB per process) and peak memory consumption
decreased by 40% (from 1288 MB to 771 MB).
Test with 120 nodes on density-heavy scenario with 270 MB on-disk
database shows 1.5 GB memory consumption decrease as expected.
Also, total CPU time consumed by the Southbound DB process reduced
from 296 to 256 minutes. Number of unreasonably long poll intervals
reduced from 2896 down to 1934.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
| |
Remove the coma character by using ' and " character.
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
| |
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
After creating the new clustered database write a raft entry that
sets the desired election timer. This allows CMSes to set the
election timer at cluster start and avoid an error-prone
election timer modification process after the cluster is up.
Reported-at: https://bugzilla.redhat.com/1831778
Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With a big database writing snapshot could take a lot of time, for
example, on one of the systems compaction of 300MB database takes
about 10 seconds to complete. For the clustered database, 40% of this
time takes conversion of the database to the file transaction json
format, the rest of time is formatting a string and writing to disk.
Of course, this highly depends on the disc and CPU speeds. 300MB is
the very possible database size for the OVN Southbound DB, and it might
be even bigger than that.
During compaction the database is not available and the ovsdb-server
doesn't do any other tasks. If leader spends 10-15 seconds writing a
snapshot, the cluster is not functional for that time period. Leader
also, likely, has some monitors to serve, so the one poll interval may
be 15-20 seconds long in the end. Systems with so big databases
typically has very high election timers configured (16 seconds), so
followers will start election only after this significant amount of
time. Once leader is back to the operational state, it will
re-connect and try to join the cluster back. In some cases, this might
also trigger the 'connected' state flapping on the old leader
triggering a re-connection of clients. This issue has been observed
with large-scale OVN deployments.
One of the methods to improve the situation is to transfer leadership
before compacting. This allows to keep the cluster functional,
while one of the servers writes a snapshot.
Additionally logging the time spent for compaction if it was longer
than 1 second. This adds a bit of visibility to 'unreasonably long
poll interval's.
Reported-at: https://bugzilla.redhat.com/1960391
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Dumitru Ceara <dceara@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
This command will stop sending and receiving any RAFT-related
traffic or accepting new connections. Useful to simulate
network problems between cluster members.
There is no unit test that uses it yet, but it's convenient for
manual testing.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If election times out for a server in 'candidate' role it sets
'candidate_retrying' flag that notifies that storage is disconnected
and client should re-connect. However, cluster/status command
reports 'Status: cluster member' and that is misleading.
Reporting "disconnected from the cluster (election timeout)" instead.
Reported-by: Carlos Goncalves <cgoncalves@redhat.com>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It's not enough to just have heartbeats.
RAFT heartbeats are unidirectional, i.e. leader sends them to followers
but not the other way around. Missing heartbeats provokes followers to
start election, but if leader will not receive any replies it will not
do anything while there is a quorum, i.e. there are enough other
servers to make decisions.
This leads to situation that while TCP connection is established,
leader will continue to blindly send messages to it. In our case this
leads to growing send backlog. Connection will be terminated
eventually due to excessive send backlog, but this this might take a
lot of time and wasted process memory. At the same time 'candidate'
will continue to send vote requests to the dead connection on its
side.
To fix that we need to reintroduce inactivity probes that will drop
connection if there was no incoming traffic for a long time and remote
server doesn't reply to the "echo" request. Probe interval might be
chosen based on an election timeout to avoid issues described in commit
db5a066c17bd.
Reported-by: Carlos Goncalves <cgoncalves@redhat.com>
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690
Fixes: db5a066c17bd ("raft: Disable RAFT jsonrpc inactivity probe.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Introduce the following info useful for cluster debugging to
cluster/status command:
- time elapsed from last start/complete election
- election trigger (e.g. timeout)
- number of disconnections
- time elapsed from last raft messaged received
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
New appctl 'cluster/set-backlog-threshold' to configure thresholds
on backlog of raft jsonrpc connections. Could be used, for example,
in some extreme conditions where size of a database expected to be
very large, i.e. comparable with default 4GB threshold.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RAFT messages could be fairly big. If something abnormal happens to
one of the servers in a cluster it may not be able to process all the
incoming messages in a timely manner. This results in jsonrpc backlog
growth on the sender's side. For example if follower gets many new
clients at once that it needs to serve, or it decides to take a
snapshot in a period of high number of database changes.
If backlog grows large enough it becomes harder and harder for follower
to process incoming raft messages, it sends outdated replies and
starts receiving snapshots and the whole raft log from the leader.
Sometimes backlog grows too high (60GB in this example):
jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:<ip>,
num of msgs: 15370, backlog: 61731060773.
In this case OS might actually decide to kill the sender to free some
memory. Anyway, It could take a lot of time for such a server to catch
up with the rest of the cluster if it has so much data to receive and
process.
Introducing backlog thresholds for jsonrpc connections.
If sending backlog will exceed particular values (500 messages or
4GB in size), connection will be dropped and re-created. This will
allow to drop all the current backlog and start over increasing
chances of cluster recovery.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous commit 8c2c503bdb0d ("raft: Avoid sending equal snapshots.")
took a "safe" approach to not send only exactly same snapshot
installation requests. However, it doesn't make much sense to send
more than one snapshot at a time. If obsolete snapshot installed,
leader will re-send the most recent one.
With this change leader will have only 1 snapshot in-flight per
connection. This will reduce backlogs on raft connections in case
new snapshot created while 'install_snapshot_request' is in progress
or if election timer changed in that period.
Also, not tracking the exact 'install_snapshot_request' we've sent
allows to simplify the code.
Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829
Fixes: 8c2c503bdb0d ("raft: Avoid sending equal snapshots.")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
| |
In many cases a big part of a memory consumed by ovsdb-server process
is a raft log, so it's important to add its length to the memory
report.
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If debug logs enabled, "raft_is_connected: true" printed on every
call to raft_is_connected() which is way too frequently.
These messages are not very informative and only litters the log.
Let's log only disconnected state in a rate-limited way and only
log positive case once at the moment cluster becomes connected.
Fixes: 923f01cad678 ("raft.c: Set candidate_retrying if no leader elected since last election.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
| |
Error should be destroyed before return.
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
While sending snapshots backlog on raft connections could quickly
grow over 4GB and this will overflow raft-backlog counter.
Let's report it in kB instead. (Using kB and not KB to match with
ru_maxrss counter reported by kernel)
Fixes: 3423cd97f88f ("ovsdb: Add raft memory usage to memory report.")
Acked-by: Dumitru Ceara <dceara@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Snapshots are huge. In some cases we could receive several outdated
append replies from the remote server. This could happen in high
scale cases if the remote server is overloaded and not able to process
all the raft requests in time. As an action to each outdated append
reply we're sending full database snapshot. While remote server is
already overloaded those snapshots will stuck in jsonrpc backlog for
a long time making it grow up to few GB. Since remote server wasn't
able to timely process incoming messages it will likely not able to
process snapshots leading to the same situation with low chances to
recover. Remote server will likely stuck in 'candidate' state, other
servers will grow their memory consumption due to growing jsonrpc
backlogs:
jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.16.0.3:6644,
num of msgs: 3795, backlog: 8838994624.
This patch is trying to avoid that situation by avoiding sending of
equal snapshot install requests. This helps maintain reasonable memory
consumption and allows the cluster to recover on a larger scale.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Memory reports could be found in logs or by calling 'memory/show'
appctl command. For ovsdb-server it includes information about db
cells, monitor connections with their backlog size, etc. But it
doesn't contain any information about memory consumed by raft.
Backlogs of raft connections could be insanely large because of
snapshot installation requests that simply contains the whole database.
In not that healthy clusters where one of ovsdb servers is not able to
timely handle all the incoming raft traffic, backlog on a sender's side
could cause significant memory consumption issues.
Adding new 'raft-connections' and 'raft-backlog' counters to the
memory report to better track such conditions.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the scale test of 640 nodes k8s cluster, raft DB nodes' jsonrpc
session got closed due to the timeout of default 5 seconds probe.
It will cause disturbance of the raft cluster. Since we already have
the heartbeat for RAFT, just disable the probe between the servers
to avoid the unnecessary jsonrpc inactivity probe.
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Zhen Wang <zhewang@nvidia.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Function raft_command_initiate() returns correctly referenced command
instance. 'n_ref' equals 1 for complete commands and 2 for incomplete
commands because one more reference is in raft->commands list.
raft_handle_execute_command_request__() leaks the reference by not
returning pointer anywhere and not unreferencing incomplete commands.
792 bytes in 11 blocks are definitely lost in loss record 258 of 262
at 0x483BB1A: calloc (vg_replace_malloc.c:762)
by 0x44BA32: xcalloc (util.c:121)
by 0x422E5F: raft_command_create_incomplete (raft.c:2038)
by 0x422E5F: raft_command_initiate (raft.c:2061)
by 0x428651: raft_handle_execute_command_request__ (raft.c:4161)
by 0x428651: raft_handle_execute_command_request (raft.c:4177)
by 0x428651: raft_handle_rpc (raft.c:4230)
by 0x428651: raft_conn_run (raft.c:1445)
by 0x428DEA: raft_run (raft.c:1803)
by 0x407392: main_loop (ovsdb-server.c:226)
by 0x407392: main (ovsdb-server.c:469)
Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.")
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Acked-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: William Tu <u9012063@gmail.com>
|
|
|
|
|
|
|
|
| |
During election, there shouldn't be any leader. This change makes sure
that a server in candidate role always report leader as "unknown".
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sometimes a server can stay in candidate role forever, even if the server
already see the new leader and handles append-requests normally. However,
because of the wrong role, it appears as disconnected from cluster and
so the clients are disconnected.
This problem happens when 2 servers become candidates in the same
term, and one of them is elected as leader in that term. It can be
reproduced by the test cases added in this patch.
The root cause is that the current implementation only changes role to
follower when a bigger term is observed (in raft_receive_term__()).
According to the RAFT paper, if another candidate becomes leader with
the same term, the candidate should change to follower.
This patch fixes it by changing the role to follower when leader
is being updated in raft_update_leader().
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a leader handles install_snapshot reply, the next_index for
the follower should be log_start instead of log_end, because there
can be new entries added in leader's log after initiating the
install_snapshot procedure. Also, it should send all the accumulated
entries to follower in the following append-request message, instead
of sending 0 entries, to speed up the converge.
Without this fix, there is no functional problem, but it takes
uncessary extra rounds of append-requests responsed with "inconsistency"
by follower, although finally will be converged.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
| |
When a follower needs to "catch up", leader can send N entries in
a single append_request instead of only one entry by each message.
The function raft_send_append_request() already supports this, so
this patch just calculate the correct "n" and use it.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a node becomes leader, it sends out heartbeat to all followers
and then sends out another append-request for a no-op command
execution to all followers again immediately. This causes 2
continously append-requests sent out to each followers, and the first
heartbeat append-request is unnecessary. This patch removes the
heartbeat.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
| |
If there is never a leader known by the current server, it's status
should be "disconnected" to the cluster. Without this patch, when
a server in cluster is restarted, before it successfully connecting
back to the cluster it will appear as connected, which is wrong.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Cluster doesn't work after all nodes restarted after DB compaction,
unless there is any transaction after DB compaction before the restart.
Error log is like:
raft|ERR|internal error: deferred vote_request message completed but not ready
to send because message index 9 is past last synced index 0: s2 vote_request:
term=6 last_log_index=9 last_log_term=4
The root cause is that the log_synced member is not initialized when
reading the raft header. This patch fixes it and remove the XXX
from the test case.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit a76ba825 took care of saving and restoring election timer in
file header snapshot, but it didn't handle the parsing of election
timer in install_snapshot_request/reply RPC, which results in problems,
e.g. when election timer change log is compacted in snapshot and then a
new node join the cluster, the new node will use the default timer
instead of the new value. This patch fixed it by parsing election
timer in snapshot RPC.
At the same time the patch updates the test case to cover the DB compact and
join senario. The test reveals another 2 problems related to clustered DB
compact, as commented in the test case's XXX, which need to be addressed
separately.
Signed-off-by: Han Zhou <hzhou@ovn.org>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Valgrind reported:
1924: compacting online - cluster
==29312== 2,886 (240 direct, 2,646 indirect) bytes in 6 blocks are definitely lost in loss record 406 of 413
==29312== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==29312== by 0x44A5F4: xmalloc (util.c:138)
==29312== by 0x4308EA: json_create (json.c:1451)
==29312== by 0x4308EA: json_object_create (json.c:254)
==29312== by 0x430ED0: json_parser_push_object (json.c:1273)
==29312== by 0x430ED0: json_parser_input (json.c:1371)
==29312== by 0x431CF1: json_lex_input (json.c:991)
==29312== by 0x43233B: json_parser_feed (json.c:1149)
==29312== by 0x41D87F: parse_body.isra.0 (log.c:411)
==29312== by 0x41E141: ovsdb_log_read (log.c:476)
==29312== by 0x42646D: raft_read_log (raft.c:866)
==29312== by 0x42646D: raft_open (raft.c:951)
==29312== by 0x4151AF: ovsdb_storage_open__ (storage.c:81)
==29312== by 0x408FFC: open_db (ovsdb-server.c:642)
==29312== by 0x40657F: main (ovsdb-server.c:358)
This patch fixes it.
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After election timer is changed through cluster/change-election-timer
command, if a server restarts, it firstly initializes with the default
value and use it to reset the timer. Although it reads the latest
timer value later from the log, the first timeout may be much shorter
than expected by other servers that use latest timeout, and it would
start election before it receives the first heartbeat from the leader.
This patch fixes it by changing the order of reading log and resetting
timer so that the latest value is read from the log before the initial
resetting of the timer.
Fixes: commit 8e35461 ("ovsdb raft: Support leader election time change online.")
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
A new unixctl command cluster/change-election-timer is implemented to
change leader election timeout base value according to the scale needs.
The change takes effect upon consensus of the cluster, implemented through
the append-request RPC. A new field "election-timer" is added to raft log
entry for this purpose.
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
candiate_retrying is used to determine if the current node is disconnected
from the cluster when the node is in candiate role. However, a node
can flap between candidate and follower role before a leader is elected
when majority of the cluster is down, so is_connected() will flap, too, which
confuses clients.
This patch avoids the flapping with the help of a new member had_leader,
so that if no leader was elected since last election, we know we are
still retrying, and keep as disconnected from the cluster.
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As mentioned in RAFT paper, section 6.2:
Leaders: A server might be in the leader state, but if it isn’t the current
leader, it could be needlessly delaying client requests. For example, suppose a
leader is partitioned from the rest of the cluster, but it can still
communicate with a particular client. Without additional mechanism, it could
delay a request from that client forever, being unable to replicate a log entry
to any other servers. Meanwhile, there might be another leader of a newer term
that is able to communicate with a majority of the cluster and would be able to
commit the client’s request. Thus, a leader in Raft steps down if an election
timeout elapses without a successful round of heartbeats to a majority of its
cluster; this allows clients to retry their requests with another server.
Reported-by: Aliasgar Ginwala <aginwala@ebay.com>
Tested-by: Aliasgar Ginwala <aginwala@ebay.com>
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
| |
Fixes: commit 5a9b53a5 ("ovsdb raft: Fix duplicated transaction execution when leader failover.")
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a server claims itself as "disconnected", all clients connected
to that server will try to reconnect to a new server in the cluster.
However, currently a server would claim itself as disconnected even
when itself is the candidate and try to become the new leader (most
likely it will be), and all its clients will reconnect to another
node.
During a leader fail-over (e.g. due to a leader failure), it is
expected that all clients of the old leader will have to reconnect
to other nodes in the cluster, but it is unnecessary for all the
clients of a healthy node to reconnect, which could cause more
disturbance in a large scale environment.
This patch fixes the problem by slightly change the condition that
a server regards itself as disconnected: if its role is candidate,
it is regarded as disconnected only if the election didn't succeed
at the first attempt. Related failure test cases are also unskipped
and all passed with this patch.
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a transaction is submitted from a client connected to a follower,
if leader crashes after receiving the execute_command_request from the
follower and sending out append request to the majority of followers,
but before sending execute_command_reply to the follower. The
transaction would finally got commited by the new leader. However,
with current implementation the transaction would be commited twice.
For the root cause, there are two cases:
Case 1, the connected follower becomes the new leader. In this case,
the pending command of the follower will be cancelled during its role
changing to leader, so the trigger for the transaction will be retried.
Case 2, another follower becomes the new leader. In this case, since
there is no execute_command_reply from the original leader (which has
crashed), the command will finally timed out, causing the trigger for
the transaction retried.
In both cases, the transaction will be retried by the server node's
trigger retrying logic. This patch fixes the problem by below changes:
1) A pending command can be completed not only by
execute_command_reply, but also when the eid is committed, if the
execute_command_reply never came.
2) Instead of cancelling all pending commands during role change, let
the commands continue waiting to be completed when the eid is
committed. The timer is increased to be twice the election base time,
so that it has the chance to be completed when leader crashes.
This patch fixes the two raft failure test cases previously disabled.
See the test case for details of how to reproduce the problem.
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
| |
raft_command's eid should always be non-null in all 3 cases. Fix the
comment, and also replace if condition with assert.
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Added unix commands cluster/... for ovsdb raft, which will be used in a future
patch to test more fine-grained failure scenarios. The commands either causes
a node to crash at certain point, or manipulate the election timer so that
we can control the election process to elect a new leader we desired for the
test cases.
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When update is requested from follower, the leader sends AppendRequest
to all followers and wait until AppendReply received from majority, and
then it will update commit index - the new entry is regarded as committed
in raft log. However, this commit will not be notified to followers
(including the one initiated the request) until next heartbeat (ping
timeout), if no other pending requests. This results in long latency
for updates made through followers, especially when a batch of updates
are requested through the same follower.
$ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
real 0m34.154s
user 0m0.083s
sys 0m0.250s
This patch solves the problem by sending heartbeat as soon as the commit
index is updated in leader. It also avoids unnessary heartbeat by resetting
the ping timer whenever AppendRequest is broadcasted. With this patch
the performance is improved more than 50 times in same test:
$ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done
real 0m0.564s
user 0m0.080s
sys 0m0.199s
Torture test cases are also updated because otherwise the tests will
all be skipped because of the improved performance.
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In current OVSDB Raft design, when there are multiple transactions
pending, either from same server node or different nodes in the
cluster, only the first one can be successful at once, and following
ones will fail at the prerequisite check on leader node, because
the first one will update the expected prerequisite eid on leader
node, and the prerequisite used for proposing a commit has to be
committed eid, so it is not possible for a node to use the latest
prerequisite expected by the leader to propose a commit until the
lastest transaction is committed by the leader and updated the
committed_index on the node.
Current implementation proposes the commit as soon as the transaction
is requested by the client, which results in continously retry which
causes high CPU load and waste.
Particularly, even if all clients are using leader_only to connect to
only the leader, the prereq check failure still happens a lot when
a batch of transactions are pending on the leader node - the leader
node proposes a batch of commits using the same committed eid as
prerequisite and it updates the expected prereq as soon as the first
one is in progress, but it needs time to append to followers and wait
until majority replies to update the committed_index, which results in
continously useless retries of the following transactions proposed by
the leader itself.
This patch doesn't change the design but simplely pre-checks if current
eid is same as prereq, before proposing the commit, to avoid waste of
CPU cycles, for both leader and followers. When clients use leader_only
mode, this patch completely eliminates the prereq check failures.
In scale test of OVN with 1k HVs and creating and binding 10k lports,
the patch resulted in 90% CPU cost reduction on leader and >80% CPU cost
reduction on followers. (The test was with leader election base time
set to 10000ms, because otherwise the test couldn't complete because
of the frequent leader re-election.)
This is just one of the related performance problems of the prereq
checking mechanism dicussed at:
https://mail.openvswitch.org/pipermail/ovs-discuss/2019-February/048243.html
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
| |
Signed-off-by: Han Zhou <hzhou8@ebay.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When server A sends the leader a request to remove server B from the
cluster, where A != B, the leader sends both A and B a notification when
the removal is complete. Until now, however, the notification (which is a
raft_remove_server_reply message) did not say which server had been
removed, and the receiver did not check. Instead, the receiver assumed
that it had been removed. The result was that B was removed and A stopped
serving out the database even though it was still part of the cluster,
This commit fixes the problem.
Reported-by: ramteja tadishetti <ramtejatadishetti@gmail.com>
Acked-by: Mark Michelson <mmichels@redhat.com>
Signed-off-by: Ben Pfaff <blp@ovn.org>
|