summaryrefslogtreecommitdiff
path: root/ovsdb/raft.c
Commit message (Collapse)AuthorAgeFilesLines
* raft: Fix unnecessary periodic compactions.Ilya Maximets2022-08-301-1/+1
| | | | | | | | | | | | | | | | | | | While creating a new database file and storing a new snapshot into it, raft module by mistake updates the base offset for the old file. So, the base offset of a new file remains zero. Then the old file is getting replaced with the new one, copying new offsets as well. In the end, after a full compaction, base offset is always zero. And any offset is twice as large as zero. That triggers a new compaction again at the earliest scheduled time. In practice this issue triggers compaction every 10-20 minutes regardless of the database load, after the first one is triggered by the actual file growth or by the 24h maximum limit. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-August/051977.html Reported-by: Oleksandr Mykhalskyi <oleksandr.mykhalskyi@netcracker.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Prepare snapshot JSON in a separate thread.Ilya Maximets2022-07-131-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conversion of the database data into JSON object, serialization and destruction of that object are the most heavy operations during the database compaction. If these operations are moved to a separate thread, the main thread can continue processing database requests in the meantime. With this change, the compaction is split in 3 phases: 1. Initialization: - Create a copy of the database. - Remember current database index. - Start a separate thread to convert a copy of the database into serialized JSON object. 2. Wait: - Continue normal operation until compaction thread is done. - Meanwhile, compaction thread: * Convert database copy to JSON. * Serialize resulted JSON. * Destroy original JSON object. 3. Finish: - Destroy the database copy. - Take the snapshot created by the thread. - Write on disk. The key for this schema to be fast is the ability to create a shallow copy of the database. This doesn't take too much time allowing the thread to do most of work. Database copy is created and destroyed only by the main thread, so there is no need for synchronization. Such solution allows to reduce the time main thread is blocked by compaction by 80-90%. For example, in ovn-heater tests with 120 node density-heavy scenario, where compaction normally takes 5-6 seconds at the end of a test, measured compaction times was all below 1 second with the change applied. Also, note that these measured times are the sum of phases 1 and 3, so actual poll intervals are about half a second in this case. Only implemented for raft storage for now. The implementation for standalone databases can be added later by using a file offset as a database index and copying newly added changes from the old file to a new one during ovsdb_log_replace(). Reported-at: https://bugzilla.redhat.com/2069108 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Don't use HMAP_FOR_EACH_SAFE when logging commands.Dumitru Ceara2022-05-301-1/+5
| | | | | | | | | | There's no need to do that because we're not changing the hmap. Also, if DEBUG logging is disabled there's no need to iterate at all. Fixes: 5a9b53a51ec9 ("ovsdb raft: Fix duplicated transaction execution when leader failover.") Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: raft: Fix transaction double commit due to lost leadership.Ilya Maximets2022-05-261-55/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While becoming a follower, the leader aborts all the current 'in-flight' commands, so the higher layers can re-try corresponding transactions when the new leader is elected. However, most of these commands are already sent to followers as append requests, hence they will actually be committed by the majority of the cluster members, i.e. will be treated as committed by the new leader, unless there is an actual network problem between servers. However, the old leader will decline append replies, since it's not the leader anymore and commands are already completed with RAFT_CMD_LOST_LEADERSHIP status. New leader will replicate the commit index back to the old leader. Old leader will re-try the previously "failed" transaction, because "cluster error"s are temporary. If a transaction had some prerequisites that didn't allow double committing or there are other database constraints (like indexes) that will not allow a transaction to be committed twice, the server will reply to the client with a false-negative transaction result. If there are no prerequisites or additional database constraints, the server will execute the same transaction again, but as a follower. E.g. in the OVN case, this may result in creation of duplicated logical switches / routers / load balancers. I.e. resources with the same non-indexed name. That may cause issues later where ovn-nbctl will not be able to add ports to these switches / routers. Suggested solution is to not complete (abort) the commands, but allow them to be completed with the commit index update from a new leader. It the similar behavior to what we do in order to complete commands in a backward scenario when the follower becomes a leader. That scenario was fixed by commit 5a9b53a51ec9 ("ovsdb raft: Fix duplicated transaction execution when leader failover."). Code paths for leader and follower inside the raft_update_commit_index were very similar, so they were refactored into one, since we also needed an ability to complete more than one command for a follower. Failure test added to exercise scenario of a leadership transfer. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Fixes: 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.") Reported-at: https://bugzilla.redhat.com/2046340 Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* hmap: use short version of safe loops if possible.Adrian Moreno2022-03-301-9/+9
| | | | | | | | | | | | | | | Using SHORT version of the *_SAFE loops makes the code cleaner and less error prone. So, use the SHORT version and remove the extra variable when possible for hmap and all its derived types. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* list: use short version of safe loops if possible.Adrian Moreno2022-03-301-8/+7
| | | | | | | | | | | | | | | Using the SHORT version of the *_SAFE loops makes the code cleaner and less error-prone. So, use the SHORT version and remove the extra variable when possible. In order to be able to use both long and short versions without changing the name of the macro for all the clients, overload the existing name and select the appropriate version depending on the number of arguments. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: raft: Fix inability to join the cluster after interrupted attempt.Ilya Maximets2022-02-251-7/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If the joining server re-connects while catching up (e.g. if it crashed or connection got closed due to inactivity), the data we sent might be lost, so the server will never reply to append request or a snapshot installation request. At the same time, leader will decline all the subsequent requests to join from that server with the 'in progress' resolution. At this point the new server will never be able to join the cluster, because it will never receive the raft log while leader thinks that it was already sent. This happened in practice when one of the servers got preempted for a few seconds, so the leader closed connection due to inactivity. Destroying the joining server if disconnection detected. This will allow to start the joining from scratch when the server re-connects and sends the new join request. We can't track re-connection in the raft_conn_run(), because it's incoming connection and the jsonrpc will not keep it alive or try to reconnect. Next time the server re-connects it will be an entirely new raft conn. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Reported-at: https://bugzilla.redhat.com/2033514 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* raft: Only allow followers to snapshot.Dumitru Ceara2021-12-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3c2d6274bcee ("raft: Transfer leadership before creating snapshots.") made it such that raft leaders transfer leadership before snapshotting. However, there's still the case when the next leader to be is in the process of snapshotting. To avoid delays in that case too, we now explicitly allow snapshots only on followers. Cluster members will have to wait until the current election is settled before snapshotting. Given the following logs taken from an OVN_Southbound 3-server cluster during a scale test: S1 (old leader): 19:07:51.226Z|raft|INFO|Transferring leadership to write a snapshot. 19:08:03.830Z|ovsdb|INFO|OVN_Southbound: Database compaction took 12601ms 19:08:03.940Z|raft|INFO|server 8b8d is leader for term 43 S2 (follower): 19:08:00.870Z|raft|INFO|server 8b8d is leader for term 43 S3 (new leader): 19:07:51.242Z|raft|INFO|received leadership transfer from f5c9 in term 42 19:07:51.244Z|raft|INFO|term 43: starting election 19:08:00.805Z|ovsdb|INFO|OVN_Southbound: Database compaction took 9559ms 19:08:00.869Z|raft|INFO|term 43: elected leader by 2+ of 3 servers We see that the leader to be (S3) receives the leadership transfer, initiates the election and immediately after starts a snapshot that takes ~9.5 seconds. During this time, S2 votes for S3 electing it as cluster leader but S3 doesn't effectively become leader until it finishes snapshotting, essentially keeping the cluster without a leader for up to ~9.5 seconds. With the current change, S3 will delay compaction and snapshotting until the election is finished. The only exception is the case of single-node clusters for which we allow the node to snapshot regardless of role. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Don't keep full json objects in memory if no longer needed.Ilya Maximets2021-09-011-42/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Raft log entries (and raft database snapshot) contains json objects of the data. Follower receives append requests with data that gets parsed and added to the raft log. Leader receives execution requests, parses data out of them and adds to the log. In both cases, later ovsdb-server reads the log with ovsdb_storage_read(), constructs transaction and updates the database. On followers these json objects in common case are never used again. Leader may use them to send append requests or snapshot installation requests to followers. However, all these operations (except for ovsdb_storage_read()) are just serializing the json in order to send it over the network. Json objects are significantly larger than their serialized string representation. For example, the snapshot of the database from one of the ovn-heater scale tests takes 270 MB as a string, but 1.6 GB as a json object from the total 3.8 GB consumed by ovsdb-server process. ovsdb_storage_read() for a given raft entry happens only once in a lifetime, so after this call, we can serialize the json object, store the string representation and free the actual json object that ovsdb will never need again. This can save a lot of memory and can also save serialization time, because each raft entry for append requests and snapshot installation requests serialized only once instead of doing that every time such request needs to be sent. JSON_SERIALIZED_OBJECT can be used in order to seamlessly integrate pre-serialized data into raft_header and similar json objects. One major special case is creation of a database snapshot. Snapshot installation request received over the network will be parsed and read by ovsdb-server just like any other raft log entry. However, snapshots created locally with raft_store_snapshot() will never be read back, because they reflect the current state of the database, hence already applied. For this case we can free the json object right after writing snapshot on disk. Tests performed with ovn-heater on 60 node density-light scenario, where on-disk database goes up to 97 MB, shows average memory consumption of ovsdb-server Southbound DB processes decreased by 58% (from 602 MB to 256 MB per process) and peak memory consumption decreased by 40% (from 1288 MB to 771 MB). Test with 120 nodes on density-heavy scenario with 270 MB on-disk database shows 1.5 GB memory consumption decrease as expected. Also, total CPU time consumed by the Southbound DB process reduced from 296 to 256 minutes. Number of unreasonably long poll intervals reduced from 2896 down to 1934. Acked-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovs: fix wrong quoteYunjian Wang2021-07-061-12/+12
| | | | | | | Remove the coma character by using ' and " character. Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: print local server ID when opening RAFT databaseDan Williams2021-06-111-0/+2
| | | | | Signed-off-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb-tool: add --election-timer=ms option to 'create-cluster'Dan Williams2021-05-271-6/+60
| | | | | | | | | | | | After creating the new clustered database write a raft entry that sets the desired election timer. This allows CMSes to set the election timer at cluster start and avoid an error-prone election timer modification process after the cluster is up. Reported-at: https://bugzilla.redhat.com/1831778 Signed-off-by: Dan Williams <dcbw@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Transfer leadership before creating snapshots.Ilya Maximets2021-05-141-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With a big database writing snapshot could take a lot of time, for example, on one of the systems compaction of 300MB database takes about 10 seconds to complete. For the clustered database, 40% of this time takes conversion of the database to the file transaction json format, the rest of time is formatting a string and writing to disk. Of course, this highly depends on the disc and CPU speeds. 300MB is the very possible database size for the OVN Southbound DB, and it might be even bigger than that. During compaction the database is not available and the ovsdb-server doesn't do any other tasks. If leader spends 10-15 seconds writing a snapshot, the cluster is not functional for that time period. Leader also, likely, has some monitors to serve, so the one poll interval may be 15-20 seconds long in the end. Systems with so big databases typically has very high election timers configured (16 seconds), so followers will start election only after this significant amount of time. Once leader is back to the operational state, it will re-connect and try to join the cluster back. In some cases, this might also trigger the 'connected' state flapping on the old leader triggering a re-connection of clients. This issue has been observed with large-scale OVN deployments. One of the methods to improve the situation is to transfer leadership before compacting. This allows to keep the cluster functional, while one of the servers writes a snapshot. Additionally logging the time spent for compaction if it was longer than 1 second. This adds a bit of visibility to 'unreasonably long poll interval's. Reported-at: https://bugzilla.redhat.com/1960391 Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Dumitru Ceara <dceara@redhat.com>
* raft: Add 'stop-raft-rpc' failure test command.Ilya Maximets2021-03-011-10/+21
| | | | | | | | | | | | This command will stop sending and receiving any RAFT-related traffic or accepting new connections. Useful to simulate network problems between cluster members. There is no unit test that uses it yet, but it's convenient for manual testing. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Report disconnected in cluster/status if candidate retries election.Ilya Maximets2021-03-011-0/+2
| | | | | | | | | | | | | | If election times out for a server in 'candidate' role it sets 'candidate_retrying' flag that notifies that storage is disconnected and client should re-connect. However, cluster/status command reports 'Status: cluster member' and that is misleading. Reporting "disconnected from the cluster (election timeout)" instead. Reported-by: Carlos Goncalves <cgoncalves@redhat.com> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690 Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Reintroduce jsonrpc inactivity probes.Ilya Maximets2021-03-011-1/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's not enough to just have heartbeats. RAFT heartbeats are unidirectional, i.e. leader sends them to followers but not the other way around. Missing heartbeats provokes followers to start election, but if leader will not receive any replies it will not do anything while there is a quorum, i.e. there are enough other servers to make decisions. This leads to situation that while TCP connection is established, leader will continue to blindly send messages to it. In our case this leads to growing send backlog. Connection will be terminated eventually due to excessive send backlog, but this this might take a lot of time and wasted process memory. At the same time 'candidate' will continue to send vote requests to the dead connection on its side. To fix that we need to reintroduce inactivity probes that will drop connection if there was no incoming traffic for a long time and remote server doesn't reply to the "echo" request. Probe interval might be chosen based on an election timeout to avoid issues described in commit db5a066c17bd. Reported-by: Carlos Goncalves <cgoncalves@redhat.com> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1929690 Fixes: db5a066c17bd ("raft: Disable RAFT jsonrpc inactivity probe.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Add some debugging information to cluster/status command.Lorenzo Bianconi2020-12-211-0/+35
| | | | | | | | | | | | | Introduce the following info useful for cluster debugging to cluster/status command: - time elapsed from last start/complete election - election trigger (e.g. timeout) - number of disconnections - time elapsed from last raft messaged received Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Make backlog thresholds configurable.Ilya Maximets2020-11-101-5/+50
| | | | | | | | | | New appctl 'cluster/set-backlog-threshold' to configure thresholds on backlog of raft jsonrpc connections. Could be used, for example, in some extreme conditions where size of a database expected to be very large, i.e. comparable with default 4GB threshold. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Set threshold on backlog for raft connections.Ilya Maximets2020-11-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RAFT messages could be fairly big. If something abnormal happens to one of the servers in a cluster it may not be able to process all the incoming messages in a timely manner. This results in jsonrpc backlog growth on the sender's side. For example if follower gets many new clients at once that it needs to serve, or it decides to take a snapshot in a period of high number of database changes. If backlog grows large enough it becomes harder and harder for follower to process incoming raft messages, it sends outdated replies and starts receiving snapshots and the whole raft log from the leader. Sometimes backlog grows too high (60GB in this example): jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:<ip>, num of msgs: 15370, backlog: 61731060773. In this case OS might actually decide to kill the sender to free some memory. Anyway, It could take a lot of time for such a server to catch up with the rest of the cluster if it has so much data to receive and process. Introducing backlog thresholds for jsonrpc connections. If sending backlog will exceed particular values (500 messages or 4GB in size), connection will be dropped and re-created. This will allow to drop all the current backlog and start over increasing chances of cluster recovery. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid having more than one snapshot in-flight.Ilya Maximets2020-11-031-26/+16
| | | | | | | | | | | | | | | | | | | | | Previous commit 8c2c503bdb0d ("raft: Avoid sending equal snapshots.") took a "safe" approach to not send only exactly same snapshot installation requests. However, it doesn't make much sense to send more than one snapshot at a time. If obsolete snapshot installed, leader will re-send the most recent one. With this change leader will have only 1 snapshot in-flight per connection. This will reduce backlogs on raft connections in case new snapshot created while 'install_snapshot_request' is in progress or if election timer changed in that period. Also, not tracking the exact 'install_snapshot_request' we've sent allows to simplify the code. Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1888829 Fixes: 8c2c503bdb0d ("raft: Avoid sending equal snapshots.") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Add log length to the memory report.Ilya Maximets2020-11-031-0/+1
| | | | | | | | | In many cases a big part of a memory consumed by ovsdb-server process is a raft log, so it's important to add its length to the memory report. Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid annoying debug logs if raft is connected.Ilya Maximets2020-10-271-1/+10
| | | | | | | | | | | | | If debug logs enabled, "raft_is_connected: true" printed on every call to raft_is_connected() which is way too frequently. These messages are not very informative and only litters the log. Let's log only disconnected state in a rate-limited way and only log positive case once at the moment cluster becomes connected. Fixes: 923f01cad678 ("raft.c: Set candidate_retrying if no leader elected since last election.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Fix error leak on failure while saving snapshot.Ilya Maximets2020-10-271-1/+1
| | | | | | | | Error should be destroyed before return. Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Report jsonrpc backlog in kilobytes.Ilya Maximets2020-10-251-2/+3
| | | | | | | | | | | | While sending snapshots backlog on raft connections could quickly grow over 4GB and this will overflow raft-backlog counter. Let's report it in kB instead. (Using kB and not KB to match with ru_maxrss counter reported by kernel) Fixes: 3423cd97f88f ("ovsdb: Add raft memory usage to memory report.") Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Avoid sending equal snapshots.Ilya Maximets2020-05-281-1/+38
| | | | | | | | | | | | | | | | | | | | | | | | | Snapshots are huge. In some cases we could receive several outdated append replies from the remote server. This could happen in high scale cases if the remote server is overloaded and not able to process all the raft requests in time. As an action to each outdated append reply we're sending full database snapshot. While remote server is already overloaded those snapshots will stuck in jsonrpc backlog for a long time making it grow up to few GB. Since remote server wasn't able to timely process incoming messages it will likely not able to process snapshots leading to the same situation with low chances to recover. Remote server will likely stuck in 'candidate' state, other servers will grow their memory consumption due to growing jsonrpc backlogs: jsonrpc|INFO|excessive sending backlog, jsonrpc: ssl:192.16.0.3:6644, num of msgs: 3795, backlog: 8838994624. This patch is trying to avoid that situation by avoiding sending of equal snapshot install requests. This helps maintain reasonable memory consumption and allows the cluster to recover on a larger scale. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* ovsdb: Add raft memory usage to memory report.Ilya Maximets2020-05-251-0/+16
| | | | | | | | | | | | | | | | | | Memory reports could be found in logs or by calling 'memory/show' appctl command. For ovsdb-server it includes information about db cells, monitor connections with their backlog size, etc. But it doesn't contain any information about memory consumed by raft. Backlogs of raft connections could be insanely large because of snapshot installation requests that simply contains the whole database. In not that healthy clusters where one of ovsdb servers is not able to timely handle all the incoming raft traffic, backlog on a sender's side could cause significant memory consumption issues. Adding new 'raft-connections' and 'raft-backlog' counters to the memory report to better track such conditions. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Disable RAFT jsonrpc inactivity probe.Zhen Wang2020-05-121-0/+1
| | | | | | | | | | | | With the scale test of 640 nodes k8s cluster, raft DB nodes' jsonrpc session got closed due to the timeout of default 5 seconds probe. It will cause disturbance of the raft cluster. Since we already have the heartbeat for RAFT, just disable the probe between the servers to avoid the unnecessary jsonrpc inactivity probe. Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Zhen Wang <zhewang@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
* raft: Fix leak of the incomplete command.Ilya Maximets2020-05-041-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Function raft_command_initiate() returns correctly referenced command instance. 'n_ref' equals 1 for complete commands and 2 for incomplete commands because one more reference is in raft->commands list. raft_handle_execute_command_request__() leaks the reference by not returning pointer anywhere and not unreferencing incomplete commands. 792 bytes in 11 blocks are definitely lost in loss record 258 of 262 at 0x483BB1A: calloc (vg_replace_malloc.c:762) by 0x44BA32: xcalloc (util.c:121) by 0x422E5F: raft_command_create_incomplete (raft.c:2038) by 0x422E5F: raft_command_initiate (raft.c:2061) by 0x428651: raft_handle_execute_command_request__ (raft.c:4161) by 0x428651: raft_handle_execute_command_request (raft.c:4177) by 0x428651: raft_handle_rpc (raft.c:4230) by 0x428651: raft_conn_run (raft.c:1445) by 0x428DEA: raft_run (raft.c:1803) by 0x407392: main_loop (ovsdb-server.c:226) by 0x407392: main (ovsdb-server.c:469) Fixes: 1b1d2e6daa56 ("ovsdb: Introduce experimental support for clustered databases.") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: William Tu <u9012063@gmail.com>
* raft: Unset leader when starting election.Han Zhou2020-03-061-0/+1
| | | | | | | | During election, there shouldn't be any leader. This change makes sure that a server in candidate role always report leader as "unknown". Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Fix the problem of stuck in candidate role forever.Han Zhou2020-03-061-2/+17
| | | | | | | | | | | | | | | | | | | | | | Sometimes a server can stay in candidate role forever, even if the server already see the new leader and handles append-requests normally. However, because of the wrong role, it appears as disconnected from cluster and so the clients are disconnected. This problem happens when 2 servers become candidates in the same term, and one of them is elected as leader in that term. It can be reproduced by the test cases added in this patch. The root cause is that the current implementation only changes role to follower when a bigger term is observed (in raft_receive_term__()). According to the RAFT paper, if another candidate becomes leader with the same term, the candidate should change to follower. This patch fixes it by changing the role to follower when leader is being updated in raft_update_leader(). Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Fix next_index in install_snapshot reply handling.Han Zhou2020-03-061-2/+3
| | | | | | | | | | | | | | | | When a leader handles install_snapshot reply, the next_index for the follower should be log_start instead of log_end, because there can be new entries added in leader's log after initiating the install_snapshot procedure. Also, it should send all the accumulated entries to follower in the following append-request message, instead of sending 0 entries, to speed up the converge. Without this fix, there is no functional problem, but it takes uncessary extra rounds of append-requests responsed with "inconsistency" by follower, although finally will be converged. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Send all missing logs in one single append_request.Han Zhou2020-03-061-1/+1
| | | | | | | | | | | When a follower needs to "catch up", leader can send N entries in a single append_request instead of only one entry by each message. The function raft_send_append_request() already supports this, so this patch just calculate the correct "n" and use it. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Avoid sending unnecessary heartbeat when becoming leader.Han Zhou2020-03-061-1/+0
| | | | | | | | | | | | When a node becomes leader, it sends out heartbeat to all followers and then sends out another append-request for a no-op command execution to all followers again immediately. This causes 2 continously append-requests sent out to each followers, and the first heartbeat append-request is unnecessary. This patch removes the heartbeat. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Fix raft_is_connected() when there is no leader yet.Han Zhou2020-03-061-2/+8
| | | | | | | | | | If there is never a leader known by the current server, it's status should be "disconnected" to the cluster. Without this patch, when a server in cluster is restarted, before it successfully connecting back to the cluster it will appear as connected, which is wrong. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Fix the problem when cluster restarted after DB compaction.Han Zhou2019-12-201-1/+1
| | | | | | | | | | | | | | | | | Cluster doesn't work after all nodes restarted after DB compaction, unless there is any transaction after DB compaction before the restart. Error log is like: raft|ERR|internal error: deferred vote_request message completed but not ready to send because message index 9 is past last synced index 0: s2 vote_request: term=6 last_log_index=9 last_log_term=4 The root cause is that the log_synced member is not initialized when reading the raft header. This patch fixes it and remove the XXX from the test case. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Fix election timer parsing in snapshot RPC.Han Zhou2019-11-211-1/+1
| | | | | | | | | | | | | | | | | | Commit a76ba825 took care of saving and restoring election timer in file header snapshot, but it didn't handle the parsing of election timer in install_snapshot_request/reply RPC, which results in problems, e.g. when election timer change log is compacted in snapshot and then a new node join the cluster, the new node will use the default timer instead of the new value. This patch fixed it by parsing election timer in snapshot RPC. At the same time the patch updates the test case to cover the DB compact and join senario. The test reveals another 2 problems related to clustered DB compact, as commented in the test case's XXX, which need to be addressed separately. Signed-off-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Free leaked json dataYifeng Sun2019-09-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Valgrind reported: 1924: compacting online - cluster ==29312== 2,886 (240 direct, 2,646 indirect) bytes in 6 blocks are definitely lost in loss record 406 of 413 ==29312== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==29312== by 0x44A5F4: xmalloc (util.c:138) ==29312== by 0x4308EA: json_create (json.c:1451) ==29312== by 0x4308EA: json_object_create (json.c:254) ==29312== by 0x430ED0: json_parser_push_object (json.c:1273) ==29312== by 0x430ED0: json_parser_input (json.c:1371) ==29312== by 0x431CF1: json_lex_input (json.c:991) ==29312== by 0x43233B: json_parser_feed (json.c:1149) ==29312== by 0x41D87F: parse_body.isra.0 (log.c:411) ==29312== by 0x41E141: ovsdb_log_read (log.c:476) ==29312== by 0x42646D: raft_read_log (raft.c:866) ==29312== by 0x42646D: raft_open (raft.c:951) ==29312== by 0x4151AF: ovsdb_storage_open__ (storage.c:81) ==29312== by 0x408FFC: open_db (ovsdb-server.c:642) ==29312== by 0x40657F: main (ovsdb-server.c:358) This patch fixes it. Acked-by: William Tu <u9012063@gmail.com> Signed-off-by: Yifeng Sun <pkusunyifeng@gmail.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft.c: Election timer initial reset with value from log.Han Zhou2019-08-231-2/+3
| | | | | | | | | | | | | | | | | After election timer is changed through cluster/change-election-timer command, if a server restarts, it firstly initializes with the default value and use it to reset the timer. Although it reads the latest timer value later from the log, the first timeout may be much shorter than expected by other servers that use latest timeout, and it would start election before it receives the first heartbeat from the leader. This patch fixes it by changing the order of reading log and resetting timer so that the latest value is read from the log before the initial resetting of the timer. Fixes: commit 8e35461 ("ovsdb raft: Support leader election time change online.") Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Support leader election time change online.Han Zhou2019-08-211-30/+157
| | | | | | | | | | | | A new unixctl command cluster/change-election-timer is implemented to change leader election timeout base value according to the scale needs. The change takes effect upon consensus of the cluster, implemented through the append-request RPC. A new field "election-timer" is added to raft log entry for this purpose. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft.c: Set candidate_retrying if no leader elected since last election.Han Zhou2019-08-211-6/+25
| | | | | | | | | | | | | | | candiate_retrying is used to determine if the current node is disconnected from the cluster when the node is in candiate role. However, a node can flap between candidate and follower role before a leader is elected when majority of the cluster is down, so is_connected() will flap, too, which confuses clients. This patch avoids the flapping with the help of a new member had_leader, so that if no leader was elected since last election, we know we are still retrying, and keep as disconnected from the cluster. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft.c: Stale leader should disconnect from cluster.Han Zhou2019-08-211-3/+40
| | | | | | | | | | | | | | | | | | | | As mentioned in RAFT paper, section 6.2: Leaders: A server might be in the leader state, but if it isn’t the current leader, it could be needlessly delaying client requests. For example, suppose a leader is partitioned from the rest of the cluster, but it can still communicate with a particular client. Without additional mechanism, it could delay a request from that client forever, being unable to replicate a log entry to any other servers. Meanwhile, there might be another leader of a newer term that is able to communicate with a majority of the cluster and would be able to commit the client’s request. Thus, a leader in Raft steps down if an election timeout elapses without a successful round of heartbeats to a majority of its cluster; this allows clients to retry their requests with another server. Reported-by: Aliasgar Ginwala <aginwala@ebay.com> Tested-by: Aliasgar Ginwala <aginwala@ebay.com> Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Move raft_reset_ping_timer() out of the loop.Han Zhou2019-08-211-1/+1
| | | | | | Fixes: commit 5a9b53a5 ("ovsdb raft: Fix duplicated transaction execution when leader failover.") Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Avoid unnecessary reconnecting during leader election.Han Zhou2019-04-221-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | If a server claims itself as "disconnected", all clients connected to that server will try to reconnect to a new server in the cluster. However, currently a server would claim itself as disconnected even when itself is the candidate and try to become the new leader (most likely it will be), and all its clients will reconnect to another node. During a leader fail-over (e.g. due to a leader failure), it is expected that all clients of the old leader will have to reconnect to other nodes in the cluster, but it is unnecessary for all the clients of a healthy node to reconnect, which could cause more disturbance in a large scale environment. This patch fixes the problem by slightly change the condition that a server regards itself as disconnected: if its role is candidate, it is regarded as disconnected only if the election didn't succeed at the first attempt. Related failure test cases are also unskipped and all passed with this patch. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Fix duplicated transaction execution when leader failover.Han Zhou2019-04-151-27/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a transaction is submitted from a client connected to a follower, if leader crashes after receiving the execute_command_request from the follower and sending out append request to the majority of followers, but before sending execute_command_reply to the follower. The transaction would finally got commited by the new leader. However, with current implementation the transaction would be commited twice. For the root cause, there are two cases: Case 1, the connected follower becomes the new leader. In this case, the pending command of the follower will be cancelled during its role changing to leader, so the trigger for the transaction will be retried. Case 2, another follower becomes the new leader. In this case, since there is no execute_command_reply from the original leader (which has crashed), the command will finally timed out, causing the trigger for the transaction retried. In both cases, the transaction will be retried by the server node's trigger retrying logic. This patch fixes the problem by below changes: 1) A pending command can be completed not only by execute_command_reply, but also when the eid is committed, if the execute_command_reply never came. 2) Instead of cancelling all pending commands during role change, let the commands continue waiting to be completed when the eid is committed. The timer is increased to be twice the election base time, so that it has the chance to be completed when leader crashes. This patch fixes the two raft failure test cases previously disabled. See the test case for details of how to reproduce the problem. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: cmd->eid should always be non-null.Han Zhou2019-04-151-6/+3
| | | | | | | | raft_command's eid should always be non-null in all 3 cases. Fix the comment, and also replace if condition with assert. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Support commands that are required for testing failure scenarios.Han Zhou2019-04-151-0/+83
| | | | | | | | | | | Added unix commands cluster/... for ovsdb raft, which will be used in a future patch to test more fine-grained failure scenarios. The commands either causes a node to crash at certain point, or manipulate the election timer so that we can control the election process to elect a new leader we desired for the test cases. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Sync commit index to followers without delay.Han Zhou2019-04-151-14/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When update is requested from follower, the leader sends AppendRequest to all followers and wait until AppendReply received from majority, and then it will update commit index - the new entry is regarded as committed in raft log. However, this commit will not be notified to followers (including the one initiated the request) until next heartbeat (ping timeout), if no other pending requests. This results in long latency for updates made through followers, especially when a batch of updates are requested through the same follower. $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done real 0m34.154s user 0m0.083s sys 0m0.250s This patch solves the problem by sending heartbeat as soon as the commit index is updated in leader. It also avoids unnessary heartbeat by resetting the ping timer whenever AppendRequest is broadcasted. With this patch the performance is improved more than 50 times in same test: $ time for i in `seq 1 100`; do ovn-nbctl ls-add ls$i; done real 0m0.564s user 0m0.080s sys 0m0.199s Torture test cases are also updated because otherwise the tests will all be skipped because of the improved performance. Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* ovsdb raft: Precheck prereq before proposing commit.Han Zhou2019-03-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In current OVSDB Raft design, when there are multiple transactions pending, either from same server node or different nodes in the cluster, only the first one can be successful at once, and following ones will fail at the prerequisite check on leader node, because the first one will update the expected prerequisite eid on leader node, and the prerequisite used for proposing a commit has to be committed eid, so it is not possible for a node to use the latest prerequisite expected by the leader to propose a commit until the lastest transaction is committed by the leader and updated the committed_index on the node. Current implementation proposes the commit as soon as the transaction is requested by the client, which results in continously retry which causes high CPU load and waste. Particularly, even if all clients are using leader_only to connect to only the leader, the prereq check failure still happens a lot when a batch of transactions are pending on the leader node - the leader node proposes a batch of commits using the same committed eid as prerequisite and it updates the expected prereq as soon as the first one is in progress, but it needs time to append to followers and wait until majority replies to update the committed_index, which results in continously useless retries of the following transactions proposed by the leader itself. This patch doesn't change the design but simplely pre-checks if current eid is same as prereq, before proposing the commit, to avoid waste of CPU cycles, for both leader and followers. When clients use leader_only mode, this patch completely eliminates the prereq check failures. In scale test of OVN with 1k HVs and creating and binding 10k lports, the patch resulted in 90% CPU cost reduction on leader and >80% CPU cost reduction on followers. (The test was with leader election base time set to 10000ms, because otherwise the test couldn't complete because of the frequent leader re-election.) This is just one of the related performance problems of the prereq checking mechanism dicussed at: https://mail.openvswitch.org/pipermail/ovs-discuss/2019-February/048243.html Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft.c: Remove noisy INFO logHan Zhou2019-01-281-1/+0
| | | | | Signed-off-by: Han Zhou <hzhou8@ebay.com> Signed-off-by: Ben Pfaff <blp@ovn.org>
* raft: Fix notifications when a server leaves the cluster.Ben Pfaff2018-11-191-12/+38
| | | | | | | | | | | | | | | When server A sends the leader a request to remove server B from the cluster, where A != B, the leader sends both A and B a notification when the removal is complete. Until now, however, the notification (which is a raft_remove_server_reply message) did not say which server had been removed, and the receiver did not check. Instead, the receiver assumed that it had been removed. The result was that B was removed and A stopped serving out the database even though it was still part of the cluster, This commit fixes the problem. Reported-by: ramteja tadishetti <ramtejatadishetti@gmail.com> Acked-by: Mark Michelson <mmichels@redhat.com> Signed-off-by: Ben Pfaff <blp@ovn.org>