Fix replication inconsistency on modules that uses key space notifications (#10969)

Fix replication inconsistency on modules that uses key space notifications. ### The Problem In general, key space notifications are invoked after the command logic was executed (this is not always the case, we will discuss later about specific command that do not follow this rules). For example, the `set x 1` will trigger a `set` notification that will be invoked after the `set` logic was performed, so if the notification logic will try to fetch `x`, it will see the new data that was written. Consider the scenario on which the notification logic performs some write commands. for example, the notification logic increase some counter, `incr x{counter}`, indicating how many times `x` was changed. The logical order by which the logic was executed is has follow: ``` set x 1 incr x{counter} ``` The issue is that the `set x 1` command is added to the replication buffer at the end of the command invocation (specifically after the key space notification logic was invoked and performed the `incr` command). The replication/aof sees the commands in the wrong order: ``` incr x{counter} set x 1 ``` In this specific example the order is less important. But if, for example, the notification would have deleted `x` then we would end up with primary-replica inconsistency. ### The Solution Put the command that cause the notification in its rightful place. In the above example, the `set x 1` command logic was executed before the notification logic, so it should be added to the replication buffer before the commands that is invoked by the notification logic. To achieve this, without a major code refactoring, we save a placeholder in the replication buffer, when finishing invoking the command logic we check if the command need to be replicated, and if it does, we use the placeholder to add it to the replication buffer instead of appending it to the end. To be efficient and not allocating memory on each command to save the placeholder, the replication buffer array was modified to reuse memory (instead of allocating it each time we want to replicate commands). Also, to avoid saving a placeholder when not needed, we do it only for WRITE or MAY_REPLICATE commands. #### Additional Fixes * Expire and Eviction notifications: * Expire/Eviction logical order was to first perform the Expire/Eviction and then the notification logic. The replication buffer got this in the other way around (first notification effect and then the `del` command). The PR fixes this issue. * The notification effect and the `del` command was not wrap with `multi-exec` (if needed). The PR also fix this issue. * SPOP command: * On spop, the `spop` notification was fired before the command logic was executed. The change in this PR would have cause the replication order to be change (first `spop` command and then notification `logic`) although the logical order is first the notification logic and then the `spop` logic. The right fix would have been to move the notification to be fired after the command was executed (like all the other commands), but this can be considered a breaking change. To overcome this, the PR keeps the current behavior and changes the `spop` code to keep the right logical order when pushing commands to the replication buffer. Another PR will follow to fix the SPOP properly and match it to the other command (we split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if we chose to). #### Unhanded Known Limitations * key miss event: * On key miss event, if a module performed some write command on the event (using `RM_Call`), the `dirty` counter would increase and the read command that cause the key miss event would be replicated to the replication and aof. This problem can also happened on a write command that open some keys but eventually decides not to perform any action. We decided not to handle this problem on this PR because the solution is complex and will cause additional risks in case we will want to cherry-pick this PR. We should decide if we want to handle it in future PR's. For now, modules writers is advice not to perform any write commands on key miss event. #### Testing * We already have tests to cover cases where a notification is invoking write commands that are also added to the replication buffer, the tests was modified to verify that the replica gets the command in the correct logical order. * Test was added to verify that `spop` behavior was kept unchanged. * Test was added to verify key miss event behave as expected. * Test was added to verify the changes do not break lazy expiration. #### Additional Changes * `propagateNow` function can accept a special dbid, -1, indicating not to replicate `select`. We use this to replicate `multi/exec` on `propagatePendingCommands` function. The side effect of this change is that now the `select` command will appear inside the `multi/exec` block on the replication stream (instead of outside of the `multi/exec` block). Tests was modified to match this new behavior.
author: Meir Shpilraien (Spielrein) <meir@redis.com> 2022-08-18 10:16:32 +0300
committer: GitHub <noreply@github.com> 2022-08-18 10:16:32 +0300
commit: 508a138885b33666923ab92720c8c3263dc5bd56 (patch)
tree: b0b34980b578984589a0c64bb57898d5685b87be /src/server.h
parent: 6a9cc20d9429e0ce1d6bfca5a40b681b1665cdec (diff)
download: redis-508a138885b33666923ab92720c8c3263dc5bd56.tar.gz
1 files changed, 10 insertions, 5 deletions
diff --git a/src/server.h b/src/server.h
index acb5bfd44..ebf7037e3 100644
--- a/src/server.h
+++ b/src/server.h
@@ -1281,6 +1281,7 @@ extern clientBufferLimitsConfig clientBufferLimitsDefaults[CLIENT_TYPE_OBUF_COUN
  * after the propagation of the executed command. */
 typedef struct redisOp {
     robj **argv;
+    /* target=0 means the operation should not be propagate (unused placeholder), for more info look at redisOpArray */
     int argc, dbid, target;
 } redisOp;
 
@@ -1292,9 +1293,14 @@ typedef struct redisOp {
  * redisOpArrayFree();
  */
 typedef struct redisOpArray {
-    redisOp *ops;
-    int numops;
-    int capacity;
+    redisOp *ops; /* The array of operation to replicated */
+    int numops;   /* The actual number of operations in the array */
+    int used;     /* Spots that is used in the ops array, we need to
+                     differenced between the actual number of operations
+                     and the used spots because there might be spots
+                     that was saved as a placeholder for future command
+                     but was never actually used */
+    int capacity; /* The ops array capacity */
 } redisOpArray;
 
 /* This structure is returned by the getMemoryOverheadData() function in
@@ -1485,7 +1491,6 @@ struct redisServer {
     int busy_module_yield_flags;         /* Are we inside a busy module? (triggered by RM_Yield). see BUSY_MODULE_YIELD_ flags. */
     const char *busy_module_yield_reply; /* When non-null, we are inside RM_Yield. */
     int core_propagates;        /* Is the core (in oppose to the module subsystem) is in charge of calling propagatePendingCommands? */
-    int propagate_no_multi;     /* True if propagatePendingCommands should avoid wrapping command in MULTI/EXEC */
     int module_ctx_nesting;     /* moduleCreateContext() nesting level */
     char *ignore_warnings;      /* Config: warnings that should be ignored. */
     int client_pause_in_transaction; /* Was a client pause executed during this Exec? */
@@ -2899,7 +2904,7 @@ int incrCommandStatsOnError(struct redisCommand *cmd, int flags);
 void call(client *c, int flags);
 void alsoPropagate(int dbid, robj **argv, int argc, int target);
 void propagatePendingCommands();
-void redisOpArrayInit(redisOpArray *oa);
+void redisOpArrayReset(redisOpArray *oa);
 void redisOpArrayFree(redisOpArray *oa);
 void forceCommandPropagation(client *c, int flags);
 void preventCommandPropagation(client *c);
author	Meir Shpilraien (Spielrein) <meir@redis.com>	2022-08-18 10:16:32 +0300
committer	GitHub <noreply@github.com>	2022-08-18 10:16:32 +0300
commit	508a138885b33666923ab92720c8c3263dc5bd56 (patch)
tree	b0b34980b578984589a0c64bb57898d5685b87be /src/server.h
parent	6a9cc20d9429e0ce1d6bfca5a40b681b1665cdec (diff)
download	redis-508a138885b33666923ab92720c8c3263dc5bd56.tar.gz