diff options
author | Meir Shpilraien (Spielrein) <meir@redis.com> | 2022-08-18 10:16:32 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2022-08-18 10:16:32 +0300 |
commit | 508a138885b33666923ab92720c8c3263dc5bd56 (patch) | |
tree | b0b34980b578984589a0c64bb57898d5685b87be /src/server.h | |
parent | 6a9cc20d9429e0ce1d6bfca5a40b681b1665cdec (diff) | |
download | redis-508a138885b33666923ab92720c8c3263dc5bd56.tar.gz |
Fix replication inconsistency on modules that uses key space notifications (#10969)
Fix replication inconsistency on modules that uses key space notifications.
### The Problem
In general, key space notifications are invoked after the command logic was
executed (this is not always the case, we will discuss later about specific
command that do not follow this rules). For example, the `set x 1` will trigger
a `set` notification that will be invoked after the `set` logic was performed, so
if the notification logic will try to fetch `x`, it will see the new data that was written.
Consider the scenario on which the notification logic performs some write
commands. for example, the notification logic increase some counter,
`incr x{counter}`, indicating how many times `x` was changed.
The logical order by which the logic was executed is has follow:
```
set x 1
incr x{counter}
```
The issue is that the `set x 1` command is added to the replication buffer
at the end of the command invocation (specifically after the key space
notification logic was invoked and performed the `incr` command).
The replication/aof sees the commands in the wrong order:
```
incr x{counter}
set x 1
```
In this specific example the order is less important.
But if, for example, the notification would have deleted `x` then we would
end up with primary-replica inconsistency.
### The Solution
Put the command that cause the notification in its rightful place. In the
above example, the `set x 1` command logic was executed before the
notification logic, so it should be added to the replication buffer before
the commands that is invoked by the notification logic. To achieve this,
without a major code refactoring, we save a placeholder in the replication
buffer, when finishing invoking the command logic we check if the command
need to be replicated, and if it does, we use the placeholder to add it to the
replication buffer instead of appending it to the end.
To be efficient and not allocating memory on each command to save the
placeholder, the replication buffer array was modified to reuse memory
(instead of allocating it each time we want to replicate commands).
Also, to avoid saving a placeholder when not needed, we do it only for
WRITE or MAY_REPLICATE commands.
#### Additional Fixes
* Expire and Eviction notifications:
* Expire/Eviction logical order was to first perform the Expire/Eviction
and then the notification logic. The replication buffer got this in the
other way around (first notification effect and then the `del` command).
The PR fixes this issue.
* The notification effect and the `del` command was not wrap with
`multi-exec` (if needed). The PR also fix this issue.
* SPOP command:
* On spop, the `spop` notification was fired before the command logic
was executed. The change in this PR would have cause the replication
order to be change (first `spop` command and then notification `logic`)
although the logical order is first the notification logic and then the
`spop` logic. The right fix would have been to move the notification to
be fired after the command was executed (like all the other commands),
but this can be considered a breaking change. To overcome this, the PR
keeps the current behavior and changes the `spop` code to keep the right
logical order when pushing commands to the replication buffer. Another PR
will follow to fix the SPOP properly and match it to the other command (we
split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if
we chose to).
#### Unhanded Known Limitations
* key miss event:
* On key miss event, if a module performed some write command on the
event (using `RM_Call`), the `dirty` counter would increase and the read
command that cause the key miss event would be replicated to the replication
and aof. This problem can also happened on a write command that open
some keys but eventually decides not to perform any action. We decided
not to handle this problem on this PR because the solution is complex
and will cause additional risks in case we will want to cherry-pick this PR.
We should decide if we want to handle it in future PR's. For now, modules
writers is advice not to perform any write commands on key miss event.
#### Testing
* We already have tests to cover cases where a notification is invoking write
commands that are also added to the replication buffer, the tests was modified
to verify that the replica gets the command in the correct logical order.
* Test was added to verify that `spop` behavior was kept unchanged.
* Test was added to verify key miss event behave as expected.
* Test was added to verify the changes do not break lazy expiration.
#### Additional Changes
* `propagateNow` function can accept a special dbid, -1, indicating not
to replicate `select`. We use this to replicate `multi/exec` on `propagatePendingCommands`
function. The side effect of this change is that now the `select` command
will appear inside the `multi/exec` block on the replication stream (instead of
outside of the `multi/exec` block). Tests was modified to match this new behavior.
Diffstat (limited to 'src/server.h')
-rw-r--r-- | src/server.h | 15 |
1 files changed, 10 insertions, 5 deletions
diff --git a/src/server.h b/src/server.h index acb5bfd44..ebf7037e3 100644 --- a/src/server.h +++ b/src/server.h @@ -1281,6 +1281,7 @@ extern clientBufferLimitsConfig clientBufferLimitsDefaults[CLIENT_TYPE_OBUF_COUN * after the propagation of the executed command. */ typedef struct redisOp { robj **argv; + /* target=0 means the operation should not be propagate (unused placeholder), for more info look at redisOpArray */ int argc, dbid, target; } redisOp; @@ -1292,9 +1293,14 @@ typedef struct redisOp { * redisOpArrayFree(); */ typedef struct redisOpArray { - redisOp *ops; - int numops; - int capacity; + redisOp *ops; /* The array of operation to replicated */ + int numops; /* The actual number of operations in the array */ + int used; /* Spots that is used in the ops array, we need to + differenced between the actual number of operations + and the used spots because there might be spots + that was saved as a placeholder for future command + but was never actually used */ + int capacity; /* The ops array capacity */ } redisOpArray; /* This structure is returned by the getMemoryOverheadData() function in @@ -1485,7 +1491,6 @@ struct redisServer { int busy_module_yield_flags; /* Are we inside a busy module? (triggered by RM_Yield). see BUSY_MODULE_YIELD_ flags. */ const char *busy_module_yield_reply; /* When non-null, we are inside RM_Yield. */ int core_propagates; /* Is the core (in oppose to the module subsystem) is in charge of calling propagatePendingCommands? */ - int propagate_no_multi; /* True if propagatePendingCommands should avoid wrapping command in MULTI/EXEC */ int module_ctx_nesting; /* moduleCreateContext() nesting level */ char *ignore_warnings; /* Config: warnings that should be ignored. */ int client_pause_in_transaction; /* Was a client pause executed during this Exec? */ @@ -2899,7 +2904,7 @@ int incrCommandStatsOnError(struct redisCommand *cmd, int flags); void call(client *c, int flags); void alsoPropagate(int dbid, robj **argv, int argc, int target); void propagatePendingCommands(); -void redisOpArrayInit(redisOpArray *oa); +void redisOpArrayReset(redisOpArray *oa); void redisOpArrayFree(redisOpArray *oa); void forceCommandPropagation(client *c, int flags); void preventCommandPropagation(client *c); |