delta/redis.git - github.com: antirez/redis.git

	Commit message (Collapse)	Author	Age	Files	Lines
*	Always compact nodes in stream listpacks after creating new nodes (#11885)	Madelyn Olson	2023-03-07	1	-2/+4
\| \| \| \| \|	This change attempts to alleviate a minor memory usage degradation for Redis 6.2 and onwards when using rather large objects (~2k) in streams. Introduced in #6281, we pre-allocate the head nodes of a stream to be 4kb, to limit the amount of unnecessary initial reallocations that are done. However, if we only ever allocate one object because 2 objects exceeds the max_stream_entry_size, we never actually shrink it to fit the single item. This can lead to a lot of excessive memory usage. For smaller item sizes this becomes less of an issue, as the overhead decreases as the items become smaller in size. This commit also changes the MEMORY USAGE of streams, since it was reporting the lpBytes instead of the allocated size. This introduced an observability issue when diagnosing the memory issue, since Redis reported the same amount of used bytes pre and post change, even though the new implementation allocated more memory.
*	Try to trim strings only when applicable (#11817)	uriyage	2023-02-28	1	-17/+16
\| \| \| \| \| \| \| \| \| \|	As `sdsRemoveFreeSpace` have an impact on performance even if it is a no-op (see details at #11508). Only call the function when there is a possibility that the string contains free space. * For strings coming from the network, it's only if they're bigger than PROTO_MBULK_BIG_ARG * For strings coming from scripts, it's only if they're smaller than LUA_CMD_OBJCACHE_MAX_LEN * For strings coming from modules, it could be anything. Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: sundb <sundbcn@gmail.com>
*	Optimization: sdsRemoveFreeSpace to avoid realloc on noop (#11766)	uriyage	2023-01-31	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In #7875 (Redis 6.2), we changed the sds alloc to be the usable allocation size in order to: > reduce the need for realloc calls by making the sds implicitly take over the internal fragmentation This change was done most sds functions, excluding `sdsRemoveFreeSpace` and `sdsResize`, the reason is that in some places (e.g. clientsCronResizeQueryBuffer) we call sdsRemoveFreeSpace when we see excessive free space and want to trim it. so if we don't trim it exactly to size, the caller may still see excessive free space and call it again and again. However, this resulted in some excessive calls to realloc, even when there's no need and it's gonna be a no-op (e.g. when reducing 15 bytes allocation to 13). It turns out that a call for realloc with jemalloc can be expensive even if it ends up doing nothing, so this PR adds a check using `je_nallocx`, which is cheap to avoid the call for realloc. in addition to that this PR unifies sdsResize and sdsRemoveFreeSpace into common code. the difference between them was that sdsResize would avoid using SDS_TYPE_5, since it want to keep the string ready to be resized again, while sdsRemoveFreeSpace would permit using SDS_TYPE_5 and get an optimal memory consumption. now both methods take a `would_regrow` argument that makes it more explicit. the only actual impact of that is that in clientsCronResizeQueryBuffer we call both sdsResize and sdsRemoveFreeSpace for in different cases, and we now prevent the use of SDS_TYPE_5 in both. The new test that was added to cover this concern used to pass before this PR as well, this PR is just a performance optimization and cleanup. Benchmark: `redis-benchmark -c 100 -t set -d 512 -P 10 -n 100000000` on i7-9850H with jemalloc, shows improvement from 1021k ops/sec to 1067k (average of 3 runs). some 4.5% improvement. Co-authored-by: Oran Agra <oran@redislabs.com>
*	Remove the bucket-cb from dictScan and move dictEntry defrag to dictScanDefrag	Viktor Söderqvist	2023-01-11	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change deletes the dictGetNext and dictGetNextRef functions, so the dict API doesn't expose the next field at all. The bucket function in dictScan is deleted. A separate dictScanDefrag function is added which takes a defrag alloc function to defrag-reallocate the dict entries. "Dirty" code accessing the dict internals in active defrag is removed. An 'afterReplaceEntry' is added to dictType, which allows the dict user to keep the dictEntry metadata up to date after reallocation/defrag/move. Additionally, for updating the cluster slot-to-key mapping, after a dictEntry has been reallocated, we need to know which db a dict belongs to, so we store a pointer to the db in a new metadata section in the dict struct, which is a new mechanism similar to dictEntry metadata. This adds some complexity but provides better isolation.
*	Make dictEntry opaque	Viktor Söderqvist	2023-01-11	1	-8/+6
\| \| \| \| \|	Use functions for all accesses to dictEntry (except in dict.c). Dict abuses e.g. in defrag.c have been replaced by support functions provided by dict.
*	Fix issues with listpack encoded set (#11685)	Oran Agra	2023-01-05	1	-0/+2
\| \| \| \| \| \| \| \|	PR #11290 added listpack encoding for sets, but was missing two things: 1. Correct handling of MEMORY USAGE (leading to an assertion). 2. Had an uncontrolled scratch buffer size in SRANDMEMBER leading to OOM panic (reported in #11668). Fixed by copying logic from ZRANDMEMBER. note that both issues didn't exist in any redis release.
*	Fix zuiFind crash / RM_ScanKey hang on SET object listpack encoding (#11581)	Binbin	2022-12-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In #11290, we added listpack encoding for SET object. But forgot to support it in zuiFind, causes ZINTER, ZINTERSTORE, ZINTERCARD, ZIDFF, ZDIFFSTORE to crash. And forgot to support it in RM_ScanKey, causes it hang. This PR add support SET listpack in zuiFind, and in RM_ScanKey. And add tests for related commands to cover this case. Other changes: - There is no reason for zuiFind to go into the internals of the SET. It can simply use setTypeIsMember and don't care about encoding. - Remove the `#include "intset.h"` from server.h reduce the chance of accidental intset API use. - Move setTypeAddAux, setTypeRemoveAux and setTypeIsMemberAux interfaces to the header. - In scanGenericCommand, use setTypeInitIterator and setTypeNext to handle OBJ_SET scan. - In RM_ScanKey, improve hash scan mode, use lpGetValue like zset, they can share code and better performance. The zuiFind part fixes #11578 Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
*	Optimize client memory usage tracking operation while client eviction is ↵	Harkrishn Patro	2022-12-07	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	disabled (#11348) ## Issue During the client input/output buffer processing, the memory usage is incrementally updated to keep track of clients going beyond a certain threshold `maxmemory-clients` to be evicted. However, this additional tracking activity leads to unnecessary CPU cycles wasted when no client-eviction is required. It is applicable in two cases. * `maxmemory-clients` is set to `0` which equates to no client eviction (applicable to all clients) * `CLIENT NO-EVICT` flag is set to `ON` which equates to a particular client not applicable for eviction. ## Solution * Disable client memory usage tracking during the read/write flow when `maxmemory-clients` is set to `0` or `client no-evict` is `on`. The memory usage is tracked only during the `clientCron` i.e. it gets periodically updated. * Cleanup the clients from the memory usage bucket when client eviction is disabled. * When the maxmemory-clients config is enabled or disabled at runtime, we immediately update the memory usage buckets for all clients (tested scanning 80000 took some 20ms) Benchmark shown that this can improve performance by about 5% in certain situations. Co-authored-by: Oran Agra <oran@redislabs.com>
*	Add listpack encoding for list (#11303)	sundb	2022-11-16	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Improve memory efficiency of list keys ## Description of the feature The new listpack encoding uses the old `list-max-listpack-size` config to perform the conversion, which we can think it of as a node inside a quicklist, but without 80 bytes overhead (internal fragmentation included) of quicklist and quicklistNode structs. For example, a list key with 5 items of 10 chars each, now takes 128 bytes instead of 208 it used to take. ## Conversion rules * Convert listpack to quicklist When the listpack length or size reaches the `list-max-listpack-size` limit, it will be converted to a quicklist. * Convert quicklist to listpack When a quicklist has only one node, and its length or size is reduced to half of the `list-max-listpack-size` limit, it will be converted to a listpack. This is done to avoid frequent conversions when we add or remove at the bounding size or length. ## Interface changes 1. add list entry param to listTypeSetIteratorDirection When list encoding is listpack, `listTypeIterator->lpi` points to the next entry of current entry, so when changing the direction, we need to use the current node (listTypeEntry->p) to update `listTypeIterator->lpi` to the next node in the reverse direction. ## Benchmark ### Listpack VS Quicklist with one node * LPUSH - roughly 0.3% improvement * LRANGE - roughly 13% improvement ### Both are quicklist * LRANGE - roughly 3% improvement * LRANGE without pipeline - roughly 3% improvement From the benchmark, as we can see from the results 1. When list is quicklist encoding, LRANGE improves performance by <5%. 2. When list is listpack encoding, LRANGE improves performance by ~13%, the main enhancement is brought by `addListListpackRangeReply()`. ## Memory usage 1M lists(key:0~key:1000000) with 5 items of 10 chars ("hellohello") each. shows memory usage down by 35.49%, from 214MB to 138MB. ## Note 1. Add conversion callback to support doing some work before conversion Since the quicklist iterator decompresses the current node when it is released, we can no longer decompress the quicklist after we convert the list.
*	Listpack encoding for sets (#11290)	Viktor Söderqvist	2022-11-09	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Small sets with not only integer elements are listpack encoded, by default up to 128 elements, max 64 bytes per element, new config `set-max-listpack-entries` and `set-max-listpack-value`. This saves memory for small sets compared to using a hashtable. Sets with only integers, even very small sets, are still intset encoded (up to 1G limit, etc.). Larger sets are hashtable encoded. This PR increments the RDB version, and has an effect on OBJECT ENCODING Possible conversions when elements are added: intset -> listpack listpack -> hashtable intset -> hashtable Note: No conversion happens when elements are deleted. If all elements are deleted and then added again, the set is deleted and recreated, thus implicitly converted to a smaller encoding.
*	Remove redundant calls to update refcount of shared integers (#11479)	Hanif Ariffin	2022-11-06	1	-2/+0
\| \| \| \| \|	Since they're created with makeObjectShared, then incrRefCount on them is a NOP. Signed-off-by: Hanif Bin Ariffin <hanif.ariffin.4326@gmail.com>
*	Bump codespell from 2.1.0 to 2.2.1 in /.codespell (#11184)	Binbin	2022-08-24	1	-1/+1
\| \| \|	add a few terms to the white list, and fix a few newly detected typos
*	Set RM_StringCompare input args as const (#11010)	Binbin	2022-07-19	1	-3/+3
\| \| \| \| \| \|	Following #10996, it forgot to modify RM_StringCompare in module.c Modified RM_StringCompare, compareStringObjectsWithFlags, compareStringObjects and collateStringObjects.
*	Remove ziplist dead code in object.c (#10751)	Binbin	2022-05-22	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \|	Remove some dead code in object.c, ziplist is no longer used in 7.0 Some backgrounds: zipmap - hash: replaced by ziplist in #285 ziplist - hash: replaced by listpack in #8887 ziplist - zset: replaced by listpack in #9366 ziplist - list: replaced by quicklist (listpack) in #2143 / #9740 Moved the location of ziplist.h in the server.c
*	Add RM_MallocSizeString, RM_MallocSizeDict (#10542)	guybe7	2022-04-17	1	-1/+1
\| \| \| \| \| \| \|	Add APIs to allow modules to compute the memory consumption of opaque objects owned by redis. Without these, the mem_usage callbacks of module data types are useless in many cases. Other changes: Fix streamRadixTreeMemoryUsage to include the size of the rax structure itself
*	show cluster.links in MEMORY STATS (#10302)	zhaozhao.zz	2022-03-29	1	-1/+4
\|
*	Implement Multi Part AOF mechanism to avoid AOFRW overheads. (#9788)	chenyang8094	2022-01-03	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement Multi-Part AOF mechanism to avoid overheads during AOFRW. Introducing a folder with multiple AOF files tracked by a manifest file. The main issues with the the original AOFRW mechanism are: * buffering of commands that are processed during rewrite (consuming a lot of RAM) * freezes of the main process when the AOFRW completes to drain the remaining part of the buffer and fsync it. * double disk IO for the data that arrives during AOFRW (had to be written to both the old and new AOF files) The main modifications of this PR: 1. Remove the AOF rewrite buffer and related code. 2. Divide the AOF into multiple files, they are classified as two types, one is the the `BASE` type, it represents the full amount of data (Maybe AOF or RDB format) after each AOFRW, there is only one `BASE` file at most. The second is `INCR` type, may have more than one. They represent the incremental commands since the last AOFRW. 3. Use a AOF manifest file to record and manage these AOF files mentioned above. 4. The original configuration of `appendfilename` will be the base part of the new file name, for example: `appendonly.aof.1.base.rdb` and `appendonly.aof.2.incr.aof` 5. Add manifest-related TCL tests, and modified some existing tests that depend on the `appendfilename` 6. Remove the `aof_rewrite_buffer_length` field in info. 7. Add `aof-disable-auto-gc` configuration. By default we're automatically deleting HISTORY type AOFs. It also gives users the opportunity to preserve the history AOFs. just for testing use now. 8. Add AOFRW limiting measure. When the AOFRW failures reaches the threshold (3 times now), we will delay the execution of the next AOFRW by 1 minute. If the next AOFRW also fails, it will be delayed by 2 minutes. The next is 4, 8, 16, the maximum delay is 60 minutes (1 hour). During the limit period, we can still use the 'bgrewriteaof' command to execute AOFRW immediately. 9. Support upgrade (load) data from old version redis. 10. Add `appenddirname` configuration, as the directory name of the append only files. All AOF files and manifest file will be placed in this directory. 11. Only the last AOF file (BASE or INCR) can be truncated. Otherwise redis will exit even if `aof-load-truncated` is enabled. Co-authored-by: Oran Agra <oran@redislabs.com>
*	Report slot to keys map size in MEMORY STATS in cluster mode (#10017)	Joey from AWS	2022-01-02	1	-1/+11
\| \| \| \|	Report slot to keys map size in MEMORY STATS in cluster mode Report dictMetadataSize in MEMORY USAGE command as well
*	Remove EVAL script verbatim replication, propagation, and deterministic ↵	zhugezy	2021-12-21	1	-6/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	execution logic (#9812) # Background The main goal of this PR is to remove relevant logics on Lua script verbatim replication, only keeping effects replication logic, which has been set as default since Redis 5.0. As a result, Lua in Redis 7.0 would be acting the same as Redis 6.0 with default configuration from users' point of view. There are lots of reasons to remove verbatim replication. Antirez has listed some of the benefits in Issue #5292: >1. No longer need to explain to users side effects into scripts. They can do whatever they want. >2. No need for a cache about scripts that we sent or not to the slaves. >3. No need to sort the output of certain commands inside scripts (SMEMBERS and others): this both simplifies and gains speed. >4. No need to store scripts inside the RDB file in order to startup correctly. >5. No problems about evicting keys during the script execution. When looking back at Redis 5.0, antirez and core team decided to set the config `lua-replicate-commands yes` by default instead of removing verbatim replication directly, in case some bad situations happened. 3 years later now before Redis 7.0, it's time to remove it formally. # Changes - configuration for lua-replicate-commands removed - created config file stub for backward compatibility - Replication script cache removed - this is useless under script effects replication - relevant statistics also removed - script persistence in RDB files is also removed - Propagation of SCRIPT LOAD and SCRIPT FLUSH to replica / AOF removed - Deterministic execution logic in scripts removed (i.e. don't run write commands after random ones, and sorting output of commands with random order) - the flags indicating which commands have non-deterministic results are kept as hints to clients. - `redis.replicate_commands()` & `redis.set_repl()` changed - now `redis.replicate_commands()` does nothing and return an 1 - ...and then `redis.set_repl()` can be issued before `redis.replicate_commands()` now - Relevant TCL cases adjusted - DEBUG lua-always-replicate-commands removed # Other changes - Fix a recent bug comparing CLIENT_ID_AOF to original_client->flags instead of id. (introduced in #9780) Co-authored-by: Oran Agra <oran@redislabs.com>
*	Introduce memory management on cluster link buffers (#9774)	ny0312	2021-12-16	1	-0/+3
\| \| \| \| \| \| \|	Introduce memory management on cluster link buffers: * Introduce a new `cluster-link-sendbuf-limit` config that caps memory usage of cluster bus link send buffers. * Introduce a new `CLUSTER LINKS` command that displays current TCP links to/from peers. * Introduce a new `mem_cluster_links` field under `INFO` command output, which displays the overall memory usage by all current cluster links. * Introduce a new `total_cluster_links_buffer_limit_exceeded` field under `CLUSTER INFO` command output, which displays the accumulated count of cluster links freed due to `cluster-link-sendbuf-limit`.
*	Redis Functions - Added redis function unit and Lua engine	meir@redislabs.com	2021-12-02	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Redis function unit is located inside functions.c and contains Redis Function implementation: 1. FUNCTION commands: * FUNCTION CREATE * FCALL * FCALL_RO * FUNCTION DELETE * FUNCTION KILL * FUNCTION INFO 2. Register engine In addition, this commit introduce the first engine that uses the Redis Function capabilities, the Lua engine.
*	Redis Functions - Move Lua related variable into luaCtx struct	meir@redislabs.com	2021-12-01	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The following variable was renamed: 1. lua_caller -> script_caller 2. lua_time_limit -> script_time_limit 3. lua_timedout -> script_timedout 4. lua_oom -> script_oom 5. lua_disable_deny_script -> script_disable_deny_script 6. in_eval -> in_script The following variables was moved to lctx under eval.c 1. lua 2. lua_client 3. lua_cur_script 4. lua_scripts 5. lua_scripts_mem 6. lua_replicate_commands 7. lua_write_dirty 8. lua_random_dirty 9. lua_multi_emitted 10. lua_repl 11. lua_kill 12. lua_time_start 13. lua_time_snapshot This commit is in a low risk of introducing any issues and it is just moving varibales around and not changing any logic.
*	Replace ziplist with listpack in quicklist (#9740)	sundb	2021-11-24	1	-7/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Part three of implementing #8702, following #8887 and #9366 . ## Description of the feature 1. Replace the ziplist container of quicklist with listpack. 2. Convert existing quicklist ziplists on RDB loading time. an O(n) operation. ## Interface changes 1. New `list-max-listpack-size` config is an alias for `list-max-ziplist-size`. 2. Replace `debug ziplist` command with `debug listpack`. ## Internal changes 1. Add `lpMerge` to merge two listpacks . (same as `ziplistMerge`) 2. Add `lpRepr` to print info of listpack which is used in debugCommand and `quicklistRepr`. (same as `ziplistRepr`) 3. Replace `QUICKLIST_NODE_CONTAINER_ZIPLIST` with `QUICKLIST_NODE_CONTAINER_PACKED`(following #9357 ). It represent that a quicklistNode is a packed node, as opposed to a plain node. 4. Remove `createZiplistObject` method, which is never used. 5. Calculate listpack entry size using overhead overestimation in `quicklistAllowInsert`. We prefer an overestimation, which would at worse lead to a few bytes below the lowest limit of 4k. ## Improvements 1. Calling `lpShrinkToFit` after converting Ziplist to listpack, which was missed at #9366. 2. Optimize `quicklistAppendPlainNode` to avoid memcpy data. ## Bugfix 1. Fix crash in `quicklistRepr` when ziplist is compressed, introduced from #9366. ## Test 1. Add unittest for `lpMerge`. 2. Modify the old quicklist ziplist corrupt dump test. Co-authored-by: Oran Agra <oran@redislabs.com>
*	Add support for list type to store elements larger than 4GB (#9357)	perryitay	2021-11-03	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Redis lists are stored in quicklist, which is currently a linked list of ziplists. Ziplists are limited to storing elements no larger than 4GB, so when bigger items are added they're getting truncated. This PR changes quicklists so that they're capable of storing large items in quicklist nodes that are plain string buffers rather than ziplist. As part of the PR there were few other changes in redis: 1. new DEBUG sub-commands: - QUICKLIST-PACKED-THRESHOLD - set the threshold of for the node type to be plan or ziplist. default (1GB) - QUICKLIST <key> - Shows low level info about the quicklist encoding of <key> 2. rdb format change: - A new type was added - RDB_TYPE_LIST_QUICKLIST_2 . - container type (packed / plain) was added to the beginning of the rdb object (before the actual node list). 3. testing: - Tests that requires over 100MB will be by default skipped. a new flag was added to 'runtest' to run the large memory tests (not used by default) Co-authored-by: sundb <sundbcn@gmail.com> Co-authored-by: Oran Agra <oran@redislabs.com>
*	Replication backlog and replicas use one global shared replication buffer ↵	Wang Yuan	2021-10-25	1	-8/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(#9166) ## Background For redis master, one replica uses one copy of replication buffer, that is a big waste of memory, more replicas more waste, and allocate/free memory for every reply list also cost much. If we set client-output-buffer-limit small and write traffic is heavy, master may disconnect with replicas and can't finish synchronization with replica. If we set client-output-buffer-limit big, master may be OOM when there are many replicas that separately keep much memory. Because replication buffers of different replica client are the same, one simple idea is that all replicas only use one replication buffer, that will effectively save memory. Since replication backlog content is the same as replicas' output buffer, now we can discard replication backlog memory and use global shared replication buffer to implement replication backlog mechanism. ## Implementation I create one global "replication buffer" which contains content of replication stream. The structure of "replication buffer" is similar to the reply list that exists in every client. But the node of list is `replBufBlock`, which has `id, repl_offset, refcount` fields. ```c /* Replication buffer blocks is the list of replBufBlock. * * +--------------+ +--------------+ +--------------+ * \| refcount = 1 \| ... \| refcount = 0 \| ... \| refcount = 2 \| * +--------------+ +--------------+ +--------------+ * \| / \ * \| / \ * \| / \ * Repl Backlog Replia_A Replia_B * * Each replica or replication backlog increments only the refcount of the * 'ref_repl_buf_node' which it points to. So when replica walks to the next * node, it should first increase the next node's refcount, and when we trim * the replication buffer nodes, we remove node always from the head node which * refcount is 0. If the refcount of the head node is not 0, we must stop * trimming and never iterate the next node. / / Similar with 'clientReplyBlock', it is used for shared buffers between * all replica clients and replication backlog. / typedef struct replBufBlock { int refcount; / Number of replicas or repl backlog using. / long long id; / The unique incremental number. / long long repl_offset; / Start replication offset of the block. */ size_t size, used; char buf[]; } replBufBlock; ``` So now when we feed replication stream into replication backlog and all replicas, we only need to feed stream into replication buffer `feedReplicationBuffer`. In this function, we set some fields of replication backlog and replicas to references of the global replication buffer blocks. And we also need to check replicas' output buffer limit to free if exceeding `client-output-buffer-limit`, and trim replication backlog if exceeding `repl-backlog-size`. When sending reply to replicas, we also need to iterate replication buffer blocks and send its content, when totally sending one block for replica, we decrease current node count and increase the next current node count, and then free the block which reference is 0 from the head of replication buffer blocks. Since now we use linked list to manage replication backlog, it may cost much time for iterating all linked list nodes to find corresponding replication buffer node. So we create a rax tree to store some nodes for index, but to avoid rax tree occupying too much memory, i record one per 64 nodes for index. Currently, to make partial resynchronization as possible as much, we always let replication backlog as the last reference of replication buffer blocks, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks, and this method doesn't increase memory usage since they share replication buffer. To avoid freezing server for freeing unreferenced replication buffer blocks when we need to trim backlog for exceeding backlog size setting, we trim backlog incrementally (free 64 blocks per call now), and make it faster in `beforeSleep` (free 640 blocks). ### Other changes - `mem_total_replication_buffers`: we add this field in INFO command, it means the total memory of replication buffers used. - `mem_clients_slaves`: now even replica is slow to replicate, and its output buffer memory is not 0, but it still may be 0, since replication backlog and replicas share one global replication buffer, only if replication buffer memory is more than the repl backlog setting size, we consider the excess as replicas' memory. Otherwise, we think replication buffer memory is the consumption of repl backlog. - Key eviction Since all replicas and replication backlog share global replication buffer, we think only the part of exceeding backlog size the extra separate consumption of replicas. Because we trim backlog incrementally in the background, backlog size may exceeds our setting if slow replicas that reference vast replication buffer blocks disconnect. To avoid massive eviction loop, we don't count the delayed freed replication backlog into used memory even if there are no replicas, i.e. we also regard this memory as replicas's memory. - `client-output-buffer-limit` check for replica clients It doesn't make sense to set the replica clients output buffer limit lower than the repl-backlog-size config (partial sync will succeed and then replica will get disconnected). Such a configuration is ignored (the size of repl-backlog-size will be used). This doesn't have memory consumption implications since the replica client will share the backlog buffers memory. - Drop replication backlog after loading data if needed We always create replication backlog if server is a master, we need it because we put DELs in it when loading expired keys in RDB, but if RDB doesn't have replication info or there is no rdb, it is not possible to support partial resynchronization, to avoid extra memory of replication backlog, we drop it. - Multi IO threads Since all replicas and replication backlog use global replication buffer, if I/O threads are enabled, to guarantee data accessing thread safe, we must let main thread handle sending the output buffer to all replicas. But before, other IO threads could handle sending output buffer of all replicas. ## Other optimizations This solution resolve some other problem: - When replicas disconnect with master since of out of output buffer limit, releasing the output buffer of replicas may freeze server if we set big `client-output-buffer-limit` for replicas, but now, it doesn't cause freezing. - This implementation may mitigate reply list copy cost time(also freezes server) when one replication has huge reply buffer and another replica can copy buffer for full synchronization. now, we just copy reference info, it is very light. - If we set replication backlog size big, it also may cost much time to copy replication backlog into replica's output buffer. But this commit eliminates this problem. - Resizing replication backlog size doesn't empty current replication backlog content.
*	Modify mem_usage2 module callback to enable to take sample_size argument (#9612)	Hanna Fadida	2021-10-17	1	-2/+2
\| \| \| \|	This is useful for approximating size computation of complex module types. Note that the mem_usage2 callback is new and has not been released yet, which is why we can modify it.
*	Client eviction (#8687)	yoav-steinberg	2021-09-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	### Description A mechanism for disconnecting clients when the sum of all connected clients is above a configured limit. This prevents eviction or OOM caused by accumulated used memory between all clients. It's a complimentary mechanism to the `client-output-buffer-limit` mechanism which takes into account not only a single client and not only output buffers but rather all memory used by all clients. #### Design The general design is as following: * We track memory usage of each client, taking into account all memory used by the client (query buffer, output buffer, parsed arguments, etc...). This is kept up to date after reading from the socket, after processing commands and after writing to the socket. * Based on the used memory we sort all clients into buckets. Each bucket contains all clients using up up to x2 memory of the clients in the bucket below it. For example up to 1m clients, up to 2m clients, up to 4m clients, ... * Before processing a command and before sleep we check if we're over the configured limit. If we are we start disconnecting clients from larger buckets downwards until we're under the limit. #### Config `maxmemory-clients` max memory all clients are allowed to consume, above this threshold we disconnect clients. This config can either be set to 0 (meaning no limit), a size in bytes (possibly with MB/GB suffix), or as a percentage of `maxmemory` by using the `%` suffix (e.g. setting it to `10%` would mean 10% of `maxmemory`). #### Important code changes * During the development I encountered yet more situations where our io-threads access global vars. And needed to fix them. I also had to handle keeps the clients sorted into the memory buckets (which are global) while their memory usage changes in the io-thread. To achieve this I decided to simplify how we check if we're in an io-thread and make it much more explicit. I removed the `CLIENT_PENDING_READ` flag used for checking if the client is in an io-thread (it wasn't used for anything else) and just used the global `io_threads_op` variable the same way to check during writes. * I optimized the cleanup of the client from the `clients_pending_read` list on client freeing. We now store a pointer in the `client` struct to this list so we don't need to search in it (`pending_read_list_node`). * Added `evicted_clients` stat to `INFO` command. * Added `CLIENT NO-EVICT ON\|OFF` sub command to exclude a specific client from the client eviction mechanism. Added corrosponding 'e' flag in the client info string. * Added `multi-mem` field in the client info string to show how much memory is used up by buffered multi commands. * Client `tot-mem` now accounts for buffered multi-commands, pubsub patterns and channels (partially), tracking prefixes (partially). * CLIENT_CLOSE_ASAP flag is now handled in a new `beforeNextClient()` function so clients will be disconnected between processing different clients and not only before sleep. This new function can be used in the future for work we want to do outside the command processing loop but don't want to wait for all clients to be processed before we get to it. Specifically I wanted to handle output-buffer-limit related closing before we process client eviction in case the two race with each other. * Added a `DEBUG CLIENT-EVICTION` command to print out info about the client eviction buckets. * Each client now holds a pointer to the client eviction memory usage bucket it belongs to and listNode to itself in that bucket for quick removal. * Global `io_threads_op` variable now can contain a `IO_THREADS_OP_IDLE` value indicating no io-threading is currently being executed. * In order to track memory used by each clients in real-time we can't rely on updating these stats in `clientsCron()` alone anymore. So now I call `updateClientMemUsage()` (used to be `clientsCronTrackClientsMemUsage()`) after command processing, after writing data to pubsub clients, after writing the output buffer and after reading from the socket (and maybe other places too). The function is written to be fast. * Clients are evicted if needed (with appropriate log line) in `beforeSleep()` and before processing a command (before performing oom-checks and key-eviction). * All clients memory usage buckets are grouped as follows: * All clients using less than 64k. * 64K..128K * 128K..256K * ... * 2G..4G * All clients using 4g and up. * Added client-eviction.tcl with a bunch of tests for the new mechanism. * Extended maxmemory.tcl to test the interaction between maxmemory and maxmemory-clients settings. * Added an option to flag a numeric configuration variable as a "percent", this means that if we encounter a '%' after the number in the config file (or config set command) we consider it as valid. Such a number is store internally as a negative value. This way an integer value can be interpreted as either a percent (negative) or absolute value (positive). This is useful for example if some numeric configuration can optionally be set to a percentage of something else. Co-authored-by: Oran Agra <oran@redislabs.com>
*	Replace all usage of ziplist with listpack for t_zset (#9366)	sundb	2021-09-09	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Part two of implementing #8702 (zset), after #8887. ## Description of the feature Replaced all uses of ziplist with listpack in t_zset, and optimized some of the code to optimize performance. ## Rdb format changes New `RDB_TYPE_ZSET_LISTPACK` rdb type. ## Rdb loading improvements: 1) Pre-expansion of dict for validation of duplicate data for listpack and ziplist. 2) Simplifying the release of empty key objects when RDB loading. 3) Unify ziplist and listpack data verify methods for zset and hash, and move code to rdb.c. ## Interface changes 1) New `zset-max-listpack-entries` config is an alias for `zset-max-ziplist-entries` (same with `zset-max-listpack-value`). 2) OBJECT ENCODING will return listpack instead of ziplist. ## Listpack improvements: 1) Add `lpDeleteRange` and `lpDeleteRangeWithEntry` functions to delete a range of entries from listpack. 2) Improve the performance of `lpCompare`, converting from string to integer is faster than converting from integer to string. 3) Replace `snprintf` with `ll2string` to improve performance in converting numbers to strings in `lpGet()`. ## Zset improvements: 1) Improve the performance of `zzlFind` method, use `lpFind` instead of `lpCompare` in a loop. 2) Use `lpDeleteRangeWithEntry` instead of `lpDelete` twice to delete a element of zset. ## Tests 1) Add some unittests for `lpDeleteRange` and `lpDeleteRangeWithEntry` function. 2) Add zset RDB loading test. 3) Add benchmark test for `lpCompare` and `ziplsitCompare`. 4) Add empty listpack zset corrupt dump test.
*	Fix missing dismiss hash listpack memory due to ziplist->listpack migration ↵	sundb	2021-08-10	1	-2/+10
\| \| \| \|	(#9353)
*	fix a compilation error around madvise when make with jemalloc on MacOS (#9350)	DarrenJiang13	2021-08-10	1	-2/+2
\| \| \|	We only use MADV_DONTNEED on Linux, that's were it was tested.
*	Format fixes and naming. SentReplyOnKeyMiss -> addReplyOrErrorObject (#9346)	Meir Shpilraien (Spielrein)	2021-08-10	1	-1/+1
\| \| \| \|	Following the comments on #8659, this PR fix some formatting and naming issues.
*	Replace all usage of ziplist with listpack for t_hash (#8887)	sundb	2021-08-10	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Part one of implementing #8702 (taking hashes first before other types) ## Description of the feature 1. Change ziplist encoded hash objects to listpack encoding. 2. Convert existing ziplists on RDB loading time. an O(n) operation. ## Rdb format changes 1. Add RDB_TYPE_HASH_LISTPACK rdb type. 2. Bump RDB_VERSION to 10 ## Interface changes 1. New `hash-max-listpack-entries` config is an alias for `hash-max-ziplist-entries` (same with `hash-max-listpack-value`) 2. OBJECT ENCODING will return `listpack` instead of `ziplist` ## Listpack improvements: 1. Support direct insert, replace integer element (rather than convert back and forth from string) 3. Add more listpack capabilities to match the ziplist ones (like `lpFind`, `lpRandomPairs` and such) 4. Optimize element length fetching, avoid multiple calculations 5. Use inline to avoid function call overhead. ## Tests 1. Add a new test to the RDB load time conversion 2. Adding the listpack unit tests. (based on the one in ziplist.c) 3. Add a few "corrupt payload: fuzzer findings" tests, and slightly modify existing ones. Co-authored-by: Oran Agra <oran@redislabs.com>
*	[BUGFIX] Add some missed error statistics (#9328)	DarrenJiang13	2021-08-06	1	-2/+1
\| \| \|	add error counting for some missed behaviors.
*	Improvements to corrupt payload sanitization (#9321)	Oran Agra	2021-08-05	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recently we found two issues in the fuzzer tester: #9302 #9285 After fixing them, more problems surfaced and this PR (as well as #9297) aims to fix them. Here's a list of the fixes - Prevent an overflow when allocating a dict hashtable - Prevent OOM when attempting to allocate a huge string - Prevent a few invalid accesses in listpack - Improve sanitization of listpack first entry - Validate integrity of stream consumer groups PEL - Validate integrity of stream listpack entry IDs - Validate ziplist tail followed by extra data which start with 0xff Co-authored-by: sundb <sundbcn@gmail.com>
*	fix dict access broken by #9228 (#9319)	yoav-steinberg	2021-08-05	1	-6/+6
\|
*	dict struct memory optimizations (#9228)	yoav-steinberg	2021-08-05	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Reduce dict struct memory overhead on 64bit dict size goes down from jemalloc's 96 byte bin to its 56 byte bin. summary of changes: - Remove `privdata` from callbacks and dict creation. (this affects many files, see "Interface change" below). - Meld `dictht` struct into the `dict` struct to eliminate struct padding. (this affects just dict.c and defrag.c) - Eliminate the `sizemask` field, can be calculated from size when needed. - Convert the `size` field into `size_exp` (exponent), utilizes one byte instead of 8. Interface change: pass dict pointer to dict type call back functions. This is instead of passing the removed privdata field. In the future if we'd like to have private data in the callbacks we can extract it from the dict type. We can extend dictType to include a custom dict struct allocator and use it to allocate more data at the end of the dict struct. This data can then be used to store private data later acccessed by the callbacks.
*	Use madvise(MADV_DONTNEED) to release memory to reduce COW (#8974)	Wang Yuan	2021-08-04	1	-0/+163
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## Backgroud As we know, after `fork`, one process will copy pages when writing data to these pages(CoW), and another process still keep old pages, they totally cost more memory. For redis, we suffered that redis consumed much memory when the fork child is serializing key/values, even that maybe cause OOM. But actually we find, in redis fork child process, the child process don't need to keep some memory and parent process may write or update that, for example, child process will never access the key-value that is serialized but users may update it in parent process. So we think it may reduce COW if the child process release memory that it is not needed. ## Implementation For releasing key value in child process, we may think we call `decrRefCount` to free memory, but i find the fork child process still use much memory when we don't write any data to redis, and it costs much more time that slows down bgsave. Maybe because memory allocator doesn't really release memory to OS, and it may modify some inner data for this free operation, especially when we free small objects. Moreover, CoW is based on pages, so it is a easy way that we only free the memory bulk that is not less than kernel page size. madvise(MADV_DONTNEED) can quickly release specified region pages to OS bypassing memory allocator, and allocator still consider that this memory still is used and don't change its inner data. There are some buffers we can release in the fork child process: - Serialized key-values the fork child process never access serialized key-values, so we try to free them. Because we only can release big bulk memory, and it is time consumed to iterate all items/members/fields/entries of complex data type. So we decide to iterate them and try to release them only when their average size of item/member/field/entry is more than page size of OS. - Replication backlog Because replication backlog is a cycle buffer, it will be changed quickly if redis has heavy write traffic, but in fork child process, we don't need to access that. - Client buffers If clients have requests during having the fork child process, clients' buffer also be changed frequently. The memory includes client query buffer, output buffer, and client struct used memory. To get child process peak private dirty memory, we need to count peak memory instead of last used memory, because the child process may continue to release memory (since COW used to only grow till now, the last was equivalent to the peak). Also we're adding a new `current_cow_peak` info variable (to complement the existing `current_cow_size`) Co-authored-by: Oran Agra <oran@redislabs.com>
*	Fix LRU blue moon bug in RESTORE, RDB loading, module API (#9279)	Oran Agra	2021-07-29	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The `lru_clock` and `lru` bits in `robj` save the least significant 24 bits of the unixtime (seconds since 1/1/1970), and wrap around every 194 days. The `objectSetLRUOrLFU` function, which is used in RESTORE with IDLETIME argument, and also in replica or master loading an RDB that contains LRU, and by a module API had a bug that's triggered when that happens. The scenario was that the idle time that came from the user, let's say RESTORE command is about 1000 seconds (e.g. in the `RESTORE can set LRU` test we have), and the current `lru_clock` just wrapped around and is less than 1000 (i.e. a period of 1000 seconds once in some 6 months), the expression in that function would produce a negative value and the code (and comment) specified that the best way to solve that is push the idle time backwards into the past by 3 months. i.e. an idle time of 3 months instead of 1000 seconds. instead, the right thing to do is to unwrap it, and put it near LRU_CLOCK_MAX. since now `lru_clock` is smaller than `obj->lru` it will be unwrapped again by `estimateObjectIdleTime`. bug was introduced by 052e03495f, but the code before it also seemed wrong.
*	Add missing comma in memory/xgroup command help message. (#9210)	Binbin	2021-07-13	1	-1/+1
\| \| \|	This would have resulted in missing newline in the help message
*	Include sizeof(struct stream) in objectComputeSize (#9164)	guybe7	2021-06-29	1	-1/+1
\| \| \|	Affects MEMORY USAGE
*	Improve objectComputeSize() with allocator fragmentaiton. (#9095)	DarrenJiang13	2021-06-17	1	-8/+7
\| \| \| \| \| \|	This commit improve MEMORY USAGE command to include internal fragmentation overheads of: 1. EMBSTR encoded strings 2. ziplist encoded zsets and hashes 3. List type nodes
*	Enhance mem_usage/free_effort/unlink/copy callbacks and add GetDbFromIO api. ↵	chenyang8094	2021-06-16	1	-9/+3
\| \| \| \| \| \| \|	(#8999) Create new module type enhanced callbacks: mem_usage2, free_effort2, unlink2, copy2. These will be given a context point from which the module can obtain the key name and database id. In addition the digest and defrag context can now be used to obtain the key name and database id.
*	Fixed some typos, add a spell check ci and others minor fix (#8890)	Binbin	2021-06-10	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a spell checker CI action that will fail future PRs if they introduce typos and spelling mistakes. This spell checker is based on blacklist of common spelling mistakes, so it will not catch everything, but at least it is also unlikely to cause false positives. Besides that, the PR also fixes many spelling mistakes and types, not all are a result of the spell checker we use. Here's a summary of other changes: 1. Scanned the entire source code and fixes all sorts of typos and spelling mistakes (including missing or extra spaces). 2. Outdated function / variable / argument names in comments 3. Fix outdated keyspace masks error log when we check `config.notify-keyspace-events` in loadServerConfigFromString. 4. Trim the white space at the end of line in `module.c`. Check: https://github.com/redis/redis/pull/7751 5. Some outdated https link URLs. 6. Fix some outdated comment. Such as: - In README: about the rdb, we used to said create a `thread`, change to `process` - dbRandomKey function coment (about the dictGetRandomKey, change to dictGetFairRandomKey) - notifyKeyspaceEvent fucntion comment (add type arg) - Some others minor fix in comment (Most of them are incorrectly quoted by variable names) 7. Modified the error log so that users can easily distinguish between TCP and TLS in `changeBindAddr`
*	Make full use of aofrwblock's buf (#8975)	Wang Yuan	2021-05-30	1	-1/+1
\| \| \| \| \| \|	Make aof rewrite buffer memory size more accurate, before, there may be 20% deviation with its real memory usage. The implication are both lower memory usage, and also a more accurate INFO.
*	Resolve nonsense static analysis warnings	Oran Agra	2021-05-03	1	-1/+1
\|
*	Fix out of range confusing error messages (XAUTOCLAIM, RPOP count) (#8746)	Yang Bodong	2021-04-07	1	-1/+5
\| \| \| \| \|	Fix out of range error messages to be clearer (avoid mentioning 9223372036854775807) * Fix XAUTOCLAIM COUNT option confusing error msg * Fix other RPOP and alike error message to mention positive
*	Moved most static strings into the shared structure (#8411)	Madelyn Olson	2021-02-09	1	-17/+0
\| \| \|	Moved most static strings into the shared structure
*	HELP subcommand, continued (#5531)	Itamar Haber	2021-01-04	1	-11/+25
\| \| \| \| \| \| \| \| \| \| \| \| \|	* man-like consistent long formatting * Uppercases commands, subcommands and options * Adds 'HELP' to HELP for all * Lexicographical order * Uses value notation and other .md likeness * Moves const char help to top Keeps it under 80 chars * Misc help typos, consistent conjuctioning (i.e return and not returns) * Uses addReplySubcommandSyntaxError(c) all over Signed-off-by: Itamar Haber <itamar@redislabs.com>
*	Flow through the error handling path for most errors (#8226)	Madelyn Olson	2020-12-23	1	-3/+3
\| \| \|	Properly throw errors for invalid replication stream and support https://github.com/redis/redis/pull/8217
*	Improve dbid range check for SELECT, MOVE, COPY (#8085)	sundb	2020-12-01	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SELECT used to read the index into a `long` variable, and then pass it to a function that takes an `int`, possibly causing an overflow before the range check. Now all these commands use better and cleaner range check, and that also results in a slight change of the error response in case of an invalid database index. SELECT: in the past it would have returned either `-ERR invalid DB index` (if not a number), or `-ERR DB index is out of range` (if not between 1..16 or alike). now it'll return either `-ERR value is out of range` (if not a number), or `-ERR value is out of range, value must between -2147483648 and 2147483647` (if not in the range for an int), or `-ERR DB index is out of range` (if not between 0..16 or alike) MOVE: in the past it would only fail with `-ERR index out of range` no matter the reason. now return the same errors as the new ones for SELECT mentioned above. (i.e. unlike for SELECT even for a value like 17 we changed the error message) COPY: doesn't really matter how it behaved in the past (new command), new behavior is like the above two.