summaryrefslogtreecommitdiff
path: root/src/zmalloc.c
Commit message (Collapse)AuthorAgeFilesLines
* Add warning for suspected slow system clocksource setting (#10636)yoav-steinberg2022-05-221-17/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR does 2 main things: 1) Add warning for suspected slow system clocksource setting. This is Linux specific. 2) Add a `--check-system` argument to redis which runs all system checks and prints a report. ## System checks Add a command line option `--check-system` which runs all known system checks and provides a report to stdout of which systems checks have failed with details on how to reconfigure the system for optimized redis performance. The `--system-check` mode exists with an appropriate error code after running all the checks. ## Slow clocksource details We check the system's clocksource performance by running `clock_gettime()` in a loop and then checking how much time was spent in a system call (via `getrusage()`). If we spend more than 10% of the time in the kernel then we print a warning. I verified that using the slow clock sources: `acpi_pm` (~90% in the kernel on my laptop) and `xen` (~30% in the kernel on an ec2 `m4.large`) we get this warning. The check runs 5 system ticks so we can detect time spent in kernel at 20% jumps (0%,20%,40%...). Anything more accurate will require the test to run longer. Typically 5 ticks are 50ms. This means running the test on startup will delay startup by 50ms. To avoid this we make sure the test is only executed in the `--check-system` mode. For a quick startup check, we specifically warn if the we see the system is using the `xen` clocksource which we know has bad performance and isn't recommended (at least on ec2). In such a case the user should manually rung redis with `--check-system` to force the slower clocksource test described above. ## Other changes in the PR * All the system checks are now implemented as functions in _syscheck.c_. They are implemented using a standard interface (see details in _syscheck.c_). To do this I moved the checking functions `linuxOvercommitMemoryValue()`, `THPIsEnabled()`, `linuxMadvFreeForkBugCheck()` out of _server.c_ and _latency.c_ and into the new _syscheck.c_. When moving these functions I made sure they don't depend on other functionality provided in _server.c_ and made them use a standard "check functions" interface. Specifically: * I removed all logging out of `linuxMadvFreeForkBugCheck()`. In case there's some unexpected error during the check aborts as before, but without any logging. It returns an error code 0 meaning the check didn't not complete. * All these functions now return 1 on success, -1 on failure, 0 in case the check itself cannot be completed. * The `linuxMadvFreeForkBugCheck()` function now internally calls `exit()` and not `exitFromChild()` because the latter is only available in _server.c_ and I wanted to remove that dependency. This isn't an because we don't need to worry about the child process created by the test doing anything related to the rdb/aof files which is why `exitFromChild()` was created. * This also fixes parsing of other /proc/\<pid\>/stat fields to correctly handle spaces in the process name and be more robust in general. Not that before this fix the rss info in `INFO memory` was corrupt in case of spaces in the process name. To recreate just rename `redis-server` to `redis server`, start it, and run `INFO memory`.
* zmalloc_get_rss implementation for haiku. (#10687)David CARLIER2022-05-081-0/+17
| | | | | also fixing already defined constants build warning while at it. Co-authored-by: Oran Agra <oran@redislabs.com>
* zmalloc_get_rss openbsd implementation (#10149)David CARLIER2022-01-191-1/+7
| | | Add support for getting the RSS in OpenBSD
* zmalloc_get_rss netbsd impl fix proposal. (#10116)David CARLIER2022-01-161-2/+2
| | | | | | Seems like the previous implementation was broken (always returning 0) since kinfo_proc2 is used the KERN_PROC2 sysctl oid is more appropriate and also the query's length was not necessarily accurate (6 here).
* Added INFO LATENCYSTATS section: latency by percentile distribution/latency ↵filipe oliveira2022-01-051-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | by cumulative distribution of latencies (#9462) # Short description The Redis extended latency stats track per command latencies and enables: - exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command. **( percentile distribution is not mergeable between cluster nodes ).** - exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command. Using the cumulative distribution of latencies we can merge several stats from different cluster nodes to calculate aggregate metrics . By default, the extended latency monitoring is enabled since the overhead of keeping track of the command latency is very small. If you don't want to track extended latency metrics, you can easily disable it at runtime using the command: - `CONFIG SET latency-tracking no` By default, the exported latency percentiles are the p50, p99, and p999. You can alter them at runtime using the command: - `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"` ## Some details: - The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command was called for the first time. - With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable vs this branch. - We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf ) ## `INFO LATENCYSTATS` exposition format - Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....` ## `LATENCY HISTOGRAM [command ...]` exposition format Return a cumulative distribution of latencies in the format of a histogram for the specified command names. The histogram is composed of a map of time buckets: - Each representing a latency range, between 1 nanosecond and roughly 1 second. - Each bucket covers twice the previous bucket's range. - Empty buckets are not printed. - Everything above 1 sec is considered +Inf. - At max there will be log2(1000000000)=30 buckets We reply a map for each command in the format: `<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ... } }` Co-authored-by: Oran Agra <oran@redislabs.com>
* Add --large-memory flag for REDIS_TEST to enable tests that consume more ↵sundb2021-11-161-2/+2
| | | | | than 100mb (#9784) This is a preparation step in order to add a new test in quicklist.c see #9776
* fix a compilation error around madvise when make with jemalloc on MacOS (#9350)DarrenJiang132021-08-101-1/+1
| | | We only use MADV_DONTNEED on Linux, that's were it was tested.
* Use madvise(MADV_DONTNEED) to release memory to reduce COW (#8974)Wang Yuan2021-08-041-1/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## Backgroud As we know, after `fork`, one process will copy pages when writing data to these pages(CoW), and another process still keep old pages, they totally cost more memory. For redis, we suffered that redis consumed much memory when the fork child is serializing key/values, even that maybe cause OOM. But actually we find, in redis fork child process, the child process don't need to keep some memory and parent process may write or update that, for example, child process will never access the key-value that is serialized but users may update it in parent process. So we think it may reduce COW if the child process release memory that it is not needed. ## Implementation For releasing key value in child process, we may think we call `decrRefCount` to free memory, but i find the fork child process still use much memory when we don't write any data to redis, and it costs much more time that slows down bgsave. Maybe because memory allocator doesn't really release memory to OS, and it may modify some inner data for this free operation, especially when we free small objects. Moreover, CoW is based on pages, so it is a easy way that we only free the memory bulk that is not less than kernel page size. madvise(MADV_DONTNEED) can quickly release specified region pages to OS bypassing memory allocator, and allocator still consider that this memory still is used and don't change its inner data. There are some buffers we can release in the fork child process: - **Serialized key-values** the fork child process never access serialized key-values, so we try to free them. Because we only can release big bulk memory, and it is time consumed to iterate all items/members/fields/entries of complex data type. So we decide to iterate them and try to release them only when their average size of item/member/field/entry is more than page size of OS. - **Replication backlog** Because replication backlog is a cycle buffer, it will be changed quickly if redis has heavy write traffic, but in fork child process, we don't need to access that. - **Client buffers** If clients have requests during having the fork child process, clients' buffer also be changed frequently. The memory includes client query buffer, output buffer, and client struct used memory. To get child process peak private dirty memory, we need to count peak memory instead of last used memory, because the child process may continue to release memory (since COW used to only grow till now, the last was equivalent to the peak). Also we're adding a new `current_cow_peak` info variable (to complement the existing `current_cow_size`) Co-authored-by: Oran Agra <oran@redislabs.com>
* Fix slowdown due to child reporting CoW. (#8645)Yossi Gottlieb2021-03-221-0/+5
| | | | | | | | | | | Reading CoW from /proc/<pid>/smaps can be slow with large processes on some platforms. This measures the time it takes to read CoW info and limits the duty cycle of future updates to roughly 1/100. As current_cow_size no longer represnets a current, fixed interval value there is also a new current_cow_size_age field that provides information about the age of the size value, in seconds.
* Add run all test support with define REDIS_TEST (#8570)sundb2021-03-101-1/+2
| | | | | | | | | | | | 1. Add `redis-server test all` support to run all tests. 2. Add redis test to daily ci. 3. Add `--accurate` option to run slow tests for more iterations (so that by default we run less cycles (shorter time, and less prints). 4. Move dict benchmark to REDIS_TEST. 5. fix some leaks in tests 6. make quicklist tests run on a specific fill set of options rather than huge ranges 7. move some prints in quicklist test outside their loops to reduce prints 8. removing sds.h from dict.c since it is now used in both redis-server and redis-cli (uses hiredis sds)
* Fix memory info on FreeBSD. (#8620)Yossi Gottlieb2021-03-091-3/+3
| | | | | | The obtained process_rss was incorrect (the OS reports pages, not bytes), resulting with many other fields getting corrupted. This has been tested on FreeBSD but not other platforms.
* Cleanup usage of malloc_usable_size. (#8554)Yossi Gottlieb2021-02-251-2/+8
| | | | | | | | | * Add better control of malloc_usable_size() usage. * Use malloc_usable_size on alpine libc daily job. * Add no-malloc-usable-size daily jobs. * Fix zmalloc(0) when HAVE_MALLOC_SIZE is undefined. In order to align with the jemalloc behavior, this should never return NULL or OOM panic.
* Fix compile errors with no HAVE_MALLOC_SIZE. (#8533)Yossi Gottlieb2021-02-231-5/+2
| | | | | Also adds a new daily CI test, relying on the fact that we don't use malloc_size() on alpine libmusl. Fixes #8531
* Fix integer overflow (CVE-2021-21309). (#8522)Yossi Gottlieb2021-02-221-0/+10
| | | | | | | | On 32-bit systems, setting the proto-max-bulk-len config parameter to a high value may result with integer overflow and a subsequent heap overflow when parsing an input bulk (CVE-2021-21309). This fix has two parts: Set a reasonable limit to the config parameter. Add additional checks to prevent the problem in other potential but unknown code paths.
* Fix last COW INFO report, Skip test on non-linux platforms (#8301)Oran Agra2021-01-081-8/+11
| | | | | | | | | | | - the last COW report wasn't always read from the pipe (receiveLastChildInfo wasn't used) - but in fact, there's no reason we won't always try to drain that pipe so i'm unifying receiveLastChildInfo with receiveChildInfo - adjust threshold of the COW test when run in accurate mode - add some prints in case this test fails again - fix indentation, page size, and PID! in MacOS proc info p.s. it seems that pri_pages_dirtied is always 0
* Several (mostly Solaris-related) cleanups (#8171)Yossi Gottlieb2020-12-131-8/+3
| | | | | | * Allow runtest-moduleapi use a different 'make', for systems where GNU Make is 'gmake'. * Fix issue with builds on Solaris re-building everything from scratch due to CFLAGS/LDFLAGS not stored. * Fix compile failure on Solaris due to atomicvar and a bunch of warnings. * Fix garbled log timestamps on Solaris.
* Solaris based system rss size report. (#8138)David CARLIER2020-12-061-0/+21
|
* Sanitize dump payload: fail RESTORE if memory allocation failsOran Agra2020-12-061-64/+88
| | | | | When RDB input attempts to make a huge memory allocation that fails, RESTORE should fail gracefully rather than die with panic
* DragonFlyBSD resident memory amount (almost) similar as FreeBSD. (#8023)David CARLIER2020-11-081-1/+5
|
* Fix wrong zmalloc_size() assumption. (#7963)Yossi Gottlieb2020-10-261-3/+0
| | | | | | | | | | | | | | When using a system with no malloc_usable_size(), zmalloc_size() assumed that the heap allocator always returns blocks that are long-padded. This may not always be the case, and will result with zmalloc_size() returning a size that is bigger than allocated. At least in one case this leads to out of bound write, process crash and a potential security vulnerability. Effectively this does not affect the vast majority of users, who use jemalloc or glibc. This problem along with a (different) fix was reported by Drew DeVault.
* performance and memory reporting improvement - sds take control of it's ↵Oran Agra2020-10-021-1/+85
| | | | | | | | | internal frag (#7875) This commit has two aspects: 1) improve memory reporting for all the places that use sdsAllocSize to compute memory used by a string, in this case it'll include the internal fragmentation. 2) reduce the need for realloc calls by making the sds implicitly take over the internal fragmentation of the block it allocated.
* getting rss size implementation for netbsd (#7293)David CARLIER2020-09-291-0/+20
|
* Implement redisAtomic to replace _Atomic C11 builtin (#7707)Wang Yuan2020-09-171-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Redis 6.0 introduces I/O threads, it is so cool and efficient, we use C11 _Atomic to establish inter-thread synchronization without mutex. But the compiler that must supports C11 _Atomic can compile redis code, that brings a lot of inconvenience since some common platforms can't support by default such as CentOS7, so we want to implement redis atomic type to make it more portable. We have implemented our atomic variable for redis that only has 'relaxed' operations in src/atomicvar.h, so we implement some operations with 'sequentially-consistent', just like the default behavior of C11 _Atomic that can establish inter-thread synchronization. And we replace all uses of C11 _Atomic with redis atomic variable. Our implementation of redis atomic variable uses C11 _Atomic, __atomic or __sync macros if available, it supports most common platforms, and we will detect automatically which feature we use. In Makefile we use a dummy file to detect if the compiler supports C11 _Atomic. Now for gcc, we can compile redis code theoretically if your gcc version is not less than 4.1.2(starts to support __sync_xxx operations). Otherwise, we remove use mutex fallback to implement redis atomic variable for performance and test. You will get compiling errors if your compiler doesn't support all features of above. For cover redis atomic variable tests, we add other CI jobs that build redis on CentOS6 and CentOS7 and workflow daily jobs that run the tests on them. For them, we just install gcc by default in order to cover different compiler versions, gcc is 4.4.7 by default installation on CentOS6 and 4.8.5 on CentOS7. We restore the feature that we can test redis with Helgrind to find data race errors. But you need install Valgrind in the default path configuration firstly before running your tests, since we use macros in helgrind.h to tell Helgrind inter-thread happens-before relationship explicitly for avoiding false positives. Please open an issue on github if you find data race errors relate to this commit. Unrelated: - Fix redefinition of typedef 'RedisModuleUserChangedFunc' For some old version compilers, they will report errors or warnings, if we re-define function type.
* Remove dead code from update_zmalloc_stat_alloc (#7589)Oran Agra2020-07-311-11/+2
| | | this seems like leftover from before 6eb51bf
* Avoid collision with MacOS LIST_HEAD macro after #6384.antirez2019-12-021-0/+8
|
* Merge pull request #6384 from devnexen/apple_smaps_implSalvatore Sanfilippo2019-12-021-0/+21
|\ | | | | Getting region date per process in Darwin
| * Adding AnonHugePages case + commentsDavid Carlier2019-09-201-2/+11
| |
| * Getting region date per process in DarwinDavid Carlier2019-09-151-0/+12
| |
* | Merge remote-tracking branch 'antirez/unstable' into jemalloc_purge_bgOran Agra2019-10-041-0/+20
|\ \ | |/
| * Updating resident memory request impl on FreeBSD.David Carlier2019-07-281-0/+20
| |
* | RED-31295 - redis: avoid race between dlopen and thread creationOran Agra2019-10-021-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It seeems that since I added the creation of the jemalloc thread redis sometimes fails to start with the following error: Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed! This seems to be due to a race bug in ld.so, in which TLS creation on the thread, collide with dlopen. Move the creation of BIO and jemalloc threads to after modules are loaded. plus small bugfix when trying to disable the jemalloc thread at runtime
* | make redis purge jemalloc after flush, and enable background purging threadOran Agra2019-06-021-0/+34
|/ | | | | | | | | jemalloc 5 doesn't immediately release memory back to the OS, instead there's a decaying mechanism, which doesn't work when there's no traffic (no allocations). this is most evident if there's no traffic after flushdb, the RSS will remain high. 1) enable jemalloc background purging 2) explicitly purge in flushdb
* Alter coding style in #4696 to conform to Redis code base.antirez2019-03-211-1/+1
|
* Merge pull request #4696 from oranagra/zrealloc_fixSalvatore Sanfilippo2019-03-211-0/+4
|\ | | | | Fix zrealloc to behave similarly to je_realloc when size is 0
| * Fix zrealloc to behave similarly to je_realloc when size is 0Oran Agra2018-02-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | According to C11, the behavior of realloc with size 0 is now deprecated. it can either behave as free(ptr) and return NULL, or return a valid pointer. but in zmalloc it can lead to zmalloc_oom_handler and panic. and that can affect modules that use it. It looks like both glibc allocator and jemalloc behave like so: realloc(malloc(32),0) returns NULL realloc(NULL,0) returns a valid pointer This commit changes zmalloc to behave the same
* | Fix incorrect memory usage accounting in zreallocBruce Merry2018-09-301-2/+18
| | | | | | | | | | | | | | | | | | | | | | When HAVE_MALLOC_SIZE is false, each call to zrealloc causes used_memory to increase by PREFIX_SIZE more than it should, due to mis-matched accounting between the original zmalloc (which includes PREFIX size in its increment) and zrealloc (which misses it from its decrement). I've also supplied a command-line test to easily demonstrate the problem. It's not wired into the test framework, because I don't know TCL so I'm not sure how to automate it.
* | fix recursion typo in zmalloc_usableOran Agra2018-07-221-1/+1
| |
* | slave buffers were wasteful and incorrectly counted causing evictionOran Agra2018-07-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A) slave buffers didn't count internal fragmentation and sds unused space, this caused them to induce eviction although we didn't mean for it. B) slave buffers were consuming about twice the memory of what they actually needed. - this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time but networking.c not storing more than 16k (partially fixed recently in 237a38737). - besides it wasn't able to store half of the new string into one buffer and the other half into the next (so the above mentioned fix helped mainly for small items). - lastly, the sds buffers had up to 30% internal fragmentation that was wasted, consumed but not used. C) inefficient performance due to starting from a small string and reallocing many times. what i changed: - creating dedicated buffers for reply list, counting their size with zmalloc_size - when creating a new reply node from, preallocate it to at least 16k. - when appending a new reply to the buffer, first fill all the unused space of the previous node before starting a new one. other changes: - expose mem_not_counted_for_evict info field for the benefit of the test suite - add a test to make sure slave buffers are counted correctly and that they don't cause eviction
* | Fix typoJack Drogon2018-07-031-1/+1
| |
* | Fix update_zmalloc_stat_alloc in zreallocFuxin Hao2018-06-141-1/+1
| |
* | Merge pull request #4901 from KFilipek/zmalloc_typo_fixSalvatore Sanfilippo2018-06-111-1/+1
|\ \ | | | | | | HW_PHYSMEM typo in preprocessor condition
| * | Typo in preprocessor conditionKrzysztof Filipek2018-05-061-1/+1
| | |
* | | include stdint.h for unit64_t definitionRemi Collet2018-05-301-0/+1
| | |
* | | Active defrag fixes for 32bit buildsOran Agra2018-05-171-1/+4
|/ / | | | | | | | | | | | | problems fixed: * failing to read fragmentation information from jemalloc * overflow in jemalloc fragmentation hint to the defragger * test suite not triggering eviction after population
* | Adding real allocator fragmentation to INFO and MEMORY command + active ↵Oran Agra2018-03-121-3/+26
|/ | | | | | | | | | | | | | | | | | | | defrag test other fixes / improvements: - LUA script memory isn't taken from zmalloc (taken from libc malloc) so it can cause high fragmentation ratio to be displayed (which is false) - there was a problem with "fragmentation" info being calculated from RSS and used_memory sampled at different times (now sampling them together) other details: - adding a few more allocator info fields to INFO and MEMORY commands - improve defrag test to measure defrag latency of big keys - increasing the accuracy of the defrag test (by looking at real grag info) this way we can use an even lower threshold and still avoid false positives - keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things - add these the MEMORY DOCTOR command - deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used") - reduce sampling rate of the rss and allocator info
* zmalloc.c: remove thread safe mode, it's the default way.antirez2017-05-091-21/+3
|
* Simplify atomicvar.h usage by having the mutex name implicit.antirez2017-05-041-3/+3
|
* Fix preprocessor if/else chain broken in order to fix #3927.antirez2017-04-111-0/+3
|
* Fix zmalloc_get_memory_size() ifdefs to actually use the else branch.antirez2017-04-111-2/+0
| | | | Close #3927.
* Defrag: activate it only if running modified version of Jemalloc.antirez2017-01-101-2/+2
| | | | | This commit also includes minor aesthetic changes like removal of trailing spaces.