| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR does 2 main things:
1) Add warning for suspected slow system clocksource setting. This is Linux specific.
2) Add a `--check-system` argument to redis which runs all system checks and prints a report.
## System checks
Add a command line option `--check-system` which runs all known system checks and provides
a report to stdout of which systems checks have failed with details on how to reconfigure the
system for optimized redis performance.
The `--system-check` mode exists with an appropriate error code after running all the checks.
## Slow clocksource details
We check the system's clocksource performance by running `clock_gettime()` in a loop and then
checking how much time was spent in a system call (via `getrusage()`). If we spend more than
10% of the time in the kernel then we print a warning. I verified that using the slow clock sources:
`acpi_pm` (~90% in the kernel on my laptop) and `xen` (~30% in the kernel on an ec2 `m4.large`)
we get this warning.
The check runs 5 system ticks so we can detect time spent in kernel at 20% jumps (0%,20%,40%...).
Anything more accurate will require the test to run longer. Typically 5 ticks are 50ms. This means
running the test on startup will delay startup by 50ms. To avoid this we make sure the test is only
executed in the `--check-system` mode.
For a quick startup check, we specifically warn if the we see the system is using the `xen` clocksource
which we know has bad performance and isn't recommended (at least on ec2). In such a case the
user should manually rung redis with `--check-system` to force the slower clocksource test described
above.
## Other changes in the PR
* All the system checks are now implemented as functions in _syscheck.c_.
They are implemented using a standard interface (see details in _syscheck.c_).
To do this I moved the checking functions `linuxOvercommitMemoryValue()`,
`THPIsEnabled()`, `linuxMadvFreeForkBugCheck()` out of _server.c_ and _latency.c_
and into the new _syscheck.c_. When moving these functions I made sure they don't
depend on other functionality provided in _server.c_ and made them use a standard
"check functions" interface. Specifically:
* I removed all logging out of `linuxMadvFreeForkBugCheck()`. In case there's some
unexpected error during the check aborts as before, but without any logging.
It returns an error code 0 meaning the check didn't not complete.
* All these functions now return 1 on success, -1 on failure, 0 in case the check itself
cannot be completed.
* The `linuxMadvFreeForkBugCheck()` function now internally calls `exit()` and not
`exitFromChild()` because the latter is only available in _server.c_ and I wanted to
remove that dependency. This isn't an because we don't need to worry about the
child process created by the test doing anything related to the rdb/aof files which
is why `exitFromChild()` was created.
* This also fixes parsing of other /proc/\<pid\>/stat fields to correctly handle spaces
in the process name and be more robust in general. Not that before this fix the rss
info in `INFO memory` was corrupt in case of spaces in the process name. To
recreate just rename `redis-server` to `redis server`, start it, and run `INFO memory`.
|
|
|
|
|
| |
also fixing already defined constants build warning while at it.
Co-authored-by: Oran Agra <oran@redislabs.com>
|
|
|
| |
Add support for getting the RSS in OpenBSD
|
|
|
|
|
|
| |
Seems like the previous implementation was broken (always returning 0)
since kinfo_proc2 is used the KERN_PROC2 sysctl oid is more appropriate
and also the query's length was not necessarily accurate (6 here).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
by cumulative distribution of latencies (#9462)
# Short description
The Redis extended latency stats track per command latencies and enables:
- exporting the per-command percentile distribution via the `INFO LATENCYSTATS` command.
**( percentile distribution is not mergeable between cluster nodes ).**
- exporting the per-command cumulative latency distributions via the `LATENCY HISTOGRAM` command.
Using the cumulative distribution of latencies we can merge several stats from different cluster nodes
to calculate aggregate metrics .
By default, the extended latency monitoring is enabled since the overhead of keeping track of the
command latency is very small.
If you don't want to track extended latency metrics, you can easily disable it at runtime using the command:
- `CONFIG SET latency-tracking no`
By default, the exported latency percentiles are the p50, p99, and p999.
You can alter them at runtime using the command:
- `CONFIG SET latency-tracking-info-percentiles "0.0 50.0 100.0"`
## Some details:
- The total size per histogram should sit around 40 KiB. We only allocate those 40KiB when a command
was called for the first time.
- With regards to the WRITE overhead As seen below, there is no measurable overhead on the achievable
ops/sec or full latency spectrum on the client. Including also the measured redis-benchmark for unstable
vs this branch.
- We track from 1 nanosecond to 1 second ( everything above 1 second is considered +Inf )
## `INFO LATENCYSTATS` exposition format
- Format: `latency_percentiles_usec_<CMDNAME>:p0=XX,p50....`
## `LATENCY HISTOGRAM [command ...]` exposition format
Return a cumulative distribution of latencies in the format of a histogram for the specified command names.
The histogram is composed of a map of time buckets:
- Each representing a latency range, between 1 nanosecond and roughly 1 second.
- Each bucket covers twice the previous bucket's range.
- Empty buckets are not printed.
- Everything above 1 sec is considered +Inf.
- At max there will be log2(1000000000)=30 buckets
We reply a map for each command in the format:
`<command name> : { `calls`: <total command calls> , `histogram` : { <bucket 1> : latency , < bucket 2> : latency, ... } }`
Co-authored-by: Oran Agra <oran@redislabs.com>
|
|
|
|
|
| |
than 100mb (#9784)
This is a preparation step in order to add a new test in quicklist.c see #9776
|
|
|
| |
We only use MADV_DONTNEED on Linux, that's were it was tested.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## Backgroud
As we know, after `fork`, one process will copy pages when writing data to these
pages(CoW), and another process still keep old pages, they totally cost more memory.
For redis, we suffered that redis consumed much memory when the fork child is serializing
key/values, even that maybe cause OOM.
But actually we find, in redis fork child process, the child process don't need to keep some
memory and parent process may write or update that, for example, child process will never
access the key-value that is serialized but users may update it in parent process.
So we think it may reduce COW if the child process release memory that it is not needed.
## Implementation
For releasing key value in child process, we may think we call `decrRefCount` to free memory,
but i find the fork child process still use much memory when we don't write any data to redis,
and it costs much more time that slows down bgsave. Maybe because memory allocator doesn't
really release memory to OS, and it may modify some inner data for this free operation, especially
when we free small objects.
Moreover, CoW is based on pages, so it is a easy way that we only free the memory bulk that is
not less than kernel page size. madvise(MADV_DONTNEED) can quickly release specified region
pages to OS bypassing memory allocator, and allocator still consider that this memory still is used
and don't change its inner data.
There are some buffers we can release in the fork child process:
- **Serialized key-values**
the fork child process never access serialized key-values, so we try to free them.
Because we only can release big bulk memory, and it is time consumed to iterate all
items/members/fields/entries of complex data type. So we decide to iterate them and
try to release them only when their average size of item/member/field/entry is more
than page size of OS.
- **Replication backlog**
Because replication backlog is a cycle buffer, it will be changed quickly if redis has heavy
write traffic, but in fork child process, we don't need to access that.
- **Client buffers**
If clients have requests during having the fork child process, clients' buffer also be changed
frequently. The memory includes client query buffer, output buffer, and client struct used memory.
To get child process peak private dirty memory, we need to count peak memory instead
of last used memory, because the child process may continue to release memory (since
COW used to only grow till now, the last was equivalent to the peak).
Also we're adding a new `current_cow_peak` info variable (to complement the existing
`current_cow_size`)
Co-authored-by: Oran Agra <oran@redislabs.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Reading CoW from /proc/<pid>/smaps can be slow with large processes on
some platforms.
This measures the time it takes to read CoW info and limits the duty
cycle of future updates to roughly 1/100.
As current_cow_size no longer represnets a current, fixed interval value
there is also a new current_cow_size_age field that provides information
about the age of the size value, in seconds.
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Add `redis-server test all` support to run all tests.
2. Add redis test to daily ci.
3. Add `--accurate` option to run slow tests for more iterations (so that
by default we run less cycles (shorter time, and less prints).
4. Move dict benchmark to REDIS_TEST.
5. fix some leaks in tests
6. make quicklist tests run on a specific fill set of options rather than huge ranges
7. move some prints in quicklist test outside their loops to reduce prints
8. removing sds.h from dict.c since it is now used in both redis-server and
redis-cli (uses hiredis sds)
|
|
|
|
|
|
| |
The obtained process_rss was incorrect (the OS reports pages, not
bytes), resulting with many other fields getting corrupted.
This has been tested on FreeBSD but not other platforms.
|
|
|
|
|
|
|
|
|
| |
* Add better control of malloc_usable_size() usage.
* Use malloc_usable_size on alpine libc daily job.
* Add no-malloc-usable-size daily jobs.
* Fix zmalloc(0) when HAVE_MALLOC_SIZE is undefined.
In order to align with the jemalloc behavior, this should never return
NULL or OOM panic.
|
|
|
|
|
| |
Also adds a new daily CI test, relying on the fact that we don't use malloc_size() on alpine libmusl.
Fixes #8531
|
|
|
|
|
|
|
|
| |
On 32-bit systems, setting the proto-max-bulk-len config parameter to a high value may result with integer overflow and a subsequent heap overflow when parsing an input bulk (CVE-2021-21309).
This fix has two parts:
Set a reasonable limit to the config parameter.
Add additional checks to prevent the problem in other potential but unknown code paths.
|
|
|
|
|
|
|
|
|
|
|
| |
- the last COW report wasn't always read from the pipe
(receiveLastChildInfo wasn't used)
- but in fact, there's no reason we won't always try to drain that pipe
so i'm unifying receiveLastChildInfo with receiveChildInfo
- adjust threshold of the COW test when run in accurate mode
- add some prints in case this test fails again
- fix indentation, page size, and PID! in MacOS proc info
p.s. it seems that pri_pages_dirtied is always 0
|
|
|
|
|
|
| |
* Allow runtest-moduleapi use a different 'make', for systems where GNU Make is 'gmake'.
* Fix issue with builds on Solaris re-building everything from scratch due to CFLAGS/LDFLAGS not stored.
* Fix compile failure on Solaris due to atomicvar and a bunch of warnings.
* Fix garbled log timestamps on Solaris.
|
| |
|
|
|
|
|
| |
When RDB input attempts to make a huge memory allocation that fails,
RESTORE should fail gracefully rather than die with panic
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When using a system with no malloc_usable_size(), zmalloc_size() assumed
that the heap allocator always returns blocks that are long-padded.
This may not always be the case, and will result with zmalloc_size()
returning a size that is bigger than allocated. At least in one case
this leads to out of bound write, process crash and a potential security
vulnerability.
Effectively this does not affect the vast majority of users, who use
jemalloc or glibc.
This problem along with a (different) fix was reported by Drew DeVault.
|
|
|
|
|
|
|
|
|
| |
internal frag (#7875)
This commit has two aspects:
1) improve memory reporting for all the places that use sdsAllocSize to compute
memory used by a string, in this case it'll include the internal fragmentation.
2) reduce the need for realloc calls by making the sds implicitly take over
the internal fragmentation of the block it allocated.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Redis 6.0 introduces I/O threads, it is so cool and efficient, we use C11
_Atomic to establish inter-thread synchronization without mutex. But the
compiler that must supports C11 _Atomic can compile redis code, that brings a
lot of inconvenience since some common platforms can't support by default such
as CentOS7, so we want to implement redis atomic type to make it more portable.
We have implemented our atomic variable for redis that only has 'relaxed'
operations in src/atomicvar.h, so we implement some operations with
'sequentially-consistent', just like the default behavior of C11 _Atomic that
can establish inter-thread synchronization. And we replace all uses of C11
_Atomic with redis atomic variable.
Our implementation of redis atomic variable uses C11 _Atomic, __atomic or
__sync macros if available, it supports most common platforms, and we will
detect automatically which feature we use. In Makefile we use a dummy file to
detect if the compiler supports C11 _Atomic. Now for gcc, we can compile redis
code theoretically if your gcc version is not less than 4.1.2(starts to support
__sync_xxx operations). Otherwise, we remove use mutex fallback to implement
redis atomic variable for performance and test. You will get compiling errors
if your compiler doesn't support all features of above.
For cover redis atomic variable tests, we add other CI jobs that build redis on
CentOS6 and CentOS7 and workflow daily jobs that run the tests on them.
For them, we just install gcc by default in order to cover different compiler
versions, gcc is 4.4.7 by default installation on CentOS6 and 4.8.5 on CentOS7.
We restore the feature that we can test redis with Helgrind to find data race
errors. But you need install Valgrind in the default path configuration firstly
before running your tests, since we use macros in helgrind.h to tell Helgrind
inter-thread happens-before relationship explicitly for avoiding false positives.
Please open an issue on github if you find data race errors relate to this commit.
Unrelated:
- Fix redefinition of typedef 'RedisModuleUserChangedFunc'
For some old version compilers, they will report errors or warnings, if we
re-define function type.
|
|
|
| |
this seems like leftover from before 6eb51bf
|
| |
|
|\
| |
| | |
Getting region date per process in Darwin
|
| | |
|
| | |
|
|\ \
| |/ |
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
It seeems that since I added the creation of the jemalloc thread redis
sometimes fails to start with the following error:
Inconsistency detected by ld.so: dl-tls.c: 493: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
This seems to be due to a race bug in ld.so, in which TLS creation on the
thread, collide with dlopen.
Move the creation of BIO and jemalloc threads to after modules are loaded.
plus small bugfix when trying to disable the jemalloc thread at runtime
|
|/
|
|
|
|
|
|
|
| |
jemalloc 5 doesn't immediately release memory back to the OS, instead there's a decaying
mechanism, which doesn't work when there's no traffic (no allocations).
this is most evident if there's no traffic after flushdb, the RSS will remain high.
1) enable jemalloc background purging
2) explicitly purge in flushdb
|
| |
|
|\
| |
| | |
Fix zrealloc to behave similarly to je_realloc when size is 0
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
According to C11, the behavior of realloc with size 0 is now deprecated.
it can either behave as free(ptr) and return NULL, or return a valid pointer.
but in zmalloc it can lead to zmalloc_oom_handler and panic.
and that can affect modules that use it.
It looks like both glibc allocator and jemalloc behave like so:
realloc(malloc(32),0) returns NULL
realloc(NULL,0) returns a valid pointer
This commit changes zmalloc to behave the same
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When HAVE_MALLOC_SIZE is false, each call to zrealloc causes used_memory
to increase by PREFIX_SIZE more than it should, due to mis-matched
accounting between the original zmalloc (which includes PREFIX size in
its increment) and zrealloc (which misses it from its decrement).
I've also supplied a command-line test to easily demonstrate the
problem. It's not wired into the test framework, because I don't know
TCL so I'm not sure how to automate it.
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A) slave buffers didn't count internal fragmentation and sds unused space,
this caused them to induce eviction although we didn't mean for it.
B) slave buffers were consuming about twice the memory of what they actually needed.
- this was mainly due to sdsMakeRoomFor growing to twice as much as needed each time
but networking.c not storing more than 16k (partially fixed recently in 237a38737).
- besides it wasn't able to store half of the new string into one buffer and the
other half into the next (so the above mentioned fix helped mainly for small items).
- lastly, the sds buffers had up to 30% internal fragmentation that was wasted,
consumed but not used.
C) inefficient performance due to starting from a small string and reallocing many times.
what i changed:
- creating dedicated buffers for reply list, counting their size with zmalloc_size
- when creating a new reply node from, preallocate it to at least 16k.
- when appending a new reply to the buffer, first fill all the unused space of the
previous node before starting a new one.
other changes:
- expose mem_not_counted_for_evict info field for the benefit of the test suite
- add a test to make sure slave buffers are counted correctly and that they don't cause eviction
|
| | |
|
| | |
|
|\ \
| | |
| | | |
HW_PHYSMEM typo in preprocessor condition
|
| | | |
|
| | | |
|
|/ /
| |
| |
| |
| |
| |
| | |
problems fixed:
* failing to read fragmentation information from jemalloc
* overflow in jemalloc fragmentation hint to the defragger
* test suite not triggering eviction after population
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
defrag test
other fixes / improvements:
- LUA script memory isn't taken from zmalloc (taken from libc malloc)
so it can cause high fragmentation ratio to be displayed (which is false)
- there was a problem with "fragmentation" info being calculated from
RSS and used_memory sampled at different times (now sampling them together)
other details:
- adding a few more allocator info fields to INFO and MEMORY commands
- improve defrag test to measure defrag latency of big keys
- increasing the accuracy of the defrag test (by looking at real grag info)
this way we can use an even lower threshold and still avoid false positives
- keep the old (total) "fragmentation" field unchanged, but add new ones for spcific things
- add these the MEMORY DOCTOR command
- deduct LUA memory from the rss in case of non jemalloc allocator (one for which we don't "allocator active/used")
- reduce sampling rate of the rss and allocator info
|
| |
|
| |
|
| |
|
|
|
|
| |
Close #3927.
|
|
|
|
|
| |
This commit also includes minor aesthetic changes like removal of
trailing spaces.
|