summaryrefslogtreecommitdiff
path: root/src/latency.h
Commit message (Collapse)AuthorAgeFilesLines
* Add basic eventloop latency measurement. (#11963)Chen Tianjie2023-05-121-0/+16
| | | | | | | | | | | | | | | | | | | | | | The measured latency(duration) includes the list below, which can be shown by `INFO STATS`. ``` eventloop_cycles // ever increasing counter eventloop_duration_sum // cumulative duration of eventloop in microseconds eventloop_duration_cmd_sum // cumulative duration of executing commands in microseconds instantaneous_eventloop_cycles_per_sec // average eventloop count per second in recent 1.6s instantaneous_eventloop_duration_usec // average single eventloop duration in recent 1.6s ``` Also added some experimental metrics, which are shown only when `INFO DEBUG` is called. This section isn't included in the default INFO, or even in `INFO ALL` and the fields in this section can change in the future without considering backwards compatibility. ``` eventloop_duration_aof_sum // cumulative duration of writing AOF eventloop_duration_cron_sum // cumulative duration cron jobs (serverCron, beforeSleep excluding IO and AOF) eventloop_cmd_per_cycle_max // max number of commands executed in one eventloop eventloop_duration_max // max duration of one eventloop ``` All of these are being reset by CONFIG RESETSTAT
* Add warning for suspected slow system clocksource setting (#10636)yoav-steinberg2022-05-221-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR does 2 main things: 1) Add warning for suspected slow system clocksource setting. This is Linux specific. 2) Add a `--check-system` argument to redis which runs all system checks and prints a report. ## System checks Add a command line option `--check-system` which runs all known system checks and provides a report to stdout of which systems checks have failed with details on how to reconfigure the system for optimized redis performance. The `--system-check` mode exists with an appropriate error code after running all the checks. ## Slow clocksource details We check the system's clocksource performance by running `clock_gettime()` in a loop and then checking how much time was spent in a system call (via `getrusage()`). If we spend more than 10% of the time in the kernel then we print a warning. I verified that using the slow clock sources: `acpi_pm` (~90% in the kernel on my laptop) and `xen` (~30% in the kernel on an ec2 `m4.large`) we get this warning. The check runs 5 system ticks so we can detect time spent in kernel at 20% jumps (0%,20%,40%...). Anything more accurate will require the test to run longer. Typically 5 ticks are 50ms. This means running the test on startup will delay startup by 50ms. To avoid this we make sure the test is only executed in the `--check-system` mode. For a quick startup check, we specifically warn if the we see the system is using the `xen` clocksource which we know has bad performance and isn't recommended (at least on ec2). In such a case the user should manually rung redis with `--check-system` to force the slower clocksource test described above. ## Other changes in the PR * All the system checks are now implemented as functions in _syscheck.c_. They are implemented using a standard interface (see details in _syscheck.c_). To do this I moved the checking functions `linuxOvercommitMemoryValue()`, `THPIsEnabled()`, `linuxMadvFreeForkBugCheck()` out of _server.c_ and _latency.c_ and into the new _syscheck.c_. When moving these functions I made sure they don't depend on other functionality provided in _server.c_ and made them use a standard "check functions" interface. Specifically: * I removed all logging out of `linuxMadvFreeForkBugCheck()`. In case there's some unexpected error during the check aborts as before, but without any logging. It returns an error code 0 meaning the check didn't not complete. * All these functions now return 1 on success, -1 on failure, 0 in case the check itself cannot be completed. * The `linuxMadvFreeForkBugCheck()` function now internally calls `exit()` and not `exitFromChild()` because the latter is only available in _server.c_ and I wanted to remove that dependency. This isn't an because we don't need to worry about the child process created by the test doing anything related to the rdb/aof files which is why `exitFromChild()` was created. * This also fixes parsing of other /proc/\<pid\>/stat fields to correctly handle spaces in the process name and be more robust in general. Not that before this fix the rss info in `INFO memory` was corrupt in case of spaces in the process name. To recreate just rename `redis-server` to `redis server`, start it, and run `INFO memory`.
* Disable THP if enabled (#7381)zhenwei pi2020-10-271-0/+1
| | | | | | | | | | | | | | | | | | | | | In case redis starts and find that THP is enabled ("always"), instead of printing a log message, which might go unnoticed, redis will try to disable it (just for the redis process). Note: it looks like on self-bulit kernels THP is likely be set to "always" by default. Some discuss about THP side effect on Linux: according to http://www.antirez.com/news/84, we can see that redis latency spikes are caused by linux kernel THP feature. I have tested on E3-2650 v3, and found that 2M huge page costs about 0.25ms to fix COW page fault. Add a new config 'disable-thp', the recommended setting is 'yes', (default) the redis tries to disable THP by prctl syscall. But users who really want THP can set it to "no" Thanks to Oran & Yossi for suggestions. Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
* Module API for LatencyAddSampleOran Agra2019-10-241-1/+1
|
* Separate latency monitoring of eviction loop and eviction DELs.antirez2015-02-111-0/+4
|
* Check THP support at startup and warn about it.antirez2014-11-121-0/+1
|
* LATENCY DOCTOR: initial draft and events summary output.antirez2014-07-081-0/+1
|
* Latency: low level time series analysis implemented.antirez2014-07-071-0/+10
|
* latencyStartMonitor() modified to avoid warnings.antirez2014-07-021-0/+2
|
* latencyTimeSeries structure max field type fixed.antirez2014-07-021-1/+1
|
* License added to latency.h.antirez2014-07-021-0/+33
|
* Latency monitor: command duration is in useconds. Convert.antirez2014-07-011-2/+2
|
* Latency monitor: collect slow commands.antirez2014-07-011-1/+1
| | | | | | | | | | | | | We introduce the distinction between slow and fast commands since those are two different sources of latency. An O(1) or O(log N) command without side effects (can't trigger deletion of large objects as a side effect of its execution) if delayed is a symptom of inherent latency of the system. A non-fast command (commands that may run large O(N) computations) if delayed may just mean that the user is executing slow operations. The advices LATENCY should provide in this two different cases are different, so we log the two classes of commands in a separated way.
* Latency monitor: basic samples collection.antirez2014-07-011-0/+42