diff options
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 35 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/lru_sort.rst | 294 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/reclaim.rst | 6 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/usage.rst | 2 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/shrinker_debugfs.rst | 135 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/userfaultfd.rst | 40 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/vm.rst | 8 |
9 files changed, 499 insertions, 23 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3ebc22e524b0..c408070f30a3 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1667,6 +1667,19 @@ hlt [BUGS=ARM,SH] + hostname= [KNL] Set the hostname (aka UTS nodename). + Format: <string> + This allows setting the system's hostname during early + startup. This sets the name returned by gethostname. + Using this parameter to set the hostname makes it + possible to ensure the hostname is correctly set before + any userspace processes run, avoiding the possibility + that a process may call gethostname before the hostname + has been explicitly set, resulting in the calling + process getting an incorrect result. The string must + not exceed the maximum allowed hostname length (usually + 64 characters) and will be truncated otherwise. + hpet= [X86-32,HPET] option to control HPET usage Format: { enable (default) | disable | force | verbose } @@ -1722,9 +1735,11 @@ Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y, the default is on. - This is not compatible with memory_hotplug.memmap_on_memory. - If both parameters are enabled, hugetlb_free_vmemmap takes - precedence over memory_hotplug.memmap_on_memory. + Note that the vmemmap pages may be allocated from the added + memory block itself when memory_hotplug.memmap_on_memory is + enabled, those vmemmap pages cannot be optimized even if this + feature is enabled. Other vmemmap pages not allocated from + the added memory block itself do not be affected. hung_task_panic= [KNL] Should the hung task detector generate panics. @@ -3083,10 +3098,12 @@ [KNL,X86,ARM] Boolean flag to enable this feature. Format: {on | off (default)} When enabled, runtime hotplugged memory will - allocate its internal metadata (struct pages) - from the hotadded memory which will allow to - hotadd a lot of memory without requiring - additional memory to do so. + allocate its internal metadata (struct pages, + those vmemmap pages cannot be optimized even + if hugetlb_free_vmemmap is enabled) from the + hotadded memory which will allow to hotadd a + lot of memory without requiring additional + memory to do so. This feature is disabled by default because it has some implication on large (e.g. GB) allocations in some configurations (e.g. small @@ -3096,10 +3113,6 @@ Note that even when enabled, there are a few cases where the feature is not effective. - This is not compatible with hugetlb_free_vmemmap. If - both parameters are enabled, hugetlb_free_vmemmap takes - precedence over memory_hotplug.memmap_on_memory. - memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest Format: <integer> default : 0 <disable> diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index c4681fa69b9c..05500042f777 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -14,3 +14,4 @@ optimize those. start usage reclaim + lru_sort diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst new file mode 100644 index 000000000000..c09cace80651 --- /dev/null +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -0,0 +1,294 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +DAMON-based LRU-lists Sorting +============================= + +DAMON-based LRU-lists Sorting (DAMON_LRU_SORT) is a static kernel module that +aimed to be used for proactive and lightweight data access pattern based +(de)prioritization of pages on their LRU-lists for making LRU-lists a more +trusworthy data access pattern source. + +Where Proactive LRU-lists Sorting is Required? +============================================== + +As page-granularity access checking overhead could be significant on huge +systems, LRU lists are normally not proactively sorted but partially and +reactively sorted for special events including specific user requests, system +calls and memory pressure. As a result, LRU lists are sometimes not so +perfectly prepared to be used as a trustworthy access pattern source for some +situations including reclamation target pages selection under sudden memory +pressure. + +Because DAMON can identify access patterns of best-effort accuracy while +inducing only user-specified range of overhead, proactively running +DAMON_LRU_SORT could be helpful for making LRU lists more trustworthy access +pattern source with low and controlled overhead. + +How It Works? +============= + +DAMON_LRU_SORT finds hot pages (pages of memory regions that showing access +rates that higher than a user-specified threshold) and cold pages (pages of +memory regions that showing no access for a time that longer than a +user-specified threshold) using DAMON, and prioritizes hot pages while +deprioritizing cold pages on their LRU-lists. To avoid it consuming too much +CPU for the prioritizations, a CPU time usage limit can be configured. Under +the limit, it prioritizes and deprioritizes more hot and cold pages first, +respectively. System administrators can also configure under what situation +this scheme should automatically activated and deactivated with three memory +pressure watermarks. + +Its default parameters for hotness/coldness thresholds and CPU quota limit are +conservatively chosen. That is, the module under its default parameters could +be widely used without harm for common situations while providing a level of +benefits for systems having clear hot/cold access patterns under memory +pressure while consuming only a limited small portion of CPU time. + +Interface: Module Parameters +============================ + +To use this feature, you should first ensure your system is running on a kernel +that is built with ``CONFIG_DAMON_LRU_SORT=y``. + +To let sysadmins enable or disable it and tune for the given system, +DAMON_LRU_SORT utilizes module parameters. That is, you can put +``damon_lru_sort.<parameter>=<value>`` on the kernel boot command line or write +proper values to ``/sys/modules/damon_lru_sort/parameters/<parameter>`` files. + +Below are the description of each parameter. + +enabled +------- + +Enable or disable DAMON_LRU_SORT. + +You can enable DAMON_LRU_SORT by setting the value of this parameter as ``Y``. +Setting it as ``N`` disables DAMON_LRU_SORT. Note that DAMON_LRU_SORT could do +no real monitoring and LRU-lists sorting due to the watermarks-based activation +condition. Refer to below descriptions for the watermarks parameter for this. + +commit_inputs +------------- + +Make DAMON_LRU_SORT reads the input parameters again, except ``enabled``. + +Input parameters that updated while DAMON_LRU_SORT is running are not applied +by default. Once this parameter is set as ``Y``, DAMON_LRU_SORT reads values +of parametrs except ``enabled`` again. Once the re-reading is done, this +parameter is set as ``N``. If invalid parameters are found while the +re-reading, DAMON_LRU_SORT will be disabled. + +hot_thres_access_freq +--------------------- + +Access frequency threshold for hot memory regions identification in permil. + +If a memory region is accessed in frequency of this or higher, DAMON_LRU_SORT +identifies the region as hot, and mark it as accessed on the LRU list, so that +it could not be reclaimed under memory pressure. 50% by default. + +cold_min_age +------------ + +Time threshold for cold memory regions identification in microseconds. + +If a memory region is not accessed for this or longer time, DAMON_LRU_SORT +identifies the region as cold, and mark it as unaccessed on the LRU list, so +that it could be reclaimed first under memory pressure. 120 seconds by +default. + +quota_ms +-------- + +Limit of time for trying the LRU lists sorting in milliseconds. + +DAMON_LRU_SORT tries to use only up to this time within a time window +(quota_reset_interval_ms) for trying LRU lists sorting. This can be used +for limiting CPU consumption of DAMON_LRU_SORT. If the value is zero, the +limit is disabled. + +10 ms by default. + +quota_reset_interval_ms +----------------------- + +The time quota charge reset interval in milliseconds. + +The charge reset interval for the quota of time (quota_ms). That is, +DAMON_LRU_SORT does not try LRU-lists sorting for more than quota_ms +milliseconds or quota_sz bytes within quota_reset_interval_ms milliseconds. + +1 second by default. + +wmarks_interval +--------------- + +The watermarks check time interval in microseconds. + +Minimal time to wait before checking the watermarks, when DAMON_LRU_SORT is +enabled but inactive due to its watermarks rule. 5 seconds by default. + +wmarks_high +----------- + +Free memory rate (per thousand) for the high watermark. + +If free memory of the system in bytes per thousand bytes is higher than this, +DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the +watermarks. 200 (20%) by default. + +wmarks_mid +---------- + +Free memory rate (per thousand) for the middle watermark. + +If free memory of the system in bytes per thousand bytes is between this and +the low watermark, DAMON_LRU_SORT becomes active, so starts the monitoring and +the LRU-lists sorting. 150 (15%) by default. + +wmarks_low +---------- + +Free memory rate (per thousand) for the low watermark. + +If free memory of the system in bytes per thousand bytes is lower than this, +DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the +watermarks. 50 (5%) by default. + +sample_interval +--------------- + +Sampling interval for the monitoring in microseconds. + +The sampling interval of DAMON for the cold memory monitoring. Please refer to +the DAMON documentation (:doc:`usage`) for more detail. 5ms by default. + +aggr_interval +------------- + +Aggregation interval for the monitoring in microseconds. + +The aggregation interval of DAMON for the cold memory monitoring. Please +refer to the DAMON documentation (:doc:`usage`) for more detail. 100ms by +default. + +min_nr_regions +-------------- + +Minimum number of monitoring regions. + +The minimal number of monitoring regions of DAMON for the cold memory +monitoring. This can be used to set lower-bound of the monitoring quality. +But, setting this too high could result in increased monitoring overhead. +Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by +default. + +max_nr_regions +-------------- + +Maximum number of monitoring regions. + +The maximum number of monitoring regions of DAMON for the cold memory +monitoring. This can be used to set upper-bound of the monitoring overhead. +However, setting this too low could result in bad monitoring quality. Please +refer to the DAMON documentation (:doc:`usage`) for more detail. 1000 by +defaults. + +monitor_region_start +-------------------- + +Start of target memory region in physical address. + +The start physical address of memory region that DAMON_LRU_SORT will do work +against. By default, biggest System RAM is used as the region. + +monitor_region_end +------------------ + +End of target memory region in physical address. + +The end physical address of memory region that DAMON_LRU_SORT will do work +against. By default, biggest System RAM is used as the region. + +kdamond_pid +----------- + +PID of the DAMON thread. + +If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread. Else, +-1. + +nr_lru_sort_tried_hot_regions +----------------------------- + +Number of hot memory regions that tried to be LRU-sorted. + +bytes_lru_sort_tried_hot_regions +-------------------------------- + +Total bytes of hot memory regions that tried to be LRU-sorted. + +nr_lru_sorted_hot_regions +------------------------- + +Number of hot memory regions that successfully be LRU-sorted. + +bytes_lru_sorted_hot_regions +---------------------------- + +Total bytes of hot memory regions that successfully be LRU-sorted. + +nr_hot_quota_exceeds +-------------------- + +Number of times that the time quota limit for hot regions have exceeded. + +nr_lru_sort_tried_cold_regions +------------------------------ + +Number of cold memory regions that tried to be LRU-sorted. + +bytes_lru_sort_tried_cold_regions +--------------------------------- + +Total bytes of cold memory regions that tried to be LRU-sorted. + +nr_lru_sorted_cold_regions +-------------------------- + +Number of cold memory regions that successfully be LRU-sorted. + +bytes_lru_sorted_cold_regions +----------------------------- + +Total bytes of cold memory regions that successfully be LRU-sorted. + +nr_cold_quota_exceeds +--------------------- + +Number of times that the time quota limit for cold regions have exceeded. + +Example +======= + +Below runtime example commands make DAMON_LRU_SORT to find memory regions +having >=50% access frequency and LRU-prioritize while LRU-deprioritizing +memory regions that not accessed for 120 seconds. The prioritization and +deprioritization is limited to be done using only up to 1% CPU time to avoid +DAMON_LRU_SORT consuming too much CPU time for the (de)prioritization. It also +asks DAMON_LRU_SORT to do nothing if the system's free memory rate is more than +50%, but start the real works if it becomes lower than 40%. If DAMON_RECLAIM +doesn't make progress and therefore the free memory rate becomes lower than +20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to +the LRU-list based page granularity reclamation. :: + + # cd /sys/modules/damon_lru_sort/parameters + # echo 500 > hot_thres_access_freq + # echo 120000000 > cold_min_age + # echo 10 > quota_ms + # echo 1000 > quota_reset_interval_ms + # echo 500 > wmarks_high + # echo 400 > wmarks_mid + # echo 200 > wmarks_low + # echo Y > enabled diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index a8bd3bd29959..4f1479a11e63 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -48,12 +48,6 @@ DAMON_RECLAIM utilizes module parameters. That is, you can put ``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write proper values to ``/sys/modules/damon_reclaim/parameters/<parameter>`` files. -Note that the parameter values except ``enabled`` are applied only when -DAMON_RECLAIM starts. Therefore, if you want to apply new parameter values in -runtime and DAMON_RECLAIM is already enabled, you should disable and re-enable -it via ``enabled`` parameter file. Writing of the new values to proper -parameter values should be done before the re-enablement. - Below are the description of each parameter. enabled diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 5540a3a40fc9..d52f572a9029 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -264,6 +264,8 @@ that can be written to and read from the file and their meaning are as below. - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` + - ``lru_prio``: Prioritize the region on its LRU lists. + - ``lru_deprio``: Deprioritize the region on its LRU lists. - ``stat``: Do nothing but count the statistics schemes/<N>/access_pattern/ diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c21b5823f126..1bd11118dfb1 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -36,6 +36,7 @@ the Linux memory management. numa_memory_policy numaperf pagemap + shrinker_debugfs soft-dirty swap_numa transhuge diff --git a/Documentation/admin-guide/mm/shrinker_debugfs.rst b/Documentation/admin-guide/mm/shrinker_debugfs.rst new file mode 100644 index 000000000000..3887f0b294fe --- /dev/null +++ b/Documentation/admin-guide/mm/shrinker_debugfs.rst @@ -0,0 +1,135 @@ +.. _shrinker_debugfs: + +========================== +Shrinker Debugfs Interface +========================== + +Shrinker debugfs interface provides a visibility into the kernel memory +shrinkers subsystem and allows to get information about individual shrinkers +and interact with them. + +For each shrinker registered in the system a directory in **<debugfs>/shrinker/** +is created. The directory's name is composed from the shrinker's name and an +unique id: e.g. *kfree_rcu-0* or *sb-xfs:vda1-36*. + +Each shrinker directory contains **count** and **scan** files, which allow to +trigger *count_objects()* and *scan_objects()* callbacks for each memcg and +numa node (if applicable). + +Usage: +------ + +1. *List registered shrinkers* + + :: + + $ cd /sys/kernel/debug/shrinker/ + $ ls + dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 + mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 + mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 + rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 + sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 + sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 + sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 + sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 + sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 + sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 + sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 + sb-dax-11 sb-proc-45 sb-tmpfs-35 + sb-debugfs-7 sb-proc-46 sb-tmpfs-40 + +2. *Get information about a specific shrinker* + + :: + + $ cd sb-btrfs\:vda2-24/ + $ ls + count scan + +3. *Count objects* + + Each line in the output has the following format:: + + <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ... + <cgroup inode id> <nr of objects on node 0> <nr of objects on node 1> ... + ... + + If there are no objects on all numa nodes, a line is omitted. If there + are no objects at all, the output might be empty. + + If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed + as cgroup inode id. If the shrinker is not numa-aware, 0's are printed + for all nodes except the first one. + :: + + $ cat count + 1 224 2 + 21 98 0 + 55 818 10 + 2367 2 0 + 2401 30 0 + 225 13 0 + 599 35 0 + 939 124 0 + 1041 3 0 + 1075 1 0 + 1109 1 0 + 1279 60 0 + 1313 7 0 + 1347 39 0 + 1381 3 0 + 1449 14 0 + 1483 63 0 + 1517 53 0 + 1551 6 0 + 1585 1 0 + 1619 6 0 + 1653 40 0 + 1687 11 0 + 1721 8 0 + 1755 4 0 + 1789 52 0 + 1823 888 0 + 1857 1 0 + 1925 2 0 + 1959 32 0 + 2027 22 0 + 2061 9 0 + 2469 799 0 + 2537 861 0 + 2639 1 0 + 2707 70 0 + 2775 4 0 + 2877 84 0 + 293 1 0 + 735 8 0 + +4. *Scan objects* + + The expected input format:: + + <cgroup inode id> <numa id> <number of objects to scan> + + For a non-memcg-aware shrinker or on a system with no memory + cgrups **0** should be passed as cgroup id. + :: + + $ cd /sys/kernel/debug/shrinker/ + $ cd sb-btrfs\:vda2-24/ + + $ cat count | head -n 5 + 1 212 0 + 21 97 0 + 55 802 5 + 2367 2 0 + 225 13 0 + + $ echo "55 0 200" > scan + + $ cat count | head -n 5 + 1 212 0 + 21 96 0 + 55 752 5 + 2367 2 0 + 225 13 0 diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 6528036093e1..9bae1acd431f 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick. Design ====== -Userfaults are delivered and resolved through the ``userfaultfd`` syscall. +Userspace creates a new userfaultfd, initializes it, and registers one or more +regions of virtual memory with it. Then, any page faults which occur within the +region(s) result in a message being delivered to the userfaultfd, notifying +userspace of the fault. The ``userfaultfd`` (aside from registering and unregistering virtual memory ranges) provides two primary functionalities: @@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory management of mremap/mprotect is that the userfaults in all their operations never involve heavyweight structures like vmas (in fact the ``userfaultfd`` runtime load never takes the mmap_lock for writing). - Vmas are not suitable for page- (or hugepage) granular fault tracking when dealing with virtual address spaces that could span Terabytes. Too many vmas would be needed for that. -The ``userfaultfd`` once opened by invoking the syscall, can also be +The ``userfaultfd``, once created, can also be passed using unix domain sockets to a manager process, so the same manager process could handle the userfaults of a multitude of different processes without them being aware about what is going on @@ -50,6 +52,38 @@ is a corner case that would currently return ``-EBUSY``). API === +Creating a userfaultfd +---------------------- + +There are two ways to create a new userfaultfd, each of which provide ways to +restrict access to this functionality (since historically userfaultfds which +handle kernel page faults have been a useful tool for exploiting the kernel). + +The first way, supported by older kernels, is the userfaultfd(2) syscall. +Access to this is controlled in several ways: + +- By default, the userfaultfd will be able to handle kernel page faults. This + can be disabled by passing in UFFD_USER_MODE_ONLY. + +- If vm.unprivileged_userfaultfd is 0, then the caller must *either* have + CAP_SYS_PTRACE, or pass in UFFD_USER_MODE_ONLY. + +- If vm.unprivileged_userfaultfd is 1, then no particular privilege is needed to + use this syscall, even if UFFD_USER_MODE_ONLY is *not* set. + +The second way, added to the kernel more recently, is by opening and issuing a +USERFAULTFD_IOC_NEW ioctl to /dev/userfaultfd. This method yields equivalent +userfaultfds to the userfaultfd(2) syscall; its benefit is in how access to +creating userfaultfds is controlled. + +Access to /dev/userfaultfd is controlled via normal filesystem permissions +(user/group/mode for example), which gives fine grained access to userfaultfd +specifically, without also granting other unrelated privileges at the same time +(as e.g. granting CAP_SYS_PTRACE would do). + +Initializing up a userfaultfd +----------------------------- + When first opened the ``userfaultfd`` must be enabled invoking the ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or a later API version) which will specify the ``read/POLLIN`` protocol diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 4a440a7cfeb0..5a810ea6dc77 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -565,9 +565,8 @@ See Documentation/admin-guide/mm/hugetlbpage.rst hugetlb_optimize_vmemmap ======================== -This knob is not available when memory_hotplug.memmap_on_memory (kernel parameter) -is configured or the size of 'struct page' (a structure defined in -include/linux/mm_types.h) is not power of two (an unusual system config could +This knob is not available when the size of 'struct page' (a structure defined +in include/linux/mm_types.h) is not power of two (an unusual system config could result in this). Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages @@ -928,6 +927,9 @@ calls without any restrictions. The default value is 0. +An alternative to this sysctl / the userfaultfd(2) syscall is to create +userfaultfds via /dev/userfaultfd. See +Documentation/admin-guide/mm/userfaultfd.rst. user_reserve_kbytes =================== |