summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorStephen Rothwell <sfr@canb.auug.org.au>2016-02-15 16:04:19 +1100
committerStephen Rothwell <sfr@canb.auug.org.au>2016-02-15 16:04:19 +1100
commit9e6455cfb1ede0b5b75930ce7c5282e8707bbfad (patch)
tree28af6ced4af2ee5cb1862cb27ab5322ba192bed4 /Documentation
parent24d8a686817f6125f4c5dba91caf85ec906cf588 (diff)
parentb424a3e227e8e78eb92827afd464562552f2b49f (diff)
downloadlinux-next-9e6455cfb1ede0b5b75930ce7c5282e8707bbfad.tar.gz
Merge branch 'akpm-current/current'
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/blockdev/zram.txt10
-rw-r--r--Documentation/cgroup-v2.txt19
-rw-r--r--Documentation/filesystems/ocfs2-online-filecheck.txt94
-rw-r--r--Documentation/kcov.txt111
-rw-r--r--Documentation/kernel-parameters.txt17
-rw-r--r--Documentation/memory-hotplug.txt23
-rw-r--r--Documentation/printk-formats.txt18
-rw-r--r--Documentation/rapidio/mport_cdev.txt104
-rw-r--r--Documentation/rapidio/tsi721.txt9
-rw-r--r--Documentation/vm/ksm.txt63
-rw-r--r--Documentation/vm/page_owner.txt9
11 files changed, 469 insertions, 8 deletions
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 5bda5031c83d..f525b1663efa 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -227,6 +227,16 @@ line of text and contains the following stats separated by whitespace:
mem_used_max
zero_pages
num_migrated
+ avail_streams
+
+`avail_streams' column shows the current number of available compression
+streams, which is not necessarily equal to the max number of compression
+streams. The max number of compression streams can be set too high and
+can be unreachable (depending on the load and the usage pattern, of
+course). `avail_streams' permits finding out the real 'level of
+concurrency' that a particular zram device saw and to calculate the real
+memory consumption by allocated compression streams, not the theoretical
+maximum value.
9) Deactivate:
swapoff /dev/zram0
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index ff49cf901148..e2f4e7948a66 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -843,6 +843,15 @@ PAGE_SIZE multiple when read back.
Amount of memory used to cache filesystem data,
including tmpfs and shared memory.
+ kernel_stack
+
+ Amount of memory allocated to kernel stacks.
+
+ slab
+
+ Amount of memory used for storing in-kernel data
+ structures.
+
sock
Amount of memory used in network transmission buffers
@@ -871,6 +880,16 @@ PAGE_SIZE multiple when read back.
on the internal memory management lists used by the
page reclaim algorithm
+ slab_reclaimable
+
+ Part of "slab" that might be reclaimed, such as
+ dentries and inodes.
+
+ slab_unreclaimable
+
+ Part of "slab" that cannot be reclaimed on memory
+ pressure.
+
pgfault
Total number of page faults incurred
diff --git a/Documentation/filesystems/ocfs2-online-filecheck.txt b/Documentation/filesystems/ocfs2-online-filecheck.txt
new file mode 100644
index 000000000000..1ab07860430d
--- /dev/null
+++ b/Documentation/filesystems/ocfs2-online-filecheck.txt
@@ -0,0 +1,94 @@
+ OCFS2 online file check
+ -----------------------
+
+This document will describe OCFS2 online file check feature.
+
+Introduction
+============
+OCFS2 is often used in high-availaibility systems. However, OCFS2 usually
+converts the filesystem to read-only when encounters an error. This may not be
+necessary, since turning the filesystem read-only would affect other running
+processes as well, decreasing availability.
+Then, a mount option (errors=continue) is introduced, which would return the
+-EIO errno to the calling process and terminate furhter processing so that the
+filesystem is not corrupted further. The filesystem is not converted to
+read-only, and the problematic file's inode number is reported in the kernel
+log. The user can try to check/fix this file via online filecheck feature.
+
+Scope
+=====
+This effort is to check/fix small issues which may hinder day-to-day operations
+of a cluster filesystem by turning the filesystem read-only. The scope of
+checking/fixing is at the file level, initially for regular files and eventually
+to all files (including system files) of the filesystem.
+
+In case of directory to file links is incorrect, the directory inode is
+reported as erroneous.
+
+This feature is not suited for extravagant checks which involve dependency of
+other components of the filesystem, such as but not limited to, checking if the
+bits for file blocks in the allocation has been set. In case of such an error,
+the offline fsck should/would be recommended.
+
+Finally, such an operation/feature should not be automated lest the filesystem
+may end up with more damage than before the repair attempt. So, this has to
+be performed using user interaction and consent.
+
+User interface
+==============
+When there are errors in the OCFS2 filesystem, they are usually accompanied
+by the inode number which caused the error. This inode number would be the
+input to check/fix the file.
+
+There is a sysfs directory for each OCFS2 file system mounting:
+
+ /sys/fs/ocfs2/<devname>/filecheck
+
+Here, <devname> indicates the name of OCFS2 volumn device which has been already
+mounted. The file above would accept inode numbers. This could be used to
+communicate with kernel space, tell which file(inode number) will be checked or
+fixed. Currently, three operations are supported, which includes checking
+inode, fixing inode and setting the size of result record history.
+
+1. If you want to know what error exactly happened to <inode> before fixing, do
+
+ # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/check
+ # cat /sys/fs/ocfs2/<devname>/filecheck/check
+
+The output is like this:
+ INO DONE ERROR
+39502 1 GENERATION
+
+<INO> lists the inode numbers.
+<DONE> indicates whether the operation has been finished.
+<ERROR> says what kind of errors was found. For the detailed error numbers,
+please refer to the file linux/fs/ocfs2/filecheck.h.
+
+2. If you determine to fix this inode, do
+
+ # echo "<inode>" > /sys/fs/ocfs2/<devname>/filecheck/fix
+ # cat /sys/fs/ocfs2/<devname>/filecheck/fix
+
+The output is like this:
+ INO DONE ERROR
+39502 1 SUCCESS
+
+This time, the <ERROR> column indicates whether this fix is successful or not.
+
+3. The record cache is used to store the history of check/fix results. It's
+defalut size is 10, and can be adjust between the range of 10 ~ 100. You can
+adjust the size like this:
+
+ # echo "<size>" > /sys/fs/ocfs2/<devname>/filecheck/set
+
+Fixing stuff
+============
+On receivng the inode, the filesystem would read the inode and the
+file metadata. In case of errors, the filesystem would fix the errors
+and report the problems it fixed in the kernel log. As a precautionary measure,
+the inode must first be checked for errors before performing a final fix.
+
+The inode and the result history will be maintained temporarily in a
+small linked list buffer which would contain the last (N) inodes
+fixed/checked, the detailed errors which were fixed/checked are printed in the
+kernel log.
diff --git a/Documentation/kcov.txt b/Documentation/kcov.txt
new file mode 100644
index 000000000000..779ff4ab1c1d
--- /dev/null
+++ b/Documentation/kcov.txt
@@ -0,0 +1,111 @@
+kcov: code coverage for fuzzing
+===============================
+
+kcov exposes kernel code coverage information in a form suitable for coverage-
+guided fuzzing (randomized testing). Coverage data of a running kernel is
+exported via the "kcov" debugfs file. Coverage collection is enabled on a task
+basis, and thus it can capture precise coverage of a single system call.
+
+Note that kcov does not aim to collect as much coverage as possible. It aims
+to collect more or less stable coverage that is function of syscall inputs.
+To achieve this goal it does not collect coverage in soft/hard interrupts
+and instrumentation of some inherently non-deterministic parts of kernel is
+disbled (e.g. scheduler, locking).
+
+Usage:
+======
+
+Configure kernel with:
+
+ CONFIG_KCOV=y
+
+CONFIG_KCOV requires gcc built on revision 231296 or later.
+Profiling data will only become accessible once debugfs has been mounted:
+
+ mount -t debugfs none /sys/kernel/debug
+
+The following program demonstrates kcov usage from within a test program:
+
+#include <stdio.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <fcntl.h>
+
+#define KCOV_INIT_TRACE _IOR('c', 1, unsigned long)
+#define KCOV_ENABLE _IO('c', 100)
+#define KCOV_DISABLE _IO('c', 101)
+#define COVER_SIZE (64<<10)
+
+int main(int argc, char **argv)
+{
+ int fd;
+ unsigned long *cover, n, i;
+
+ /* A single fd descriptor allows coverage collection on a single
+ * thread.
+ */
+ fd = open("/sys/kernel/debug/kcov", O_RDWR);
+ if (fd == -1)
+ perror("open"), exit(1);
+ /* Setup trace mode and trace size. */
+ if (ioctl(fd, KCOV_INIT_TRACE, COVER_SIZE))
+ perror("ioctl"), exit(1);
+ /* Mmap buffer shared between kernel- and user-space. */
+ cover = (unsigned long*)mmap(NULL, COVER_SIZE * sizeof(unsigned long),
+ PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if ((void*)cover == MAP_FAILED)
+ perror("mmap"), exit(1);
+ /* Enable coverage collection on the current thread. */
+ if (ioctl(fd, KCOV_ENABLE, 0))
+ perror("ioctl"), exit(1);
+ /* Reset coverage from the tail of the ioctl() call. */
+ __atomic_store_n(&cover[0], 0, __ATOMIC_RELAXED);
+ /* That's the target syscal call. */
+ read(-1, NULL, 0);
+ /* Read number of PCs collected. */
+ n = __atomic_load_n(&cover[0], __ATOMIC_RELAXED);
+ for (i = 0; i < n; i++)
+ printf("0x%lx\n", cover[i + 1]);
+ /* Disable coverage collection for the current thread. After this call
+ * coverage can be enabled for a different thread.
+ */
+ if (ioctl(fd, KCOV_DISABLE, 0))
+ perror("ioctl"), exit(1);
+ /* Free resources. */
+ if (munmap(cover, COVER_SIZE * sizeof(unsigned long)))
+ perror("munmap"), exit(1);
+ if (close(fd))
+ perror("close"), exit(1);
+ return 0;
+}
+
+After piping through addr2line output of the program looks as follows:
+
+SyS_read
+fs/read_write.c:562
+__fdget_pos
+fs/file.c:774
+__fget_light
+fs/file.c:746
+__fget_light
+fs/file.c:750
+__fget_light
+fs/file.c:760
+__fdget_pos
+fs/file.c:784
+SyS_read
+fs/read_write.c:562
+
+If a program needs to collect coverage from several threads (independently),
+it needs to open /sys/kernel/debug/kcov in each thread separately.
+
+The interface is fine-grained to allow efficient forking of test processes.
+That is, a parent process opens /sys/kernel/debug/kcov, enables trace mode,
+mmaps coverage buffer and then forks child processes in a loop. Child processes
+only need to enable coverage (disable happens automatically on thread end).
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 5688b69c26bb..52a8471fba9c 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1759,7 +1759,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
keepinitrd [HW,ARM]
- kernelcore=nn[KMG] [KNL,X86,IA-64,PPC] This parameter
+ kernelcore= [KNL,X86,IA-64,PPC]
+ Format: nn[KMGTPE] | "mirror"
+ This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations. The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1775,6 +1777,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
+ Instead of specifying the amount of memory (nn[KMGTPE]),
+ you can specify "mirror" option. In case "mirror"
+ option is specified, mirrored (reliable) memory is used
+ for non-movable allocations and remaining memory is used
+ for Movable pages. nn[KMGTPE] and "mirror" are exclusive,
+ so you can NOT specify nn[KMGTPE] and "mirror" at the same
+ time.
+
kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
@@ -2732,6 +2742,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
we can turn it on.
on: enable the feature
+ page_poison= [KNL] Boot-time parameter changing the state of
+ poisoning on the buddy allocator.
+ off: turn off poisoning
+ on: turn on poisoning
+
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
timeout = 0: wait forever
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index ce2cfcf35c27..443f4b44ad97 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -256,10 +256,27 @@ If the memory block is offline, you'll read "offline".
5.2. How to online memory
------------
-Even if the memory is hot-added, it is not at ready-to-use state.
-For using newly added memory, you have to "online" the memory block.
+When the memory is hot-added, the kernel decides whether or not to "online"
+it according to the policy which can be read from "auto_online_blocks" file:
-For onlining, you have to write "online" to the memory block's state file as:
+% cat /sys/devices/system/memory/auto_online_blocks
+
+The default is "offline" which means the newly added memory is not in a
+ready-to-use state and you have to "online" the newly added memory blocks
+manually. Automatic onlining can be requested by writing "online" to
+"auto_online_blocks" file:
+
+% echo online > /sys/devices/system/memory/auto_online_blocks
+
+This sets a global policy and impacts all memory blocks that will subsequently
+be hotplugged. Currently offline blocks keep their state. It is possible, under
+certain circumstances, that some memory blocks will be added but will fail to
+online. User space tools can check their "state" files
+(/sys/devices/system/memory/memoryXXX/state) and try to online them manually.
+
+If the automatic onlining wasn't requested, failed, or some memory block was
+offlined it is possible to change the individual block's state by writing to the
+"state" file:
% echo online > /sys/devices/system/memory/memoryXXX/state
diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index 5d1128bf0282..5962949944fd 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -298,6 +298,24 @@ bitmap and its derivatives such as cpumask and nodemask:
Passed by reference.
+Flags bitfields such as page flags, gfp_flags:
+
+ %pGp referenced|uptodate|lru|active|private
+ %pGg GFP_USER|GFP_DMA32|GFP_NOWARN
+ %pGv read|exec|mayread|maywrite|mayexec|denywrite
+
+ For printing flags bitfields as a collection of symbolic constants that
+ would construct the value. The type of flags is given by the third
+ character. Currently supported are [p]age flags, [v]ma_flags (both
+ expect unsigned long *) and [g]fp_flags (expects gfp_t *). The flag
+ names and print order depends on the particular type.
+
+ Note that this format should not be used directly in TP_printk() part
+ of a tracepoint. Instead, use the show_*_flags() functions from
+ <trace/events/mmflags.h>.
+
+ Passed by reference.
+
Network device features:
%pNF 0x000000000000c000
diff --git a/Documentation/rapidio/mport_cdev.txt b/Documentation/rapidio/mport_cdev.txt
new file mode 100644
index 000000000000..20c120d4b3b8
--- /dev/null
+++ b/Documentation/rapidio/mport_cdev.txt
@@ -0,0 +1,104 @@
+RapidIO subsystem mport character device driver (rio_mport_cdev.c)
+==================================================================
+
+Version History:
+----------------
+ 1.0.0 - Initial driver release.
+
+==================================================================
+
+I. Overview
+
+This device driver is the result of collaboration within the RapidIO.org
+Software Task Group (STG) between Texas Instruments, Freescale,
+Prodrive Technologies, Nokia Networks, BAE and IDT. Additional input was
+received from other members of RapidIO.org. The objective was to create a
+character mode driver interface which exposes the capabilities of RapidIO
+devices directly to applications, in a manner that allows the numerous and
+varied RapidIO implementations to interoperate.
+
+This driver (MPORT_CDEV) provides access to basic RapidIO subsystem operations
+for user-space applications. Most of RapidIO operations are supported through
+'ioctl' system calls.
+
+When loaded this device driver creates filesystem nodes named rio_mportX in /dev
+directory for each registered RapidIO mport device. 'X' in the node name matches
+to unique port ID assigned to each local mport device.
+
+Using available set of ioctl commands user-space applications can perform
+following RapidIO bus and subsystem operations:
+
+- Reads and writes from/to configuration registers of mport devices
+ (RIO_MPORT_MAINT_READ_LOCAL/RIO_MPORT_MAINT_WRITE_LOCAL)
+- Reads and writes from/to configuration registers of remote RapidIO devices.
+ This operations are defined as RapidIO Maintenance reads/writes in RIO spec.
+ (RIO_MPORT_MAINT_READ_REMOTE/RIO_MPORT_MAINT_WRITE_REMOTE)
+- Set RapidIO Destination ID for mport devices (RIO_MPORT_MAINT_HDID_SET)
+- Set RapidIO Component Tag for mport devices (RIO_MPORT_MAINT_COMPTAG_SET)
+- Query logical index of mport devices (RIO_MPORT_MAINT_PORT_IDX_GET)
+- Query capabilities and RapidIO link configuration of mport devices
+ (RIO_MPORT_GET_PROPERTIES)
+- Enable/Disable reporting of RapidIO doorbell events to user-space applications
+ (RIO_ENABLE_DOORBELL_RANGE/RIO_DISABLE_DOORBELL_RANGE)
+- Enable/Disable reporting of RIO port-write events to user-space applications
+ (RIO_ENABLE_PORTWRITE_RANGE/RIO_DISABLE_PORTWRITE_RANGE)
+- Query/Control type of events reported through this driver: doorbells,
+ port-writes or both (RIO_SET_EVENT_MASK/RIO_GET_EVENT_MASK)
+- Configure/Map mport's outbound requests window(s) for specific size,
+ RapidIO destination ID, hopcount and request type
+ (RIO_MAP_OUTBOUND/RIO_UNMAP_OUTBOUND)
+- Configure/Map mport's inbound requests window(s) for specific size,
+ RapidIO base address and local memory base address
+ (RIO_MAP_INBOUND/RIO_UNMAP_INBOUND)
+- Allocate/Free contiguous DMA coherent memory buffer for DMA data transfers
+ to/from remote RapidIO devices (RIO_ALLOC_DMA/RIO_FREE_DMA)
+- Initiate DMA data transfers to/from remote RapidIO devices (RIO_TRANSFER).
+ Supports blocking, asynchronous and posted (a.k.a 'fire-and-forget') data
+ transfer modes.
+- Check/Wait for completion of asynchronous DMA data transfer
+ (RIO_WAIT_FOR_ASYNC)
+- Manage device objects supported by RapidIO subsystem (RIO_DEV_ADD/RIO_DEV_DEL).
+ This allows implementation of various RapidIO fabric enumeration algorithms
+ as user-space applications while using remaining functionality provided by
+ kernel RapidIO subsystem.
+
+II. Hardware Compatibility
+
+This device driver uses standard interfaces defined by kernel RapidIO subsystem
+and therefore it can be used with any mport device driver registered by RapidIO
+subsystem with limitations set by available mport implementation.
+
+At this moment the most common limitation is availability of RapidIO-specific
+DMA engine framework for specific mport device. Users should verify available
+functionality of their platform when planning to use this driver:
+
+- IDT Tsi721 PCIe-to-RapidIO bridge device and its mport device driver are fully
+ compatible with this driver.
+- Freescale SoCs 'fsl_rio' mport driver does not have implementation for RapidIO
+ specific DMA engine support and therefore DMA data transfers mport_cdev driver
+ are not available.
+
+III. Module parameters
+
+- 'dbg_level' - This parameter allows to control amount of debug information
+ generated by this device driver. This parameter is formed by set of
+ This parameter can be changed bit masks that correspond to the specific
+ functional block.
+ For mask definitions see 'drivers/rapidio/devices/rio_mport_cdev.c'
+ This parameter can be changed dynamically.
+ Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level.
+
+IV. Known problems
+
+ None.
+
+V. User-space Applications and API
+
+API library and applications that use this device driver are available from
+RapidIO.org.
+
+VI. TODO List
+
+- Add support for sending/receiving "raw" RapidIO messaging packets.
+- Add memory mapped DMA data transfers as an option when RapidIO-specific DMA
+ is not available.
diff --git a/Documentation/rapidio/tsi721.txt b/Documentation/rapidio/tsi721.txt
index 626052f403bb..7c1c7bf48ec0 100644
--- a/Documentation/rapidio/tsi721.txt
+++ b/Documentation/rapidio/tsi721.txt
@@ -16,6 +16,15 @@ For inbound messages this driver uses destination ID matching to forward message
into the corresponding message queue. Messaging callbacks are implemented to be
fully compatible with RIONET driver (Ethernet over RapidIO messaging services).
+1. Module parameters:
+- 'dbg_level' - This parameter allows to control amount of debug information
+ generated by this device driver. This parameter is formed by set of
+ This parameter can be changed bit masks that correspond to the specific
+ functional block.
+ For mask definitions see 'drivers/rapidio/devices/tsi721.h'
+ This parameter can be changed dynamically.
+ Use CONFIG_RAPIDIO_DEBUG=y to enable debug output at the top level.
+
II. Known problems
None.
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index f34a8ee6f860..a7ef7595edab 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -80,6 +80,50 @@ run - set 0 to stop ksmd from running but keep merged pages,
Default: 0 (must be changed to 1 to activate KSM,
except if CONFIG_SYSFS is disabled)
+max_page_sharing - Maximum sharing allowed for each KSM page. This
+ enforces a deduplication limit to avoid the virtual
+ memory rmap lists to grow too large. The minimum
+ value is 2 as a newly created KSM page will have at
+ least two sharers. The rmap walk has O(N)
+ complexity where N is the number of rmap_items
+ (i.e. virtual mappings) that are sharing the page,
+ which is in turn capped by max_page_sharing. So
+ this effectively spread the the linear O(N)
+ computational complexity from rmap walk context
+ over different KSM pages. The ksmd walk over the
+ stable_node "chains" is also O(N), but N is the
+ number of stable_node "dups", not the number of
+ rmap_items, so it has not a significant impact on
+ ksmd performance. In practice the best stable_node
+ "dup" candidate will be kept and found at the head
+ of the "dups" list. The higher this value the
+ faster KSM will merge the memory (because there
+ will be fewer stable_node dups queued into the
+ stable_node chain->hlist to check for pruning) and
+ the higher the deduplication factor will be, but
+ the slowest the worst case rmap walk could be for
+ any given KSM page. Slowing down the rmap_walk
+ means there will be higher latency for certain
+ virtual memory operations happening during
+ swapping, compaction, NUMA balancing and page
+ migration, in turn decreasing responsiveness for
+ the caller of those virtual memory operations. The
+ scheduler latency of other tasks not involved with
+ the VM operations doing the rmap walk is not
+ affected by this parameter as the rmap walks are
+ always schedule friendly themselves.
+
+stable_node_chains_prune_millisecs - How frequently to walk the whole
+ list of stable_node "dups" linked in the
+ stable_node "chains" in order to prune stale
+ stable_nodes. Smaller milllisecs values will free
+ up the KSM metadata with lower latency, but they
+ will make ksmd use more CPU during the scan. This
+ only applies to the stable_node chains so it's a
+ noop if not a single KSM page hit the
+ max_page_sharing yet (there would be no stable_node
+ chains in such case).
+
The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
pages_shared - how many shared pages are being used
@@ -88,10 +132,29 @@ pages_unshared - how many pages unique but repeatedly checked for merging
pages_volatile - how many pages changing too fast to be placed in a tree
full_scans - how many times all mergeable areas have been scanned
+stable_node_chains - number of stable node chains allocated, this is
+ effectively the number of KSM pages that hit the
+ max_page_sharing limit
+stable_node_dups - number of stable node dups queued into the
+ stable_node chains
+
A high ratio of pages_sharing to pages_shared indicates good sharing, but
a high ratio of pages_unshared to pages_sharing indicates wasted effort.
pages_volatile embraces several different kinds of activity, but a high
proportion there would also indicate poor use of madvise MADV_MERGEABLE.
+The maximum possible page_sharing/page_shared ratio is limited by the
+max_page_sharing tunable. To increase the ratio max_page_sharing must
+be increased accordingly.
+
+The stable_node_dups/stable_node_chains ratio is also affected by the
+max_page_sharing tunable, and an high ratio may indicate fragmentation
+in the stable_node dups, which could be solved by introducing
+fragmentation algorithms in ksmd which would refile rmap_items from
+one stable_node dup to another stable_node dup, in order to freeup
+stable_node "dups" with few rmap_items in them, but that may increase
+the ksmd CPU usage and possibly slowdown the readonly computations on
+the KSM pages of the applications.
+
Izik Eidus,
Hugh Dickins, 17 Nov 2009
diff --git a/Documentation/vm/page_owner.txt b/Documentation/vm/page_owner.txt
index 8f3ce9b3aa11..ffff1439076a 100644
--- a/Documentation/vm/page_owner.txt
+++ b/Documentation/vm/page_owner.txt
@@ -28,10 +28,11 @@ with page owner and page owner is disabled in runtime due to no enabling
boot option, runtime overhead is marginal. If disabled in runtime, it
doesn't require memory to store owner information, so there is no runtime
memory overhead. And, page owner inserts just two unlikely branches into
-the page allocator hotpath and if it returns false then allocation is
-done like as the kernel without page owner. These two unlikely branches
-would not affect to allocation performance. Following is the kernel's
-code size change due to this facility.
+the page allocator hotpath and if not enabled, then allocation is done
+like as the kernel without page owner. These two unlikely branches should
+not affect to allocation performance, especially if the static keys jump
+label patching functionality is available. Following is the kernel's code
+size change due to this facility.
- Without page owner
text data bss dec hex filename