From c4e1be9ec1130fff4d691cdc0e0f9d666009f9ae Mon Sep 17 00:00:00 2001 From: Dave Hansen Date: Thu, 6 Jul 2017 15:36:44 -0700 Subject: mm, sparsemem: break out of loops early There are a number of times that we loop over NR_MEM_SECTIONS, looking for section_present() on each section. But, when we have very large physical address spaces (large MAX_PHYSMEM_BITS), NR_MEM_SECTIONS becomes very large, making the loops quite long. With MAX_PHYSMEM_BITS=46 and a section size of 128MB, the current loops are 512k iterations, which we barely notice on modern hardware. But, raising MAX_PHYSMEM_BITS higher (like we will see on systems that support 5-level paging) makes this 64x longer and we start to notice, especially on slower systems like simulators. A 10-second delay for 512k iterations is annoying. But, a 640- second delay is crippling. This does not help if we have extremely sparse physical address spaces, but those are quite rare. We expect that most of the "slow" systems where this matters will also be quite small and non-sparse. To fix this, we track the highest section we've ever encountered. This lets us know when we will *never* see another section_present(), and lets us break out of the loops earlier. Doing the whole for_each_present_section_nr() macro is probably overkill, but it will ensure that any future loop iterations that we grow are more likely to be correct. Kirrill said "It shaved almost 40 seconds from boot time in qemu with 5-level paging enabled for me". Link: http://lkml.kernel.org/r/20170504174434.C45A4735@viggo.jf.intel.com Signed-off-by: Dave Hansen Tested-by: Kirill A. Shutemov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'drivers/base') diff --git a/drivers/base/memory.c b/drivers/base/memory.c index cc4f1d0cbffe..90225ffee501 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -820,6 +820,10 @@ int __init memory_dev_init(void) */ mutex_lock(&mem_sysfs_mutex); for (i = 0; i < NR_MEM_SECTIONS; i += sections_per_block) { + /* Don't iterate over sections we know are !present: */ + if (i > __highest_present_section_nr) + break; + err = add_memory_block(i); if (!ret) ret = err; -- cgit v1.2.1 From bfe63d3beabfac93521c8b7ccd40befd7a90148e Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Jul 2017 15:37:42 -0700 Subject: mm: drop page_initialized check from get_nid_for_pfn Commit c04fc586c1a4 ("mm: show node to memory section relationship with symlinks in sysfs") has added means to export memblock<->node association into the sysfs. It has also introduced get_nid_for_pfn which is a rather confusing counterpart of pfn_to_nid which checks also whether the pfn page is already initialized (page_initialized). This is done by checking page::lru != NULL which doesn't make any sense at all. Nothing in this path really relies on the lru list being used or initialized. Just remove it because this will become a problem with later patches. Thanks to Reza Arbab for testing which revealed this to be a problem (http://lkml.kernel.org/r/20170403202337.GA12482@dhcp22.suse.cz) Link: http://lkml.kernel.org/r/20170515085827.16474-4-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Reza Arbab Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Balbir Singh Cc: Dan Williams Cc: Daniel Kiper Cc: David Rientjes Cc: Heiko Carstens Cc: Igor Mammedov Cc: Jerome Glisse Cc: Joonsoo Kim Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Tobias Regnery Cc: Toshi Kani Cc: Vitaly Kuznetsov Cc: Xishi Qiu Cc: Yasuaki Ishimatsu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/node.c | 7 ------- 1 file changed, 7 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/node.c b/drivers/base/node.c index 0440d95c9b5b..db769d3148b7 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -368,21 +368,14 @@ int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) } #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE -#define page_initialized(page) (page->lru.next) - static int __ref get_nid_for_pfn(unsigned long pfn) { - struct page *page; - if (!pfn_valid_within(pfn)) return -1; #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT if (system_state < SYSTEM_RUNNING) return early_pfn_to_nid(pfn); #endif - page = pfn_to_page(pfn); - if (!page_initialized(page)) - return -1; return pfn_to_nid(pfn); } -- cgit v1.2.1 From 1b862aecfbd419cdc4553645bf86d07554279bed Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Jul 2017 15:37:45 -0700 Subject: mm, memory_hotplug: get rid of is_zone_device_section Device memory hotplug hooks into regular memory hotplug only half way. It needs memory sections to track struct pages but there is no need/desire to associate those sections with memory blocks and export them to the userspace via sysfs because they cannot be onlined anyway. This is currently expressed by for_device argument to arch_add_memory which then makes sure to associate the given memory range with ZONE_DEVICE. register_new_memory then relies on is_zone_device_section to distinguish special memory hotplug from the regular one. While this works now, later patches in this series want to move __add_zone outside of arch_add_memory path so we have to come up with something else. Add want_memblock down the __add_pages path and use it to control whether the section->memblock association should be done. arch_add_memory then just trivially want memblock for everything but for_device hotplug. remove_memory_section doesn't need is_zone_device_section either. We can simply skip all the memblock specific cleanup if there is no memblock for the given section. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.org Signed-off-by: Michal Hocko Tested-by: Dan Williams Acked-by: Vlastimil Babka Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Balbir Singh Cc: Daniel Kiper Cc: David Rientjes Cc: Heiko Carstens Cc: Igor Mammedov Cc: Jerome Glisse Cc: Joonsoo Kim Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Reza Arbab Cc: Tobias Regnery Cc: Toshi Kani Cc: Vitaly Kuznetsov Cc: Xishi Qiu Cc: Yasuaki Ishimatsu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 90225ffee501..f8fd562c3f18 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -685,14 +685,6 @@ static int add_memory_block(int base_section_nr) return 0; } -static bool is_zone_device_section(struct mem_section *ms) -{ - struct page *page; - - page = sparse_decode_mem_map(ms->section_mem_map, __section_nr(ms)); - return is_zone_device_page(page); -} - /* * need an interface for the VM to add new memory regions, * but without onlining it. @@ -702,9 +694,6 @@ int register_new_memory(int nid, struct mem_section *section) int ret = 0; struct memory_block *mem; - if (is_zone_device_section(section)) - return 0; - mutex_lock(&mem_sysfs_mutex); mem = find_memory_block(section); @@ -741,11 +730,16 @@ static int remove_memory_section(unsigned long node_id, { struct memory_block *mem; - if (is_zone_device_section(section)) - return 0; - mutex_lock(&mem_sysfs_mutex); + + /* + * Some users of the memory hotplug do not want/need memblock to + * track all sections. Skip over those. + */ mem = find_memory_block(section); + if (!mem) + goto out_unlock; + unregister_mem_sect_under_nodes(mem, __section_nr(section)); mem->section_count--; @@ -754,6 +748,7 @@ static int remove_memory_section(unsigned long node_id, else put_device(&mem->dev); +out_unlock: mutex_unlock(&mem_sysfs_mutex); return 0; } -- cgit v1.2.1 From 9037a9934349b0e180896fc8cacaf1819418ba03 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Jul 2017 15:37:49 -0700 Subject: mm, memory_hotplug: split up register_one_node() Memory hotplug (add_memory_resource) has to reinitialize node infrastructure if the node is offline (one which went through the complete add_memory(); remove_memory() cycle). That involves node registration to the kobj infrastructure (register_node), the proper association with cpus (register_cpu_under_node) and finally creation of node<->memblock symlinks (link_mem_sections). The last part requires to know node_start_pfn and node_spanned_pages which we currently have but a leter patch will postpone this initialization to the onlining phase which happens later. In fact we do not need to rely on the early pgdat initialization even now because the currently hot added pfn range is currently known. Split register_one_node into core which does all the common work for the boot time NUMA initialization and the hotplug (__register_one_node). register_one_node keeps the full initialization while hotplug calls __register_one_node and manually calls link_mem_sections for the proper range. This shouldn't introduce any functional change. Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Vlastimil Babka Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Balbir Singh Cc: Dan Williams Cc: Daniel Kiper Cc: David Rientjes Cc: Heiko Carstens Cc: Igor Mammedov Cc: Jerome Glisse Cc: Joonsoo Kim Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Reza Arbab Cc: Tobias Regnery Cc: Toshi Kani Cc: Vitaly Kuznetsov Cc: Xishi Qiu Cc: Yasuaki Ishimatsu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/node.c | 51 ++++++++++++++++++++------------------------------- 1 file changed, 20 insertions(+), 31 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/node.c b/drivers/base/node.c index db769d3148b7..1da0005341a1 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -461,10 +461,9 @@ int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, return 0; } -static int link_mem_sections(int nid) +int link_mem_sections(int nid, unsigned long start_pfn, unsigned long nr_pages) { - unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn; - unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages; + unsigned long end_pfn = start_pfn + nr_pages; unsigned long pfn; struct memory_block *mem_blk = NULL; int err = 0; @@ -552,10 +551,7 @@ static int node_memory_callback(struct notifier_block *self, return NOTIFY_OK; } #endif /* CONFIG_HUGETLBFS */ -#else /* !CONFIG_MEMORY_HOTPLUG_SPARSE */ - -static int link_mem_sections(int nid) { return 0; } -#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ +#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */ #if !defined(CONFIG_MEMORY_HOTPLUG_SPARSE) || \ !defined(CONFIG_HUGETLBFS) @@ -569,39 +565,32 @@ static void init_node_hugetlb_work(int nid) { } #endif -int register_one_node(int nid) +int __register_one_node(int nid) { - int error = 0; + int p_node = parent_node(nid); + struct node *parent = NULL; + int error; int cpu; - if (node_online(nid)) { - int p_node = parent_node(nid); - struct node *parent = NULL; - - if (p_node != nid) - parent = node_devices[p_node]; - - node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL); - if (!node_devices[nid]) - return -ENOMEM; - - error = register_node(node_devices[nid], nid, parent); + if (p_node != nid) + parent = node_devices[p_node]; - /* link cpu under this node */ - for_each_present_cpu(cpu) { - if (cpu_to_node(cpu) == nid) - register_cpu_under_node(cpu, nid); - } + node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL); + if (!node_devices[nid]) + return -ENOMEM; - /* link memory sections under this node */ - error = link_mem_sections(nid); + error = register_node(node_devices[nid], nid, parent); - /* initialize work queue for memory hot plug */ - init_node_hugetlb_work(nid); + /* link cpu under this node */ + for_each_present_cpu(cpu) { + if (cpu_to_node(cpu) == nid) + register_cpu_under_node(cpu, nid); } - return error; + /* initialize work queue for memory hot plug */ + init_node_hugetlb_work(nid); + return error; } void unregister_one_node(int nid) -- cgit v1.2.1 From 8b0662f245a328df8873be949b0087760420f40c Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Jul 2017 15:37:53 -0700 Subject: mm, memory_hotplug: consider offline memblocks removable is_pageblock_removable_nolock() relies on having zone association to examine all the page blocks to check whether they are movable or free. This is just wasting of cycles when the memblock is offline. Later patch in the series will also change the time when the page is associated with a zone so we let's bail out early if the memblock is offline. Link: http://lkml.kernel.org/r/20170515085827.16474-7-mhocko@kernel.org Signed-off-by: Michal Hocko Reported-by: Igor Mammedov Acked-by: Vlastimil Babka Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Balbir Singh Cc: Dan Williams Cc: Daniel Kiper Cc: David Rientjes Cc: Heiko Carstens Cc: Jerome Glisse Cc: Joonsoo Kim Cc: Martin Schwidefsky Cc: Mel Gorman Cc: Reza Arbab Cc: Tobias Regnery Cc: Toshi Kani Cc: Vitaly Kuznetsov Cc: Xishi Qiu Cc: Yasuaki Ishimatsu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'drivers/base') diff --git a/drivers/base/memory.c b/drivers/base/memory.c index f8fd562c3f18..1e884d82af6f 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -128,6 +128,9 @@ static ssize_t show_mem_removable(struct device *dev, int ret = 1; struct memory_block *mem = to_memory_block(dev); + if (mem->state != MEM_ONLINE) + goto out; + for (i = 0; i < sections_per_block; i++) { if (!present_section_nr(mem->start_section_nr + i)) continue; @@ -135,6 +138,7 @@ static ssize_t show_mem_removable(struct device *dev, ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION); } +out: return sprintf(buf, "%d\n", ret); } -- cgit v1.2.1 From f1dd2cd13c4bbbc9a7c4617b3b034fa643de98fe Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Jul 2017 15:38:11 -0700 Subject: mm, memory_hotplug: do not associate hotadded memory to zones until online The current memory hotplug implementation relies on having all the struct pages associate with a zone/node during the physical hotplug phase (arch_add_memory->__add_pages->__add_section->__add_zone). In the vast majority of cases this means that they are added to ZONE_NORMAL. This has been so since 9d99aaa31f59 ("[PATCH] x86_64: Support memory hotadd without sparsemem") and it wasn't a big deal back then because movable onlining didn't exist yet. Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable onlining 511c2aba8f07 ("mm, memory-hotplug: dynamic configure movable memory and portion memory") and then things got more complicated. Rather than reconsidering the zone association which was no longer needed (because the memory hotplug already depended on SPARSEMEM) a convoluted semantic of zone shifting has been developed. Only the currently last memblock or the one adjacent to the zone_movable can be onlined movable. This essentially means that the online type changes as the new memblocks are added. Let's simulate memory hot online manually $ echo 0x100000000 > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory32/valid_zones Normal Movable $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal /sys/devices/system/memory/memory34/valid_zones:Normal Movable $ echo online_movable > /sys/devices/system/memory/memory34/state $ grep . /sys/devices/system/memory/memory3?/valid_zones /sys/devices/system/memory/memory32/valid_zones:Normal /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Normal This is an awkward semantic because an udev event is sent as soon as the block is onlined and an udev handler might want to online it based on some policy (e.g. association with a node) but it will inherently race with new blocks showing up. This patch changes the physical online phase to not associate pages with any zone at all. All the pages are just marked reserved and wait for the onlining phase to be associated with the zone as per the online request. There are only two requirements - existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap - ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses the latter one is not an inherent requirement and can be changed in the future. It preserves the current behavior and made the code slightly simpler. This is subject to change in future. This means that the same physical online steps as above will lead to the following state: Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Normal Movable /sys/devices/system/memory/memory32/valid_zones:Normal Movable /sys/devices/system/memory/memory33/valid_zones:Normal Movable /sys/devices/system/memory/memory34/valid_zones:Movable Implementation: The current move_pfn_range is reimplemented to check the above requirements (allow_online_pfn_range) and then updates the respective zone (move_pfn_range_to_zone), the pgdat and links all the pages in the pfn range with the zone/node. __add_pages is updated to not require the zone and only initializes sections in the range. This allowed to simplify the arch_add_memory code (s390 could get rid of quite some of code). devm_memremap_pages is the only user of arch_add_memory which relies on the zone association because it only hooks into the memory hotplug only half way. It uses it to associate the new memory with ZONE_DEVICE but doesn't allow it to be {on,off}lined via sysfs. This means that this particular code path has to call move_pfn_range_to_zone explicitly. The original zone shifting code is kept in place and will be removed in the follow up patch for an easier review. Please note that this patch also changes the original behavior when offlining a memory block adjacent to another zone (Normal vs. Movable) used to allow to change its movable type. This will be handled later. [richard.weiyang@gmail.com: simplify zone_intersects()] Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com [richard.weiyang@gmail.com: remove duplicate call for set_page_links] Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com [akpm@linux-foundation.org: remove unused local `i'] Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.org Signed-off-by: Michal Hocko Signed-off-by: Wei Yang Tested-by: Dan Williams Tested-by: Reza Arbab Acked-by: Heiko Carstens # For s390 bits Acked-by: Vlastimil Babka Cc: Martin Schwidefsky Cc: Andi Kleen Cc: Andrea Arcangeli Cc: Balbir Singh Cc: Daniel Kiper Cc: David Rientjes Cc: Igor Mammedov Cc: Jerome Glisse Cc: Joonsoo Kim Cc: Mel Gorman Cc: Tobias Regnery Cc: Toshi Kani Cc: Vitaly Kuznetsov Cc: Xishi Qiu Cc: Yasuaki Ishimatsu Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 52 +++++++++++++++++++++++++++------------------------ 1 file changed, 28 insertions(+), 24 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 1e884d82af6f..b86fda30ce62 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -392,39 +392,43 @@ static ssize_t show_valid_zones(struct device *dev, struct device_attribute *attr, char *buf) { struct memory_block *mem = to_memory_block(dev); - unsigned long start_pfn, end_pfn; - unsigned long valid_start, valid_end, valid_pages; + unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr); unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block; - struct zone *zone; - int zone_shift = 0; + unsigned long valid_start_pfn, valid_end_pfn; + bool append = false; + int nid; - start_pfn = section_nr_to_pfn(mem->start_section_nr); - end_pfn = start_pfn + nr_pages; - - /* The block contains more than one zone can not be offlined. */ - if (!test_pages_in_a_zone(start_pfn, end_pfn, &valid_start, &valid_end)) + /* + * The block contains more than one zone can not be offlined. + * This can happen e.g. for ZONE_DMA and ZONE_DMA32 + */ + if (!test_pages_in_a_zone(start_pfn, start_pfn + nr_pages, &valid_start_pfn, &valid_end_pfn)) return sprintf(buf, "none\n"); - zone = page_zone(pfn_to_page(valid_start)); - valid_pages = valid_end - valid_start; - - /* MMOP_ONLINE_KEEP */ - sprintf(buf, "%s", zone->name); + start_pfn = valid_start_pfn; + nr_pages = valid_end_pfn - start_pfn; - /* MMOP_ONLINE_KERNEL */ - zone_can_shift(valid_start, valid_pages, ZONE_NORMAL, &zone_shift); - if (zone_shift) { - strcat(buf, " "); - strcat(buf, (zone + zone_shift)->name); + /* + * Check the existing zone. Make sure that we do that only on the + * online nodes otherwise the page_zone is not reliable + */ + if (mem->state == MEM_ONLINE) { + strcat(buf, page_zone(pfn_to_page(start_pfn))->name); + goto out; } - /* MMOP_ONLINE_MOVABLE */ - zone_can_shift(valid_start, valid_pages, ZONE_MOVABLE, &zone_shift); - if (zone_shift) { - strcat(buf, " "); - strcat(buf, (zone + zone_shift)->name); + nid = pfn_to_nid(start_pfn); + if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL)) { + strcat(buf, NODE_DATA(nid)->node_zones[ZONE_NORMAL].name); + append = true; } + if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_MOVABLE)) { + if (append) + strcat(buf, " "); + strcat(buf, NODE_DATA(nid)->node_zones[ZONE_MOVABLE].name); + } +out: strcat(buf, "\n"); return strlen(buf); -- cgit v1.2.1 From c246a213f5bad687c6c2cea27d7265eaf8f6f5d7 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Jul 2017 15:38:18 -0700 Subject: mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone Heiko Carstens has noticed that he can generate overlapping zones for ZONE_DMA and ZONE_NORMAL: DMA [mem 0x0000000000000000-0x000000007fffffff] Normal [mem 0x0000000080000000-0x000000017fffffff] $ cat /sys/devices/system/memory/block_size_bytes 10000000 $ cat /sys/devices/system/memory/memory5/valid_zones DMA $ echo 0 > /sys/devices/system/memory/memory5/online $ cat /sys/devices/system/memory/memory5/valid_zones Normal $ echo 1 > /sys/devices/system/memory/memory5/online Normal $ cat /proc/zoneinfo Node 0, zone DMA spanned 524288 <----- present 458752 managed 455078 start_pfn: 0 <----- Node 0, zone Normal spanned 720896 present 589824 managed 571648 start_pfn: 327680 <----- The reason is that we assume that the default zone for kernel onlining is ZONE_NORMAL. This was a simplification introduced by the memory hotplug rework and it is easily fixable by checking the range overlap in the zone order and considering the first matching zone as the default one. If there is no such zone then assume ZONE_NORMAL as we have been doing so far. Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online" Link: http://lkml.kernel.org/r/20170601083746.4924-3-mhocko@kernel.org Signed-off-by: Michal Hocko Reported-by: Heiko Carstens Tested-by: Heiko Carstens Acked-by: Vlastimil Babka Cc: Dan Williams Cc: Reza Arbab Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'drivers/base') diff --git a/drivers/base/memory.c b/drivers/base/memory.c index b86fda30ce62..c7c4e0325cdb 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -419,7 +419,7 @@ static ssize_t show_valid_zones(struct device *dev, nid = pfn_to_nid(start_pfn); if (allow_online_pfn_range(nid, start_pfn, nr_pages, MMOP_ONLINE_KERNEL)) { - strcat(buf, NODE_DATA(nid)->node_zones[ZONE_NORMAL].name); + strcat(buf, default_zone_for_pfn(nid, start_pfn, nr_pages)->name); append = true; } -- cgit v1.2.1 From 385386cff4c6f047907655e05791d88198c4c523 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Thu, 6 Jul 2017 15:40:43 -0700 Subject: mm: vmstat: move slab statistics from zone to node counters Patch series "mm: per-lruvec slab stats" Josef is working on a new approach to balancing slab caches and the page cache. For this to work, he needs slab cache statistics on the lruvec level. These patches implement that by adding infrastructure that allows updating and reading generic VM stat items per lruvec, then switches some existing VM accounting sites, including the slab accounting ones, to this new cgroup-aware API. I'll follow up with more patches on this, because there is actually substantial simplification that can be done to the memory controller when we replace private memcg accounting with making the existing VM accounting sites cgroup-aware. But this is enough for Josef to base his slab reclaim work on, so here goes. This patch (of 5): To re-implement slab cache vs. page cache balancing, we'll need the slab counters at the lruvec level, which, ever since lru reclaim was moved from the zone to the node, is the intersection of the node, not the zone, and the memcg. We could retain the per-zone counters for when the page allocator dumps its memory information on failures, and have counters on both levels - which on all but NUMA node 0 is usually redundant. But let's keep it simple for now and just move them. If anybody complains we can restore the per-zone counters. [hannes@cmpxchg.org: fix oops] Link: http://lkml.kernel.org/r/20170605183511.GA8915@cmpxchg.org Link: http://lkml.kernel.org/r/20170530181724.27197-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Cc: Josef Bacik Cc: Michal Hocko Cc: Vladimir Davydov Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/node.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/node.c b/drivers/base/node.c index 1da0005341a1..6b1ee371ee97 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -129,11 +129,11 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)), nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)), nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), - nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE) + - sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)), - nid, K(sum_zone_node_page_state(nid, NR_SLAB_RECLAIMABLE)), + nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE) + + node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE)), + nid, K(node_page_state(pgdat, NR_SLAB_RECLAIMABLE)), #ifdef CONFIG_TRANSPARENT_HUGEPAGE - nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE)), + nid, K(node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE)), nid, K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR), nid, K(node_page_state(pgdat, NR_SHMEM_THPS) * @@ -141,7 +141,7 @@ static ssize_t node_read_meminfo(struct device *dev, nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)); #else - nid, K(sum_zone_node_page_state(nid, NR_SLAB_UNRECLAIMABLE))); + nid, K(node_page_state(pgdat, NR_SLAB_UNRECLAIMABLE))); #endif n += hugetlb_report_node_meminfo(nid, buf + n); return n; -- cgit v1.2.1 From f70029bbaacbfa8f082d2b4988717cba4e269f17 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Thu, 6 Jul 2017 15:41:02 -0700 Subject: mm, memory_hotplug: drop CONFIG_MOVABLE_NODE Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a good explanation on why it is actually useful. It makes a lot of sense to make movable node semantic opt in but we already have that because the feature has to be explicitly enabled on the kernel command line. A config option on top only makes the configuration space larger without a good reason. It also adds an additional ifdefery that pollutes the code. Just drop the config option and make it de-facto always enabled. This shouldn't introduce any change to the semantic. Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org Signed-off-by: Michal Hocko Acked-by: Reza Arbab Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Andrea Arcangeli Cc: Jerome Glisse Cc: Yasuaki Ishimatsu Cc: Xishi Qiu Cc: Kani Toshimitsu Cc: Chen Yucong Cc: Joonsoo Kim Cc: Andi Kleen Cc: David Rientjes Cc: Daniel Kiper Cc: Igor Mammedov Cc: Vitaly Kuznetsov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- drivers/base/node.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'drivers/base') diff --git a/drivers/base/node.c b/drivers/base/node.c index 6b1ee371ee97..73d39bc58c42 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -639,9 +639,7 @@ static struct node_attr node_state_attr[] = { #ifdef CONFIG_HIGHMEM [N_HIGH_MEMORY] = _NODE_ATTR(has_high_memory, N_HIGH_MEMORY), #endif -#ifdef CONFIG_MOVABLE_NODE [N_MEMORY] = _NODE_ATTR(has_memory, N_MEMORY), -#endif [N_CPU] = _NODE_ATTR(has_cpu, N_CPU), }; @@ -652,9 +650,7 @@ static struct attribute *node_state_attrs[] = { #ifdef CONFIG_HIGHMEM &node_state_attr[N_HIGH_MEMORY].attr.attr, #endif -#ifdef CONFIG_MOVABLE_NODE &node_state_attr[N_MEMORY].attr.attr, -#endif &node_state_attr[N_CPU].attr.attr, NULL }; -- cgit v1.2.1