diff options
author | Sebastian Andrzej Siewior <bigeasy@linutronix.de> | 2022-01-19 18:35:12 +0100 |
---|---|---|
committer | Sebastian Andrzej Siewior <bigeasy@linutronix.de> | 2022-01-19 18:35:12 +0100 |
commit | 715da4c19855eb48714f5e1e4a41f7412a850cb3 (patch) | |
tree | d3081a76d1009045aa3a34c251859451066ba918 | |
parent | b8f3cf6ce0d58825dd408951611589aaea129838 (diff) | |
download | linux-rt-715da4c19855eb48714f5e1e4a41f7412a850cb3.tar.gz |
[ANNOUNCE] v5.16.1-rt17v5.16.1-rt17-patches
Dear RT folks!
I'm pleased to announce the v5.16.1-rt17 patch set.
Changes since v5.16.1-rt16:
- Make sure that the local_lock_*() are completely optimized away on
!RT without debug.
- Updates to memcg: Disable the threshold handler on RT which is a
cgroup v1 feature (deprecated).
- i2c:
- Host notify on smbus seems not working on RT. Reported by Michael
Below, waiting for feedback.
- The rcar host driver must not disable force threading.
Known issues
- netconsole triggers WARN.
- Valentin Schneider reported a few splats on ARM64, see
https://lkml.kernel.org/r/20210810134127.1394269-1-valentin.schneider@arm.com
The delta patch against v5.16.1-rt16 is appended below and can be found here:
https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.16/incr/patch-5.16.1-rt16-rt17.patch.xz
You can get this release via the git tree at:
git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git v5.16.1-rt17
The RT patch against v5.16.1 can be found here:
https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.16/older/patch-5.16.1-rt17.patch.xz
The split quilt queue is available at:
https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.16/older/patches-5.16.1-rt17.tar.xz
Sebastian
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
15 files changed, 1118 insertions, 259 deletions
diff --git a/patches/0001-mm-memcg-Disable-threshold-event-handlers-on-PREEMPT.patch b/patches/0001-mm-memcg-Disable-threshold-event-handlers-on-PREEMPT.patch new file mode 100644 index 000000000000..d299bfa5b069 --- /dev/null +++ b/patches/0001-mm-memcg-Disable-threshold-event-handlers-on-PREEMPT.patch @@ -0,0 +1,846 @@ +From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +Date: Tue, 18 Jan 2022 17:28:07 +0100 +Subject: [PATCH 1/4] mm/memcg: Disable threshold event handlers on PREEMPT_RT +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +During the integration of PREEMPT_RT support, the code flow around +memcg_check_events() resulted in `twisted code'. Moving the code around +and avoiding then would then lead to an additional local-irq-save +section within memcg_check_events(). While looking better, it adds a +local-irq-save section to code flow which is usually within an +local-irq-off block on non-PREEMPT_RT configurations. + +The threshold event handler is a deprecated memcg v1 feature. Instead of +trying to get it to work under PREEMPT_RT just disable it. There should +be no users on PREEMPT_RT. From that perspective it makes even less +sense to get it to work under PREEMPT_RT while having zero users. + +Make memory.soft_limit_in_bytes and cgroup.event_control return +-EOPNOTSUPP on PREEMPT_RT. Make an empty memcg_check_events() and +memcg_write_event_control() which return only -EOPNOTSUPP on PREEMPT_RT. +Document that the two knobs are disabled on PREEMPT_RT. Shuffle the code around +so that all unused function are in on #ifdef block. + +Suggested-by: Michal Hocko <mhocko@kernel.org> +Suggested-by: Michal Koutný <mkoutny@suse.com> +Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +--- + Documentation/admin-guide/cgroup-v1/memory.rst | 2 + mm/memcontrol.c | 728 ++++++++++++------------- + 2 files changed, 374 insertions(+), 356 deletions(-) + +--- a/Documentation/admin-guide/cgroup-v1/memory.rst ++++ b/Documentation/admin-guide/cgroup-v1/memory.rst +@@ -64,6 +64,7 @@ Brief summary of control files. + threads + cgroup.procs show list of processes + cgroup.event_control an interface for event_fd() ++ This knob is not available on CONFIG_PREEMPT_RT systems. + memory.usage_in_bytes show current usage for memory + (See 5.5 for details) + memory.memsw.usage_in_bytes show current usage for memory+Swap +@@ -75,6 +76,7 @@ Brief summary of control files. + memory.max_usage_in_bytes show max memory usage recorded + memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded + memory.soft_limit_in_bytes set/show soft limit of memory usage ++ This knob is not available on CONFIG_PREEMPT_RT systems. + memory.stat show various statistics + memory.use_hierarchy set/show hierarchical account enabled + This knob is deprecated and shouldn't be +--- a/mm/memcontrol.c ++++ b/mm/memcontrol.c +@@ -169,7 +169,6 @@ struct mem_cgroup_event { + struct work_struct remove; + }; + +-static void mem_cgroup_threshold(struct mem_cgroup *memcg); + static void mem_cgroup_oom_notify(struct mem_cgroup *memcg); + + /* Stuffs for move charges at task migration. */ +@@ -521,43 +520,6 @@ static unsigned long soft_limit_excess(s + return excess; + } + +-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid) +-{ +- unsigned long excess; +- struct mem_cgroup_per_node *mz; +- struct mem_cgroup_tree_per_node *mctz; +- +- mctz = soft_limit_tree.rb_tree_per_node[nid]; +- if (!mctz) +- return; +- /* +- * Necessary to update all ancestors when hierarchy is used. +- * because their event counter is not touched. +- */ +- for (; memcg; memcg = parent_mem_cgroup(memcg)) { +- mz = memcg->nodeinfo[nid]; +- excess = soft_limit_excess(memcg); +- /* +- * We have to update the tree if mz is on RB-tree or +- * mem is over its softlimit. +- */ +- if (excess || mz->on_tree) { +- unsigned long flags; +- +- spin_lock_irqsave(&mctz->lock, flags); +- /* if on-tree, remove it */ +- if (mz->on_tree) +- __mem_cgroup_remove_exceeded(mz, mctz); +- /* +- * Insert again. mz->usage_in_excess will be updated. +- * If excess is 0, no tree ops. +- */ +- __mem_cgroup_insert_exceeded(mz, mctz, excess); +- spin_unlock_irqrestore(&mctz->lock, flags); +- } +- } +-} +- + static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg) + { + struct mem_cgroup_tree_per_node *mctz; +@@ -821,50 +783,6 @@ static void mem_cgroup_charge_statistics + __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages); + } + +-static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, +- enum mem_cgroup_events_target target) +-{ +- unsigned long val, next; +- +- val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events); +- next = __this_cpu_read(memcg->vmstats_percpu->targets[target]); +- /* from time_after() in jiffies.h */ +- if ((long)(next - val) < 0) { +- switch (target) { +- case MEM_CGROUP_TARGET_THRESH: +- next = val + THRESHOLDS_EVENTS_TARGET; +- break; +- case MEM_CGROUP_TARGET_SOFTLIMIT: +- next = val + SOFTLIMIT_EVENTS_TARGET; +- break; +- default: +- break; +- } +- __this_cpu_write(memcg->vmstats_percpu->targets[target], next); +- return true; +- } +- return false; +-} +- +-/* +- * Check events in order. +- * +- */ +-static void memcg_check_events(struct mem_cgroup *memcg, int nid) +-{ +- /* threshold event is triggered in finer grain than soft limit */ +- if (unlikely(mem_cgroup_event_ratelimit(memcg, +- MEM_CGROUP_TARGET_THRESH))) { +- bool do_softlimit; +- +- do_softlimit = mem_cgroup_event_ratelimit(memcg, +- MEM_CGROUP_TARGET_SOFTLIMIT); +- mem_cgroup_threshold(memcg); +- if (unlikely(do_softlimit)) +- mem_cgroup_update_tree(memcg, nid); +- } +-} +- + struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p) + { + /* +@@ -3751,8 +3669,12 @@ static ssize_t mem_cgroup_write(struct k + } + break; + case RES_SOFT_LIMIT: ++#ifndef CONFIG_PREEMPT_RT + memcg->soft_limit = nr_pages; + ret = 0; ++#else ++ ret = -EOPNOTSUPP; ++#endif + break; + } + return ret ?: nbytes; +@@ -4057,6 +3979,343 @@ static int mem_cgroup_swappiness_write(s + return 0; + } + ++static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) ++{ ++ struct mem_cgroup_eventfd_list *ev; ++ ++ spin_lock(&memcg_oom_lock); ++ ++ list_for_each_entry(ev, &memcg->oom_notify, list) ++ eventfd_signal(ev->eventfd, 1); ++ ++ spin_unlock(&memcg_oom_lock); ++ return 0; ++} ++ ++static void mem_cgroup_oom_notify(struct mem_cgroup *memcg) ++{ ++ struct mem_cgroup *iter; ++ ++ for_each_mem_cgroup_tree(iter, memcg) ++ mem_cgroup_oom_notify_cb(iter); ++} ++ ++static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) ++{ ++ struct mem_cgroup *memcg = mem_cgroup_from_seq(sf); ++ ++ seq_printf(sf, "oom_kill_disable %d\n", memcg->oom_kill_disable); ++ seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); ++ seq_printf(sf, "oom_kill %lu\n", ++ atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); ++ return 0; ++} ++ ++static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, ++ struct cftype *cft, u64 val) ++{ ++ struct mem_cgroup *memcg = mem_cgroup_from_css(css); ++ ++ /* cannot set to root cgroup and only 0 and 1 are allowed */ ++ if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1))) ++ return -EINVAL; ++ ++ memcg->oom_kill_disable = val; ++ if (!val) ++ memcg_oom_recover(memcg); ++ ++ return 0; ++} ++ ++#ifdef CONFIG_CGROUP_WRITEBACK ++ ++#include <trace/events/writeback.h> ++ ++static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) ++{ ++ return wb_domain_init(&memcg->cgwb_domain, gfp); ++} ++ ++static void memcg_wb_domain_exit(struct mem_cgroup *memcg) ++{ ++ wb_domain_exit(&memcg->cgwb_domain); ++} ++ ++static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) ++{ ++ wb_domain_size_changed(&memcg->cgwb_domain); ++} ++ ++struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) ++{ ++ struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); ++ ++ if (!memcg->css.parent) ++ return NULL; ++ ++ return &memcg->cgwb_domain; ++} ++ ++/** ++ * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg ++ * @wb: bdi_writeback in question ++ * @pfilepages: out parameter for number of file pages ++ * @pheadroom: out parameter for number of allocatable pages according to memcg ++ * @pdirty: out parameter for number of dirty pages ++ * @pwriteback: out parameter for number of pages under writeback ++ * ++ * Determine the numbers of file, headroom, dirty, and writeback pages in ++ * @wb's memcg. File, dirty and writeback are self-explanatory. Headroom ++ * is a bit more involved. ++ * ++ * A memcg's headroom is "min(max, high) - used". In the hierarchy, the ++ * headroom is calculated as the lowest headroom of itself and the ++ * ancestors. Note that this doesn't consider the actual amount of ++ * available memory in the system. The caller should further cap ++ * *@pheadroom accordingly. ++ */ ++void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, ++ unsigned long *pheadroom, unsigned long *pdirty, ++ unsigned long *pwriteback) ++{ ++ struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); ++ struct mem_cgroup *parent; ++ ++ mem_cgroup_flush_stats(); ++ ++ *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); ++ *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); ++ *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + ++ memcg_page_state(memcg, NR_ACTIVE_FILE); ++ ++ *pheadroom = PAGE_COUNTER_MAX; ++ while ((parent = parent_mem_cgroup(memcg))) { ++ unsigned long ceiling = min(READ_ONCE(memcg->memory.max), ++ READ_ONCE(memcg->memory.high)); ++ unsigned long used = page_counter_read(&memcg->memory); ++ ++ *pheadroom = min(*pheadroom, ceiling - min(ceiling, used)); ++ memcg = parent; ++ } ++} ++ ++/* ++ * Foreign dirty flushing ++ * ++ * There's an inherent mismatch between memcg and writeback. The former ++ * tracks ownership per-page while the latter per-inode. This was a ++ * deliberate design decision because honoring per-page ownership in the ++ * writeback path is complicated, may lead to higher CPU and IO overheads ++ * and deemed unnecessary given that write-sharing an inode across ++ * different cgroups isn't a common use-case. ++ * ++ * Combined with inode majority-writer ownership switching, this works well ++ * enough in most cases but there are some pathological cases. For ++ * example, let's say there are two cgroups A and B which keep writing to ++ * different but confined parts of the same inode. B owns the inode and ++ * A's memory is limited far below B's. A's dirty ratio can rise enough to ++ * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid ++ * triggering background writeback. A will be slowed down without a way to ++ * make writeback of the dirty pages happen. ++ * ++ * Conditions like the above can lead to a cgroup getting repeatedly and ++ * severely throttled after making some progress after each ++ * dirty_expire_interval while the underlying IO device is almost ++ * completely idle. ++ * ++ * Solving this problem completely requires matching the ownership tracking ++ * granularities between memcg and writeback in either direction. However, ++ * the more egregious behaviors can be avoided by simply remembering the ++ * most recent foreign dirtying events and initiating remote flushes on ++ * them when local writeback isn't enough to keep the memory clean enough. ++ * ++ * The following two functions implement such mechanism. When a foreign ++ * page - a page whose memcg and writeback ownerships don't match - is ++ * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning ++ * bdi_writeback on the page owning memcg. When balance_dirty_pages() ++ * decides that the memcg needs to sleep due to high dirty ratio, it calls ++ * mem_cgroup_flush_foreign() which queues writeback on the recorded ++ * foreign bdi_writebacks which haven't expired. Both the numbers of ++ * recorded bdi_writebacks and concurrent in-flight foreign writebacks are ++ * limited to MEMCG_CGWB_FRN_CNT. ++ * ++ * The mechanism only remembers IDs and doesn't hold any object references. ++ * As being wrong occasionally doesn't matter, updates and accesses to the ++ * records are lockless and racy. ++ */ ++void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio, ++ struct bdi_writeback *wb) ++{ ++ struct mem_cgroup *memcg = folio_memcg(folio); ++ struct memcg_cgwb_frn *frn; ++ u64 now = get_jiffies_64(); ++ u64 oldest_at = now; ++ int oldest = -1; ++ int i; ++ ++ trace_track_foreign_dirty(folio, wb); ++ ++ /* ++ * Pick the slot to use. If there is already a slot for @wb, keep ++ * using it. If not replace the oldest one which isn't being ++ * written out. ++ */ ++ for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { ++ frn = &memcg->cgwb_frn[i]; ++ if (frn->bdi_id == wb->bdi->id && ++ frn->memcg_id == wb->memcg_css->id) ++ break; ++ if (time_before64(frn->at, oldest_at) && ++ atomic_read(&frn->done.cnt) == 1) { ++ oldest = i; ++ oldest_at = frn->at; ++ } ++ } ++ ++ if (i < MEMCG_CGWB_FRN_CNT) { ++ /* ++ * Re-using an existing one. Update timestamp lazily to ++ * avoid making the cacheline hot. We want them to be ++ * reasonably up-to-date and significantly shorter than ++ * dirty_expire_interval as that's what expires the record. ++ * Use the shorter of 1s and dirty_expire_interval / 8. ++ */ ++ unsigned long update_intv = ++ min_t(unsigned long, HZ, ++ msecs_to_jiffies(dirty_expire_interval * 10) / 8); ++ ++ if (time_before64(frn->at, now - update_intv)) ++ frn->at = now; ++ } else if (oldest >= 0) { ++ /* replace the oldest free one */ ++ frn = &memcg->cgwb_frn[oldest]; ++ frn->bdi_id = wb->bdi->id; ++ frn->memcg_id = wb->memcg_css->id; ++ frn->at = now; ++ } ++} ++ ++/* issue foreign writeback flushes for recorded foreign dirtying events */ ++void mem_cgroup_flush_foreign(struct bdi_writeback *wb) ++{ ++ struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); ++ unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); ++ u64 now = jiffies_64; ++ int i; ++ ++ for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { ++ struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; ++ ++ /* ++ * If the record is older than dirty_expire_interval, ++ * writeback on it has already started. No need to kick it ++ * off again. Also, don't start a new one if there's ++ * already one in flight. ++ */ ++ if (time_after64(frn->at, now - intv) && ++ atomic_read(&frn->done.cnt) == 1) { ++ frn->at = 0; ++ trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id); ++ cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, ++ WB_REASON_FOREIGN_FLUSH, ++ &frn->done); ++ } ++ } ++} ++ ++#else /* CONFIG_CGROUP_WRITEBACK */ ++ ++static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) ++{ ++ return 0; ++} ++ ++static void memcg_wb_domain_exit(struct mem_cgroup *memcg) ++{ ++} ++ ++static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) ++{ ++} ++ ++#endif /* CONFIG_CGROUP_WRITEBACK */ ++ ++#ifndef CONFIG_PREEMPT_RT ++/* ++ * DO NOT USE IN NEW FILES. ++ * ++ * "cgroup.event_control" implementation. ++ * ++ * This is way over-engineered. It tries to support fully configurable ++ * events for each user. Such level of flexibility is completely ++ * unnecessary especially in the light of the planned unified hierarchy. ++ * ++ * Please deprecate this and replace with something simpler if at all ++ * possible. ++ */ ++ ++static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, ++ enum mem_cgroup_events_target target) ++{ ++ unsigned long val, next; ++ ++ val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events); ++ next = __this_cpu_read(memcg->vmstats_percpu->targets[target]); ++ /* from time_after() in jiffies.h */ ++ if ((long)(next - val) < 0) { ++ switch (target) { ++ case MEM_CGROUP_TARGET_THRESH: ++ next = val + THRESHOLDS_EVENTS_TARGET; ++ break; ++ case MEM_CGROUP_TARGET_SOFTLIMIT: ++ next = val + SOFTLIMIT_EVENTS_TARGET; ++ break; ++ default: ++ break; ++ } ++ __this_cpu_write(memcg->vmstats_percpu->targets[target], next); ++ return true; ++ } ++ return false; ++} ++ ++static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid) ++{ ++ unsigned long excess; ++ struct mem_cgroup_per_node *mz; ++ struct mem_cgroup_tree_per_node *mctz; ++ ++ mctz = soft_limit_tree.rb_tree_per_node[nid]; ++ if (!mctz) ++ return; ++ /* ++ * Necessary to update all ancestors when hierarchy is used. ++ * because their event counter is not touched. ++ */ ++ for (; memcg; memcg = parent_mem_cgroup(memcg)) { ++ mz = memcg->nodeinfo[nid]; ++ excess = soft_limit_excess(memcg); ++ /* ++ * We have to update the tree if mz is on RB-tree or ++ * mem is over its softlimit. ++ */ ++ if (excess || mz->on_tree) { ++ unsigned long flags; ++ ++ spin_lock_irqsave(&mctz->lock, flags); ++ /* if on-tree, remove it */ ++ if (mz->on_tree) ++ __mem_cgroup_remove_exceeded(mz, mctz); ++ /* ++ * Insert again. mz->usage_in_excess will be updated. ++ * If excess is 0, no tree ops. ++ */ ++ __mem_cgroup_insert_exceeded(mz, mctz, excess); ++ spin_unlock_irqrestore(&mctz->lock, flags); ++ } ++ } ++} ++ + static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) + { + struct mem_cgroup_threshold_ary *t; +@@ -4119,6 +4378,25 @@ static void mem_cgroup_threshold(struct + } + } + ++/* ++ * Check events in order. ++ * ++ */ ++static void memcg_check_events(struct mem_cgroup *memcg, int nid) ++{ ++ /* threshold event is triggered in finer grain than soft limit */ ++ if (unlikely(mem_cgroup_event_ratelimit(memcg, ++ MEM_CGROUP_TARGET_THRESH))) { ++ bool do_softlimit; ++ ++ do_softlimit = mem_cgroup_event_ratelimit(memcg, ++ MEM_CGROUP_TARGET_SOFTLIMIT); ++ mem_cgroup_threshold(memcg); ++ if (unlikely(do_softlimit)) ++ mem_cgroup_update_tree(memcg, nid); ++ } ++} ++ + static int compare_thresholds(const void *a, const void *b) + { + const struct mem_cgroup_threshold *_a = a; +@@ -4133,27 +4411,6 @@ static int compare_thresholds(const void + return 0; + } + +-static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg) +-{ +- struct mem_cgroup_eventfd_list *ev; +- +- spin_lock(&memcg_oom_lock); +- +- list_for_each_entry(ev, &memcg->oom_notify, list) +- eventfd_signal(ev->eventfd, 1); +- +- spin_unlock(&memcg_oom_lock); +- return 0; +-} +- +-static void mem_cgroup_oom_notify(struct mem_cgroup *memcg) +-{ +- struct mem_cgroup *iter; +- +- for_each_mem_cgroup_tree(iter, memcg) +- mem_cgroup_oom_notify_cb(iter); +-} +- + static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg, + struct eventfd_ctx *eventfd, const char *args, enum res_type type) + { +@@ -4382,259 +4639,6 @@ static void mem_cgroup_oom_unregister_ev + spin_unlock(&memcg_oom_lock); + } + +-static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) +-{ +- struct mem_cgroup *memcg = mem_cgroup_from_seq(sf); +- +- seq_printf(sf, "oom_kill_disable %d\n", memcg->oom_kill_disable); +- seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); +- seq_printf(sf, "oom_kill %lu\n", +- atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); +- return 0; +-} +- +-static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css, +- struct cftype *cft, u64 val) +-{ +- struct mem_cgroup *memcg = mem_cgroup_from_css(css); +- +- /* cannot set to root cgroup and only 0 and 1 are allowed */ +- if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1))) +- return -EINVAL; +- +- memcg->oom_kill_disable = val; +- if (!val) +- memcg_oom_recover(memcg); +- +- return 0; +-} +- +-#ifdef CONFIG_CGROUP_WRITEBACK +- +-#include <trace/events/writeback.h> +- +-static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) +-{ +- return wb_domain_init(&memcg->cgwb_domain, gfp); +-} +- +-static void memcg_wb_domain_exit(struct mem_cgroup *memcg) +-{ +- wb_domain_exit(&memcg->cgwb_domain); +-} +- +-static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +-{ +- wb_domain_size_changed(&memcg->cgwb_domain); +-} +- +-struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) +-{ +- struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); +- +- if (!memcg->css.parent) +- return NULL; +- +- return &memcg->cgwb_domain; +-} +- +-/** +- * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg +- * @wb: bdi_writeback in question +- * @pfilepages: out parameter for number of file pages +- * @pheadroom: out parameter for number of allocatable pages according to memcg +- * @pdirty: out parameter for number of dirty pages +- * @pwriteback: out parameter for number of pages under writeback +- * +- * Determine the numbers of file, headroom, dirty, and writeback pages in +- * @wb's memcg. File, dirty and writeback are self-explanatory. Headroom +- * is a bit more involved. +- * +- * A memcg's headroom is "min(max, high) - used". In the hierarchy, the +- * headroom is calculated as the lowest headroom of itself and the +- * ancestors. Note that this doesn't consider the actual amount of +- * available memory in the system. The caller should further cap +- * *@pheadroom accordingly. +- */ +-void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, +- unsigned long *pheadroom, unsigned long *pdirty, +- unsigned long *pwriteback) +-{ +- struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); +- struct mem_cgroup *parent; +- +- mem_cgroup_flush_stats(); +- +- *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); +- *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); +- *pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) + +- memcg_page_state(memcg, NR_ACTIVE_FILE); +- +- *pheadroom = PAGE_COUNTER_MAX; +- while ((parent = parent_mem_cgroup(memcg))) { +- unsigned long ceiling = min(READ_ONCE(memcg->memory.max), +- READ_ONCE(memcg->memory.high)); +- unsigned long used = page_counter_read(&memcg->memory); +- +- *pheadroom = min(*pheadroom, ceiling - min(ceiling, used)); +- memcg = parent; +- } +-} +- +-/* +- * Foreign dirty flushing +- * +- * There's an inherent mismatch between memcg and writeback. The former +- * tracks ownership per-page while the latter per-inode. This was a +- * deliberate design decision because honoring per-page ownership in the +- * writeback path is complicated, may lead to higher CPU and IO overheads +- * and deemed unnecessary given that write-sharing an inode across +- * different cgroups isn't a common use-case. +- * +- * Combined with inode majority-writer ownership switching, this works well +- * enough in most cases but there are some pathological cases. For +- * example, let's say there are two cgroups A and B which keep writing to +- * different but confined parts of the same inode. B owns the inode and +- * A's memory is limited far below B's. A's dirty ratio can rise enough to +- * trigger balance_dirty_pages() sleeps but B's can be low enough to avoid +- * triggering background writeback. A will be slowed down without a way to +- * make writeback of the dirty pages happen. +- * +- * Conditions like the above can lead to a cgroup getting repeatedly and +- * severely throttled after making some progress after each +- * dirty_expire_interval while the underlying IO device is almost +- * completely idle. +- * +- * Solving this problem completely requires matching the ownership tracking +- * granularities between memcg and writeback in either direction. However, +- * the more egregious behaviors can be avoided by simply remembering the +- * most recent foreign dirtying events and initiating remote flushes on +- * them when local writeback isn't enough to keep the memory clean enough. +- * +- * The following two functions implement such mechanism. When a foreign +- * page - a page whose memcg and writeback ownerships don't match - is +- * dirtied, mem_cgroup_track_foreign_dirty() records the inode owning +- * bdi_writeback on the page owning memcg. When balance_dirty_pages() +- * decides that the memcg needs to sleep due to high dirty ratio, it calls +- * mem_cgroup_flush_foreign() which queues writeback on the recorded +- * foreign bdi_writebacks which haven't expired. Both the numbers of +- * recorded bdi_writebacks and concurrent in-flight foreign writebacks are +- * limited to MEMCG_CGWB_FRN_CNT. +- * +- * The mechanism only remembers IDs and doesn't hold any object references. +- * As being wrong occasionally doesn't matter, updates and accesses to the +- * records are lockless and racy. +- */ +-void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio, +- struct bdi_writeback *wb) +-{ +- struct mem_cgroup *memcg = folio_memcg(folio); +- struct memcg_cgwb_frn *frn; +- u64 now = get_jiffies_64(); +- u64 oldest_at = now; +- int oldest = -1; +- int i; +- +- trace_track_foreign_dirty(folio, wb); +- +- /* +- * Pick the slot to use. If there is already a slot for @wb, keep +- * using it. If not replace the oldest one which isn't being +- * written out. +- */ +- for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { +- frn = &memcg->cgwb_frn[i]; +- if (frn->bdi_id == wb->bdi->id && +- frn->memcg_id == wb->memcg_css->id) +- break; +- if (time_before64(frn->at, oldest_at) && +- atomic_read(&frn->done.cnt) == 1) { +- oldest = i; +- oldest_at = frn->at; +- } +- } +- +- if (i < MEMCG_CGWB_FRN_CNT) { +- /* +- * Re-using an existing one. Update timestamp lazily to +- * avoid making the cacheline hot. We want them to be +- * reasonably up-to-date and significantly shorter than +- * dirty_expire_interval as that's what expires the record. +- * Use the shorter of 1s and dirty_expire_interval / 8. +- */ +- unsigned long update_intv = +- min_t(unsigned long, HZ, +- msecs_to_jiffies(dirty_expire_interval * 10) / 8); +- +- if (time_before64(frn->at, now - update_intv)) +- frn->at = now; +- } else if (oldest >= 0) { +- /* replace the oldest free one */ +- frn = &memcg->cgwb_frn[oldest]; +- frn->bdi_id = wb->bdi->id; +- frn->memcg_id = wb->memcg_css->id; +- frn->at = now; +- } +-} +- +-/* issue foreign writeback flushes for recorded foreign dirtying events */ +-void mem_cgroup_flush_foreign(struct bdi_writeback *wb) +-{ +- struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); +- unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); +- u64 now = jiffies_64; +- int i; +- +- for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { +- struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; +- +- /* +- * If the record is older than dirty_expire_interval, +- * writeback on it has already started. No need to kick it +- * off again. Also, don't start a new one if there's +- * already one in flight. +- */ +- if (time_after64(frn->at, now - intv) && +- atomic_read(&frn->done.cnt) == 1) { +- frn->at = 0; +- trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id); +- cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, +- WB_REASON_FOREIGN_FLUSH, +- &frn->done); +- } +- } +-} +- +-#else /* CONFIG_CGROUP_WRITEBACK */ +- +-static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) +-{ +- return 0; +-} +- +-static void memcg_wb_domain_exit(struct mem_cgroup *memcg) +-{ +-} +- +-static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +-{ +-} +- +-#endif /* CONFIG_CGROUP_WRITEBACK */ +- +-/* +- * DO NOT USE IN NEW FILES. +- * +- * "cgroup.event_control" implementation. +- * +- * This is way over-engineered. It tries to support fully configurable +- * events for each user. Such level of flexibility is completely +- * unnecessary especially in the light of the planned unified hierarchy. +- * +- * Please deprecate this and replace with something simpler if at all +- * possible. +- */ +- + /* + * Unregister event and free resources. + * +@@ -4845,6 +4849,18 @@ static ssize_t memcg_write_event_control + return ret; + } + ++#else ++ ++static ssize_t memcg_write_event_control(struct kernfs_open_file *of, ++ char *buf, size_t nbytes, loff_t off) ++{ ++ return -EOPNOTSUPP; ++} ++ ++static void memcg_check_events(struct mem_cgroup *memcg, int nid) { } ++ ++#endif ++ + static struct cftype mem_cgroup_legacy_files[] = { + { + .name = "usage_in_bytes", diff --git a/patches/0001-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch b/patches/0001-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch deleted file mode 100644 index 9ac4dc028011..000000000000 --- a/patches/0001-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch +++ /dev/null @@ -1,210 +0,0 @@ -From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> -Date: Fri, 17 Dec 2021 18:19:19 +0100 -Subject: [PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption - on PREEMPT_RT - -The per-CPU counter are modified with the non-atomic modifier. The -consistency is ensure by disabling interrupts for the update. -This breaks on PREEMPT_RT because some sections additionally -acquire a spinlock_t lock (which becomes sleeping and must not be -acquired with disabled interrupts). Another problem is that -mem_cgroup_swapout() expects to be invoked with disabled interrupts -because the caller has to acquire a spinlock_t which is acquired with -disabled interrupts. Since spinlock_t never disables interrupts on -PREEMPT_RT the interrupts are never disabled at this point. - -The code is never called from in_irq() context on PREEMPT_RT therefore -disabling preemption during the update is sufficient on PREEMPT_RT. The -sections with disabled preemption must exclude memcg_check_events() so -that spinlock_t locks can still be acquired (for instance in -eventfd_signal()). - -Don't disable interrupts during updates of the per-CPU variables, -instead use shorter sections which disable preemption. - -Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> -Link: https://lkml.kernel.org/r/20211222114111.2206248-2-bigeasy@linutronix.de ---- - mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++++++++----------- - 1 file changed, 60 insertions(+), 14 deletions(-) - ---- a/mm/memcontrol.c -+++ b/mm/memcontrol.c -@@ -671,8 +671,14 @@ void __mod_memcg_state(struct mem_cgroup - if (mem_cgroup_disabled()) - return; - -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_disable(); -+ - __this_cpu_add(memcg->vmstats_percpu->state[idx], val); - memcg_rstat_updated(memcg); -+ -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_enable(); - } - - /* idx can be of type enum memcg_stat_item or node_stat_item. */ -@@ -699,6 +705,9 @@ void __mod_memcg_lruvec_state(struct lru - pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); - memcg = pn->memcg; - -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_disable(); -+ - /* Update memcg */ - __this_cpu_add(memcg->vmstats_percpu->state[idx], val); - -@@ -706,6 +715,9 @@ void __mod_memcg_lruvec_state(struct lru - __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); - - memcg_rstat_updated(memcg); -+ -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_enable(); - } - - /** -@@ -788,8 +800,13 @@ void __count_memcg_events(struct mem_cgr - if (mem_cgroup_disabled()) - return; - -+ if (IS_ENABLED(PREEMPT_RT)) -+ preempt_disable(); -+ - __this_cpu_add(memcg->vmstats_percpu->events[idx], count); - memcg_rstat_updated(memcg); -+ if (IS_ENABLED(PREEMPT_RT)) -+ preempt_enable(); - } - - static unsigned long memcg_events(struct mem_cgroup *memcg, int event) -@@ -810,6 +827,9 @@ static unsigned long memcg_events_local( - static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, - int nr_pages) - { -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_disable(); -+ - /* pagein of a big page is an event. So, ignore page size */ - if (nr_pages > 0) - __count_memcg_events(memcg, PGPGIN, 1); -@@ -819,12 +839,19 @@ static void mem_cgroup_charge_statistics - } - - __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages); -+ -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_enable(); - } - - static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, - enum mem_cgroup_events_target target) - { - unsigned long val, next; -+ bool ret = false; -+ -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_disable(); - - val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events); - next = __this_cpu_read(memcg->vmstats_percpu->targets[target]); -@@ -841,9 +868,11 @@ static bool mem_cgroup_event_ratelimit(s - break; - } - __this_cpu_write(memcg->vmstats_percpu->targets[target], next); -- return true; -+ ret = true; - } -- return false; -+ if (IS_ENABLED(CONFIG_PREEMPT_RT)) -+ preempt_enable(); -+ return ret; - } - - /* -@@ -5645,12 +5674,14 @@ static int mem_cgroup_move_account(struc - ret = 0; - nid = folio_nid(folio); - -- local_irq_disable(); -+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) -+ local_irq_disable(); - mem_cgroup_charge_statistics(to, nr_pages); - memcg_check_events(to, nid); - mem_cgroup_charge_statistics(from, -nr_pages); - memcg_check_events(from, nid); -- local_irq_enable(); -+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) -+ local_irq_enable(); - out_unlock: - folio_unlock(folio); - out: -@@ -6670,10 +6701,12 @@ static int charge_memcg(struct folio *fo - css_get(&memcg->css); - commit_charge(folio, memcg); - -- local_irq_disable(); -+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) -+ local_irq_disable(); - mem_cgroup_charge_statistics(memcg, nr_pages); - memcg_check_events(memcg, folio_nid(folio)); -- local_irq_enable(); -+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) -+ local_irq_enable(); - out: - return ret; - } -@@ -6785,11 +6818,20 @@ static void uncharge_batch(const struct - memcg_oom_recover(ug->memcg); - } - -- local_irq_save(flags); -- __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); -- __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); -- memcg_check_events(ug->memcg, ug->nid); -- local_irq_restore(flags); -+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { -+ local_irq_save(flags); -+ __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); -+ __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); -+ memcg_check_events(ug->memcg, ug->nid); -+ local_irq_restore(flags); -+ } else { -+ preempt_disable(); -+ __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); -+ __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); -+ preempt_enable(); -+ -+ memcg_check_events(ug->memcg, ug->nid); -+ } - - /* drop reference from uncharge_folio */ - css_put(&ug->memcg->css); -@@ -6930,10 +6972,12 @@ void mem_cgroup_migrate(struct folio *ol - css_get(&memcg->css); - commit_charge(new, memcg); - -- local_irq_save(flags); -+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) -+ local_irq_save(flags); - mem_cgroup_charge_statistics(memcg, nr_pages); - memcg_check_events(memcg, folio_nid(new)); -- local_irq_restore(flags); -+ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) -+ local_irq_restore(flags); - } - - DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); -@@ -7157,8 +7201,10 @@ void mem_cgroup_swapout(struct page *pag - * i_pages lock which is taken with interrupts-off. It is - * important here to have the interrupts disabled because it is the - * only synchronisation we have for updating the per-CPU variables. -+ * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU -+ * variables are synchronised by keeping preemption disabled. - */ -- VM_BUG_ON(!irqs_disabled()); -+ VM_BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT) && !irqs_disabled()); - mem_cgroup_charge_statistics(memcg, -nr_entries); - memcg_check_events(memcg, page_to_nid(page)); - diff --git a/patches/0001_random_remove_unused_irq_flags_argument_from_add_interrupt_randomness.patch b/patches/0001_random_remove_unused_irq_flags_argument_from_add_interrupt_randomness.patch index e596a409f7ec..01dcb9789602 100644 --- a/patches/0001_random_remove_unused_irq_flags_argument_from_add_interrupt_randomness.patch +++ b/patches/0001_random_remove_unused_irq_flags_argument_from_add_interrupt_randomness.patch @@ -53,7 +53,7 @@ Link: https://lore.kernel.org/r/20211207121737.2347312-2-bigeasy@linutronix.de * void add_disk_randomness(struct gendisk *disk); * * add_device_randomness() is for adding data to the random pool that -@@ -1242,7 +1242,7 @@ static __u32 get_reg(struct fast_pool *f +@@ -1260,7 +1260,7 @@ static __u32 get_reg(struct fast_pool *f return *ptr; } diff --git a/patches/0002-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch b/patches/0002-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch new file mode 100644 index 000000000000..7ed1f73e4ea1 --- /dev/null +++ b/patches/0002-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch @@ -0,0 +1,85 @@ +From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +Date: Fri, 17 Dec 2021 18:19:19 +0100 +Subject: [PATCH 2/4] mm/memcg: Protect per-CPU counter by disabling preemption + on PREEMPT_RT where needed. + +The per-CPU counter are modified with the non-atomic modifier. The +consistency is ensured by disabling interrupts for the update. +On non PREEMPT_RT configuration this works because acquiring a +spinlock_t typed lock with the _irq() suffix disables interrupts. On +PREEMPT_RT configurations the RMW operation can be interrupted. + +Another problem is that mem_cgroup_swapout() expects to be invoked with +disabled interrupts because the caller has to acquire a spinlock_t which +is acquired with disabled interrupts. Since spinlock_t never disables +interrupts on PREEMPT_RT the interrupts are never disabled at this +point. + +The code is never called from in_irq() context on PREEMPT_RT therefore +disabling preemption during the update is sufficient on PREEMPT_RT. +The sections which explicitly disable interrupts can remain on +PREEMPT_RT because the sections remain short and they don't involve +sleeping locks (memcg_check_events() is doing nothing on PREEMPT_RT). + +Disable preemption during update of the per-CPU variables which do not +explicitly disable interrupts. + +Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +--- + mm/memcontrol.c | 21 +++++++++++++++++++-- + 1 file changed, 19 insertions(+), 2 deletions(-) + +--- a/mm/memcontrol.c ++++ b/mm/memcontrol.c +@@ -661,6 +661,8 @@ void __mod_memcg_lruvec_state(struct lru + pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); + memcg = pn->memcg; + ++ if (IS_ENABLED(CONFIG_PREEMPT_RT)) ++ preempt_disable(); + /* Update memcg */ + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); + +@@ -668,6 +670,8 @@ void __mod_memcg_lruvec_state(struct lru + __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); + + memcg_rstat_updated(memcg); ++ if (IS_ENABLED(CONFIG_PREEMPT_RT)) ++ preempt_enable(); + } + + /** +@@ -750,8 +754,12 @@ void __count_memcg_events(struct mem_cgr + if (mem_cgroup_disabled()) + return; + ++ if (IS_ENABLED(PREEMPT_RT)) ++ preempt_disable(); + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); + memcg_rstat_updated(memcg); ++ if (IS_ENABLED(PREEMPT_RT)) ++ preempt_enable(); + } + + static unsigned long memcg_events(struct mem_cgroup *memcg, int event) +@@ -7173,9 +7181,18 @@ void mem_cgroup_swapout(struct page *pag + * i_pages lock which is taken with interrupts-off. It is + * important here to have the interrupts disabled because it is the + * only synchronisation we have for updating the per-CPU variables. ++ * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU ++ * variables are synchronised by keeping preemption disabled. + */ +- VM_BUG_ON(!irqs_disabled()); +- mem_cgroup_charge_statistics(memcg, -nr_entries); ++ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { ++ VM_BUG_ON(!irqs_disabled()); ++ mem_cgroup_charge_statistics(memcg, -nr_entries); ++ } else { ++ preempt_disable(); ++ mem_cgroup_charge_statistics(memcg, -nr_entries); ++ preempt_enable(); ++ } ++ + memcg_check_events(memcg, page_to_nid(page)); + + css_put(&memcg->css); diff --git a/patches/0002-mm-memcg-Add-a-local_lock_t-for-IRQ-and-TASK-object.patch b/patches/0003-mm-memcg-Add-a-local_lock_t-for-IRQ-and-TASK-object.patch index 42b457e4a0f2..13ba236cffa9 100644 --- a/patches/0002-mm-memcg-Add-a-local_lock_t-for-IRQ-and-TASK-object.patch +++ b/patches/0003-mm-memcg-Add-a-local_lock_t-for-IRQ-and-TASK-object.patch @@ -1,6 +1,6 @@ From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Mon, 20 Dec 2021 11:14:17 +0100 -Subject: [PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object. +Subject: [PATCH 3/4] mm/memcg: Add a local_lock_t for IRQ and TASK object. The members of the per-CPU structure memcg_stock_pcp are protected either by disabling interrupts or by disabling preemption if the @@ -50,14 +50,13 @@ interrupts with a local_lock_t. This change requires some factoring: complains. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> -Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de --- mm/memcontrol.c | 176 ++++++++++++++++++++++++++++++++++++-------------------- 1 file changed, 115 insertions(+), 61 deletions(-) --- a/mm/memcontrol.c +++ b/mm/memcontrol.c -@@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void) +@@ -260,8 +260,10 @@ bool mem_cgroup_kmem_disabled(void) return cgroup_memory_nokmem; } @@ -69,7 +68,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de static void obj_cgroup_release(struct percpu_ref *ref) { -@@ -296,7 +298,7 @@ static void obj_cgroup_release(struct pe +@@ -295,7 +297,7 @@ static void obj_cgroup_release(struct pe nr_pages = nr_bytes >> PAGE_SHIFT; if (nr_pages) @@ -78,7 +77,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de spin_lock_irqsave(&css_set_lock, flags); list_del(&objcg->list); -@@ -2120,26 +2122,40 @@ struct obj_stock { +@@ -2017,26 +2019,40 @@ struct obj_stock { }; struct memcg_stock_pcp { @@ -122,7 +121,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg) -@@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgr +@@ -2065,7 +2081,7 @@ static bool consume_stock(struct mem_cgr if (nr_pages > MEMCG_CHARGE_BATCH) return ret; @@ -131,7 +130,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de stock = this_cpu_ptr(&memcg_stock); if (memcg == stock->cached && stock->nr_pages >= nr_pages) { -@@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgr +@@ -2073,7 +2089,7 @@ static bool consume_stock(struct mem_cgr ret = true; } @@ -140,7 +139,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de return ret; } -@@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_sto +@@ -2101,38 +2117,43 @@ static void drain_stock(struct memcg_sto static void drain_local_stock(struct work_struct *dummy) { @@ -192,16 +191,16 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de { - struct memcg_stock_pcp *stock; - unsigned long flags; +- +- local_irq_save(flags); + struct memcg_stock_pcp *stock = this_cpu_ptr(&memcg_stock); -- local_irq_save(flags); -- - stock = this_cpu_ptr(&memcg_stock); + lockdep_assert_held(&stock->stock_lock); if (stock->cached != memcg) { /* reset if necessary */ drain_stock(stock); css_get(&memcg->css); -@@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgro +@@ -2142,8 +2163,20 @@ static void refill_stock(struct mem_cgro if (stock->nr_pages > MEMCG_CHARGE_BATCH) drain_stock(stock); @@ -223,7 +222,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de } /* -@@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgro +@@ -2152,7 +2185,7 @@ static void refill_stock(struct mem_cgro */ static void drain_all_stock(struct mem_cgroup *root_memcg) { @@ -232,7 +231,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de /* If someone's already draining, avoid adding running more workers. */ if (!mutex_trylock(&percpu_charge_mutex)) -@@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_c +@@ -2163,7 +2196,7 @@ static void drain_all_stock(struct mem_c * as well as workers from this path always operate on the local * per-cpu data. CPU up doesn't touch memcg_stock at all. */ @@ -241,7 +240,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de for_each_online_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); struct mem_cgroup *memcg; -@@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_c +@@ -2179,14 +2212,10 @@ static void drain_all_stock(struct mem_c rcu_read_unlock(); if (flush && @@ -259,7 +258,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de mutex_unlock(&percpu_charge_mutex); } -@@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_c +@@ -2587,7 +2616,7 @@ static int try_charge_memcg(struct mem_c done_restock: if (batch > nr_pages) @@ -268,7 +267,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de /* * If the hierarchy is above the normal consumption range, schedule -@@ -2803,28 +2832,36 @@ static struct mem_cgroup *get_mem_cgroup +@@ -2700,28 +2729,36 @@ static struct mem_cgroup *get_mem_cgroup * can only be accessed after disabling interrupt. User context code can * access interrupt object stock, but not vice versa. */ @@ -314,7 +313,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de } /* -@@ -3002,7 +3039,8 @@ static void memcg_free_cache_id(int id) +@@ -2899,7 +2936,8 @@ static void memcg_free_cache_id(int id) * @nr_pages: number of pages to uncharge */ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, @@ -324,7 +323,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de { struct mem_cgroup *memcg; -@@ -3010,7 +3048,7 @@ static void obj_cgroup_uncharge_pages(st +@@ -2907,7 +2945,7 @@ static void obj_cgroup_uncharge_pages(st if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) page_counter_uncharge(&memcg->kmem, nr_pages); @@ -333,7 +332,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de css_put(&memcg->css); } -@@ -3084,7 +3122,7 @@ void __memcg_kmem_uncharge_page(struct p +@@ -2981,7 +3019,7 @@ void __memcg_kmem_uncharge_page(struct p return; objcg = __folio_objcg(folio); @@ -342,7 +341,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de folio->memcg_data = 0; obj_cgroup_put(objcg); } -@@ -3092,17 +3130,21 @@ void __memcg_kmem_uncharge_page(struct p +@@ -2989,17 +3027,21 @@ void __memcg_kmem_uncharge_page(struct p void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { @@ -366,7 +365,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de obj_cgroup_get(objcg); stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; -@@ -3146,38 +3188,43 @@ void mod_objcg_state(struct obj_cgroup * +@@ -3043,38 +3085,43 @@ void mod_objcg_state(struct obj_cgroup * if (nr) mod_objcg_mlstate(objcg, pgdat, idx, nr); @@ -416,7 +415,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de /* * The leftover is flushed to the centralized per-memcg value. -@@ -3212,8 +3259,8 @@ static void drain_obj_stock(struct obj_s +@@ -3109,8 +3156,8 @@ static void drain_obj_stock(struct obj_s stock->cached_pgdat = NULL; } @@ -426,7 +425,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, -@@ -3221,11 +3268,13 @@ static bool obj_stock_flush_required(str +@@ -3118,11 +3165,13 @@ static bool obj_stock_flush_required(str { struct mem_cgroup *memcg; @@ -440,7 +439,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de if (stock->irq_obj.cached_objcg) { memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) -@@ -3238,12 +3287,15 @@ static bool obj_stock_flush_required(str +@@ -3135,12 +3184,15 @@ static bool obj_stock_flush_required(str static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge) { @@ -458,7 +457,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de obj_cgroup_get(objcg); stock->cached_objcg = objcg; stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) -@@ -3257,10 +3309,12 @@ static void refill_obj_stock(struct obj_ +@@ -3154,10 +3206,12 @@ static void refill_obj_stock(struct obj_ stock->nr_bytes &= (PAGE_SIZE - 1); } @@ -473,7 +472,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-3-bigeasy@linutronix.de } int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) -@@ -7061,7 +7115,7 @@ void mem_cgroup_uncharge_skmem(struct me +@@ -7041,7 +7095,7 @@ void mem_cgroup_uncharge_skmem(struct me mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages); diff --git a/patches/0003_random_split_add_interrupt_randomness.patch b/patches/0003_random_split_add_interrupt_randomness.patch index bd0750993a37..d94c0ca75a94 100644 --- a/patches/0003_random_split_add_interrupt_randomness.patch +++ b/patches/0003_random_split_add_interrupt_randomness.patch @@ -21,7 +21,7 @@ Link: https://lore.kernel.org/r/20211207121737.2347312-4-bigeasy@linutronix.de --- a/drivers/char/random.c +++ b/drivers/char/random.c -@@ -1242,29 +1242,10 @@ static __u32 get_reg(struct fast_pool *f +@@ -1260,29 +1260,10 @@ static __u32 get_reg(struct fast_pool *f return *ptr; } @@ -52,7 +52,7 @@ Link: https://lore.kernel.org/r/20211207121737.2347312-4-bigeasy@linutronix.de if (unlikely(crng_init == 0)) { if ((fast_pool->count >= 64) && -@@ -1293,6 +1274,32 @@ void add_interrupt_randomness(int irq) +@@ -1311,6 +1292,32 @@ void add_interrupt_randomness(int irq) /* award one bit for the contents of the fast pool */ credit_entropy_bits(r, 1); } diff --git a/patches/0003-mm-memcg-Allow-the-task_obj-optimization-only-on-non.patch b/patches/0004-mm-memcg-Allow-the-task_obj-optimization-only-on-non.patch index 624edfa49778..1bfba2709995 100644 --- a/patches/0003-mm-memcg-Allow-the-task_obj-optimization-only-on-non.patch +++ b/patches/0004-mm-memcg-Allow-the-task_obj-optimization-only-on-non.patch @@ -1,6 +1,6 @@ From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Date: Wed, 22 Dec 2021 12:16:27 +0100 -Subject: [PATCH 3/3] mm/memcg: Allow the task_obj optimization only on +Subject: [PATCH 4/4] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels. Based on my understanding the optimisation with task_obj for in_task() @@ -11,17 +11,21 @@ With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be configured but these kernels always have preempt_disable()/enable() present so it probably makes no sense here for the optimisation. +I did a micro benchmark with disabled interrupts and a loop of +100.000.000 invokcations of kfree(kmalloc()). Based on the results it +makes no sense to add an exception based on dynamic preemption. + Restrict the optimisation to !CONFIG_PREEMPTION kernels. +Link: https://lore.kernel.org/all/YdX+INO9gQje6d0S@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> -Link: https://lkml.kernel.org/r/20211222114111.2206248-4-bigeasy@linutronix.de --- mm/memcontrol.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) --- a/mm/memcontrol.c +++ b/mm/memcontrol.c -@@ -2126,7 +2126,7 @@ struct memcg_stock_pcp { +@@ -2023,7 +2023,7 @@ struct memcg_stock_pcp { local_lock_t stock_lock; struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; @@ -30,7 +34,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-4-bigeasy@linutronix.de /* Protects only task_obj */ local_lock_t task_obj_lock; struct obj_stock task_obj; -@@ -2139,7 +2139,7 @@ struct memcg_stock_pcp { +@@ -2036,7 +2036,7 @@ struct memcg_stock_pcp { }; static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { .stock_lock = INIT_LOCAL_LOCK(stock_lock), @@ -39,7 +43,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-4-bigeasy@linutronix.de .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock), #endif }; -@@ -2228,7 +2228,7 @@ static void drain_local_stock(struct wor +@@ -2125,7 +2125,7 @@ static void drain_local_stock(struct wor * drain_stock races is that we always operate on local CPU stock * here with IRQ disabled */ @@ -48,7 +52,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-4-bigeasy@linutronix.de local_lock(&memcg_stock.task_obj_lock); old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL); local_unlock(&memcg_stock.task_obj_lock); -@@ -2837,7 +2837,7 @@ static inline struct obj_stock *get_obj_ +@@ -2734,7 +2734,7 @@ static inline struct obj_stock *get_obj_ { struct memcg_stock_pcp *stock; @@ -57,7 +61,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-4-bigeasy@linutronix.de if (likely(in_task())) { *pflags = 0UL; *stock_lock_acquried = false; -@@ -2855,7 +2855,7 @@ static inline struct obj_stock *get_obj_ +@@ -2752,7 +2752,7 @@ static inline struct obj_stock *get_obj_ static inline void put_obj_stock(unsigned long flags, bool stock_lock_acquried) { @@ -66,7 +70,7 @@ Link: https://lkml.kernel.org/r/20211222114111.2206248-4-bigeasy@linutronix.de if (likely(!stock_lock_acquried)) { local_unlock(&memcg_stock.task_obj_lock); return; -@@ -3268,7 +3268,7 @@ static bool obj_stock_flush_required(str +@@ -3165,7 +3165,7 @@ static bool obj_stock_flush_required(str { struct mem_cgroup *memcg; diff --git a/patches/0004_random_move_the_fast_pool_reset_into_the_caller.patch b/patches/0004_random_move_the_fast_pool_reset_into_the_caller.patch index 62e8ae44d325..9c8b5dc57660 100644 --- a/patches/0004_random_move_the_fast_pool_reset_into_the_caller.patch +++ b/patches/0004_random_move_the_fast_pool_reset_into_the_caller.patch @@ -16,7 +16,7 @@ Link: https://lore.kernel.org/r/20211207121737.2347312-5-bigeasy@linutronix.de --- a/drivers/char/random.c +++ b/drivers/char/random.c -@@ -1242,37 +1242,35 @@ static __u32 get_reg(struct fast_pool *f +@@ -1260,37 +1260,35 @@ static __u32 get_reg(struct fast_pool *f return *ptr; } @@ -65,7 +65,7 @@ Link: https://lore.kernel.org/r/20211207121737.2347312-5-bigeasy@linutronix.de } void add_interrupt_randomness(int irq) -@@ -1298,7 +1296,10 @@ void add_interrupt_randomness(int irq) +@@ -1316,7 +1314,10 @@ void add_interrupt_randomness(int irq) fast_mix(fast_pool); add_interrupt_bench(cycles); diff --git a/patches/0005_random_defer_processing_of_randomness_on_preempt_rt.patch b/patches/0005_random_defer_processing_of_randomness_on_preempt_rt.patch index 6a1164699e75..70bb19289067 100644 --- a/patches/0005_random_defer_processing_of_randomness_on_preempt_rt.patch +++ b/patches/0005_random_defer_processing_of_randomness_on_preempt_rt.patch @@ -39,7 +39,7 @@ Link: https://lore.kernel.org/r/20211207121737.2347312-6-bigeasy@linutronix.de --- a/drivers/char/random.c +++ b/drivers/char/random.c -@@ -1273,6 +1273,32 @@ static bool process_interrupt_randomness +@@ -1291,6 +1291,32 @@ static bool process_interrupt_randomness return true; } @@ -72,7 +72,7 @@ Link: https://lore.kernel.org/r/20211207121737.2347312-6-bigeasy@linutronix.de void add_interrupt_randomness(int irq) { struct fast_pool *fast_pool = this_cpu_ptr(&irq_randomness); -@@ -1296,9 +1322,16 @@ void add_interrupt_randomness(int irq) +@@ -1314,9 +1340,16 @@ void add_interrupt_randomness(int irq) fast_mix(fast_pool); add_interrupt_bench(cycles); diff --git a/patches/Add_localversion_for_-RT_release.patch b/patches/Add_localversion_for_-RT_release.patch index 22146ab020cb..efeddd431fc4 100644 --- a/patches/Add_localversion_for_-RT_release.patch +++ b/patches/Add_localversion_for_-RT_release.patch @@ -15,4 +15,4 @@ Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- /dev/null +++ b/localversion-rt @@ -0,0 +1 @@ -+-rt16 ++-rt17 diff --git a/patches/i2c-core-Let-i2c_handle_smbus_host_notify-use-handle.patch b/patches/i2c-core-Let-i2c_handle_smbus_host_notify-use-handle.patch new file mode 100644 index 000000000000..719ffc5da81d --- /dev/null +++ b/patches/i2c-core-Let-i2c_handle_smbus_host_notify-use-handle.patch @@ -0,0 +1,40 @@ +From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +Date: Wed, 19 Jan 2022 16:10:39 +0100 +Subject: [PATCH] i2c: core: Let i2c_handle_smbus_host_notify() use + handle_nested_irq() on PREEMPT_RT. + +The i2c-i801 driver invokes i2c_handle_smbus_host_notify() from his +interrupt service routine. On PREEMPT_RT i2c-i801's handler is forced +threaded with enabled interrupts which leads to a warning by +handle_irq_event_percpu() assuming that irq_default_primary_handler() +enabled interrupts. + +i2c-i801's interrupt handler can't be made non-threaded because the +interrupt line is shared with other devices. +All i2c host driver's interrupt handler are (force-)threaded on +PREEMPT_RT. + +Handle the IRQs by invoking handle_nested_irq() on PREEMPT_RT. + +Reported-by: Michael Below <below@judiz.de> +Link: https://bugs.debian.org/1002537 +Cc: Salvatore Bonaccorso <carnil@debian.org> +Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +--- + drivers/i2c/i2c-core-base.c | 5 ++++- + 1 file changed, 4 insertions(+), 1 deletion(-) + +--- a/drivers/i2c/i2c-core-base.c ++++ b/drivers/i2c/i2c-core-base.c +@@ -1423,7 +1423,10 @@ int i2c_handle_smbus_host_notify(struct + if (irq <= 0) + return -ENXIO; + +- generic_handle_irq(irq); ++ if (!IS_ENABLED(CONFIG_PREEMPT_RT)) ++ generic_handle_irq(irq); ++ else ++ handle_nested_irq(irq); + + return 0; + } diff --git a/patches/i2c-rcar-Allow-interrupt-handler-to-be-threaded.patch b/patches/i2c-rcar-Allow-interrupt-handler-to-be-threaded.patch new file mode 100644 index 000000000000..bccd17f553fd --- /dev/null +++ b/patches/i2c-rcar-Allow-interrupt-handler-to-be-threaded.patch @@ -0,0 +1,49 @@ +From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +Date: Wed, 19 Jan 2022 15:56:50 +0100 +Subject: [PATCH] i2c: rcar: Allow interrupt handler to be threaded. + +This is a revert of commit + 24c6d4bc56388 ("i2c: rcar: make sure irq is not threaded on Gen2 and earlier") + +The IRQ-handler is not threaded unless requested. On PREEMPT_RT the +handler must be threaded because the handler acquires spinlock_t locks +which are sleeping locks on PREEMPT_RT. This is either via the DMA API +or the possible wake_up() invocation. + +Remove IRQF_NO_THREAD from flags passed to request_irq(). + +Fixes: 24c6d4bc56388 ("i2c: rcar: make sure irq is not threaded on Gen2 and earlier") +Cc: Wolfram Sang <wsa+renesas@sang-engineering.com> +Cc: linux-renesas-soc@vger.kernel.org +Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +--- + drivers/i2c/busses/i2c-rcar.c | 4 +--- + 1 file changed, 1 insertion(+), 3 deletions(-) + +--- a/drivers/i2c/busses/i2c-rcar.c ++++ b/drivers/i2c/busses/i2c-rcar.c +@@ -1025,7 +1025,6 @@ static int rcar_i2c_probe(struct platfor + struct rcar_i2c_priv *priv; + struct i2c_adapter *adap; + struct device *dev = &pdev->dev; +- unsigned long irqflags = 0; + irqreturn_t (*irqhandler)(int irq, void *ptr) = rcar_i2c_gen3_irq; + int ret; + +@@ -1076,7 +1075,6 @@ static int rcar_i2c_probe(struct platfor + rcar_i2c_write(priv, ICSAR, 0); /* Gen2: must be 0 if not using slave */ + + if (priv->devtype < I2C_RCAR_GEN3) { +- irqflags |= IRQF_NO_THREAD; + irqhandler = rcar_i2c_gen2_irq; + } + +@@ -1102,7 +1100,7 @@ static int rcar_i2c_probe(struct platfor + if (ret < 0) + goto out_pm_disable; + priv->irq = ret; +- ret = devm_request_irq(dev, priv->irq, irqhandler, irqflags, dev_name(dev), priv); ++ ret = devm_request_irq(dev, priv->irq, irqhandler, 0, dev_name(dev), priv); + if (ret < 0) { + dev_err(dev, "cannot get irq %d\n", priv->irq); + goto out_pm_disable; diff --git a/patches/locking-local_lock-Make-the-empty-local_lock_-functi.patch b/patches/locking-local_lock-Make-the-empty-local_lock_-functi.patch new file mode 100644 index 000000000000..52501259d245 --- /dev/null +++ b/patches/locking-local_lock-Make-the-empty-local_lock_-functi.patch @@ -0,0 +1,40 @@ +From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +Date: Wed, 5 Jan 2022 10:53:55 +0100 +Subject: [PATCH] locking/local_lock: Make the empty local_lock_*() function a + macro. + +It has been said that local_lock() does not add any overhead compared to +preempt_disable() in a !LOCKDEP configuration. A microbenchmark showed +an unexpected result which can be reduced to the fact that local_lock() +was not entirely optimized away. +In the !LOCKDEP configuration local_lock_acquire() is an empty static +inline function. On x86 the this_cpu_ptr() argument of that function is +fully evaluated leading to an additional mov+add instructions which are +not needed and not used. + +Replace the static inline function whith a macro. The typecheck() macro +ensures that the argument is of proper type while the resulting +dissasembly shows no traces of this_cpu_ptr(). + +Link: https://lkml.kernel.org/r/20220105202623.1118172-1-bigeasy@linutronix.de +Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> +Reviewed-by: Waiman Long <longman@redhat.com> +--- + include/linux/local_lock_internal.h | 6 +++--- + 1 file changed, 3 insertions(+), 3 deletions(-) + +--- a/include/linux/local_lock_internal.h ++++ b/include/linux/local_lock_internal.h +@@ -44,9 +44,9 @@ static inline void local_lock_debug_init + } + #else /* CONFIG_DEBUG_LOCK_ALLOC */ + # define LOCAL_LOCK_DEBUG_INIT(lockname) +-static inline void local_lock_acquire(local_lock_t *l) { } +-static inline void local_lock_release(local_lock_t *l) { } +-static inline void local_lock_debug_init(local_lock_t *l) { } ++# define local_lock_acquire(__ll) do { typecheck(local_lock_t *, __ll); } while (0) ++# define local_lock_release(__ll) do { typecheck(local_lock_t *, __ll); } while (0) ++# define local_lock_debug_init(__ll) do { typecheck(local_lock_t *, __ll); } while (0) + #endif /* !CONFIG_DEBUG_LOCK_ALLOC */ + + #define INIT_LOCAL_LOCK(lockname) { LOCAL_LOCK_DEBUG_INIT(lockname) } diff --git a/patches/printk__remove_deferred_printing.patch b/patches/printk__remove_deferred_printing.patch index 6e99ed585191..17af3e09682d 100644 --- a/patches/printk__remove_deferred_printing.patch +++ b/patches/printk__remove_deferred_printing.patch @@ -162,7 +162,7 @@ Signed-off-by: Thomas Gleixner <tglx@linutronix.de> ({ \ --- a/drivers/char/random.c +++ b/drivers/char/random.c -@@ -1507,9 +1507,8 @@ static void _warn_unseeded_randomness(co +@@ -1525,9 +1525,8 @@ static void _warn_unseeded_randomness(co print_once = true; #endif if (__ratelimit(&unseeded_warning)) @@ -782,7 +782,7 @@ Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- a/kernel/workqueue.c +++ b/kernel/workqueue.c -@@ -4826,9 +4826,7 @@ void show_one_workqueue(struct workqueue +@@ -4845,9 +4845,7 @@ void show_one_workqueue(struct workqueue * drivers that queue work while holding locks * also taken in their write paths. */ @@ -792,7 +792,7 @@ Signed-off-by: Thomas Gleixner <tglx@linutronix.de> } raw_spin_unlock_irqrestore(&pwq->pool->lock, flags); /* -@@ -4859,7 +4857,6 @@ static void show_one_worker_pool(struct +@@ -4878,7 +4876,6 @@ static void show_one_worker_pool(struct * queue work while holding locks also taken in their write * paths. */ @@ -800,7 +800,7 @@ Signed-off-by: Thomas Gleixner <tglx@linutronix.de> pr_info("pool %d:", pool->id); pr_cont_pool_info(pool); pr_cont(" hung=%us workers=%d", -@@ -4874,7 +4871,6 @@ static void show_one_worker_pool(struct +@@ -4893,7 +4890,6 @@ static void show_one_worker_pool(struct first = false; } pr_cont("\n"); diff --git a/patches/series b/patches/series index 44e701ef85e9..b8a4fe43d56a 100644 --- a/patches/series +++ b/patches/series @@ -58,6 +58,7 @@ smp_wake_ksoftirqd_on_preempt_rt_instead_do_softirq.patch fscache-Use-only-one-fscache_object_cong_wait.patch tcp-Don-t-acquire-inet_listen_hashbucket-lock-with-d.patch panic_remove_oops_id.patch +locking-local_lock-Make-the-empty-local_lock_-functi.patch # sched 0001_kernel_fork_redo_ifdefs_around_task_s_handling.patch @@ -77,9 +78,10 @@ panic_remove_oops_id.patch 0005_random_defer_processing_of_randomness_on_preempt_rt.patch # cgroup -0001-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch -0002-mm-memcg-Add-a-local_lock_t-for-IRQ-and-TASK-object.patch -0003-mm-memcg-Allow-the-task_obj-optimization-only-on-non.patch +0001-mm-memcg-Disable-threshold-event-handlers-on-PREEMPT.patch +0002-mm-memcg-Protect-per-CPU-counter-by-disabling-preemp.patch +0003-mm-memcg-Add-a-local_lock_t-for-IRQ-and-TASK-object.patch +0004-mm-memcg-Allow-the-task_obj-optimization-only-on-non.patch ########################################################################### # Post @@ -88,6 +90,10 @@ cgroup__use_irqsave_in_cgroup_rstat_flush_locked.patch mm__workingset__replace_IRQ-off_check_with_a_lockdep_assert..patch softirq-Use-a-dedicated-thread-for-timer-wakeups.patch +# These two need some feedback. +i2c-rcar-Allow-interrupt-handler-to-be-threaded.patch +i2c-core-Let-i2c_handle_smbus_host_notify-use-handle.patch + ########################################################################### # Kconfig bits: ########################################################################### |