anolis-cloud-kernel

Commit Graph

Author	SHA1	Message	Date
Cruz Zhao	ad071c796e	anolis: sched: introduce ID_BOOK_CPU ANBZ: #22143 In high-concurrency scenarios, highclass tasks often select the same idle CPU when waking up, which can cause scheduling delays. The reason is that To reduce the scheduling delays caused by this situation, we introduce ID_BOOK_CPU: When highclass task selecting idle cpu, check whether it has been booked by other tasks and whether it's still idle before we select it to wake up. If the idle cpu we found has been booked by other tasks, select again, until book the idle cpu successfully or reach the retry limit. If the idle cpu hasn't been booked by any other, make rq->booked true to mark that the cpu is booked, and make rq->booked false after the highclass task actually enqueues into the rq of the cpu. To Enable ID_BOOK_CPU, echo ID_BOOK_CPU > /sys/kernel/debug/sched_features. To set retry limit, modify /proc/sys/kernel/sched_id_book_cpu_nr_tries. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5455	2025-06-25 16:29:07 +08:00
Cruz Zhao	2b60f98354	anolis: sched: optimize push_expellee() ANBZ: #11591 If there are some cpus push expellee at the same time, there will be lock competition, which will cause sys rush high in a short time. To avoid causing jitter to unexpellee, we use raw_spin_rq_trylock() to reduce lock competition, and check whether should we resched every loop. By the way, we introduce sysctl_sched_push_expellee_interval to control the frequency of pushing expellee tasks, and the default interval is 6ms. By the way, we fix the bug that cpu cannot go into nohz_full state when ID_PUSH_EXPELLEE is turned off. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4125	2024-12-17 08:30:52 +00:00
Cyril Hrubis	0ccd3a0b71	sched/rt: Disallow writing invalid values to sched_rt_period_us ANBZ: #9497 commit `079be8fc63` upstream. The validation of the value written to sched_rt_period_us was broken because: - the sysclt_sched_rt_period is declared as unsigned int - parsed by proc_do_intvec() - the range is asserted after the value parsed by proc_do_intvec() Because of this negative values written to the file were written into a unsigned integer that were later on interpreted as large positive integers which did passed the check: if (sysclt_sched_rt_period <= 0) return EINVAL; This commit fixes the parsing by setting explicit range for both perid_us and runtime_us into the sched_rt_sysctls table and processes the values with proc_dointvec_minmax() instead. Alternatively if we wanted to use full range of unsigned int for the period value we would have to split the proc_handler and use proc_douintvec() for it however even the Documentation/scheduller/sched-rt-group.rst describes the range as 1 to INT_MAX. As far as I can tell the only problem this causes is that the sysctl file allows writing negative values which when read back may confuse userspace. There is also a LTP test being submitted for these sysctl files at: http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/ Signed-off-by: Cyril Hrubis <chrubis@suse.cz> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz [ pvorel: rebased for 5.15, 5.10 ] Reviewed-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Petr Vorel <pvorel@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit e4bc311745078e145f251ab5b28fb30f90652583) [dtcccc: we have not introduced SYSCTL_NEG_ONE, use neg_one instead.] Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3549	2024-08-02 07:26:05 +00:00
Suren Baghdasaryan	5d749449c8	mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30% ANBZ: #9482 commit e1aa3fe3e282a0532fffaa68360cbd45a2a59291 stable. [ Upstream commit `39c65a94cd` ] For embedded systems with low total memory, having to run applications with relatively large memory requirements, 10% max limitation for watermark_scale_factor poses an issue of triggering direct reclaim every time such application is started. This results in slow application startup times and bad end-user experience. By increasing watermark_scale_factor max limit we allow vendors more flexibility to choose the right level of kswapd aggressiveness for their device and workload requirements. Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Antti Palosaari <crope@iki.fi> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Fengfei Xi <xi.fengfei@h3c.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: `935d44acf6` ("memfd: check for non-NULL file_seals in memfd_create() syscall") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3468	2024-08-01 02:00:10 +00:00
Cruz Zhao	1c1cead323	anolis: sched: introduce group balancer switch and config ANBZ: #8765 Group balancer is a feature that schedules task groups as a whole to achieve better locality. It uses a soft cpu bind method which offers a dynamic way to restrict allowed CPUs for tasks in the same group. This patch introduces the switch and config of group balancer. Co-developed-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.aibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3215	2024-05-20 16:15:48 +00:00
Tianchen Ding	87dd20ef17	anolis: bpf: add whitelist for fmod_ret functions to prepare for user-space scheduling ANBZ: #8975 Since sched_ext is still pending and not merged into upstream kernel, we designed a light-weight user-space scheduling thanks to bpf fmod_ret. It can modify the return value of almost any function. For safety, upstream kernel limits the scope of functions allowed to be modified, and let's enlarge it. Also add a sysctl "bpf_customized_fmodret" to control this feature. Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3146	2024-05-17 16:18:07 +00:00
Yi Tao	9f8a413f69	anolis: support using parent cgroup as rich container source ANBZ: #8997 In certain scenarios, multiple containers within a single pod will share a PID namespace. These containers do not set specific resource limits; instead, resource limitations are applied to the pod as a whole. Therefore, it's necessary to use the pod's cgroup configuration as the data source for rich containers. From the perspective of the kernel, this is essentially the parent cgroup of the current process. Signed-off-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3151	2024-05-17 06:43:15 +00:00
Yi Tao	c9e66be185	anolis: Add rich container feature control ANBZ: #7692 The functionality of rich containers is supported in cgroup v2 through eBPF, but only a subset of features is supported. To ensure consistency between both sides, add switches for each feature of the rich container in cgroup v1. Features and their names are: cpuinfo, meminfo, cpuusage, loadavg, uptime, diskquota. All features are enabled by default if rich container is enabled。 Add a new sysctl interface named rich_container_feature_control. When read, it shows space separated list of the features which are enabled. Space separated list of features prefixed with '+' or '-' can be written to enable or disable features. A feature name prefixed with '+' enables the feature and '-' disables. If a feature appears more than once on the list, the last one is effective. When multiple enable and disable operations are specified, either all succeed or all fail. eg: echo “+cpuinfo -meminfo +uptime” > /proc/sys/kernel/rich_container_feature_control Signed-off-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2489	2023-12-11 03:47:39 +00:00
Cruz Zhao	a760c56ec5	anolis: watchdog: enlarge watchdog_thresh limit ANBZ: #6805 During the stress test, there're always false alarm of soft lockup, because of long loop, instead of true lockup. So we need to enlarge watchdog_thresh limit to 150 to void this kind of case. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2286	2023-10-16 08:38:03 +00:00
Cruz Zhao	78300c148f	anolis: sched: introduce sysctl_sched_acpu_enabled ANBZ: #6729 In order to be able to dynamically turn on and off acpu accounting, we introduce sysctl_sched_acpu_enabled, instead of default on. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2260	2023-10-08 09:54:40 +00:00
Josh Don	389ef61178	sched: reduce sched slice for SCHED_IDLE entities ANBZ: #6587 commit `51ce83ed52` upstream. Use a small, non-scaled min granularity for SCHED_IDLE entities, when competing with normal entities. This reduces the latency of getting a normal entity back on cpu, at the expense of increased context switch frequency of SCHED_IDLE entities. The benefit of this change is to reduce the round-robin latency for normal entities when competing with a SCHED_IDLE entity. Example: on a machine with HZ=1000, spawned two threads, one of which is SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE thread runs for 4ms then waits for 1.4s. With this patch, it runs for 1ms and waits 340ms (as it round-robins with the other thread). Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com [dtcccc: adapt "idle_min_granularity_ns" to sysctl instead of debugfs because kernel 5.10 has not moved them yet. I set the min value of this config to ZERO to suppress BE tasks better.] Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2250	2023-10-08 09:53:10 +00:00
Cruz Zhao	c02f4ea408	anolis: sched/core: introduce sysctl_sched_core ANBZ: #6201 In order to be able to control the switch of core sched, we introduce sysctl_sched_core. When turned on, it's allowed to give/share cookie to tasks. when turned off, the cookie of all cookies will be cleared, and it's not allowed to give/share cookie to tasks (only get cookie is allowed). Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2243	2023-09-28 12:37:32 +00:00
Erwei Deng	9b1b42a69c	anolis: sched/fair: fix the sysctl of sched_expel_update_interval ANBZ: #4814 There is a bug of sched_expel_update_interval sysctl. See the following: # cat /proc/sys/kernel/sched_expel_update_interval 10 0 There are two values. It's wrong. The variable of sysctl_sched_expel_update_interval is defined as unsigned long, and there will be two values when read this sysctl, because the size of long will be splited into two size of int by the proc_dointvec handler. So, the variable should be defined as unsigned int that is large enough to use it. Signed-off-by: Erwei Deng <erwei@linux.alibaba.com> Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1593	2023-05-08 03:13:57 +00:00
Chao Wu	79e997af0a	anolis: pidns: add pid_max per namespace ANBZ: #4226 In x86_64 with big ram people running containers set pid_max on host to large values to be able to launch more containers. At the same time containers running 32-bit software experience problems with large pids -ps calls readdir/stat on proc entries and inode's i_ino happen to be too big for the 32-bit API. Thus, the ability to limit the pid value inside container is required. And max_pid could be set through '/proc/sys/kernel/pid_max' Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Chao Wu <chaowu@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1288	2023-02-27 03:49:34 +00:00
Linus Torvalds	886024a0d5	proc: avoid integer type confusion in get_proc_long ANBZ: #3460 commit `e6cfaf34be` upstream. proc_get_long() is passed a size_t, but then assigns it to an 'int' variable for the length. Let's not do that, even if our IO paths are limited to MAX_RW_COUNT (exactly because of these kinds of type errors). So do the proper test in the rigth type. Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: CVE-2022-3567 Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1006	2022-12-14 19:27:29 +08:00
Linus Torvalds	d206f363e5	proc: proc_skip_spaces() shouldn't think it is working on C strings ANBZ: #3460 commit `bce9332220` upstream. proc_skip_spaces() seems to think it is working on C strings, and ends up being just a wrapper around skip_spaces() with a really odd calling convention. Instead of basing it on skip_spaces(), it should have looked more like proc_skip_char(), which really is the exact same function (except it skips a particular character, rather than whitespace). So use that as inspiration, odd coding and all. Now the calling convention actually makes sense and works for the intended purpose. Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: CVE-2022-4378 Signed-off-by: Heng Qi <hengqi@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1006	2022-12-14 19:24:55 +08:00
Cruz Zhao	31ec2541ea	anolis: sched/fair: Introduce sysctl interface to enable group identity ANBZ: #2022 We introduce a counter group_identity_count to indicate how many cgroups have enabled group identity, and introduced a function group_identity_enabled() to indicate whether there are cgroups with group identity enabled, that is, if group_identity_count is zero, group_identity_enabled() returns false, otherwise returns true. These judgment functions are helpers for group identity fast path, and also helpers for confliction judgment, i.e. confliction with core scheduling. Only if group_identity_count is zero, group identity can be turned on/off. Turn on: echo 1 > /proc/sys/kernel/sched_group_identity_enabled Turn off: echo 0 > /proc/sys/kernel/sched_group_identity_enabled Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Yi Tao <escape@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/682 Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>	2022-09-21 03:40:38 +00:00
Jiang Liu	a380e9e60f	anolis: userns: add a sysctl to control the max depth ANBZ: #1474 Add "sysctl.kernel.userns_max_level" to control the maximum nested level of user namespace. The valid configuration values are 0-33. When configured to zero, user namespace is effectively disabled. Originally the check is "if (parent_ns->level > 32)" and init_user_ns.level is zero, so the actually maximum level is 33 instead of 32. Signed-off-by: Jiang Liu <gerry@linux.alibaba.com> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-08-02 16:40:18 +08:00
Serge Hallyn	a702079920	userns: add a sysctl to disable unprivileged user namespace unsharing ANBZ: #1476 commit 5758824b20fa2308ebb5c460874d0ffd73d0d8e4 Ubuntu groovy. It is turned on by default, but can be turned off if admins prefer or, more importantly, if a security vulnerability is found. The intent is to use this as mitigation so long as Ubuntu is on the cutting edge of enablement for things like unprivileged filesystem mounting. (This patch is tweaked from the one currently still in Debian sid, which in turn came from the patch we had in saucy) Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> [bwh: Remove unneeded binary sysctl bits] [ saf: move extern unprivileged_userns_clone declaration to include/linux/user_namespace.h to conform with `2374c09b1c` "sysctl: remove all extern declaration from sysctl.c" ] Signed-off-by: Tim Gardner <tim.gardner@canonical.com> [jeffle: add documentation for the sysctl] Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-08-02 16:40:17 +08:00
Xin Hao	57c48b6a99	ck: mm: add an switch to enable in-use hugtlb pages migraiton ANBZ: #1330 In some scenarios, if the in-use hugtlb pages are migrated, the program will fail to access its corresponding data, and then caused program execution exception. To solve the problem, we add a sysctrl interface "enable_used_hugtlb_migration" To disable migration: echo 0 > /proc/sys/vm/enable_used_hugtlb_migration To enable migration: echo 1 > /proc/sys/vm/enable_used_hugtlb_migration BTW, This bug is caused by backport the commit `ae37c7ff79` ("mm: make alloc_contig_range handle in-use hugetlb pages"), the solution is only used for "alloc_contig_range()"，the other path can still migration in-use hugetlb pages. Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>	2022-08-02 16:30:35 +08:00
zhongjiang-ali	1f86172dbe	anolis: mm: introduce ability to reserve page cache on system wide ANBZ: #849 The kernel generally prefers to reclaim page cache over anonymous pages, and only reclaims page cache when there is no swap configured. However, in some extreme scenario, where the application allocates a large amount of anonymous memory, the page cache is almost completely exhausted, while the OOM killer barely fires. The system can suffer from heavy IO, and the application performance can be significantly affected, or even the application may become unresponsive, since the page cache, including the application program instruction, is thrashing. In such scenario, some users do want OOM instead of half-dead. This provides user the ability to reserve page cache on system wide. With appropriate amount of page cache reserved, OOM killer can be triggered in time, and some key processes can make progress (cooperate with oom_score_adj, for example). Enable the feature with: echo XXX > /proc/sys/vm/min_cache_kbytes disable the feature with: echo 0 > /proc/sys/vm/min_cache_kbytes Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Acked-by: Gang Deng <gavin.dg@linux.alibaba.com> Suggested-by: yinbinbin <yinbinbin001@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2022-08-01 12:26:01 +00:00
Cruz Zhao	c209a80007	anolis: sched: Credit clarification for BVT and its related work ANBZ: #781 Group Identity is a totally different feature from BVT, and the defination and logic of bvt_warp_ns has been totally reformed. While 10 lines of code are related to BVT paper and it's code (Jacob Leverich), this patch gives credit to BVT and its related work. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com> Acked-by: Michael Wang <yun.wang@linux.alibaba.com>	2022-08-01 12:25:57 +00:00
Cruz Zhao	328d3ac4df	anolis: sched: fix some warnings with allnoconfig ANBZ: #45 With allnoconfig, there will be some warnings during compile: kernel/sched/fair.c:955:1: warning: 'id_update_make_up' defined but not used [Wunused-function] id_update_make_up(struct task_group tg, struct rq rq, struct cfs_rq cfs_rq, ^ kernel/sched/fair.c:1008:1: warning: 'id_commit_make_up' defined but not used [Wunused-function] id_commit_make_up(struct rq rq, bool commit) ^ kernel/sched/fair.c:1724:1: warning: 'id_can_migrate_task' defined but not used [Wunused-function] id_can_migrate_task(struct task_struct p, struct rq src_rq, struct rq *dst_rq) ^ kernel/sysctl.c:122:22: warning: 'penalty_extra_delay_max' defined but not used [Wunused-function] static unsigned long penalty_extra_delay_max = HZ; ^ To fix this problem, these functions are limited in configs. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com> Acked-by: Michael Wang <yun.wang@linux.alibaba.com>	2022-08-01 12:00:48 +00:00
Yi Tao	3f6bceb77c	anolis: cgroup: add user space interface for cgroup pool OpenAnolis Bug Tracker: 0000312 Add pool_size interface and delay_time interface. When the user writes pool_size, a cgroup pool will be created, and then when the user needs to create a cgroup, it will take the fast path and get a cgroup from pool. pool_size is capacity of cgroup pool, which is advised larger than amount of cgroup user creates at the same time. For example, it is proper to set 500 for pool_size if creating 400 cgroups at the same time. When pool is exhausted, a supply work will be triggered. In order to avoid competition with the currently created cgroup, the supply work can be delayed. User can set cgroup_supply_delay_time to delay supply work to time when system is idle. Usage: enable pool: echo 500 > /sys/fs/cgroup/<subsystem>/test-cgroup/pool_size disable pool: echo 0 > /sys/fs/cgroup/<subsystem>/test-cgroup/pool_size delay supply work to 3000ms later: echo 3000 > /proc/sys/kernel/cgroup_supply_delay_time Suggested-by: Shanpei Chen <shanpeic@linux.alibaba.com> Reviewed-by: Shanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: Yi Tao <escape@linux.alibaba.com>	2022-08-01 11:52:22 +00:00
Tianchen Ding	7001fcff92	ck:mm:break in readahead when multiple threads detected to #34329845 Specific workloads using multiple threads to do sequential reading on the same file may introduce race on spinlock during page cache readahead. Readahead process will end earlier when detecting readahead page conflict. This is effective for sequential reading the same file with multiple threads and large readahead size. The benefit can be shown in the following test by fio: fio --name=ratest --ioengine=mmap --iodepth=1 --rw=read --direct=0 --size=512M --numjobs=32 --directory=/mnt --filename=ratestfile Firstly, we set readahead size to its default value: blockdev --setra 256 /dev/vdh The result is: Run status group 0 (all jobs): READ: bw=4427MiB/s (4642MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=16.0GiB (17.2GB), run=3699-3701msec Disk stats (read/write): vdh: ios=22945/0, merge=0/0, ticks=12515/0, in_queue=12515, util=50.79% Then, we set readahead size larger: blockdev --setra 8192 /dev/vdh The result is: Run status group 0 (all jobs): READ: bw=6450MiB/s (6764MB/s), 202MiB/s-202MiB/s (211MB/s-212MB/s), io=16.0GiB (17.2GB), run=2535-2540msec Disk stats (read/write): vdh: ios=19503/0, merge=0/0, ticks=49375/0, in_queue=49375, util=23.67% Finally, we set sysctl vm.enable_multithread_ra_boost=1 with ra=8192 and we get: Run status group 0 (all jobs): READ: bw=7234MiB/s (7585MB/s), 226MiB/s-226MiB/s (237MB/s-237MB/s), io=16.0GiB (17.2GB), run=2262-2265msec Disk stats (read/write): vdh: ios=10548/0, merge=0/0, ticks=40709/0, in_queue=40709, util=23.62% Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>	2022-08-01 11:31:07 +00:00
Meng Shen	fbd69fac5c	ck: UKFEF: report softlockup event to #34868789 Add softlockup event into unified kernel fault event framework. Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: Meng Shen <shenmeng@linux.alibaba.com> Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>	2022-08-01 11:30:53 +00:00
Meng Shen	d2ffdd5d83	ck: UKFEF: unified kernel fault event framework to #34868789 There are various kernel errors/warnnings, if users want to know the effect with a error or warning, they should see the detail of kernel. It is difficult for users to handle so much things. This patch classify kernel fault events, divide events into its own module such as sched, mem, io, net, etc. At the same time, to report the effect class of the current fault event. There are three effect classes: Slight - just a jitter, every thing could works. Normal - the current task may have exception. Fatal - system may be unstable. Accordding to these information, users can easily choose a suitable action to ensure the reliability. This feature also can be used for checking the side effect of a system-change. This feature can be enabled/disabled via /proc/sys/kernel/fault_event_enable. Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: Meng Shen <shenmeng@linux.alibaba.com> Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>	2022-08-01 11:30:51 +00:00
Cruz Zhao	f0f2e04c71	ck: sched: fix the performence regression caused by update_rq_on_expel() to #34609774 fix #34359599 fix #34361563 In order to prevent scheduling errors in the case of ipi failure, update_rq_on_expel() is called in function pick_next_task_fair(), but this will cause performence regression. In order to prevent errors while ensuring performence, we introduce an interval to call update_rq_on_expel(), the default value is 10, unit: jiffy. We also introdue a sched_feat: ID_LOOSE_EXPEL to control whether to call update_rq_on_expel() in function pick_next_task_fair() periodly, if on, update_rq_on_expel() will be called loosely, and the interval will be larger than sysctl_sched_expel_update_interval between calls, called everytime otherwise. The default value of ID_LOOSE_EXPEL is OFF. Here follows the data of test will-it-scale-sched_yield-thread-15 before applying this patch: will-it-scale-sched_yield-thread-15-1 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 1,0,0.00,1479767,95.80,1479767 will-it-scale-sched_yield-thread-15-50% tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 48,0,0.00,15057129,0.12,0 will-it-scale-sched_yield-thread-15-100% tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 96,0,0.00,14793466,0.14,0 Here follows the data of test will-it-scale-sched_yield-thread-15 after applying this patch with sysctl_sched_expel_update_interval = 10 and ID_LOOSE_EXPEL on: will-it-scale-sched_yield-thread-15-1 tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 1,0,0.00,1566401,95.82,1566401 will-it-scale-sched_yield-thread-15-50% tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 48,0,0.00,15789744,0.12,0 will-it-scale-sched_yield-thread-15-100% tasks,processes,processes_idle,threads,threads_idle,linear 0,0,100,0,100,0 96,0,0.00,15612437,0.15,0 We also tested it with nginx + ffmpeg, and here is the scenario: online task: nginx number of online tasks: 2 bvt of online tasks: 2 offline task: ffmpeg number of offline tasks: 4 bvt of offline task: -1 Here follows the data before applying this patch: only online online+offline cpu usage: 7.79% 46.69% nginx rt: 0.84ms 0.92ms Here follows the data after appling this patch sched_expel_update_interval = 10(unit: jiffy) only online online+offline cpu usage: 7.91% 46.84% nginx rt: 0.82ms 0.89ms sched_expel_update_interval = 100(unit: jiffy) only online online+offline cpu usage: 7.81% 46.45% nginx rt: 0.81ms 0.87ms sched_expel_update_interval = 1000(unit: jiffy) only online online+offline cpu usage: 7.91% 46.19% nginx rt: 0.81ms 0.87ms As we can see from the result, this patch can resolve the performence regression to a certain degree. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Acked-by: Michael Wang <yun.wang@linux.alibaba.com>	2022-08-01 11:30:21 +00:00
Michael Wang	4f43246740	ck: sched: introduce group identity 'idle saver' to #34609774 We found that underclass tasks can migrate frequently and occupy too much idle CPU, this cause a high 99th latency on highclass since the latency to wakeup on idle CPU are much lower than to preempt the underclass. Thus we introduce the IDLE_SAVER identity which on default activated for group with bvt -1, now when they want to wakeup on idle CPU, it require the average idle time to be above 'sysctl_sched_idle_saver_wmark' (default 1ms). For the underclass who want to occupy more idle CPU, eg. the vcpu threads, just deactive the identity by turn off the bit 3 in 'cpu.identity'. Also improved the logical in select_idle_core, now consider a core full of underclass as backup option for highcalss, and prevent underclass to take the idle core with expel or idle saver reason. Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Acked-by: Michael Wang <yun.wang@linux.alibaba.com>	2022-08-01 11:30:20 +00:00
Michael Wang	9372f9beab	ck: sched: introduce group identity 'smt expeller' to #34609774 Concurrent execution on smt cpus could impact each other on performance since they sharing the hardware unit, an underclass task should not cause regression to the highclass task running on smt peers in this way. The legacy feature Noise Clean was supposed to address this issue, while it introduced too much overhead on throttle/unthrottle tasks which causing unstable latency in real world. As now we have all underclass tasks on separate tree, just hidden such tasks from pick_next_task() would be enough. Thus we introduced another identity SMT_EXPELLER, which will try to expel the underclass on smt peers when running. By apply the corresponding bit into 'group_identity', tasks willbecome expeller, when on executing, it's smt peers won't be able to pick the underclass tasks for running. Legacy 'bvt_warp_ns' 2 will have this identity on default. To support the new cgroup deployment, eg k8s, it requires the highclass parent to have both highclass and underclass child. In order to support expel in this case, we need to skip the highclass se, when and only when it's full of underclass tasks. We achieved this by isolate them from the rb-tree temporarily, until the end of expel, or it's no longer only contain underclass tasks, like throttle but cost far more less. Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Acked-by: Michael Wang <yun.wang@linux.alibaba.com>	2022-08-01 11:30:18 +00:00
Michael Wang	b37e67a6c6	ck: sched: introduce per-cgroup identity to #34609774 In collocation environment, we deploy latency sensitive (LS) workloads and best effort (BE) workloads on the same machine, and borrowed vritual time(BVT) is the legacy feature to help ensure the schedule priority for LS. This patch is just porting the idea, however, in order to reduce the overhead, we introduced group identity feature which based on the idea of multiple rb-tree. There are now two rb-tree for each cfs_rq, and tasks with special identity will be queued into the corresponding tree, this help reduce the complexity of implementation and obviously reduce the overhead of following smt expeller(Noise Clean) feature. By write particular identity into the new cpu-cgroup entry named 'group_identity', tasks of this cgroup will perform the special schedule behavior, including: * ID_NORMAL Normal CFS tasks * ID_HIGHCLASS Tasks preempt normal & underclass tasks on wakeup, also consider underclass only cpu as idle. * ID_UNDERCLASS Tasks take penalty on wakeup. And the legacy entry 'bvt_warp_ns' was reserved for compatibility, the relationship with identity are: * 2 == ID_HIGHCLASS * 1 == ID_HIGHCLASS * 0 == ID_NORMAL * -1 == ID_UNDERCLASS * -2 == ID_UNDERCLASS Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Acked-by: Michael Wang <yun.wang@linux.alibaba.com>	2022-08-01 11:30:18 +00:00
Xunlei Pang	e5b00b47d4	ck: cpuinfo: Add cpuinfo support of cpu quota and cpu shard to #34559224 lxcfs supports rich container cpuinfo from cpu quota, also sigma cpushare has the requirement of getting cpuinfo from cpu.shares. Thus, introduce new knobs to support these: /proc/sys/kernel/rich_container_cpuinfo_source - 0 uses cpu quota "quota/period" by default. It will fall back to use cpuset.cpus if quota not set. - 1 uses cpuset.cpus. - 2 uses cpushare "cpu.shares/1024" (1024 can be configured below) /proc/sys/kernel/rich_container_cpuinfo_sharesbase - when rich_container_cpuinfo_source is 2, this is the divisor. Note that, after faking cpuinfo in dockers, it's impossible to make /proc/stat match them accordingly, but we only care about the following stats not /proc/stat: /proc/cpuinfo, sysfs online and /proc/meminfo These feature depends on CONFIG_RICH_CONTAINER_CG_SWITCH=n. Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Erwei Deng <erwei@linux.alibaba.com> Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com> Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>	2022-08-01 11:30:17 +00:00
Xunlei Pang	aa278c4ae5	ck: make the rich container support k8s. to #34559224 Make the rich container support k8s in different ways, which the one is using the global switch when the CONFIG_RICH_CONTAINER_CG_SWITCH is not seted, and the other is using per cgroup switch when the config is seted. For the per cgroup switch: (CONFIG_RICH_CONTAINER_CG_SWITCH = y) For k8s shared pidns pod, all the containers under this pod share the same pid_namespace, and the namespace child_reaper lives in a special container called "pause". Thus we can't use child_reaper as the rich container information any more. Instead, we introduce a new cgroup interface "cgroup.rich_container_source" to let users (pouch) to involve in, the rich container kernel code will search from current cgroup upwards until the first cgroup with this file set with nonzero value, and use this cgroup as the data source of the rich container. If not found, fallback to the default cgroup containing the child_reaper. New knob "cgroup.rich_container_source": default value 0, children don't inherit. e.g., pod (shared pidns) / \| \ c1 c2 pause If explicitly set pod's cgroup.rich_container_source(cpuacct, cpuset and memory subsys) to be 1, "c1" or "c2" will reflect the statistics of "pod" like: echo 1 > /sys/fs/cgroup/cpuacct/.../pod/cgroup.rich_container_source echo 1 > /sys/fs/cgroup/cpuset/.../pod/cgroup.rich_container_source echo 1 > /sys/fs/cgroup/memory/.../pod/cgroup.rich_container_source For the global switch: (CONFIG_RICH_CONTAINER_CG_SWITCH = n) For k8s, multpile containers share one pid_namespace, the child reaper lives in one special "pause" container amongst. Then we should use current other than the child reaper to get the information, thus it deserves a runtime interface. Introduce "/proc/sys/kernel/rich_container_source": - 0 means to use cgroups of "current" by default. - 1 means to use cgroups of "child reaper". Signed-off-by: Erwei Deng <erwei@linux.alibaba.com> Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com> Acked-by: Xunlei Pang <xlpang@linux.alibaba.com> [ Shile: fixed the conflicts with fe8a4afd7c2 (ck: proc/uptime: Add uptime support for rich container) ] Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com> Acked-by: Erwei Deng <erwei@linux.alibaba.com>	2022-08-01 11:30:17 +00:00
Xunlei Pang	f1c204622c	ck: pidns: Support rich container switch on/off to #34559224 Introduce "/proc/sys/kernel/rich_container_enable" to control rich container at runtime. Default off. Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Reviewed-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Erwei Deng <erwei@linux.alibaba.com> Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com> Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>	2022-08-01 11:30:11 +00:00
Huaixin Chang	98dbdf849a	ck: sched/fair: Make CFS bandwidth controller burstable to #32761176 Accumulate unused quota from previous periods, thus accumulated bandwidth runtime can be used in the following periods. During accumulation, take care of runtime overflow. Previous non-burstable CFS bandwidth controller only assign quota to runtime, that saves a lot. A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to denote how many percent of burst is given on setting cfs bandwidth. By default it is 0, which means on burst is allowed unless accumulated. Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a switch for burst. It is enabled by default. Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com> Acked-by: Shanpei Chen <shanpeic@linux.alibaba.com>	2022-08-01 11:16:05 +00:00
zhongjiang-ali	23e00c1f94	ck: mm: add an interface to adjust the penalty time dynamically to #32751521 It will not be throttled if the penalty time is less than 10ms when the memory.usage_in_bytes has exceeded the memory.high. But there are some tasks allocating a lot of memory cocurrently to result in Oom killer. The interface adjust the penatly time to throttle the task further, hence the oomd can effectively monitor it and kill the related task in the mem cgroup to avoid oom killer in the kernel. Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>	2022-08-01 11:02:05 +00:00
George Zhang	75985e48c2	ck: net/tcp: Support tunable tcp timeout value in TIME-WAIT state fix #32811140 By default the tcp_tw_timeout value is 60 seconds. The minimum is 1 second and the maximum is 600. This setting is useful on system under heavy tcp load. NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time" restriction, and make your system into the risk of causing some old data to be accepted as new or new data rejected as old duplicated by some receivers. Link: http://tools.ietf.org/html/rfc793 Signed-off-by: George Zhang <georgezhang@linux.alibaba.com> Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>	2022-08-01 11:01:51 +00:00
zhongjiang-ali	e8e17ddfc2	ck: mm: restrict the print message frequency further when memcg oom triggers. fix #31801703 It is because too much memcg oom printed message will trigger the softlockup. In general, we use the same ratelimit oom_rc between system and memcg to limit the print message. But it is more frequent to exceed its limit of the memcg, thus it would will result in oom easily. And A lot of printed information will be outputed. It's likely to trigger softlockup. The patch use different ratelimit to limit the memcg and system oom. And we test the patch using the default value in the memcg, The issue will go. Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>	2022-08-01 11:01:49 +00:00
Xiaoguang Wang	c20584a15b	ck: mm: add proc interface to control context readahead fix #32695316 For some workloads whose io activities are mostly random, context readahead feature can introduce unnecessary io read operations, which will impact app's performance. Context readahead's algorithm is straightforward and not that smart. This patch adds "/proc/sys/vm/enable_context_readahead" to control whether to disable or enable this feature. Currently we enable context readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead to disable context readahead. We also have tested mongodb's performance in 'random point select' case, With context readahead enabled: mongodb eps 12409 With context readahead disabled: mongodb eps 14443 About 16% performance improvement. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-08-01 11:01:44 +00:00
zhangliguang	90658e1450	ck: fs,ext4: remove projid limit when create hard link fix #32824162 This is a temporary workaround plan to avoid the limitation when creating hard link cross two projids. Signed-off-by: zhangliguang <zhangliguang@linux.alibaba.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Caspar Zhang <caspar@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-08-01 11:01:43 +00:00
Muchun Song	31e16a5e11	mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE [ Upstream commit `43b5240ca6` ] "numa_stat" should not be included in the scope of CONFIG_HUGETLB_PAGE, if CONFIG_HUGETLB_PAGE is not configured even if CONFIG_NUMA is configured, "numa_stat" is missed form /proc. Move it out of CONFIG_HUGETLB_PAGE to fix it. Fixes: `4518085e12` ("mm, sysctl: make NUMA stats configurable") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:13 +02:00
Kuniyuki Iwashima	b8871d9186	sysctl: Fix data-races in proc_dointvec_ms_jiffies(). [ Upstream commit `7d1025e559` ] A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec_ms_jiffies() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec_ms_jiffies() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:09 +02:00
Kuniyuki Iwashima	609ce7ff75	sysctl: Fix data races in proc_dointvec_jiffies(). [ Upstream commit `e877820877` ] A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec_jiffies() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec_jiffies() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:07 +02:00
Kuniyuki Iwashima	a5ee448d38	sysctl: Fix data races in proc_doulongvec_minmax(). [ Upstream commit `c31bcc8fb8` ] A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_doulongvec_minmax() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_doulongvec_minmax() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:07 +02:00
Kuniyuki Iwashima	e3a2144b3b	sysctl: Fix data races in proc_douintvec_minmax(). [ Upstream commit `2d3b559df3` ] A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_douintvec_minmax() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_douintvec_minmax() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: `61d9b56a89` ("sysctl: add unsigned int range support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:07 +02:00
Kuniyuki Iwashima	71ddde27c2	sysctl: Fix data races in proc_dointvec_minmax(). [ Upstream commit `f613d86d01` ] A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec_minmax() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec_minmax() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:06 +02:00
Kuniyuki Iwashima	d5d54714e3	sysctl: Fix data races in proc_douintvec(). [ Upstream commit `4762b532ec` ] A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_douintvec() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_douintvec() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: `e7d316a02f` ("sysctl: handle error writing UINT_MAX to u32 fields") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:06 +02:00
Kuniyuki Iwashima	80cc28a4b4	sysctl: Fix data races in proc_dointvec(). [ Upstream commit `1f1be04b4d` ] A sysctl variable is accessed concurrently, and there is always a chance of data-race. So, all readers and writers need some basic protection to avoid load/store-tearing. This patch changes proc_dointvec() to use READ_ONCE() and WRITE_ONCE() internally to fix data-races on the sysctl side. For now, proc_dointvec() itself is tolerant to a data-race, but we still need to add annotations on the other subsystem's side. Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>	2022-07-21 21:20:06 +02:00
Josh Poimboeuf	afc2d635b5	x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting commit `44a3918c82` upstream. With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable to Spectre v2 BHB-based attacks. When both are enabled, print a warning message and report it in the 'spectre_v2' sysfs vulnerabilities file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> [fllinden@amazon.com: backported to 5.10] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-03-11 12:11:49 +01:00
Daniel Borkmann	8c15bfb36a	bpf: Add kconfig knob for disabling unpriv bpf by default commit `08389d8882` upstream. Add a kconfig knob which allows for unprivileged bpf to be disabled by default. If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2. This still allows a transition of 2 -> {0,1} through an admin. Similarly, this also still keeps 1 -> {1} behavior intact, so that once set to permanently disabled, it cannot be undone aside from a reboot. We've also added extra2 with max of 2 for the procfs handler, so that an admin still has a chance to toggle between 0 <-> 2. Either way, as an additional alternative, applications can make use of CAP_BPF that we added a while ago. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2022-01-05 12:40:34 +01:00

1 2 3 4 5 ...

687 Commits