Commit Graph

687 Commits

Author SHA1 Message Date
Cruz Zhao ad071c796e anolis: sched: introduce ID_BOOK_CPU
ANBZ: #22143

In high-concurrency scenarios, highclass tasks often select the same
idle CPU when waking up, which can cause scheduling delays. The reason
is that To reduce
the scheduling delays caused by this situation, we introduce
ID_BOOK_CPU:

When highclass task selecting idle cpu, check whether it has been
booked by other tasks and whether it's still idle before we select it
to wake up. If the idle cpu we found has been booked by other tasks,
select again, until book the idle cpu successfully or reach the retry
limit. If the idle cpu hasn't been booked by any other, make rq->booked
true to mark that the cpu is booked, and make rq->booked false after the
highclass task actually enqueues into the rq of the cpu.

To Enable ID_BOOK_CPU,
echo ID_BOOK_CPU > /sys/kernel/debug/sched_features.

To set retry limit, modify /proc/sys/kernel/sched_id_book_cpu_nr_tries.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5455
2025-06-25 16:29:07 +08:00
Cruz Zhao 2b60f98354 anolis: sched: optimize push_expellee()
ANBZ: #11591

If there are some cpus push expellee at the same time, there will
be lock competition, which will cause sys rush high in a short time.
To avoid causing jitter to unexpellee, we use raw_spin_rq_trylock()
to reduce lock competition, and check whether should we resched
every loop.

By the way, we introduce sysctl_sched_push_expellee_interval to
control the frequency of pushing expellee tasks, and the default
interval is 6ms.

By the way, we fix the bug that cpu cannot go into nohz_full state
when ID_PUSH_EXPELLEE is turned off.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4125
2024-12-17 08:30:52 +00:00
Cyril Hrubis 0ccd3a0b71 sched/rt: Disallow writing invalid values to sched_rt_period_us
ANBZ: #9497

commit 079be8fc63 upstream.

The validation of the value written to sched_rt_period_us was broken
because:

  - the sysclt_sched_rt_period is declared as unsigned int
  - parsed by proc_do_intvec()
  - the range is asserted after the value parsed by proc_do_intvec()

Because of this negative values written to the file were written into a
unsigned integer that were later on interpreted as large positive
integers which did passed the check:

  if (sysclt_sched_rt_period <= 0)
	return EINVAL;

This commit fixes the parsing by setting explicit range for both
perid_us and runtime_us into the sched_rt_sysctls table and processes
the values with proc_dointvec_minmax() instead.

Alternatively if we wanted to use full range of unsigned int for the
period value we would have to split the proc_handler and use
proc_douintvec() for it however even the
Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
INT_MAX.

As far as I can tell the only problem this causes is that the sysctl
file allows writing negative values which when read back may confuse
userspace.

There is also a LTP test being submitted for these sysctl files at:

  http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/

Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
[ pvorel: rebased for 5.15, 5.10 ]
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e4bc311745078e145f251ab5b28fb30f90652583)
[dtcccc: we have not introduced SYSCTL_NEG_ONE, use neg_one instead.]
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3549
2024-08-02 07:26:05 +00:00
Suren Baghdasaryan 5d749449c8 mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%
ANBZ: #9482

commit e1aa3fe3e282a0532fffaa68360cbd45a2a59291 stable.

[ Upstream commit 39c65a94cd ]

For embedded systems with low total memory, having to run applications
with relatively large memory requirements, 10% max limitation for
watermark_scale_factor poses an issue of triggering direct reclaim every
time such application is started.  This results in slow application
startup times and bad end-user experience.

By increasing watermark_scale_factor max limit we allow vendors more
flexibility to choose the right level of kswapd aggressiveness for their
device and workload requirements.

Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Fengfei Xi <xi.fengfei@h3c.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable-dep-of: 935d44acf6 ("memfd: check for non-NULL file_seals in memfd_create() syscall")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3468
2024-08-01 02:00:10 +00:00
Cruz Zhao 1c1cead323 anolis: sched: introduce group balancer switch and config
ANBZ: #8765

Group balancer is a feature that schedules task groups as a whole to
achieve better locality. It uses a soft cpu bind method which offers
a dynamic way to restrict allowed CPUs for tasks in the same group.

This patch introduces the switch and config of group balancer.

Co-developed-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.aibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3215
2024-05-20 16:15:48 +00:00
Tianchen Ding 87dd20ef17 anolis: bpf: add whitelist for fmod_ret functions to prepare for
user-space scheduling

ANBZ: #8975

Since sched_ext is still pending and not merged into upstream kernel, we
designed a light-weight user-space scheduling thanks to bpf fmod_ret. It
can modify the return value of almost any function. For safety, upstream
kernel limits the scope of functions allowed to be modified, and let's
enlarge it.

Also add a sysctl "bpf_customized_fmodret" to control this feature.

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3146
2024-05-17 16:18:07 +00:00
Yi Tao 9f8a413f69 anolis: support using parent cgroup as rich container source
ANBZ: #8997

In certain scenarios, multiple containers within a single pod will share
a PID namespace. These containers do not set specific resource limits;
instead, resource limitations are applied to the pod as a whole.
Therefore, it's necessary to use the pod's cgroup configuration as the
data source for rich containers. From the perspective of the kernel,
this is essentially the parent cgroup of the current process.

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3151
2024-05-17 06:43:15 +00:00
Yi Tao c9e66be185 anolis: Add rich container feature control
ANBZ: #7692

The functionality of rich containers is supported in cgroup v2 through
eBPF, but only a subset of features is supported. To ensure consistency
between both sides, add switches for each feature of the rich container
in cgroup v1.

Features and their names are: cpuinfo, meminfo, cpuusage, loadavg,
uptime, diskquota. All features are enabled by default if rich container
is enabled。

Add a new sysctl interface named rich_container_feature_control. When
read, it shows space separated list of the features which are enabled.
Space separated list of features prefixed with '+' or '-' can be written
to enable or disable features.  A feature name prefixed with '+' enables
the feature and '-' disables.  If a feature appears more than once on
the list, the last one is effective.  When multiple enable and disable
operations are specified, either all succeed or all fail.

eg:
	echo “+cpuinfo -meminfo +uptime” >
	/proc/sys/kernel/rich_container_feature_control

Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2489
2023-12-11 03:47:39 +00:00
Cruz Zhao a760c56ec5 anolis: watchdog: enlarge watchdog_thresh limit
ANBZ: #6805

During the stress test, there're always false alarm of soft lockup,
because of long loop, instead of true lockup. So we need to enlarge
watchdog_thresh limit to 150 to void this kind of case.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2286
2023-10-16 08:38:03 +00:00
Cruz Zhao 78300c148f anolis: sched: introduce sysctl_sched_acpu_enabled
ANBZ: #6729

In order to be able to dynamically turn on and off acpu accounting, we
introduce sysctl_sched_acpu_enabled, instead of default on.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2260
2023-10-08 09:54:40 +00:00
Josh Don 389ef61178 sched: reduce sched slice for SCHED_IDLE entities
ANBZ: #6587

commit 51ce83ed52 upstream.

Use a small, non-scaled min granularity for SCHED_IDLE entities, when
competing with normal entities. This reduces the latency of getting
a normal entity back on cpu, at the expense of increased context
switch frequency of SCHED_IDLE entities.

The benefit of this change is to reduce the round-robin latency for
normal entities when competing with a SCHED_IDLE entity.

Example: on a machine with HZ=1000, spawned two threads, one of which is
SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE
thread runs for 4ms then waits for 1.4s. With this patch, it runs for
1ms and waits 340ms (as it round-robins with the other thread).

Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com
[dtcccc: adapt "idle_min_granularity_ns" to sysctl instead of debugfs
because kernel 5.10 has not moved them yet. I set the min value of this
config to ZERO to suppress BE tasks better.]
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2250
2023-10-08 09:53:10 +00:00
Cruz Zhao c02f4ea408 anolis: sched/core: introduce sysctl_sched_core
ANBZ: #6201

In order to be able to control the switch of core sched, we introduce
sysctl_sched_core. When turned on, it's allowed to give/share cookie
to tasks. when turned off, the cookie of all cookies will be cleared,
and it's not allowed to give/share cookie to tasks (only get cookie
is allowed).

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2243
2023-09-28 12:37:32 +00:00
Erwei Deng 9b1b42a69c anolis: sched/fair: fix the sysctl of sched_expel_update_interval
ANBZ: #4814

There is a bug of sched_expel_update_interval sysctl. See the
following:

	# cat /proc/sys/kernel/sched_expel_update_interval
	10    0

There are two values. It's wrong.

The variable of sysctl_sched_expel_update_interval is defined as
unsigned long, and there will be two values when read this sysctl,
because the size of long will be splited into two size of int by
the proc_dointvec handler.

So, the variable should be defined as unsigned int that is large
enough to use it.

Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1593
2023-05-08 03:13:57 +00:00
Chao Wu 79e997af0a anolis: pidns: add pid_max per namespace
ANBZ: #4226

In x86_64 with big ram people running containers set pid_max on host to
large values to be able to launch more containers. At the same time
containers running 32-bit software experience problems with large pids
-ps calls readdir/stat on proc entries and inode's i_ino happen to be
too big for the 32-bit API.
Thus, the ability to limit the pid value inside container is required.
And max_pid could be set through '/proc/sys/kernel/pid_max'

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1288
2023-02-27 03:49:34 +00:00
Linus Torvalds 886024a0d5 proc: avoid integer type confusion in get_proc_long
ANBZ: #3460

commit e6cfaf34be upstream.

proc_get_long() is passed a size_t, but then assigns it to an 'int'
variable for the length.  Let's not do that, even if our IO paths are
limited to MAX_RW_COUNT (exactly because of these kinds of type errors).

So do the proper test in the rigth type.

Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Fixes: CVE-2022-3567
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1006
2022-12-14 19:27:29 +08:00
Linus Torvalds d206f363e5 proc: proc_skip_spaces() shouldn't think it is working on C strings
ANBZ: #3460

commit bce9332220 upstream.

proc_skip_spaces() seems to think it is working on C strings, and ends
up being just a wrapper around skip_spaces() with a really odd calling
convention.

Instead of basing it on skip_spaces(), it should have looked more like
proc_skip_char(), which really is the exact same function (except it
skips a particular character, rather than whitespace).  So use that as
inspiration, odd coding and all.

Now the calling convention actually makes sense and works for the
intended purpose.

Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Fixes: CVE-2022-4378
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1006
2022-12-14 19:24:55 +08:00
Cruz Zhao 31ec2541ea anolis: sched/fair: Introduce sysctl interface to enable group identity
ANBZ: #2022

We introduce a counter group_identity_count to indicate how many cgroups
have enabled group identity, and introduced a function
group_identity_enabled() to indicate whether there are cgroups with group
identity enabled, that is, if group_identity_count is zero,
group_identity_enabled() returns false, otherwise returns true.

These judgment functions are helpers for group identity fast path, and also
helpers for confliction judgment, i.e. confliction with core scheduling.

Only if group_identity_count is zero, group identity can be turned on/off.

Turn on:
	echo 1 > /proc/sys/kernel/sched_group_identity_enabled
Turn off:
	echo 0 > /proc/sys/kernel/sched_group_identity_enabled

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/682
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
2022-09-21 03:40:38 +00:00
Jiang Liu a380e9e60f anolis: userns: add a sysctl to control the max depth
ANBZ: #1474

Add "sysctl.kernel.userns_max_level" to control the maximum nested
level of user namespace. The valid configuration values are 0-33.
When configured to zero, user namespace is effectively disabled.

Originally the check is "if (parent_ns->level > 32)" and
init_user_ns.level is zero, so the actually maximum level is 33
instead of 32.

Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-08-02 16:40:18 +08:00
Serge Hallyn a702079920 userns: add a sysctl to disable unprivileged user namespace unsharing
ANBZ: #1476

commit 5758824b20fa2308ebb5c460874d0ffd73d0d8e4 Ubuntu groovy.

It is turned on by default, but can be turned off if admins prefer or,
more importantly, if a security vulnerability is found.

The intent is to use this as mitigation so long as Ubuntu is on the
cutting edge of enablement for things like unprivileged filesystem
mounting.

(This patch is tweaked from the one currently still in Debian sid, which
in turn came from the patch we had in saucy)

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
[bwh: Remove unneeded binary sysctl bits]
[ saf: move extern unprivileged_userns_clone declaration to
  include/linux/user_namespace.h to conform with 2374c09b1c
  "sysctl: remove all extern declaration from sysctl.c" ]
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
[jeffle: add documentation for the sysctl]
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-08-02 16:40:17 +08:00
Xin Hao 57c48b6a99 ck: mm: add an switch to enable in-use hugtlb pages migraiton
ANBZ: #1330

In some scenarios, if the in-use hugtlb pages are migrated,
the program will fail to access its corresponding data, and
then caused program execution exception.

To solve the problem, we add a sysctrl interface
"enable_used_hugtlb_migration"
To disable migration:
	echo 0 > /proc/sys/vm/enable_used_hugtlb_migration

To enable migration:
	echo 1 > /proc/sys/vm/enable_used_hugtlb_migration

BTW, This bug is caused by backport the commit ae37c7ff79 ("mm:
make alloc_contig_range handle in-use hugetlb pages"), the solution
is only used for "alloc_contig_range()",the other path can still
migration in-use hugetlb pages.

Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
2022-08-02 16:30:35 +08:00
zhongjiang-ali 1f86172dbe anolis: mm: introduce ability to reserve page cache on system wide
ANBZ: #849

The kernel generally prefers to reclaim page cache over anonymous pages,
and only reclaims page cache when there is no swap configured.

However, in some extreme scenario, where the application allocates a large
amount of anonymous memory, the page cache is almost completely exhausted,
while the OOM killer barely fires.   The system can suffer from heavy IO,
and the application performance can be significantly affected, or even the
application may become unresponsive, since the page cache, including the
application program instruction, is thrashing.  In such scenario, some users
do want OOM instead of half-dead.

This provides user the ability to reserve page cache on system wide.
With appropriate amount of page cache reserved, OOM killer can be
triggered in time, and some key processes can make progress (cooperate
with oom_score_adj, for example).

Enable the feature with:
echo XXX > /proc/sys/vm/min_cache_kbytes
disable the feature with:
echo 0 > /proc/sys/vm/min_cache_kbytes

Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Suggested-by: yinbinbin <yinbinbin001@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2022-08-01 12:26:01 +00:00
Cruz Zhao c209a80007 anolis: sched: Credit clarification for BVT and its related work
ANBZ: #781

Group Identity is a totally different feature from BVT, and the
defination and logic of bvt_warp_ns has been totally reformed.

While 10 lines of code are related to BVT paper and it's code
(Jacob Leverich), this patch gives credit to BVT and its related
work.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
2022-08-01 12:25:57 +00:00
Cruz Zhao 328d3ac4df anolis: sched: fix some warnings with allnoconfig
ANBZ: #45

With allnoconfig, there will be some warnings during compile:
kernel/sched/fair.c:955:1: warning: 'id_update_make_up' defined but not used [Wunused-function]
 id_update_make_up(struct task_group *tg, struct rq *rq, struct cfs_rq *cfs_rq,
 ^
kernel/sched/fair.c:1008:1: warning: 'id_commit_make_up' defined but not used [Wunused-function]
 id_commit_make_up(struct rq *rq, bool commit)
 ^
kernel/sched/fair.c:1724:1: warning: 'id_can_migrate_task' defined but not used [Wunused-function]
 id_can_migrate_task(struct task_struct *p, struct rq *src_rq, struct rq  *dst_rq)
 ^
kernel/sysctl.c:122:22: warning: 'penalty_extra_delay_max' defined but not used [Wunused-function]
 static unsigned long penalty_extra_delay_max = HZ;
                      ^
To fix this problem, these functions are limited in configs.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
2022-08-01 12:00:48 +00:00
Yi Tao 3f6bceb77c anolis: cgroup: add user space interface for cgroup pool
OpenAnolis Bug Tracker: 0000312

Add pool_size interface and delay_time interface. When the user writes
pool_size, a cgroup pool will be created, and then when the user needs
to create a cgroup, it will take the fast path and get a cgroup from
pool.

pool_size is capacity of cgroup pool, which is advised larger than
amount of cgroup user creates at the same time. For example, it is
proper to set 500 for pool_size if creating 400 cgroups at the same
time.

When pool is exhausted, a supply work will be triggered. In order to
avoid competition with the currently created cgroup, the supply work
can be delayed. User can set cgroup_supply_delay_time to delay supply
work to time when system is idle.

Usage:

enable pool:
	echo 500 > /sys/fs/cgroup/<subsystem>/test-cgroup/pool_size

disable pool:
	echo 0 > /sys/fs/cgroup/<subsystem>/test-cgroup/pool_size

delay supply work to 3000ms later:
	echo 3000 > /proc/sys/kernel/cgroup_supply_delay_time

Suggested-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Reviewed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
2022-08-01 11:52:22 +00:00
Tianchen Ding 7001fcff92 ck:mm:break in readahead when multiple threads detected
to #34329845

Specific workloads using multiple threads to do sequential reading on
the same file may introduce race on spinlock during page cache
readahead. Readahead process will end earlier when detecting readahead
page conflict. This is effective for sequential reading the same file
with multiple threads and large readahead size.

The benefit can be shown in the following test by fio:
fio --name=ratest --ioengine=mmap --iodepth=1 --rw=read --direct=0
--size=512M --numjobs=32 --directory=/mnt --filename=ratestfile

Firstly, we set readahead size to its default value:
blockdev --setra 256 /dev/vdh
The result is:
Run status group 0 (all jobs):
   READ: bw=4427MiB/s (4642MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s),
io=16.0GiB (17.2GB), run=3699-3701msec
Disk stats (read/write):
  vdh: ios=22945/0, merge=0/0, ticks=12515/0, in_queue=12515,
util=50.79%

Then, we set readahead size larger:
blockdev --setra 8192 /dev/vdh
The result is:
Run status group 0 (all jobs):
   READ: bw=6450MiB/s (6764MB/s), 202MiB/s-202MiB/s (211MB/s-212MB/s),
io=16.0GiB (17.2GB), run=2535-2540msec
Disk stats (read/write):
  vdh: ios=19503/0, merge=0/0, ticks=49375/0, in_queue=49375,
util=23.67%

Finally, we set sysctl vm.enable_multithread_ra_boost=1 with ra=8192
and we get:
Run status group 0 (all jobs):
   READ: bw=7234MiB/s (7585MB/s), 226MiB/s-226MiB/s (237MB/s-237MB/s),
io=16.0GiB (17.2GB), run=2262-2265msec
Disk stats (read/write):
  vdh: ios=10548/0, merge=0/0, ticks=40709/0, in_queue=40709,
util=23.62%

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
2022-08-01 11:31:07 +00:00
Meng Shen fbd69fac5c ck: UKFEF: report softlockup event
to #34868789

Add softlockup event into unified kernel fault event framework.

Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Meng Shen <shenmeng@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
2022-08-01 11:30:53 +00:00
Meng Shen d2ffdd5d83 ck: UKFEF: unified kernel fault event framework
to #34868789

There are various kernel errors/warnnings, if users want to know
the effect with a error or warning, they should see the detail of
kernel. It is difficult for users to handle so much things.

This patch classify kernel fault events, divide events into its own
module such as sched, mem, io, net, etc. At the same time,
to report the effect class of the current fault event.
There are three effect classes:
Slight - just a jitter, every thing could works.
Normal - the current task may have exception.
Fatal  - system may be unstable.

Accordding to these information, users can easily choose a suitable
action to ensure the reliability.

This feature also can be used for checking the side effect of a
system-change.

This feature can be enabled/disabled via
/proc/sys/kernel/fault_event_enable.

Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Meng Shen <shenmeng@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
2022-08-01 11:30:51 +00:00
Cruz Zhao f0f2e04c71 ck: sched: fix the performence regression caused by update_rq_on_expel()
to #34609774
fix #34359599
fix #34361563

In order to prevent scheduling errors in the case of ipi failure,
update_rq_on_expel() is called in function pick_next_task_fair(), but
this will cause performence regression. In order to prevent errors
while ensuring performence, we introduce an interval to call
update_rq_on_expel(), the default value is 10, unit: jiffy. We also
introdue a sched_feat: ID_LOOSE_EXPEL to control whether to call
update_rq_on_expel() in function pick_next_task_fair() periodly,
if on, update_rq_on_expel() will be called loosely, and the interval
will be larger than sysctl_sched_expel_update_interval between calls,
called everytime otherwise. The default value of ID_LOOSE_EXPEL is OFF.

Here follows the data of test will-it-scale-sched_yield-thread-15
before applying this patch:
will-it-scale-sched_yield-thread-15-1
	tasks,processes,processes_idle,threads,threads_idle,linear
	0,0,100,0,100,0
	1,0,0.00,1479767,95.80,1479767
will-it-scale-sched_yield-thread-15-50%
	tasks,processes,processes_idle,threads,threads_idle,linear
	0,0,100,0,100,0
	48,0,0.00,15057129,0.12,0
will-it-scale-sched_yield-thread-15-100%
	tasks,processes,processes_idle,threads,threads_idle,linear
	0,0,100,0,100,0
	96,0,0.00,14793466,0.14,0

Here follows the data of test will-it-scale-sched_yield-thread-15
after applying this patch with sysctl_sched_expel_update_interval
= 10 and ID_LOOSE_EXPEL on:
will-it-scale-sched_yield-thread-15-1
	tasks,processes,processes_idle,threads,threads_idle,linear
	0,0,100,0,100,0
	1,0,0.00,1566401,95.82,1566401
will-it-scale-sched_yield-thread-15-50%
	tasks,processes,processes_idle,threads,threads_idle,linear
	0,0,100,0,100,0
	48,0,0.00,15789744,0.12,0
will-it-scale-sched_yield-thread-15-100%
	tasks,processes,processes_idle,threads,threads_idle,linear
	0,0,100,0,100,0
	96,0,0.00,15612437,0.15,0

We also tested it with nginx + ffmpeg, and here is the scenario:
    online task: nginx
        number of online tasks: 2
        bvt of online tasks: 2
    offline task: ffmpeg
        number of offline tasks: 4
        bvt of offline task: -1

Here follows the data before applying this patch:
			only online	online+offline
	cpu usage:	7.79%		46.69%
	nginx rt:	0.84ms		0.92ms

Here follows the data after appling this patch
	sched_expel_update_interval = 10(unit: jiffy)
			only online	online+offline
	cpu usage:	7.91%		46.84%
	nginx rt:	0.82ms		0.89ms
	sched_expel_update_interval = 100(unit: jiffy)
			only online	online+offline
	cpu usage:	7.81%		46.45%
	nginx rt:	0.81ms		0.87ms
	sched_expel_update_interval = 1000(unit: jiffy)
			only online	online+offline
	cpu usage:	7.91%		46.19%
	nginx rt:	0.81ms		0.87ms

As we can see from the result, this patch can resolve the performence
regression to a certain degree.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
2022-08-01 11:30:21 +00:00
Michael Wang 4f43246740 ck: sched: introduce group identity 'idle saver'
to #34609774

We found that underclass tasks can migrate frequently and
occupy too much idle CPU, this cause a high 99th latency
on highclass since the latency to wakeup on idle CPU are
much lower than to preempt the underclass.

Thus we introduce the IDLE_SAVER identity which on default
activated for group with bvt -1, now when they want to wakeup
on idle CPU, it require the average idle time to be above
'sysctl_sched_idle_saver_wmark' (default 1ms).

For the underclass who want to occupy more idle CPU, eg. the
vcpu threads, just deactive the identity by turn off the bit
3 in 'cpu.identity'.

Also improved the logical in select_idle_core, now consider
a core full of underclass as backup option for highcalss, and
prevent underclass to take the idle core with expel or idle
saver reason.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
2022-08-01 11:30:20 +00:00
Michael Wang 9372f9beab ck: sched: introduce group identity 'smt expeller'
to #34609774

Concurrent execution on smt cpus could impact each other on
performance since they sharing the hardware unit, an
underclass task should not cause regression to the highclass
task running on smt peers in this way.

The legacy feature Noise Clean was supposed to address this
issue, while it introduced too much overhead on
throttle/unthrottle tasks which causing unstable latency in
real world.

As now we have all underclass tasks on separate tree, just
hidden such tasks from pick_next_task() would be enough.

Thus we introduced another identity SMT_EXPELLER, which will
try to expel the underclass on smt peers when running.

By apply the corresponding bit into 'group_identity', tasks
willbecome expeller, when on executing, it's smt peers won't
be able to pick the underclass tasks for running.

Legacy 'bvt_warp_ns' 2 will have this identity on default.

To support the new cgroup deployment, eg k8s, it requires
the highclass parent to have both highclass and underclass
child.

In order to support expel in this case, we need to skip the
highclass se, when and only when it's full of underclass
tasks.

We achieved this by isolate them from the rb-tree temporarily,
until the end of expel, or it's no longer only contain
underclass tasks, like throttle but cost far more less.

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
2022-08-01 11:30:18 +00:00
Michael Wang b37e67a6c6 ck: sched: introduce per-cgroup identity
to #34609774

In collocation environment, we deploy latency sensitive (LS) workloads
and best effort (BE) workloads on the same machine, and borrowed vritual
time(BVT) is the legacy feature to help ensure the schedule priority
for LS.

This patch is just porting the idea, however, in order to reduce the
overhead, we introduced group identity feature which based on the idea
of multiple rb-tree.

There are now two rb-tree for each cfs_rq, and tasks with special
identity will be queued into the corresponding tree, this help reduce
the complexity of implementation and obviously reduce the overhead
of following smt expeller(Noise Clean) feature.

By write particular identity into the new cpu-cgroup entry named
'group_identity', tasks of this cgroup will perform the special
schedule behavior, including:

  * ID_NORMAL
    Normal CFS tasks
  * ID_HIGHCLASS
    Tasks preempt normal & underclass tasks on wakeup,
    also consider underclass only cpu as idle.
  * ID_UNDERCLASS
    Tasks take penalty on wakeup.

And the legacy entry 'bvt_warp_ns' was reserved for compatibility, the
relationship with identity are:

  * 2 == ID_HIGHCLASS
  * 1 == ID_HIGHCLASS
  * 0 == ID_NORMAL
  * -1 == ID_UNDERCLASS
  * -2 == ID_UNDERCLASS

Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
2022-08-01 11:30:18 +00:00
Xunlei Pang e5b00b47d4 ck: cpuinfo: Add cpuinfo support of cpu quota and cpu shard
to #34559224

lxcfs supports rich container cpuinfo from cpu quota, also sigma
cpushare has the requirement of getting cpuinfo from cpu.shares.

Thus, introduce new knobs to support these:
/proc/sys/kernel/rich_container_cpuinfo_source
- 0 uses cpu quota "quota/period" by default.
  It will fall back to use cpuset.cpus if quota not set.
- 1 uses cpuset.cpus.
- 2 uses cpushare "cpu.shares/1024" (1024 can be configured below)

/proc/sys/kernel/rich_container_cpuinfo_sharesbase
- when rich_container_cpuinfo_source is 2, this is the divisor.

Note that, after faking cpuinfo in dockers, it's impossible to
make /proc/stat match them accordingly, but we only care about
the following stats not /proc/stat:
/proc/cpuinfo, sysfs online and /proc/meminfo

These feature depends on CONFIG_RICH_CONTAINER_CG_SWITCH=n.

Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
2022-08-01 11:30:17 +00:00
Xunlei Pang aa278c4ae5 ck: make the rich container support k8s.
to #34559224

Make the rich container support k8s in different ways, which the one is
using the global switch when the CONFIG_RICH_CONTAINER_CG_SWITCH is not
seted, and the other is using per cgroup switch when the config is seted.

For the per cgroup switch: (CONFIG_RICH_CONTAINER_CG_SWITCH = y)

For k8s shared pidns pod, all the containers under this pod share
the same pid_namespace, and the namespace child_reaper lives in a
special container called "pause". Thus we can't use child_reaper
as the rich container information any more. Instead, we introduce
a new cgroup interface "cgroup.rich_container_source" to let users
(pouch) to involve in, the rich container kernel code will search
from current cgroup upwards until the first cgroup with this file
set with nonzero value, and use this cgroup as the data source of
the rich container. If not found, fallback to the default cgroup
containing the child_reaper.

New knob "cgroup.rich_container_source":
 default value 0, children don't inherit.

e.g.,
   pod (shared pidns)
     /    |    \
   c1    c2   pause

If explicitly set pod's cgroup.rich_container_source(cpuacct, cpuset
and memory subsys) to be 1, "c1" or "c2" will reflect the statistics
of "pod" like:
 echo 1 > /sys/fs/cgroup/cpuacct/.../pod/cgroup.rich_container_source
 echo 1 > /sys/fs/cgroup/cpuset/.../pod/cgroup.rich_container_source
 echo 1 > /sys/fs/cgroup/memory/.../pod/cgroup.rich_container_source

For the global switch: (CONFIG_RICH_CONTAINER_CG_SWITCH = n)

For k8s, multpile containers share one pid_namespace, the child
reaper lives in one special "pause" container amongst.

Then we should use current other than the child reaper to get
the information, thus it deserves a runtime interface.

Introduce "/proc/sys/kernel/rich_container_source":
 - 0 means to use cgroups of "current" by default.
 - 1 means to use cgroups of "child reaper".

Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>

[ Shile: fixed the conflicts with
	fe8a4afd7c2 (ck: proc/uptime: Add uptime support for rich container) ]
Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: Erwei Deng <erwei@linux.alibaba.com>
2022-08-01 11:30:17 +00:00
Xunlei Pang f1c204622c ck: pidns: Support rich container switch on/off
to #34559224

Introduce "/proc/sys/kernel/rich_container_enable" to control
rich container at runtime. Default off.

Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
2022-08-01 11:30:11 +00:00
Huaixin Chang 98dbdf849a ck: sched/fair: Make CFS bandwidth controller burstable
to #32761176

Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.

A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.

Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.

Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Acked-by: Shanpei Chen <shanpeic@linux.alibaba.com>
2022-08-01 11:16:05 +00:00
zhongjiang-ali 23e00c1f94 ck: mm: add an interface to adjust the penalty time dynamically
to #32751521

It will not be throttled if the penalty time is less than 10ms when
the memory.usage_in_bytes has exceeded the memory.high. But there
are some tasks allocating a lot of memory cocurrently to result in
Oom killer. The interface adjust the penatly time to throttle the
task further, hence the oomd can effectively monitor it and kill
the related task in the mem cgroup to avoid oom killer in the kernel.

Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
2022-08-01 11:02:05 +00:00
George Zhang 75985e48c2 ck: net/tcp: Support tunable tcp timeout value in TIME-WAIT state
fix #32811140

By default the tcp_tw_timeout value is 60 seconds. The minimum is
1 second and the maximum is 600. This setting is useful on system under
heavy tcp load.

NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
restriction, and make your system into the risk of causing some old data
to be accepted as new or new data rejected as old duplicated by some
receivers.

Link: http://tools.ietf.org/html/rfc793

Signed-off-by: George Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
2022-08-01 11:01:51 +00:00
zhongjiang-ali e8e17ddfc2 ck: mm: restrict the print message frequency further when memcg oom triggers.
fix #31801703

It is because too much memcg oom printed message will trigger the softlockup.
In general, we use the same ratelimit oom_rc between system and memcg
to limit the print message. But it is more frequent to exceed its limit
of the memcg, thus it would will result in oom easily. And A lot of
printed information will be outputed. It's likely to trigger softlockup.

The patch use different ratelimit to limit the memcg and system oom. And
we test the patch using the default value in the memcg, The issue will
go.

Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
2022-08-01 11:01:49 +00:00
Xiaoguang Wang c20584a15b ck: mm: add proc interface to control context readahead
fix #32695316

For some workloads whose io activities are mostly random, context
readahead feature can introduce unnecessary io read operations, which
will impact app's performance. Context readahead's algorithm is
straightforward and not that smart.

This patch adds "/proc/sys/vm/enable_context_readahead" to control
whether to disable or enable this feature. Currently we enable context
readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
to disable context readahead.

We also have tested mongodb's performance in 'random point select' case,
With context readahead enabled:
  mongodb eps 12409
With context readahead disabled:
  mongodb eps 14443
About 16% performance improvement.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-08-01 11:01:44 +00:00
zhangliguang 90658e1450 ck: fs,ext4: remove projid limit when create hard link
fix #32824162

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.

Signed-off-by: zhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Caspar Zhang <caspar@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-08-01 11:01:43 +00:00
Muchun Song 31e16a5e11 mm: sysctl: fix missing numa_stat when !CONFIG_HUGETLB_PAGE
[ Upstream commit 43b5240ca6 ]

"numa_stat" should not be included in the scope of CONFIG_HUGETLB_PAGE, if
CONFIG_HUGETLB_PAGE is not configured even if CONFIG_NUMA is configured,
"numa_stat" is missed form /proc. Move it out of CONFIG_HUGETLB_PAGE to
fix it.

Fixes: 4518085e12 ("mm, sysctl: make NUMA stats configurable")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:13 +02:00
Kuniyuki Iwashima b8871d9186 sysctl: Fix data-races in proc_dointvec_ms_jiffies().
[ Upstream commit 7d1025e559 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec_ms_jiffies() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_dointvec_ms_jiffies() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:09 +02:00
Kuniyuki Iwashima 609ce7ff75 sysctl: Fix data races in proc_dointvec_jiffies().
[ Upstream commit e877820877 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec_jiffies() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_dointvec_jiffies() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:07 +02:00
Kuniyuki Iwashima a5ee448d38 sysctl: Fix data races in proc_doulongvec_minmax().
[ Upstream commit c31bcc8fb8 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_doulongvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_doulongvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:07 +02:00
Kuniyuki Iwashima e3a2144b3b sysctl: Fix data races in proc_douintvec_minmax().
[ Upstream commit 2d3b559df3 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_douintvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_douintvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 61d9b56a89 ("sysctl: add unsigned int range support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:07 +02:00
Kuniyuki Iwashima 71ddde27c2 sysctl: Fix data races in proc_dointvec_minmax().
[ Upstream commit f613d86d01 ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side.  For now,
proc_dointvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:06 +02:00
Kuniyuki Iwashima d5d54714e3 sysctl: Fix data races in proc_douintvec().
[ Upstream commit 4762b532ec ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_douintvec() to use READ_ONCE() and WRITE_ONCE()
internally to fix data-races on the sysctl side.  For now, proc_douintvec()
itself is tolerant to a data-race, but we still need to add annotations on
the other subsystem's side.

Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:06 +02:00
Kuniyuki Iwashima 80cc28a4b4 sysctl: Fix data races in proc_dointvec().
[ Upstream commit 1f1be04b4d ]

A sysctl variable is accessed concurrently, and there is always a chance
of data-race.  So, all readers and writers need some basic protection to
avoid load/store-tearing.

This patch changes proc_dointvec() to use READ_ONCE() and WRITE_ONCE()
internally to fix data-races on the sysctl side.  For now, proc_dointvec()
itself is tolerant to a data-race, but we still need to add annotations on
the other subsystem's side.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-21 21:20:06 +02:00
Josh Poimboeuf afc2d635b5 x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting
commit 44a3918c82 upstream.

With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable
to Spectre v2 BHB-based attacks.

When both are enabled, print a warning message and report it in the
'spectre_v2' sysfs vulnerabilities file.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
[fllinden@amazon.com: backported to 5.10]
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11 12:11:49 +01:00
Daniel Borkmann 8c15bfb36a bpf: Add kconfig knob for disabling unpriv bpf by default
commit 08389d8882 upstream.

Add a kconfig knob which allows for unprivileged bpf to be disabled by default.
If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2.

This still allows a transition of 2 -> {0,1} through an admin. Similarly,
this also still keeps 1 -> {1} behavior intact, so that once set to permanently
disabled, it cannot be undone aside from a reboot.

We've also added extra2 with max of 2 for the procfs handler, so that an admin
still has a chance to toggle between 0 <-> 2.

Either way, as an additional alternative, applications can make use of CAP_BPF
that we added a while ago.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
Cc: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-05 12:40:34 +01:00