ANBZ: #22143
In high-concurrency scenarios, highclass tasks often select the same
idle CPU when waking up, which can cause scheduling delays. The reason
is that To reduce
the scheduling delays caused by this situation, we introduce
ID_BOOK_CPU:
When highclass task selecting idle cpu, check whether it has been
booked by other tasks and whether it's still idle before we select it
to wake up. If the idle cpu we found has been booked by other tasks,
select again, until book the idle cpu successfully or reach the retry
limit. If the idle cpu hasn't been booked by any other, make rq->booked
true to mark that the cpu is booked, and make rq->booked false after the
highclass task actually enqueues into the rq of the cpu.
To Enable ID_BOOK_CPU,
echo ID_BOOK_CPU > /sys/kernel/debug/sched_features.
To set retry limit, modify /proc/sys/kernel/sched_id_book_cpu_nr_tries.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5455
ANBZ: #11591
If there are some cpus push expellee at the same time, there will
be lock competition, which will cause sys rush high in a short time.
To avoid causing jitter to unexpellee, we use raw_spin_rq_trylock()
to reduce lock competition, and check whether should we resched
every loop.
By the way, we introduce sysctl_sched_push_expellee_interval to
control the frequency of pushing expellee tasks, and the default
interval is 6ms.
By the way, we fix the bug that cpu cannot go into nohz_full state
when ID_PUSH_EXPELLEE is turned off.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4125
ANBZ: #9497
commit 079be8fc63 upstream.
The validation of the value written to sched_rt_period_us was broken
because:
- the sysclt_sched_rt_period is declared as unsigned int
- parsed by proc_do_intvec()
- the range is asserted after the value parsed by proc_do_intvec()
Because of this negative values written to the file were written into a
unsigned integer that were later on interpreted as large positive
integers which did passed the check:
if (sysclt_sched_rt_period <= 0)
return EINVAL;
This commit fixes the parsing by setting explicit range for both
perid_us and runtime_us into the sched_rt_sysctls table and processes
the values with proc_dointvec_minmax() instead.
Alternatively if we wanted to use full range of unsigned int for the
period value we would have to split the proc_handler and use
proc_douintvec() for it however even the
Documentation/scheduller/sched-rt-group.rst describes the range as 1 to
INT_MAX.
As far as I can tell the only problem this causes is that the sysctl
file allows writing negative values which when read back may confuse
userspace.
There is also a LTP test being submitted for these sysctl files at:
http://patchwork.ozlabs.org/project/ltp/patch/20230901144433.2526-1-chrubis@suse.cz/
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20231002115553.3007-2-chrubis@suse.cz
[ pvorel: rebased for 5.15, 5.10 ]
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e4bc311745078e145f251ab5b28fb30f90652583)
[dtcccc: we have not introduced SYSCTL_NEG_ONE, use neg_one instead.]
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3549
ANBZ: #9482
commit e1aa3fe3e282a0532fffaa68360cbd45a2a59291 stable.
[ Upstream commit 39c65a94cd ]
For embedded systems with low total memory, having to run applications
with relatively large memory requirements, 10% max limitation for
watermark_scale_factor poses an issue of triggering direct reclaim every
time such application is started. This results in slow application
startup times and bad end-user experience.
By increasing watermark_scale_factor max limit we allow vendors more
flexibility to choose the right level of kswapd aggressiveness for their
device and workload requirements.
Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Fengfei Xi <xi.fengfei@h3c.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stable-dep-of: 935d44acf6 ("memfd: check for non-NULL file_seals in memfd_create() syscall")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3468
ANBZ: #8765
Group balancer is a feature that schedules task groups as a whole to
achieve better locality. It uses a soft cpu bind method which offers
a dynamic way to restrict allowed CPUs for tasks in the same group.
This patch introduces the switch and config of group balancer.
Co-developed-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.aibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3215
user-space scheduling
ANBZ: #8975
Since sched_ext is still pending and not merged into upstream kernel, we
designed a light-weight user-space scheduling thanks to bpf fmod_ret. It
can modify the return value of almost any function. For safety, upstream
kernel limits the scope of functions allowed to be modified, and let's
enlarge it.
Also add a sysctl "bpf_customized_fmodret" to control this feature.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3146
ANBZ: #8997
In certain scenarios, multiple containers within a single pod will share
a PID namespace. These containers do not set specific resource limits;
instead, resource limitations are applied to the pod as a whole.
Therefore, it's necessary to use the pod's cgroup configuration as the
data source for rich containers. From the perspective of the kernel,
this is essentially the parent cgroup of the current process.
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3151
ANBZ: #7692
The functionality of rich containers is supported in cgroup v2 through
eBPF, but only a subset of features is supported. To ensure consistency
between both sides, add switches for each feature of the rich container
in cgroup v1.
Features and their names are: cpuinfo, meminfo, cpuusage, loadavg,
uptime, diskquota. All features are enabled by default if rich container
is enabled。
Add a new sysctl interface named rich_container_feature_control. When
read, it shows space separated list of the features which are enabled.
Space separated list of features prefixed with '+' or '-' can be written
to enable or disable features. A feature name prefixed with '+' enables
the feature and '-' disables. If a feature appears more than once on
the list, the last one is effective. When multiple enable and disable
operations are specified, either all succeed or all fail.
eg:
echo “+cpuinfo -meminfo +uptime” >
/proc/sys/kernel/rich_container_feature_control
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2489
ANBZ: #6805
During the stress test, there're always false alarm of soft lockup,
because of long loop, instead of true lockup. So we need to enlarge
watchdog_thresh limit to 150 to void this kind of case.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2286
ANBZ: #6729
In order to be able to dynamically turn on and off acpu accounting, we
introduce sysctl_sched_acpu_enabled, instead of default on.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2260
ANBZ: #6587
commit 51ce83ed52 upstream.
Use a small, non-scaled min granularity for SCHED_IDLE entities, when
competing with normal entities. This reduces the latency of getting
a normal entity back on cpu, at the expense of increased context
switch frequency of SCHED_IDLE entities.
The benefit of this change is to reduce the round-robin latency for
normal entities when competing with a SCHED_IDLE entity.
Example: on a machine with HZ=1000, spawned two threads, one of which is
SCHED_IDLE, and affined to one cpu. Without this patch, the SCHED_IDLE
thread runs for 4ms then waits for 1.4s. With this patch, it runs for
1ms and waits 340ms (as it round-robins with the other thread).
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20210820010403.946838-4-joshdon@google.com
[dtcccc: adapt "idle_min_granularity_ns" to sysctl instead of debugfs
because kernel 5.10 has not moved them yet. I set the min value of this
config to ZERO to suppress BE tasks better.]
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2250
ANBZ: #6201
In order to be able to control the switch of core sched, we introduce
sysctl_sched_core. When turned on, it's allowed to give/share cookie
to tasks. when turned off, the cookie of all cookies will be cleared,
and it's not allowed to give/share cookie to tasks (only get cookie
is allowed).
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2243
ANBZ: #4814
There is a bug of sched_expel_update_interval sysctl. See the
following:
# cat /proc/sys/kernel/sched_expel_update_interval
10 0
There are two values. It's wrong.
The variable of sysctl_sched_expel_update_interval is defined as
unsigned long, and there will be two values when read this sysctl,
because the size of long will be splited into two size of int by
the proc_dointvec handler.
So, the variable should be defined as unsigned int that is large
enough to use it.
Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1593
ANBZ: #4226
In x86_64 with big ram people running containers set pid_max on host to
large values to be able to launch more containers. At the same time
containers running 32-bit software experience problems with large pids
-ps calls readdir/stat on proc entries and inode's i_ino happen to be
too big for the 32-bit API.
Thus, the ability to limit the pid value inside container is required.
And max_pid could be set through '/proc/sys/kernel/pid_max'
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1288
ANBZ: #3460
commit e6cfaf34be upstream.
proc_get_long() is passed a size_t, but then assigns it to an 'int'
variable for the length. Let's not do that, even if our IO paths are
limited to MAX_RW_COUNT (exactly because of these kinds of type errors).
So do the proper test in the rigth type.
Reported-by: Kyle Zeng <zengyhkyle@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: CVE-2022-3567
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1006
ANBZ: #3460
commit bce9332220 upstream.
proc_skip_spaces() seems to think it is working on C strings, and ends
up being just a wrapper around skip_spaces() with a really odd calling
convention.
Instead of basing it on skip_spaces(), it should have looked more like
proc_skip_char(), which really is the exact same function (except it
skips a particular character, rather than whitespace). So use that as
inspiration, odd coding and all.
Now the calling convention actually makes sense and works for the
intended purpose.
Reported-and-tested-by: Kyle Zeng <zengyhkyle@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: CVE-2022-4378
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1006
ANBZ: #2022
We introduce a counter group_identity_count to indicate how many cgroups
have enabled group identity, and introduced a function
group_identity_enabled() to indicate whether there are cgroups with group
identity enabled, that is, if group_identity_count is zero,
group_identity_enabled() returns false, otherwise returns true.
These judgment functions are helpers for group identity fast path, and also
helpers for confliction judgment, i.e. confliction with core scheduling.
Only if group_identity_count is zero, group identity can be turned on/off.
Turn on:
echo 1 > /proc/sys/kernel/sched_group_identity_enabled
Turn off:
echo 0 > /proc/sys/kernel/sched_group_identity_enabled
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/682
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
ANBZ: #1474
Add "sysctl.kernel.userns_max_level" to control the maximum nested
level of user namespace. The valid configuration values are 0-33.
When configured to zero, user namespace is effectively disabled.
Originally the check is "if (parent_ns->level > 32)" and
init_user_ns.level is zero, so the actually maximum level is 33
instead of 32.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
ANBZ: #1476
commit 5758824b20fa2308ebb5c460874d0ffd73d0d8e4 Ubuntu groovy.
It is turned on by default, but can be turned off if admins prefer or,
more importantly, if a security vulnerability is found.
The intent is to use this as mitigation so long as Ubuntu is on the
cutting edge of enablement for things like unprivileged filesystem
mounting.
(This patch is tweaked from the one currently still in Debian sid, which
in turn came from the patch we had in saucy)
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
[bwh: Remove unneeded binary sysctl bits]
[ saf: move extern unprivileged_userns_clone declaration to
include/linux/user_namespace.h to conform with 2374c09b1c
"sysctl: remove all extern declaration from sysctl.c" ]
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
[jeffle: add documentation for the sysctl]
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
ANBZ: #1330
In some scenarios, if the in-use hugtlb pages are migrated,
the program will fail to access its corresponding data, and
then caused program execution exception.
To solve the problem, we add a sysctrl interface
"enable_used_hugtlb_migration"
To disable migration:
echo 0 > /proc/sys/vm/enable_used_hugtlb_migration
To enable migration:
echo 1 > /proc/sys/vm/enable_used_hugtlb_migration
BTW, This bug is caused by backport the commit ae37c7ff79 ("mm:
make alloc_contig_range handle in-use hugetlb pages"), the solution
is only used for "alloc_contig_range()",the other path can still
migration in-use hugetlb pages.
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
ANBZ: #849
The kernel generally prefers to reclaim page cache over anonymous pages,
and only reclaims page cache when there is no swap configured.
However, in some extreme scenario, where the application allocates a large
amount of anonymous memory, the page cache is almost completely exhausted,
while the OOM killer barely fires. The system can suffer from heavy IO,
and the application performance can be significantly affected, or even the
application may become unresponsive, since the page cache, including the
application program instruction, is thrashing. In such scenario, some users
do want OOM instead of half-dead.
This provides user the ability to reserve page cache on system wide.
With appropriate amount of page cache reserved, OOM killer can be
triggered in time, and some key processes can make progress (cooperate
with oom_score_adj, for example).
Enable the feature with:
echo XXX > /proc/sys/vm/min_cache_kbytes
disable the feature with:
echo 0 > /proc/sys/vm/min_cache_kbytes
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Suggested-by: yinbinbin <yinbinbin001@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #781
Group Identity is a totally different feature from BVT, and the
defination and logic of bvt_warp_ns has been totally reformed.
While 10 lines of code are related to BVT paper and it's code
(Jacob Leverich), this patch gives credit to BVT and its related
work.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
ANBZ: #45
With allnoconfig, there will be some warnings during compile:
kernel/sched/fair.c:955:1: warning: 'id_update_make_up' defined but not used [Wunused-function]
id_update_make_up(struct task_group *tg, struct rq *rq, struct cfs_rq *cfs_rq,
^
kernel/sched/fair.c:1008:1: warning: 'id_commit_make_up' defined but not used [Wunused-function]
id_commit_make_up(struct rq *rq, bool commit)
^
kernel/sched/fair.c:1724:1: warning: 'id_can_migrate_task' defined but not used [Wunused-function]
id_can_migrate_task(struct task_struct *p, struct rq *src_rq, struct rq *dst_rq)
^
kernel/sysctl.c:122:22: warning: 'penalty_extra_delay_max' defined but not used [Wunused-function]
static unsigned long penalty_extra_delay_max = HZ;
^
To fix this problem, these functions are limited in configs.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000312
Add pool_size interface and delay_time interface. When the user writes
pool_size, a cgroup pool will be created, and then when the user needs
to create a cgroup, it will take the fast path and get a cgroup from
pool.
pool_size is capacity of cgroup pool, which is advised larger than
amount of cgroup user creates at the same time. For example, it is
proper to set 500 for pool_size if creating 400 cgroups at the same
time.
When pool is exhausted, a supply work will be triggered. In order to
avoid competition with the currently created cgroup, the supply work
can be delayed. User can set cgroup_supply_delay_time to delay supply
work to time when system is idle.
Usage:
enable pool:
echo 500 > /sys/fs/cgroup/<subsystem>/test-cgroup/pool_size
disable pool:
echo 0 > /sys/fs/cgroup/<subsystem>/test-cgroup/pool_size
delay supply work to 3000ms later:
echo 3000 > /proc/sys/kernel/cgroup_supply_delay_time
Suggested-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Reviewed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Yi Tao <escape@linux.alibaba.com>
to #34329845
Specific workloads using multiple threads to do sequential reading on
the same file may introduce race on spinlock during page cache
readahead. Readahead process will end earlier when detecting readahead
page conflict. This is effective for sequential reading the same file
with multiple threads and large readahead size.
The benefit can be shown in the following test by fio:
fio --name=ratest --ioengine=mmap --iodepth=1 --rw=read --direct=0
--size=512M --numjobs=32 --directory=/mnt --filename=ratestfile
Firstly, we set readahead size to its default value:
blockdev --setra 256 /dev/vdh
The result is:
Run status group 0 (all jobs):
READ: bw=4427MiB/s (4642MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s),
io=16.0GiB (17.2GB), run=3699-3701msec
Disk stats (read/write):
vdh: ios=22945/0, merge=0/0, ticks=12515/0, in_queue=12515,
util=50.79%
Then, we set readahead size larger:
blockdev --setra 8192 /dev/vdh
The result is:
Run status group 0 (all jobs):
READ: bw=6450MiB/s (6764MB/s), 202MiB/s-202MiB/s (211MB/s-212MB/s),
io=16.0GiB (17.2GB), run=2535-2540msec
Disk stats (read/write):
vdh: ios=19503/0, merge=0/0, ticks=49375/0, in_queue=49375,
util=23.67%
Finally, we set sysctl vm.enable_multithread_ra_boost=1 with ra=8192
and we get:
Run status group 0 (all jobs):
READ: bw=7234MiB/s (7585MB/s), 226MiB/s-226MiB/s (237MB/s-237MB/s),
io=16.0GiB (17.2GB), run=2262-2265msec
Disk stats (read/write):
vdh: ios=10548/0, merge=0/0, ticks=40709/0, in_queue=40709,
util=23.62%
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
to #34868789
There are various kernel errors/warnnings, if users want to know
the effect with a error or warning, they should see the detail of
kernel. It is difficult for users to handle so much things.
This patch classify kernel fault events, divide events into its own
module such as sched, mem, io, net, etc. At the same time,
to report the effect class of the current fault event.
There are three effect classes:
Slight - just a jitter, every thing could works.
Normal - the current task may have exception.
Fatal - system may be unstable.
Accordding to these information, users can easily choose a suitable
action to ensure the reliability.
This feature also can be used for checking the side effect of a
system-change.
This feature can be enabled/disabled via
/proc/sys/kernel/fault_event_enable.
Signed-off-by: Wetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: Meng Shen <shenmeng@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
to #34609774fix#34359599fix#34361563
In order to prevent scheduling errors in the case of ipi failure,
update_rq_on_expel() is called in function pick_next_task_fair(), but
this will cause performence regression. In order to prevent errors
while ensuring performence, we introduce an interval to call
update_rq_on_expel(), the default value is 10, unit: jiffy. We also
introdue a sched_feat: ID_LOOSE_EXPEL to control whether to call
update_rq_on_expel() in function pick_next_task_fair() periodly,
if on, update_rq_on_expel() will be called loosely, and the interval
will be larger than sysctl_sched_expel_update_interval between calls,
called everytime otherwise. The default value of ID_LOOSE_EXPEL is OFF.
Here follows the data of test will-it-scale-sched_yield-thread-15
before applying this patch:
will-it-scale-sched_yield-thread-15-1
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,0,0.00,1479767,95.80,1479767
will-it-scale-sched_yield-thread-15-50%
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
48,0,0.00,15057129,0.12,0
will-it-scale-sched_yield-thread-15-100%
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
96,0,0.00,14793466,0.14,0
Here follows the data of test will-it-scale-sched_yield-thread-15
after applying this patch with sysctl_sched_expel_update_interval
= 10 and ID_LOOSE_EXPEL on:
will-it-scale-sched_yield-thread-15-1
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
1,0,0.00,1566401,95.82,1566401
will-it-scale-sched_yield-thread-15-50%
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
48,0,0.00,15789744,0.12,0
will-it-scale-sched_yield-thread-15-100%
tasks,processes,processes_idle,threads,threads_idle,linear
0,0,100,0,100,0
96,0,0.00,15612437,0.15,0
We also tested it with nginx + ffmpeg, and here is the scenario:
online task: nginx
number of online tasks: 2
bvt of online tasks: 2
offline task: ffmpeg
number of offline tasks: 4
bvt of offline task: -1
Here follows the data before applying this patch:
only online online+offline
cpu usage: 7.79% 46.69%
nginx rt: 0.84ms 0.92ms
Here follows the data after appling this patch
sched_expel_update_interval = 10(unit: jiffy)
only online online+offline
cpu usage: 7.91% 46.84%
nginx rt: 0.82ms 0.89ms
sched_expel_update_interval = 100(unit: jiffy)
only online online+offline
cpu usage: 7.81% 46.45%
nginx rt: 0.81ms 0.87ms
sched_expel_update_interval = 1000(unit: jiffy)
only online online+offline
cpu usage: 7.91% 46.19%
nginx rt: 0.81ms 0.87ms
As we can see from the result, this patch can resolve the performence
regression to a certain degree.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
to #34609774
We found that underclass tasks can migrate frequently and
occupy too much idle CPU, this cause a high 99th latency
on highclass since the latency to wakeup on idle CPU are
much lower than to preempt the underclass.
Thus we introduce the IDLE_SAVER identity which on default
activated for group with bvt -1, now when they want to wakeup
on idle CPU, it require the average idle time to be above
'sysctl_sched_idle_saver_wmark' (default 1ms).
For the underclass who want to occupy more idle CPU, eg. the
vcpu threads, just deactive the identity by turn off the bit
3 in 'cpu.identity'.
Also improved the logical in select_idle_core, now consider
a core full of underclass as backup option for highcalss, and
prevent underclass to take the idle core with expel or idle
saver reason.
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
to #34609774
Concurrent execution on smt cpus could impact each other on
performance since they sharing the hardware unit, an
underclass task should not cause regression to the highclass
task running on smt peers in this way.
The legacy feature Noise Clean was supposed to address this
issue, while it introduced too much overhead on
throttle/unthrottle tasks which causing unstable latency in
real world.
As now we have all underclass tasks on separate tree, just
hidden such tasks from pick_next_task() would be enough.
Thus we introduced another identity SMT_EXPELLER, which will
try to expel the underclass on smt peers when running.
By apply the corresponding bit into 'group_identity', tasks
willbecome expeller, when on executing, it's smt peers won't
be able to pick the underclass tasks for running.
Legacy 'bvt_warp_ns' 2 will have this identity on default.
To support the new cgroup deployment, eg k8s, it requires
the highclass parent to have both highclass and underclass
child.
In order to support expel in this case, we need to skip the
highclass se, when and only when it's full of underclass
tasks.
We achieved this by isolate them from the rb-tree temporarily,
until the end of expel, or it's no longer only contain
underclass tasks, like throttle but cost far more less.
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
to #34609774
In collocation environment, we deploy latency sensitive (LS) workloads
and best effort (BE) workloads on the same machine, and borrowed vritual
time(BVT) is the legacy feature to help ensure the schedule priority
for LS.
This patch is just porting the idea, however, in order to reduce the
overhead, we introduced group identity feature which based on the idea
of multiple rb-tree.
There are now two rb-tree for each cfs_rq, and tasks with special
identity will be queued into the corresponding tree, this help reduce
the complexity of implementation and obviously reduce the overhead
of following smt expeller(Noise Clean) feature.
By write particular identity into the new cpu-cgroup entry named
'group_identity', tasks of this cgroup will perform the special
schedule behavior, including:
* ID_NORMAL
Normal CFS tasks
* ID_HIGHCLASS
Tasks preempt normal & underclass tasks on wakeup,
also consider underclass only cpu as idle.
* ID_UNDERCLASS
Tasks take penalty on wakeup.
And the legacy entry 'bvt_warp_ns' was reserved for compatibility, the
relationship with identity are:
* 2 == ID_HIGHCLASS
* 1 == ID_HIGHCLASS
* 0 == ID_NORMAL
* -1 == ID_UNDERCLASS
* -2 == ID_UNDERCLASS
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Acked-by: Michael Wang <yun.wang@linux.alibaba.com>
to #34559224
lxcfs supports rich container cpuinfo from cpu quota, also sigma
cpushare has the requirement of getting cpuinfo from cpu.shares.
Thus, introduce new knobs to support these:
/proc/sys/kernel/rich_container_cpuinfo_source
- 0 uses cpu quota "quota/period" by default.
It will fall back to use cpuset.cpus if quota not set.
- 1 uses cpuset.cpus.
- 2 uses cpushare "cpu.shares/1024" (1024 can be configured below)
/proc/sys/kernel/rich_container_cpuinfo_sharesbase
- when rich_container_cpuinfo_source is 2, this is the divisor.
Note that, after faking cpuinfo in dockers, it's impossible to
make /proc/stat match them accordingly, but we only care about
the following stats not /proc/stat:
/proc/cpuinfo, sysfs online and /proc/meminfo
These feature depends on CONFIG_RICH_CONTAINER_CG_SWITCH=n.
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
to #34559224
Make the rich container support k8s in different ways, which the one is
using the global switch when the CONFIG_RICH_CONTAINER_CG_SWITCH is not
seted, and the other is using per cgroup switch when the config is seted.
For the per cgroup switch: (CONFIG_RICH_CONTAINER_CG_SWITCH = y)
For k8s shared pidns pod, all the containers under this pod share
the same pid_namespace, and the namespace child_reaper lives in a
special container called "pause". Thus we can't use child_reaper
as the rich container information any more. Instead, we introduce
a new cgroup interface "cgroup.rich_container_source" to let users
(pouch) to involve in, the rich container kernel code will search
from current cgroup upwards until the first cgroup with this file
set with nonzero value, and use this cgroup as the data source of
the rich container. If not found, fallback to the default cgroup
containing the child_reaper.
New knob "cgroup.rich_container_source":
default value 0, children don't inherit.
e.g.,
pod (shared pidns)
/ | \
c1 c2 pause
If explicitly set pod's cgroup.rich_container_source(cpuacct, cpuset
and memory subsys) to be 1, "c1" or "c2" will reflect the statistics
of "pod" like:
echo 1 > /sys/fs/cgroup/cpuacct/.../pod/cgroup.rich_container_source
echo 1 > /sys/fs/cgroup/cpuset/.../pod/cgroup.rich_container_source
echo 1 > /sys/fs/cgroup/memory/.../pod/cgroup.rich_container_source
For the global switch: (CONFIG_RICH_CONTAINER_CG_SWITCH = n)
For k8s, multpile containers share one pid_namespace, the child
reaper lives in one special "pause" container amongst.
Then we should use current other than the child reaper to get
the information, thus it deserves a runtime interface.
Introduce "/proc/sys/kernel/rich_container_source":
- 0 means to use cgroups of "current" by default.
- 1 means to use cgroups of "child reaper".
Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
[ Shile: fixed the conflicts with
fe8a4afd7c2 (ck: proc/uptime: Add uptime support for rich container) ]
Signed-off-by: Shile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: Erwei Deng <erwei@linux.alibaba.com>
to #34559224
Introduce "/proc/sys/kernel/rich_container_enable" to control
rich container at runtime. Default off.
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Erwei Deng <erwei@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
to #32761176
Accumulate unused quota from previous periods, thus accumulated
bandwidth runtime can be used in the following periods. During
accumulation, take care of runtime overflow. Previous non-burstable
CFS bandwidth controller only assign quota to runtime, that saves a lot.
A sysctl parameter sysctl_sched_cfs_bw_burst_onset_percent is introduced to
denote how many percent of burst is given on setting cfs bandwidth. By
default it is 0, which means on burst is allowed unless accumulated.
Also, parameter sysctl_sched_cfs_bw_burst_enabled is introduced as a
switch for burst. It is enabled by default.
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Acked-by: Shanpei Chen <shanpeic@linux.alibaba.com>
to #32751521
It will not be throttled if the penalty time is less than 10ms when
the memory.usage_in_bytes has exceeded the memory.high. But there
are some tasks allocating a lot of memory cocurrently to result in
Oom killer. The interface adjust the penatly time to throttle the
task further, hence the oomd can effectively monitor it and kill
the related task in the mem cgroup to avoid oom killer in the kernel.
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
fix#32811140
By default the tcp_tw_timeout value is 60 seconds. The minimum is
1 second and the maximum is 600. This setting is useful on system under
heavy tcp load.
NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
restriction, and make your system into the risk of causing some old data
to be accepted as new or new data rejected as old duplicated by some
receivers.
Link: http://tools.ietf.org/html/rfc793
Signed-off-by: George Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
fix#31801703
It is because too much memcg oom printed message will trigger the softlockup.
In general, we use the same ratelimit oom_rc between system and memcg
to limit the print message. But it is more frequent to exceed its limit
of the memcg, thus it would will result in oom easily. And A lot of
printed information will be outputed. It's likely to trigger softlockup.
The patch use different ratelimit to limit the memcg and system oom. And
we test the patch using the default value in the memcg, The issue will
go.
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Acked-by: Xunlei Pang <xlpang@linux.alibaba.com>
fix#32695316
For some workloads whose io activities are mostly random, context
readahead feature can introduce unnecessary io read operations, which
will impact app's performance. Context readahead's algorithm is
straightforward and not that smart.
This patch adds "/proc/sys/vm/enable_context_readahead" to control
whether to disable or enable this feature. Currently we enable context
readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
to disable context readahead.
We also have tested mongodb's performance in 'random point select' case,
With context readahead enabled:
mongodb eps 12409
With context readahead disabled:
mongodb eps 14443
About 16% performance improvement.
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
fix#32824162
This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: zhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Caspar Zhang <caspar@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
[ Upstream commit 43b5240ca6 ]
"numa_stat" should not be included in the scope of CONFIG_HUGETLB_PAGE, if
CONFIG_HUGETLB_PAGE is not configured even if CONFIG_NUMA is configured,
"numa_stat" is missed form /proc. Move it out of CONFIG_HUGETLB_PAGE to
fix it.
Fixes: 4518085e12 ("mm, sysctl: make NUMA stats configurable")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 7d1025e559 ]
A sysctl variable is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
This patch changes proc_dointvec_ms_jiffies() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side. For now,
proc_dointvec_ms_jiffies() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit e877820877 ]
A sysctl variable is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
This patch changes proc_dointvec_jiffies() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side. For now,
proc_dointvec_jiffies() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c31bcc8fb8 ]
A sysctl variable is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
This patch changes proc_doulongvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side. For now,
proc_doulongvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2d3b559df3 ]
A sysctl variable is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
This patch changes proc_douintvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side. For now,
proc_douintvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.
Fixes: 61d9b56a89 ("sysctl: add unsigned int range support")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f613d86d01 ]
A sysctl variable is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
This patch changes proc_dointvec_minmax() to use READ_ONCE() and
WRITE_ONCE() internally to fix data-races on the sysctl side. For now,
proc_dointvec_minmax() itself is tolerant to a data-race, but we still
need to add annotations on the other subsystem's side.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 4762b532ec ]
A sysctl variable is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
This patch changes proc_douintvec() to use READ_ONCE() and WRITE_ONCE()
internally to fix data-races on the sysctl side. For now, proc_douintvec()
itself is tolerant to a data-race, but we still need to add annotations on
the other subsystem's side.
Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1f1be04b4d ]
A sysctl variable is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
This patch changes proc_dointvec() to use READ_ONCE() and WRITE_ONCE()
internally to fix data-races on the sysctl side. For now, proc_dointvec()
itself is tolerant to a data-race, but we still need to add annotations on
the other subsystem's side.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 44a3918c82 upstream.
With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable
to Spectre v2 BHB-based attacks.
When both are enabled, print a warning message and report it in the
'spectre_v2' sysfs vulnerabilities file.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
[fllinden@amazon.com: backported to 5.10]
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 08389d8882 upstream.
Add a kconfig knob which allows for unprivileged bpf to be disabled by default.
If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2.
This still allows a transition of 2 -> {0,1} through an admin. Similarly,
this also still keeps 1 -> {1} behavior intact, so that once set to permanently
disabled, it cannot be undone aside from a reboot.
We've also added extra2 with max of 2 for the procfs handler, so that an admin
still has a chance to toggle between 0 <-> 2.
Either way, as an additional alternative, applications can make use of CAP_BPF
that we added a while ago.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net
Cc: Salvatore Bonaccorso <carnil@debian.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>