anolis-cloud-kernel

Commit Graph

Author	SHA1	Message	Date
Cruz Zhao	98170019bf	anolis: sched: don't account util_est for each cfs_rq when group_balancer disabled ANBZ: #8765 If we account util_est for each cfs_rq when group_balancer disabled, there will be some overhead. To avoid the overhead, we only account it when group balancer enabled, and when the switch is turned, we recalculate for each cfs_rq at once. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5529	2025-07-21 02:35:12 +00:00
Yinan Liu	160227a0f9	bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen ANBZ: #22860 commit `29ebbba7d4` upstream. With the way the hooks implemented right now, we have a special condition: optval larger than PAGE_SIZE will expose only first 4k into BPF; any modifications to the optval are ignored. If the BPF program doesn't handle this condition by resetting optlen to 0, the userspace will get EFAULT. The intention of the EFAULT was to make it apparent to the developers that the program is doing something wrong. However, this inadvertently might affect production workloads with the BPF programs that are not too careful (i.e., returning EFAULT for perfectly valid setsockopt/getsockopt calls). Let's try to minimize the chance of BPF program screwing up userspace by ignoring the output of those BPF programs (instead of returning EFAULT to the userspace). pr_info_once those cases to the dmesg to help with figuring out what's going wrong. Fixes: `0d01da6afc` ("bpf: implement getsockopt and setsockopt hooks") Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yinan Liu <yinan@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5523	2025-07-17 07:35:44 +00:00
Anton Protopopov	750c08658e	bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps ANBZ: #22586 commit `b34ffb0c6d` upstream. The LRU and LRU_PERCPU maps allocate a new element on update before locking the target hash table bucket. Right after that the maps try to lock the bucket. If this fails, then maps return -EBUSY to the caller without releasing the allocated element. This makes the element untracked: it doesn't belong to either of free lists, and it doesn't belong to the hash table, so can't be re-used; this eventually leads to the permanent -ENOMEM on LRU map updates, which is unexpected. Fix this by returning the element to the local free list if bucket locking fails. Fixes: `20b6cc34ea` ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5495	2025-07-16 03:10:48 +00:00
Cruz Zhao	9117970143	anolis: sched: remove task_group from gb_sd when it got freed ANBZ: #8765 When a task_group which enabled group_balancer got freed, it should be removed from gb_sd, to prevent use after free. BTW, hold the gb_lock of task_group when it's enabling/disabling group_balancer. Fixes: `7f8e0c7133` ("anolis: sched: maintain group balancer task groups") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5512	2025-07-14 13:59:33 +08:00
Cruz Zhao	a4b5d3580f	anolis: sched: update parameters of group balancer root sched domain ANBZ: #8765 As we will update the root cpumask before we build group balancer sched domains, we should update the parameters too. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	5f2005c72b	anolis: sched: control the interval of gb_load_balance() ANBZ: #8765 To reduce turbulence, the interval of gb_load_balance() should be controlled, and we set it to twice the lower interval. And to prevent unnecessary migration, we only migrate task groups when the imbalance pct exceeds the threshold 117. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	717827ca16	anolis: sched: adjust the lower interval of group balancer sched domain ANBZ: #8765 The goal of lower interval is to control the interval so that the scheduler has enough time to do load balancing, so we'd better to determine the lower interval according to the sched domain where group balancer sched domain is located. So far, the sched domain has the following layers: - NUMA - DIE - MC - SMT When building group balancer sched domain according to hardware topology, we identify the topological structures corresponding, and determine the lower interval according to the span weight. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	9b5823f38c	anolis: sched: travese the descendants when migrating task groups ANBZ: #8765 As a group balancer sched domain may have many descendants, with task groups, when we select task groups to migrate we should traverse the task descendants, too. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	d4d2d65b6d	anolis: sched: set the flags of group balancer sched domain after build ANBZ: #8765 If we set the flags of group balancer sched domain when we build it, the flag of the middle level may be incorrect. So we should set the flags after build. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	f9c7a08dd3	anolis: sched: make soft_cpus_version correct ANBZ: #8765 When task fork, we should reset the soft_cpus_version to -1. When the task group moves to another gb_sd, we should inc the soft_cpus_version. Fixes: `4bb9c29b1c` ("anolis: sched: modify the interface of attach/detach_tg_to/from_group_balancer_sched_domain") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	cb5766c6e7	anolis: sched: optimize the selecting child gb_sd of task group ANBZ: #8765 When lower task group, we will select a child group balancer sched domain to migrate. Firstly, we should control the interval of lowering task groups of the parent gb_sd, to avoid causing imbalance. Secondly, if the load the task group spans balanced in the child gb_sds, we should select the one with lower load. By the way, fix the bug that using wrong load when calculate the load the task group. Fixes: `bebcfc550d` ("anolis: sched: introduce dynamical load balance for group balancer") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	781988ed40	anolis: sched: adjust the condition of gb_load_balance ANBZ: #8765 When load balance, we should migrate task groups when we cannot balance the sched groups, instead of migrating task groups at the entry of load balance. By the way, we should call task_tick_gb() when tick, to lower the task groups which belongs to a higher level group balancer sched domain. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	be3d022910	anolis: sched: calculate util_est for each cfs_rq ANBZ: #8765 CFS only calculate util_est for root cfs_rq, but for group balancer, util_est for each cfs_rq is need for load balance. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	ede3a68d1e	anolis: sched: make sure root_cpumask of group balancer is online cpumask ANBZ: #8765 Under certain scenarios (such as virtualization environments, and parameters like max cpus are configured in the cmdline), the cpu_possible_mask may be incorrect, where cpu_possible_mask contains some cpus which doesn't exist. Under these scenarios, group balancer may reference null pointer when building group balancer sched domains. To avoid unexpected scenarios, we use the intersection of housekeeping_cpumask(HK_FLAG_DOMAIN) and cpu_online_mask as root_cpumask, making sure that root_cpumask of group_balancer is online cpumask. Consider that the cpumask of each topology may not reliable when boot, we validate the topology level again before we build group balancer sched domains. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	6f6c7b8117	anolis: sched: fix deadlock when add/remove task group to/from gb_sd ANBZ: #8765 When we move task group between gb_sd, we should not hold the same lock twice. Fixes: `4bb9c29b1c` ("anolis: sched: modify the interface of attach/detach_tg_to/from_group_balancer_sched_domain") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	b2f87d3eb4	anolis: sched: add reserved sched_features ANBZ: #22299 The machanism of sched_features provide a convenient way to add switches for enabling or disabling certain scheduler functionalities. To quickly use sched_features as switches in hotfix or other kernel modules, we add some reserved sched_features, e.g., SCHED_FEAT_RESERVE1. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5480	2025-07-03 15:06:49 +08:00
Cruz Zhao	fbe382f2c6	anolis: sched: refactor ID_ABSOLUTE_EXPEL ANBZ: #22187 There are some corner cases that ID_ABSOLUTE_EXPEL doesn't cover, which will make it complex to resolve these problems. So we refactor ID_ABSOLTE_EXPEL using rq->on_expel: we expand the semantics of rq->on_expel to include the meaning of being expelled by highclass tasks of current cpu, and introduce another flag rq->expel_by_smt_sibling to indicate whether the rq is expelled by smt sibling. If there is highclass tasks in the rq, we update rq->on_expel true. As for the statistics, expel_sum keeps the original semantics. By the way, when there is highclass tasks running, we don't push expellees to reduce performance loss. Fixes: `52249b01d3` ("anolis: sched/fair: introduce sched_feat ID_ABSOLUTE_EXPEL") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5463	2025-06-30 05:44:52 +00:00
Cruz Zhao	ad071c796e	anolis: sched: introduce ID_BOOK_CPU ANBZ: #22143 In high-concurrency scenarios, highclass tasks often select the same idle CPU when waking up, which can cause scheduling delays. The reason is that To reduce the scheduling delays caused by this situation, we introduce ID_BOOK_CPU: When highclass task selecting idle cpu, check whether it has been booked by other tasks and whether it's still idle before we select it to wake up. If the idle cpu we found has been booked by other tasks, select again, until book the idle cpu successfully or reach the retry limit. If the idle cpu hasn't been booked by any other, make rq->booked true to mark that the cpu is booked, and make rq->booked false after the highclass task actually enqueues into the rq of the cpu. To Enable ID_BOOK_CPU, echo ID_BOOK_CPU > /sys/kernel/debug/sched_features. To set retry limit, modify /proc/sys/kernel/sched_id_book_cpu_nr_tries. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5455	2025-06-25 16:29:07 +08:00
Jiakun Shuai	8cbb260f55	anolis: phytium: pswiotlb: Add PSWIOTLB mechanism to improve DMA performance ANBZ: #21762 This patch added additional "memory copy" to improve D2H direction DMA performance on Phytium Server SoCs. Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn> Signed-off-by: Jiakun Shuai <shuaijiakun1288@phytium.com.cn> Reviewed-by: Jay chen <jkchen@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5432	2025-06-16 07:11:40 +00:00
Zelin Deng	898d81d8ce	anolis: swiotlb: allocate swiotlb at high address when size >= 2G ANBZ: #11329 Generally user can user swiotlb=${nslabs},any to ensure large bounce buffer can be allocated successfully, however in case user may forget add "any" which could cause oom at boot stage, allow to allocate swiotlb at high address when size >= 2G. Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5410	2025-06-06 09:18:53 +00:00
Mao Minkai	cc583d3b7a	anolis: sw64: improve sw64_rrk ANBZ: #4688 Implement an improved sw64_rrk which stores the timestamp and loglevel. Signed-off-by: Mao Minkai <maominkai@wxiat.com> Reviewed-by: He Sheng <hesheng@wxiat.com> Signed-off-by: Gu Zitao <guzitao@wxiat.com> Reviewed-by: Min Li <gumi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5372	2025-06-05 06:59:14 +00:00
Tianchen Ding	a79ba51ad1	anolis: sched: Remove fast path for SCHED_COOKIE_MATCH_UNSET ANBZ: #6587 KFENCE reported a UAF: Use-after-free read at 0x0000000093739cf3 (in kfence-#108828): pick_next_task+0x163/0xc60 __schedule+0x10c/0x730 schedule+0x5a/0xd0 exit_to_user_mode_loop+0x5e/0x120 exit_to_user_mode_prepare+0xc5/0xd0 irqentry_exit_to_user_mode+0x5/0x30 asm_exc_page_fault+0x22/0x30 kfence-#108828: 0x000000001e616912-0x00000000f97285ce, size=8, cache=kmalloc-8 allocated by task 3185328 on cpu 111 at 87583.036759s: __kmem_cache_alloc_node+0x22c/0x330 kmalloc_trace+0x26/0x90 sched_core_share_pid+0x394/0x500 __do_sys_prctl+0x592/0x9c0 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x78/0xe2 freed by task 3185328 on cpu 111 at 87583.036770s: sched_core_put_cookie+0x30/0x40 sched_core_share_pid+0x379/0x500 __do_sys_prctl+0x592/0x9c0 do_syscall_64+0x34/0x80 entry_SYSCALL_64_after_hwframe+0x78/0xe2 This is because rq->core->core_cookie has been freed when a task updates its cookie. Since the fast path is hard to maintain, just remove it. Fixes: `d0822ae661` ("anolis: sched: Introduce cookie flags") Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5369	2025-05-29 13:50:50 +08:00
Luis Chamberlain	f201b1abfd	module: avoid allocation if module is already present and ready ANBZ: #21135 commit `064f4536d1` upstream. The finit_module() system call can create unnecessary virtual memory pressure for duplicate modules. This is because load_module() can in the worse case allocate more than twice the size of a module in virtual memory. This saves at least a full size of the module in wasted vmalloc space memory by trying to avoid duplicates as soon as we can validate the module name in the read module structure. This can only be an issue if a system is getting hammered with userspace loading modules. There are two ways to load modules typically on systems, one is the kernel moduile auto-loading (request_module() calls in-kernel) and the other is things like udev. The auto-loading is in-kernel, but that pings back to userspace to just call modprobe. We already have a way to restrict the amount of concurrent kernel auto-loads in a given time, however that still allows multiple requests for the same module to go through and force two threads in userspace racing to call modprobe for the same exact module. Even though libkmod which both modprobe and udev does check if a module is already loaded prior calling finit_module() races are still possible and this is clearly evident today when you have multiple CPUs. To avoid memory pressure for such stupid cases put a stop gap for them. The earliest we can detect duplicates from the modules side of things is once we have blessed the module name, sadly after the first vmalloc allocation. We can check for the module being present before a secondary vmalloc() allocation. There is a linear relationship between wasted virtual memory bytes and the number of CPU counts. The reason is that udev ends up racing to call tons of the same modules for each of the CPUs. We can see the different linear relationships between wasted virtual memory and CPU count during after boot in the following graph: +----------------------------------------------------------------------------+ 14GB \|-+ + + + + + +-\| \| * \| \| * \| \| \| 12GB \|-+ +-\| \| \| \| \| \| \| \| \| 10GB \|-+ +-\| \| \| \| \| \| \| 8GB \|-+ +-\| waste \| ### \| \| #### \| \| ####### \| 6GB \|-+ **** #### +-\| \| * #### \| \| * #### \| \| *** #### \| 4GB \|-+ #### +-\| \| #### \| \| #### \| \| #### \| 2GB \|-+ ##### +-\| \| * #### \| \| * #### Before ***** \| \| ## + + + + After ####### \| +----------------------------------------------------------------------------+ 0 50 100 150 200 250 300 CPUs count On the y-axis we can see gigabytes of wasted virtual memory during boot due to duplicate module requests which just end up failing. Trying to infer the slope this ends up being about ~463 MiB per CPU lost prior to this patch. After this patch we only loose about ~230 MiB per CPU, for a total savings of about ~233 MiB per CPU. This is all just on bootup! On a 8vcpu 8 GiB RAM system using kdevops and testing against selftests kmod.sh -t 0008 I see a saving in the highest side of memory consumption of up to ~ 84 MiB with the Linux kernel selftests kmod test 0008. With the new stress-ng module test I see a 145 MiB difference in max memory consumption with 100 ops. The stress-ng module ops tests can be pretty pathalogical -- it is not realistic, however it was used to finally successfully reproduce issues which are only reported to happen on system with over 400 CPUs [0] by just usign 100 ops on a 8vcpu 8 GiB RAM system. Running out of virtual memory space is no surprise given the above graph, since at least on x86_64 we're capped at 128 MiB, eventually we'd hit a series of errors and once can use the above graph to guestimate when. This of course will vary depending on the features you have enabled. So for instance, enabling KASAN seems to make this much worse. The results with kmod and stress-ng can be observed and visualized below. The time it takes to run the test is also not affected. The kmod tests 0008: The gnuplot is set to a range from 400000 KiB (390 Mib) - 580000 (566 Mib) given the tests peak around that range. cat kmod.plot set term dumb set output fileout set yrange [400000:580000] plot filein with linespoints title "Memory usage (KiB)" Before: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 \| grep Mem \| awk '{print $3}' > log-0008-before.txt ^C root@kmod ~ # sort -n -r log-0008-before.txt \| head -1 528732 So ~516.33 MiB After: root@kmod ~ # /data/linux-next/tools/testing/selftests/kmod/kmod.sh -t 0008 root@kmod ~ # free -k -s 1 -c 40 \| grep Mem \| awk '{print $3}' > log-0008-after.txt ^C root@kmod ~ # sort -n -r log-0008-after.txt \| head -1 442516 So ~432.14 MiB That's about 84 ~MiB in savings in the worst case. The graphs: root@kmod ~ # gnuplot -e "filein='log-0008-before.txt'; fileout='graph-0008-before.txt'" kmod.plot root@kmod ~ # gnuplot -e "filein='log-0008-after.txt'; fileout='graph-0008-after.txt'" kmod.plot root@kmod ~ # cat graph-0008-before.txt 580000 +-----------------------------------------------------------------+ \| + + + + + + + \| 560000 \|-+ Memory usage (KiB) *A-\| \| \| 540000 \|-+ +-\| \| \| \| A AAAAAAA AAA AAA AAAAAA A \| 520000 \|-+AAAA AAA AAAAAA AA A A+-\| \|A \| 500000 \|-+ +-\| \| \| 480000 \|-+ +-\| \| \| 460000 \|-+ +-\| \| \| \| \| 440000 \|-+ +-\| \| \| 420000 \|-+ +-\| \| + + + + + + + \| 400000 +-----------------------------------------------------------------+ 0 5 10 15 20 25 30 35 40 root@kmod ~ # cat graph-0008-after.txt 580000 +-----------------------------------------------------------------+ \| + + + + + + + \| 560000 \|-+ Memory usage (KiB) *A*-\| \| \| 540000 \|-+ +-\| \| \| \| \| 520000 \|-+ +-\| \| \| 500000 \|-+ +-\| \| \| 480000 \|-+ +-\| \| \| 460000 \|-+ +-\| \| \| Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5322	2025-05-22 05:59:42 +00:00
Luis Chamberlain	a924910f67	module: move early sanity checks into a helper ANBZ: #21135 commit `85e6f61c13` upstream. Move early sanity checkers for the module into a helper. This let's us make it clear when we are working with the local copy of the module prior to allocation. This produces no functional changes, it just makes subsequent changes easier to read. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5322	2025-05-22 05:59:42 +00:00
Luis Chamberlain	59fc95f4c8	module: extract patient module check into helper ANBZ: #21135 commit `f71afa6a42` upstream. The patient module check inside add_unformed_module() is large enough as we need it. It is a bit hard to read too, so just move it to a helper and do the inverse checks first to help shift the code and make it easier to read. The new helper then is module_patient_check_exists(). To make this work we need to mvoe the finished_loading() up, we do that without making any functional changes to that routine. Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5322	2025-05-22 05:59:42 +00:00
Petr Pavlu	7534596f1f	module: Don't wait for GOING modules ANBZ: #21135 commit `0254127ab9` upstream. During a system boot, it can happen that the kernel receives a burst of requests to insert the same module but loading it eventually fails during its init call. For instance, udev can make a request to insert a frequency module for each individual CPU when another frequency module is already loaded which causes the init function of the new module to return an error. Since commit `6e6de3dee5` ("kernel/module.c: Only return -EEXIST for modules that have finished loading"), the kernel waits for modules in MODULE_STATE_GOING state to finish unloading before making another attempt to load the same module. This creates unnecessary work in the described scenario and delays the boot. In the worst case, it can prevent udev from loading drivers for other devices and might cause timeouts of services waiting on them and subsequently a failed boot. This patch attempts a different solution for the problem `6e6de3dee5` was trying to solve. Rather than waiting for the unloading to complete, it returns a different error code (-EBUSY) for modules in the GOING state. This should avoid the error situation that was described in `6e6de3dee5` (user space attempting to load a dependent module because the -EEXIST error code would suggest to user space that the first module had been loaded successfully), while avoiding the delay situation too. This has been tested on linux-next since December 2022 and passes all kmod selftests except test 0009 with module compression enabled but it has been confirmed that this issue has existed and has gone unnoticed since prior to this commit and can also be reproduced without module compression with a simple usleep(5000000) on tools/modprobe.c [0]. These failures are caused by hitting the kernel mod_concurrent_max and can happen either due to a self inflicted kernel module auto-loead DoS somehow or on a system with large CPU count and each CPU count incorrectly triggering many module auto-loads. Both of those issues need to be fixed in-kernel. [0] https://lore.kernel.org/all/Y9A4fiobL6IHp%2F%2FP@bombadil.infradead.org/ Fixes: `6e6de3dee5` ("kernel/module.c: Only return -EEXIST for modules that have finished loading") Co-developed-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Cc: stable@vger.kernel.org Reviewed-by: Petr Mladek <pmladek@suse.com> [mcgrof: enhance commit log with testing and kmod test result interpretation ] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5322	2025-05-22 05:59:42 +00:00
Cruz Zhao	e93dbf3bd4	anolis: sched: introduce simplified proxy execution for jbd2_lock ANBZ: #20446 When underclass task helds the jbd2_lock, it may resched actively and cannot be scheduled on cpu in a short time because of low priority, resulting that other tasks waiting for the lock for a long time. In order to allow underclass tasks to release jbd2 lock as soon as possible, we use simplified proxy execution for jbd2_lock: when an underclass task hold jbd2 lock, we move it to root task group temporarily, and move it to the original task group when it release the jbd2 lock. We also introduce switches to control whether should we enable proxy exec for some jbd2 lock and whether should we enable proxy exec for highclass tasks(for normal tasks and underclass tasks by default): - /proc/fs/jbd2/$dev/proxy_exec - /proc/fs/jbd2/$dev/proxy_exec_for_highclass Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5086	2025-04-17 15:13:48 +00:00
Chen Zhongjin	2ef4bd297d	profiling: fix shift too large makes kernel panic ANBZ: #20389 commit 4046f3ef3bb678c05bcce12da717770a9ddfbf3c stable. commit `0fe6ee8f12` upstream. `2d186afd04` ("profiling: fix shift-out-of-bounds bugs") limits shift value by [0, BITS_PER_LONG -1], which means [0, 63]. However, syzbot found that the max shift value should be the bit number of (_etext - _stext). If shift is outside of this, the "buffer_bytes" will be zero and will cause kzalloc(0). Then the kernel panics due to dereferencing the returned pointer 16. This can be easily reproduced by passing a large number like 60 to enable profiling and then run readprofile. LOGS: BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 6148067 P4D 6148067 PUD 6142067 PMD 0 PREEMPT SMP CPU: 4 PID: 184 Comm: readprofile Not tainted 5.18.0+ #162 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_profile+0x104/0x220 RSP: 0018:ffffc900006fbe80 EFLAGS: 00000202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff888006150000 RSI: 0000000000000001 RDI: ffffffff82aba4a0 RBP: 000000000188bb60 R08: 0000000000000010 R09: ffff888006151000 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82aba4a0 R13: 0000000000000000 R14: ffffc900006fbf08 R15: 0000000000020c30 FS: 000000000188a8c0(0000) GS:ffff88803ed00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000010 CR3: 0000000006144000 CR4: 00000000000006e0 Call Trace: <TASK> proc_reg_read+0x56/0x70 vfs_read+0x9a/0x1b0 ksys_read+0xa1/0xe0 ? fpregs_assert_state_consistent+0x1e/0x40 do_syscall_64+0x3a/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x4d4b4e RSP: 002b:00007ffebb668d58 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000000000188a8a0 RCX: 00000000004d4b4e RDX: 0000000000000400 RSI: 000000000188bb60 RDI: 0000000000000003 RBP: 0000000000000003 R08: 000000000000006e R09: 0000000000000000 R10: 0000000000000041 R11: 0000000000000246 R12: 000000000188bb60 R13: 0000000000000400 R14: 0000000000000000 R15: 000000000188bb60 </TASK> Modules linked in: CR2: 0000000000000010 Killed ---[ end trace 0000000000000000 ]--- Check prof_len in profile_init() to prevent it be zero. Link: https://lkml.kernel.org/r/20220531012854.229439-1-chenzhongjin@huawei.com Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Signed-off-by: Chen Zhongjin <chenzhongjin@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5056	2025-04-14 13:11:57 +08:00
Jiri Olsa	12983ba93e	bpf: Fix prog_array_map_poke_run map poke update ANBZ: #9462 commit `4b7de80160` upstream. Lee pointed out issue found by syscaller [0] hitting BUG in prog array map poke update in prog_array_map_poke_run function due to error value returned from bpf_arch_text_poke function. There's race window where bpf_arch_text_poke can fail due to missing bpf program kallsym symbols, which is accounted for with check for -EINVAL in that BUG_ON call. The problem is that in such case we won't update the tail call jump and cause imbalance for the next tail call update check which will fail with -EBUSY in bpf_arch_text_poke. I'm hitting following race during the program load: CPU 0 CPU 1 bpf_prog_load bpf_check do_misc_fixups prog_array_map_poke_track map_update_elem bpf_fd_array_map_update_elem prog_array_map_poke_run bpf_arch_text_poke returns -EINVAL bpf_prog_kallsyms_add After bpf_arch_text_poke (CPU 1) fails to update the tail call jump, the next poke update fails on expected jump instruction check in bpf_arch_text_poke with -EBUSY and triggers the BUG_ON in prog_array_map_poke_run. Similar race exists on the program unload. Fixing this by moving the update to bpf_arch_poke_desc_update function which makes sure we call __bpf_arch_text_poke that skips the bpf address check. Each architecture has slightly different approach wrt looking up bpf address in bpf_arch_text_poke, so instead of splitting the function or adding new 'checkip' argument in previous version, it seems best to move the whole map_poke_run update as arch specific code. [0] https://syzkaller.appspot.com/bug?extid=97a4fe20470e9bc30810 Fixes: `ebf7d1f508` ("bpf, x64: rework pro/epilogue and tailcall handling in JIT") Reported-by: syzbot+97a4fe20470e9bc30810@syzkaller.appspotmail.com Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yonghong.song@linux.dev> Cc: Lee Jones <lee@kernel.org> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20231206083041.1306660-2-jolsa@kernel.org [dtcccc: Fix __bpf_arch_text_poke() signature.] Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5041	2025-04-11 03:26:23 +00:00
Steven Rostedt (Google)	ba47f07fbc	tracing: Make sure trace_printk() can output as soon as it can be used ANBZ: #19809 commit `3bb06eb6e9` upstream. Currently trace_printk() can be used as soon as early_trace_init() is called from start_kernel(). But if a crash happens, and "ftrace_dump_on_oops" is set on the kernel command line, all you get will be: [ 0.456075] <idle>-0 0dN.2. 347519us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 353141us : Unknown type 6 [ 0.456075] <idle>-0 0dN.2. 358684us : Unknown type 6 This is because the trace_printk() event (type 6) hasn't been registered yet. That gets done via an early_initcall(), which may be early, but not early enough. Instead of registering the trace_printk() event (and other ftrace events, which are not trace events) via an early_initcall(), have them registered at the same time that trace_printk() can be used. This way, if there is a crash before early_initcall(), then the trace_printk()s will actually be useful. Link: https://lkml.kernel.org/r/20230104161412.019f6c55@gandalf.local.home Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: `e725c731e3` ("tracing: Split tracing initialization into two for early initialization") Reported-by: "Joel Fernandes (Google)" <joel@joelfernandes.org> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Fixes: CVE-2023-53007 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5002	2025-03-31 15:12:44 +08:00
Thomas Gleixner	e38d532944	sched/core: Prevent rescheduling when interrupts are disabled ANBZ: #19858 commit `82c387ef75` upstream. David reported a warning observed while loop testing kexec jump: Interrupts enabled after irqrouter_resume+0x0/0x50 WARNING: CPU: 0 PID: 560 at drivers/base/syscore.c:103 syscore_resume+0x18a/0x220 kernel_kexec+0xf6/0x180 __do_sys_reboot+0x206/0x250 do_syscall_64+0x95/0x180 The corresponding interrupt flag trace: hardirqs last enabled at (15573): [<ffffffffa8281b8e>] __up_console_sem+0x7e/0x90 hardirqs last disabled at (15580): [<ffffffffa8281b73>] __up_console_sem+0x63/0x90 That means __up_console_sem() was invoked with interrupts enabled. Further instrumentation revealed that in the interrupt disabled section of kexec jump one of the syscore_suspend() callbacks woke up a task, which set the NEED_RESCHED flag. A later callback in the resume path invoked cond_resched() which in turn led to the invocation of the scheduler: __cond_resched+0x21/0x60 down_timeout+0x18/0x60 acpi_os_wait_semaphore+0x4c/0x80 acpi_ut_acquire_mutex+0x3d/0x100 acpi_ns_get_node+0x27/0x60 acpi_ns_evaluate+0x1cb/0x2d0 acpi_rs_set_srs_method_data+0x156/0x190 acpi_pci_link_set+0x11c/0x290 irqrouter_resume+0x54/0x60 syscore_resume+0x6a/0x200 kernel_kexec+0x145/0x1c0 __do_sys_reboot+0xeb/0x240 do_syscall_64+0x95/0x180 This is a long standing problem, which probably got more visible with the recent printk changes. Something does a task wakeup and the scheduler sets the NEED_RESCHED flag. cond_resched() sees it set and invokes schedule() from a completely bogus context. The scheduler enables interrupts after context switching, which causes the above warning at the end. Quite some of the code paths in syscore_suspend()/resume() can result in triggering a wakeup with the exactly same consequences. They might not have done so yet, but as they share a lot of code with normal operations it's just a question of time. The problem only affects the PREEMPT_NONE and PREEMPT_VOLUNTARY scheduling models. Full preemption is not affected as cond_resched() is disabled and the preemption check preemptible() takes the interrupt disabled flag into account. Cure the problem by adding a corresponding check into cond_resched(). Reported-by: David Woodhouse <dwmw@amazon.co.uk> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Closes: https://lore.kernel.org/all/7717fe2ac0ce5f0a2c43fdab8b11f4483d54a2a4.camel@infradead.org Fixes: CVE-2024-58090 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4962	2025-03-31 07:08:07 +00:00
Chen Ridong	bff92e63c4	padata: avoid UAF for reorder_work ANBZ: #19145 commit `dd7d37ccf6` upstream. Although the previous patch can avoid ps and ps UAF for _do_serial, it can not avoid potential UAF issue for reorder_work. This issue can happen just as below: crypto_request crypto_request crypto_del_alg padata_do_serial ... padata_reorder // processes all remaining // requests then breaks while (1) { if (!padata) break; ... } padata_do_serial // new request added list_add // sees the new request queue_work(reorder_work) padata_reorder queue_work_on(squeue->work) ... <kworker context> padata_serial_worker // completes new request, // no more outstanding // requests crypto_del_alg // free pd <kworker context> invoke_padata_reorder // UAF of pd To avoid UAF for 'reorder_work', get 'pd' ref before put 'reorder_work' into the 'serial_wq' and put 'pd' ref until the 'serial_wq' finish. Fixes: CVE-2025-21726 Fixes: `bbefa1dd6a` ("crypto: pcrypt - Avoid deadlock by using per-instance padata queues") Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4942	2025-03-28 02:52:18 +00:00
Chen Ridong	91d7fcb009	padata: add pd get/put refcnt helper ANBZ: #19145 commit `ae154202cc` upstream. Add helpers for pd to get/put refcnt to make code consice. Fixes: CVE-2025-21726 Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4942	2025-03-28 02:52:18 +00:00
Thomas Gleixner	df33362ace	static_call: Replace pointless WARN_ON() in static_call_module_notify() ANBZ: #12626 commit `fe513c2ef0` upstream. static_call_module_notify() triggers a WARN_ON(), when memory allocation fails in __static_call_add_module(). That's not really justified, because the failure case must be correctly handled by the well known call chain and the error code is passed through to the initiating userspace application. A memory allocation fail is not a fatal problem, but the WARN_ON() takes the machine out when panic_on_warn is set. Replace it with a pr_warn(). Fixes: `9183c3f9ed` ("static_call: Add inline static call infrastructure") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/8734mf7pmb.ffs@tglx Fixes: CVE-2024-49954 Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4892	2025-03-28 01:35:46 +00:00
VanGiang Nguyen	da92b7ffa5	padata: use integer wrap around to prevent deadlock on seq_nr overflow ANBZ: #19794 commit `9a22b28123` upstream. When submitting more than 2^32 padata objects to padata_do_serial, the current sorting implementation incorrectly sorts padata objects with overflowed seq_nr, causing them to be placed before existing objects in the reorder list. This leads to a deadlock in the serialization process as padata_find_next cannot match padata->seq_nr and pd->processed because the padata instance with overflowed seq_nr will be selected next. To fix this, we use an unsigned integer wrap around to correctly sort padata objects in scenarios with integer overflow. Fixes: `bfde23ce20` ("padata: unbind parallel jobs from specific CPUs") Cc: <stable@vger.kernel.org> Co-developed-by: Christian Gafert <christian.gafert@rohde-schwarz.com> Signed-off-by: Christian Gafert <christian.gafert@rohde-schwarz.com> Co-developed-by: Max Ferger <max.ferger@rohde-schwarz.com> Signed-off-by: Max Ferger <max.ferger@rohde-schwarz.com> Signed-off-by: Van Giang Nguyen <vangiang.nguyen@rohde-schwarz.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Fixes: CVE-2024-47739 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4928	2025-03-27 14:51:10 +00:00
Andrii Nakryiko	358068fc35	bpf: avoid holding freeze_mutex during mmap operation ANBZ: #19515 commit `bc27c52eea` upstream. We use map->freeze_mutex to prevent races between map_freeze() and memory mapping BPF map contents with writable permissions. The way we naively do this means we'll hold freeze_mutex for entire duration of all the mm and VMA manipulations, which is completely unnecessary. This can potentially also lead to deadlocks, as reported by syzbot in [0]. So, instead, hold freeze_mutex only during writeability checks, bump (proactively) "write active" count for the map, unlock the mutex and proceed with mmap logic. And only if something went wrong during mmap logic, then undo that "write active" counter increment. [0] https://lore.kernel.org/bpf/678dcbc9.050a0220.303755.0066.GAE@google.com/ Fixes: `fc9702273e` ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY") Reported-by: syzbot+4dc041c686b7c816a71e@syzkaller.appspotmail.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250129012246.1515826-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Fixes: CVE-2025-21853 Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4922	2025-03-27 13:21:47 +00:00
Andrii Nakryiko	4309beaab7	bpf: unify VM_WRITE vs VM_MAYWRITE use in BPF map mmaping logic ANBZ: #19515 commit `98671a0fd1` upstream. For all BPF maps we ensure that VM_MAYWRITE is cleared when memory-mapping BPF map contents as initially read-only VMA. This is because in some cases BPF verifier relies on the underlying data to not be modified afterwards by user space, so once something is mapped read-only, it shouldn't be re-mmap'ed as read-write. As such, it's not necessary to check VM_MAYWRITE in bpf_map_mmap() and map->ops->map_mmap() callbacks: VM_WRITE should be consistently set for read-write mappings, and if VM_WRITE is not set, there is no way for user space to upgrade read-only mapping to read-write one. This patch cleans up this VM_WRITE vs VM_MAYWRITE handling within bpf_map_mmap(), which is an entry point for any BPF map mmap()-ing logic. We also drop unnecessary sanitization of VM_MAYWRITE in BPF ringbuf's map_mmap() callback implementation, as it is already performed by common code in bpf_map_mmap(). Note, though, that in bpf_map_mmap_{open,close}() callbacks we can't drop VM_MAYWRITE use, because it's possible (and is outside of subsystem's control) to have initially read-write memory mapping, which is subsequently dropped to read-only by user space through mprotect(). In such case, from BPF verifier POV it's read-write data throughout the lifetime of BPF map, and is counted as "active writer". But its VMAs will start out as VM_WRITE\|VM_MAYWRITE, then mprotect() can change it to just VM_MAYWRITE (and no VM_WRITE), so when its finally munmap()'ed and bpf_map_mmap_close() is called, vm_flags will be just VM_MAYWRITE, but we still need to decrement active writer count with bpf_map_write_active_dec() as it's still considered to be a read-write mapping by the rest of BPF subsystem. Similar reasoning applies to bpf_map_mmap_open(), which is called whenever mmap(), munmap(), and/or mprotect() forces mm subsystem to split original VMA into multiple discontiguous VMAs. Memory-mapping handling is a bit tricky, yes. Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250129012246.1515826-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org> Fixes: CVE-2025-21853 [dtcccc: Fix conflicts due to no vm_flags_clear().] Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4922	2025-03-27 13:21:47 +00:00
Chen Ridong	82d280f303	padata: fix UAF in padata_reorder ANBZ: #19087 commit `e01780ea46` upstream. A bug was found when run ltp test: BUG: KASAN: slab-use-after-free in padata_find_next+0x29/0x1a0 Read of size 4 at addr ffff88bbfe003524 by task kworker/u113:2/3039206 CPU: 0 PID: 3039206 Comm: kworker/u113:2 Kdump: loaded Not tainted 6.6.0+ Workqueue: pdecrypt_parallel padata_parallel_worker Call Trace: <TASK> dump_stack_lvl+0x32/0x50 print_address_description.constprop.0+0x6b/0x3d0 print_report+0xdd/0x2c0 kasan_report+0xa5/0xd0 padata_find_next+0x29/0x1a0 padata_reorder+0x131/0x220 padata_parallel_worker+0x3d/0xc0 process_one_work+0x2ec/0x5a0 If 'mdelay(10)' is added before calling 'padata_find_next' in the 'padata_reorder' function, this issue could be reproduced easily with ltp test (pcrypt_aead01). This can be explained as bellow: pcrypt_aead_encrypt ... padata_do_parallel refcount_inc(&pd->refcnt); // add refcnt ... padata_do_serial padata_reorder // pd while (1) { padata_find_next(pd, true); // using pd queue_work_on ... padata_serial_worker crypto_del_alg padata_put_pd_cnt // sub refcnt padata_free_shell padata_put_pd(ps->pd); // pd is freed // loop again, but pd is freed // call padata_find_next, UAF } In the padata_reorder function, when it loops in 'while', if the alg is deleted, the refcnt may be decreased to 0 before entering 'padata_find_next', which leads to UAF. As mentioned in [1], do_serial is supposed to be called with BHs disabled and always happen under RCU protection, to address this issue, add synchronize_rcu() in 'padata_free_shell' wait for all _do_serial calls to finish. [1] https://lore.kernel.org/all/20221028160401.cccypv4euxikusiq@parnassus.localdomain/ [2] https://lore.kernel.org/linux-kernel/jfjz5d7zwbytztackem7ibzalm5lnxldi2eofeiczqmqs2m7o6@fq426cwnjtkm/ Fixes: `b128a30409` ("padata: allocate workqueue internally") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Qu Zicheng <quzicheng@huawei.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Fixes: CVE-2025-21727 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4885	2025-03-27 12:06:42 +00:00
Marco Elver	d70951b88c	kcsan: Turn report_filterlist_lock into a raw_spinlock ANBZ: #13070 commit `59458fa4dd` upstream. Ran Xiaokai reports that with a KCSAN-enabled PREEMPT_RT kernel, we can see splats like: \| BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 \| in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 \| preempt_count: 10002, expected: 0 \| RCU nest depth: 0, expected: 0 \| no locks held by swapper/1/0. \| irq event stamp: 156674 \| hardirqs last enabled at (156673): [<ffffffff81130bd9>] do_idle+0x1f9/0x240 \| hardirqs last disabled at (156674): [<ffffffff82254f84>] sysvec_apic_timer_interrupt+0x14/0xc0 \| softirqs last enabled at (0): [<ffffffff81099f47>] copy_process+0xfc7/0x4b60 \| softirqs last disabled at (0): [<0000000000000000>] 0x0 \| Preemption disabled at: \| [<ffffffff814a3e2a>] paint_ptr+0x2a/0x90 \| CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.11.0+ #3 \| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 \| Call Trace: \| <IRQ> \| dump_stack_lvl+0x7e/0xc0 \| dump_stack+0x1d/0x30 \| __might_resched+0x1a2/0x270 \| rt_spin_lock+0x68/0x170 \| kcsan_skip_report_debugfs+0x43/0xe0 \| print_report+0xb5/0x590 \| kcsan_report_known_origin+0x1b1/0x1d0 \| kcsan_setup_watchpoint+0x348/0x650 \| __tsan_unaligned_write1+0x16d/0x1d0 \| hrtimer_interrupt+0x3d6/0x430 \| __sysvec_apic_timer_interrupt+0xe8/0x3a0 \| sysvec_apic_timer_interrupt+0x97/0xc0 \| </IRQ> On a detected data race, KCSAN's reporting logic checks if it should filter the report. That list is protected by the report_filterlist_lock non-raw spinlock which may sleep on RT kernels. Since KCSAN may report data races in any context, convert it to a raw_spinlock. This requires being careful about when to allocate memory for the filter list itself which can be done via KCSAN's debugfs interface. Concurrent modification of the filter list via debugfs should be rare: the chosen strategy is to optimistically pre-allocate memory before the critical section and discard if unused. Link: https://lore.kernel.org/all/20240925143154.2322926-1-ranxiaokai627@163.com/ Reported-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Tested-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Signed-off-by: Marco Elver <elver@google.com> Fixes: CVE-2024-56610 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4898	2025-03-27 04:17:14 +00:00
Thomas Gleixner	6b7c8d0c2b	static_call: Handle module init failure correctly in static_call_del_module() ANBZ: #12640 commit `4b30051c48` upstream. Module insertion invokes static_call_add_module() to initialize the static calls in a module. static_call_add_module() invokes __static_call_init(), which allocates a struct static_call_mod to either encapsulate the built-in static call sites of the associated key into it so further modules can be added or to append the module to the module chain. If that allocation fails the function returns with an error code and the module core invokes static_call_del_module() to clean up eventually added static_call_mod entries. This works correctly, when all keys used by the module were converted over to a module chain before the failure. If not then static_call_del_module() causes a #GP as it blindly assumes that key::mods points to a valid struct static_call_mod. The problem is that key::mods is not a individual struct member of struct static_call_key, it's part of a union to save space: union { /* bit 0: 0 = mods, 1 = sites / unsigned long type; struct static_call_mod mods; struct static_call_site *sites; }; key::sites is a pointer to the list of built-in usage sites of the static call. The type of the pointer is differentiated by bit 0. A mods pointer has the bit clear, the sites pointer has the bit set. As static_call_del_module() blidly assumes that the pointer is a valid static_call_mod type, it fails to check for this failure case and dereferences the pointer to the list of built-in call sites, which is obviously bogus. Cure it by checking whether the key has a sites or a mods pointer. If it's a sites pointer then the key is not to be touched. As the sites are walked in the same order as in __static_call_init() the site walk can be terminated because all subsequent sites have not been touched by the init code due to the error exit. If it was converted before the allocation fail, then the inner loop which searches for a module match will find nothing. A fail in the second allocation in __static_call_init() is harmless and does not require special treatment. The first allocation succeeded and converted the key to a module chain. That first entry has mod::mod == NULL and mod::next == NULL, so the inner loop of static_call_del_module() will neither find a module match nor a module chain. The next site in the walk was either already converted, but can't match the module, or it will exit the outer loop because it has a static_call_site pointer and not a static_call_mod pointer. Fixes: `9183c3f9ed` ("static_call: Add inline static call infrastructure") Closes: https://lore.kernel.org/all/20230915082126.4187913-1-ruanjinjie@huawei.com Reported-by: Jinjie Ruan <ruanjinjie@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Jinjie Ruan <ruanjinjie@huawei.com> Link: https://lore.kernel.org/r/87zfon6b0s.ffs@tglx [Fixes Conflicts] Fixes: CVE-2024-50002 Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4893	2025-03-27 04:14:09 +00:00
Waiman Long	785f17672e	padata: Fix possible divide-by-0 panic in padata_mt_helper() ANBZ: #12508 commit ab8b397d5997d8c37610252528edc54bebf9f6d3 stable. commit `6d45e1c948` upstream. We are hit with a not easily reproducible divide-by-0 panic in padata.c at bootup time. [ 10.017908] Oops: divide error: 0000 1 PREEMPT SMP NOPTI [ 10.017908] CPU: 26 PID: 2627 Comm: kworker/u1666:1 Not tainted 6.10.0-15.el10.x86_64 #1 [ 10.017908] Hardware name: Lenovo ThinkSystem SR950 [7X12CTO1WW]/[7X12CTO1WW], BIOS [PSE140J-2.30] 07/20/2021 [ 10.017908] Workqueue: events_unbound padata_mt_helper [ 10.017908] RIP: 0010:padata_mt_helper+0x39/0xb0 : [ 10.017963] Call Trace: [ 10.017968] <TASK> [ 10.018004] ? padata_mt_helper+0x39/0xb0 [ 10.018084] process_one_work+0x174/0x330 [ 10.018093] worker_thread+0x266/0x3a0 [ 10.018111] kthread+0xcf/0x100 [ 10.018124] ret_from_fork+0x31/0x50 [ 10.018138] ret_from_fork_asm+0x1a/0x30 [ 10.018147] </TASK> Looking at the padata_mt_helper() function, the only way a divide-by-0 panic can happen is when ps->chunk_size is 0. The way that chunk_size is initialized in padata_do_multithreaded(), chunk_size can be 0 when the min_chunk in the passed-in padata_mt_job structure is 0. Fix this divide-by-0 panic by making sure that chunk_size will be at least 1 no matter what the input parameters are. Link: https://lkml.kernel.org/r/20240806174647.1050398-1-longman@redhat.com Fixes: `004ed42638` ("padata: add basic support for multithreaded jobs") Signed-off-by: Waiman Long <longman@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Waiman Long <longman@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: CVE-2024-43889 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4890	2025-03-27 04:13:12 +00:00
Alexey Dobriyan	dc8fdc42e4	module: fix [e_shstrndx].sh_size=0 OOB access ANBZ: #19173 commit `391e982bfa` upstream. It is trivial to craft a module to trigger OOB access in this line: if (info->secstrings[strhdr->sh_size - 1] != '\0') { BUG: unable to handle page fault for address: ffffc90000aa0fff PGD 100000067 P4D 100000067 PUD 100066067 PMD 10436f067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 7 PID: 1215 Comm: insmod Not tainted 5.18.0-rc5-00007-g9bf578647087-dirty #10 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014 RIP: 0010:load_module+0x19b/0x2391 Fixes: CVE-2022-49444 Fixes: `ec2a29593c` ("module: harden ELF info handling") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> [rebased patch onto modules-next] Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4919	2025-03-27 03:58:14 +00:00
Miaohe Lin	d4859e78f8	kernel/resource: fix kfree() of bootmem memory again ANBZ: #19115 commit `0cbcc92917` upstream. Since commit `ebff7d8f27` ("mem hotunplug: fix kfree() of bootmem memory"), we could get a resource allocated during boot via alloc_resource(). And it's required to release the resource using free_resource(). Howerver, many people use kfree directly which will result in kernel BUG. In order to fix this without fixing every call site, just leak a couple of bytes in such corner case. Fixes: CVE-2022-49190 Link: https://lkml.kernel.org/r/20220217083619.19305-1-linmiaohe@huawei.com Fixes: `ebff7d8f27` ("mem hotunplug: fix kfree() of bootmem memory") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4910	2025-03-27 03:55:22 +00:00
Wang Hui	9a8e730031	sched: remove redundant on_rq status change ANBZ: #19751 commit `f912d05161` upstream. activate_task/deactivate_task will change on_rq status, no need to do it again. Signed-off-by: Wang Hui <john.wanghui@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210721091109.1406043-1-john.wanghui@huawei.com [dtcccc: Also fix the same issue in __push_expellee().] Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4923	2025-03-26 06:31:01 +00:00
Tianchen Ding	d304765eee	anolis: sched/isolation: Only skip percpu kthread when updating wilds ANBZ: #19745 Normal kthreads should also be isolated. Make this behavior match with cgroup v2 cpuset root partition. Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4904	2025-03-26 02:58:20 +00:00
Frederic Weisbecker	d1c0e5cb1e	kthread: unpark only parked kthread ANBZ: #19745 commit `214e01ad4e` upstream. Calling into kthread unparking unconditionally is mostly harmless when the kthread is already unparked. The wake up is then simply ignored because the target is not in TASK_PARKED state. However if the kthread is per CPU, the wake up is preceded by a call to kthread_bind() which expects the task to be inactive and in TASK_PARKED state, which obviously isn't the case if it is unparked. As a result, calling kthread_stop() on an unparked per-cpu kthread triggers such a warning: WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525 <TASK> kthread_stop+0x17a/0x630 kernel/kthread.c:707 destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810 wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257 netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693 default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769 ops_exit_list net/core/net_namespace.c:178 [inline] cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312 worker_thread+0x86d/0xd70 kernel/workqueue.c:3393 kthread+0x2f0/0x390 kernel/kthread.c:389 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> Fix this with skipping unecessary unparking while stopping a kthread. Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org Fixes: `5c25b5ff89` ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com Suggested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Hillf Danton <hdanton@sina.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4904	2025-03-26 02:58:20 +00:00
Peter Zijlstra	4fd79dc2b0	workqueue: Tag bound workers with KTHREAD_IS_PER_CPU ANBZ: #19745 commit `5c25b5ff89` upstream. Mark the per-cpu workqueue workers as KTHREAD_IS_PER_CPU. Workqueues have unfortunate semantics in that per-cpu workers are not default flushed and parked during hotplug, however a subset does manual flush on hotplug and hard relies on them for correctness. Therefore play silly games.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.693465814@infradead.org Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4904	2025-03-26 02:58:20 +00:00
Lai Jiangshan	2bb289b01f	workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity ANBZ: #19745 commit `547a77d02f` upstream. The scheduler won't break affinity for us any more, and we should "emulate" the same behavior when the scheduler breaks affinity for us. The behavior is "changing the cpumask to cpu_possible_mask". And there might be some other CPUs online later while the worker is still running with the pending work items. The worker should be allowed to use the later online CPUs as before and process the work items ASAP. If we use cpu_active_mask here, we can't achieve this goal but using cpu_possible_mask can. Fixes: `06249738a4` ("workqueue: Manually break affinity on hotplug") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210111152638.2417-4-jiangshanlai@gmail.com Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4904	2025-03-26 02:58:20 +00:00
Peter Zijlstra	fdc324371d	workqueue: Manually break affinity on hotplug ANBZ: #19745 commit `06249738a4` upstream. Don't rely on the scheduler to force break affinity for us -- it will stop doing that for per-cpu-kthreads. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201023102346.464718669@infradead.org Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4904	2025-03-26 02:58:20 +00:00
Wander Lairson Costa	f06decbd93	sched/deadline: Fix warning in migrate_enable for boosted tasks ANBZ: #13073 commit `0664e2c311` upstream. When running the following command: while true; do stress-ng --cyclic 30 --timeout 30s --minimize --quiet done a warning is eventually triggered: WARNING: CPU: 43 PID: 2848 at kernel/sched/deadline.c:794 setup_new_dl_entity+0x13e/0x180 ... Call Trace: <TASK> ? show_trace_log_lvl+0x1c4/0x2df ? enqueue_dl_entity+0x631/0x6e0 ? setup_new_dl_entity+0x13e/0x180 ? __warn+0x7e/0xd0 ? report_bug+0x11a/0x1a0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 enqueue_dl_entity+0x631/0x6e0 enqueue_task_dl+0x7d/0x120 __do_set_cpus_allowed+0xe3/0x280 __set_cpus_allowed_ptr_locked+0x140/0x1d0 __set_cpus_allowed_ptr+0x54/0xa0 migrate_enable+0x7e/0x150 rt_spin_unlock+0x1c/0x90 group_send_sig_info+0xf7/0x1a0 ? kill_pid_info+0x1f/0x1d0 kill_pid_info+0x78/0x1d0 kill_proc_info+0x5b/0x110 __x64_sys_kill+0x93/0xc0 do_syscall_64+0x5c/0xf0 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f0dab31f92b This warning occurs because set_cpus_allowed dequeues and enqueues tasks with the ENQUEUE_RESTORE flag set. If the task is boosted, the warning is triggered. A boosted task already had its parameters set by rt_mutex_setprio, and a new call to setup_new_dl_entity is unnecessary, hence the WARN_ON call. Check if we are requeueing a boosted task and avoid calling setup_new_dl_entity if that's the case. Fixes: `295d6d5e37` ("sched/deadline: Fix switching to -deadline") Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20240724142253.27145-2-wander@redhat.com Fixes: CVE-2024-56583 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Cruz Zhao <cruzzhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4715	2025-03-14 03:34:57 +00:00

1 2 3 4 5 ...

36536 Commits