ANBZ: #20922
commit d8189834d4 upstream.
butt3rflyh4ck reports a bug as below:
When a thread always calls F2FS_IOC_RESIZE_FS to resize fs, if resize fs is
failed, f2fs kernel thread would invoke callback function to update f2fs io
info, it would call f2fs_write_end_io and may trigger null-ptr-deref in
NODE_MAPPING.
general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
RIP: 0010:NODE_MAPPING fs/f2fs/f2fs.h:1972 [inline]
RIP: 0010:f2fs_write_end_io+0x727/0x1050 fs/f2fs/data.c:370
<TASK>
bio_endio+0x5af/0x6c0 block/bio.c:1608
req_bio_endio block/blk-mq.c:761 [inline]
blk_update_request+0x5cc/0x1690 block/blk-mq.c:906
blk_mq_end_request+0x59/0x4c0 block/blk-mq.c:1023
lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1101
__do_softirq+0x1d4/0x8ef kernel/softirq.c:571
run_ksoftirqd kernel/softirq.c:939 [inline]
run_ksoftirqd+0x31/0x60 kernel/softirq.c:931
smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
kthread+0x33e/0x440 kernel/kthread.c:379
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
The root cause is below race case can cause leaving dirty metadata
in f2fs after filesystem is remount as ro:
Thread A Thread B
- f2fs_ioc_resize_fs
- f2fs_readonly --- return false
- f2fs_resize_fs
- f2fs_remount
- write_checkpoint
- set f2fs as ro
- free_segment_range
- update meta_inode's data
Then, if f2fs_put_super() fails to write_checkpoint due to readonly
status, and meta_inode's dirty data will be writebacked after node_inode
is put, finally, f2fs_write_end_io will access NULL pointer on
sbi->node_inode.
Thread A IRQ context
- f2fs_put_super
- write_checkpoint fails
- iput(node_inode)
- node_inode = NULL
- iput(meta_inode)
- write_inode_now
- f2fs_write_meta_page
- f2fs_write_end_io
- NODE_MAPPING(sbi)
: access NULL pointer on node_inode
Fixes: b4b10061ef ("f2fs: refactor resize_fs to avoid meta updates in progress")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Closes: https://lore.kernel.org/r/1684480657-2375-1-git-send-email-yangtiezhu@loongson.cn
Tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: CVE-2023-2898
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5565
ANBZ: #21452
[ Upstream commit d42334578e ]
exfat_extract_uni_name copies characters from a given file name entry into
the 'uniname' variable. This variable is actually defined on the stack of
the exfat_readdir() function. According to the definition of
the 'exfat_uni_name' type, the file name should be limited 255 characters
(+ null teminator space), but the exfat_get_uniname_from_ext_entry()
function can write more characters because there is no check if filename
entries exceeds max filename length. This patch add the check not to copy
filename characters when exceeding max filename length.
Cc: stable@vger.kernel.org
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reported-by: Maxim Suhanov <dfirblog@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Fixes: CVE-2023-4273
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5564
ANBZ: #23087
Currently, our kernel supports collapsing file-backed pages into
hugepages during file reads by prioritizing the file for collapse
in khugepaged. The allocation flags for collapsing are determined
based on khugepaged's defrag setting.
With this patch, the /sys/kernel/mm/transparent_hugepage/defrag
setting is honored for direct file collapse, ensuring the defrag
behavior for file direct collapse uses the same allocation policy
as other THP paths.
Signed-off-by: Weilin Tong <tongweilin@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5557
ANBZ: #22935
commit 3a605e32a7 upstream.
The variable 'ctrl' became useless since the code using it was dropped
from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment
request genctr on completion"). Fix it to get rid of this compilation
warning in the nvme-5.17 branch:
drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’:
drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable]
struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
^~~~
Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion")
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5544
ANBZ: #23086
commit 2053230af1 upstream.
The MSI-X Capability requires devices to support 64-bit Message Addresses,
but the MSI Capability can support either 32- or 64-bit addresses.
Previously, we set dev->no_64bit_msi for a few broken devices that
advertise 64-bit MSI support but don't correctly support it.
In addition, check the MSI "64-bit Address Capable" bit for all devices and
set dev->no_64bit_msi for devices that don't advertise 64-bit support.
This allows msi_verify_entries() to catch arch code defects that assign
64-bit addresses when they're not supported.
The warning is helpful to find defects like the one fixed by
https://lore.kernel.org/r/20201117165312.25847-1-vidyas@nvidia.com
[bhelgaas: set no_64bit_msi in pci_msi_init(), commit log]
Link: https://lore.kernel.org/r/20201124105035.24573-1-vidyas@nvidia.com
Link: https://lore.kernel.org/r/20201203185110.1583077-4-helgaas@kernel.org
Signed-off-by: Vidya Sagar <vidyas@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5556
ANBZ: #23086
commit cbc40d5c33 upstream.
Move pci_msi_setup_pci_dev(), which disables MSI and MSI-X interrupts, from
probe.c to msi.c so it's with all the other MSI code and more consistent
with other capability initialization. This means we must compile msi.c
always, even without CONFIG_PCI_MSI, so wrap the rest of msi.c in an #ifdef
and adjust the Makefile accordingly. No functional change intended.
Link: https://lore.kernel.org/r/20201203185110.1583077-2-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5556
ANBZ: #11854
To support mounting without loopback block devices.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5548
ANBZ: #22926
Used time_before(jiffies, last_je) in psp_mutex_lock_timeout()
which yields CPU too early.
Fixed to time_after(jiffies, last_je + msecs_to_jiffies(100))
for proper 100ms wait.
Signed-off-by: xiongmengbiao <xiongmengbiao@hygon.cn>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5532
ANBZ: #11854
commit 905eeb2b7c33adda23a966aeb811ab4cb9e62031 upstream.
Previously, file operations on a file-backed mount used the current
process' credentials to access the backing FD. Attempting to do so on
Android lead to SELinux denials, as ACL rules on the backing file (e.g.
/system/apex/foo.apex) is restricted to a small set of process.
Arguably, this error is redundant and leaking implementation details, as
access to files on a mount is already ACL'ed by path.
Instead, override to use the opener's cred when accessing the backing
file. This makes the behavior similar to a loop-backed mount, which
uses kworker cred when accessing the backing file and does not cause
SELinux denials.
Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20250612-b4-erofs-impersonate-v1-1-8ea7d6f65171@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
ANBZ: #11854
commit bbfe756dc3062c1e934f06e5ba39c239aa953b92 upstream.
If bio_add_folio() fails (because it is full),
erofs_fileio_scan_folio() needs to submit the I/O request via
erofs_fileio_rq_submit() and allocate a new I/O request with an empty
`struct bio`. Then it retries the bio_add_folio() call.
However, at this point, erofs_onlinefolio_split() has already been
called which increments `folio->private`; the retry will call
erofs_onlinefolio_split() again, but there will never be a matching
erofs_onlinefolio_end() call. This leaves the folio locked forever
and all waiters will be stuck in folio_wait_bit_common().
This bug has been added by commit ce63cb62d7 ("erofs: support
unencoded inodes for fileio"), but was practically unreachable because
there was room for 256 folios in the `struct bio` - until commit
9f74ae8c9a ("erofs: shorten bvecs[] for file-backed mounts") which
reduced the array capacity to 16 folios.
It was now trivial to trigger the bug by manually invoking readahead
from userspace, e.g.:
posix_fadvise(fd, 0, st.st_size, POSIX_FADV_WILLNEED);
This should be fixed by invoking erofs_onlinefolio_split() only after
bio_add_folio() has succeeded. This is safe: asynchronous completions
invoking erofs_onlinefolio_end() will not unlock the folio because
erofs_fileio_scan_folio() is still holding a reference to be released
by erofs_onlinefolio_end() at the end.
Fixes: ce63cb62d7 ("erofs: support unencoded inodes for fileio")
Fixes: 9f74ae8c9a ("erofs: shorten bvecs[] for file-backed mounts")
Cc: stable@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Gao Xiang <xiang@kernel.org>
Tested-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20250428230933.3422273-1-max.kellermann@ionos.com
Signed-off-by: Gao Xiang <xiang@kernel.org>
Conflicts:
fs/erofs/fileio.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
ANBZ: #11854
commit 6422cde1b0 upstream.
For many use cases (e.g. container images are just fetched from remote),
performance will be impacted if underlay page cache is up-to-date but
direct i/o flushes dirty pages first.
Instead, let's use buffered I/O by default to keep in sync with loop
devices and add a (re)mount option to explicitly give a try to use
direct I/O if supported by the underlying files.
The container startup time is improved as below:
[workload] docker.io/library/workpress:latest
unpack 1st run non-1st runs
EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s
EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s
EROFS snapshotter loop 4.596023152s 0.346s 0.201s
Overlayfs snapshotter 5.382851037s 0.206s 0.214s
Fixes: fb17675026 ("erofs: add file-backed mount support")
Cc: Derek McGowan <derek@mcg.dev>
Reviewed-by: Chao Yu <chao@kernel.org>
Conflicts:
fs/erofs/fileio.c
fs/erofs/super.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
ANBZ: #11854
commit fb17675026 upstream.
It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.
Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices. However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed. There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses. As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.
On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
- Fscache infrastructure has recently been moved into new Netfslib
which is an unexpected dependency to EROFS really, although it
originally claims "it could be used for caching other things such as
ISO9660 filesystems too." [3]
- It takes an unexpectedly long time to upstream Fscache/Cachefiles
enhancements. For example, the failover feature took more than
one year, and the deamonless feature is still far behind now;
- Ongoing HSM "fanotify pre-content hooks" [4] together with this will
perfectly supersede "erofs over fscache" in a simpler way since
developers (mainly containerd folks) could leverage their existing
caching mechanism entirely in userspace instead of strictly following
the predefined in-kernel caching tree hierarchy.
After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)
[1] https://github.com/containers/storage/pull/2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/cover.1723670362.git.josef@toxicpanda.com
Closes: https://github.com/containers/composefs/issues/144
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240830032840.3783206-1-hsiangkao@linux.alibaba.com
Conflicts:
fs/erofs/Kconfig
fs/erofs/data.c
fs/erofs/inode.c
fs/erofs/internal.h
fs/erofs/super.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
ANBZ: #11854
commit 07abe43a28 upstream.
Instead of allocating the erofs_sb_info in fill_super() allocate it during
erofs_init_fs_context() and ensure that erofs can always have the info
available during erofs_kill_sb(). After this erofs_fs_context is no longer
needed, replace ctx with sbi, no functional changes.
Suggested-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240419123611.947084-2-libaokun1@huawei.com
Conflicts:
fs/erofs/internal.h
fs/erofs/super.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
ANBZ: #11854
This was missing in the original backport commit.
Fixes: 34bc2f0eb2 ("erofs: get rid of erofs_map_blocks_flatmode()")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
ANBZ: #8765
If we account util_est for each cfs_rq when group_balancer disabled,
there will be some overhead. To avoid the overhead, we only account it
when group balancer enabled, and when the switch is turned, we
recalculate for each cfs_rq at once.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5529
ANBZ: #22862
This patch makes the following changes to cpufeatures.h:
- Corrects the typo in X86_FEATURE_PARALLAX_E to X86_FEATURE_PARALLAX_EN
- Reformats the newly added feature macros in word5 to maintain consistent
coding style.
Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5524
ANBZ: #20656
In cases where an NVMe device fails to create I/O queues due to certain
errors, the kernel does not remove the corresponding NVMe controller.
However, under such conditions, the nvme_dev->ctrl->tagset may become
NULL. To avoid potential NULL pointer dereference, it is necessary to
perform a NULL check on both the tagset and the admin_tagset before use.
Fixes: 31b981ebbd ("anolis: nvme: support set per device's timeout")
Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5528
ANBZ: #22860
commit 6b6a23d5d8 upstream.
And add examples for how to correctly handle large optlens.
This is less relevant now when we don't EFAULT anymore, but
that's still the correct thing to do.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5523
ANBZ: #22860
commit 989a4a7dbf upstream.
Instead of assuming EFAULT, let's assume the BPF program's
output is ignored.
Remove "getsockopt: deny arbitrary ctx->retval" because it
was actually testing optlen. We have separate set of tests
for retval.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5523
ANBZ: #22860
commit 29ebbba7d4 upstream.
With the way the hooks implemented right now, we have a special
condition: optval larger than PAGE_SIZE will expose only first 4k into
BPF; any modifications to the optval are ignored. If the BPF program
doesn't handle this condition by resetting optlen to 0,
the userspace will get EFAULT.
The intention of the EFAULT was to make it apparent to the
developers that the program is doing something wrong.
However, this inadvertently might affect production workloads
with the BPF programs that are not too careful (i.e., returning EFAULT
for perfectly valid setsockopt/getsockopt calls).
Let's try to minimize the chance of BPF program screwing up userspace
by ignoring the output of those BPF programs (instead of returning
EFAULT to the userspace). pr_info_once those cases to
the dmesg to help with figuring out what's going wrong.
Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5523
ANBZ: #22615
For Hygon Dhyana processor other than hygon family 18h model 4h, using atl
to convert umc_mca_addr to sys_addr. When geting die id, it use
amd_get_nodes_per_socket() function to get "nodes_per_socket" value. It will
get wrong value for Hygon CPU.
To fix it, we use hygon_get_nodes_per_socket() function to get the correct
"nodes_per_socket" value.
Fixes: 8df0979486c2("EDAC/amd64: Use new AMD Address Translation Library")
Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Wenhui Fan <fanwh@hygon.cn>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5521
ANBZ: #12420
When rich container is enabled, the task's start boottime should be
`task->start_boottime - init_tsk->start_boottime`.
When timens is enabled, the task's start boottime should be
`task->start_boottime + timens->boottime_offset`.
When both of them are opened, the task's start boottime should be
`task->start_boottime - init_tsk->start_boottime`, too.
But current the logic is wrong,
`task->start_boottime + timens->boottime_offset - init_tsk->start_boottime`.
This will make task's start boottime become a very large number, and
sometimes maybe a negative number. When ps tool iterates /proc/*/stat,
parse start time, and then uses locatime() to convert it, the ps tool
may segmentfault because locatime() cannot process negative number and
returns NULL.
Fixes: c46463ddc5 ("fs/proc: apply the time namespace offset to /proc/stat btime")
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5518
ANBZ: #22586
commit b34ffb0c6d upstream.
The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.
Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Xiao Long <xiaolong@openanolis.org>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5495
ANBZ: #18841
We applied parallel deferred memory online to virtio-mem and
observed a significant speedup in the hot-plug process.
We conducted an experiment to hotplug 400G of memory, and
the results were as follows:
- Before applying the patch:
- Total Time = Origin Hotplug Time = 5537ms (72.24 GB/s)
- After applying the patch (with `parallel_hotplug_ratio=80`):
- Origin Hotplug Time = 178ms
- Deferred Parallel Hotplug Time = 1200ms
- Total Time = 1378ms (76% reduction, 290.28 GB/s)
Lastly, there's an issue regarding the guest's plug
request to the VMM. The VMM relies on the plug requests
sent by the guest to determine the size of the hot-plugged
memory. Therefore, we should defer the sending of the plug
requests after the memory has been actually onlined.
Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
ANBZ: #18841
Memory hotplug is a serial process that adds memory to Linux in the
granularity of memory blocks. We identified two memory initialization
functions that consume significant time when onling memory blocks:
- `__init_single_page`: initialize the struct page
- `__free_pages_core`: add page to the buddy allocator
We attempted to execute these two functions in parallel during the
process of hotplugging a memory block. The experimental results showed
that when the memory block size was 1GB, the hotplug speed was
increased by approximately 200%. However, when the memory block size
was 128MB, which is the more commonly used size, the hotplug speed
was even worse than that of serial execution.
Therefore, how to improve the hotplug speed when the memory block size
is 128MB remains a challenge.
Here is my idea:
- Defer the execution of these two functions and their associated
processs to the final phase of the entire hotplug process, so
that the hotplug speed will no longer be limited by the memory
block size.
- Perform parallel execution in the final phase, as previous
implementations have proven that this can accelerate the hotplug
process.
We introduce the new online function, `deferred_online_memory`, for
deferring the actual online process of memory blocks.
Additionally, we have added a command-line argument,
parallel_hotplug_ratio, which sets the ratio of parallel workers to
the number of CPUs on the node. When parallel_hotplug_ratio is 0,
the memory online process will no longer be deferred.
Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
ANBZ: #18841
This commit refactors the `online_pages()` function to prepare for
deferred memory online support. `online_pages()` is the core function
for memory hotplug. Initializing struct pages and freeing pages to buddy
are the most time-consuming operations, so we move these operations and
related ones to the deferred phase.
Additionally, since the adjustment of `present_pages` is deferred, and
`auto_movable_zone_for_pfn()` uses `present_pages` to determine which
zone the new memory block belongs to, we introduce `deferred_pages` to
indicate the number of deferred pages. This allows
`auto_movable_zone_for_pfn()` to make decisions based on both
`present_pages` and `deferred_pages`.
Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
ANBZ: #18841
This commit refactors the `move_pfn_range_to_zone()` function to prepare
for deferred memory online support. `memmap_init_zone()` is a
time-consuming operation, we put it into the deferred phase for execution.
Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
ANBZ: #18841
This commit refactors the `adjust_present_page_count()` function
to prepare for deferred memory online support. If memory online is
deferred, then the adjustment of the present pages must also be deferred.
Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
ANBZ: #18841
We added three macros to represent different phases of memory hotplug.
MHP_PHASE_DEFAULT represents the original process, while MHP_PHASE_PREPARE
and MHP_PHASE_DEFERRED represent the process split into the prepare
phase and the deferred phase. The purpose of this change is to move
time-consuming operations to the deferred phase, thereby effectively
improving the speed of memory hotplug through concurrency.
Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
ANBZ: #8765
When a task_group which enabled group_balancer got freed, it should be
removed from gb_sd, to prevent use after free.
BTW, hold the gb_lock of task_group when it's enabling/disabling
group_balancer.
Fixes: 7f8e0c7133 ("anolis: sched: maintain group balancer task groups")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5512
ANBZ: #12858
commit 49f683b41f upstream.
Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the
loads and stores are atomic. In the extremely unlikely scenario the
compiler tears the stores, it's theoretically possible for KVM to attempt
to get a vCPU using an out-of-bounds index, e.g. if the write is split
into multiple 8-bit stores, and is paired with a 32-bit load on a VM with
257 vCPUs:
CPU0 CPU1
last_boosted_vcpu = 0xff;
(last_boosted_vcpu = 0x100)
last_boosted_vcpu[15:8] = 0x01;
i = (last_boosted_vcpu = 0x1ff)
last_boosted_vcpu[7:0] = 0x00;
vcpu = kvm->vcpu_array[0x1ff];
As detected by KCSAN:
BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
value changed: 0x00000012 -> 0x00000000
Fixes: 217ece6129 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin")
Cc: stable@vger.kernel.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fixes: CVE-2024-40953
Signed-off-by: Bin Guo <guobin@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5511
ANBZ: #12459
commit aa0d42cacf upstream.
Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support
for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are
myriad bugs in the implementation, some of which are fatal to the guest,
and others which put the stability and health of the host at risk.
For guest fatalities, the most glaring issue is that KVM fails to ensure
tracing is disabled, and *stays* disabled prior to VM-Enter, which is
necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing
is enabled (enforced via a VMX consistency check). Per the SDM:
If the logical processor is operating with Intel PT enabled (if
IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
IA32_RTIT_CTL" VM-entry control must be 0.
On the host side, KVM doesn't validate the guest CPUID configuration
provided by userspace, and even worse, uses the guest configuration to
decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring
guest CPUID to enumerate more address ranges than are supported in hardware
will result in KVM trying to passthrough, save, and load non-existent MSRs,
which generates a variety of WARNs, ToPA ERRORs in the host, a potential
deadlock, etc.
Fixes: f99e3daf94 ("KVM: x86: Add Intel PT virtualization work mode")
Cc: stable@vger.kernel.org
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20241101185031.1799556-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: CVE-2024-53135
Signed-off-by: Bin Guo <guobin@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5510
ANBZ: #22597
commit 2b88119e35 upstream.
The mtty driver exposes a PCI serial device to userspace and therefore
makes an easy target for a sample device supporting migration. The device
does not make use of DMA, therefore we can easily claim support for the
migration P2P states, as well as dirty logging. This implementation also
makes use of PRE_COPY support in order to provide migration stream
compatibility testing, which should generally be considered good practice.
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20231016224736.2575718-3-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5504
ANBZ: #22597
commit 293fbc2881 upstream.
The mtty driver does not currently conform to the vfio SET_IRQS uAPI.
For example, it claims to support mask and unmask of INTx, but actually
does nothing. It claims to support AUTOMASK for INTx, but doesn't. It
fails to teardown eventfds under the full semantics specified by the
SET_IRQS ioctl. It also fails to teardown eventfds when the device is
closed, leading to memory leaks. It claims to support the request IRQ,
but doesn't.
Fix all these.
A side effect of this is that QEMU will now report a warning:
vfio <uuid>: Failed to set up UNMASK eventfd signaling for interrupt \
INTX-0: VFIO_DEVICE_SET_IRQS failure: Inappropriate ioctl for device
The fact is that the unmask eventfd was never supported but quietly
failed. mtty never honored the AUTOMASK behavior, therefore there
was nothing to unmask. QEMU is verbose about the failure, but
properly falls back to userspace unmasking.
Fixes: 9d1a546c53 ("docs: Sample driver to demonstrate how to use Mediated device framework.")
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20231016224736.2575718-2-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5504
ANBZ: #20511
The "scan" command of the cmn700 cache scan misc device file can take
up to several hundreds milliseconds. The CPU utilization during
command run will be 100%. Although use cond_sched() during "scan"
running to release CPU resource for other workloads, the high CPU
utilization may be undesirable.
So, in the patch, add "cpu_percent" parameter to "set_param" command
of the misc device file. If "cpu_percent" is less than 100, use
schedule_timeout_interruptible() in "scan" command running to reduce the
CPU usage to the specified value.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5427
ANBZ: #20511
It may take long to scan all HNF (Fully-coherent Home Node) of a die
because the debug register accessing is slow. For example, it takes
300-500ms to scan 64 HNF of a die using one core with 100%
utilization. If some of the addresses go into problematic state again
before the next scanning/flushing, the performance may hurt.
So, in this patch, flush the addresses with a higher frequency if they
are identified as problematic repeatedly. The more times an address
is identified as problematic, the higher its flush frequencym from
once to, 4, 16, N (number of HNF) times per full scanning/flushing.
The main cost is the memory to record the previously identified
problematic addresses and improved code complexity. So, add a new
parameter named "max_frequ" to the misc device file write command
"set_param". If "max_frequ=0" is specified, original direct flushing
mechansim is used to avoid the extra memory usage and code complexity.
It can be used to control the max flushing frequency too.
In a specjbb rt (response time) test with fixed injecting rate, the
patch can improve the performance up to 3%.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5427
ANBZ: #20511
This is to mitigate ARM CMN-700 performance erratum 3688582. For some
version of ARM CMN-700, if
- multiple CMN-700 instances are used (e.g. multi-dies chip) AND
- LLC SF (Snoop Filter) for a cache line in shared MESI state AND
- the LLC cache line is victimized
then, any further memory reading from the cache line address will not
fill the LLC cache but always go to DDR until the cache line is
evicted or invalidated from LLC SF.
The details of the erratum can be found at,
https://developer.arm.com/documentation/SDEN-2039384/1900/
To mitigate the erratum, dump the addresses in SF and LLC with state
above via CMN-700 debug registers (cmn_hns_cfg_slcsf_dbgrd, details
can be found in the following URL), then flush the cache line via DC
instruction. The CMN-700 debug registers can be accessed in EL3 only,
so add a customized SMC (Secure Monitor Call) to dump the potential
problematic addresses and flush the cache in a misc device driver.
https://developer.arm.com/documentation/102308/0302/?lang=en
Misc device file (/dev/cmn700_cache_scan by default) reading/writing
is the user space interface. The input/output format is ascii text to
make it easy to use the interface. Support the following device file
write command line.
- set_param <param1>=<val1> <param2>=<val2> ...
set the parameters of scanning and flushing, available parameters
are as follows:
- step=[1-128]: number of cache set to scan for one SMC
- hnf=[0-127]-[0-127]: hnf (full-coherent home node) to scan
- sf_check_mesi=[0-1]: whether to check mesi state of SF cache line
- sf_mesi=[0-3]: the mesi state to check
- scan
scan according to parameters above and flush the cache line.
- scan_step
scan for one step only, to reduce latency.
Reading file gets scanning statistics. Example output is as follows,
nr_entry_total: 638
nr_entry_max: 3
nr_flush_total: 3
nr_flush_max: 1
us_smc_total: 327594
us_smc_max: 105
cml_nr_scan: 78
cml_nr_entry_total: 65336
cml_nr_entry_max: 5
cml_nr_flush_total: 8188
cml_nr_flush_max: 5
cml_us_smc_total: 25635226
cml_us_smc_max: 395
cml_us_flush_total: 332172
cml_us_flush_max: 41
where cml_ indicates "cumulative", that is, cml_xxx_total is the sum
of nr_xxx_total and cml_xxx_max is the maximum of xxx_max of all
scannings so far.
[cml_]nr_entry_total: total number of potential problematic cache line
entries.
[cml_]nr_entry_max: maximum number of potential problematic cache
line entries for one step.
[cml_]nr_flush_total: total number of flushed cache line entries
[cml_]nr_flush_max: maximum number of flushed cache line entries for
one step.
[cml_]us_smc_total: total CPU time of SMC in micro-second.
[cml_]us_smc_max: maximum CPU time of SMC in micro-second.
[cml_]us_flush_total: total CPU time of flushing cache in micro-second.
[cml_]us_flush_max: maximum CPU time of flushing cache in micro-second.
It's possible that some cache line may be flushed by mistake. For
example, if a address satisfies the condition above but it was nevered
access in another core. However, all our tested workloads show
positive or neutral performance impact, which indicates that the
mistaken rate can be kept low.
Performance test shows that the mitigation can improve mysql oltp read
qps (queries per second) up to 17.5%. And the LLC miss ratio reduces
from 32.7% to 11.3%.
Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5427
ANBZ: #19799
commit 3d75b8aa5c upstream.
Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
completion queue, e.g. when a VM and all its vCPUs is being destroyed.
KVM must ensure that none of its workqueue callbacks is running when the
last reference to the KVM _module_ is put. Gifting a reference to the
associated VM prevents the workqueue callback from dereferencing freed
vCPU/VM memory, but does not prevent the KVM module from being unloaded
before the callback completes.
Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
result in deadlock. async_pf_execute() can't return until kvm_put_kvm()
finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: events async_pf_execute [kvm]
RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
Call Trace:
<TASK>
async_pf_execute+0x198/0x260 [kvm]
process_one_work+0x145/0x2d0
worker_thread+0x27e/0x3a0
kthread+0xba/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
</TASK>
---[ end trace 0000000000000000 ]---
INFO: task kworker/8:1:251 blocked for more than 120 seconds.
Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000
Workqueue: events async_pf_execute [kvm]
Call Trace:
<TASK>
__schedule+0x33f/0xa40
schedule+0x53/0xc0
schedule_timeout+0x12a/0x140
__wait_for_common+0x8d/0x1d0
__flush_work.isra.0+0x19f/0x2c0
kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
kvm_put_kvm+0x1c1/0x320 [kvm]
async_pf_execute+0x198/0x260 [kvm]
process_one_work+0x145/0x2d0
worker_thread+0x27e/0x3a0
kthread+0xba/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20
</TASK>
If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
then there's no need to gift async_pf_execute() a reference because all
invocations of async_pf_execute() will be forced to complete before the
vCPU and its VM are destroyed/freed. And that in turn fixes the module
unloading bug as __fput() won't do module_put() on the last vCPU reference
until the vCPU has been freed, e.g. if closing the vCPU file also puts the
last reference to the KVM module.
Note that kvm_check_async_pf_completion() may also take the work item off
the completion queue and so also needs to flush the work queue, as the
work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting
on the workqueue could theoretically delay a vCPU due to waiting for the
work to complete, but that's a very, very small chance, and likely a very
small delay. kvm_arch_async_page_present_queued() unconditionally makes a
new request, i.e. will effectively delay entering the guest, so the
remaining work is really just:
trace_kvm_async_pf_completed(addr, cr2_or_gpa);
__kvm_vcpu_wake_up(vcpu);
mmput(mm);
and mmput() can't drop the last reference to the page tables if the vCPU is
still alive, i.e. the vCPU won't get stuck tearing down page tables.
Add a helper to do the flushing, specifically to deal with "wakeup all"
work items, as they aren't actually work items, i.e. are never placed in a
workqueue. Trying to flush a bogus workqueue entry rightly makes
__flush_work() complain (kudos to whoever added that sanity check).
Note, commit 5f6de5cbeb ("KVM: Prevent module exit until all VMs are
freed") *tried* to fix the module refcounting issue by having VMs grab a
reference to the module, but that only made the bug slightly harder to hit
as it gave async_pf_execute() a bit more time to complete before the KVM
module could be unloaded.
Fixes: af585b921e ("KVM: Halt vcpu if page it tries to access is swapped out")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Reviewed-by: Xu Yilun <yilun.xu@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fixes: CVE-2024-26976
Signed-off-by: Bin Guo <guobin@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5484
ANBZ: #8765
As we will update the root cpumask before we build group balancer sched
domains, we should update the parameters too.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
ANBZ: #8765
To reduce turbulence, the interval of gb_load_balance() should be
controlled, and we set it to twice the lower interval. And to prevent
unnecessary migration, we only migrate task groups when the imbalance
pct exceeds the threshold 117.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
ANBZ: #8765
The goal of lower interval is to control the interval so that the
scheduler has enough time to do load balancing, so we'd better to
determine the lower interval according to the sched domain where group
balancer sched domain is located.
So far, the sched domain has the following layers:
- NUMA
- DIE
- MC
- SMT
When building group balancer sched domain according to hardware topology,
we identify the topological structures corresponding, and determine the
lower interval according to the span weight.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
ANBZ: #8765
As a group balancer sched domain may have many descendants, with task
groups, when we select task groups to migrate we should traverse the
task descendants, too.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
ANBZ: #8765
If we set the flags of group balancer sched domain when we build it, the
flag of the middle level may be incorrect. So we should set the flags
after build.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
ANBZ: #8765
When task fork, we should reset the soft_cpus_version to -1.
When the task group moves to another gb_sd, we should inc the
soft_cpus_version.
Fixes: 4bb9c29b1c ("anolis: sched: modify the interface of attach/detach_tg_to/from_group_balancer_sched_domain")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
ANBZ: #8765
When lower task group, we will select a child group balancer sched
domain to migrate. Firstly, we should control the interval of lowering
task groups of the parent gb_sd, to avoid causing imbalance. Secondly,
if the load the task group spans balanced in the child gb_sds, we should
select the one with lower load.
By the way, fix the bug that using wrong load when calculate the load
the task group.
Fixes: bebcfc550d ("anolis: sched: introduce dynamical load balance for group balancer")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434