anolis-cloud-kernel

Commit Graph

Author	SHA1	Message	Date
Chao Yu	218a9ce97b	f2fs: fix to avoid NULL pointer dereference f2fs_write_end_io() ANBZ: #20922 commit `d8189834d4` upstream. butt3rflyh4ck reports a bug as below: When a thread always calls F2FS_IOC_RESIZE_FS to resize fs, if resize fs is failed, f2fs kernel thread would invoke callback function to update f2fs io info, it would call f2fs_write_end_io and may trigger null-ptr-deref in NODE_MAPPING. general protection fault, probably for non-canonical address KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] RIP: 0010:NODE_MAPPING fs/f2fs/f2fs.h:1972 [inline] RIP: 0010:f2fs_write_end_io+0x727/0x1050 fs/f2fs/data.c:370 <TASK> bio_endio+0x5af/0x6c0 block/bio.c:1608 req_bio_endio block/blk-mq.c:761 [inline] blk_update_request+0x5cc/0x1690 block/blk-mq.c:906 blk_mq_end_request+0x59/0x4c0 block/blk-mq.c:1023 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1101 __do_softirq+0x1d4/0x8ef kernel/softirq.c:571 run_ksoftirqd kernel/softirq.c:939 [inline] run_ksoftirqd+0x31/0x60 kernel/softirq.c:931 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164 kthread+0x33e/0x440 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 The root cause is below race case can cause leaving dirty metadata in f2fs after filesystem is remount as ro: Thread A Thread B - f2fs_ioc_resize_fs - f2fs_readonly --- return false - f2fs_resize_fs - f2fs_remount - write_checkpoint - set f2fs as ro - free_segment_range - update meta_inode's data Then, if f2fs_put_super() fails to write_checkpoint due to readonly status, and meta_inode's dirty data will be writebacked after node_inode is put, finally, f2fs_write_end_io will access NULL pointer on sbi->node_inode. Thread A IRQ context - f2fs_put_super - write_checkpoint fails - iput(node_inode) - node_inode = NULL - iput(meta_inode) - write_inode_now - f2fs_write_meta_page - f2fs_write_end_io - NODE_MAPPING(sbi) : access NULL pointer on node_inode Fixes: `b4b10061ef` ("f2fs: refactor resize_fs to avoid meta updates in progress") Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Closes: https://lore.kernel.org/r/1684480657-2375-1-git-send-email-yangtiezhu@loongson.cn Tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Signed-off-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: CVE-2023-2898 Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5565	2025-08-01 11:21:00 +08:00
Namjae Jeon	3c65405ba3	exfat: check if filename entries exceeds max filename length ANBZ: #21452 [ Upstream commit `d42334578e` ] exfat_extract_uni_name copies characters from a given file name entry into the 'uniname' variable. This variable is actually defined on the stack of the exfat_readdir() function. According to the definition of the 'exfat_uni_name' type, the file name should be limited 255 characters (+ null teminator space), but the exfat_get_uniname_from_ext_entry() function can write more characters because there is no check if filename entries exceeds max filename length. This patch add the check not to copy filename characters when exceeding max filename length. Cc: stable@vger.kernel.org Cc: Yuezhang Mo <Yuezhang.Mo@sony.com> Reported-by: Maxim Suhanov <dfirblog@gmail.com> Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Fixes: CVE-2023-4273 Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5564	2025-08-01 10:07:23 +08:00
Weilin Tong	3ab761d2a7	anolis: mm: honor THP defrag setting for direct file collapse ANBZ: #23087 Currently, our kernel supports collapsing file-backed pages into hugepages during file reads by prioritizing the file for collapse in khugepaged. The allocation flags for collapsing are determined based on khugepaged's defrag setting. With this patch, the /sys/kernel/mm/transparent_hugepage/defrag setting is honored for direct file collapse, ensuring the defrag behavior for file direct collapse uses the same allocation policy as other THP paths. Signed-off-by: Weilin Tong <tongweilin@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5557	2025-07-31 03:33:28 +00:00
Geliang Tang	c6dd932e4d	nvme: drop unused variable ctrl in nvme_setup_cmd ANBZ: #22935 commit `3a605e32a7` upstream. The variable 'ctrl' became useless since the code using it was dropped from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment request genctr on completion"). Fix it to get rid of this compilation warning in the nvme-5.17 branch: drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’: drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable] struct nvme_ctrl *ctrl = nvme_req(req)->ctrl; ^~~~ Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion") Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5544	2025-07-31 02:49:39 +00:00
Vidya Sagar	cb85aee82d	PCI/MSI: Set device flag indicating only 32-bit MSI support ANBZ: #23086 commit `2053230af1` upstream. The MSI-X Capability requires devices to support 64-bit Message Addresses, but the MSI Capability can support either 32- or 64-bit addresses. Previously, we set dev->no_64bit_msi for a few broken devices that advertise 64-bit MSI support but don't correctly support it. In addition, check the MSI "64-bit Address Capable" bit for all devices and set dev->no_64bit_msi for devices that don't advertise 64-bit support. This allows msi_verify_entries() to catch arch code defects that assign 64-bit addresses when they're not supported. The warning is helpful to find defects like the one fixed by https://lore.kernel.org/r/20201117165312.25847-1-vidyas@nvidia.com [bhelgaas: set no_64bit_msi in pci_msi_init(), commit log] Link: https://lore.kernel.org/r/20201124105035.24573-1-vidyas@nvidia.com Link: https://lore.kernel.org/r/20201203185110.1583077-4-helgaas@kernel.org Signed-off-by: Vidya Sagar <vidyas@nvidia.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thierry Reding <treding@nvidia.com> Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5556	2025-07-29 16:46:59 +08:00
Bjorn Helgaas	62ad266bac	PCI/MSI: Move MSI/MSI-X flags updaters to msi.c ANBZ: #23086 commit `830dfe88ea` upstream. pci_msi_set_enable() and pci_msix_clear_and_set_ctrl() are only used from msi.c, so move them from drivers/pci/pci.h to msi.c. No functional change intended. Link: https://lore.kernel.org/r/20201203185110.1583077-3-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thierry Reding <treding@nvidia.com> Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5556	2025-07-29 16:46:55 +08:00
Bjorn Helgaas	f9d7c2d60f	PCI/MSI: Move MSI/MSI-X init to msi.c ANBZ: #23086 commit `cbc40d5c33` upstream. Move pci_msi_setup_pci_dev(), which disables MSI and MSI-X interrupts, from probe.c to msi.c so it's with all the other MSI code and more consistent with other capability initialization. This means we must compile msi.c always, even without CONFIG_PCI_MSI, so wrap the rest of msi.c in an #ifdef and adjust the Makefile accordingly. No functional change intended. Link: https://lore.kernel.org/r/20201203185110.1583077-2-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Thierry Reding <treding@nvidia.com> Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5556	2025-07-29 16:46:51 +08:00
Gao Xiang	8b2d460737	anolis: configs: enable CONFIG_EROFS_FS_BACKED_BY_FILE ANBZ: #11854 To support mounting without loopback block devices. Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5548	2025-07-28 08:43:58 +00:00
xiongmengbiao	71a37bdc5f	anolis: crypto: ccp: use time_after() instead of time_before() to checks schedule condition ANBZ: #22926 Used time_before(jiffies, last_je) in psp_mutex_lock_timeout() which yields CPU too early. Fixed to time_after(jiffies, last_je + msecs_to_jiffies(100)) for proper 100ms wait. Signed-off-by: xiongmengbiao <xiongmengbiao@hygon.cn> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5532	2025-07-23 06:27:52 +00:00
Tatsuyuki Ishi	7fc3abd8b2	erofs: impersonate the opener's credentials when accessing backing file ANBZ: #11854 commit 905eeb2b7c33adda23a966aeb811ab4cb9e62031 upstream. Previously, file operations on a file-backed mount used the current process' credentials to access the backing FD. Attempting to do so on Android lead to SELinux denials, as ACL rules on the backing file (e.g. /system/apex/foo.apex) is restricted to a small set of process. Arguably, this error is redundant and leaking implementation details, as access to files on a mount is already ACL'ed by path. Instead, override to use the opener's cred when accessing the backing file. This makes the behavior similar to a loop-backed mount, which uses kworker cred when accessing the backing file and does not cause SELinux denials. Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250612-b4-erofs-impersonate-v1-1-8ea7d6f65171@google.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 13:34:00 +08:00
Max Kellermann	baac228835	fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio() ANBZ: #11854 commit bbfe756dc3062c1e934f06e5ba39c239aa953b92 upstream. If bio_add_folio() fails (because it is full), erofs_fileio_scan_folio() needs to submit the I/O request via erofs_fileio_rq_submit() and allocate a new I/O request with an empty `struct bio`. Then it retries the bio_add_folio() call. However, at this point, erofs_onlinefolio_split() has already been called which increments `folio->private`; the retry will call erofs_onlinefolio_split() again, but there will never be a matching erofs_onlinefolio_end() call. This leaves the folio locked forever and all waiters will be stuck in folio_wait_bit_common(). This bug has been added by commit `ce63cb62d7` ("erofs: support unencoded inodes for fileio"), but was practically unreachable because there was room for 256 folios in the `struct bio` - until commit `9f74ae8c9a` ("erofs: shorten bvecs[] for file-backed mounts") which reduced the array capacity to 16 folios. It was now trivial to trigger the bug by manually invoking readahead from userspace, e.g.: posix_fadvise(fd, 0, st.st_size, POSIX_FADV_WILLNEED); This should be fixed by invoking erofs_onlinefolio_split() only after bio_add_folio() has succeeded. This is safe: asynchronous completions invoking erofs_onlinefolio_end() will not unlock the folio because erofs_fileio_scan_folio() is still holding a reference to be released by erofs_onlinefolio_end() at the end. Fixes: `ce63cb62d7` ("erofs: support unencoded inodes for fileio") Fixes: `9f74ae8c9a` ("erofs: shorten bvecs[] for file-backed mounts") Cc: stable@vger.kernel.org Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Reviewed-by: Gao Xiang <xiang@kernel.org> Tested-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20250428230933.3422273-1-max.kellermann@ionos.com Signed-off-by: Gao Xiang <xiang@kernel.org> Conflicts: fs/erofs/fileio.c Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 13:34:00 +08:00
Gao Xiang	6828af4e38	erofs: shorten bvecs[] for file-backed mounts ANBZ: #11854 commit `9f74ae8c9a` upstream. BIO_MAX_VECS is too large for __GFP_NOFAIL allocation. We could use a mempool (since BIOs can always proceed), but it seems overly complicated for now. Reviewed-by: Chao Yu <chao@kernel.org> Conflicts: fs/erofs/fileio.c Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250107082825.74242-1-hsiangkao@linux.alibaba.com Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 13:34:00 +08:00
Gao Xiang	e86624db2a	erofs: use buffered I/O for file-backed mounts by default ANBZ: #11854 commit `6422cde1b0` upstream. For many use cases (e.g. container images are just fetched from remote), performance will be impacted if underlay page cache is up-to-date but direct i/o flushes dirty pages first. Instead, let's use buffered I/O by default to keep in sync with loop devices and add a (re)mount option to explicitly give a try to use direct I/O if supported by the underlying files. The container startup time is improved as below: [workload] docker.io/library/workpress:latest unpack 1st run non-1st runs EROFS snapshotter buffered I/O file 4.586404265s 0.308s 0.198s EROFS snapshotter direct I/O file 4.581742849s 2.238s 0.222s EROFS snapshotter loop 4.596023152s 0.346s 0.201s Overlayfs snapshotter 5.382851037s 0.206s 0.214s Fixes: `fb17675026` ("erofs: add file-backed mount support") Cc: Derek McGowan <derek@mcg.dev> Reviewed-by: Chao Yu <chao@kernel.org> Conflicts: fs/erofs/fileio.c fs/erofs/super.c Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 13:34:00 +08:00
Gao Xiang	e263d8ad31	erofs: support unencoded inodes for fileio ANBZ: #11854 commit `ce63cb62d7` upstream. Since EROFS only needs to handle read requests in simple contexts, Just directly use vfs_iocb_iter_read() for data I/Os. Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240905093031.2745929-1-hsiangkao@linux.alibaba.com Conflicts: fs/erofs/Makefile fs/erofs/data.c fs/erofs/inode.c fs/erofs/internal.h fs/erofs/zdata.c Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 13:33:56 +08:00
Gao Xiang	e0754eb85d	erofs: add file-backed mount support ANBZ: #11854 commit `fb17675026` upstream. It actually has been around for years: For containers and other sandbox use cases, there will be thousands (and even more) of authenticated (sub)images running on the same host, unlike OS images. Of course, all scenarios can use the same EROFS on-disk format, but bdev-backed mounts just work well for OS images since golden data is dumped into real block devices. However, it's somewhat hard for container runtimes to manage and isolate so many unnecessary virtual block devices safely and efficiently [1]: they just look like a burden to orchestrators and file-backed mounts are preferred indeed. There were already enough attempts such as Incremental FS, the original ComposeFS and PuzzleFS acting in the same way for immutable fses. As for current EROFS users, ComposeFS, containerd and Android APEXs will be directly benefited from it. On the other hand, previous experimental feature "erofs over fscache" was once also intended to provide a similar solution (inspired by Incremental FS discussion [2]), but the following facts show file-backed mounts will be a better approach: - Fscache infrastructure has recently been moved into new Netfslib which is an unexpected dependency to EROFS really, although it originally claims "it could be used for caching other things such as ISO9660 filesystems too." [3] - It takes an unexpectedly long time to upstream Fscache/Cachefiles enhancements. For example, the failover feature took more than one year, and the deamonless feature is still far behind now; - Ongoing HSM "fanotify pre-content hooks" [4] together with this will perfectly supersede "erofs over fscache" in a simpler way since developers (mainly containerd folks) could leverage their existing caching mechanism entirely in userspace instead of strictly following the predefined in-kernel caching tree hierarchy. After "fanotify pre-content hooks" lands upstream to provide the same functionality, "erofs over fscache" will be removed then (as an EROFS internal improvement and EROFS will not have to bother with on-demand fetching and/or caching improvements anymore.) [1] https://github.com/containers/storage/pull/2039 [2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com [3] https://docs.kernel.org/filesystems/caching/fscache.html [4] https://lore.kernel.org/r/cover.1723670362.git.josef@toxicpanda.com Closes: https://github.com/containers/composefs/issues/144 Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240830032840.3783206-1-hsiangkao@linux.alibaba.com Conflicts: fs/erofs/Kconfig fs/erofs/data.c fs/erofs/inode.c fs/erofs/internal.h fs/erofs/super.c Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 11:08:23 +08:00
Baokun Li	f8b1bb49b3	erofs: get rid of erofs_fs_context ANBZ: #11854 commit `07abe43a28` upstream. Instead of allocating the erofs_sb_info in fill_super() allocate it during erofs_init_fs_context() and ensure that erofs can always have the info available during erofs_kill_sb(). After this erofs_fs_context is no longer needed, replace ctx with sbi, no functional changes. Suggested-by: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20240419123611.947084-2-libaokun1@huawei.com Conflicts: fs/erofs/internal.h fs/erofs/super.c Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 11:08:23 +08:00
Gao Xiang	fddff2670c	erofs: keep meta inode into erofs_buf ANBZ: #11854 commit `eb2c5e41be` upstream. So that erofs_read_metadata() can read metadata from other inodes (e.g. packed inode) as well. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 11:08:22 +08:00
Gao Xiang	dd8474084e	anolis: erofs: add missing map->m_flags = 0 in erofs_map_blocks() ANBZ: #11854 This was missing in the original backport commit. Fixes: `34bc2f0eb2` ("erofs: get rid of erofs_map_blocks_flatmode()") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5535	2025-07-22 11:08:12 +08:00
Cruz Zhao	98170019bf	anolis: sched: don't account util_est for each cfs_rq when group_balancer disabled ANBZ: #8765 If we account util_est for each cfs_rq when group_balancer disabled, there will be some overhead. To avoid the overhead, we only account it when group balancer enabled, and when the switch is turned, we recalculate for each cfs_rq at once. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5529	2025-07-21 02:35:12 +00:00
LeoLiu-oc	c44eb4e82c	anolis: x86/cpufeatures: Fix typo in PARALLAX_E and reformat new feature macros ANBZ: #22862 This patch makes the following changes to cpufeatures.h: - Corrects the typo in X86_FEATURE_PARALLAX_E to X86_FEATURE_PARALLAX_EN - Reformats the newly added feature macros in word5 to maintain consistent coding style. Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5524	2025-07-18 09:31:09 +00:00
Bitao Hu	b53c2009c3	anolis: bugfix: nvme: check if the pointer is null before using the tagset ANBZ: #20656 In cases where an NVMe device fails to create I/O queues due to certain errors, the kernel does not remove the corresponding NVMe controller. However, under such conditions, the nvme_dev->ctrl->tagset may become NULL. To avoid potential NULL pointer dereference, it is necessary to perform a NULL check on both the tagset and the admin_tagset before use. Fixes: `31b981ebbd` ("anolis: nvme: support set per device's timeout") Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5528	2025-07-18 05:55:33 +00:00
Stanislav Fomichev	9e6c73f30c	bpf: Document EFAULT changes for sockopt ANBZ: #22860 commit `6b6a23d5d8` upstream. And add examples for how to correctly handle large optlens. This is less relevant now when we don't EFAULT anymore, but that's still the correct thing to do. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yinan Liu <yinan@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5523	2025-07-17 07:35:44 +00:00
Yinan Liu	3507962043	selftests/bpf: Update EFAULT {g,s}etsockopt selftests ANBZ: #22860 commit `989a4a7dbf` upstream. Instead of assuming EFAULT, let's assume the BPF program's output is ignored. Remove "getsockopt: deny arbitrary ctx->retval" because it was actually testing optlen. We have separate set of tests for retval. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-3-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yinan Liu <yinan@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5523	2025-07-17 07:35:44 +00:00
Yinan Liu	160227a0f9	bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen ANBZ: #22860 commit `29ebbba7d4` upstream. With the way the hooks implemented right now, we have a special condition: optval larger than PAGE_SIZE will expose only first 4k into BPF; any modifications to the optval are ignored. If the BPF program doesn't handle this condition by resetting optlen to 0, the userspace will get EFAULT. The intention of the EFAULT was to make it apparent to the developers that the program is doing something wrong. However, this inadvertently might affect production workloads with the BPF programs that are not too careful (i.e., returning EFAULT for perfectly valid setsockopt/getsockopt calls). Let's try to minimize the chance of BPF program screwing up userspace by ignoring the output of those BPF programs (instead of returning EFAULT to the userspace). pr_info_once those cases to the dmesg to help with figuring out what's going wrong. Fixes: `0d01da6afc` ("bpf: implement getsockopt and setsockopt hooks") Suggested-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Yinan Liu <yinan@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5523	2025-07-17 07:35:44 +00:00
Bo Liu	bfec42ac43	anolis: RAS/AMD/ATL: fix error nodes per socket on for Hygon Dhyana processor ANBZ: #22615 For Hygon Dhyana processor other than hygon family 18h model 4h, using atl to convert umc_mca_addr to sys_addr. When geting die id, it use amd_get_nodes_per_socket() function to get "nodes_per_socket" value. It will get wrong value for Hygon CPU. To fix it, we use hygon_get_nodes_per_socket() function to get the correct "nodes_per_socket" value. Fixes: 8df0979486c2("EDAC/amd64: Use new AMD Address Translation Library") Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Wenhui Fan <fanwh@hygon.cn> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5521	2025-07-17 03:51:21 +00:00
Qiao Ma	3884211d3c	anolis: timens: fix start boottime calculate bug when rich container and timens enabled ANBZ: #12420 When rich container is enabled, the task's start boottime should be `task->start_boottime - init_tsk->start_boottime`. When timens is enabled, the task's start boottime should be `task->start_boottime + timens->boottime_offset`. When both of them are opened, the task's start boottime should be `task->start_boottime - init_tsk->start_boottime`, too. But current the logic is wrong, `task->start_boottime + timens->boottime_offset - init_tsk->start_boottime`. This will make task's start boottime become a very large number, and sometimes maybe a negative number. When ps tool iterates /proc/*/stat, parse start time, and then uses locatime() to convert it, the ps tool may segmentfault because locatime() cannot process negative number and returns NULL. Fixes: `c46463ddc5` ("fs/proc: apply the time namespace offset to /proc/stat btime") Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Liu Song <liusong@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5518	2025-07-16 07:35:40 +00:00
Anton Protopopov	750c08658e	bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps ANBZ: #22586 commit `b34ffb0c6d` upstream. The LRU and LRU_PERCPU maps allocate a new element on update before locking the target hash table bucket. Right after that the maps try to lock the bucket. If this fails, then maps return -EBUSY to the caller without releasing the allocated element. This makes the element untracked: it doesn't belong to either of free lists, and it doesn't belong to the hash table, so can't be re-used; this eventually leads to the permanent -ENOMEM on LRU map updates, which is unexpected. Fix this by returning the element to the local free list if bucket locking fails. Fixes: `20b6cc34ea` ("bpf: Avoid hashtab deadlock with map_locked") Signed-off-by: Anton Protopopov <aspsk@isovalent.com> Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5495	2025-07-16 03:10:48 +00:00
Philo Lu	25a462a690	anolis: Revert "net: missing check virtio" ANBZ: #22611 This reverts commit `8759fce7fb`. The gso check added in this commit overkills normal datapath, resulting in unexpected packet loss under some devices like vhost [0]. As this patch only fixes a possible WARN_ON_ONCE() under syzbot, which is thought to be nonsignificant, we just revert it. [0] https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=89add40066f9ed9abe5f7f886fe5789ff7e0c50e Signed-off-by: Philo Lu <lulie@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5519	2025-07-15 21:01:45 +08:00
Yang Rong	1535649a03	anolis: virtio-mem: apply parallel deferred memory online ANBZ: #18841 We applied parallel deferred memory online to virtio-mem and observed a significant speedup in the hot-plug process. We conducted an experiment to hotplug 400G of memory, and the results were as follows: - Before applying the patch: - Total Time = Origin Hotplug Time = 5537ms (72.24 GB/s) - After applying the patch (with `parallel_hotplug_ratio=80`): - Origin Hotplug Time = 178ms - Deferred Parallel Hotplug Time = 1200ms - Total Time = 1378ms (76% reduction, 290.28 GB/s) Lastly, there's an issue regarding the guest's plug request to the VMM. The VMM relies on the plug requests sent by the guest to determine the size of the hot-plugged memory. Therefore, we should defer the sending of the plug requests after the memory has been actually onlined. Signed-off-by: Yang Rong <youngrong@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4622	2025-07-15 01:37:35 +00:00
Yang Rong	8dc3571250	anolis: mm/memory_hotplug: support parallel deferred memory online ANBZ: #18841 Memory hotplug is a serial process that adds memory to Linux in the granularity of memory blocks. We identified two memory initialization functions that consume significant time when onling memory blocks: - `__init_single_page`: initialize the struct page - `__free_pages_core`: add page to the buddy allocator We attempted to execute these two functions in parallel during the process of hotplugging a memory block. The experimental results showed that when the memory block size was 1GB, the hotplug speed was increased by approximately 200%. However, when the memory block size was 128MB, which is the more commonly used size, the hotplug speed was even worse than that of serial execution. Therefore, how to improve the hotplug speed when the memory block size is 128MB remains a challenge. Here is my idea: - Defer the execution of these two functions and their associated processs to the final phase of the entire hotplug process, so that the hotplug speed will no longer be limited by the memory block size. - Perform parallel execution in the final phase, as previous implementations have proven that this can accelerate the hotplug process. We introduce the new online function, `deferred_online_memory`, for deferring the actual online process of memory blocks. Additionally, we have added a command-line argument, parallel_hotplug_ratio, which sets the ratio of parallel workers to the number of CPUs on the node. When parallel_hotplug_ratio is 0, the memory online process will no longer be deferred. Signed-off-by: Yang Rong <youngrong@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4622	2025-07-15 01:37:35 +00:00
Yang Rong	05a35a41ac	anolis: mm/memory_hotplug: refactor online_pages() ANBZ: #18841 This commit refactors the `online_pages()` function to prepare for deferred memory online support. `online_pages()` is the core function for memory hotplug. Initializing struct pages and freeing pages to buddy are the most time-consuming operations, so we move these operations and related ones to the deferred phase. Additionally, since the adjustment of `present_pages` is deferred, and `auto_movable_zone_for_pfn()` uses `present_pages` to determine which zone the new memory block belongs to, we introduce `deferred_pages` to indicate the number of deferred pages. This allows `auto_movable_zone_for_pfn()` to make decisions based on both `present_pages` and `deferred_pages`. Signed-off-by: Yang Rong <youngrong@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4622	2025-07-15 01:37:35 +00:00
Yang Rong	6c4a83f3c6	anolis: mm/memory_hotplug: refactor move_pfn_range_to_zone() ANBZ: #18841 This commit refactors the `move_pfn_range_to_zone()` function to prepare for deferred memory online support. `memmap_init_zone()` is a time-consuming operation, we put it into the deferred phase for execution. Signed-off-by: Yang Rong <youngrong@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4622	2025-07-15 01:37:35 +00:00
Yang Rong	6675639481	anolis: mm/memory_hotplug: refactor adjust_present_page_count() ANBZ: #18841 This commit refactors the `adjust_present_page_count()` function to prepare for deferred memory online support. If memory online is deferred, then the adjustment of the present pages must also be deferred. Signed-off-by: Yang Rong <youngrong@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4622	2025-07-15 01:37:35 +00:00
Yang Rong	26fe05d3ff	anolis: mm/memory_hotplug: add MHP_PHASE_* macros ANBZ: #18841 We added three macros to represent different phases of memory hotplug. MHP_PHASE_DEFAULT represents the original process, while MHP_PHASE_PREPARE and MHP_PHASE_DEFERRED represent the process split into the prepare phase and the deferred phase. The purpose of this change is to move time-consuming operations to the deferred phase, thereby effectively improving the speed of memory hotplug through concurrency. Signed-off-by: Yang Rong <youngrong@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4622	2025-07-15 01:37:35 +00:00
Cruz Zhao	9117970143	anolis: sched: remove task_group from gb_sd when it got freed ANBZ: #8765 When a task_group which enabled group_balancer got freed, it should be removed from gb_sd, to prevent use after free. BTW, hold the gb_lock of task_group when it's enabling/disabling group_balancer. Fixes: `7f8e0c7133` ("anolis: sched: maintain group balancer task groups") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5512	2025-07-14 13:59:33 +08:00
Breno Leitao	94bd1b3e30	KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() ANBZ: #12858 commit `49f683b41f` upstream. Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the loads and stores are atomic. In the extremely unlikely scenario the compiler tears the stores, it's theoretically possible for KVM to attempt to get a vCPU using an out-of-bounds index, e.g. if the write is split into multiple 8-bit stores, and is paired with a 32-bit load on a VM with 257 vCPUs: CPU0 CPU1 last_boosted_vcpu = 0xff; (last_boosted_vcpu = 0x100) last_boosted_vcpu[15:8] = 0x01; i = (last_boosted_vcpu = 0x1ff) last_boosted_vcpu[7:0] = 0x00; vcpu = kvm->vcpu_array[0x1ff]; As detected by KCSAN: BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x00000012 -> 0x00000000 Fixes: `217ece6129` ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin") Cc: stable@vger.kernel.org Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org Signed-off-by: Sean Christopherson <seanjc@google.com> Fixes: CVE-2024-40953 Signed-off-by: Bin Guo <guobin@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5511	2025-07-11 15:16:17 +08:00
Sean Christopherson	9cdc79e445	KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN ANBZ: #12459 commit `aa0d42cacf` upstream. Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are myriad bugs in the implementation, some of which are fatal to the guest, and others which put the stability and health of the host at risk. For guest fatalities, the most glaring issue is that KVM fails to ensure tracing is disabled, and stays disabled prior to VM-Enter, which is necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing is enabled (enforced via a VMX consistency check). Per the SDM: If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load IA32_RTIT_CTL" VM-entry control must be 0. On the host side, KVM doesn't validate the guest CPUID configuration provided by userspace, and even worse, uses the guest configuration to decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring guest CPUID to enumerate more address ranges than are supported in hardware will result in KVM trying to passthrough, save, and load non-existent MSRs, which generates a variety of WARNs, ToPA ERRORs in the host, a potential deadlock, etc. Fixes: `f99e3daf94` ("KVM: x86: Add Intel PT virtualization work mode") Cc: stable@vger.kernel.org Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20241101185031.1799556-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Fixes: CVE-2024-53135 Signed-off-by: Bin Guo <guobin@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5510	2025-07-11 11:10:39 +08:00
Alex Williamson	a66c935fc7	vfio/mtty: Enable migration support ANBZ: #22597 commit `2b88119e35` upstream. The mtty driver exposes a PCI serial device to userspace and therefore makes an easy target for a sample device supporting migration. The device does not make use of DMA, therefore we can easily claim support for the migration P2P states, as well as dirty logging. This implementation also makes use of PRE_COPY support in order to provide migration stream compatibility testing, which should generally be considered good practice. Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20231016224736.2575718-3-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5504	2025-07-10 16:22:26 +08:00
Alex Williamson	bdd18c4e7a	vfio/mtty: Overhaul mtty interrupt handling ANBZ: #22597 commit `293fbc2881` upstream. The mtty driver does not currently conform to the vfio SET_IRQS uAPI. For example, it claims to support mask and unmask of INTx, but actually does nothing. It claims to support AUTOMASK for INTx, but doesn't. It fails to teardown eventfds under the full semantics specified by the SET_IRQS ioctl. It also fails to teardown eventfds when the device is closed, leading to memory leaks. It claims to support the request IRQ, but doesn't. Fix all these. A side effect of this is that QEMU will now report a warning: vfio <uuid>: Failed to set up UNMASK eventfd signaling for interrupt \ INTX-0: VFIO_DEVICE_SET_IRQS failure: Inappropriate ioctl for device The fact is that the unmask eventfd was never supported but quietly failed. mtty never honored the AUTOMASK behavior, therefore there was nothing to unmask. QEMU is verbose about the failure, but properly falls back to userspace unmasking. Fixes: `9d1a546c53` ("docs: Sample driver to demonstrate how to use Mediated device framework.") Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20231016224736.2575718-2-alex.williamson@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5504	2025-07-10 16:21:52 +08:00
Huang Ying	5e8fab7191	anolis: ccs: add CPU utilization control functionality ANBZ: #20511 The "scan" command of the cmn700 cache scan misc device file can take up to several hundreds milliseconds. The CPU utilization during command run will be 100%. Although use cond_sched() during "scan" running to release CPU resource for other workloads, the high CPU utilization may be undesirable. So, in the patch, add "cpu_percent" parameter to "set_param" command of the misc device file. If "cpu_percent" is less than 100, use schedule_timeout_interruptible() in "scan" command running to reduce the CPU usage to the specified value. Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5427	2025-07-10 05:33:28 +00:00
Huang Ying	2a431733a2	anolis: ccs: flush repeated problematic addresses with higher frequency ANBZ: #20511 It may take long to scan all HNF (Fully-coherent Home Node) of a die because the debug register accessing is slow. For example, it takes 300-500ms to scan 64 HNF of a die using one core with 100% utilization. If some of the addresses go into problematic state again before the next scanning/flushing, the performance may hurt. So, in this patch, flush the addresses with a higher frequency if they are identified as problematic repeatedly. The more times an address is identified as problematic, the higher its flush frequencym from once to, 4, 16, N (number of HNF) times per full scanning/flushing. The main cost is the memory to record the previously identified problematic addresses and improved code complexity. So, add a new parameter named "max_frequ" to the misc device file write command "set_param". If "max_frequ=0" is specified, original direct flushing mechansim is used to avoid the extra memory usage and code complexity. It can be used to control the max flushing frequency too. In a specjbb rt (response time) test with fixed injecting rate, the patch can improve the performance up to 3%. Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5427	2025-07-10 05:33:28 +00:00
Huang Ying	3b2e9daf8e	anolis: ccs: add misc device driver cmn700_cache_scan ANBZ: #20511 This is to mitigate ARM CMN-700 performance erratum 3688582. For some version of ARM CMN-700, if - multiple CMN-700 instances are used (e.g. multi-dies chip) AND - LLC SF (Snoop Filter) for a cache line in shared MESI state AND - the LLC cache line is victimized then, any further memory reading from the cache line address will not fill the LLC cache but always go to DDR until the cache line is evicted or invalidated from LLC SF. The details of the erratum can be found at, https://developer.arm.com/documentation/SDEN-2039384/1900/ To mitigate the erratum, dump the addresses in SF and LLC with state above via CMN-700 debug registers (cmn_hns_cfg_slcsf_dbgrd, details can be found in the following URL), then flush the cache line via DC instruction. The CMN-700 debug registers can be accessed in EL3 only, so add a customized SMC (Secure Monitor Call) to dump the potential problematic addresses and flush the cache in a misc device driver. https://developer.arm.com/documentation/102308/0302/?lang=en Misc device file (/dev/cmn700_cache_scan by default) reading/writing is the user space interface. The input/output format is ascii text to make it easy to use the interface. Support the following device file write command line. - set_param <param1>=<val1> <param2>=<val2> ... set the parameters of scanning and flushing, available parameters are as follows: - step=[1-128]: number of cache set to scan for one SMC - hnf=[0-127]-[0-127]: hnf (full-coherent home node) to scan - sf_check_mesi=[0-1]: whether to check mesi state of SF cache line - sf_mesi=[0-3]: the mesi state to check - scan scan according to parameters above and flush the cache line. - scan_step scan for one step only, to reduce latency. Reading file gets scanning statistics. Example output is as follows, nr_entry_total: 638 nr_entry_max: 3 nr_flush_total: 3 nr_flush_max: 1 us_smc_total: 327594 us_smc_max: 105 cml_nr_scan: 78 cml_nr_entry_total: 65336 cml_nr_entry_max: 5 cml_nr_flush_total: 8188 cml_nr_flush_max: 5 cml_us_smc_total: 25635226 cml_us_smc_max: 395 cml_us_flush_total: 332172 cml_us_flush_max: 41 where cml_ indicates "cumulative", that is, cml_xxx_total is the sum of nr_xxx_total and cml_xxx_max is the maximum of xxx_max of all scannings so far. [cml_]nr_entry_total: total number of potential problematic cache line entries. [cml_]nr_entry_max: maximum number of potential problematic cache line entries for one step. [cml_]nr_flush_total: total number of flushed cache line entries [cml_]nr_flush_max: maximum number of flushed cache line entries for one step. [cml_]us_smc_total: total CPU time of SMC in micro-second. [cml_]us_smc_max: maximum CPU time of SMC in micro-second. [cml_]us_flush_total: total CPU time of flushing cache in micro-second. [cml_]us_flush_max: maximum CPU time of flushing cache in micro-second. It's possible that some cache line may be flushed by mistake. For example, if a address satisfies the condition above but it was nevered access in another core. However, all our tested workloads show positive or neutral performance impact, which indicates that the mistaken rate can be kept low. Performance test shows that the mitigation can improve mysql oltp read qps (queries per second) up to 17.5%. And the LLC miss ratio reduces from 32.7% to 11.3%. Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5427	2025-07-10 05:33:28 +00:00
Sean Christopherson	bec2cded1a	KVM: Always flush async #PF workqueue when vCPU is being destroyed ANBZ: #19799 commit `3d75b8aa5c` upstream. Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its completion queue, e.g. when a VM and all its vCPUs is being destroyed. KVM must ensure that none of its workqueue callbacks is running when the last reference to the KVM _module_ is put. Gifting a reference to the associated VM prevents the workqueue callback from dereferencing freed vCPU/VM memory, but does not prevent the KVM module from being unloaded before the callback completes. Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will result in deadlock. async_pf_execute() can't return until kvm_put_kvm() finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes: WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm] Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: events async_pf_execute [kvm] RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm] Call Trace: <TASK> async_pf_execute+0x198/0x260 [kvm] process_one_work+0x145/0x2d0 worker_thread+0x27e/0x3a0 kthread+0xba/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- INFO: task kworker/8:1:251 blocked for more than 120 seconds. Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000 Workqueue: events async_pf_execute [kvm] Call Trace: <TASK> __schedule+0x33f/0xa40 schedule+0x53/0xc0 schedule_timeout+0x12a/0x140 __wait_for_common+0x8d/0x1d0 __flush_work.isra.0+0x19f/0x2c0 kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm] kvm_arch_destroy_vm+0x78/0x1b0 [kvm] kvm_put_kvm+0x1c1/0x320 [kvm] async_pf_execute+0x198/0x260 [kvm] process_one_work+0x145/0x2d0 worker_thread+0x27e/0x3a0 kthread+0xba/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 </TASK> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue, then there's no need to gift async_pf_execute() a reference because all invocations of async_pf_execute() will be forced to complete before the vCPU and its VM are destroyed/freed. And that in turn fixes the module unloading bug as __fput() won't do module_put() on the last vCPU reference until the vCPU has been freed, e.g. if closing the vCPU file also puts the last reference to the KVM module. Note that kvm_check_async_pf_completion() may also take the work item off the completion queue and so also needs to flush the work queue, as the work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting on the workqueue could theoretically delay a vCPU due to waiting for the work to complete, but that's a very, very small chance, and likely a very small delay. kvm_arch_async_page_present_queued() unconditionally makes a new request, i.e. will effectively delay entering the guest, so the remaining work is really just: trace_kvm_async_pf_completed(addr, cr2_or_gpa); __kvm_vcpu_wake_up(vcpu); mmput(mm); and mmput() can't drop the last reference to the page tables if the vCPU is still alive, i.e. the vCPU won't get stuck tearing down page tables. Add a helper to do the flushing, specifically to deal with "wakeup all" work items, as they aren't actually work items, i.e. are never placed in a workqueue. Trying to flush a bogus workqueue entry rightly makes __flush_work() complain (kudos to whoever added that sanity check). Note, commit `5f6de5cbeb` ("KVM: Prevent module exit until all VMs are freed") tried to fix the module refcounting issue by having VMs grab a reference to the module, but that only made the bug slightly harder to hit as it gave async_pf_execute() a bit more time to complete before the KVM module could be unloaded. Fixes: `af585b921e` ("KVM: Halt vcpu if page it tries to access is swapped out") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Reviewed-by: Xu Yilun <yilun.xu@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com> Fixes: CVE-2024-26976 Signed-off-by: Bin Guo <guobin@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5484	2025-07-10 01:38:21 +00:00
Cruz Zhao	a4b5d3580f	anolis: sched: update parameters of group balancer root sched domain ANBZ: #8765 As we will update the root cpumask before we build group balancer sched domains, we should update the parameters too. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	5f2005c72b	anolis: sched: control the interval of gb_load_balance() ANBZ: #8765 To reduce turbulence, the interval of gb_load_balance() should be controlled, and we set it to twice the lower interval. And to prevent unnecessary migration, we only migrate task groups when the imbalance pct exceeds the threshold 117. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	717827ca16	anolis: sched: adjust the lower interval of group balancer sched domain ANBZ: #8765 The goal of lower interval is to control the interval so that the scheduler has enough time to do load balancing, so we'd better to determine the lower interval according to the sched domain where group balancer sched domain is located. So far, the sched domain has the following layers: - NUMA - DIE - MC - SMT When building group balancer sched domain according to hardware topology, we identify the topological structures corresponding, and determine the lower interval according to the span weight. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	9b5823f38c	anolis: sched: travese the descendants when migrating task groups ANBZ: #8765 As a group balancer sched domain may have many descendants, with task groups, when we select task groups to migrate we should traverse the task descendants, too. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	d4d2d65b6d	anolis: sched: set the flags of group balancer sched domain after build ANBZ: #8765 If we set the flags of group balancer sched domain when we build it, the flag of the middle level may be incorrect. So we should set the flags after build. Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	f9c7a08dd3	anolis: sched: make soft_cpus_version correct ANBZ: #8765 When task fork, we should reset the soft_cpus_version to -1. When the task group moves to another gb_sd, we should inc the soft_cpus_version. Fixes: `4bb9c29b1c` ("anolis: sched: modify the interface of attach/detach_tg_to/from_group_balancer_sched_domain") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00
Cruz Zhao	cb5766c6e7	anolis: sched: optimize the selecting child gb_sd of task group ANBZ: #8765 When lower task group, we will select a child group balancer sched domain to migrate. Firstly, we should control the interval of lowering task groups of the parent gb_sd, to avoid causing imbalance. Secondly, if the load the task group spans balanced in the child gb_sds, we should select the one with lower load. By the way, fix the bug that using wrong load when calculate the load the task group. Fixes: `bebcfc550d` ("anolis: sched: introduce dynamical load balance for group balancer") Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5434	2025-07-09 11:13:24 +00:00

1 2 3 4 5 ...

997866 Commits All Branches Search

997866 Commits

All Branches