Commit Graph

997866 Commits

Author SHA1 Message Date
Chao Yu 218a9ce97b f2fs: fix to avoid NULL pointer dereference f2fs_write_end_io()
ANBZ: #20922

commit d8189834d4 upstream.

butt3rflyh4ck reports a bug as below:

When a thread always calls F2FS_IOC_RESIZE_FS to resize fs, if resize fs is
failed, f2fs kernel thread would invoke callback function to update f2fs io
info, it would call  f2fs_write_end_io and may trigger null-ptr-deref in
NODE_MAPPING.

general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
RIP: 0010:NODE_MAPPING fs/f2fs/f2fs.h:1972 [inline]
RIP: 0010:f2fs_write_end_io+0x727/0x1050 fs/f2fs/data.c:370
 <TASK>
 bio_endio+0x5af/0x6c0 block/bio.c:1608
 req_bio_endio block/blk-mq.c:761 [inline]
 blk_update_request+0x5cc/0x1690 block/blk-mq.c:906
 blk_mq_end_request+0x59/0x4c0 block/blk-mq.c:1023
 lo_complete_rq+0x1c6/0x280 drivers/block/loop.c:370
 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1101
 __do_softirq+0x1d4/0x8ef kernel/softirq.c:571
 run_ksoftirqd kernel/softirq.c:939 [inline]
 run_ksoftirqd+0x31/0x60 kernel/softirq.c:931
 smpboot_thread_fn+0x659/0x9e0 kernel/smpboot.c:164
 kthread+0x33e/0x440 kernel/kthread.c:379
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308

The root cause is below race case can cause leaving dirty metadata
in f2fs after filesystem is remount as ro:

Thread A				Thread B
- f2fs_ioc_resize_fs
 - f2fs_readonly   --- return false
 - f2fs_resize_fs
					- f2fs_remount
					 - write_checkpoint
					 - set f2fs as ro
  - free_segment_range
   - update meta_inode's data

Then, if f2fs_put_super()  fails to write_checkpoint due to readonly
status, and meta_inode's dirty data will be writebacked after node_inode
is put, finally, f2fs_write_end_io will access NULL pointer on
sbi->node_inode.

Thread A				IRQ context
- f2fs_put_super
 - write_checkpoint fails
 - iput(node_inode)
 - node_inode = NULL
 - iput(meta_inode)
  - write_inode_now
   - f2fs_write_meta_page
					- f2fs_write_end_io
					 - NODE_MAPPING(sbi)
					 : access NULL pointer on node_inode

Fixes: b4b10061ef ("f2fs: refactor resize_fs to avoid meta updates in progress")
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Closes: https://lore.kernel.org/r/1684480657-2375-1-git-send-email-yangtiezhu@loongson.cn
Tested-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Stefan Ghinea <stefan.ghinea@windriver.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: CVE-2023-2898
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5565
2025-08-01 11:21:00 +08:00
Namjae Jeon 3c65405ba3 exfat: check if filename entries exceeds max filename length
ANBZ: #21452

[ Upstream commit d42334578e ]

exfat_extract_uni_name copies characters from a given file name entry into
the 'uniname' variable. This variable is actually defined on the stack of
the exfat_readdir() function. According to the definition of
the 'exfat_uni_name' type, the file name should be limited 255 characters
(+ null teminator space), but the exfat_get_uniname_from_ext_entry()
function can write more characters because there is no check if filename
entries exceeds max filename length. This patch add the check not to copy
filename characters when exceeding max filename length.

Cc: stable@vger.kernel.org
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reported-by: Maxim Suhanov <dfirblog@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Fixes: CVE-2023-4273
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5564
2025-08-01 10:07:23 +08:00
Weilin Tong 3ab761d2a7 anolis: mm: honor THP defrag setting for direct file collapse
ANBZ: #23087

Currently, our kernel supports collapsing file-backed pages into
hugepages during file reads by prioritizing the file for collapse
in khugepaged. The allocation flags for collapsing are determined
based on khugepaged's defrag setting.

With this patch, the /sys/kernel/mm/transparent_hugepage/defrag
setting is honored for direct file collapse, ensuring the defrag
behavior for file direct collapse uses the same allocation policy
as other THP paths.

Signed-off-by: Weilin Tong <tongweilin@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5557
2025-07-31 03:33:28 +00:00
Geliang Tang c6dd932e4d nvme: drop unused variable ctrl in nvme_setup_cmd
ANBZ: #22935

commit 3a605e32a7 upstream.

The variable 'ctrl' became useless since the code using it was dropped
from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment
request genctr on completion"). Fix it to get rid of this compilation
warning in the nvme-5.17 branch:

 drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’:
 drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable]
   struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
                     ^~~~

Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion")
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5544
2025-07-31 02:49:39 +00:00
Vidya Sagar cb85aee82d PCI/MSI: Set device flag indicating only 32-bit MSI support
ANBZ: #23086

commit 2053230af1 upstream.

The MSI-X Capability requires devices to support 64-bit Message Addresses,
but the MSI Capability can support either 32- or 64-bit addresses.

Previously, we set dev->no_64bit_msi for a few broken devices that
advertise 64-bit MSI support but don't correctly support it.

In addition, check the MSI "64-bit Address Capable" bit for all devices and
set dev->no_64bit_msi for devices that don't advertise 64-bit support.
This allows msi_verify_entries() to catch arch code defects that assign
64-bit addresses when they're not supported.

The warning is helpful to find defects like the one fixed by
https://lore.kernel.org/r/20201117165312.25847-1-vidyas@nvidia.com

[bhelgaas: set no_64bit_msi in pci_msi_init(), commit log]
Link: https://lore.kernel.org/r/20201124105035.24573-1-vidyas@nvidia.com
Link: https://lore.kernel.org/r/20201203185110.1583077-4-helgaas@kernel.org
Signed-off-by: Vidya Sagar <vidyas@nvidia.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5556
2025-07-29 16:46:59 +08:00
Bjorn Helgaas 62ad266bac PCI/MSI: Move MSI/MSI-X flags updaters to msi.c
ANBZ: #23086

commit 830dfe88ea upstream.

pci_msi_set_enable() and pci_msix_clear_and_set_ctrl() are only used from
msi.c, so move them from drivers/pci/pci.h to msi.c.  No functional change
intended.

Link: https://lore.kernel.org/r/20201203185110.1583077-3-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5556
2025-07-29 16:46:55 +08:00
Bjorn Helgaas f9d7c2d60f PCI/MSI: Move MSI/MSI-X init to msi.c
ANBZ: #23086

commit cbc40d5c33 upstream.

Move pci_msi_setup_pci_dev(), which disables MSI and MSI-X interrupts, from
probe.c to msi.c so it's with all the other MSI code and more consistent
with other capability initialization.  This means we must compile msi.c
always, even without CONFIG_PCI_MSI, so wrap the rest of msi.c in an #ifdef
and adjust the Makefile accordingly.  No functional change intended.

Link: https://lore.kernel.org/r/20201203185110.1583077-2-helgaas@kernel.org
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: wangkaiyuan <wangkaiyuan@inspur.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5556
2025-07-29 16:46:51 +08:00
Gao Xiang 8b2d460737 anolis: configs: enable CONFIG_EROFS_FS_BACKED_BY_FILE
ANBZ: #11854

To support mounting without loopback block devices.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5548
2025-07-28 08:43:58 +00:00
xiongmengbiao 71a37bdc5f anolis: crypto: ccp: use time_after() instead of time_before() to checks schedule condition
ANBZ: #22926

Used time_before(jiffies, last_je) in psp_mutex_lock_timeout()
which yields CPU too early.

Fixed to time_after(jiffies, last_je + msecs_to_jiffies(100))
for proper 100ms wait.

Signed-off-by: xiongmengbiao <xiongmengbiao@hygon.cn>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5532
2025-07-23 06:27:52 +00:00
Tatsuyuki Ishi 7fc3abd8b2 erofs: impersonate the opener's credentials when accessing backing file
ANBZ: #11854

commit 905eeb2b7c33adda23a966aeb811ab4cb9e62031 upstream.

Previously, file operations on a file-backed mount used the current
process' credentials to access the backing FD. Attempting to do so on
Android lead to SELinux denials, as ACL rules on the backing file (e.g.
/system/apex/foo.apex) is restricted to a small set of process.
Arguably, this error is redundant and leaking implementation details, as
access to files on a mount is already ACL'ed by path.

Instead, override to use the opener's cred when accessing the backing
file. This makes the behavior similar to a loop-backed mount, which
uses kworker cred when accessing the backing file and does not cause
SELinux denials.

Signed-off-by: Tatsuyuki Ishi <ishitatsuyuki@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20250612-b4-erofs-impersonate-v1-1-8ea7d6f65171@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 13:34:00 +08:00
Max Kellermann baac228835 fs/erofs/fileio: call erofs_onlinefolio_split() after bio_add_folio()
ANBZ: #11854

commit bbfe756dc3062c1e934f06e5ba39c239aa953b92 upstream.

If bio_add_folio() fails (because it is full),
erofs_fileio_scan_folio() needs to submit the I/O request via
erofs_fileio_rq_submit() and allocate a new I/O request with an empty
`struct bio`.  Then it retries the bio_add_folio() call.

However, at this point, erofs_onlinefolio_split() has already been
called which increments `folio->private`; the retry will call
erofs_onlinefolio_split() again, but there will never be a matching
erofs_onlinefolio_end() call.  This leaves the folio locked forever
and all waiters will be stuck in folio_wait_bit_common().

This bug has been added by commit ce63cb62d7 ("erofs: support
unencoded inodes for fileio"), but was practically unreachable because
there was room for 256 folios in the `struct bio` - until commit
9f74ae8c9a ("erofs: shorten bvecs[] for file-backed mounts") which
reduced the array capacity to 16 folios.

It was now trivial to trigger the bug by manually invoking readahead
from userspace, e.g.:

 posix_fadvise(fd, 0, st.st_size, POSIX_FADV_WILLNEED);

This should be fixed by invoking erofs_onlinefolio_split() only after
bio_add_folio() has succeeded.  This is safe: asynchronous completions
invoking erofs_onlinefolio_end() will not unlock the folio because
erofs_fileio_scan_folio() is still holding a reference to be released
by erofs_onlinefolio_end() at the end.

Fixes: ce63cb62d7 ("erofs: support unencoded inodes for fileio")
Fixes: 9f74ae8c9a ("erofs: shorten bvecs[] for file-backed mounts")
Cc: stable@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Gao Xiang <xiang@kernel.org>
Tested-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20250428230933.3422273-1-max.kellermann@ionos.com
Signed-off-by: Gao Xiang <xiang@kernel.org>
Conflicts:
	fs/erofs/fileio.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 13:34:00 +08:00
Gao Xiang 6828af4e38 erofs: shorten bvecs[] for file-backed mounts
ANBZ: #11854

commit 9f74ae8c9a upstream.

BIO_MAX_VECS is too large for __GFP_NOFAIL allocation.  We could use
a mempool (since BIOs can always proceed), but it seems overly
complicated for now.

Reviewed-by: Chao Yu <chao@kernel.org>
Conflicts:
	fs/erofs/fileio.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20250107082825.74242-1-hsiangkao@linux.alibaba.com
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 13:34:00 +08:00
Gao Xiang e86624db2a erofs: use buffered I/O for file-backed mounts by default
ANBZ: #11854

commit 6422cde1b0 upstream.

For many use cases (e.g. container images are just fetched from remote),
performance will be impacted if underlay page cache is up-to-date but
direct i/o flushes dirty pages first.

Instead, let's use buffered I/O by default to keep in sync with loop
devices and add a (re)mount option to explicitly give a try to use
direct I/O if supported by the underlying files.

The container startup time is improved as below:
[workload] docker.io/library/workpress:latest
                                     unpack        1st run  non-1st runs
EROFS snapshotter buffered I/O file  4.586404265s  0.308s   0.198s
EROFS snapshotter direct I/O file    4.581742849s  2.238s   0.222s
EROFS snapshotter loop               4.596023152s  0.346s   0.201s
Overlayfs snapshotter                5.382851037s  0.206s   0.214s

Fixes: fb17675026 ("erofs: add file-backed mount support")
Cc: Derek McGowan <derek@mcg.dev>
Reviewed-by: Chao Yu <chao@kernel.org>
Conflicts:
	fs/erofs/fileio.c
	fs/erofs/super.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241212134336.2059899-1-hsiangkao@linux.alibaba.com
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 13:34:00 +08:00
Gao Xiang e263d8ad31 erofs: support unencoded inodes for fileio
ANBZ: #11854

commit ce63cb62d7 upstream.

Since EROFS only needs to handle read requests in simple contexts,
Just directly use vfs_iocb_iter_read() for data I/Os.

Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240905093031.2745929-1-hsiangkao@linux.alibaba.com
Conflicts:
	fs/erofs/Makefile
	fs/erofs/data.c
	fs/erofs/inode.c
	fs/erofs/internal.h
	fs/erofs/zdata.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 13:33:56 +08:00
Gao Xiang e0754eb85d erofs: add file-backed mount support
ANBZ: #11854

commit fb17675026 upstream.

It actually has been around for years: For containers and other sandbox
use cases, there will be thousands (and even more) of authenticated
(sub)images running on the same host, unlike OS images.

Of course, all scenarios can use the same EROFS on-disk format, but
bdev-backed mounts just work well for OS images since golden data is
dumped into real block devices.  However, it's somewhat hard for
container runtimes to manage and isolate so many unnecessary virtual
block devices safely and efficiently [1]: they just look like a burden
to orchestrators and file-backed mounts are preferred indeed.  There
were already enough attempts such as Incremental FS, the original
ComposeFS and PuzzleFS acting in the same way for immutable fses.  As
for current EROFS users, ComposeFS, containerd and Android APEXs will
be directly benefited from it.

On the other hand, previous experimental feature "erofs over fscache"
was once also intended to provide a similar solution (inspired by
Incremental FS discussion [2]), but the following facts show file-backed
mounts will be a better approach:
 - Fscache infrastructure has recently been moved into new Netfslib
   which is an unexpected dependency to EROFS really, although it
   originally claims "it could be used for caching other things such as
   ISO9660 filesystems too." [3]

 - It takes an unexpectedly long time to upstream Fscache/Cachefiles
   enhancements.  For example, the failover feature took more than
   one year, and the deamonless feature is still far behind now;

 - Ongoing HSM "fanotify pre-content hooks" [4] together with this will
   perfectly supersede "erofs over fscache" in a simpler way since
   developers (mainly containerd folks) could leverage their existing
   caching mechanism entirely in userspace instead of strictly following
   the predefined in-kernel caching tree hierarchy.

After "fanotify pre-content hooks" lands upstream to provide the same
functionality, "erofs over fscache" will be removed then (as an EROFS
internal improvement and EROFS will not have to bother with on-demand
fetching and/or caching improvements anymore.)

[1] https://github.com/containers/storage/pull/2039
[2] https://lore.kernel.org/r/CAOQ4uxjbVxnubaPjVaGYiSwoGDTdpWbB=w_AeM6YM=zVixsUfQ@mail.gmail.com
[3] https://docs.kernel.org/filesystems/caching/fscache.html
[4] https://lore.kernel.org/r/cover.1723670362.git.josef@toxicpanda.com

Closes: https://github.com/containers/composefs/issues/144
Reviewed-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240830032840.3783206-1-hsiangkao@linux.alibaba.com
Conflicts:
	fs/erofs/Kconfig
	fs/erofs/data.c
	fs/erofs/inode.c
	fs/erofs/internal.h
	fs/erofs/super.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 11:08:23 +08:00
Baokun Li f8b1bb49b3 erofs: get rid of erofs_fs_context
ANBZ: #11854

commit 07abe43a28 upstream.

Instead of allocating the erofs_sb_info in fill_super() allocate it during
erofs_init_fs_context() and ensure that erofs can always have the info
available during erofs_kill_sb(). After this erofs_fs_context is no longer
needed, replace ctx with sbi, no functional changes.

Suggested-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240419123611.947084-2-libaokun1@huawei.com
Conflicts:
	fs/erofs/internal.h
	fs/erofs/super.c
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 11:08:23 +08:00
Gao Xiang fddff2670c erofs: keep meta inode into erofs_buf
ANBZ: #11854

commit eb2c5e41be upstream.

So that erofs_read_metadata() can read metadata from other inodes
(e.g. packed inode) as well.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 11:08:22 +08:00
Gao Xiang dd8474084e anolis: erofs: add missing map->m_flags = 0 in erofs_map_blocks()
ANBZ: #11854

This was missing in the original backport commit.

Fixes: 34bc2f0eb2 ("erofs: get rid of erofs_map_blocks_flatmode()")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5535
2025-07-22 11:08:12 +08:00
Cruz Zhao 98170019bf anolis: sched: don't account util_est for each cfs_rq when group_balancer disabled
ANBZ: #8765

If we account util_est for each cfs_rq when group_balancer disabled,
there will be some overhead. To avoid the overhead, we only account it
when group balancer enabled, and when the switch is turned, we
recalculate for each cfs_rq at once.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5529
2025-07-21 02:35:12 +00:00
LeoLiu-oc c44eb4e82c anolis: x86/cpufeatures: Fix typo in PARALLAX_E and reformat new feature macros
ANBZ: #22862

This patch makes the following changes to cpufeatures.h:
- Corrects the typo in X86_FEATURE_PARALLAX_E to X86_FEATURE_PARALLAX_EN
- Reformats the newly added feature macros in word5 to maintain consistent
   coding style.

Signed-off-by: LeoLiu-oc <leoliu-oc@zhaoxin.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5524
2025-07-18 09:31:09 +00:00
Bitao Hu b53c2009c3 anolis: bugfix: nvme: check if the pointer is null before using the tagset
ANBZ: #20656

In cases where an NVMe device fails to create I/O queues due to certain
errors, the kernel does not remove the corresponding NVMe controller.
However, under such conditions, the nvme_dev->ctrl->tagset may become
NULL. To avoid potential NULL pointer dereference, it is necessary to
perform a NULL check on both the tagset and the admin_tagset before use.

Fixes: 31b981ebbd ("anolis: nvme: support set per device's timeout")

Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5528
2025-07-18 05:55:33 +00:00
Stanislav Fomichev 9e6c73f30c bpf: Document EFAULT changes for sockopt
ANBZ: #22860

commit 6b6a23d5d8 upstream.

And add examples for how to correctly handle large optlens.
This is less relevant now when we don't EFAULT anymore, but
that's still the correct thing to do.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5523
2025-07-17 07:35:44 +00:00
Yinan Liu 3507962043 selftests/bpf: Update EFAULT {g,s}etsockopt selftests
ANBZ: #22860

commit 989a4a7dbf upstream.

Instead of assuming EFAULT, let's assume the BPF program's
output is ignored.

Remove "getsockopt: deny arbitrary ctx->retval" because it
was actually testing optlen. We have separate set of tests
for retval.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-3-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5523
2025-07-17 07:35:44 +00:00
Yinan Liu 160227a0f9 bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
ANBZ: #22860

commit 29ebbba7d4 upstream.

With the way the hooks implemented right now, we have a special
condition: optval larger than PAGE_SIZE will expose only first 4k into
BPF; any modifications to the optval are ignored. If the BPF program
doesn't handle this condition by resetting optlen to 0,
the userspace will get EFAULT.

The intention of the EFAULT was to make it apparent to the
developers that the program is doing something wrong.
However, this inadvertently might affect production workloads
with the BPF programs that are not too careful (i.e., returning EFAULT
for perfectly valid setsockopt/getsockopt calls).

Let's try to minimize the chance of BPF program screwing up userspace
by ignoring the output of those BPF programs (instead of returning
EFAULT to the userspace). pr_info_once those cases to
the dmesg to help with figuring out what's going wrong.

Fixes: 0d01da6afc ("bpf: implement getsockopt and setsockopt hooks")
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Yinan Liu <yinan@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5523
2025-07-17 07:35:44 +00:00
Bo Liu bfec42ac43 anolis: RAS/AMD/ATL: fix error nodes per socket on for Hygon Dhyana processor
ANBZ: #22615

For Hygon Dhyana processor other than hygon family 18h model 4h, using atl
to convert umc_mca_addr to sys_addr. When geting die id, it use
amd_get_nodes_per_socket() function to get "nodes_per_socket" value. It will
get wrong value for Hygon CPU.

To fix it, we use hygon_get_nodes_per_socket() function to get the correct
"nodes_per_socket" value.

Fixes: 8df0979486c2("EDAC/amd64: Use new AMD Address Translation Library")
Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Wenhui Fan <fanwh@hygon.cn>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5521
2025-07-17 03:51:21 +00:00
Qiao Ma 3884211d3c anolis: timens: fix start boottime calculate bug when rich container and timens enabled
ANBZ: #12420

When rich container is enabled, the task's start boottime should be
`task->start_boottime - init_tsk->start_boottime`.

When timens is enabled, the task's start boottime should be
`task->start_boottime + timens->boottime_offset`.

When both of them are opened, the task's start boottime should be
`task->start_boottime - init_tsk->start_boottime`, too.
But current the logic is wrong,
`task->start_boottime + timens->boottime_offset - init_tsk->start_boottime`.

This will make task's start boottime become a very large number, and
sometimes maybe a negative number. When ps tool iterates /proc/*/stat,
parse start time, and then uses locatime() to convert it, the ps tool
may segmentfault because locatime() cannot process negative number and
returns NULL.

Fixes: c46463ddc5 ("fs/proc: apply the time namespace offset to /proc/stat btime")
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Liu Song <liusong@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5518
2025-07-16 07:35:40 +00:00
Anton Protopopov 750c08658e bpf: fix a memory leak in the LRU and LRU_PERCPU hash maps
ANBZ: #22586

commit b34ffb0c6d upstream.

The LRU and LRU_PERCPU maps allocate a new element on update before locking the
target hash table bucket. Right after that the maps try to lock the bucket.
If this fails, then maps return -EBUSY to the caller without releasing the
allocated element. This makes the element untracked: it doesn't belong to
either of free lists, and it doesn't belong to the hash table, so can't be
re-used; this eventually leads to the permanent -ENOMEM on LRU map updates,
which is unexpected. Fix this by returning the element to the local free list
if bucket locking fails.

Fixes: 20b6cc34ea ("bpf: Avoid hashtab deadlock with map_locked")
Signed-off-by: Anton Protopopov <aspsk@isovalent.com>
Link: https://lore.kernel.org/r/20230522154558.2166815-1-aspsk@isovalent.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

Signed-off-by: Xiao Long <xiaolong@openanolis.org>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5495
2025-07-16 03:10:48 +00:00
Philo Lu 25a462a690 anolis: Revert "net: missing check virtio"
ANBZ: #22611

This reverts commit 8759fce7fb.

The gso check added in this commit overkills normal datapath, resulting
in unexpected packet loss under some devices like vhost [0].

As this patch only fixes a possible WARN_ON_ONCE() under syzbot, which
is thought to be nonsignificant, we just revert it.

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=89add40066f9ed9abe5f7f886fe5789ff7e0c50e

Signed-off-by: Philo Lu <lulie@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5519
2025-07-15 21:01:45 +08:00
Yang Rong 1535649a03 anolis: virtio-mem: apply parallel deferred memory online
ANBZ: #18841

We applied parallel deferred memory online to virtio-mem and
observed a significant speedup in the hot-plug process.

We conducted an experiment to hotplug 400G of memory, and
the results were as follows:
- Before applying the patch:
  - Total Time = Origin Hotplug Time = 5537ms (72.24 GB/s)
- After applying the patch (with `parallel_hotplug_ratio=80`):
  - Origin Hotplug Time = 178ms
  - Deferred Parallel Hotplug Time = 1200ms
  - Total Time = 1378ms (76% reduction, 290.28 GB/s)

Lastly, there's an issue regarding the guest's plug
request to the VMM. The VMM relies on the plug requests
sent by the guest to determine the size of the hot-plugged
memory. Therefore, we should defer the sending of the plug
requests after the memory has been actually onlined.

Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
2025-07-15 01:37:35 +00:00
Yang Rong 8dc3571250 anolis: mm/memory_hotplug: support parallel deferred memory online
ANBZ: #18841

Memory hotplug is a serial process that adds memory to Linux in the
granularity of memory blocks. We identified two memory initialization
functions that consume significant time when onling memory blocks:
- `__init_single_page`: initialize the struct page
- `__free_pages_core`: add page to the buddy allocator

We attempted to execute these two functions in parallel during the
process of hotplugging a memory block. The experimental results showed
that when the memory block size was 1GB, the hotplug speed was
increased by approximately 200%. However, when the memory block size
was 128MB, which is the more commonly used size, the hotplug speed
was even worse than that of serial execution.

Therefore, how to improve the hotplug speed when the memory block size
is 128MB remains a challenge.

Here is my idea:
- Defer the execution of these two functions and their associated
  processs to the final phase of the entire hotplug process, so
  that the hotplug speed will no longer be limited by the memory
  block size.
- Perform parallel execution in the final phase, as previous
  implementations have proven that this can accelerate the hotplug
  process.

We introduce the new online function, `deferred_online_memory`, for
deferring the actual online process of memory blocks.

Additionally, we have added a command-line argument,
parallel_hotplug_ratio, which sets the ratio of parallel workers to
the number of CPUs on the node. When parallel_hotplug_ratio is 0,
the memory online process will no longer be deferred.

Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
2025-07-15 01:37:35 +00:00
Yang Rong 05a35a41ac anolis: mm/memory_hotplug: refactor online_pages()
ANBZ: #18841

This commit refactors the `online_pages()` function to prepare for
deferred memory online support. `online_pages()` is the core function
for memory hotplug. Initializing struct pages and freeing pages to buddy
are the most time-consuming operations, so we move these operations and
related ones to the deferred phase.

Additionally, since the adjustment of `present_pages` is deferred, and
`auto_movable_zone_for_pfn()` uses `present_pages` to determine which
zone the new memory block belongs to, we introduce `deferred_pages` to
indicate the number of deferred pages. This allows
`auto_movable_zone_for_pfn()` to make decisions based on both
`present_pages` and `deferred_pages`.

Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
2025-07-15 01:37:35 +00:00
Yang Rong 6c4a83f3c6 anolis: mm/memory_hotplug: refactor move_pfn_range_to_zone()
ANBZ: #18841

This commit refactors the `move_pfn_range_to_zone()` function to prepare
for deferred memory online support. `memmap_init_zone()` is a
time-consuming operation, we put it into the deferred phase for execution.

Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
2025-07-15 01:37:35 +00:00
Yang Rong 6675639481 anolis: mm/memory_hotplug: refactor adjust_present_page_count()
ANBZ: #18841

This commit refactors the `adjust_present_page_count()` function
to prepare for deferred memory online support. If memory online is
deferred, then the adjustment of the present pages must also be deferred.

Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
2025-07-15 01:37:35 +00:00
Yang Rong 26fe05d3ff anolis: mm/memory_hotplug: add MHP_PHASE_* macros
ANBZ: #18841

We added three macros to represent different phases of memory hotplug.
MHP_PHASE_DEFAULT represents the original process, while MHP_PHASE_PREPARE
and MHP_PHASE_DEFERRED represent the process split into the prepare
phase and the deferred phase. The purpose of this change is to move
time-consuming operations to the deferred phase, thereby effectively
improving the speed of memory hotplug through concurrency.

Signed-off-by: Yang Rong <youngrong@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4622
2025-07-15 01:37:35 +00:00
Cruz Zhao 9117970143 anolis: sched: remove task_group from gb_sd when it got freed
ANBZ: #8765

When a task_group which enabled group_balancer got freed, it should be
removed from gb_sd, to prevent use after free.

BTW, hold the gb_lock of task_group when it's enabling/disabling
group_balancer.

Fixes: 7f8e0c7133 ("anolis: sched: maintain group balancer task groups")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5512
2025-07-14 13:59:33 +08:00
Breno Leitao 94bd1b3e30 KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin()
ANBZ: #12858

commit 49f683b41f upstream.

Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the
loads and stores are atomic.  In the extremely unlikely scenario the
compiler tears the stores, it's theoretically possible for KVM to attempt
to get a vCPU using an out-of-bounds index, e.g. if the write is split
into multiple 8-bit stores, and is paired with a 32-bit load on a VM with
257 vCPUs:

  CPU0                              CPU1
  last_boosted_vcpu = 0xff;

                                    (last_boosted_vcpu = 0x100)
                                    last_boosted_vcpu[15:8] = 0x01;
  i = (last_boosted_vcpu = 0x1ff)
                                    last_boosted_vcpu[7:0] = 0x00;

  vcpu = kvm->vcpu_array[0x1ff];

As detected by KCSAN:

  BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]

  write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
  kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
  handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
  vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
		 arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
  vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
  kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
  kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
  __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
  __x64_sys_ioctl (fs/ioctl.c:890)
  x64_sys_call (arch/x86/entry/syscall_64.c:33)
  do_syscall_64 (arch/x86/entry/common.c:?)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

  read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
  kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
  handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
  vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
			arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
  vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
  kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
  kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
  __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
  __x64_sys_ioctl (fs/ioctl.c:890)
  x64_sys_call (arch/x86/entry/syscall_64.c:33)
  do_syscall_64 (arch/x86/entry/common.c:?)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

  value changed: 0x00000012 -> 0x00000000

Fixes: 217ece6129 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin")
Cc: stable@vger.kernel.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fixes: CVE-2024-40953
Signed-off-by: Bin Guo <guobin@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5511
2025-07-11 15:16:17 +08:00
Sean Christopherson 9cdc79e445 KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN
ANBZ: #12459

commit aa0d42cacf upstream.

Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support
for virtualizing Intel PT via guest/host mode unless BROKEN=y.  There are
myriad bugs in the implementation, some of which are fatal to the guest,
and others which put the stability and health of the host at risk.

For guest fatalities, the most glaring issue is that KVM fails to ensure
tracing is disabled, and *stays* disabled prior to VM-Enter, which is
necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing
is enabled (enforced via a VMX consistency check).  Per the SDM:

  If the logical processor is operating with Intel PT enabled (if
  IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load
  IA32_RTIT_CTL" VM-entry control must be 0.

On the host side, KVM doesn't validate the guest CPUID configuration
provided by userspace, and even worse, uses the guest configuration to
decide what MSRs to save/load at VM-Enter and VM-Exit.  E.g. configuring
guest CPUID to enumerate more address ranges than are supported in hardware
will result in KVM trying to passthrough, save, and load non-existent MSRs,
which generates a variety of WARNs, ToPA ERRORs in the host, a potential
deadlock, etc.

Fixes: f99e3daf94 ("KVM: x86: Add Intel PT virtualization work mode")
Cc: stable@vger.kernel.org
Cc: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Tested-by: Adrian Hunter <adrian.hunter@intel.com>
Message-ID: <20241101185031.1799556-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: CVE-2024-53135
Signed-off-by: Bin Guo <guobin@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5510
2025-07-11 11:10:39 +08:00
Alex Williamson a66c935fc7 vfio/mtty: Enable migration support
ANBZ: #22597

commit 2b88119e35 upstream.

The mtty driver exposes a PCI serial device to userspace and therefore
makes an easy target for a sample device supporting migration.  The device
does not make use of DMA, therefore we can easily claim support for the
migration P2P states, as well as dirty logging.  This implementation also
makes use of PRE_COPY support in order to provide migration stream
compatibility testing, which should generally be considered good practice.

Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20231016224736.2575718-3-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5504
2025-07-10 16:22:26 +08:00
Alex Williamson bdd18c4e7a vfio/mtty: Overhaul mtty interrupt handling
ANBZ: #22597

commit 293fbc2881 upstream.

The mtty driver does not currently conform to the vfio SET_IRQS uAPI.
For example, it claims to support mask and unmask of INTx, but actually
does nothing.  It claims to support AUTOMASK for INTx, but doesn't.  It
fails to teardown eventfds under the full semantics specified by the
SET_IRQS ioctl.  It also fails to teardown eventfds when the device is
closed, leading to memory leaks.  It claims to support the request IRQ,
but doesn't.

Fix all these.

A side effect of this is that QEMU will now report a warning:

vfio <uuid>: Failed to set up UNMASK eventfd signaling for interrupt \
INTX-0: VFIO_DEVICE_SET_IRQS failure: Inappropriate ioctl for device

The fact is that the unmask eventfd was never supported but quietly
failed.  mtty never honored the AUTOMASK behavior, therefore there
was nothing to unmask.  QEMU is verbose about the failure, but
properly falls back to userspace unmasking.

Fixes: 9d1a546c53 ("docs: Sample driver to demonstrate how to use Mediated device framework.")
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/r/20231016224736.2575718-2-alex.williamson@redhat.com
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5504
2025-07-10 16:21:52 +08:00
Huang Ying 5e8fab7191 anolis: ccs: add CPU utilization control functionality
ANBZ: #20511

The "scan" command of the cmn700 cache scan misc device file can take
up to several hundreds milliseconds.  The CPU utilization during
command run will be 100%.  Although use cond_sched() during "scan"
running to release CPU resource for other workloads, the high CPU
utilization may be undesirable.

So, in the patch, add "cpu_percent" parameter to "set_param" command
of the misc device file.  If "cpu_percent" is less than 100, use
schedule_timeout_interruptible() in "scan" command running to reduce the
CPU usage to the specified value.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5427
2025-07-10 05:33:28 +00:00
Huang Ying 2a431733a2 anolis: ccs: flush repeated problematic addresses with higher frequency
ANBZ: #20511

It may take long to scan all HNF (Fully-coherent Home Node) of a die
because the debug register accessing is slow.  For example, it takes
300-500ms to scan 64 HNF of a die using one core with 100%
utilization.  If some of the addresses go into problematic state again
before the next scanning/flushing, the performance may hurt.

So, in this patch, flush the addresses with a higher frequency if they
are identified as problematic repeatedly.  The more times an address
is identified as problematic, the higher its flush frequencym from
once to, 4, 16, N (number of HNF) times per full scanning/flushing.

The main cost is the memory to record the previously identified
problematic addresses and improved code complexity.  So, add a new
parameter named "max_frequ" to the misc device file write command
"set_param".  If "max_frequ=0" is specified, original direct flushing
mechansim is used to avoid the extra memory usage and code complexity.
It can be used to control the max flushing frequency too.

In a specjbb rt (response time) test with fixed injecting rate, the
patch can improve the performance up to 3%.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5427
2025-07-10 05:33:28 +00:00
Huang Ying 3b2e9daf8e anolis: ccs: add misc device driver cmn700_cache_scan
ANBZ: #20511

This is to mitigate ARM CMN-700 performance erratum 3688582.  For some
version of ARM CMN-700, if

- multiple CMN-700 instances are used (e.g. multi-dies chip) AND
- LLC SF (Snoop Filter) for a cache line in shared MESI state AND
- the LLC cache line is victimized

then, any further memory reading from the cache line address will not
fill the LLC cache but always go to DDR until the cache line is
evicted or invalidated from LLC SF.

The details of the erratum can be found at,
https://developer.arm.com/documentation/SDEN-2039384/1900/

To mitigate the erratum, dump the addresses in SF and LLC with state
above via CMN-700 debug registers (cmn_hns_cfg_slcsf_dbgrd, details
can be found in the following URL), then flush the cache line via DC
instruction.  The CMN-700 debug registers can be accessed in EL3 only,
so add a customized SMC (Secure Monitor Call) to dump the potential
problematic addresses and flush the cache in a misc device driver.

https://developer.arm.com/documentation/102308/0302/?lang=en

Misc device file (/dev/cmn700_cache_scan by default) reading/writing
is the user space interface.  The input/output format is ascii text to
make it easy to use the interface.  Support the following device file
write command line.

- set_param <param1>=<val1> <param2>=<val2> ...
  set the parameters of scanning and flushing, available parameters
  are as follows:
  - step=[1-128]: number of cache set to scan for one SMC
  - hnf=[0-127]-[0-127]: hnf (full-coherent home node) to scan
  - sf_check_mesi=[0-1]: whether to check mesi state of SF cache line
  - sf_mesi=[0-3]: the mesi state to check

- scan
  scan according to parameters above and flush the cache line.

- scan_step
  scan for one step only, to reduce latency.

Reading file gets scanning statistics.  Example output is as follows,

nr_entry_total:            638
nr_entry_max:                3
nr_flush_total:              3
nr_flush_max:                1
us_smc_total:           327594
us_smc_max:                105
cml_nr_scan:                78
cml_nr_entry_total:      65336
cml_nr_entry_max:            5
cml_nr_flush_total:       8188
cml_nr_flush_max:            5
cml_us_smc_total:     25635226
cml_us_smc_max:            395
cml_us_flush_total:     332172
cml_us_flush_max:           41

where cml_ indicates "cumulative", that is, cml_xxx_total is the sum
of nr_xxx_total and cml_xxx_max is the maximum of xxx_max of all
scannings so far.

[cml_]nr_entry_total: total number of potential problematic cache line
		      entries.
[cml_]nr_entry_max:   maximum number of potential problematic cache
                      line entries for one step.
[cml_]nr_flush_total: total number of flushed cache line entries
[cml_]nr_flush_max:   maximum number of flushed cache line entries for
		      one step.
[cml_]us_smc_total:   total CPU time of SMC in micro-second.
[cml_]us_smc_max:     maximum CPU time of SMC in micro-second.
[cml_]us_flush_total: total CPU time of flushing cache in micro-second.
[cml_]us_flush_max:   maximum CPU time of flushing cache in micro-second.

It's possible that some cache line may be flushed by mistake.  For
example, if a address satisfies the condition above but it was nevered
access in another core.  However, all our tested workloads show
positive or neutral performance impact, which indicates that the
mistaken rate can be kept low.

Performance test shows that the mitigation can improve mysql oltp read
qps (queries per second) up to 17.5%.  And the LLC miss ratio reduces
from 32.7% to 11.3%.

Signed-off-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5427
2025-07-10 05:33:28 +00:00
Sean Christopherson bec2cded1a KVM: Always flush async #PF workqueue when vCPU is being destroyed
ANBZ: #19799

commit 3d75b8aa5c upstream.

Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
completion queue, e.g. when a VM and all its vCPUs is being destroyed.
KVM must ensure that none of its workqueue callbacks is running when the
last reference to the KVM _module_ is put.  Gifting a reference to the
associated VM prevents the workqueue callback from dereferencing freed
vCPU/VM memory, but does not prevent the KVM module from being unloaded
before the callback completes.

Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:

 WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
 CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Workqueue: events async_pf_execute [kvm]
 RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
 Call Trace:
  <TASK>
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>
 ---[ end trace 0000000000000000 ]---
 INFO: task kworker/8:1:251 blocked for more than 120 seconds.
       Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
 Workqueue: events async_pf_execute [kvm]
 Call Trace:
  <TASK>
  __schedule+0x33f/0xa40
  schedule+0x53/0xc0
  schedule_timeout+0x12a/0x140
  __wait_for_common+0x8d/0x1d0
  __flush_work.isra.0+0x19f/0x2c0
  kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
  kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
  kvm_put_kvm+0x1c1/0x320 [kvm]
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>

If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
then there's no need to gift async_pf_execute() a reference because all
invocations of async_pf_execute() will be forced to complete before the
vCPU and its VM are destroyed/freed.  And that in turn fixes the module
unloading bug as __fput() won't do module_put() on the last vCPU reference
until the vCPU has been freed, e.g. if closing the vCPU file also puts the
last reference to the KVM module.

Note that kvm_check_async_pf_completion() may also take the work item off
the completion queue and so also needs to flush the work queue, as the
work will not be seen by kvm_clear_async_pf_completion_queue().  Waiting
on the workqueue could theoretically delay a vCPU due to waiting for the
work to complete, but that's a very, very small chance, and likely a very
small delay.  kvm_arch_async_page_present_queued() unconditionally makes a
new request, i.e. will effectively delay entering the guest, so the
remaining work is really just:

        trace_kvm_async_pf_completed(addr, cr2_or_gpa);

        __kvm_vcpu_wake_up(vcpu);

        mmput(mm);

and mmput() can't drop the last reference to the page tables if the vCPU is
still alive, i.e. the vCPU won't get stuck tearing down page tables.

Add a helper to do the flushing, specifically to deal with "wakeup all"
work items, as they aren't actually work items, i.e. are never placed in a
workqueue.  Trying to flush a bogus workqueue entry rightly makes
__flush_work() complain (kudos to whoever added that sanity check).

Note, commit 5f6de5cbeb ("KVM: Prevent module exit until all VMs are
freed") *tried* to fix the module refcounting issue by having VMs grab a
reference to the module, but that only made the bug slightly harder to hit
as it gave async_pf_execute() a bit more time to complete before the KVM
module could be unloaded.

Fixes: af585b921e ("KVM: Halt vcpu if page it tries to access is swapped out")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Reviewed-by: Xu Yilun <yilun.xu@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Fixes: CVE-2024-26976
Signed-off-by: Bin Guo <guobin@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5484
2025-07-10 01:38:21 +00:00
Cruz Zhao a4b5d3580f anolis: sched: update parameters of group balancer root sched domain
ANBZ: #8765

As we will update the root cpumask before we build group balancer sched
domains, we should update the parameters too.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
2025-07-09 11:13:24 +00:00
Cruz Zhao 5f2005c72b anolis: sched: control the interval of gb_load_balance()
ANBZ: #8765

To reduce turbulence, the interval of gb_load_balance() should be
controlled, and we set it to twice the lower interval. And to prevent
unnecessary migration, we only migrate task groups when the imbalance
pct exceeds the threshold 117.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
2025-07-09 11:13:24 +00:00
Cruz Zhao 717827ca16 anolis: sched: adjust the lower interval of group balancer sched domain
ANBZ: #8765

The goal of lower interval is to control the interval so that the
scheduler has enough time to do load balancing, so we'd better to
determine the lower interval according to the sched domain where group
balancer sched domain is located.

So far, the sched domain has the following layers:
 - NUMA
 - DIE
 - MC
 - SMT

When building group balancer sched domain according to hardware topology,
we identify the topological structures corresponding, and determine the
lower interval according to the span weight.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
2025-07-09 11:13:24 +00:00
Cruz Zhao 9b5823f38c anolis: sched: travese the descendants when migrating task groups
ANBZ: #8765

As a group balancer sched domain may have many descendants, with task
groups, when we select task groups to migrate we should traverse the
task descendants, too.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
2025-07-09 11:13:24 +00:00
Cruz Zhao d4d2d65b6d anolis: sched: set the flags of group balancer sched domain after build
ANBZ: #8765

If we set the flags of group balancer sched domain when we build it, the
flag of the middle level may be incorrect. So we should set the flags
after build.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
2025-07-09 11:13:24 +00:00
Cruz Zhao f9c7a08dd3 anolis: sched: make soft_cpus_version correct
ANBZ: #8765

When task fork, we should reset the soft_cpus_version to -1.
When the task group moves to another gb_sd, we should inc the
soft_cpus_version.

Fixes: 4bb9c29b1c ("anolis: sched: modify the interface of attach/detach_tg_to/from_group_balancer_sched_domain")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
2025-07-09 11:13:24 +00:00
Cruz Zhao cb5766c6e7 anolis: sched: optimize the selecting child gb_sd of task group
ANBZ: #8765

When lower task group, we will select a child group balancer sched
domain to migrate. Firstly, we should control the interval of lowering
task groups of the parent gb_sd, to avoid causing imbalance. Secondly,
if the load the task group spans balanced in the child gb_sds, we should
select the one with lower load.

By the way, fix the bug that using wrong load when calculate the load
the task group.

Fixes: bebcfc550d ("anolis: sched: introduce dynamical load balance for group balancer")
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5434
2025-07-09 11:13:24 +00:00