anolis-cloud-kernel

Commit Graph

Author	SHA1	Message	Date
Miklos Szeredi	76819e6780	fuse: don't truncate cached, mutated symlink ANBZ: #20444 [ Upstream commit `b4c173dfbb` ] Fuse allows the value of a symlink to change and this property is exploited by some filesystems (e.g. CVMFS). It has been observed, that sometimes after changing the symlink contents, the value is truncated to the old size. This is caused by fuse_getattr() racing with fuse_reverse_inval_inode(). fuse_reverse_inval_inode() updates the fuse_inode's attr_version, which results in fuse_change_attributes() exiting before updating the cached attributes This is okay, as the cached attributes remain invalid and the next call to fuse_change_attributes() will likely update the inode with the correct values. The reason this causes problems is that cached symlinks will be returned through page_get_link(), which truncates the symlink to inode->i_size. This is correct for filesystems that don't mutate symlinks, but in this case it causes bad behavior. The solution is to just remove this truncation. This can cause a regression in a filesystem that relies on supplying a symlink larger than the file size, but this is unlikely. If that happens we'd need to make this behavior conditional. Reported-by: Laura Promberger <laura.promberger@cern.ch> Tested-by: Sam Lewis <samclewis@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://lore.kernel.org/r/20250220100258.793363-1-mszeredi@redhat.com Reviewed-by: Bernd Schubert <bschubert@ddn.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5081	2025-04-22 01:26:36 +00:00
Jens Axboe	d7a3a27c63	fs: add batch and poll flags to the uring_cmd_iopoll() handler ANBZ: #11744 commit `de97fcb303` upstream. We need the poll_flags to know how to poll for the IO, and we should have the batch structure in preparation for supporting batched completions with iopoll. Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4160	2024-12-09 06:40:29 +00:00
Jens Axboe	ddd4974732	block: add a struct io_comp_batch argument to fops->iopoll() ANBZ: #11744 commit `5a72e899ce` upstream. struct io_comp_batch contains a list head and a completion handler, which will allow completions to more effciently completed batches of IO. For now, no functional changes in this patch, we just define the io_comp_batch structure and add the argument to the file_operations iopoll handler. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> [joe: apply changes for all bio_poll() and fix conflicts] Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4160	2024-12-09 06:40:29 +00:00
Amir Goldstein	cd3678b1f6	fs: create kiocb_{start,end}_write() helpers ANBZ: #12172 commit f83a32351efd6c6676fdabd5ab0e680500bc48e0 stable. Commit `ed0360bbab` upstream. aio, io_uring, cachefiles and overlayfs, all open code an ugly variant of file_{start,end}_write() to silence lockdep warnings. Create helpers for this lockdep dance so we can use the helpers in all the callers. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-4-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4171	2024-12-02 05:13:28 +00:00
Christoph Hellwig	b8f1c9a55b	block: switch polling to be bio based ANBZ: #11744 commit `3e08773c38` upstream. Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> [joe: fix all conflicts for *submit_bio() and blk_poll()] Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4084	2024-11-26 01:40:15 +00:00
Christoph Hellwig	429040d248	block: replace the spin argument to blk_iopoll with a flags argument ANBZ: #11744 commit `ef99b2d376` upstream. Switch the boolean spin argument to blk_poll to passing a set of flags instead. This will allow to control polling behavior in a more fine grained way. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de [axboe: adapt to changed io_uring iopoll] Signed-off-by: Jens Axboe <axboe@kernel.dk> [joe: fix conflicts for fs/block_dev.c and io_uring/io_uring.c] Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4084	2024-11-26 01:40:15 +00:00
Zhihao Cheng	97b23b74db	vfs: Don't evict inode under the inode lru traversing context ANBZ: #11123 commit 03880af02a78bc9a98b5a581f529cf709c88a9b8 stable. commit `2a0629834c` upstream. The inode reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes (See function find_inode_fast), then the reclaiming process destroy the inodes by function dispose_list(). Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may do inode lookup in the inode evicting callback function, if the inode lookup is operated under the inode lru traversing context, deadlock problems may happen. Case 1: In function ext4_evict_inode(), the ea inode lookup could happen if ea_inode feature is enabled, the lookup process will be stuck under the evicting context like this: 1. File A has inode i_reg and an ea inode i_ea 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea 3. Then, following three processes running like this: PA PB echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // i_reg is added into lru, lru->i_ea->i_reg prune_icache_sb list_lru_walk_one inode_lru_isolate i_ea->i_state \|= I_FREEING // set inode state inode_lru_isolate __iget(i_reg) spin_unlock(&i_reg->i_lock) spin_unlock(lru_lock) rm file A i_reg->nlink = 0 iput(i_reg) // i_reg->nlink is 0, do evict ext4_evict_inode ext4_xattr_delete_inode ext4_xattr_inode_dec_ref_all ext4_xattr_inode_iget ext4_iget(i_ea->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(i_ea) ----→ AA deadlock dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&i_ea->i_state) Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file deleting process holds BASEHD's wbuf->io_mutex while getting the xattr inode, which could race with inode reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happen as following: 1. File A has inode ia and a xattr(with inode ixa), regular file B has inode ib and a xattr. 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa 3. Then, following three processes running like this: PA PB PC echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // ib and ia are added into lru, lru->ixa->ib->ia prune_icache_sb list_lru_walk_one inode_lru_isolate ixa->i_state \|= I_FREEING // set inode state inode_lru_isolate __iget(ib) spin_unlock(&ib->i_lock) spin_unlock(lru_lock) rm file B ib->nlink = 0 rm file A iput(ia) ubifs_evict_inode(ia) ubifs_jnl_delete_inode(ia) ubifs_jnl_write_inode(ia) make_reservation(BASEHD) // Lock wbuf->io_mutex ubifs_iget(ixa->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(ixa) \| iput(ib) // ib->nlink is 0, do evict \| ubifs_evict_inode \| ubifs_jnl_delete_inode(ib) ↓ ubifs_jnl_write_inode ABBA deadlock ←-----make_reservation(BASEHD) dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&ixa->i_state) Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to pin the inode in memory while inode_lru_isolate() reclaims its pages instead of using ordinary inode reference. This way inode deletion cannot be triggered from inode_lru_isolate() thus avoiding the deadlock. evict() is made to wait for I_LRU_ISOLATING to be cleared before proceeding with inode cleanup. Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022 Fixes: `e50e5129f3` ("ext4: xattr-in-inode support") Fixes: `7959cf3a75` ("ubifs: journal: Handle xattrs like files") Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Fixes conflicts] Fixes: CVE-2024-45003 Signed-off-by: Xiao Long <xiaolong@openanolis.org> Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3896	2024-09-29 11:23:55 +00:00
Shawn Wang	fa73812bca	anolis: mm: kidled: Add a new KIDLED_YOUNG flag to replace REFERENCED checking ANBZ: #9779 Now kidled will skip the dentry/inode with REFERENCED set, which means the age of these entries is always 0. To reclaim these entries and avoid pollution of the REFERENCED flag, introduce a new KIDLED_YOUNG flag for kidled to indicate the entry has been referenced since the last kidled scanning. KIDLED_YOUNG should be set at the same time as the REFERENCED. More specifically, it will be set in retain_dentry() and inode_lru_list_add() for now. Besides, kidled and coldpgs modules are modified to check KIDELD_YOUNG instead of REFERENCED: When kidled scans an entry with KIDLED_YOUNG set, it clears this flag and set the age of this entry to 0. When coldpgs scans an entry with *KIDLED_YOUNG set, it just ignores this entry. Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3720	2024-08-20 13:40:53 +00:00
Eric Biggers	1ac641314d	fscrypt: stop using keyrings subsystem for fscrypt_master_key ANBZ: #9613 commit `d7e7b9af10` upstream. The approach of fs/crypto/ internally managing the fscrypt_master_key structs as the payloads of "struct key" objects contained in a "struct key" keyring has outlived its usefulness. The original idea was to simplify the code by reusing code from the keyrings subsystem. However, several issues have arisen that can't easily be resolved: - When a master key struct is destroyed, blk_crypto_evict_key() must be called on any per-mode keys embedded in it. (This started being the case when inline encryption support was added.) Yet, the keyrings subsystem can arbitrarily delay the destruction of keys, even past the time the filesystem was unmounted. Therefore, currently there is no easy way to call blk_crypto_evict_key() when a master key is destroyed. Currently, this is worked around by holding an extra reference to the filesystem's request_queue(s). But it was overlooked that the request_queue reference is not guaranteed to pin the corresponding blk_crypto_profile too; for device-mapper devices that support inline crypto, it doesn't. This can cause a use-after-free. - When the last inode that was using an incompletely-removed master key is evicted, the master key removal is completed by removing the key struct from the keyring. Currently this is done via key_invalidate(). Yet, key_invalidate() takes the key semaphore. This can deadlock when called from the shrinker, since in fscrypt_ioctl_add_key(), memory is allocated with GFP_KERNEL under the same semaphore. - More generally, the fact that the keyrings subsystem can arbitrarily delay the destruction of keys (via garbage collection delay, or via random processes getting temporary key references) is undesirable, as it means we can't strictly guarantee that all secrets are ever wiped. - Doing the master key lookups via the keyrings subsystem results in the key_permission LSM hook being called. fscrypt doesn't want this, as all access control for encrypted files is designed to happen via the files themselves, like any other files. The workaround which SELinux users are using is to change their SELinux policy to grant key search access to all domains. This works, but it is an odd extra step that shouldn't really have to be done. The fix for all these issues is to change the implementation to what I should have done originally: don't use the keyrings subsystem to keep track of the filesystem's fscrypt_master_key structs. Instead, just store them in a regular kernel data structure, and rework the reference counting, locking, and lifetime accordingly. Retain support for RCU-mode key lookups by using a hash table. Replace fscrypt_sb_free() with fscrypt_sb_delete(), which releases the keys synchronously and runs a bit earlier during unmount, so that block devices are still available. A side effect of this patch is that neither the master keys themselves nor the filesystem keyrings will be listed in /proc/keys anymore. ("Master key users" and the master key users keyrings will still be listed.) However, this was mostly an implementation detail, and it was intended just for debugging purposes. I don't know of anyone using it. This patch does not change how "master key users" (->mk_users) works; that still uses the keyrings subsystem. That is still needed for key quotas, and changing that isn't necessary to solve the issues listed above. If we decide to change that too, it would be a separate patch. I've marked this as fixing the original commit that added the fscrypt keyring, but as noted above the most important issue that this patch fixes wasn't introduced until the addition of inline encryption support. Fixes: `22d94f493b` ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20220901193208.138056-2-ebiggers@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3679	2024-08-12 08:20:08 +00:00
Jeff Layton	1fc0d4123a	filelock: new helper: vfs_inode_has_locks ANBZ: #9667 commit `ab1ddef98a` upstream. Ceph has a need to know whether a particular inode has any locks set on it. It's currently tracking that by a num_locks field in its filp->private_data, but that's problematic as it tries to decrement this field when releasing locks and that can race with the file being torn down. Add a new vfs_inode_has_locks helper that just returns whether any locks are currently held on the inode. Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Stable-dep-of: `461ab10ef7` ("ceph: switch to vfs_inode_has_locks() to fix file lock bug") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3632	2024-08-09 19:32:34 +08:00
Jeff Layton	b1e2b3525c	fs: add ctime accessors infrastructure ANBZ: #9649 [ Upstream commit `9b6304c1d5` ] struct timespec64 has unused bits in the tv_nsec field that can be used for other purposes. In future patches, we're going to change how the inode->i_ctime is accessed in certain inodes in order to make use of them. In order to do that safely though, we'll need to eradicate raw accesses of the inode->i_ctime field from the kernel. Add new accessor functions for the ctime that we use to replace them. Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Message-Id: <20230705185812.579118-2-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: `5923d6686a` ("smb3: fix caching of ctime on setxattr") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3625	2024-08-06 06:57:29 +00:00
Christian Brauner	fe58af9413	fs: use consistent setgid checks in is_sxid() ANBZ: #9613 commit `8d84e39d76` upstream. Now that we made the VFS setgid checking consistent an inode can't be marked security irrelevant even if the setgid bit is still set. Make this function consistent with all other helpers. Note that enforcing consistent setgid stripping checks for file modification and mode- and ownership changes will cause the setgid bit to be lost in more cases than useed to be the case. If an unprivileged user wrote to a non-executable setgid file that they don't have privilege over the setgid bit will be dropped. This will lead to temporary failures in some xfstests until they have been updated. Reported-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3604	2024-08-06 02:52:58 +00:00
Amir Goldstein	93e0de95b9	vfs: fix copy_file_range() averts filesystem freeze protection ANBZ: #9613 commit `10bc8e4af6` upstream. [backport comments for pre v5.15: - ksmbd mentions are irrelevant - ksmbd hunks were dropped - sb_write_started() is missing - assert was dropped ] Commit `868f9f2f8e` ("vfs: fix copy_file_range() regression in cross-fs copies") removed fallback to generic_copy_file_range() for cross-fs cases inside vfs_copy_file_range(). To preserve behavior of nfsd and ksmbd server-side-copy, the fallback to generic_copy_file_range() was added in nfsd and ksmbd code, but that call is missing sb_start_write(), fsnotify hooks and more. Ideally, nfsd and ksmbd would pass a flag to vfs_copy_file_range() that will take care of the fallback, but that code would be subtle and we got vfs_copy_file_range() logic wrong too many times already. Instead, add a flag to explicitly request vfs_copy_file_range() to perform only generic_copy_file_range() and let nfsd and ksmbd use this flag only in the fallback path. This choise keeps the logic changes to minimum in the non-nfsd/ksmbd code paths to reduce the risk of further regressions. Fixes: `868f9f2f8e` ("vfs: fix copy_file_range() regression in cross-fs copies") Tested-by: Namjae Jeon <linkinjeon@kernel.org> Tested-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3580	2024-08-01 01:39:28 +00:00
Bart Van Assche	3a19a0c752	fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio ANBZ: #9613 commit `b820de741a` upstream. If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the following kernel warning appears: WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8 Call trace: kiocb_set_cancel_fn+0x9c/0xa8 ffs_epfile_read_iter+0x144/0x1d0 io_read+0x19c/0x498 io_issue_sqe+0x118/0x27c io_submit_sqes+0x25c/0x5fc __arm64_sys_io_uring_enter+0x104/0xab0 invoke_syscall+0x58/0x11c el0_svc_common+0xb4/0xf4 do_el0_svc+0x2c/0xb0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x68/0xb4 el0t_64_sync+0x1a4/0x1a8 Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is submitted by libaio. Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3580	2024-08-01 01:39:28 +00:00
Hao Ge	28c59d3c6d	fs: fix undefined behavior in bit shift for SB_NOUSER ANBZ: #9613 commit `f15afbd34d` upstream. Shifting signed 32-bit value by 31 bits is undefined, so changing significant bit to unsigned. It was spotted by UBSAN. So let's just fix this by using the BIT() helper for all SB_* flags. Fixes: `e462ec50cb` ("VFS: Differentiate mount flags (MS_) from internal superblock flags") Signed-off-by: Hao Ge <gehao@kylinos.cn> Message-Id: <20230424051835.374204-1-gehao@kylinos.cn> [brauner@kernel.org: use BIT() for all SB_ flags] Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3575	2024-07-31 09:41:17 +00:00
Akinobu Mita	2a89bb1491	libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value ANBZ: #9495 [ Upstream commit `2e41f274f9` ] Patch series "fix error when writing negative value to simple attribute files". The simple attribute files do not accept a negative value since the commit `488dac0c92` ("libfs: fix error cast of negative value in simple_attr_write()"), but some attribute files want to accept a negative value. This patch (of 3): The simple attribute files do not accept a negative value since the commit `488dac0c92` ("libfs: fix error cast of negative value in simple_attr_write()"), so we have to use a 64-bit value to write a negative value. This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value. Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com Fixes: `488dac0c92` ("libfs: fix error cast of negative value in simple_attr_write()") Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reported-by: Zhao Gongyi <zhaogongyi@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Yicong Yang <yangyicong@hisilicon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3475	2024-07-12 05:49:57 +00:00
Guixin Liu	3e6454be7f	anolis: kabi: revert all kabi_use ANBZ: #9320 Because we dont take care of kABI consistency between major version now, revert kabi use, these reserved space are intended to ensure kABI compatibility between minor versions and their corresponding major version. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Signed-off-by：Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3344	2024-06-28 05:46:26 +00:00
Xu Yu	bd617fc27c	anolis: mm: support fast reflink ANBZ: #7740 Reflink will takes too much times to copy-on-write in the dirty pages. Redis will block the progress to wait for completion, hence that will not acceptiable for us to make the feature productization. The patch will handle with the copy-on-write in the asynchronous thread. and we will not handle with pmd aligned entry directly, and let it behind for background work. [ Xu Yu: - Employ async fork framework. - Fix wrong usage of pmd_write. - Move work_struct of fast reflink from vm_area_struct to address_space, and update corresponding routines. ] Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2794	2024-03-01 01:09:03 +00:00
Rongwei Wang	4a85c08e8c	anolis: mm: pgtable_share: switch to per-VMA pgtable share data ANBZ: #6632 For anonymous shared memory, the page table data constructed is sufficient in a per-mapping way. That because other processes have no chances to create new VMAs base on the same file descriptor. But, identified shared memory also is user of page table sharing. The file descriptor is visible for others to create new shared page table. Obviously, this two groups of shared page table is different and invisible each other. All in all, the advantages of per-VMA having: - Avoid the conflict of create shared page tables with different attributes; - Avoid the fails incurred by various VMA area; - Maybe a little semantic issue, irrelevant tasks can not see the same mm_struct; - A simpler structure; Here moving the pgtable share data into vma struct. The per-VMA method can meet above two shared memory. Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2550	2023-12-29 12:54:08 +00:00
Rongwei Wang	e7de7cbc6e	anolis: mm: pgtable_share: introduce CONFIG_PAGETABLE_SHARE ANBZ: #6632 CONFIG_PAGETABLE_SHARE introduced to select whether enable this feature. And it needs to select CONFIG_ARCH_USES_HIGH_VMA_FLAGS. Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2550	2023-12-29 12:54:08 +00:00
Khalid Aziz	fb2edd234e	mm/ptshare: Create new mm struct for page table sharing ANBZ: #6632 cherry-picked from https://lore.kernel.org/linux-mm/1fd52581f4e4960a4d07cb9784d56659ec139d3c.1682453344.git.khalid.aziz@oracle.com/ When a process passes MAP_SHARED_PT flag to mmap(), create a new mm struct to hold the shareable page table entries for the newly mapped region. This new mm is not associated with a task. Its lifetime is until the last shared mapping is deleted. This patch also adds a new pointer "ptshare_data" to struct address_space which points to the data structure that will contain pointer to this newly created mm along with properties of the shared mapping. ptshare_data maintains a refcount for the shared mapping so that it can be cleaned up upon last unmap. [Rongwei Wang: fix vm_flags_set() error] Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2550	2023-12-29 12:54:08 +00:00
Shiyang Ruan	e4d1b9eecc	fsdax: dedup file range to use a compare function ANBZ: #7740 commit `6f7db3894a` upstream. With dax we cannot deal with readpage() etc. So, we create a dax comparison function which is similar with vfs_dedupe_file_range_compare(). And introduce dax_remap_file_range_prep() for filesystem use. Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Conflicts: fs/remap_range.c include/linux/dax.h include/linux/fs.h Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2504	2023-12-26 09:51:37 +00:00
Christoph Hellwig	f55bb06bba	fs: explicitly unregister per-superblock BDIs ANBZ: #7146 commit `0b3ea0926a` upstream. Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi associated with them, and unregister it when the superblock is shut down. Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Miquel Raynal <miquel.raynal@bootlin.com> Cc: Richard Weinberger <richard@nod.at> Cc: Vignesh Raghavendra <vigneshr@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2407	2023-11-17 06:50:20 +00:00
Jens Axboe	3df6596538	io_uring: import 5.15-stable io_uring ANBZ: #6478 commit 788d0824269bef539fe31a785b1517882eafed93 linux-5.10.y. No upstream commit exists. This imports the io_uring codebase from 5.15.85, wholesale. Changes from that code base: - Drop IOCB_ALLOC_CACHE, we don't have that in 5.10. - Drop MKDIRAT/SYMLINKAT/LINKAT. Would require further VFS backports, and we don't support these in 5.10 to begin with. - sock_from_file() old style calling convention. - Use compat_get_bitmap() only for CONFIG_COMPAT=y [backport note] Besides, add ANCK 5.10 specific changes Including the following changes: support us granularity of io_sq_thread_idle optimize submit sqes latency support for 128-byte SQEs support CQE32 in io_uring_cqe support infrastructure for uring-cmd [backport note2] As we introduce pf_io_worker in task_struct, we use the first KABI_RESERVE Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2208	2023-09-25 12:04:35 +00:00
Matthew Wilcox (Oracle)	a94299bed6	fs: Document file_ra_state ANBZ: #6207 commit `c790fbf20a` upstream. Turn the comments into kernel-doc and improve the wording slightly. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jeff Layton <jlayton@kernel.org> Tested-by: Dave Wysochanski <dwysocha@redhat.com> Tested-By: Marc Dionne <marc.dionne@auristor.com> Link: https://lore.kernel.org/r/20210407201857.3582797-3-willy@infradead.org/ Link: https://lore.kernel.org/r/161789068619.6155.1397999970593531574.stgit@warthog.procyon.org.uk/ # v6 Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2080	2023-08-24 11:06:13 +00:00
Miklos Szeredi	d5c57539a0	ovl: enable RCU'd ->get_acl() ANBZ: #5669 commit `332f606b32` upstream. Overlayfs does not cache ACL's (to avoid double caching). Instead it just calls the underlying filesystem's i_op->get_acl(), which will return the cached value, if possible. In rcu path walk, however, get_cached_acl_rcu() is employed to get the value from the cache, which will fail on overlayfs resulting in dropping out of rcu walk mode. This can result in a big performance hit in certain situations. Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which indicates pass-through) Reported-by: garyhuang <zjh.20052005@163.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1801	2023-06-29 01:23:21 +00:00
Jingbo Xu	4d36616698	anolis: Revert "anolis: erofs,ovl: bypass [override\|revert]_creds for overlayfs" ANBZ: #5669 This reverts commit `8a90701652`. The current implementation is risky and may cause copy-up failure in specific scenarios with "permission denied". As this optimization is not practically used, revert it for now. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1801	2023-06-29 01:23:21 +00:00
Jingbo Xu	c17e876988	anolis: Revert "anolis: erofs,ovl: enable RCU'd ->get_acl()" ANBZ: #5669 This reverts commit `449488538c`, where the optimization is enabled only when lowerdirs are upon limited fs type. Later we will cherry-pick the upstream commit `332f606b32` ("ovl: enable RCU'd ->get_acl()") to re-enable this optimization. Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1801	2023-06-29 01:23:21 +00:00
Liu Bo	cc28d7800c	anolis: dax: add dax_read_one_pfn helper ANBZ: #5553 Dax pfn can be mapped to kernel virtual address just like page cache page, but there is no such a helper that is equivalent to read_cache_page(). This adds a helper "dax_read_one_pfn" to return a pfn_t. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Tested-by: Liu Bo <bo.liu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1747	2023-06-16 06:57:55 +00:00
Kanchan Joshi	6b5e751782	fs: add file_operations->uring_cmd_iopoll ANBZ: #4890 commit `de27e18e86` upstream io_uring will invoke this to do completion polling on uring-cmd operations. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220823161443.49436-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1681	2023-06-15 22:55:32 +00:00
Christian Brauner (Microsoft)	e244892f6f	attr: use consistent sgid stripping checks ANBZ: #5286 commit 0e9dbde96cacd0e95e05e862b50f3079146917f2 stable 5.10 commit `ed5a7047d2` upstream. [backported to 5.10.y, prior to idmapped mounts] Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE \| kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID \| ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid \|= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE \| kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force \| kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [Ziyang Zhang: modify fuse_cache_write_iter() too] Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1667	2023-05-26 07:12:17 +00:00
Yang Xu	d59b93b65b	fs: add mode_strip_sgid() helper ANBZ: #4541 commit `2b3416ceff` upstream. [remove userns argument of helper for 5.10.y backport] Add a dedicated helper to handle the setgid bit when creating a new file in a setgid directory. This is a preparatory patch for moving setgid stripping into the vfs. The patch contains no functional changes. Currently the setgid stripping logic is open-coded directly in inode_init_owner() and the individual filesystems are responsible for handling setgid inheritance. Since this has proven to be brittle as evidenced by old issues we uncovered over the last months (see [1] to [3] below) we will try to move this logic into the vfs. Link: `e014f37db1` ("xfs: use setattr_copy to set vfs inode attributes") [1] Link: `01ea173e10` ("xfs: fix up non-directory creation in SGID directories") [2] Link: `fd84bfdddd` ("ceph: fix up non-directory creation in SGID directories") [3] Link: https://lore.kernel.org/r/1657779088-2242-1-git-send-email-xuyang2018.jy@fujitsu.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1489	2023-03-24 08:16:23 +00:00
Guixin Liu	2af3c87980	anolis: kabi: Reserve some fields ANBZ: #3879 Based on upstream struct size to reserve specific fields for kabi related structs. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1148	2023-02-24 07:45:32 +00:00
Guixin Liu	a027e85f84	anolis: kabi: Replace hotfix reserve macros with KABI reserve macros ANBZ: #3879 Unify hotfix reserve macros and KABI reserve macros, now KABI reserve macros are used for both KABI and hotfix. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1087	2023-01-31 13:20:23 +00:00
Jens Axboe	943c68a0f9	mm: provide filemap_range_needs_writeback() helper ANBZ: #3599 commit `63135aa386` upstream. Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3. An internal workload complained because it was using too much CPU, and when I took a look, we had a lot of io_uring workers going to town. For an async buffered read like workload, I am normally expecting _zero_ offloads to a worker thread, but this one had tons of them. I'd drop caches and things would look good again, but then a minute later we'd regress back to using workers. Turns out that every minute something was reading parts of the device, which would add page cache for that inode. I put patches like these in for our kernel, and the problem was solved. Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache entries for the given range. This causes unnecessary work from the callers side, when the IO could have been issued totally fine without blocking on writeback when there is none. This patch (of 3): For O_DIRECT reads/writes, we check if we need to issue a call to filemap_write_and_wait_range() to issue and/or wait for writeback for any page in the given range. The existing mechanism just checks for a page in the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the slow path (and needing retry) if there's just a clean page cache page in the range. Provide filemap_range_needs_writeback() which tries a little harder to check if we actually need to issue and/or wait for writeback in the range. Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1034	2023-01-03 15:03:53 +08:00
Jens Axboe	28426fcd72	fs,io_uring: add infrastructure for uring-cmd ANBZ: #3541 commit `ee692a21e9` upstream file_operations->uring_cmd is a file private handler. This is somewhat similar to ioctl but hopefully a lot more sane and useful as it can be used to enable many io_uring capabilities for the underlying operation. IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't know what is in this command type, it's for the provider of ->uring_cmd() to deal with. Co-developed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220511054750.20432-2-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk> [Ziyang Zhang: remove CQE32 support for uring-cmd] Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1017	2022-12-27 10:32:04 +00:00
Ziyang Zhang	62e8ec4c09	anolis: Revert "ck: io_uring: support ioctl" ANBZ: #3286 This reverts commit `116dd23479`. If we want to sync with io_uring upstream, IORING_OP_IOCTL could break the uapi since there is not a specific number assigned for this opcode. Revert IORING_OP_IOCTL now and it seems that no one is using it now. Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/923	2022-11-29 16:46:54 +08:00
Jingbo Xu	449488538c	anolis: erofs,ovl: enable RCU'd ->get_acl() ANBZ: #2457 cherry-picked from `332f606b32` upstream. Overlayfs does not cache ACL's (to avoid double caching). Instead it just calls the underlying filesystem's i_op->get_acl(), which will return the cached value, if possible. In rcu path walk, however, get_cached_acl_rcu() is employed to get the value from the cache, which will fail on overlayfs resulting in dropping out of rcu walk mode. This can result in a big performance hit in certain situations. Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which indicates pass-through) [Jingbo's Note] For risk control consideration, enable RCU'd ACL optimization (for overlayfs) only in constricted scenarios, e.g. the container scenarios where erofs serves as the lowerfs of overlayfs. Reported-by: garyhuang <zjh.20052005@163.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/783 Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-10-24 02:54:58 +00:00
Miklos Szeredi	3145d22992	vfs: add rcu argument to ->get_acl() callback ANBZ: #2457 commit `0cad624662` upstream. Add a rcu argument to the ->get_acl() callback to allow get_cached_acl_rcu() to call the ->get_acl() method in the next patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/783 Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-10-24 02:54:58 +00:00
Jingbo Xu	8a90701652	anolis: erofs,ovl: bypass [override\|revert]_creds for overlayfs ANBZ: #2457 Add "opt_creds=on" mount option for erofs, which serves as a hint of bypassing [override\|revert]_creds in permission checking in overlayfs, when erofs is mounted as lowerfs of overlayfs. overlayfs will access the realinode with creds of mounter by default, in which case [override\|revert]_creds are called. In some workloads, the overhead of frequent [override\|revert]_creds is quite considerable. As an optimization we can bypass [override\|revert]_creds and thus tasks will access realinode with creds of caller. This is inspired by a similar optimization from Android (see Link below). The side effect is that tasks may fail to access files if caller's permission is not adequate to do that. So this optimization only applies to those constricted scenarios where caller is adequate to access realinode, e.g. container scenarios. To enable this optimization, set "opt_creds=on" mount option for those erofs instances that serve as lowerfs for overlayfs. The mount option is a hint for overlayfs suggesting that the optimization could be applied. The optimization will be enabled if all lowerfs are configured with this mount option. By default this mount option is unset and thus overlayfs won't enable the optimization. Link: https://lore.kernel.org/all/20211117015806.2192263-4-dvander@google.com/ Link: https://gitee.com/anolis/cloud-kernel/pulls/783 Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-10-24 02:54:58 +00:00
Pavel Begunkov	1f2a758e33	io_uring: disable polling pollfree files ANBZ: #2212 commit 28d8d2737e82fc29ff9e788597661abecc7f7994 linux-stable 5.10. Older kernels lack io_uring POLLFREE handling. As only affected files are signalfd and android binder the safest option would be to disable polling those files via io_uring and hope there are no users. Fixes: `221c5eb233` ("io_uring: add support for IORING_OP_POLL") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: CVE-2022-3176 Link: https://gitee.com/anolis/cloud-kernel/pulls/734 Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>	2022-09-23 03:49:43 +00:00
zhongjiang-ali	b595403c0b	anolis: mm: kidled: refactor cold slab ANBZ: #1702 The previous cold slab use an internal field to record the age, it will increase the object size and bring in fragmentation. The patch refactor the implementation to cut down the defect. It will reuse the page->mem_cgroup because slab will not in the memcg when kmem accouting disable, and use an extra pointer to store the age when kmem accouting enable, it is an specal obj_cgroup pointer allocated by memcg_alloc_page_obj_cgroups. Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Acked-by: Gang Deng <gavin.dg@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/565	2022-08-02 16:44:02 +08:00
zhongjiang-ali	1d7768d166	anolis: mm: kidled: make kidled support to identify cold slab ANBZ: #1702 In some cases, we can see a large of reclaimable slab in the machine. It can result in some memory fragment to make the system fluctuation. Meanwhile, reclaim the slab can increase the memory to be used for offline business. The patch is adapted to kidle frame, thus the kidled is able to monitor the free slab for an long time. It is an indicator for user to reap the long free slab. Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Acked-by: Gang Deng <gavin.dg@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/565	2022-08-02 16:43:57 +08:00
Alessio Balsini	31512306c6	anolis: fs: Generic function to convert iocb to rw flags ANBZ: #1576 Reference: https://lore.kernel.org/lkml/20210125153057.3623715-2-balsini@android.com/ OverlayFS implements its own function to translate iocb flags into rw flags, so that they can be passed into another vfs call. With commit `ce71bfea20` ("fs: align IOCB_* flags with RWF_* flags") Jens created a 1:1 matching between the iocb flags and rw flags, simplifying the conversion. Reduce the OverlayFS code by making the flag conversion function generic and reusable. Signed-off-by: Alessio Balsini <balsini@android.com> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com> Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2022-08-02 16:40:02 +08:00
Xu Yu	cb7613cb41	anolis: mm: duptext: add core implementation ANBZ: #32 NOTE: this is an experimental feature. This introduces and implements interfaces to create and remove duplicated pages, as well as statistics of duplicated pages in global granularity and process granularity, respectively. This also introduces the global switch and per-memcg switch to control whether page duplication is allowed. Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>	2022-08-01 12:00:46 +00:00
xiejingfeng	ee0d34abbf	ck: hotfix: Add Kernel hotfix enhancement fix #32137220 We reserve some fields beforehand for core structures prone to change, so that we won't hurt when extra fields have to be added for hotfix, thereby inceasing the success rate, we even can hot add features with this enhancement. After reserving, normally cache does not matter as the reserved fields (usually at tail) are not accessed at all. Currently involve the following structures: MM: struct zone struct pglist_data struct mm_struct struct vm_area_struct struct mem_cgroup struct writeback_control Block: struct gendisk struct backing_dev_info struct bio struct queue_limits struct request_queue struct blkcg struct blkcg_policy struct blk_mq_hw_ctx struct blk_mq_tag_set struct blk_mq_queue_data struct blk_mq_ops struct elevator_mq_ops struct address_space struct block_device struct hd_struct struct bio_set Network: struct sk_buff struct sock struct net_device_ops struct xt_target struct dst_entry struct dst_ops struct fib_rule Scheduler: struct task_struct struct cfs_rq struct rq struct sched_statistics struct sched_entity struct signal_struct struct task_group struct cpuacct cgroup: struct cgroup_root struct cgroup_subsys_state struct cgroup_subsys struct css_set Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-08-01 11:01:51 +00:00
Hao Xu	116dd23479	ck: io_uring: support ioctl fix #32693463 Async ioctl is necessary for some scenarios like nonblocking single-threaded model. Currently just support BLKDISCARD op. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-08-01 11:01:45 +00:00
zhangliguang	90658e1450	ck: fs,ext4: remove projid limit when create hard link fix #32824162 This is a temporary workaround plan to avoid the limitation when creating hard link cross two projids. Signed-off-by: zhangliguang <zhangliguang@linux.alibaba.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Acked-by: Caspar Zhang <caspar@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>	2022-08-01 11:01:43 +00:00
Josef Bacik	9febc9d8d2	fs: export an inode_update_time helper commit `e60feb445f` upstream. If you already have an inode and need to update the time on the inode there is no way to do this properly. Export this helper to allow file systems to update time on the inode so the appropriate handler is called, either ->update_time or generic_update_time. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-11-26 10:39:22 +01:00
Al Viro	40ba433a85	new helper: inode_wrong_type() commit `6e3e2c4362` upstream. inode_wrong_type(inode, mode) returns true if setting inode->i_mode to given value would've changed the inode type. We have enough of those checks open-coded to make a helper worthwhile. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2021-09-08 08:49:01 +02:00

1 2 3 4 5 ...

1765 Commits