Commit Graph

1765 Commits

Author SHA1 Message Date
Miklos Szeredi 76819e6780 fuse: don't truncate cached, mutated symlink
ANBZ: #20444

[ Upstream commit b4c173dfbb ]

Fuse allows the value of a symlink to change and this property is exploited
by some filesystems (e.g. CVMFS).

It has been observed, that sometimes after changing the symlink contents,
the value is truncated to the old size.

This is caused by fuse_getattr() racing with fuse_reverse_inval_inode().
fuse_reverse_inval_inode() updates the fuse_inode's attr_version, which
results in fuse_change_attributes() exiting before updating the cached
attributes

This is okay, as the cached attributes remain invalid and the next call to
fuse_change_attributes() will likely update the inode with the correct
values.

The reason this causes problems is that cached symlinks will be
returned through page_get_link(), which truncates the symlink to
inode->i_size.  This is correct for filesystems that don't mutate
symlinks, but in this case it causes bad behavior.

The solution is to just remove this truncation.  This can cause a
regression in a filesystem that relies on supplying a symlink larger than
the file size, but this is unlikely.  If that happens we'd need to make
this behavior conditional.

Reported-by: Laura Promberger <laura.promberger@cern.ch>
Tested-by: Sam Lewis <samclewis@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://lore.kernel.org/r/20250220100258.793363-1-mszeredi@redhat.com
Reviewed-by: Bernd Schubert <bschubert@ddn.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5081
2025-04-22 01:26:36 +00:00
Jens Axboe d7a3a27c63 fs: add batch and poll flags to the uring_cmd_iopoll() handler
ANBZ: #11744

commit de97fcb303 upstream.

We need the poll_flags to know how to poll for the IO, and we should
have the batch structure in preparation for supporting batched
completions with iopoll.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4160
2024-12-09 06:40:29 +00:00
Jens Axboe ddd4974732 block: add a struct io_comp_batch argument to fops->iopoll()
ANBZ: #11744

commit 5a72e899ce upstream.

struct io_comp_batch contains a list head and a completion handler, which
will allow completions to more effciently completed batches of IO.

For now, no functional changes in this patch, we just define the
io_comp_batch structure and add the argument to the file_operations iopoll
handler.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[joe: apply changes for all bio_poll() and fix conflicts]
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4160
2024-12-09 06:40:29 +00:00
Amir Goldstein cd3678b1f6 fs: create kiocb_{start,end}_write() helpers
ANBZ: #12172

commit f83a32351efd6c6676fdabd5ab0e680500bc48e0 stable.

Commit ed0360bbab upstream.

aio, io_uring, cachefiles and overlayfs, all open code an ugly variant
of file_{start,end}_write() to silence lockdep warnings.

Create helpers for this lockdep dance so we can use the helpers in all
the callers.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Message-Id: <20230817141337.1025891-4-amir73il@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>

Signed-off-by: Xiao Long <xiaolong@openanolis.org>
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4171
2024-12-02 05:13:28 +00:00
Christoph Hellwig b8f1c9a55b block: switch polling to be bio based
ANBZ: #11744

commit 3e08773c38 upstream.

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[joe: fix all conflicts for *submit_bio() and blk_poll()]
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4084
2024-11-26 01:40:15 +00:00
Christoph Hellwig 429040d248 block: replace the spin argument to blk_iopoll with a flags argument
ANBZ: #11744

commit ef99b2d376 upstream.

Switch the boolean spin argument to blk_poll to passing a set of flags
instead.  This will allow to control polling behavior in a more fine
grained way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Mark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-10-hch@lst.de
[axboe: adapt to changed io_uring iopoll]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[joe: fix conflicts for fs/block_dev.c and io_uring/io_uring.c]
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4084
2024-11-26 01:40:15 +00:00
Zhihao Cheng 97b23b74db vfs: Don't evict inode under the inode lru traversing context
ANBZ: #11123

commit 03880af02a78bc9a98b5a581f529cf709c88a9b8 stable.

commit 2a0629834c upstream.

The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes
(See function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list(). Some filesystems(eg. ext4 with
ea_inode feature, ubifs with xattr) may do inode lookup in the inode
evicting callback function, if the inode lookup is operated under the
inode lru traversing context, deadlock problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
        if ea_inode feature is enabled, the lookup process will be stuck
	under the evicting context like this:

 1. File A has inode i_reg and an ea inode i_ea
 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
 3. Then, following three processes running like this:

    PA                              PB
 echo 2 > /proc/sys/vm/drop_caches
  shrink_slab
   prune_dcache_sb
   // i_reg is added into lru, lru->i_ea->i_reg
   prune_icache_sb
    list_lru_walk_one
     inode_lru_isolate
      i_ea->i_state |= I_FREEING // set inode state
     inode_lru_isolate
      __iget(i_reg)
      spin_unlock(&i_reg->i_lock)
      spin_unlock(lru_lock)
                                     rm file A
                                      i_reg->nlink = 0
      iput(i_reg) // i_reg->nlink is 0, do evict
       ext4_evict_inode
        ext4_xattr_delete_inode
         ext4_xattr_inode_dec_ref_all
          ext4_xattr_inode_iget
           ext4_iget(i_ea->i_ino)
            iget_locked
             find_inode_fast
              __wait_on_freeing_inode(i_ea) ----→ AA deadlock
    dispose_list // cannot be executed by prune_icache_sb
     wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
        deleting process holds BASEHD's wbuf->io_mutex while getting the
	xattr inode, which could race with inode reclaiming process(The
        reclaiming process could try locking BASEHD's wbuf->io_mutex in
	inode evicting function), then an ABBA deadlock problem would
	happen as following:

 1. File A has inode ia and a xattr(with inode ixa), regular file B has
    inode ib and a xattr.
 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
 3. Then, following three processes running like this:

        PA                PB                        PC
                echo 2 > /proc/sys/vm/drop_caches
                 shrink_slab
                  prune_dcache_sb
                  // ib and ia are added into lru, lru->ixa->ib->ia
                  prune_icache_sb
                   list_lru_walk_one
                    inode_lru_isolate
                     ixa->i_state |= I_FREEING // set inode state
                    inode_lru_isolate
                     __iget(ib)
                     spin_unlock(&ib->i_lock)
                     spin_unlock(lru_lock)
                                                   rm file B
                                                    ib->nlink = 0
 rm file A
  iput(ia)
   ubifs_evict_inode(ia)
    ubifs_jnl_delete_inode(ia)
     ubifs_jnl_write_inode(ia)
      make_reservation(BASEHD) // Lock wbuf->io_mutex
      ubifs_iget(ixa->i_ino)
       iget_locked
        find_inode_fast
         __wait_on_freeing_inode(ixa)
          |          iput(ib) // ib->nlink is 0, do evict
          |           ubifs_evict_inode
          |            ubifs_jnl_delete_inode(ib)
          ↓             ubifs_jnl_write_inode
     ABBA deadlock ←-----make_reservation(BASEHD)
                   dispose_list // cannot be executed by prune_icache_sb
                    wake_up_bit(&ixa->i_state)

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING
to pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion
cannot be triggered from inode_lru_isolate() thus avoiding the deadlock.
evict() is made to wait for I_LRU_ISOLATING to be cleared before
proceeding with inode cleanup.

Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Fixes: 7959cf3a75 ("ubifs: journal: Handle xattrs like files")
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

[Fixes conflicts]
Fixes: CVE-2024-45003
Signed-off-by: Xiao Long <xiaolong@openanolis.org>
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3896
2024-09-29 11:23:55 +00:00
Shawn Wang fa73812bca anolis: mm: kidled: Add a new *KIDLED_YOUNG flag to replace *REFERENCED checking
ANBZ: #9779

Now kidled will skip the dentry/inode with *REFERENCED set, which means
the age of these entries is always 0. To reclaim these entries and avoid
pollution of the *REFERENCED flag, introduce a new *KIDLED_YOUNG flag for
kidled to indicate the entry has been referenced since the last kidled
scanning.

*KIDLED_YOUNG should be set at the same time as the *REFERENCED. More
specifically, it will be set in retain_dentry() and inode_lru_list_add()
for now.

Besides, kidled and coldpgs modules are modified to check *KIDELD_YOUNG
instead of *REFERENCED:

When kidled scans an entry with *KIDLED_YOUNG set, it clears this flag
and set the age of this entry to 0.

When coldpgs scans an entry with *KIDLED_YOUNG set, it just ignores this
entry.

Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3720
2024-08-20 13:40:53 +00:00
Eric Biggers 1ac641314d fscrypt: stop using keyrings subsystem for fscrypt_master_key
ANBZ: #9613

commit d7e7b9af10 upstream.

The approach of fs/crypto/ internally managing the fscrypt_master_key
structs as the payloads of "struct key" objects contained in a
"struct key" keyring has outlived its usefulness.  The original idea was
to simplify the code by reusing code from the keyrings subsystem.
However, several issues have arisen that can't easily be resolved:

- When a master key struct is destroyed, blk_crypto_evict_key() must be
  called on any per-mode keys embedded in it.  (This started being the
  case when inline encryption support was added.)  Yet, the keyrings
  subsystem can arbitrarily delay the destruction of keys, even past the
  time the filesystem was unmounted.  Therefore, currently there is no
  easy way to call blk_crypto_evict_key() when a master key is
  destroyed.  Currently, this is worked around by holding an extra
  reference to the filesystem's request_queue(s).  But it was overlooked
  that the request_queue reference is *not* guaranteed to pin the
  corresponding blk_crypto_profile too; for device-mapper devices that
  support inline crypto, it doesn't.  This can cause a use-after-free.

- When the last inode that was using an incompletely-removed master key
  is evicted, the master key removal is completed by removing the key
  struct from the keyring.  Currently this is done via key_invalidate().
  Yet, key_invalidate() takes the key semaphore.  This can deadlock when
  called from the shrinker, since in fscrypt_ioctl_add_key(), memory is
  allocated with GFP_KERNEL under the same semaphore.

- More generally, the fact that the keyrings subsystem can arbitrarily
  delay the destruction of keys (via garbage collection delay, or via
  random processes getting temporary key references) is undesirable, as
  it means we can't strictly guarantee that all secrets are ever wiped.

- Doing the master key lookups via the keyrings subsystem results in the
  key_permission LSM hook being called.  fscrypt doesn't want this, as
  all access control for encrypted files is designed to happen via the
  files themselves, like any other files.  The workaround which SELinux
  users are using is to change their SELinux policy to grant key search
  access to all domains.  This works, but it is an odd extra step that
  shouldn't really have to be done.

The fix for all these issues is to change the implementation to what I
should have done originally: don't use the keyrings subsystem to keep
track of the filesystem's fscrypt_master_key structs.  Instead, just
store them in a regular kernel data structure, and rework the reference
counting, locking, and lifetime accordingly.  Retain support for
RCU-mode key lookups by using a hash table.  Replace fscrypt_sb_free()
with fscrypt_sb_delete(), which releases the keys synchronously and runs
a bit earlier during unmount, so that block devices are still available.

A side effect of this patch is that neither the master keys themselves
nor the filesystem keyrings will be listed in /proc/keys anymore.
("Master key users" and the master key users keyrings will still be
listed.)  However, this was mostly an implementation detail, and it was
intended just for debugging purposes.  I don't know of anyone using it.

This patch does *not* change how "master key users" (->mk_users) works;
that still uses the keyrings subsystem.  That is still needed for key
quotas, and changing that isn't necessary to solve the issues listed
above.  If we decide to change that too, it would be a separate patch.

I've marked this as fixing the original commit that added the fscrypt
keyring, but as noted above the most important issue that this patch
fixes wasn't introduced until the addition of inline encryption support.

Fixes: 22d94f493b ("fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220901193208.138056-2-ebiggers@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3679
2024-08-12 08:20:08 +00:00
Jeff Layton 1fc0d4123a filelock: new helper: vfs_inode_has_locks
ANBZ: #9667

commit ab1ddef98a upstream.

Ceph has a need to know whether a particular inode has any locks set on
it. It's currently tracking that by a num_locks field in its
filp->private_data, but that's problematic as it tries to decrement this
field when releasing locks and that can race with the file being torn
down.

Add a new vfs_inode_has_locks helper that just returns whether any locks
are currently held on the inode.

Reviewed-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Stable-dep-of: 461ab10ef7 ("ceph: switch to vfs_inode_has_locks() to fix file lock bug")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3632
2024-08-09 19:32:34 +08:00
Jeff Layton b1e2b3525c fs: add ctime accessors infrastructure
ANBZ: #9649

[ Upstream commit 9b6304c1d5 ]

struct timespec64 has unused bits in the tv_nsec field that can be used
for other purposes. In future patches, we're going to change how the
inode->i_ctime is accessed in certain inodes in order to make use of
them. In order to do that safely though, we'll need to eradicate raw
accesses of the inode->i_ctime field from the kernel.

Add new accessor functions for the ctime that we use to replace them.

Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Message-Id: <20230705185812.579118-2-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Stable-dep-of: 5923d6686a ("smb3: fix caching of ctime on setxattr")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3625
2024-08-06 06:57:29 +00:00
Christian Brauner fe58af9413 fs: use consistent setgid checks in is_sxid()
ANBZ: #9613

commit 8d84e39d76 upstream.

Now that we made the VFS setgid checking consistent an inode can't be
marked security irrelevant even if the setgid bit is still set. Make
this function consistent with all other helpers.

Note that enforcing consistent setgid stripping checks for file
modification and mode- and ownership changes will cause the setgid bit
to be lost in more cases than useed to be the case. If an unprivileged
user wrote to a non-executable setgid file that they don't have
privilege over the setgid bit will be dropped. This will lead to
temporary failures in some xfstests until they have been updated.

Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3604
2024-08-06 02:52:58 +00:00
Amir Goldstein 93e0de95b9 vfs: fix copy_file_range() averts filesystem freeze protection
ANBZ: #9613

commit 10bc8e4af6 upstream.

[backport comments for pre v5.15:
- ksmbd mentions are irrelevant - ksmbd hunks were dropped
- sb_write_started() is missing - assert was dropped
]

Commit 868f9f2f8e ("vfs: fix copy_file_range() regression in cross-fs
copies") removed fallback to generic_copy_file_range() for cross-fs
cases inside vfs_copy_file_range().

To preserve behavior of nfsd and ksmbd server-side-copy, the fallback to
generic_copy_file_range() was added in nfsd and ksmbd code, but that
call is missing sb_start_write(), fsnotify hooks and more.

Ideally, nfsd and ksmbd would pass a flag to vfs_copy_file_range() that
will take care of the fallback, but that code would be subtle and we got
vfs_copy_file_range() logic wrong too many times already.

Instead, add a flag to explicitly request vfs_copy_file_range() to
perform only generic_copy_file_range() and let nfsd and ksmbd use this
flag only in the fallback path.

This choise keeps the logic changes to minimum in the non-nfsd/ksmbd code
paths to reduce the risk of further regressions.

Fixes: 868f9f2f8e ("vfs: fix copy_file_range() regression in cross-fs copies")
Tested-by: Namjae Jeon <linkinjeon@kernel.org>
Tested-by: Luis Henriques <lhenriques@suse.de>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3580
2024-08-01 01:39:28 +00:00
Bart Van Assche 3a19a0c752 fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio
ANBZ: #9613

commit b820de741a upstream.

If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the
following kernel warning appears:

WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8
Call trace:
 kiocb_set_cancel_fn+0x9c/0xa8
 ffs_epfile_read_iter+0x144/0x1d0
 io_read+0x19c/0x498
 io_issue_sqe+0x118/0x27c
 io_submit_sqes+0x25c/0x5fc
 __arm64_sys_io_uring_enter+0x104/0xab0
 invoke_syscall+0x58/0x11c
 el0_svc_common+0xb4/0xf4
 do_el0_svc+0x2c/0xb0
 el0_svc+0x2c/0xa4
 el0t_64_sync_handler+0x68/0xb4
 el0t_64_sync+0x1a4/0x1a8

Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is
submitted by libaio.

Suggested-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Avi Kivity <avi@scylladb.com>
Cc: Sandeep Dhavale <dhavale@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3580
2024-08-01 01:39:28 +00:00
Hao Ge 28c59d3c6d fs: fix undefined behavior in bit shift for SB_NOUSER
ANBZ: #9613

commit f15afbd34d upstream.

Shifting signed 32-bit value by 31 bits is undefined, so changing
significant bit to unsigned. It was spotted by UBSAN.

So let's just fix this by using the BIT() helper for all SB_* flags.

Fixes: e462ec50cb ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Message-Id: <20230424051835.374204-1-gehao@kylinos.cn>
[brauner@kernel.org: use BIT() for all SB_* flags]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3575
2024-07-31 09:41:17 +00:00
Akinobu Mita 2a89bb1491 libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value
ANBZ: #9495

[ Upstream commit 2e41f274f9 ]

Patch series "fix error when writing negative value to simple attribute
files".

The simple attribute files do not accept a negative value since the commit
488dac0c92 ("libfs: fix error cast of negative value in
simple_attr_write()"), but some attribute files want to accept a negative
value.

This patch (of 3):

The simple attribute files do not accept a negative value since the commit
488dac0c92 ("libfs: fix error cast of negative value in
simple_attr_write()"), so we have to use a 64-bit value to write a
negative value.

This adds DEFINE_SIMPLE_ATTRIBUTE_SIGNED for a signed value.

Link: https://lkml.kernel.org/r/20220919172418.45257-1-akinobu.mita@gmail.com
Link: https://lkml.kernel.org/r/20220919172418.45257-2-akinobu.mita@gmail.com
Fixes: 488dac0c92 ("libfs: fix error cast of negative value in simple_attr_write()")
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reported-by: Zhao Gongyi <zhaogongyi@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3475
2024-07-12 05:49:57 +00:00
Guixin Liu 3e6454be7f anolis: kabi: revert all kabi_use
ANBZ: #9320

Because we dont take care of kABI consistency between major version now,
revert kabi use, these reserved space are intended to ensure kABI
compatibility between minor versions and their corresponding major
version.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by:Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3344
2024-06-28 05:46:26 +00:00
Xu Yu bd617fc27c anolis: mm: support fast reflink
ANBZ: #7740

Reflink will takes too much times to copy-on-write in the dirty pages.
Redis will block the progress to wait for completion, hence that will
not acceptiable for us to make the feature productization.

The patch will handle with the copy-on-write in the asynchronous thread.
and we will not handle with pmd aligned entry directly, and let it
behind for background work.

[ Xu Yu:
    - Employ async fork framework.
    - Fix wrong usage of pmd_write.
    - Move work_struct of fast reflink from vm_area_struct to
      address_space, and update corresponding routines.
]

Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2794
2024-03-01 01:09:03 +00:00
Rongwei Wang 4a85c08e8c anolis: mm: pgtable_share: switch to per-VMA pgtable share data
ANBZ: #6632

For anonymous shared memory, the page table data
constructed is sufficient in a per-mapping way.
That because other processes have no chances to
create new VMAs base on the same file descriptor.

But, identified shared memory also is user of page
table sharing. The file descriptor is visible for
others to create new shared page table.
Obviously, this two groups of shared page table is
different and invisible each other.

All in all, the advantages of per-VMA having:
- Avoid the conflict of create shared page tables
  with different attributes;
- Avoid the fails incurred by various VMA area;
- Maybe a little semantic issue, irrelevant tasks
  can not see the same mm_struct;
- A simpler structure;

Here moving the pgtable share data into vma struct.
The per-VMA method can meet above two shared memory.

Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
2023-12-29 12:54:08 +00:00
Rongwei Wang e7de7cbc6e anolis: mm: pgtable_share: introduce CONFIG_PAGETABLE_SHARE
ANBZ: #6632

CONFIG_PAGETABLE_SHARE introduced to select
whether enable this feature. And it needs to
select CONFIG_ARCH_USES_HIGH_VMA_FLAGS.

Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
2023-12-29 12:54:08 +00:00
Khalid Aziz fb2edd234e mm/ptshare: Create new mm struct for page table sharing
ANBZ: #6632

cherry-picked from https://lore.kernel.org/linux-mm/1fd52581f4e4960a4d07cb9784d56659ec139d3c.1682453344.git.khalid.aziz@oracle.com/

When a process passes MAP_SHARED_PT flag to mmap(), create a new mm
struct to hold the shareable page table entries for the newly mapped
region.  This new mm is not associated with a task.  Its lifetime is
until the last shared mapping is deleted.  This patch also adds a
new pointer "ptshare_data" to struct address_space which points to
the data structure that will contain pointer to this newly created
mm along with properties of the shared mapping. ptshare_data
maintains a refcount for the shared mapping so that it can be
cleaned up upon last unmap.

[Rongwei Wang: fix vm_flags_set() error]

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
2023-12-29 12:54:08 +00:00
Shiyang Ruan e4d1b9eecc fsdax: dedup file range to use a compare function
ANBZ: #7740

commit 6f7db3894a upstream.

With dax we cannot deal with readpage() etc.  So, we create a dax
comparison function which is similar with vfs_dedupe_file_range_compare().
And introduce dax_remap_file_range_prep() for filesystem use.

Link: https://lkml.kernel.org/r/20220603053738.1218681-13-ruansy.fnst@fujitsu.com
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Conflicts:
	fs/remap_range.c
	include/linux/dax.h
	include/linux/fs.h
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2504
2023-12-26 09:51:37 +00:00
Christoph Hellwig f55bb06bba fs: explicitly unregister per-superblock BDIs
ANBZ: #7146

commit 0b3ea0926a upstream.

Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
associated with them, and unregister it when the superblock is shut
down.

Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2407
2023-11-17 06:50:20 +00:00
Jens Axboe 3df6596538 io_uring: import 5.15-stable io_uring
ANBZ: #6478

commit 788d0824269bef539fe31a785b1517882eafed93 linux-5.10.y.

No upstream commit exists.

This imports the io_uring codebase from 5.15.85, wholesale. Changes
from that code base:

- Drop IOCB_ALLOC_CACHE, we don't have that in 5.10.
- Drop MKDIRAT/SYMLINKAT/LINKAT. Would require further VFS backports,
  and we don't support these in 5.10 to begin with.
- sock_from_file() old style calling convention.
- Use compat_get_bitmap() only for CONFIG_COMPAT=y

[backport note]
Besides, add ANCK 5.10 specific changes

Including the following changes:

support us granularity of io_sq_thread_idle
optimize submit sqes latency
support for 128-byte SQEs
support CQE32 in io_uring_cqe
support infrastructure for uring-cmd

[backport note2]
As we introduce pf_io_worker in task_struct, we use the first
KABI_RESERVE

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2208
2023-09-25 12:04:35 +00:00
Matthew Wilcox (Oracle) a94299bed6 fs: Document file_ra_state
ANBZ: #6207

commit c790fbf20a upstream.

Turn the comments into kernel-doc and improve the wording slightly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Dave Wysochanski <dwysocha@redhat.com>
Tested-By: Marc Dionne <marc.dionne@auristor.com>
Link: https://lore.kernel.org/r/20210407201857.3582797-3-willy@infradead.org/
Link: https://lore.kernel.org/r/161789068619.6155.1397999970593531574.stgit@warthog.procyon.org.uk/ # v6
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2080
2023-08-24 11:06:13 +00:00
Miklos Szeredi d5c57539a0 ovl: enable RCU'd ->get_acl()
ANBZ: #5669

commit 332f606b32 upstream.

Overlayfs does not cache ACL's (to avoid double caching).  Instead it just
calls the underlying filesystem's i_op->get_acl(), which will return the
cached value, if possible.

In rcu path walk, however, get_cached_acl_rcu() is employed to get the
value from the cache, which will fail on overlayfs resulting in dropping
out of rcu walk mode.  This can result in a big performance hit in certain
situations.

Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which
indicates pass-through)

Reported-by: garyhuang <zjh.20052005@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1801
2023-06-29 01:23:21 +00:00
Jingbo Xu 4d36616698 anolis: Revert "anolis: erofs,ovl: bypass [override|revert]_creds for overlayfs"
ANBZ: #5669

This reverts commit 8a90701652.

The current implementation is risky and may cause copy-up failure in
specific scenarios with "permission denied".  As this optimization is
not practically used, revert it for now.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1801
2023-06-29 01:23:21 +00:00
Jingbo Xu c17e876988 anolis: Revert "anolis: erofs,ovl: enable RCU'd ->get_acl()"
ANBZ: #5669

This reverts commit 449488538c, where the
optimization is enabled only when lowerdirs are upon limited fs type.

Later we will cherry-pick the upstream commit 332f606b32 ("ovl: enable
RCU'd ->get_acl()") to re-enable this optimization.

Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1801
2023-06-29 01:23:21 +00:00
Liu Bo cc28d7800c anolis: dax: add dax_read_one_pfn helper
ANBZ: #5553

Dax pfn can be mapped to kernel virtual address just like page cache page,
but there is no such a helper that is equivalent to read_cache_page().

This adds a helper "dax_read_one_pfn" to return a pfn_t.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Tested-by: Liu Bo <bo.liu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1747
2023-06-16 06:57:55 +00:00
Kanchan Joshi 6b5e751782 fs: add file_operations->uring_cmd_iopoll
ANBZ: #4890

commit de27e18e86 upstream

io_uring will invoke this to do completion polling on uring-cmd
operations.

Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220823161443.49436-2-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Reviewed-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1681
2023-06-15 22:55:32 +00:00
Christian Brauner (Microsoft) e244892f6f attr: use consistent sgid stripping checks
ANBZ: #5286

commit 0e9dbde96cacd0e95e05e862b50f3079146917f2 stable 5.10

commit ed5a7047d2 upstream.

[backported to 5.10.y, prior to idmapped mounts]

Currently setgid stripping in file_remove_privs()'s should_remove_suid()
helper is inconsistent with other parts of the vfs. Specifically, it only
raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
inode isn't in the caller's groups and the caller isn't privileged over the
inode although we require this already in setattr_prepare() and
setattr_copy() and so all filesystem implement this requirement implicitly
because they have to use setattr_{prepare,copy}() anyway.

But the inconsistency shows up in setgid stripping bugs for overlayfs in
xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping works
correctly when performing various write-like operations as an unprivileged
user (fallocate, reflink, write, etc.):

echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k

The test basically creates a file with 6666 permissions. While the file has
the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
regular filesystem like xfs what will happen is:

sys_fallocate()
-> vfs_fallocate()
   -> xfs_file_fallocate()
      -> file_modified()
         -> __file_remove_privs()
            -> dentry_needs_remove_privs()
               -> should_remove_suid()
            -> __remove_privs()
               newattrs.ia_valid = ATTR_FORCE | kill;
               -> notify_change()
                  -> setattr_copy()

In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.

But we also see that ATTR_KILL_SGID won't be set because while the file
is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.

So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
ATTR_KILL_SUID is set and does:

ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);

which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

Now we call into the filesystem's ->setattr() inode operation which will
end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

if (ia_valid & ATTR_MODE) {
        umode_t mode = attr->ia_mode;
        vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
        if (!vfsgid_in_group_p(vfsgid) &&
            !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                mode &= ~S_ISGID;
        inode->i_mode = mode;
}

and since the caller in the test is neither capable nor in the group of the
inode the S_ISGID bit is stripped.

But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
has the consequence that neither the setgid nor the suid bits are stripped
even though it should be stripped because the inode isn't in the caller's
groups and the caller isn't privileged over the inode.

If overlayfs is in the mix things become a bit more complicated and the bug
shows up more clearly. When e.g., ovl_setattr() is hit from
ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
ATTR_KILL_SGID might be raised but because the check in notify_change() is
questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
stripped the S_ISGID bit isn't removed even though it should be stripped:

sys_fallocate()
-> vfs_fallocate()
   -> ovl_fallocate()
      -> file_remove_privs()
         -> dentry_needs_remove_privs()
            -> should_remove_suid()
         -> __remove_privs()
            newattrs.ia_valid = ATTR_FORCE | kill;
            -> notify_change()
               -> ovl_setattr()
                  // TAKE ON MOUNTER'S CREDS
                  -> ovl_do_notify_change()
                     -> notify_change()
                  // GIVE UP MOUNTER'S CREDS
     // TAKE ON MOUNTER'S CREDS
     -> vfs_fallocate()
        -> xfs_file_fallocate()
           -> file_modified()
              -> __file_remove_privs()
                 -> dentry_needs_remove_privs()
                    -> should_remove_suid()
                 -> __remove_privs()
                    newattrs.ia_valid = attr_force | kill;
                    -> notify_change()

The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper to perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have notify_change()
not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
first place because the caller must calculate the flags via
should_remove_suid() anyway which would raise ATTR_KILL_SGID.

While we're at it we move should_remove_suid() from inode.c to attr.c
where it belongs with the rest of the iattr helpers. Especially since it
returns ATTR_KILL_S{G,U}ID flags. We also rename it to
setattr_should_drop_suidgid() to better reflect that it indicates both
setuid and setgid bit removal and also that it returns attr flags.

Running xfstests with this doesn't report any regressions. We should really
try and use consistent checks.

Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[Ziyang Zhang: modify fuse_cache_write_iter() too]
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1667
2023-05-26 07:12:17 +00:00
Yang Xu d59b93b65b fs: add mode_strip_sgid() helper
ANBZ: #4541

commit 2b3416ceff upstream.

[remove userns argument of helper for 5.10.y backport]

Add a dedicated helper to handle the setgid bit when creating a new file
in a setgid directory. This is a preparatory patch for moving setgid
stripping into the vfs. The patch contains no functional changes.

Currently the setgid stripping logic is open-coded directly in
inode_init_owner() and the individual filesystems are responsible for
handling setgid inheritance. Since this has proven to be brittle as
evidenced by old issues we uncovered over the last months (see [1] to
[3] below) we will try to move this logic into the vfs.

Link: e014f37db1 ("xfs: use setattr_copy to set vfs inode attributes") [1]
Link: 01ea173e10 ("xfs: fix up non-directory creation in SGID directories") [2]
Link: fd84bfdddd ("ceph: fix up non-directory creation in SGID directories") [3]
Link: https://lore.kernel.org/r/1657779088-2242-1-git-send-email-xuyang2018.jy@fujitsu.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-and-Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Ferry Meng <mengferry@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1489
2023-03-24 08:16:23 +00:00
Guixin Liu 2af3c87980 anolis: kabi: Reserve some fields
ANBZ: #3879

Based on upstream struct size to reserve specific fields
for kabi related structs.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1148
2023-02-24 07:45:32 +00:00
Guixin Liu a027e85f84 anolis: kabi: Replace hotfix reserve macros with KABI reserve macros
ANBZ: #3879

Unify hotfix reserve macros and KABI reserve macros, now KABI reserve
macros are used for both KABI and hotfix.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1087
2023-01-31 13:20:23 +00:00
Jens Axboe 943c68a0f9 mm: provide filemap_range_needs_writeback() helper
ANBZ: #3599

commit 63135aa386 upstream.

Patch series "Improve IOCB_NOWAIT O_DIRECT reads", v3.

An internal workload complained because it was using too much CPU, and
when I took a look, we had a lot of io_uring workers going to town.

For an async buffered read like workload, I am normally expecting _zero_
offloads to a worker thread, but this one had tons of them.  I'd drop
caches and things would look good again, but then a minute later we'd
regress back to using workers.  Turns out that every minute something
was reading parts of the device, which would add page cache for that
inode.  I put patches like these in for our kernel, and the problem was
solved.

Don't -EAGAIN IOCB_NOWAIT dio reads just because we have page cache
entries for the given range.  This causes unnecessary work from the
callers side, when the IO could have been issued totally fine without
blocking on writeback when there is none.

This patch (of 3):

For O_DIRECT reads/writes, we check if we need to issue a call to
filemap_write_and_wait_range() to issue and/or wait for writeback for any
page in the given range.  The existing mechanism just checks for a page in
the range, which is suboptimal for IOCB_NOWAIT as we'll fallback to the
slow path (and needing retry) if there's just a clean page cache page in
the range.

Provide filemap_range_needs_writeback() which tries a little harder to
check if we actually need to issue and/or wait for writeback in the range.

Link: https://lkml.kernel.org/r/20210224164455.1096727-1-axboe@kernel.dk
Link: https://lkml.kernel.org/r/20210224164455.1096727-2-axboe@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1034
2023-01-03 15:03:53 +08:00
Jens Axboe 28426fcd72 fs,io_uring: add infrastructure for uring-cmd
ANBZ: #3541

commit ee692a21e9 upstream

file_operations->uring_cmd is a file private handler.
This is somewhat similar to ioctl but hopefully a lot more sane and
useful as it can be used to enable many io_uring capabilities for the
underlying operation.

IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
know what is in this command type, it's for the provider of ->uring_cmd()
to deal with.

Co-developed-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-2-joshi.k@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
[Ziyang Zhang: remove CQE32 support for uring-cmd]
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1017
2022-12-27 10:32:04 +00:00
Ziyang Zhang 62e8ec4c09 anolis: Revert "ck: io_uring: support ioctl"
ANBZ: #3286

This reverts commit 116dd23479.

If we want to sync with io_uring upstream, IORING_OP_IOCTL could break
the uapi since there is not a specific number assigned for this opcode.

Revert IORING_OP_IOCTL now and it seems that no one is using it now.

Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/923
2022-11-29 16:46:54 +08:00
Jingbo Xu 449488538c anolis: erofs,ovl: enable RCU'd ->get_acl()
ANBZ: #2457

cherry-picked from 332f606b32 upstream.

Overlayfs does not cache ACL's (to avoid double caching).  Instead it just
calls the underlying filesystem's i_op->get_acl(), which will return the
cached value, if possible.

In rcu path walk, however, get_cached_acl_rcu() is employed to get the
value from the cache, which will fail on overlayfs resulting in dropping
out of rcu walk mode.  This can result in a big performance hit in certain
situations.

Fix by calling ->get_acl() with rcu=true in case of ACL_DONT_CACHE (which
indicates pass-through)

[Jingbo's Note]
For risk control consideration, enable RCU'd ACL optimization (for
overlayfs) only in constricted scenarios, e.g. the container scenarios
where erofs serves as the lowerfs of overlayfs.

Reported-by: garyhuang <zjh.20052005@163.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/783
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-10-24 02:54:58 +00:00
Miklos Szeredi 3145d22992 vfs: add rcu argument to ->get_acl() callback
ANBZ: #2457

commit 0cad624662 upstream.

Add a rcu argument to the ->get_acl() callback to allow
get_cached_acl_rcu() to call the ->get_acl() method in the next patch.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/783
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-10-24 02:54:58 +00:00
Jingbo Xu 8a90701652 anolis: erofs,ovl: bypass [override|revert]_creds for overlayfs
ANBZ: #2457

Add "opt_creds=on" mount option for erofs, which serves as a hint of
bypassing [override|revert]_creds in permission checking in overlayfs,
when erofs is mounted as lowerfs of overlayfs.

overlayfs will access the realinode with creds of mounter by default, in
which case [override|revert]_creds are called. In some workloads, the
overhead of frequent [override|revert]_creds is quite considerable.

As an optimization we can bypass [override|revert]_creds and thus tasks
will access realinode with creds of caller. This is inspired by a
similar optimization from Android (see Link below). The side effect is
that tasks may fail to access files if caller's permission is not
adequate to do that. So this optimization only applies to those
constricted scenarios where caller is adequate to access realinode, e.g.
container scenarios.

To enable this optimization, set "opt_creds=on" mount option for those
erofs instances that serve as lowerfs for overlayfs. The mount option is
a hint for overlayfs suggesting that the optimization could be applied.
The optimization will be enabled if all lowerfs are configured with this
mount option.

By default this mount option is unset and thus overlayfs won't enable
the optimization.

Link: https://lore.kernel.org/all/20211117015806.2192263-4-dvander@google.com/
Link: https://gitee.com/anolis/cloud-kernel/pulls/783
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-10-24 02:54:58 +00:00
Pavel Begunkov 1f2a758e33 io_uring: disable polling pollfree files
ANBZ: #2212

commit 28d8d2737e82fc29ff9e788597661abecc7f7994 linux-stable 5.10.

Older kernels lack io_uring POLLFREE handling. As only affected files
are signalfd and android binder the safest option would be to disable
polling those files via io_uring and hope there are no users.

Fixes: 221c5eb233 ("io_uring: add support for IORING_OP_POLL")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: CVE-2022-3176
Link: https://gitee.com/anolis/cloud-kernel/pulls/734
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
2022-09-23 03:49:43 +00:00
zhongjiang-ali b595403c0b anolis: mm: kidled: refactor cold slab
ANBZ: #1702

The previous cold slab use an internal field to record the age,
it will increase the object size and bring in fragmentation.

The patch refactor the implementation to cut down the defect.
It will reuse the page->mem_cgroup because slab will not in
the memcg when kmem accouting disable, and use an extra pointer
to store the age when kmem accouting enable, it is an specal
obj_cgroup pointer allocated by memcg_alloc_page_obj_cgroups.

Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/565
2022-08-02 16:44:02 +08:00
zhongjiang-ali 1d7768d166 anolis: mm: kidled: make kidled support to identify cold slab
ANBZ: #1702

In some cases, we can see a large of reclaimable slab in the machine.
It can result in some memory fragment to make the system fluctuation.
Meanwhile, reclaim the slab can increase the memory to be used for
offline business.

The patch is adapted to kidle frame, thus the kidled is able to monitor
the free slab for an long time. It is an indicator for user to reap
the long free slab.

Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/565
2022-08-02 16:43:57 +08:00
Alessio Balsini 31512306c6 anolis: fs: Generic function to convert iocb to rw flags
ANBZ: #1576

Reference:
https://lore.kernel.org/lkml/20210125153057.3623715-2-balsini@android.com/

OverlayFS implements its own function to translate iocb flags into rw
flags, so that they can be passed into another vfs call.
With commit ce71bfea20 ("fs: align IOCB_* flags with RWF_* flags")
Jens created a 1:1 matching between the iocb flags and rw flags,
simplifying the conversion.

Reduce the OverlayFS code by making the flag conversion function generic
and reusable.

Signed-off-by: Alessio Balsini <balsini@android.com>
Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-08-02 16:40:02 +08:00
Xu Yu cb7613cb41 anolis: mm: duptext: add core implementation
ANBZ: #32

NOTE: this is an experimental feature.

This introduces and implements interfaces to create and remove
duplicated pages, as well as statistics of duplicated pages in
global granularity and process granularity, respectively.

This also introduces the global switch and per-memcg switch to
control whether page duplication is allowed.

Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
2022-08-01 12:00:46 +00:00
xiejingfeng ee0d34abbf ck: hotfix: Add Kernel hotfix enhancement
fix #32137220

We reserve some fields beforehand for core structures prone to change,
so that we won't hurt when extra fields have to be added for hotfix,
thereby inceasing the success rate, we even can hot add features with
this enhancement.

After reserving, normally cache does not matter as the reserved fields
(usually at tail) are not accessed at all.

Currently involve the following structures:
   MM:
   struct zone
   struct pglist_data
   struct mm_struct
   struct vm_area_struct
   struct mem_cgroup
   struct writeback_control

   Block:
   struct gendisk
   struct backing_dev_info
   struct bio
   struct queue_limits
   struct request_queue
   struct blkcg
   struct blkcg_policy
   struct blk_mq_hw_ctx
   struct blk_mq_tag_set
   struct blk_mq_queue_data
   struct blk_mq_ops
   struct elevator_mq_ops
   struct address_space
   struct block_device
   struct hd_struct
   struct bio_set

   Network:
   struct sk_buff
   struct sock
   struct net_device_ops
   struct xt_target
   struct dst_entry
   struct dst_ops
   struct fib_rule

   Scheduler:
   struct task_struct
   struct cfs_rq
   struct rq
   struct sched_statistics
   struct sched_entity
   struct signal_struct
   struct task_group
   struct cpuacct

   cgroup:
   struct cgroup_root
   struct cgroup_subsys_state
   struct cgroup_subsys
   struct css_set

Signed-off-by: xiejingfeng <xiejingfeng@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-08-01 11:01:51 +00:00
Hao Xu 116dd23479 ck: io_uring: support ioctl
fix #32693463

Async ioctl is necessary for some scenarios like nonblocking
single-threaded model. Currently just support BLKDISCARD op.

Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-08-01 11:01:45 +00:00
zhangliguang 90658e1450 ck: fs,ext4: remove projid limit when create hard link
fix #32824162

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.

Signed-off-by: zhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: Caspar Zhang <caspar@linux.alibaba.com>
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
2022-08-01 11:01:43 +00:00
Josef Bacik 9febc9d8d2 fs: export an inode_update_time helper
commit e60feb445f upstream.

If you already have an inode and need to update the time on the inode
there is no way to do this properly.  Export this helper to allow file
systems to update time on the inode so the appropriate handler is
called, either ->update_time or generic_update_time.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-26 10:39:22 +01:00
Al Viro 40ba433a85 new helper: inode_wrong_type()
commit 6e3e2c4362 upstream.

inode_wrong_type(inode, mode) returns true if setting inode->i_mode
to given value would've changed the inode type.  We have enough of
those checks open-coded to make a helper worthwhile.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08 08:49:01 +02:00