Commit Graph

1093 Commits

Author SHA1 Message Date
Qi Zheng bc307a6c2d mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
ANBZ: #20722

commit 6375e95f38 upstream.

Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc.  These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.

The following are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

In this case, most of the page table entries are empty.  For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.

As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case.  We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).

Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock.  If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).

For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 KB                         240 KB

[Zelin Deng: change and export zap_page_range_single() in this patch in
order to make it be used by external caller. This change is originally
introduced by commit: 21b85b09527c("madvise: use zap_page_range_single for madvise dontneed")]

[zhengqi.arch@bytedance.com: fix uninitialized symbol 'ptl']
  Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com
  Link: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/
Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5202
2025-05-08 08:00:29 +00:00
Linus Torvalds 85b36d22d7 mm: introduce new 'lock_mm_and_find_vma()' page fault helper
ANBZ: #20359

commit c2508ec5a5 upstream.

.. and make x86 use it.

This basically extracts the existing x86 "find and expand faulting vma"
code, but extends it to also take the mmap lock for writing in case we
actually do need to expand the vma.

We've historically short-circuited that case, and have some rather ugly
special logic to serialize the stack segment expansion (since we only
hold the mmap lock for reading) that doesn't match the normal VM
locking.

That slight violation of locking worked well, right up until it didn't:
the maple tree code really does want proper locking even for simple
extension of an existing vma.

So extract the code for "look up the vma of the fault" from x86, fix it
up to do the necessary write locking, and make it available as a helper
function for other architectures that can use the common helper.

Note: I say "common helper", but it really only handles the normal
stack-grows-down case.  Which is all architectures except for PA-RISC
and IA64.  So some rare architectures can't use the helper, but if they
care they'll just need to open-code this logic.

It's also worth pointing out that this code really would like to have an
optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
read-lock (for the common case) to taking the write lock (for having to
extend the vma) in the normal single-threaded situation where there is
no other locking activity.

But that _is_ all the very uncommon special case, so while it would be
nice to have such an operation, it probably doesn't matter in reality.
I did put in the skeleton code for such a possible future expansion,
even if it only acts as pseudo-documentation for what we're doing.

[feng: during backporting to 5.10, slight change to the code location
 and function parameter name in mm/memory.c and arch/x86/mm/fault.c]

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5085
2025-04-22 09:42:28 +00:00
Dan Carpenter 16e11c5c5a mm/slab: make __free(kfree) accept error pointers
ANBZ: #20392

commit cd7eb8f83f upstream.

Currently, if an automatically freed allocation is an error pointer that
will lead to a crash.  An example of this is in wm831x_gpio_dbg_show().

   171	char *label __free(kfree) = gpiochip_dup_line_label(chip, i);
   172	if (IS_ERR(label)) {
   173		dev_err(wm831x->dev, "Failed to duplicate label\n");
   174		continue;
   175  }

The auto clean up function should check for error pointers as well,
otherwise we're going to keep hitting issues like this.

Fixes: 54da6a0924 ("locking: Introduce __cleanup() based infrastructure")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Acked-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5060
2025-04-16 01:34:16 +00:00
Dan Williams 034ce2ec0f mm/slab: Add __free() support for kvfree
ANBZ: #20392

commit a67d74a4b1 upstream.

Allow for the declaration of variables that trigger kvfree() when they
go out of scope. The check for NULL and call to kvfree() can be elided
by the compiler in most cases, otherwise without the NULL check an
unnecessary call to kvfree() may be emitted. Peter proposed a comment
for this detail [1].

[ Zelin Deng: kvfree() is not in slab.h, so just modify it in mm.h ]

Link: http://lore.kernel.org/r/20230816103102.GF980931@hirez.programming.kicks-ass.net [1]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Pankaj Gupta <pankaj.gupta@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Acked-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5060
2025-04-16 01:34:16 +00:00
Peter Xu 58d47d7e0b mm: introduce page_needs_cow_for_dma() for deciding whether cow
ANBZ: #12593

commit 97a7e4733b upstream

We've got quite a few places (pte, pmd, pud) that explicitly checked
against whether we should break the cow right now during fork().  It's
easier to provide a helper, especially before we work the same thing on
hugetlbfs.

Since we'll reference is_cow_mapping() in mm.h, move it there too.
Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
exported to the whole kernel.  With that we should expect another patch to
use is_cow_mapping() whenever we can across the kernel since we do use it
quite a lot but it's always done with raw code against VM_* flags.

Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4271
2024-12-26 02:30:39 +00:00
zhongjiang-ali 39dbfa37de anolis: virtio-mem: make memmap on movable memory can be offlined
ANBZ: #12923

online memory will allocate memory for struct page from
normal/dma32 zone by virtio-mem. it will trigger oom
when online memory will take up a lot of kernel memory.

It will fails to offline the memmap memory in movable node when
memmap_on_memory enable, which will result in memory waste.

The patch will finish the function to offline the memmap when the
other memory of memory memblock has been offlined, the memmap also
will offlined.

Signed-off-by: hr567 <hr567@linux.alibaba.com>
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4359
2024-12-25 11:04:18 +08:00
Yuanhe Shu 427511b962 anolis: mm: set default value for min_cache_kbytes
ANBZ: #11569

Set the default value for min_cache_kbytes to resolve the issue
of not triggering OOM.

min_cache_kbytes would be set to:
150M,  when total memory is (  0,   4G]
300M,  when total memory is ( 4G,   8G]
400M,  when total memory is ( 8G,  16G]
500M,  when total memory is (16G, 128G]
1024M, when total memory is above 128G

Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4044
2024-10-30 09:44:07 +00:00
Qinyun Tan 92598db507 anolis: mm: introduce vma flag VM_PGOFF_IS_PFN
ANBZ: #10842

Introduce a new vma flag 'VM_PGOFF_IS_PFN', which means
vma->vm_pgoff == pfn. This allows us to directly obtain pfn
through vma->vm_pgoff. No Functional Change.

Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3834
2024-09-23 07:02:48 +00:00
Kirill A. Shutemov deadc95d12 efi/unaccepted: Make sure unaccepted table is mapped
ANBZ: #9469

commit 8dbe33956d upstream

Unaccepted table is now allocated from EFI_ACPI_RECLAIM_MEMORY. It
translates into E820_TYPE_ACPI, which is not added to memblock and
therefore not mapped in the direct mapping.

This causes a crash on the first touch of the table.

Use memblock_add() to make sure that the table is mapped in direct
mapping.

Align the range to the nearest page borders. Ranges smaller than page
size are not mapped.

[ Zelin Deng: Add PAGE_ALIGN_DOWN ]

Fixes: e7761d827e ("efi/unaccepted: Use ACPI reclaim memory for unaccepted memory table")
Reported-by: Hongyu Ning <hongyu.ning@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3457
2024-07-10 01:38:34 +00:00
Kirill A. Shutemov 7417dc9cb9 efi/libstub: Implement support for unaccepted memory
ANBZ: #9469

commit 745e3ed85f upstream

UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.

The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.

Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 (or whatever the arch uses) has to be modified on every
page acceptance. It leads to table fragmentation and there's a limited
number of entries in the e820 table.

Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents a
naturally aligned power-2-sized region of address space -- unit.

For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or
physical address space.

In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.

Any unaccepted memory that is not aligned to unit_size gets accepted
upfront.

The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via EFI configuration table. allocate_e820() allocates the
bitmap if unaccepted memory is present, according to the size of
unaccepted region.

[ Zelin Deng: fix conflits and add missing function ]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230606142637.5171-4-kirill.shutemov@linux.intel.com
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3457
2024-07-10 01:38:34 +00:00
Kirill A. Shutemov fae8454149 mm: Add support for unaccepted memory
ANBZ: #9469

commit dcdfdd40fa upstream

UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.

There are several ways the kernel can deal with unaccepted memory:

 1. Accept all the memory during boot. It is easy to implement and it
    doesn't have runtime cost once the system is booted. The downside is
    very long boot time.

    Accept can be parallelized to multiple CPUs to keep it manageable
    (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
    memory bandwidth and does not scale beyond the point.

 2. Accept a block of memory on the first use. It requires more
    infrastructure and changes in page allocator to make it work, but
    it provides good boot time.

    On-demand memory accept means latency spikes every time kernel steps
    onto a new memory block. The spikes will go away once workload data
    set size gets stabilized or all memory gets accepted.

 3. Accept all memory in background. Introduce a thread (or multiple)
    that gets memory accepted proactively. It will minimize time the
    system experience latency spikes on memory allocation while keeping
    low boot time.

    This approach cannot function on its own. It is an extension of #2:
    background memory acceptance requires functional scheduler, but the
    page allocator may need to tap into unaccepted memory before that.

    The downside of the approach is that these threads also steal CPU
    cycles and memory bandwidth from the user's workload and may hurt
    user experience.

Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.

Support of unaccepted memory requires a few changes in core-mm code:

  - memblock accepts memory on allocation. It serves early boot memory
    allocations and doesn't limit them to pre-accepted pool of memory.

  - page allocator accepts memory on the first allocation of the page.
    When kernel runs out of accepted memory, it accepts memory until the
    high watermark is reached. It helps to minimize fragmentation.

EFI code will provide two helpers if the platform supports unaccepted
memory:

 - accept_memory() makes a range of physical addresses accepted.

 - range_contains_unaccepted_memory() checks anything within the range
   of physical addresses requires acceptance.

[ Zelin Deng: As the 2 helpers above have not been implemented in
  this commit, add fake implement for them in case build error
]

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>	# memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3457
2024-07-10 01:38:34 +00:00
Guixin Liu 3e6454be7f anolis: kabi: revert all kabi_use
ANBZ: #9320

Because we dont take care of kABI consistency between major version now,
revert kabi use, these reserved space are intended to ensure kABI
compatibility between minor versions and their corresponding major
version.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by:Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3344
2024-06-28 05:46:26 +00:00
Xu Yu bd617fc27c anolis: mm: support fast reflink
ANBZ: #7740

Reflink will takes too much times to copy-on-write in the dirty pages.
Redis will block the progress to wait for completion, hence that will
not acceptiable for us to make the feature productization.

The patch will handle with the copy-on-write in the asynchronous thread.
and we will not handle with pmd aligned entry directly, and let it
behind for background work.

[ Xu Yu:
    - Employ async fork framework.
    - Fix wrong usage of pmd_write.
    - Move work_struct of fast reflink from vm_area_struct to
      address_space, and update corresponding routines.
]

Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2794
2024-03-01 01:09:03 +00:00
Xu Yu c27dee77b3 anolis: async fork: abstract a generic framework
ANBZ: #7740

A new functionality, akin to that of async fork, will be introduced to
manipulate PMD and VMA ingeniously.

Therefore, this abstracts the implementation of async fork into a generic
framework.

Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2794
2024-03-01 01:09:03 +00:00
Rongwei Wang d35e76c36b anolis: mm: pgtable_share: switch original mm to charge memcg
ANBZ: #6632

Beofre this patch, it's incorrect to charge memcg
with shadow mm. That's because it has no memcg
belongs to.

Here switching original mm to charge memcg.

Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
2023-12-29 12:54:08 +00:00
Rongwei Wang e7de7cbc6e anolis: mm: pgtable_share: introduce CONFIG_PAGETABLE_SHARE
ANBZ: #6632

CONFIG_PAGETABLE_SHARE introduced to select
whether enable this feature. And it needs to
select CONFIG_ARCH_USES_HIGH_VMA_FLAGS.

Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
2023-12-29 12:54:08 +00:00
Khalid Aziz 0967bd180a mm/ptshare: Add vm flag for shared PTE
ANBZ: #6632

cherry-picked from https://lore.kernel.org/linux-mm/a1bc313a3f50f085ab9fcd9826b3cdc20bc15269.1682453344.git.khalid.aziz@oracle.com/

Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
a function to determine if a vma shares PTE by checking this flag.
This is to be used to find the shared page table entries on page fault
for vmas sharing PTE.

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
2023-12-29 12:54:08 +00:00
Shiyang Ruan f464814a82 mm: introduce mf_dax_kill_procs() for fsdax case
ANBZ: #7740

commit c36e202495 upstream.

This new function is a variant of mf_generic_kill_procs that accepts a
file, offset pair instead of a struct to support multiple files sharing a
DAX mapping.  It is intended to be called by the file systems as part of
the memory_failure handler after the file system performed a reverse
mapping from the storage address to the file and file offset.

Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Gao Xiang: adapt ANCK 5.10 codebase. ]
[ Gao Xiang: folded parts of upstream commit 6a8e0596f0. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2504
2023-12-26 09:51:37 +00:00
Shakeel Butt c7e9593678 mm: memcontrol: account pagetables per node
ANBZ: #6706

commit f0c0c115fb upstream.

For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups.  However at
the moment, the pagetables are accounted per-zone.  Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.

[akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]

Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Bo Liu <boliu@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2248
2023-10-08 09:52:14 +00:00
Guixin Liu 4dba5ccef7 anolis: kabi: Reserve some space for kABI stability
ANBZ: #3879

Based on upstream struct size to reserve specific fields
for kABI related structs.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2210
2023-09-20 13:21:10 +00:00
Dave Chinner a040c4fe55 mm: Add kvrealloc()
ANBZ: #6600

commit de2860f463 upstream.

During log recovery of an XFS filesystem with 64kB directory
buffers, rebuilding a buffer split across two log records results
in a memory allocation warning from krealloc like this:

xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
XFS (dm-0): Unmounting Filesystem
XFS (dm-0): Mounting V5 Filesystem
XFS (dm-0): Starting recovery (logdev: internal)
------------[ cut here ]------------
WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
.....
RIP: 0010:get_page_from_freelist+0xdee/0xe40
Call Trace:
 ? complete+0x3f/0x50
 __alloc_pages+0x16f/0x300
 alloc_pages+0x87/0x110
 kmalloc_order+0x2c/0x90
 kmalloc_order_trace+0x1d/0x90
 __kmalloc_track_caller+0x215/0x270
 ? xlog_recover_add_to_cont_trans+0x63/0x1f0
 krealloc+0x54/0xb0
 xlog_recover_add_to_cont_trans+0x63/0x1f0
 xlog_recovery_process_trans+0xc1/0xd0
 xlog_recover_process_ophdr+0x86/0x130
 xlog_recover_process_data+0x9f/0x160
 xlog_recover_process+0xa2/0x120
 xlog_do_recovery_pass+0x40b/0x7d0
 ? __irq_work_queue_local+0x4f/0x60
 ? irq_work_queue+0x3a/0x50
 xlog_do_log_recovery+0x70/0x150
 xlog_do_recover+0x38/0x1d0
 xlog_recover+0xd8/0x170
 xfs_log_mount+0x181/0x300
 xfs_mountfs+0x4a1/0x9b0
 xfs_fs_fill_super+0x3c0/0x7b0
 get_tree_bdev+0x171/0x270
 ? suffix_kstrtoint.constprop.0+0xf0/0xf0
 xfs_fs_get_tree+0x15/0x20
 vfs_get_tree+0x24/0xc0
 path_mount+0x2f5/0xaf0
 __x64_sys_mount+0x108/0x140
 do_syscall_64+0x3a/0x70
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Essentially, we are taking a multi-order allocation from kmem_alloc()
(which has an open coded no fail, no warn loop) and then
reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
then triggering the above warning.

This is a regression caused by converting this code from an open
coded no fail/no warn reallocation loop to using __GFP_NOFAIL.

What we actually need here is kvrealloc(), so that if contiguous
page allocation fails we fall back to vmalloc() and we don't
get nasty warnings happening in XFS.

Fixes: 771915c4f6 ("xfs: remove kmem_realloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2199
2023-09-20 02:27:53 +00:00
Ning Zhang 3f29353eb6 anolis: mm, kidled: set age value in the flags of head pages
ANBZ: #6572

Patch series "Free some vmemmap pages of HugeTLB page" make
the vmemmap pages of tail pages readonly, which will result
in kernel panic when kidled store the age value into the flags
of tail pages.

In the patch, we only store the age value in the head pages.

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2181
2023-09-14 06:47:04 +00:00
Kaihao Bai ff52fd3fd0 anolis: mm: limit use confliction between kidled/coldpgs and mglru
ANBZ: #6082

Kidled and MGLRU share 8 Bits in Page Flags, and the redundant bits
of MGLRU are set as unavailable Offset.

In the life cycle of the operating system, only one component of
Kidled/MGLRU is allowed to be started/shutdown from the kernel level.

Reuse Reset logic (traverse all pages, clear Page Flags offset) to
enable another component after closing one component

Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2051
2023-08-15 15:28:16 +08:00
Yu Zhao 4ce21077fe mm: multi-gen LRU: groundwork
ANBZ: #6082

commit ec1c86b25f upstream

Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations.  They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim.  These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking rendering
   threads.

There are also two access patterns: one with temporal locality and the
other without.  For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults".  Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting.  The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction.  The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then.  This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2051
2023-08-15 15:26:43 +08:00
Feiyang Chen 7aaddcfff2 LoongArch: Add sparse memory vmemmap support
ANBZ: #5200

commit 7b09f5af01 upstream.

Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP
uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn
operations. This is the most efficient option when sufficient kernel
resources are available.

Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn
Signed-off-by: Min Zhou <zhoumin@loongson.cn>
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xuefeng Li <lixuefeng@loongson.cn>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Juxin Gao <gaojuxin@loongson.cn>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1830
2023-07-07 01:46:16 +00:00
Ning Zhang 6cd7711cfe anolis: mm, thp: introduce thp zero subpages reclaim
ANBZ: #5310

Transparent huge pages could reduce the number of address translation
misses, which could improve performance for the applications. But one
concern is that THP may lead to memory bloat which may cause OOM. The
reason is a huge page may contain some zero subpages which user didn't
really access them.

This patch introduces a mechanism to reclaim these zero subpages, it
works when memory pressure is high. We'll estimate whether a huge page
contains enough zero subpages at first, then try split it and reclaim.

We add a interface 'thp_reclaim' to control on or off in memory cgroup:
  echo reclaim > memory.thp_reclaim to enable reclaim mode.
  echo swap > memory.thp_reclaim to enable swap mode.
  echo disable > memory.thp_reclaim to disable.

Or you can configure it by the global interface:
  /sys/kernel/mm/transparent_hugepage/reclaim

The default mode is inherited from it's parent:
  1. The reclaim mode means the huge page will be split and the zero
     subpages will be reclaimed.
  2. The swap mode is reserved for combining with swap, for example,
     zswap and zram.

Meanwhile, we add a interface thp_reclaim_stat to show the stat of
each node in the memory cgroup:
  /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_stat

Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1662
2023-06-15 11:04:37 +00:00
Muchun Song 9e776aa029 mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP*
ANBZ: #4794

commit 47010c040d upstream

The word of "free" is not expressive enough to express the feature of
optimizing vmemmap pages associated with each HugeTLB, rename this keywork
to "optimize".  In this patch , cheanup configs to make code more
expressive.

Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2023-04-21 14:44:21 +08:00
Muchun Song de5587db3b mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP
ANBZ: #4794

commit e540841734 upstream

The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those
functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.

Link: https://lkml.kernel.org/r/20211101031651.75851-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2023-04-21 14:44:20 +08:00
Muchun Song b41bc642bc mm: sparsemem: split the huge PMD mapping of vmemmap pages
ANBZ: #4794

commit 3bc2b6a725 upstream

Patch series "Split huge PMD mapping of vmemmap pages", v4.

In order to reduce the difficulty of code review in series[1].  We disable
huge PMD mapping of vmemmap pages when that feature is enabled.  In this
series, we do not disable huge PMD mapping of vmemmap pages anymore.  We
will split huge PMD mapping when needed.  When HugeTLB pages are freed
from the pool we do not attempt coalasce and move back to a PMD mapping
because it is much more complex.

[1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/

This patch (of 3):

In [1], PMD mappings of vmemmap pages were disabled if the the feature
hugetlb_free_vmemmap was enabled.  This was done to simplify the initial
implementation of vmmemap freeing for hugetlb pages.  Now, remove this
simplification by allowing PMD mapping and switching to PTE mappings as
needed for allocated hugetlb pages.

When a hugetlb page is allocated, the vmemmap page tables are walked to
free vmemmap pages.  During this walk, split huge PMD mappings to PTE
mappings as required.  In the unlikely case PTE pages can not be
allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb
page.

When HugeTLB pages are freed from the pool, we do not attempt to
coalesce and move back to a PMD mapping because it is much more complex.

[1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com

Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2023-04-21 14:44:20 +08:00
Muchun Song 6c90fde256 mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page
ANBZ: #4794

commit ad2fa3717b upstream

When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it.  However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure.  In
this case, we just refuse to free the HugeTLB page.  This changes behavior
in some corner cases as listed below:

 1) Failing to free a huge page triggered by the user (decrease nr_pages).

    User needs to try again later.

 2) Failing to free a surplus huge page when freed by the application.

    Try again later when freeing a huge page next time.

 3) Failing to dissolve a free huge page on ZONE_MOVABLE via
    offline_pages().

    This can happen when we have plenty of ZONE_MOVABLE memory, but
    not enough kernel memory to allocate vmemmmap pages.  We may even
    be able to migrate huge page contents, but will not be able to
    dissolve the source huge page.  This will prevent an offline
    operation and is unfortunate as memory offlining is expected to
    succeed on movable zones.  Users that depend on memory hotplug
    to succeed for movable zones should carefully consider whether the
    memory savings gained from this feature are worth the risk of
    possibly not being able to offline memory in certain situations.

 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
    alloc_contig_range() - once we have that handling in place. Mainly
    affects CMA and virtio-mem.

    Similar to 3). virito-mem will handle migration errors gracefully.
    CMA might be able to fallback on other free areas within the CMA
    region.

Vmemmap pages are allocated from the page freeing context.  In order for
those allocations to be not disruptive (e.g.  trigger oom killer)
__GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure.  GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.

[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
  Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2023-04-21 14:44:19 +08:00
Muchun Song 8ae06854e4 mm: hugetlb: free the vmemmap pages associated with each HugeTLB page
ANBZ: #4794

commit f41f2ed43c upstream

Every HugeTLB has more than one struct page structure.  We __know__ that
we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
store metadata associated with each HugeTLB.

There are a lot of struct page structures associated with each HugeTLB
page.  For tail pages, the value of compound_head is the same.  So we can
reuse first page of tail page structures.  We map the virtual addresses of
the remaining pages of tail page structures to the first tail page struct,
and then free these page frames.  Therefore, we need to reserve two pages
as vmemmap areas.

When we allocate a HugeTLB page from the buddy, we can free some vmemmap
pages associated with each HugeTLB page.  It is more appropriate to do it
in the prep_new_huge_page().

The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
associated with a HugeTLB page can be freed, returns zero for now, which
means the feature is disabled.  We will enable it once all the
infrastructure is there.

[willy@infradead.org: fix documentation warning]
  Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org

Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2023-04-21 14:44:19 +08:00
Guixin Liu 2af3c87980 anolis: kabi: Reserve some fields
ANBZ: #3879

Based on upstream struct size to reserve specific fields
for kabi related structs.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1148
2023-02-24 07:45:32 +00:00
Xu Yu 7cba848719 anolis: mm: introduce vm_insert_page(s)_mkspecial
ANBZ: #3289

This adds the ability to insert anonymous pages or file pages, used for
direct IO or buffer IO respectively, to a user VM. The intention behind
this is to facilitate mapping pages in IO requests to user space, which
is usually the backend of remote block device.

This integrates the advantage of vm_insert_pages (batching the pmd lock),
and eliminates the overhead of remap_pfn_range (track_pfn_remap), since
the pages to be inserted should always be ram.

NOTE that it is the caller's responsibility to ensure the validity of
pages to be inserted, i.e., that such pages are used for IO requests.
Depending on this premise, such pages can be inserted as special PTE,
without increasing the page refcount and mapcount. On the other hand,
the special mapping should be carefully managed (e.g., zapped) when the
IO request is done.

Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1255
2023-02-24 03:12:49 +00:00
Xu Yu 48da3ae5a7 anolis: mm: rename fast_copy_mm interface to async_fork
ANBZ: #4116

This renames the "fast copy mm" feature to "async fork", in order
to reveal its true purpose explicitly.

The interface is changed as a result. Now, user can enable it by:

  # echo 1 > /path/to/memcg/memory.async_fork

And disable it by:

  # echo 0 > /path/to/memcg/memory.async_fork

The function itself is not changed.

Note again that this only introduces the interfaces and stubs of
async fork.

Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1219
2023-02-21 06:53:52 +00:00
Guixin Liu a027e85f84 anolis: kabi: Replace hotfix reserve macros with KABI reserve macros
ANBZ: #3879

Unify hotfix reserve macros and KABI reserve macros, now KABI reserve
macros are used for both KABI and hotfix.

Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1087
2023-01-31 13:20:23 +00:00
Tony Luck 10c6e7aa17 mm, hwpoison: when copy-on-write hits poison, take page offline
ANBZ: #3829

commit d302c2398b upstream.

Cannot call memory_failure() directly from the fault handler because
mmap_lock (and others) are held.

It is important, but not urgent, to mark the source page as h/w poisoned
and unmap it from other tasks.

Use memory_failure_queue() to request a call to memory_failure() for the
page with the error.

Also provide a stub version for CONFIG_MEMORY_FAILURE=n

Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Shuai Xue: fix conflit by moving memory_failure_queue into CONFIG_MEMORY_FAILURE ]
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1079
2023-01-30 03:09:42 +00:00
Simon Guo 0c1d004895 anolis: mm: gup:remove FOLL_GET_PGSTABLE flag checking
ANBZ: #3400

When supporting gup of normal memory via DAX case, a FOLL_GET_PGSTABLE
flag is marked to restrict the pg_stable_pfn() usage on GUP to KVM only.
But it is not right to assume that only KVM will use the GUP, given the
memory is shared by various tasks such as io thread.

Actually if dax gup page is not in ZONE_DEVICE, it is safe to decide that
page has a stable page struct.

This patch removes the checking of FOLL_GET_PGSTABLE flag.

Fixes: fea195bb4f ("anolis: mm: gup: allow to follow _PAGE_DEVMAP && !ZONE_DEVICE pages optionally")
Signed-off-by: Simon Guo <wei.guo.simon@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/991
2022-12-09 10:43:10 +00:00
Kaihao Bai 09797fd32d anolis: mm: add switch for file zero page
ANBZ: #2510

On top of the filling the zero page of the file hole, a switch is
added to provide the function of turning on/off at runtime. When
turning off, all zero page mappings are evicted to ensure correctness.

Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/785
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2022-11-21 06:11:11 +00:00
Kaihao Bai 3d89f15277 anolis: mm: avoid MMAP_POPULATE filling zero page
ANBZ: #2510

When mmap a file with MMAP_POPULATE/MMAP_LOCK, the page cache should be
pre-allocated. It should avoid filling zero page.

Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/785
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2022-11-21 06:11:11 +00:00
Zhu Yanhai ba809e4874 anolis: mm: gup: allow to follow _PAGE_DEVMAP && !ZONE_DEVICE pages optionally
ANBZ: #2586

Calling get_user_pages() on a range of user memory that has been mmaped from a
DAX file will fail when there are no 'struct page' to describe those pages.
This problem has been addressed in some device drivers by adding optional struct
page support for pages under the control of the driver. To be specific, gup woud
query pgmap_radix tree first and acquire a reference against a pgmap instance
before it ever touches the page. This is to be sure the struct page won't be
released under the nose of gup. And to add any memory into pgmap_radix, they
mustn't be overlapped with the existing PFNs.

However, there are cases that the users need to map the normal memory via DAX,
and without losing the support of gup. Usually this could be done with a memmap
parameter appended into the kernel cmdline to produce a fake PMEM device. Which
is not flexible and sometimes unpractical (imagine how could you expose a BRD
via DAX? A BRD device can't be stacked onto another PMEM device anyway).

Given that what gup really cares is merely to make sure the existence of the
page, and if the struct pages behind the DAX entries (_PAGE_DEVMAP entries) are
actually from the normal memory, gup won't need to bother if they will be
released on the fly. Therefore, it should be fine to fix the code like below,

static inline bool pg_stable_pfn(unsigned long pfn)
{
	int nid = pfn_to_nid(pfn);
	struct zone *zone_device = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];

	if (zone_spans_pfn(zone_device, pfn))
		return false;
	else
		return true;
}

[and later in gup]
	page = vm_normal_page(vma, address, pte);
	if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
		if (pg_stable_pfn(pte_pfn(pte)))
			page = pte_page(pte);
		else {
			/*
			 * Only return device mapping pages in the FOLL_GET case
			 * since they are only valid while holding the pgmap
			 * reference.
			 */
			pgmap = get_dev_pagemap(pte_pfn(pte), NULL);
			if (pgmap)
				page = pte_page(pte);
			else
				goto no_page;

		}
[cut here]

Nevertheless, it's too risky to expose the pages of this kind to the other parts
of the world at once. So this patch proposes a FOLL_GET_PGSTABLE flag
associated with FOLL_GET. The purpose is to let the selected users of gup (it's
KVM in this patch) take the shot first, while left the others as what they were.

Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Signed-off-by: Simon Guo <wei.guo.simon@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/865
2022-11-17 02:08:13 +00:00
Xu Yu b56e2f16a5 anolis: mm: fcm: do not unconditionally call fixup ops
ANBZ: #1881

The fcm (fast copy mm) feature places stubs in the mm core path, whose
overhead should be negligible.

However, the fcm_fixup_{vma,pmd} is unconditionally called whenever a
vma or pmd is about to be changed. This resulted in a performance
degradation of up to 10% for will-it-scale/mmap testcase.

This fixes the performance degradation by adding a static key. There is
no need to call fixup ops when the static key is disabled.

Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/665
2022-08-25 12:09:50 +08:00
Xu Yu 91acfc5102 anolis: Revert "anolis: mm: fcm: do not unconditionally call fcm_fixup_vma"
ANBZ: #1881

This reverts commit 166322a4cb.

The reverted patch does not solve the performance degradation
completely. Therefore, revert it and a better solution will be
provided in the next patch.

Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/665
2022-08-25 12:09:15 +08:00
Xu Yu 166322a4cb anolis: mm: fcm: do not unconditionally call fcm_fixup_vma
ANBZ: #1881

The fcm (fast copy mm) feature places stubs in the mm core path, whose
overhead should be negligible.

However, the fcm_fixup_vma stub is unconditionally called whenever a vma
is about to be changed. This resulted in a performance degradation of up
to 10% for will-it-scale/mmap testcase.

This fixes the performance degradation by adding proper conditions
before calling fcm_fixup_vma.

Fixes: 4995c319c5 ("anolis: mm: fcm: introduce fast copy mm interface")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/625
2022-08-11 11:05:18 +00:00
Xu Yu 4995c319c5 anolis: mm: fcm: introduce fast copy mm interface
ANBZ: #1745

Under normal circumstances, parent will be blocked synchronously in
copy_mm() when fork(2) is called, and the duration may be long with
large memory usage. It will cause the parent out of service from the
user's view. For example, a redis instance creates a memory snapshot
based on fork(2), it's trival to produce latency spikes.

Fast copy mm is designed to reduce block duration when parent process
calls fork(2). It can be controlled by each memcg and can be switched
on/off in the fly. It's disabled by default, user can enable it by:

  # echo 1 > /path/to/memcg/memory.fast_copy_mm

And disable it by:

  # echo 0 > /path/to/memcg/memory.fast_copy_mm

Child memcg will inherit parent memcg's configuration.

This interfaces the interfaces and stubs of fast copy mm, which split
the copy_page_range() function into fast path and slow path. The parent
is expected to excute only the fast path in fork(2), while the child
process will excute the slow path before return to user mode.

Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/578
2022-08-02 16:45:14 +08:00
Tony Luck b306cfc49c x86/sgx: Hook arch_memory_failure() into mainline code
ANBZ: #1345

commit 03b122da74 upstream.

Add a call inside memory_failure() to call the arch specific code
to check if the address is an SGX EPC page and handle it.

Note the SGX EPC pages do not have a "struct page" entry, so the hook
goes in at the same point as the device mapping hook.

Pull the call to acquire the mutex earlier so the SGX errors are also
protected.

Make set_mce_nospec() skip SGX pages when trying to adjust
the 1:1 map.

Intel-SIG: commit 03b122da74 x86/sgx: Hook arch_memory_failure()
into mainline code.
Backport for SGX MCA recovery co-existence support.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.com
[ Zhiquan: amend commit log ]
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com>
2022-08-02 16:39:08 +08:00
Liam Howlett b5ed53c684 mm: add vma_lookup(), update find_vma_intersection() comments
ANBZ: #1324

commit ce6d42f2e4 upstream.

Patch series "mm: Add vma_lookup()", v2.

Many places in the kernel use find_vma() to get a vma and then check the
start address of the vma to ensure the next vma was not returned.

Other places use the find_vma_intersection() call with add, addr + 1 as
the range; looking for just the vma at a specific address.

The third use of find_vma() is by developers who do not know that the
function starts searching at the provided address upwards for the next
vma.  This results in a bug that is often overlooked for a long time.

Adding the new vma_lookup() function will allow for cleaner code by
removing the find_vma() calls which check limits, making
find_vma_intersection() calls of a single address to be shorter, and
potentially reduce the incorrect uses of find_vma().

This patch (of 22):

Many places in the kernel use find_vma() to get a vma and then check the
start address of the vma to ensure the next vma was not returned.

Other places use the find_vma_intersection() call with add, addr + 1 as
the range; looking for just the vma at a specific address.

The third use of find_vma() is by developers who do not know that the
function starts searching at the provided address upwards for the next
vma.  This results in a bug that is often overlooked for a long time.

Adding the new vma_lookup() function will allow for cleaner code by
removing the find_vma() calls which check limits, making
find_vma_intersection() calls of a single address to be shorter, and
potentially reduce the incorrect uses of find_vma().

Also change find_vma_intersection() comments and declaration to be of the
correct length and add kernel documentation style comment.

Intel-SIG: commit ce6d42f2e4  mm: add vma_lookup(), update
find_vma_intersection() comments.
Backport for SGX virtualization support.

Link: https://lkml.kernel.org/r/20210521174745.2219620-1-Liam.Howlett@Oracle.com
Link: https://lkml.kernel.org/r/20210521174745.2219620-2-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Fan Du <fan.du@intel.com>
[ Zhiquan: amend commit log ]
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com>
2022-08-02 16:35:02 +08:00
Xu Yu 03909fbc5e anolis: mm: duptext: create duplicated page for remote access
ANBZ: #32

NOTE: this is an experimental feature.

This duplicates pages synchronously when accessing remote page cache.
Specifically, this creates duplicated pages in do_read_fault() and
filemap_map_pages().

Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
2022-08-01 12:00:46 +00:00
Gang Deng 03ffd42251 anolis: mm, kidled: fix race when free idle age
fix #36837630

When building kernel with idle age not in page's flag, kernel will panic
as below:

[   13.977004] BUG: unable to handle kernel paging request at ffffc90000eba2b9
[   13.978021] PGD 13ad35067 P4D 13ad35067 PUD 13ad36067 PMD 139b88067 PTE 0
[   13.979014] Oops: 0002 [#1] SMP PTI
[   13.979533] CPU: 12 PID: 112 Comm: kidled Not tainted 4.19.91+ #586
[   13.980450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[   13.982136] RIP: 0010:free_pcp_prepare+0x49/0xc0
[   13.982945] Code: 44 00 00 48 8b 15 1f 9d 13 01 48 8b 0d f8 9c 13 01 48 b8 00 00 00 00 00 16 00 00 48 01 d8 48 c1 f8 06 48 85 0
[   13.985674] RSP: 0018:ffffc900003ffe20 EFLAGS: 00010202
[   13.986429] RAX: 00000000001352b9 RBX: ffffea0004d4ae80 RCX: 0000000000000001
[   13.987468] RDX: ffffc90000d85000 RSI: 0000000000000000 RDI: ffffea0004d4ae80
[   13.988504] RBP: ffffea0004d4ae80 R08: ffffc90000ec6000 R09: 0000000000000000
[   13.989534] R10: 0000000000008e1c R11: ffffffff828c1b6d R12: ffffc90000d85000
[   13.990581] R13: ffffffff82306700 R14: 0000000000000001 R15: ffff88813adbab50
[   13.991634] FS:  0000000000000000(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000
[   13.992814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.993648] CR2: ffffc90000eba2b9 CR3: 000000000220a006 CR4: 00000000003706e0
[   13.994681] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   13.995721] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   13.996763] Call Trace:
[   13.997137]  free_unref_page+0x11/0x60
[   13.997693]  __vunmap+0x4e/0xb0
[   13.998159]  kidled.cold+0x1b/0x53
[   13.998680]  ? __schedule+0x31c/0x6d0
[   13.999222]  ? finish_wait+0x80/0x80
[   13.999751]  ? kidled_mem_cgroup_move_stats+0x270/0x270
[   14.000514]  kthread+0x117/0x130
[   14.001006]  ? kthread_create_worker_on_cpu+0x70/0x70
[   14.001751]  ret_from_fork+0x35/0x40

This patch uses rcu lock to fix this race window, caller can only access
the idle age under read lock, see kidled_get/set/inc_page_age(). Note
the kidled and the memory hotplug process will also use the
mem_hotplug_lock to avoid race between alloc and free.

Since it may sleep in kidle_free_page_age(), call it earlier to avoid
sleep with pgdat_resize_lock held.

Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Reviewed-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
2022-08-01 11:46:03 +00:00
Xunlei Pang 9eee07ca21 ck: mm: Pin code section of process in memory
fix #36381167

Pin code section of process in memory for the corresponding
VMAs like mlock does.

Usage:
- pin process "PID"
  echo PID > /proc/unevictable/add_pid
- unpin it
  echo PID > /proc/unevictable/del_pid
- show all pinned process pids
  cat /proc/unevictable/add_pid

Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
2022-08-01 11:31:20 +00:00
Tianchen Ding 7001fcff92 ck:mm:break in readahead when multiple threads detected
to #34329845

Specific workloads using multiple threads to do sequential reading on
the same file may introduce race on spinlock during page cache
readahead. Readahead process will end earlier when detecting readahead
page conflict. This is effective for sequential reading the same file
with multiple threads and large readahead size.

The benefit can be shown in the following test by fio:
fio --name=ratest --ioengine=mmap --iodepth=1 --rw=read --direct=0
--size=512M --numjobs=32 --directory=/mnt --filename=ratestfile

Firstly, we set readahead size to its default value:
blockdev --setra 256 /dev/vdh
The result is:
Run status group 0 (all jobs):
   READ: bw=4427MiB/s (4642MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s),
io=16.0GiB (17.2GB), run=3699-3701msec
Disk stats (read/write):
  vdh: ios=22945/0, merge=0/0, ticks=12515/0, in_queue=12515,
util=50.79%

Then, we set readahead size larger:
blockdev --setra 8192 /dev/vdh
The result is:
Run status group 0 (all jobs):
   READ: bw=6450MiB/s (6764MB/s), 202MiB/s-202MiB/s (211MB/s-212MB/s),
io=16.0GiB (17.2GB), run=2535-2540msec
Disk stats (read/write):
  vdh: ios=19503/0, merge=0/0, ticks=49375/0, in_queue=49375,
util=23.67%

Finally, we set sysctl vm.enable_multithread_ra_boost=1 with ra=8192
and we get:
Run status group 0 (all jobs):
   READ: bw=7234MiB/s (7585MB/s), 226MiB/s-226MiB/s (237MB/s-237MB/s),
io=16.0GiB (17.2GB), run=2262-2265msec
Disk stats (read/write):
  vdh: ios=10548/0, merge=0/0, ticks=40709/0, in_queue=40709,
util=23.62%

Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
2022-08-01 11:31:07 +00:00