ANBZ: #20722
commit 6375e95f38 upstream.
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.
The following are a memory usage snapshot of one process which actually
happened on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.
As a first step, this commit aims to synchronously free the empty PTE
pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
cases other than madvise(MADV_DONTNEED).
Once an empty PTE is detected, we first try to hold the pmd lock within
the pte lock. If successful, we clear the pmd entry directly (fast path).
Otherwise, we wait until the pte lock is released, then re-hold the pmd
and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
whether the PTE page is empty and free it (slow path).
For other cases such as madvise(MADV_FREE), consider scanning and freeing
empty PTE pages asynchronously in the future.
The following code snippet can show the effect of optimization:
mmap 50G
while (1) {
for (; i < 1024 * 25; i++) {
touch 2M memory
madvise MADV_DONTNEED 2M
}
}
As we can see, the memory usage of VmPTE is reduced:
before after
VIRT 50.0 GB 50.0 GB
RES 3.1 MB 3.1 MB
VmPTE 102640 KB 240 KB
[Zelin Deng: change and export zap_page_range_single() in this patch in
order to make it be used by external caller. This change is originally
introduced by commit: 21b85b09527c("madvise: use zap_page_range_single for madvise dontneed")]
[zhengqi.arch@bytedance.com: fix uninitialized symbol 'ptl']
Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com
Link: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/
Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5202
ANBZ: #20359
commit c2508ec5a5 upstream.
.. and make x86 use it.
This basically extracts the existing x86 "find and expand faulting vma"
code, but extends it to also take the mmap lock for writing in case we
actually do need to expand the vma.
We've historically short-circuited that case, and have some rather ugly
special logic to serialize the stack segment expansion (since we only
hold the mmap lock for reading) that doesn't match the normal VM
locking.
That slight violation of locking worked well, right up until it didn't:
the maple tree code really does want proper locking even for simple
extension of an existing vma.
So extract the code for "look up the vma of the fault" from x86, fix it
up to do the necessary write locking, and make it available as a helper
function for other architectures that can use the common helper.
Note: I say "common helper", but it really only handles the normal
stack-grows-down case. Which is all architectures except for PA-RISC
and IA64. So some rare architectures can't use the helper, but if they
care they'll just need to open-code this logic.
It's also worth pointing out that this code really would like to have an
optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
read-lock (for the common case) to taking the write lock (for having to
extend the vma) in the normal single-threaded situation where there is
no other locking activity.
But that _is_ all the very uncommon special case, so while it would be
nice to have such an operation, it probably doesn't matter in reality.
I did put in the skeleton code for such a possible future expansion,
even if it only acts as pseudo-documentation for what we're doing.
[feng: during backporting to 5.10, slight change to the code location
and function parameter name in mm/memory.c and arch/x86/mm/fault.c]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5085
ANBZ: #20392
commit cd7eb8f83f upstream.
Currently, if an automatically freed allocation is an error pointer that
will lead to a crash. An example of this is in wm831x_gpio_dbg_show().
171 char *label __free(kfree) = gpiochip_dup_line_label(chip, i);
172 if (IS_ERR(label)) {
173 dev_err(wm831x->dev, "Failed to duplicate label\n");
174 continue;
175 }
The auto clean up function should check for error pointers as well,
otherwise we're going to keep hitting issues like this.
Fixes: 54da6a0924 ("locking: Introduce __cleanup() based infrastructure")
Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Acked-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5060
ANBZ: #20392
commit a67d74a4b1 upstream.
Allow for the declaration of variables that trigger kvfree() when they
go out of scope. The check for NULL and call to kvfree() can be elided
by the compiler in most cases, otherwise without the NULL check an
unnecessary call to kvfree() may be emitted. Peter proposed a comment
for this detail [1].
[ Zelin Deng: kvfree() is not in slab.h, so just modify it in mm.h ]
Link: http://lore.kernel.org/r/20230816103102.GF980931@hirez.programming.kicks-ass.net [1]
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Pankaj Gupta <pankaj.gupta@amd.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Acked-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/5060
ANBZ: #12593
commit 97a7e4733b upstream
We've got quite a few places (pte, pmd, pud) that explicitly checked
against whether we should break the cow right now during fork(). It's
easier to provide a helper, especially before we work the same thing on
hugetlbfs.
Since we'll reference is_cow_mapping() in mm.h, move it there too.
Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
exported to the whole kernel. With that we should expect another patch to
use is_cow_mapping() whenever we can across the kernel since we do use it
quite a lot but it's always done with raw code against VM_* flags.
Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Gal Pressman <galpress@amazon.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Roland Scheidegger <sroland@vmware.com>
Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
Cc: Wei Zhang <wzam@amazon.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4271
ANBZ: #12923
online memory will allocate memory for struct page from
normal/dma32 zone by virtio-mem. it will trigger oom
when online memory will take up a lot of kernel memory.
It will fails to offline the memmap memory in movable node when
memmap_on_memory enable, which will result in memory waste.
The patch will finish the function to offline the memmap when the
other memory of memory memblock has been offlined, the memmap also
will offlined.
Signed-off-by: hr567 <hr567@linux.alibaba.com>
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4359
ANBZ: #11569
Set the default value for min_cache_kbytes to resolve the issue
of not triggering OOM.
min_cache_kbytes would be set to:
150M, when total memory is ( 0, 4G]
300M, when total memory is ( 4G, 8G]
400M, when total memory is ( 8G, 16G]
500M, when total memory is (16G, 128G]
1024M, when total memory is above 128G
Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/4044
ANBZ: #10842
Introduce a new vma flag 'VM_PGOFF_IS_PFN', which means
vma->vm_pgoff == pfn. This allows us to directly obtain pfn
through vma->vm_pgoff. No Functional Change.
Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com>
Reviewed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3834
ANBZ: #9469
commit 8dbe33956d upstream
Unaccepted table is now allocated from EFI_ACPI_RECLAIM_MEMORY. It
translates into E820_TYPE_ACPI, which is not added to memblock and
therefore not mapped in the direct mapping.
This causes a crash on the first touch of the table.
Use memblock_add() to make sure that the table is mapped in direct
mapping.
Align the range to the nearest page borders. Ranges smaller than page
size are not mapped.
[ Zelin Deng: Add PAGE_ALIGN_DOWN ]
Fixes: e7761d827e ("efi/unaccepted: Use ACPI reclaim memory for unaccepted memory table")
Reported-by: Hongyu Ning <hongyu.ning@intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3457
ANBZ: #9469
commit 745e3ed85f upstream
UEFI Specification version 2.9 introduces the concept of memory
acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, requiring memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific for the Virtual
Machine platform.
Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptance until memory is needed. It lowers boot time and reduces
memory overhead.
The kernel needs to know what memory has been accepted. Firmware
communicates this information via memory map: a new memory type --
EFI_UNACCEPTED_MEMORY -- indicates such memory.
Range-based tracking works fine for firmware, but it gets bulky for
the kernel: e820 (or whatever the arch uses) has to be modified on every
page acceptance. It leads to table fragmentation and there's a limited
number of entries in the e820 table.
Another option is to mark such memory as usable in e820 and track if the
range has been accepted in a bitmap. One bit in the bitmap represents a
naturally aligned power-2-sized region of address space -- unit.
For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or
physical address space.
In the worst-case scenario -- a huge hole in the middle of the
address space -- It needs 256MiB to handle 4PiB of the address
space.
Any unaccepted memory that is not aligned to unit_size gets accepted
upfront.
The bitmap is allocated and constructed in the EFI stub and passed down
to the kernel via EFI configuration table. allocate_e820() allocates the
bitmap if unaccepted memory is present, according to the size of
unaccepted region.
[ Zelin Deng: fix conflits and add missing function ]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20230606142637.5171-4-kirill.shutemov@linux.intel.com
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3457
ANBZ: #9469
commit dcdfdd40fa upstream
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.
There are several ways the kernel can deal with unaccepted memory:
1. Accept all the memory during boot. It is easy to implement and it
doesn't have runtime cost once the system is booted. The downside is
very long boot time.
Accept can be parallelized to multiple CPUs to keep it manageable
(i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
memory bandwidth and does not scale beyond the point.
2. Accept a block of memory on the first use. It requires more
infrastructure and changes in page allocator to make it work, but
it provides good boot time.
On-demand memory accept means latency spikes every time kernel steps
onto a new memory block. The spikes will go away once workload data
set size gets stabilized or all memory gets accepted.
3. Accept all memory in background. Introduce a thread (or multiple)
that gets memory accepted proactively. It will minimize time the
system experience latency spikes on memory allocation while keeping
low boot time.
This approach cannot function on its own. It is an extension of #2:
background memory acceptance requires functional scheduler, but the
page allocator may need to tap into unaccepted memory before that.
The downside of the approach is that these threads also steal CPU
cycles and memory bandwidth from the user's workload and may hurt
user experience.
Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.
Support of unaccepted memory requires a few changes in core-mm code:
- memblock accepts memory on allocation. It serves early boot memory
allocations and doesn't limit them to pre-accepted pool of memory.
- page allocator accepts memory on the first allocation of the page.
When kernel runs out of accepted memory, it accepts memory until the
high watermark is reached. It helps to minimize fragmentation.
EFI code will provide two helpers if the platform supports unaccepted
memory:
- accept_memory() makes a range of physical addresses accepted.
- range_contains_unaccepted_memory() checks anything within the range
of physical addresses requires acceptance.
[ Zelin Deng: As the 2 helpers above have not been implemented in
this commit, add fake implement for them in case build error
]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com>
Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3457
ANBZ: #9320
Because we dont take care of kABI consistency between major version now,
revert kabi use, these reserved space are intended to ensure kABI
compatibility between minor versions and their corresponding major
version.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Signed-off-by:Cruz Zhao <CruzZhao@linux.alibaba.com>
Reviewed-by: Yi Tao <escape@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3344
ANBZ: #7740
Reflink will takes too much times to copy-on-write in the dirty pages.
Redis will block the progress to wait for completion, hence that will
not acceptiable for us to make the feature productization.
The patch will handle with the copy-on-write in the asynchronous thread.
and we will not handle with pmd aligned entry directly, and let it
behind for background work.
[ Xu Yu:
- Employ async fork framework.
- Fix wrong usage of pmd_write.
- Move work_struct of fast reflink from vm_area_struct to
address_space, and update corresponding routines.
]
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2794
ANBZ: #7740
A new functionality, akin to that of async fork, will be introduced to
manipulate PMD and VMA ingeniously.
Therefore, this abstracts the implementation of async fork into a generic
framework.
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2794
ANBZ: #6632
Beofre this patch, it's incorrect to charge memcg
with shadow mm. That's because it has no memcg
belongs to.
Here switching original mm to charge memcg.
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
ANBZ: #6632
CONFIG_PAGETABLE_SHARE introduced to select
whether enable this feature. And it needs to
select CONFIG_ARCH_USES_HIGH_VMA_FLAGS.
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
ANBZ: #6632
cherry-picked from https://lore.kernel.org/linux-mm/a1bc313a3f50f085ab9fcd9826b3cdc20bc15269.1682453344.git.khalid.aziz@oracle.com/
Add a bit to vm_flags to indicate a vma shares PTEs with others. Add
a function to determine if a vma shares PTE by checking this flag.
This is to be used to find the shared page table entries on page fault
for vmas sharing PTE.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2550
ANBZ: #7740
commit c36e202495 upstream.
This new function is a variant of mf_generic_kill_procs that accepts a
file, offset pair instead of a struct to support multiple files sharing a
DAX mapping. It is intended to be called by the file systems as part of
the memory_failure handler after the file system performed a reverse
mapping from the storage address to the file and file offset.
Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.wiliams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Ritesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Gao Xiang: adapt ANCK 5.10 codebase. ]
[ Gao Xiang: folded parts of upstream commit 6a8e0596f0. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2504
ANBZ: #6706
commit f0c0c115fb upstream.
For many workloads, pagetable consumption is significant and it makes
sense to expose it in the memory.stat for the memory cgroups. However at
the moment, the pagetables are accounted per-zone. Converting them to
per-node and using the right interface will correctly account for the
memory cgroups as well.
[akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/]
Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Bo Liu <boliu@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Guixin Liu <kanie@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2248
ANBZ: #3879
Based on upstream struct size to reserve specific fields
for kABI related structs.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2210
ANBZ: #6600
commit de2860f463 upstream.
During log recovery of an XFS filesystem with 64kB directory
buffers, rebuilding a buffer split across two log records results
in a memory allocation warning from krealloc like this:
xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
XFS (dm-0): Unmounting Filesystem
XFS (dm-0): Mounting V5 Filesystem
XFS (dm-0): Starting recovery (logdev: internal)
------------[ cut here ]------------
WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
.....
RIP: 0010:get_page_from_freelist+0xdee/0xe40
Call Trace:
? complete+0x3f/0x50
__alloc_pages+0x16f/0x300
alloc_pages+0x87/0x110
kmalloc_order+0x2c/0x90
kmalloc_order_trace+0x1d/0x90
__kmalloc_track_caller+0x215/0x270
? xlog_recover_add_to_cont_trans+0x63/0x1f0
krealloc+0x54/0xb0
xlog_recover_add_to_cont_trans+0x63/0x1f0
xlog_recovery_process_trans+0xc1/0xd0
xlog_recover_process_ophdr+0x86/0x130
xlog_recover_process_data+0x9f/0x160
xlog_recover_process+0xa2/0x120
xlog_do_recovery_pass+0x40b/0x7d0
? __irq_work_queue_local+0x4f/0x60
? irq_work_queue+0x3a/0x50
xlog_do_log_recovery+0x70/0x150
xlog_do_recover+0x38/0x1d0
xlog_recover+0xd8/0x170
xfs_log_mount+0x181/0x300
xfs_mountfs+0x4a1/0x9b0
xfs_fs_fill_super+0x3c0/0x7b0
get_tree_bdev+0x171/0x270
? suffix_kstrtoint.constprop.0+0xf0/0xf0
xfs_fs_get_tree+0x15/0x20
vfs_get_tree+0x24/0xc0
path_mount+0x2f5/0xaf0
__x64_sys_mount+0x108/0x140
do_syscall_64+0x3a/0x70
entry_SYSCALL_64_after_hwframe+0x44/0xae
Essentially, we are taking a multi-order allocation from kmem_alloc()
(which has an open coded no fail, no warn loop) and then
reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
then triggering the above warning.
This is a regression caused by converting this code from an open
coded no fail/no warn reallocation loop to using __GFP_NOFAIL.
What we actually need here is kvrealloc(), so that if contiguous
page allocation fails we fall back to vmalloc() and we don't
get nasty warnings happening in XFS.
Fixes: 771915c4f6 ("xfs: remove kmem_realloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2199
ANBZ: #6572
Patch series "Free some vmemmap pages of HugeTLB page" make
the vmemmap pages of tail pages readonly, which will result
in kernel panic when kidled store the age value into the flags
of tail pages.
In the patch, we only store the age value in the head pages.
Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2181
ANBZ: #6082
Kidled and MGLRU share 8 Bits in Page Flags, and the redundant bits
of MGLRU are set as unavailable Offset.
In the life cycle of the operating system, only one component of
Kidled/MGLRU is allowed to be started/shutdown from the kernel level.
Reuse Reset logic (traverse all pages, clear Page Flags offset) to
enable another component after closing one component
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2051
ANBZ: #6082
commit ec1c86b25f upstream
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.
Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.
There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations. They form a closed-loop system, i.e., "the page reclaim".
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim. These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].
To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.
The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
applications usually do not prepare themselves for major page
faults like they do for blocked I/O. E.g., GUI applications
commonly use dedicated I/O threads to avoid blocking rendering
threads.
There are also two access patterns: one with temporal locality and the
other without. For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].
The next patch will address the "outlying refaults". Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.
A page is added to the youngest generation on faulting. The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction. The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then. This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.
[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/
Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2051
ANBZ: #5200
commit 7b09f5af01 upstream.
Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP
uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn
operations. This is the most efficient option when sufficient kernel
resources are available.
Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn
Signed-off-by: Min Zhou <zhoumin@loongson.cn>
Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
Signed-off-by: Xuefeng Li <lixuefeng@loongson.cn>
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Juxin Gao <gaojuxin@loongson.cn>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1830
ANBZ: #5310
Transparent huge pages could reduce the number of address translation
misses, which could improve performance for the applications. But one
concern is that THP may lead to memory bloat which may cause OOM. The
reason is a huge page may contain some zero subpages which user didn't
really access them.
This patch introduces a mechanism to reclaim these zero subpages, it
works when memory pressure is high. We'll estimate whether a huge page
contains enough zero subpages at first, then try split it and reclaim.
We add a interface 'thp_reclaim' to control on or off in memory cgroup:
echo reclaim > memory.thp_reclaim to enable reclaim mode.
echo swap > memory.thp_reclaim to enable swap mode.
echo disable > memory.thp_reclaim to disable.
Or you can configure it by the global interface:
/sys/kernel/mm/transparent_hugepage/reclaim
The default mode is inherited from it's parent:
1. The reclaim mode means the huge page will be split and the zero
subpages will be reclaimed.
2. The swap mode is reserved for combining with swap, for example,
zswap and zram.
Meanwhile, we add a interface thp_reclaim_stat to show the stat of
each node in the memory cgroup:
/sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_stat
Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1662
ANBZ: #4794
commit 47010c040d upstream
The word of "free" is not expressive enough to express the feature of
optimizing vmemmap pages associated with each HugeTLB, rename this keywork
to "optimize". In this patch , cheanup configs to make code more
expressive.
Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #4794
commit e540841734 upstream
The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those
functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP.
Link: https://lkml.kernel.org/r/20211101031651.75851-6-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #4794
commit 3bc2b6a725 upstream
Patch series "Split huge PMD mapping of vmemmap pages", v4.
In order to reduce the difficulty of code review in series[1]. We disable
huge PMD mapping of vmemmap pages when that feature is enabled. In this
series, we do not disable huge PMD mapping of vmemmap pages anymore. We
will split huge PMD mapping when needed. When HugeTLB pages are freed
from the pool we do not attempt coalasce and move back to a PMD mapping
because it is much more complex.
[1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/
This patch (of 3):
In [1], PMD mappings of vmemmap pages were disabled if the the feature
hugetlb_free_vmemmap was enabled. This was done to simplify the initial
implementation of vmmemap freeing for hugetlb pages. Now, remove this
simplification by allowing PMD mapping and switching to PTE mappings as
needed for allocated hugetlb pages.
When a hugetlb page is allocated, the vmemmap page tables are walked to
free vmemmap pages. During this walk, split huge PMD mappings to PTE
mappings as required. In the unlikely case PTE pages can not be
allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb
page.
When HugeTLB pages are freed from the pool, we do not attempt to
coalesce and move back to a PMD mapping because it is much more complex.
[1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #4794
commit ad2fa3717b upstream
When we free a HugeTLB page to the buddy allocator, we need to allocate
the vmemmap pages associated with it. However, we may not be able to
allocate the vmemmap pages when the system is under memory pressure. In
this case, we just refuse to free the HugeTLB page. This changes behavior
in some corner cases as listed below:
1) Failing to free a huge page triggered by the user (decrease nr_pages).
User needs to try again later.
2) Failing to free a surplus huge page when freed by the application.
Try again later when freeing a huge page next time.
3) Failing to dissolve a free huge page on ZONE_MOVABLE via
offline_pages().
This can happen when we have plenty of ZONE_MOVABLE memory, but
not enough kernel memory to allocate vmemmmap pages. We may even
be able to migrate huge page contents, but will not be able to
dissolve the source huge page. This will prevent an offline
operation and is unfortunate as memory offlining is expected to
succeed on movable zones. Users that depend on memory hotplug
to succeed for movable zones should carefully consider whether the
memory savings gained from this feature are worth the risk of
possibly not being able to offline memory in certain situations.
4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
alloc_contig_range() - once we have that handling in place. Mainly
affects CMA and virtio-mem.
Similar to 3). virito-mem will handle migration errors gracefully.
CMA might be able to fallback on other free areas within the CMA
region.
Vmemmap pages are allocated from the page freeing context. In order for
those allocations to be not disruptive (e.g. trigger oom killer)
__GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because
a non sleeping allocation would be too fragile and it could fail too
easily under memory pressure. GFP_ATOMIC or other modes to access memory
reserves is not used because we want to prevent consuming reserves under
heavy hugetlb freeing.
[mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
[willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chen Huang <chenhuang5@huawei.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #4794
commit f41f2ed43c upstream
Every HugeTLB has more than one struct page structure. We __know__ that
we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to
store metadata associated with each HugeTLB.
There are a lot of struct page structures associated with each HugeTLB
page. For tail pages, the value of compound_head is the same. So we can
reuse first page of tail page structures. We map the virtual addresses of
the remaining pages of tail page structures to the first tail page struct,
and then free these page frames. Therefore, we need to reserve two pages
as vmemmap areas.
When we allocate a HugeTLB page from the buddy, we can free some vmemmap
pages associated with each HugeTLB page. It is more appropriate to do it
in the prep_new_huge_page().
The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages
associated with a HugeTLB page can be freed, returns zero for now, which
means the feature is disabled. We will enable it once all the
infrastructure is there.
[willy@infradead.org: fix documentation warning]
Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org
Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Chen Huang <chenhuang5@huawei.com>
Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Barry Song <song.bao.hua@hisilicon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1589
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #3879
Based on upstream struct size to reserve specific fields
for kabi related structs.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1148
ANBZ: #3289
This adds the ability to insert anonymous pages or file pages, used for
direct IO or buffer IO respectively, to a user VM. The intention behind
this is to facilitate mapping pages in IO requests to user space, which
is usually the backend of remote block device.
This integrates the advantage of vm_insert_pages (batching the pmd lock),
and eliminates the overhead of remap_pfn_range (track_pfn_remap), since
the pages to be inserted should always be ram.
NOTE that it is the caller's responsibility to ensure the validity of
pages to be inserted, i.e., that such pages are used for IO requests.
Depending on this premise, such pages can be inserted as special PTE,
without increasing the page refcount and mapcount. On the other hand,
the special mapping should be carefully managed (e.g., zapped) when the
IO request is done.
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1255
ANBZ: #4116
This renames the "fast copy mm" feature to "async fork", in order
to reveal its true purpose explicitly.
The interface is changed as a result. Now, user can enable it by:
# echo 1 > /path/to/memcg/memory.async_fork
And disable it by:
# echo 0 > /path/to/memcg/memory.async_fork
The function itself is not changed.
Note again that this only introduces the interfaces and stubs of
async fork.
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1219
ANBZ: #3879
Unify hotfix reserve macros and KABI reserve macros, now KABI reserve
macros are used for both KABI and hotfix.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1087
ANBZ: #3829
commit d302c2398b upstream.
Cannot call memory_failure() directly from the fault handler because
mmap_lock (and others) are held.
It is important, but not urgent, to mark the source page as h/w poisoned
and unmap it from other tasks.
Use memory_failure_queue() to request a call to memory_failure() for the
page with the error.
Also provide a stub version for CONFIG_MEMORY_FAILURE=n
Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Shuai Xue: fix conflit by moving memory_failure_queue into CONFIG_MEMORY_FAILURE ]
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1079
ANBZ: #3400
When supporting gup of normal memory via DAX case, a FOLL_GET_PGSTABLE
flag is marked to restrict the pg_stable_pfn() usage on GUP to KVM only.
But it is not right to assume that only KVM will use the GUP, given the
memory is shared by various tasks such as io thread.
Actually if dax gup page is not in ZONE_DEVICE, it is safe to decide that
page has a stable page struct.
This patch removes the checking of FOLL_GET_PGSTABLE flag.
Fixes: fea195bb4f ("anolis: mm: gup: allow to follow _PAGE_DEVMAP && !ZONE_DEVICE pages optionally")
Signed-off-by: Simon Guo <wei.guo.simon@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/991
ANBZ: #2510
On top of the filling the zero page of the file hole, a switch is
added to provide the function of turning on/off at runtime. When
turning off, all zero page mappings are evicted to ensure correctness.
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/785
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #2510
When mmap a file with MMAP_POPULATE/MMAP_LOCK, the page cache should be
pre-allocated. It should avoid filling zero page.
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/785
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
ANBZ: #2586
Calling get_user_pages() on a range of user memory that has been mmaped from a
DAX file will fail when there are no 'struct page' to describe those pages.
This problem has been addressed in some device drivers by adding optional struct
page support for pages under the control of the driver. To be specific, gup woud
query pgmap_radix tree first and acquire a reference against a pgmap instance
before it ever touches the page. This is to be sure the struct page won't be
released under the nose of gup. And to add any memory into pgmap_radix, they
mustn't be overlapped with the existing PFNs.
However, there are cases that the users need to map the normal memory via DAX,
and without losing the support of gup. Usually this could be done with a memmap
parameter appended into the kernel cmdline to produce a fake PMEM device. Which
is not flexible and sometimes unpractical (imagine how could you expose a BRD
via DAX? A BRD device can't be stacked onto another PMEM device anyway).
Given that what gup really cares is merely to make sure the existence of the
page, and if the struct pages behind the DAX entries (_PAGE_DEVMAP entries) are
actually from the normal memory, gup won't need to bother if they will be
released on the fly. Therefore, it should be fine to fix the code like below,
static inline bool pg_stable_pfn(unsigned long pfn)
{
int nid = pfn_to_nid(pfn);
struct zone *zone_device = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
if (zone_spans_pfn(zone_device, pfn))
return false;
else
return true;
}
[and later in gup]
page = vm_normal_page(vma, address, pte);
if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
if (pg_stable_pfn(pte_pfn(pte)))
page = pte_page(pte);
else {
/*
* Only return device mapping pages in the FOLL_GET case
* since they are only valid while holding the pgmap
* reference.
*/
pgmap = get_dev_pagemap(pte_pfn(pte), NULL);
if (pgmap)
page = pte_page(pte);
else
goto no_page;
}
[cut here]
Nevertheless, it's too risky to expose the pages of this kind to the other parts
of the world at once. So this patch proposes a FOLL_GET_PGSTABLE flag
associated with FOLL_GET. The purpose is to let the selected users of gup (it's
KVM in this patch) take the shot first, while left the others as what they were.
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
Signed-off-by: Simon Guo <wei.guo.simon@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/865
ANBZ: #1881
The fcm (fast copy mm) feature places stubs in the mm core path, whose
overhead should be negligible.
However, the fcm_fixup_{vma,pmd} is unconditionally called whenever a
vma or pmd is about to be changed. This resulted in a performance
degradation of up to 10% for will-it-scale/mmap testcase.
This fixes the performance degradation by adding a static key. There is
no need to call fixup ops when the static key is disabled.
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/665
ANBZ: #1881
This reverts commit 166322a4cb.
The reverted patch does not solve the performance degradation
completely. Therefore, revert it and a better solution will be
provided in the next patch.
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/665
ANBZ: #1881
The fcm (fast copy mm) feature places stubs in the mm core path, whose
overhead should be negligible.
However, the fcm_fixup_vma stub is unconditionally called whenever a vma
is about to be changed. This resulted in a performance degradation of up
to 10% for will-it-scale/mmap testcase.
This fixes the performance degradation by adding proper conditions
before calling fcm_fixup_vma.
Fixes: 4995c319c5 ("anolis: mm: fcm: introduce fast copy mm interface")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/625
ANBZ: #1745
Under normal circumstances, parent will be blocked synchronously in
copy_mm() when fork(2) is called, and the duration may be long with
large memory usage. It will cause the parent out of service from the
user's view. For example, a redis instance creates a memory snapshot
based on fork(2), it's trival to produce latency spikes.
Fast copy mm is designed to reduce block duration when parent process
calls fork(2). It can be controlled by each memcg and can be switched
on/off in the fly. It's disabled by default, user can enable it by:
# echo 1 > /path/to/memcg/memory.fast_copy_mm
And disable it by:
# echo 0 > /path/to/memcg/memory.fast_copy_mm
Child memcg will inherit parent memcg's configuration.
This interfaces the interfaces and stubs of fast copy mm, which split
the copy_page_range() function into fast path and slow path. The parent
is expected to excute only the fast path in fork(2), while the child
process will excute the slow path before return to user mode.
Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Acked-by: Gang Deng <gavin.dg@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/578
ANBZ: #1345
commit 03b122da74 upstream.
Add a call inside memory_failure() to call the arch specific code
to check if the address is an SGX EPC page and handle it.
Note the SGX EPC pages do not have a "struct page" entry, so the hook
goes in at the same point as the device mapping hook.
Pull the call to acquire the mutex earlier so the SGX errors are also
protected.
Make set_mce_nospec() skip SGX pages when trying to adjust
the 1:1 map.
Intel-SIG: commit 03b122da74 x86/sgx: Hook arch_memory_failure()
into mainline code.
Backport for SGX MCA recovery co-existence support.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.com
[ Zhiquan: amend commit log ]
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com>
ANBZ: #1324
commit ce6d42f2e4 upstream.
Patch series "mm: Add vma_lookup()", v2.
Many places in the kernel use find_vma() to get a vma and then check the
start address of the vma to ensure the next vma was not returned.
Other places use the find_vma_intersection() call with add, addr + 1 as
the range; looking for just the vma at a specific address.
The third use of find_vma() is by developers who do not know that the
function starts searching at the provided address upwards for the next
vma. This results in a bug that is often overlooked for a long time.
Adding the new vma_lookup() function will allow for cleaner code by
removing the find_vma() calls which check limits, making
find_vma_intersection() calls of a single address to be shorter, and
potentially reduce the incorrect uses of find_vma().
This patch (of 22):
Many places in the kernel use find_vma() to get a vma and then check the
start address of the vma to ensure the next vma was not returned.
Other places use the find_vma_intersection() call with add, addr + 1 as
the range; looking for just the vma at a specific address.
The third use of find_vma() is by developers who do not know that the
function starts searching at the provided address upwards for the next
vma. This results in a bug that is often overlooked for a long time.
Adding the new vma_lookup() function will allow for cleaner code by
removing the find_vma() calls which check limits, making
find_vma_intersection() calls of a single address to be shorter, and
potentially reduce the incorrect uses of find_vma().
Also change find_vma_intersection() comments and declaration to be of the
correct length and add kernel documentation style comment.
Intel-SIG: commit ce6d42f2e4 mm: add vma_lookup(), update
find_vma_intersection() comments.
Backport for SGX virtualization support.
Link: https://lkml.kernel.org/r/20210521174745.2219620-1-Liam.Howlett@Oracle.com
Link: https://lkml.kernel.org/r/20210521174745.2219620-2-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Fan Du <fan.du@intel.com>
[ Zhiquan: amend commit log ]
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com>
ANBZ: #32
NOTE: this is an experimental feature.
This duplicates pages synchronously when accessing remote page cache.
Specifically, this creates duplicated pages in do_read_fault() and
filemap_map_pages().
Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>
fix#36381167
Pin code section of process in memory for the corresponding
VMAs like mlock does.
Usage:
- pin process "PID"
echo PID > /proc/unevictable/add_pid
- unpin it
echo PID > /proc/unevictable/del_pid
- show all pinned process pids
cat /proc/unevictable/add_pid
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com>
Acked-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>
to #34329845
Specific workloads using multiple threads to do sequential reading on
the same file may introduce race on spinlock during page cache
readahead. Readahead process will end earlier when detecting readahead
page conflict. This is effective for sequential reading the same file
with multiple threads and large readahead size.
The benefit can be shown in the following test by fio:
fio --name=ratest --ioengine=mmap --iodepth=1 --rw=read --direct=0
--size=512M --numjobs=32 --directory=/mnt --filename=ratestfile
Firstly, we set readahead size to its default value:
blockdev --setra 256 /dev/vdh
The result is:
Run status group 0 (all jobs):
READ: bw=4427MiB/s (4642MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s),
io=16.0GiB (17.2GB), run=3699-3701msec
Disk stats (read/write):
vdh: ios=22945/0, merge=0/0, ticks=12515/0, in_queue=12515,
util=50.79%
Then, we set readahead size larger:
blockdev --setra 8192 /dev/vdh
The result is:
Run status group 0 (all jobs):
READ: bw=6450MiB/s (6764MB/s), 202MiB/s-202MiB/s (211MB/s-212MB/s),
io=16.0GiB (17.2GB), run=2535-2540msec
Disk stats (read/write):
vdh: ios=19503/0, merge=0/0, ticks=49375/0, in_queue=49375,
util=23.67%
Finally, we set sysctl vm.enable_multithread_ra_boost=1 with ra=8192
and we get:
Run status group 0 (all jobs):
READ: bw=7234MiB/s (7585MB/s), 226MiB/s-226MiB/s (237MB/s-237MB/s),
io=16.0GiB (17.2GB), run=2262-2265msec
Disk stats (read/write):
vdh: ios=10548/0, merge=0/0, ticks=40709/0, in_queue=40709,
util=23.62%
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>