anolis-cloud-kernel

Commit Graph

Author	SHA1	Message	Date
Qi Zheng	bc307a6c2d	mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) ANBZ: #20722 commit `6375e95f38` upstream. Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit aims to synchronously free the empty PTE pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than madvise(MADV_DONTNEED). Once an empty PTE is detected, we first try to hold the pmd lock within the pte lock. If successful, we clear the pmd entry directly (fast path). Otherwise, we wait until the pte lock is released, then re-hold the pmd and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect whether the PTE page is empty and free it (slow path). For other cases such as madvise(MADV_FREE), consider scanning and freeing empty PTE pages asynchronously in the future. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 KB 240 KB [Zelin Deng: change and export zap_page_range_single() in this patch in order to make it be used by external caller. This change is originally introduced by commit: 21b85b09527c("madvise: use zap_page_range_single for madvise dontneed")] [zhengqi.arch@bytedance.com: fix uninitialized symbol 'ptl'] Link: https://lkml.kernel.org/r/20241206112348.51570-1-zhengqi.arch@bytedance.com Link: https://lore.kernel.org/linux-mm/224e6a4e-43b5-4080-bdd8-b0a6fb2f0853@stanley.mountain/ Link: https://lkml.kernel.org/r/92aba2b319a734913f18ba41e7d86a265f0b84e2.1733305182.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Will Deacon <will@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5202	2025-05-08 08:00:29 +00:00
Linus Torvalds	85b36d22d7	mm: introduce new 'lock_mm_and_find_vma()' page fault helper ANBZ: #20359 commit `c2508ec5a5` upstream. .. and make x86 use it. This basically extracts the existing x86 "find and expand faulting vma" code, but extends it to also take the mmap lock for writing in case we actually do need to expand the vma. We've historically short-circuited that case, and have some rather ugly special logic to serialize the stack segment expansion (since we only hold the mmap lock for reading) that doesn't match the normal VM locking. That slight violation of locking worked well, right up until it didn't: the maple tree code really does want proper locking even for simple extension of an existing vma. So extract the code for "look up the vma of the fault" from x86, fix it up to do the necessary write locking, and make it available as a helper function for other architectures that can use the common helper. Note: I say "common helper", but it really only handles the normal stack-grows-down case. Which is all architectures except for PA-RISC and IA64. So some rare architectures can't use the helper, but if they care they'll just need to open-code this logic. It's also worth pointing out that this code really would like to have an optimistic "mmap_upgrade_trylock()" to make it quicker to go from a read-lock (for the common case) to taking the write lock (for having to extend the vma) in the normal single-threaded situation where there is no other locking activity. But that _is_ all the very uncommon special case, so while it would be nice to have such an operation, it probably doesn't matter in reality. I did put in the skeleton code for such a possible future expansion, even if it only acts as pseudo-documentation for what we're doing. [feng: during backporting to 5.10, slight change to the code location and function parameter name in mm/memory.c and arch/x86/mm/fault.c] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5085	2025-04-22 09:42:28 +00:00
Dan Carpenter	16e11c5c5a	mm/slab: make __free(kfree) accept error pointers ANBZ: #20392 commit `cd7eb8f83f` upstream. Currently, if an automatically freed allocation is an error pointer that will lead to a crash. An example of this is in wm831x_gpio_dbg_show(). 171 char *label __free(kfree) = gpiochip_dup_line_label(chip, i); 172 if (IS_ERR(label)) { 173 dev_err(wm831x->dev, "Failed to duplicate label\n"); 174 continue; 175 } The auto clean up function should check for error pointers as well, otherwise we're going to keep hitting issues like this. Fixes: `54da6a0924` ("locking: Introduce __cleanup() based infrastructure") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Acked-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5060	2025-04-16 01:34:16 +00:00
Dan Williams	034ce2ec0f	mm/slab: Add __free() support for kvfree ANBZ: #20392 commit `a67d74a4b1` upstream. Allow for the declaration of variables that trigger kvfree() when they go out of scope. The check for NULL and call to kvfree() can be elided by the compiler in most cases, otherwise without the NULL check an unnecessary call to kvfree() may be emitted. Peter proposed a comment for this detail [1]. [ Zelin Deng: kvfree() is not in slab.h, so just modify it in mm.h ] Link: http://lore.kernel.org/r/20230816103102.GF980931@hirez.programming.kicks-ass.net [1] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Pankaj Gupta <pankaj.gupta@amd.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Tested-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Acked-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/5060	2025-04-16 01:34:16 +00:00
Peter Xu	58d47d7e0b	mm: introduce page_needs_cow_for_dma() for deciding whether cow ANBZ: #12593 commit `97a7e4733b` upstream We've got quite a few places (pte, pmd, pud) that explicitly checked against whether we should break the cow right now during fork(). It's easier to provide a helper, especially before we work the same thing on hugetlbfs. Since we'll reference is_cow_mapping() in mm.h, move it there too. Actually it suites mm.h more since internal.h is mm/ only, but mm.h is exported to the whole kernel. With that we should expect another patch to use is_cow_mapping() whenever we can across the kernel since we do use it quite a lot but it's always done with raw code against VM_* flags. Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Gal Pressman <galpress@amazon.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Roland Scheidegger <sroland@vmware.com> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Cc: Wei Zhang <wzam@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4271	2024-12-26 02:30:39 +00:00
zhongjiang-ali	39dbfa37de	anolis: virtio-mem: make memmap on movable memory can be offlined ANBZ: #12923 online memory will allocate memory for struct page from normal/dma32 zone by virtio-mem. it will trigger oom when online memory will take up a lot of kernel memory. It will fails to offline the memmap memory in movable node when memmap_on_memory enable, which will result in memory waste. The patch will finish the function to offline the memmap when the other memory of memory memblock has been offlined, the memmap also will offlined. Signed-off-by: hr567 <hr567@linux.alibaba.com> Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4359	2024-12-25 11:04:18 +08:00
Yuanhe Shu	427511b962	anolis: mm: set default value for min_cache_kbytes ANBZ: #11569 Set the default value for min_cache_kbytes to resolve the issue of not triggering OOM. min_cache_kbytes would be set to: 150M, when total memory is ( 0, 4G] 300M, when total memory is ( 4G, 8G] 400M, when total memory is ( 8G, 16G] 500M, when total memory is (16G, 128G] 1024M, when total memory is above 128G Signed-off-by: Yuanhe Shu <xiangzao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/4044	2024-10-30 09:44:07 +00:00
Qinyun Tan	92598db507	anolis: mm: introduce vma flag VM_PGOFF_IS_PFN ANBZ: #10842 Introduce a new vma flag 'VM_PGOFF_IS_PFN', which means vma->vm_pgoff == pfn. This allows us to directly obtain pfn through vma->vm_pgoff. No Functional Change. Signed-off-by: Qinyun Tan <qinyuntan@linux.alibaba.com> Reviewed-by: Guanghui Feng <guanghuifeng@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3834	2024-09-23 07:02:48 +00:00
Kirill A. Shutemov	deadc95d12	efi/unaccepted: Make sure unaccepted table is mapped ANBZ: #9469 commit `8dbe33956d` upstream Unaccepted table is now allocated from EFI_ACPI_RECLAIM_MEMORY. It translates into E820_TYPE_ACPI, which is not added to memblock and therefore not mapped in the direct mapping. This causes a crash on the first touch of the table. Use memblock_add() to make sure that the table is mapped in direct mapping. Align the range to the nearest page borders. Ranges smaller than page size are not mapped. [ Zelin Deng: Add PAGE_ALIGN_DOWN ] Fixes: `e7761d827e` ("efi/unaccepted: Use ACPI reclaim memory for unaccepted memory table") Reported-by: Hongyu Ning <hongyu.ning@intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3457	2024-07-10 01:38:34 +00:00
Kirill A. Shutemov	7417dc9cb9	efi/libstub: Implement support for unaccepted memory ANBZ: #9469 commit `745e3ed85f` upstream UEFI Specification version 2.9 introduces the concept of memory acceptance: Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, requiring memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific for the Virtual Machine platform. Accepting memory is costly and it makes VMM allocate memory for the accepted guest physical address range. It's better to postpone memory acceptance until memory is needed. It lowers boot time and reduces memory overhead. The kernel needs to know what memory has been accepted. Firmware communicates this information via memory map: a new memory type -- EFI_UNACCEPTED_MEMORY -- indicates such memory. Range-based tracking works fine for firmware, but it gets bulky for the kernel: e820 (or whatever the arch uses) has to be modified on every page acceptance. It leads to table fragmentation and there's a limited number of entries in the e820 table. Another option is to mark such memory as usable in e820 and track if the range has been accepted in a bitmap. One bit in the bitmap represents a naturally aligned power-2-sized region of address space -- unit. For x86, unit size is 2MiB: 4k of the bitmap is enough to track 64GiB or physical address space. In the worst-case scenario -- a huge hole in the middle of the address space -- It needs 256MiB to handle 4PiB of the address space. Any unaccepted memory that is not aligned to unit_size gets accepted upfront. The bitmap is allocated and constructed in the EFI stub and passed down to the kernel via EFI configuration table. allocate_e820() allocates the bitmap if unaccepted memory is present, according to the size of unaccepted region. [ Zelin Deng: fix conflits and add missing function ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20230606142637.5171-4-kirill.shutemov@linux.intel.com Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3457	2024-07-10 01:38:34 +00:00
Kirill A. Shutemov	fae8454149	mm: Add support for unaccepted memory ANBZ: #9469 commit `dcdfdd40fa` upstream UEFI Specification version 2.9 introduces the concept of memory acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific to the Virtual Machine platform. There are several ways the kernel can deal with unaccepted memory: 1. Accept all the memory during boot. It is easy to implement and it doesn't have runtime cost once the system is booted. The downside is very long boot time. Accept can be parallelized to multiple CPUs to keep it manageable (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate memory bandwidth and does not scale beyond the point. 2. Accept a block of memory on the first use. It requires more infrastructure and changes in page allocator to make it work, but it provides good boot time. On-demand memory accept means latency spikes every time kernel steps onto a new memory block. The spikes will go away once workload data set size gets stabilized or all memory gets accepted. 3. Accept all memory in background. Introduce a thread (or multiple) that gets memory accepted proactively. It will minimize time the system experience latency spikes on memory allocation while keeping low boot time. This approach cannot function on its own. It is an extension of #2: background memory acceptance requires functional scheduler, but the page allocator may need to tap into unaccepted memory before that. The downside of the approach is that these threads also steal CPU cycles and memory bandwidth from the user's workload and may hurt user experience. Implement #1 and #2 for now. #2 is the default. Some workloads may want to use #1 with accept_memory=eager in kernel command line. #3 can be implemented later based on user's demands. Support of unaccepted memory requires a few changes in core-mm code: - memblock accepts memory on allocation. It serves early boot memory allocations and doesn't limit them to pre-accepted pool of memory. - page allocator accepts memory on the first allocation of the page. When kernel runs out of accepted memory, it accepts memory until the high watermark is reached. It helps to minimize fragmentation. EFI code will provide two helpers if the platform supports unaccepted memory: - accept_memory() makes a range of physical addresses accepted. - range_contains_unaccepted_memory() checks anything within the range of physical addresses requires acceptance. [ Zelin Deng: As the 2 helpers above have not been implemented in this commit, add fake implement for them in case build error ] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com Signed-off-by: Zelin Deng <zelin.deng@linux.alibaba.com> Reviewed-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3457	2024-07-10 01:38:34 +00:00
Guixin Liu	3e6454be7f	anolis: kabi: revert all kabi_use ANBZ: #9320 Because we dont take care of kABI consistency between major version now, revert kabi use, these reserved space are intended to ensure kABI compatibility between minor versions and their corresponding major version. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Signed-off-by：Cruz Zhao <CruzZhao@linux.alibaba.com> Reviewed-by: Yi Tao <escape@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Cruz Zhao <CruzZhao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/3344	2024-06-28 05:46:26 +00:00
Xu Yu	bd617fc27c	anolis: mm: support fast reflink ANBZ: #7740 Reflink will takes too much times to copy-on-write in the dirty pages. Redis will block the progress to wait for completion, hence that will not acceptiable for us to make the feature productization. The patch will handle with the copy-on-write in the asynchronous thread. and we will not handle with pmd aligned entry directly, and let it behind for background work. [ Xu Yu: - Employ async fork framework. - Fix wrong usage of pmd_write. - Move work_struct of fast reflink from vm_area_struct to address_space, and update corresponding routines. ] Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2794	2024-03-01 01:09:03 +00:00
Xu Yu	c27dee77b3	anolis: async fork: abstract a generic framework ANBZ: #7740 A new functionality, akin to that of async fork, will be introduced to manipulate PMD and VMA ingeniously. Therefore, this abstracts the implementation of async fork into a generic framework. Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2794	2024-03-01 01:09:03 +00:00
Rongwei Wang	d35e76c36b	anolis: mm: pgtable_share: switch original mm to charge memcg ANBZ: #6632 Beofre this patch, it's incorrect to charge memcg with shadow mm. That's because it has no memcg belongs to. Here switching original mm to charge memcg. Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2550	2023-12-29 12:54:08 +00:00
Rongwei Wang	e7de7cbc6e	anolis: mm: pgtable_share: introduce CONFIG_PAGETABLE_SHARE ANBZ: #6632 CONFIG_PAGETABLE_SHARE introduced to select whether enable this feature. And it needs to select CONFIG_ARCH_USES_HIGH_VMA_FLAGS. Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2550	2023-12-29 12:54:08 +00:00
Khalid Aziz	0967bd180a	mm/ptshare: Add vm flag for shared PTE ANBZ: #6632 cherry-picked from https://lore.kernel.org/linux-mm/a1bc313a3f50f085ab9fcd9826b3cdc20bc15269.1682453344.git.khalid.aziz@oracle.com/ Add a bit to vm_flags to indicate a vma shares PTEs with others. Add a function to determine if a vma shares PTE by checking this flag. This is to be used to find the shared page table entries on page fault for vmas sharing PTE. Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2550	2023-12-29 12:54:08 +00:00
Shiyang Ruan	f464814a82	mm: introduce mf_dax_kill_procs() for fsdax case ANBZ: #7740 commit `c36e202495` upstream. This new function is a variant of mf_generic_kill_procs that accepts a file, offset pair instead of a struct to support multiple files sharing a DAX mapping. It is intended to be called by the file systems as part of the memory_failure handler after the file system performed a reverse mapping from the storage address to the file and file offset. Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Gao Xiang: adapt ANCK 5.10 codebase. ] [ Gao Xiang: folded parts of upstream commit `6a8e0596f0`. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2504	2023-12-26 09:51:37 +00:00
Shakeel Butt	c7e9593678	mm: memcontrol: account pagetables per node ANBZ: #6706 commit `f0c0c115fb` upstream. For many workloads, pagetable consumption is significant and it makes sense to expose it in the memory.stat for the memory cgroups. However at the moment, the pagetables are accounted per-zone. Converting them to per-node and using the right interface will correctly account for the memory cgroups as well. [akpm@linux-foundation.org: export __mod_lruvec_page_state to modules for arch/mips/kvm/] Link: https://lkml.kernel.org/r/20201130212541.2781790-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Bo Liu <boliu@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2248	2023-10-08 09:52:14 +00:00
Guixin Liu	4dba5ccef7	anolis: kabi: Reserve some space for kABI stability ANBZ: #3879 Based on upstream struct size to reserve specific fields for kABI related structs. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2210	2023-09-20 13:21:10 +00:00
Dave Chinner	a040c4fe55	mm: Add kvrealloc() ANBZ: #6600 commit `de2860f463` upstream. During log recovery of an XFS filesystem with 64kB directory buffers, rebuilding a buffer split across two log records results in a memory allocation warning from krealloc like this: xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff) XFS (dm-0): Unmounting Filesystem XFS (dm-0): Mounting V5 Filesystem XFS (dm-0): Starting recovery (logdev: internal) ------------[ cut here ]------------ WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40 ..... RIP: 0010:get_page_from_freelist+0xdee/0xe40 Call Trace: ? complete+0x3f/0x50 __alloc_pages+0x16f/0x300 alloc_pages+0x87/0x110 kmalloc_order+0x2c/0x90 kmalloc_order_trace+0x1d/0x90 __kmalloc_track_caller+0x215/0x270 ? xlog_recover_add_to_cont_trans+0x63/0x1f0 krealloc+0x54/0xb0 xlog_recover_add_to_cont_trans+0x63/0x1f0 xlog_recovery_process_trans+0xc1/0xd0 xlog_recover_process_ophdr+0x86/0x130 xlog_recover_process_data+0x9f/0x160 xlog_recover_process+0xa2/0x120 xlog_do_recovery_pass+0x40b/0x7d0 ? __irq_work_queue_local+0x4f/0x60 ? irq_work_queue+0x3a/0x50 xlog_do_log_recovery+0x70/0x150 xlog_do_recover+0x38/0x1d0 xlog_recover+0xd8/0x170 xfs_log_mount+0x181/0x300 xfs_mountfs+0x4a1/0x9b0 xfs_fs_fill_super+0x3c0/0x7b0 get_tree_bdev+0x171/0x270 ? suffix_kstrtoint.constprop.0+0xf0/0xf0 xfs_fs_get_tree+0x15/0x20 vfs_get_tree+0x24/0xc0 path_mount+0x2f5/0xaf0 __x64_sys_mount+0x108/0x140 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae Essentially, we are taking a multi-order allocation from kmem_alloc() (which has an open coded no fail, no warn loop) and then reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is then triggering the above warning. This is a regression caused by converting this code from an open coded no fail/no warn reallocation loop to using __GFP_NOFAIL. What we actually need here is kvrealloc(), so that if contiguous page allocation fails we fall back to vmalloc() and we don't get nasty warnings happening in XFS. Fixes: `771915c4f6` ("xfs: remove kmem_realloc()") Signed-off-by: Dave Chinner <dchinner@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2199	2023-09-20 02:27:53 +00:00
Ning Zhang	3f29353eb6	anolis: mm, kidled: set age value in the flags of head pages ANBZ: #6572 Patch series "Free some vmemmap pages of HugeTLB page" make the vmemmap pages of tail pages readonly, which will result in kernel panic when kidled store the age value into the flags of tail pages. In the patch, we only store the age value in the head pages. Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2181	2023-09-14 06:47:04 +00:00
Kaihao Bai	ff52fd3fd0	anolis: mm: limit use confliction between kidled/coldpgs and mglru ANBZ: #6082 Kidled and MGLRU share 8 Bits in Page Flags, and the redundant bits of MGLRU are set as unavailable Offset. In the life cycle of the operating system, only one component of Kidled/MGLRU is allowed to be started/shutdown from the kernel level. Reuse Reset logic (traverse all pages, clear Page Flags offset) to enable another component after closing one component Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2051	2023-08-15 15:28:16 +08:00
Yu Zhao	4ce21077fe	mm: multi-gen LRU: groundwork ANBZ: #6082 commit `ec1c86b25f` upstream Evictable pages are divided into multiple generations for each lruvec. The youngest generation number is stored in lrugen->max_seq for both anon and file types as they are aged on an equal footing. The oldest generation numbers are stored in lrugen->min_seq[] separately for anon and file types as clean file pages can be evicted regardless of swap constraints. These three variables are monotonically increasing. Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits in order to fit into the gen counter in folio->flags. Each truncated generation number is an index to lrugen->lists[]. The sliding window technique is used to track at least MIN_NR_GENS and at most MAX_NR_GENS generations. The gen counter stores a value within [1, MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it stores 0. There are two conceptually independent procedures: "the aging", which produces young generations, and "the eviction", which consumes old generations. They form a closed-loop system, i.e., "the page reclaim". Both procedures can be invoked from userspace for the purposes of working set estimation and proactive reclaim. These techniques are commonly used to optimize job scheduling (bin packing) in data centers [1][2]. To avoid confusion, the terms "hot" and "cold" will be applied to the multi-gen LRU, as a new convention; the terms "active" and "inactive" will be applied to the active/inactive LRU, as usual. The protection of hot pages and the selection of cold pages are based on page access channels and patterns. There are two access channels: one through page tables and the other through file descriptors. The protection of the former channel is by design stronger because: 1. The uncertainty in determining the access patterns of the former channel is higher due to the approximation of the accessed bit. 2. The cost of evicting the former channel is higher due to the TLB flushes required and the likelihood of encountering the dirty bit. 3. The penalty of underprotecting the former channel is higher because applications usually do not prepare themselves for major page faults like they do for blocked I/O. E.g., GUI applications commonly use dedicated I/O threads to avoid blocking rendering threads. There are also two access patterns: one with temporal locality and the other without. For the reasons listed above, the former channel is assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is present; the latter channel is assumed to follow the latter pattern unless outlying refaults have been observed [3][4]. The next patch will address the "outlying refaults". Three macros, i.e., LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in this patch to make the entire patchset less diffy. A page is added to the youngest generation on faulting. The aging needs to check the accessed bit at least twice before handing this page over to the eviction. The first check takes care of the accessed bit set on the initial fault; the second check makes sure this page has not been used since then. This protocol, AKA second chance, requires a minimum of two generations, hence MIN_NR_GENS. [1] https://dl.acm.org/doi/10.1145/3297858.3304053 [2] https://dl.acm.org/doi/10.1145/3503222.3507731 [3] https://lwn.net/Articles/495543/ [4] https://lwn.net/Articles/815342/ Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org> Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name> Acked-by: Steven Barrett <steven@liquorix.net> Acked-by: Suleiman Souhlal <suleiman@google.com> Tested-by: Daniel Byrne <djbyrne@mtu.edu> Tested-by: Donald Carr <d@chaos-reins.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru> Tested-by: Shuang Zhai <szhai2@cs.rochester.edu> Tested-by: Sofia Trinh <sofia.trinh@edi.works> Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Larabel <Michael@MichaelLarabel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/2051	2023-08-15 15:26:43 +08:00
Feiyang Chen	7aaddcfff2	LoongArch: Add sparse memory vmemmap support ANBZ: #5200 commit `7b09f5af01` upstream. Add sparse memory vmemmap support for LoongArch. SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most efficient option when sufficient kernel resources are available. Link: https://lkml.kernel.org/r/20221027125253.3458989-3-chenhuacai@loongson.cn Signed-off-by: Min Zhou <zhoumin@loongson.cn> Signed-off-by: Feiyang Chen <chenfeiyang@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn> Signed-off-by: Xuefeng Li <lixuefeng@loongson.cn> Signed-off-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Juxin Gao <gaojuxin@loongson.cn> Link: https://gitee.com/anolis/cloud-kernel/pulls/1830	2023-07-07 01:46:16 +00:00
Ning Zhang	6cd7711cfe	anolis: mm, thp: introduce thp zero subpages reclaim ANBZ: #5310 Transparent huge pages could reduce the number of address translation misses, which could improve performance for the applications. But one concern is that THP may lead to memory bloat which may cause OOM. The reason is a huge page may contain some zero subpages which user didn't really access them. This patch introduces a mechanism to reclaim these zero subpages, it works when memory pressure is high. We'll estimate whether a huge page contains enough zero subpages at first, then try split it and reclaim. We add a interface 'thp_reclaim' to control on or off in memory cgroup: echo reclaim > memory.thp_reclaim to enable reclaim mode. echo swap > memory.thp_reclaim to enable swap mode. echo disable > memory.thp_reclaim to disable. Or you can configure it by the global interface: /sys/kernel/mm/transparent_hugepage/reclaim The default mode is inherited from it's parent: 1. The reclaim mode means the huge page will be split and the zero subpages will be reclaimed. 2. The swap mode is reserved for combining with swap, for example, zswap and zram. Meanwhile, we add a interface thp_reclaim_stat to show the stat of each node in the memory cgroup: /sys/fs/cgroup/memory/{memcg}/memory.thp_reclaim_stat Signed-off-by: Ning Zhang <ningzhang@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1662	2023-06-15 11:04:37 +00:00
Muchun Song	9e776aa029	mm: hugetlb_vmemmap: cleanup CONFIG_HUGETLB_PAGE_FREE_VMEMMAP* ANBZ: #4794 commit `47010c040d` upstream The word of "free" is not expressive enough to express the feature of optimizing vmemmap pages associated with each HugeTLB, rename this keywork to "optimize". In this patch , cheanup configs to make code more expressive. Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1589 Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2023-04-21 14:44:21 +08:00
Muchun Song	de5587db3b	mm: sparsemem: move vmemmap related to HugeTLB to CONFIG_HUGETLB_PAGE_FREE_VMEMMAP ANBZ: #4794 commit `e540841734` upstream The vmemmap_remap_free/alloc are relevant to HugeTLB, so move those functiongs to the scope of CONFIG_HUGETLB_PAGE_FREE_VMEMMAP. Link: https://lkml.kernel.org/r/20211101031651.75851-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1589 Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2023-04-21 14:44:20 +08:00
Muchun Song	b41bc642bc	mm: sparsemem: split the huge PMD mapping of vmemmap pages ANBZ: #4794 commit `3bc2b6a725` upstream Patch series "Split huge PMD mapping of vmemmap pages", v4. In order to reduce the difficulty of code review in series[1]. We disable huge PMD mapping of vmemmap pages when that feature is enabled. In this series, we do not disable huge PMD mapping of vmemmap pages anymore. We will split huge PMD mapping when needed. When HugeTLB pages are freed from the pool we do not attempt coalasce and move back to a PMD mapping because it is much more complex. [1] https://lore.kernel.org/linux-doc/20210510030027.56044-1-songmuchun@bytedance.com/ This patch (of 3): In [1], PMD mappings of vmemmap pages were disabled if the the feature hugetlb_free_vmemmap was enabled. This was done to simplify the initial implementation of vmmemap freeing for hugetlb pages. Now, remove this simplification by allowing PMD mapping and switching to PTE mappings as needed for allocated hugetlb pages. When a hugetlb page is allocated, the vmemmap page tables are walked to free vmemmap pages. During this walk, split huge PMD mappings to PTE mappings as required. In the unlikely case PTE pages can not be allocated, return error(ENOMEM) and do not optimize vmemmap of the hugetlb page. When HugeTLB pages are freed from the pool, we do not attempt to coalesce and move back to a PMD mapping because it is much more complex. [1] https://lkml.kernel.org/r/20210510030027.56044-8-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20210616094915.34432-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1589 Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2023-04-21 14:44:20 +08:00
Muchun Song	6c90fde256	mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page ANBZ: #4794 commit `ad2fa3717b` upstream When we free a HugeTLB page to the buddy allocator, we need to allocate the vmemmap pages associated with it. However, we may not be able to allocate the vmemmap pages when the system is under memory pressure. In this case, we just refuse to free the HugeTLB page. This changes behavior in some corner cases as listed below: 1) Failing to free a huge page triggered by the user (decrease nr_pages). User needs to try again later. 2) Failing to free a surplus huge page when freed by the application. Try again later when freeing a huge page next time. 3) Failing to dissolve a free huge page on ZONE_MOVABLE via offline_pages(). This can happen when we have plenty of ZONE_MOVABLE memory, but not enough kernel memory to allocate vmemmmap pages. We may even be able to migrate huge page contents, but will not be able to dissolve the source huge page. This will prevent an offline operation and is unfortunate as memory offlining is expected to succeed on movable zones. Users that depend on memory hotplug to succeed for movable zones should carefully consider whether the memory savings gained from this feature are worth the risk of possibly not being able to offline memory in certain situations. 4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via alloc_contig_range() - once we have that handling in place. Mainly affects CMA and virtio-mem. Similar to 3). virito-mem will handle migration errors gracefully. CMA might be able to fallback on other free areas within the CMA region. Vmemmap pages are allocated from the page freeing context. In order for those allocations to be not disruptive (e.g. trigger oom killer) __GFP_NORETRY is used. hugetlb_lock is dropped for the allocation because a non sleeping allocation would be too fragile and it could fail too easily under memory pressure. GFP_ATOMIC or other modes to access memory reserves is not used because we want to prevent consuming reserves under heavy hugetlb freeing. [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page] Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1589 Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2023-04-21 14:44:19 +08:00
Muchun Song	8ae06854e4	mm: hugetlb: free the vmemmap pages associated with each HugeTLB page ANBZ: #4794 commit `f41f2ed43c` upstream Every HugeTLB has more than one struct page structure. We __know__ that we only use the first 4 (__NR_USED_SUBPAGE) struct page structures to store metadata associated with each HugeTLB. There are a lot of struct page structures associated with each HugeTLB page. For tail pages, the value of compound_head is the same. So we can reuse first page of tail page structures. We map the virtual addresses of the remaining pages of tail page structures to the first tail page struct, and then free these page frames. Therefore, we need to reserve two pages as vmemmap areas. When we allocate a HugeTLB page from the buddy, we can free some vmemmap pages associated with each HugeTLB page. It is more appropriate to do it in the prep_new_huge_page(). The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap pages associated with a HugeTLB page can be freed, returns zero for now, which means the feature is disabled. We will enable it once all the infrastructure is there. [willy@infradead.org: fix documentation warning] Link: https://lkml.kernel.org/r/20210615200242.1716568-5-willy@infradead.org Link: https://lkml.kernel.org/r/20210510030027.56044-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Chen Huang <chenhuang5@huawei.com> Tested-by: Bodeddula Balasubramaniam <bodeddub@amazon.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Oliver Neukum <oneukum@suse.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Xin Hao <xhao@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1589 Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2023-04-21 14:44:19 +08:00
Guixin Liu	2af3c87980	anolis: kabi: Reserve some fields ANBZ: #3879 Based on upstream struct size to reserve specific fields for kabi related structs. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Yihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1148	2023-02-24 07:45:32 +00:00
Xu Yu	7cba848719	anolis: mm: introduce vm_insert_page(s)_mkspecial ANBZ: #3289 This adds the ability to insert anonymous pages or file pages, used for direct IO or buffer IO respectively, to a user VM. The intention behind this is to facilitate mapping pages in IO requests to user space, which is usually the backend of remote block device. This integrates the advantage of vm_insert_pages (batching the pmd lock), and eliminates the overhead of remap_pfn_range (track_pfn_remap), since the pages to be inserted should always be ram. NOTE that it is the caller's responsibility to ensure the validity of pages to be inserted, i.e., that such pages are used for IO requests. Depending on this premise, such pages can be inserted as special PTE, without increasing the page refcount and mapcount. On the other hand, the special mapping should be carefully managed (e.g., zapped) when the IO request is done. Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1255	2023-02-24 03:12:49 +00:00
Xu Yu	48da3ae5a7	anolis: mm: rename fast_copy_mm interface to async_fork ANBZ: #4116 This renames the "fast copy mm" feature to "async fork", in order to reveal its true purpose explicitly. The interface is changed as a result. Now, user can enable it by: # echo 1 > /path/to/memcg/memory.async_fork And disable it by: # echo 0 > /path/to/memcg/memory.async_fork The function itself is not changed. Note again that this only introduces the interfaces and stubs of async fork. Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1219	2023-02-21 06:53:52 +00:00
Guixin Liu	a027e85f84	anolis: kabi: Replace hotfix reserve macros with KABI reserve macros ANBZ: #3879 Unify hotfix reserve macros and KABI reserve macros, now KABI reserve macros are used for both KABI and hotfix. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1087	2023-01-31 13:20:23 +00:00
Tony Luck	10c6e7aa17	mm, hwpoison: when copy-on-write hits poison, take page offline ANBZ: #3829 commit `d302c2398b` upstream. Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use memory_failure_queue() to request a call to memory_failure() for the page with the error. Also provide a stub version for CONFIG_MEMORY_FAILURE=n Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Shuai Xue <xueshuai@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Shuai Xue: fix conflit by moving memory_failure_queue into CONFIG_MEMORY_FAILURE ] Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/1079	2023-01-30 03:09:42 +00:00
Simon Guo	0c1d004895	anolis: mm: gup:remove FOLL_GET_PGSTABLE flag checking ANBZ: #3400 When supporting gup of normal memory via DAX case, a FOLL_GET_PGSTABLE flag is marked to restrict the pg_stable_pfn() usage on GUP to KVM only. But it is not right to assume that only KVM will use the GUP, given the memory is shared by various tasks such as io thread. Actually if dax gup page is not in ZONE_DEVICE, it is safe to decide that page has a stable page struct. This patch removes the checking of FOLL_GET_PGSTABLE flag. Fixes: `fea195bb4f` ("anolis: mm: gup: allow to follow _PAGE_DEVMAP && !ZONE_DEVICE pages optionally") Signed-off-by: Simon Guo <wei.guo.simon@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Zhu Yanhai <gaoyang.zyh@taobao.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/991	2022-12-09 10:43:10 +00:00
Kaihao Bai	09797fd32d	anolis: mm: add switch for file zero page ANBZ: #2510 On top of the filling the zero page of the file hole, a switch is added to provide the function of turning on/off at runtime. When turning off, all zero page mappings are evicted to ensure correctness. Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com> Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/785 Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2022-11-21 06:11:11 +00:00
Kaihao Bai	3d89f15277	anolis: mm: avoid MMAP_POPULATE filling zero page ANBZ: #2510 When mmap a file with MMAP_POPULATE/MMAP_LOCK, the page cache should be pre-allocated. It should avoid filling zero page. Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com> Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/785 Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2022-11-21 06:11:11 +00:00
Zhu Yanhai	ba809e4874	anolis: mm: gup: allow to follow _PAGE_DEVMAP && !ZONE_DEVICE pages optionally ANBZ: #2586 Calling get_user_pages() on a range of user memory that has been mmaped from a DAX file will fail when there are no 'struct page' to describe those pages. This problem has been addressed in some device drivers by adding optional struct page support for pages under the control of the driver. To be specific, gup woud query pgmap_radix tree first and acquire a reference against a pgmap instance before it ever touches the page. This is to be sure the struct page won't be released under the nose of gup. And to add any memory into pgmap_radix, they mustn't be overlapped with the existing PFNs. However, there are cases that the users need to map the normal memory via DAX, and without losing the support of gup. Usually this could be done with a memmap parameter appended into the kernel cmdline to produce a fake PMEM device. Which is not flexible and sometimes unpractical (imagine how could you expose a BRD via DAX? A BRD device can't be stacked onto another PMEM device anyway). Given that what gup really cares is merely to make sure the existence of the page, and if the struct pages behind the DAX entries (_PAGE_DEVMAP entries) are actually from the normal memory, gup won't need to bother if they will be released on the fly. Therefore, it should be fine to fix the code like below, static inline bool pg_stable_pfn(unsigned long pfn) { int nid = pfn_to_nid(pfn); struct zone zone_device = &NODE_DATA(nid)->node_zones[ZONE_DEVICE]; if (zone_spans_pfn(zone_device, pfn)) return false; else return true; } [and later in gup] page = vm_normal_page(vma, address, pte); if (!page && pte_devmap(pte) && (flags & FOLL_GET)) { if (pg_stable_pfn(pte_pfn(pte))) page = pte_page(pte); else { / * Only return device mapping pages in the FOLL_GET case * since they are only valid while holding the pgmap * reference. */ pgmap = get_dev_pagemap(pte_pfn(pte), NULL); if (pgmap) page = pte_page(pte); else goto no_page; } [cut here] Nevertheless, it's too risky to expose the pages of this kind to the other parts of the world at once. So this patch proposes a FOLL_GET_PGSTABLE flag associated with FOLL_GET. The purpose is to let the selected users of gup (it's KVM in this patch) take the shot first, while left the others as what they were. Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com> Signed-off-by: Simon Guo <wei.guo.simon@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/865	2022-11-17 02:08:13 +00:00
Xu Yu	b56e2f16a5	anolis: mm: fcm: do not unconditionally call fixup ops ANBZ: #1881 The fcm (fast copy mm) feature places stubs in the mm core path, whose overhead should be negligible. However, the fcm_fixup_{vma,pmd} is unconditionally called whenever a vma or pmd is about to be changed. This resulted in a performance degradation of up to 10% for will-it-scale/mmap testcase. This fixes the performance degradation by adding a static key. There is no need to call fixup ops when the static key is disabled. Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Gang Deng <gavin.dg@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/665	2022-08-25 12:09:50 +08:00
Xu Yu	91acfc5102	anolis: Revert "anolis: mm: fcm: do not unconditionally call fcm_fixup_vma" ANBZ: #1881 This reverts commit `166322a4cb`. The reverted patch does not solve the performance degradation completely. Therefore, revert it and a better solution will be provided in the next patch. Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Gang Deng <gavin.dg@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/665	2022-08-25 12:09:15 +08:00
Xu Yu	166322a4cb	anolis: mm: fcm: do not unconditionally call fcm_fixup_vma ANBZ: #1881 The fcm (fast copy mm) feature places stubs in the mm core path, whose overhead should be negligible. However, the fcm_fixup_vma stub is unconditionally called whenever a vma is about to be changed. This resulted in a performance degradation of up to 10% for will-it-scale/mmap testcase. This fixes the performance degradation by adding proper conditions before calling fcm_fixup_vma. Fixes: `4995c319c5` ("anolis: mm: fcm: introduce fast copy mm interface") Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: zhong jiang <zhongjiang-ali@linux.alibaba.com> Acked-by: Gang Deng <gavin.dg@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/625	2022-08-11 11:05:18 +00:00
Xu Yu	4995c319c5	anolis: mm: fcm: introduce fast copy mm interface ANBZ: #1745 Under normal circumstances, parent will be blocked synchronously in copy_mm() when fork(2) is called, and the duration may be long with large memory usage. It will cause the parent out of service from the user's view. For example, a redis instance creates a memory snapshot based on fork(2), it's trival to produce latency spikes. Fast copy mm is designed to reduce block duration when parent process calls fork(2). It can be controlled by each memcg and can be switched on/off in the fly. It's disabled by default, user can enable it by: # echo 1 > /path/to/memcg/memory.fast_copy_mm And disable it by: # echo 0 > /path/to/memcg/memory.fast_copy_mm Child memcg will inherit parent memcg's configuration. This interfaces the interfaces and stubs of fast copy mm, which split the copy_page_range() function into fast path and slow path. The parent is expected to excute only the fast path in fork(2), while the child process will excute the slow path before return to user mode. Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Gang Deng <gavin.dg@linux.alibaba.com> Link: https://gitee.com/anolis/cloud-kernel/pulls/578	2022-08-02 16:45:14 +08:00
Tony Luck	b306cfc49c	x86/sgx: Hook arch_memory_failure() into mainline code ANBZ: #1345 commit `03b122da74` upstream. Add a call inside memory_failure() to call the arch specific code to check if the address is an SGX EPC page and handle it. Note the SGX EPC pages do not have a "struct page" entry, so the hook goes in at the same point as the device mapping hook. Pull the call to acquire the mutex earlier so the SGX errors are also protected. Make set_mce_nospec() skip SGX pages when trying to adjust the 1:1 map. Intel-SIG: commit `03b122da74` x86/sgx: Hook arch_memory_failure() into mainline code. Backport for SGX MCA recovery co-existence support. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Tested-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20211026220050.697075-6-tony.luck@intel.com [ Zhiquan: amend commit log ] Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com> Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com>	2022-08-02 16:39:08 +08:00
Liam Howlett	b5ed53c684	mm: add vma_lookup(), update find_vma_intersection() comments ANBZ: #1324 commit `ce6d42f2e4` upstream. Patch series "mm: Add vma_lookup()", v2. Many places in the kernel use find_vma() to get a vma and then check the start address of the vma to ensure the next vma was not returned. Other places use the find_vma_intersection() call with add, addr + 1 as the range; looking for just the vma at a specific address. The third use of find_vma() is by developers who do not know that the function starts searching at the provided address upwards for the next vma. This results in a bug that is often overlooked for a long time. Adding the new vma_lookup() function will allow for cleaner code by removing the find_vma() calls which check limits, making find_vma_intersection() calls of a single address to be shorter, and potentially reduce the incorrect uses of find_vma(). This patch (of 22): Many places in the kernel use find_vma() to get a vma and then check the start address of the vma to ensure the next vma was not returned. Other places use the find_vma_intersection() call with add, addr + 1 as the range; looking for just the vma at a specific address. The third use of find_vma() is by developers who do not know that the function starts searching at the provided address upwards for the next vma. This results in a bug that is often overlooked for a long time. Adding the new vma_lookup() function will allow for cleaner code by removing the find_vma() calls which check limits, making find_vma_intersection() calls of a single address to be shorter, and potentially reduce the incorrect uses of find_vma(). Also change find_vma_intersection() comments and declaration to be of the correct length and add kernel documentation style comment. Intel-SIG: commit `ce6d42f2e4` mm: add vma_lookup(), update find_vma_intersection() comments. Backport for SGX virtualization support. Link: https://lkml.kernel.org/r/20210521174745.2219620-1-Liam.Howlett@Oracle.com Link: https://lkml.kernel.org/r/20210521174745.2219620-2-Liam.Howlett@Oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Fan Du <fan.du@intel.com> [ Zhiquan: amend commit log ] Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com> Reviewed-by: Artie Ding <artie.ding@linux.alibaba.com>	2022-08-02 16:35:02 +08:00
Xu Yu	03909fbc5e	anolis: mm: duptext: create duplicated page for remote access ANBZ: #32 NOTE: this is an experimental feature. This duplicates pages synchronously when accessing remote page cache. Specifically, this creates duplicated pages in do_read_fault() and filemap_map_pages(). Signed-off-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>	2022-08-01 12:00:46 +00:00
Gang Deng	03ffd42251	anolis: mm, kidled: fix race when free idle age fix #36837630 When building kernel with idle age not in page's flag, kernel will panic as below: [ 13.977004] BUG: unable to handle kernel paging request at ffffc90000eba2b9 [ 13.978021] PGD 13ad35067 P4D 13ad35067 PUD 13ad36067 PMD 139b88067 PTE 0 [ 13.979014] Oops: 0002 [#1] SMP PTI [ 13.979533] CPU: 12 PID: 112 Comm: kidled Not tainted 4.19.91+ #586 [ 13.980450] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 13.982136] RIP: 0010:free_pcp_prepare+0x49/0xc0 [ 13.982945] Code: 44 00 00 48 8b 15 1f 9d 13 01 48 8b 0d f8 9c 13 01 48 b8 00 00 00 00 00 16 00 00 48 01 d8 48 c1 f8 06 48 85 0 [ 13.985674] RSP: 0018:ffffc900003ffe20 EFLAGS: 00010202 [ 13.986429] RAX: 00000000001352b9 RBX: ffffea0004d4ae80 RCX: 0000000000000001 [ 13.987468] RDX: ffffc90000d85000 RSI: 0000000000000000 RDI: ffffea0004d4ae80 [ 13.988504] RBP: ffffea0004d4ae80 R08: ffffc90000ec6000 R09: 0000000000000000 [ 13.989534] R10: 0000000000008e1c R11: ffffffff828c1b6d R12: ffffc90000d85000 [ 13.990581] R13: ffffffff82306700 R14: 0000000000000001 R15: ffff88813adbab50 [ 13.991634] FS: 0000000000000000(0000) GS:ffff88813bb00000(0000) knlGS:0000000000000000 [ 13.992814] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.993648] CR2: ffffc90000eba2b9 CR3: 000000000220a006 CR4: 00000000003706e0 [ 13.994681] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 13.995721] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 13.996763] Call Trace: [ 13.997137] free_unref_page+0x11/0x60 [ 13.997693] __vunmap+0x4e/0xb0 [ 13.998159] kidled.cold+0x1b/0x53 [ 13.998680] ? __schedule+0x31c/0x6d0 [ 13.999222] ? finish_wait+0x80/0x80 [ 13.999751] ? kidled_mem_cgroup_move_stats+0x270/0x270 [ 14.000514] kthread+0x117/0x130 [ 14.001006] ? kthread_create_worker_on_cpu+0x70/0x70 [ 14.001751] ret_from_fork+0x35/0x40 This patch uses rcu lock to fix this race window, caller can only access the idle age under read lock, see kidled_get/set/inc_page_age(). Note the kidled and the memory hotplug process will also use the mem_hotplug_lock to avoid race between alloc and free. Since it may sleep in kidle_free_page_age(), call it earlier to avoid sleep with pgdat_resize_lock held. Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> Reviewed-by: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>	2022-08-01 11:46:03 +00:00
Xunlei Pang	9eee07ca21	ck: mm: Pin code section of process in memory fix #36381167 Pin code section of process in memory for the corresponding VMAs like mlock does. Usage: - pin process "PID" echo PID > /proc/unevictable/add_pid - unpin it echo PID > /proc/unevictable/del_pid - show all pinned process pids cat /proc/unevictable/add_pid Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Alex Shi <alex.shi@linux.alibaba.com> Acked-by: Xu Yu <xuyu@linux.alibaba.com> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com>	2022-08-01 11:31:20 +00:00
Tianchen Ding	7001fcff92	ck:mm:break in readahead when multiple threads detected to #34329845 Specific workloads using multiple threads to do sequential reading on the same file may introduce race on spinlock during page cache readahead. Readahead process will end earlier when detecting readahead page conflict. This is effective for sequential reading the same file with multiple threads and large readahead size. The benefit can be shown in the following test by fio: fio --name=ratest --ioengine=mmap --iodepth=1 --rw=read --direct=0 --size=512M --numjobs=32 --directory=/mnt --filename=ratestfile Firstly, we set readahead size to its default value: blockdev --setra 256 /dev/vdh The result is: Run status group 0 (all jobs): READ: bw=4427MiB/s (4642MB/s), 138MiB/s-138MiB/s (145MB/s-145MB/s), io=16.0GiB (17.2GB), run=3699-3701msec Disk stats (read/write): vdh: ios=22945/0, merge=0/0, ticks=12515/0, in_queue=12515, util=50.79% Then, we set readahead size larger: blockdev --setra 8192 /dev/vdh The result is: Run status group 0 (all jobs): READ: bw=6450MiB/s (6764MB/s), 202MiB/s-202MiB/s (211MB/s-212MB/s), io=16.0GiB (17.2GB), run=2535-2540msec Disk stats (read/write): vdh: ios=19503/0, merge=0/0, ticks=49375/0, in_queue=49375, util=23.67% Finally, we set sysctl vm.enable_multithread_ra_boost=1 with ra=8192 and we get: Run status group 0 (all jobs): READ: bw=7234MiB/s (7585MB/s), 226MiB/s-226MiB/s (237MB/s-237MB/s), io=16.0GiB (17.2GB), run=2262-2265msec Disk stats (read/write): vdh: ios=10548/0, merge=0/0, ticks=40709/0, in_queue=40709, util=23.62% Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Gang Deng <gavin.dg@linux.alibaba.com>	2022-08-01 11:31:07 +00:00

1 2 3 4 5 ...

1093 Commits