ANBZ: #8917
We used to decide when to kfree(kpa) by on_rb_tree, but it is not so
accurate. Unlike kpa->refcnt (which is used to protect kfence pool), add
a new inner kpa->_ref to track kpa itself.
Fixes: 3bc27d45db ("anolis: kfence: introduce pool_mode and auto distribute pools to nodes")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/3122
ANBZ: #6165
For productization of kfence and applying it online, we need a general
config to control its memory cost for different machines. i.e., for
machines with small memory, the kfence should be disabled or its pool
size should be limited at a small value.
Introduce a boot cmdline named "kfence.booting_max" with a similar
syntax with crashkernel. A reasonable config can be:
kfence.booting_max=0-128M:0,128M-256M:1M,256M-:2M
So that:
On machines with memory of [0, 128M), kfence will not be enabled.
On machines with memory of [128M, 256M), kfence will allocate at most 1MB
for kfence pool. (which means num_objects = 127 on page_size = 4KB)
On machines with memory larger than 256M, kfence will allocate at most 2MB
for kfence pool. (which means num_objects = 255 on page_size = 4KB)
Notes:
This config only sets the upper limit, so if the user sets
num_objects = 127 and kfence.booting_max=0-:2M, kfence will still allocate
1MB for pool.
This config only works for upstream mode. (pool_size < 1GB and
sample_interval > 0) Because if the user want to use debug mode, he must
focus on the specific machine and not need this general setting.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/2532
ANBZ: #5849
commit bfa7965b33 upstream
Kfence only needs its pool to be mapped as page granularity, if it is
inited early. Previous judgement was a bit over protected. From [1], Mark
suggested to "just map the KFENCE region a page granularity". So I
decouple it from judgement and do page granularity mapping for kfence
pool only. Need to be noticed that late init of kfence pool still requires
page granularity mapping.
Page granularity mapping in theory cost more(2M per 1GB) memory on arm64
platform. Like what I've tested on QEMU(emulated 1GB RAM) with
gki_defconfig, also turning off rodata protection:
Before:
[root@liebao ]# cat /proc/meminfo
MemTotal: 999484 kB
After:
[root@liebao ]# cat /proc/meminfo
MemTotal: 1001480 kB
To implement this, also relocate the kfence pool allocation before the
linear mapping setting up, arm64_kfence_alloc_pool is to allocate phys
addr, __kfence_pool is to be set after linear mapping set up.
Rewrite the initialization to adapt to the self developed kfence enhancements.
- Add kfence.num_objects callback function of boot cmdline, which aims to
enhance the adjust-ability of kfence. If there's a willing to change the
kfence.num_objects in some scenarios, we can adjust the parameters in
boot command line.
- Add a new pointer __kfence_pool_early_init and some judgements to be compatible
with numa node aware kfence pool allocation.
LINK: [1] https://lore.kernel.org/linux-arm-kernel/Y+IsdrvDNILA59UN@FVFF77S0Q05N/
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/1679066974-690-1-git-send-email-quic_zhenhuah@quicinc.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Kaihao Bai <carlo.bai@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xu Yu <xuyu@linux.alibaba.com>
Link: https://gitee.com/anolis/cloud-kernel/pulls/1885
ANBZ: #28
Some functions like __alloc_skb() may call kfence_ksize() frequently.
Although kfence_ksize() will return 0 in default, this may still bring
extra cost. Fix this by calling is_kfence_address() first. It's
always_inline and has static branch, so there should be no performance
regressions.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
The function is_kfence_address() convert kaddr to page without checking
kaddr because it is valid in most common cases. But when KASAN enabled,
there may be invalid kaddr, called by kasan_record_aux_stack(),
kasan_poison_shadow(), kasan_unpoison_shadow(), etc.
Fixes: 2265a18d22 ("anolis: kfence: add PG_kfence to recognize kfence address in fast path")
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
Add a interface at /sys/module/kfence/parameters/order0_page to control
order0 page switch.
Default 1 (enabled).
To disable KFENCE monitoring order0 page, write 0 to the interface, or
set kfence.order0_page=0 in boot cmdline.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
Once kfence_skip_interval is enabled (sample_interval=-1), it will never
be disabled so then positive sample_interval is no effect.
Add a new static branch to avoid false positive reported by canary
check.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
For productization, global pool mode is provided to users. The interface
of num_objects_pernode is recovered to num_objects to keep the same as
upstream version. Moreover, a new interface of kfence.pool_mode is added
to determine whether KFENCE will use global mode or pool mode to
allocate KFENCE pool memory.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
After KFENCE is disabled, all areas will be scanned and the in_use
objects number of each area will be recorded. When the count of an area
is dec to 0, reclaiming will start.
First, all percpu freelists will be forced flush to pernode freelists.
Then, pages of these unused areas can be safely freed.
At last, the metadata recorded in these areas will be freed and
these areas will be marked as zombie in rb_tree.
They will be removed from tree during next enabling.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
Based on previous commits, KFENCE dynamically enabling is introduced. We
keep trying to alloc free 1GiB continuous pages until the request
requirements are met or no space. Writing a positive number to
/sys/module/kfence/parameters/num_objects_pernode to set max number of
objects KFENCE can hold and then writing a positive number to
/sys/module/kfence/parameters/sample_interval to enable KFENCE.
NOTE: this needs KFENCE not been inited since booting.
To keep sync with rb_tree searching, stop_machine() is called when
updating the tree.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
For convenience of allocating pools dynamically, memory management,
and memory free in future, kfence_pool_area is introduced. Compared
with the previous struct which controlling a whole pool on one node,
we allow multiple new structs controlling different areas on the same
node. A rb_tree is constructed to maintain all these areas.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
To improve performance, we add PG_kfence flag to recognize whether an
address is in kfence pool first. If true, other works may be done by
calling no-inline functions.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #28
Some functions may call free_compound_page() to free an order-0 page
allocated by kfence, e.g., io_mem_free() in io_uring. However,
__free_pages_ok() is used to free order>0 pages, so there will be mistakes
during checking and freeing.
This patch also try to improve performance by adding static branch to
some hot path.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
ANBZ: #27
Since is_kfence_address() is a hot path, add a static branch to skip the
following steps such as get page and nid. This static branch is also
used to speedup kfence_ksize() and fix error when KFENCE is not inited
yet but kfence_ksize() is called by kmemleak.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
Single kernel page is supported by kfence now. The kernel order-0 pages
(including order-0 pages from vmalloc) will be allocated from kfence pool.
To exclude some specific page type(e.g., slab pages and pgtable pages),
a new gfp flag is introduced.
Out of bounds read/write and use after free can be detected.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
We may want all slubs allocated by kfence, so when pool size is large,
the limit of alloc gate interval is cancelled. When the number of
objects is larger than 65535(and that's the default max value of upstream),
kfence will alloc pages from its pool at all time if possible.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
The original kfence is designed for high efficiency so that the pool
size is small. Since we enlarged the max value of num_objects to
million level, it is necessary to support numa. To avoid confusion,
the boot cmdline parameter kfence.num_objects is changed to
kfence.num_objects_pernode.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
The original kfence has fixed num_objects and pool_size according to
CONFIG, but under some cases we want change them dynamically. This patch
allows users to adjust their own num_objects by setting kfence.num_objects
in boot cmdline and rebooting.
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
commit a7cb5d23ea upstream.
Originally the addr != NULL check was meant to take care of the case
where __kfence_pool == NULL (KFENCE is disabled). However, this does
not work for addresses where addr > 0 && addr < KFENCE_POOL_SIZE.
This can be the case on NULL-deref where addr > 0 && addr < PAGE_SIZE or
any other faulting access with addr < KFENCE_POOL_SIZE. While the
kernel would likely crash, the stack traces and report might be
confusing due to double faults upon KFENCE's attempt to unprotect such
an address.
Fix it by just checking that __kfence_pool != NULL instead.
Link: https://lkml.kernel.org/r/20210818130300.2482437-1-elver@google.com
Fixes: 0ce20dd840 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> [5.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
commit bc8fbc5f30 upstream.
Add KFENCE test suite, testing various error detection scenarios. Makes
use of KUnit for test organization. Since KFENCE's interface to obtain
error reports is via the console, the test verifies that KFENCE outputs
expected reports to the console.
[elver@google.com: fix typo in test]
Link: https://lkml.kernel.org/r/X9lHQExmHGvETxY4@elver.google.com
[elver@google.com: show access type in report]
Link: https://lkml.kernel.org/r/20210111091544.3287013-2-elver@google.com
Link: https://lkml.kernel.org/r/20201103175841.3495947-9-elver@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Co-developed-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joern Engel <joern@purestorage.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
commit d438fabce7 upstream.
Instead of removing the fault handling portion of the stack trace based on
the fault handler's name, just use struct pt_regs directly.
Change kfence_handle_page_fault() to take a struct pt_regs, and plumb it
through to kfence_report_error() for out-of-bounds, use-after-free, or
invalid access errors, where pt_regs is used to generate the stack trace.
If the kernel is a DEBUG_KERNEL, also show registers for more information.
Link: https://lkml.kernel.org/r/20201105092133.2075331-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Suggested-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>
OpenAnolis Bug Tracker: 0000278
commit 0ce20dd840 upstream.
Patch series "KFENCE: A low-overhead sampling-based memory safety error detector", v7.
This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors. This
series enables KFENCE for the x86 and arm64 architectures, and adds
KFENCE hooks to the SLAB and SLUB allocators.
KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.
KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error.
Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval,
the next allocation through the main allocator (SLAB or SLUB) returns a
guarded allocation from the KFENCE object pool. At this point, the timer
is reset, and the next allocation is set up after the expiration of the
interval.
To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE.
The KFENCE memory pool is of fixed size, and if the pool is exhausted no
further KFENCE allocations occur. The default config is conservative
with only 255 objects, resulting in a pool size of 2 MiB (with 4 KiB
pages).
We have verified by running synthetic benchmarks (sysbench I/O,
hackbench) and production server-workload benchmarks that a kernel with
KFENCE (using sample intervals 100-500ms) is performance-neutral
compared to a non-KFENCE baseline kernel.
KFENCE is inspired by GWP-ASan [1], a userspace tool with similar
properties. The name "KFENCE" is a homage to the Electric Fence Malloc
Debugger [2].
For more details, see Documentation/dev-tools/kfence.rst added in the
series -- also viewable here:
https://raw.githubusercontent.com/google/kasan/kfence/Documentation/dev-tools/kfence.rst
[1] http://llvm.org/docs/GwpAsan.html
[2] https://linux.die.net/man/3/efence
This patch (of 9):
This adds the Kernel Electric-Fence (KFENCE) infrastructure. KFENCE is a
low-overhead sampling-based memory safety error detector of heap
use-after-free, invalid-free, and out-of-bounds access errors.
KFENCE is designed to be enabled in production kernels, and has near
zero performance overhead. Compared to KASAN, KFENCE trades performance
for precision. The main motivation behind KFENCE's design, is that with
enough total uptime KFENCE will detect bugs in code paths not typically
exercised by non-production test workloads. One way to quickly achieve a
large enough total uptime is when the tool is deployed across a large
fleet of machines.
KFENCE objects each reside on a dedicated page, at either the left or
right page boundaries. The pages to the left and right of the object
page are "guard pages", whose attributes are changed to a protected
state, and cause page faults on any attempted access to them. Such page
faults are then intercepted by KFENCE, which handles the fault
gracefully by reporting a memory access error. To detect out-of-bounds
writes to memory within the object's page itself, KFENCE also uses
pattern-based redzones. The following figure illustrates the page
layout:
---+-----------+-----------+-----------+-----------+-----------+---
| xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx |
| xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx |
| x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x |
| xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx |
| xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx |
| xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx |
---+-----------+-----------+-----------+-----------+-----------+---
Guarded allocations are set up based on a sample interval (can be set
via kfence.sample_interval). After expiration of the sample interval, a
guarded allocation from the KFENCE object pool is returned to the main
allocator (SLAB or SLUB). At this point, the timer is reset, and the
next allocation is set up after the expiration of the interval.
To enable/disable a KFENCE allocation through the main allocator's
fast-path without overhead, KFENCE relies on static branches via the
static keys infrastructure. The static branch is toggled to redirect the
allocation to KFENCE. To date, we have verified by running synthetic
benchmarks (sysbench I/O, hackbench) that a kernel compiled with KFENCE
is performance-neutral compared to the non-KFENCE baseline.
For more details, see Documentation/dev-tools/kfence.rst (added later in
the series).
[elver@google.com: fix parameter description for kfence_object_start()]
Link: https://lkml.kernel.org/r/20201106092149.GA2851373@elver.google.com
[elver@google.com: avoid stalling work queue task without allocations]
Link: https://lkml.kernel.org/r/CADYN=9J0DQhizAGB0-jz4HOBBh+05kMBXb4c0cXMS7Qi5NAJiw@mail.gmail.com
Link: https://lkml.kernel.org/r/20201110135320.3309507-1-elver@google.com
[elver@google.com: fix potential deadlock due to wake_up()]
Link: https://lkml.kernel.org/r/000000000000c0645805b7f982e4@google.com
Link: https://lkml.kernel.org/r/20210104130749.1768991-1-elver@google.com
[elver@google.com: add option to use KFENCE without static keys]
Link: https://lkml.kernel.org/r/20210111091544.3287013-1-elver@google.com
[elver@google.com: add missing copyright and description headers]
Link: https://lkml.kernel.org/r/20210118092159.145934-1-elver@google.com
Link: https://lkml.kernel.org/r/20201103175841.3495947-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Co-developed-by: Marco Elver <elver@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Joern Engel <joern@purestorage.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Reviewed-by: Xunlei Pang <xlpang@linux.alibaba.com>