The backend generally uses 64-bit immediates (e.g. what
MachineOperand::getImm() returns), so use that for analyzeCompare()
and optimizeCompareInst() as well. This avoids truncation for
targets that support immediates larger 32-bit. In particular, we
can avoid the bugprone value normalization hack in the AArch64
target.
This is a followup to D108076.
Differential Revision: https://reviews.llvm.org/D108875
This is a non-intrusive fix for
https://bugs.llvm.org/show_bug.cgi?id=51476 intended for backport
to the 13.x release branch. It expands on the current hack by
distinguishing between CmpValue of 0, 1 and 2, where 0 and 1 have
the obvious meaning and 2 means "anything else". The new optimization
from D98564 should only be performed for CmpValue of 0 or 1.
For main, I think we should switch the analyzeCompare() and
optimizeCompare() APIs to use int64_t instead of int, which is in
line with MachineOperand's notion of an immediate, and avoids this
problem altogether.
Differential Revision: https://reviews.llvm.org/D108076
This changes the lowering of f32 and f64 COPY from a 128bit vector ORR to
a fmov of the appropriate type. At least on some CPU's with 64bit NEON
data paths this is expected to be faster, and shouldn't be slower on any
CPU that treats fmov as a register rename.
Differential Revision: https://reviews.llvm.org/D106365
This avoids the use of the vector unit for copying from scalar to
vector. There is an extra ptrue instruction, but a predicate register
with the ptrue pattern populated is likely to be free in the context of
real code.
Tests were generated from a template to cover the axes mentioned at the
top of the test file.
Co-authored-by: Francesco Petrogalli <francesco.petrogalli@arm.com>
Differential Revision: https://reviews.llvm.org/D103170
PACI*SP have the advantage that they are in HINT space, meaning
they can be run successfully in hardware without PAuth support -
they will just behave as a NOP. However, PACI*SP are also implicit
landing pads (think of an extra BTI jc). Therefore, they allow
indirect jumps of all kinds into them, potentially inserting new
gadgets. This patch replaces PACI*SP by PACI* LR, SP when
compiling explicitly for hardware with full PAuth support. PACI*
is not in the HINT space, therefore it will fault when run in
hardware without PAuth support, but it is also not a landing pad,
making programs safer in newer HW.
Differential Revision: https://reviews.llvm.org/D101920
D88631 added initial support for:
- -mstack-protector-guard=
- -mstack-protector-guard-reg=
- -mstack-protector-guard-offset=
flags, and D100919 extended these to AArch64. Unfortunately, these flags
aren't retained for LTO. Make them module attributes rather than
TargetOptions.
Link: https://github.com/ClangBuiltLinux/linux/issues/1378
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D102742
Follow up to D88631 but for aarch64; the Linux kernel uses the command
line flags:
1. -mstack-protector-guard=sysreg
2. -mstack-protector-guard-reg=sp_el0
3. -mstack-protector-guard-offset=0
to use the system register sp_el0 for the stack canary, enabling the
kernel to have a unique stack canary per task (like a thread, but not
limited to userspace as the kernel can preempt itself).
Address pr/47341 for aarch64.
Fixes: https://github.com/ClangBuiltLinux/linux/issues/289
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed By: xiangzhangllvm, DavidSpickett, dmgreen
Differential Revision: https://reviews.llvm.org/D100919
Swift's new concurrency features are going to require guaranteed tail calls so
that they don't consume excessive amounts of stack space. This would normally
mean "tailcc", but there are also Swift-specific ABI desires that don't
naturally go along with "tailcc" so this adds another calling convention that's
the combination of "swiftcc" and "tailcc".
Support is added for AArch64 and X86 for now.
This extends any frame record created in the function to include that
parameter, passed in X22.
The new record looks like [X22, FP, LR] in memory, and FP is stored with 0b0001
in bits 63:60 (CodeGen assumes they are 0b0000 in normal operation). The effect
of this is that tools walking the stack should expect to see one of three
values there:
* 0b0000 => a normal, non-extended record with just [FP, LR]
* 0b0001 => the extended record [X22, FP, LR]
* 0b1111 => kernel space, and a non-extended record.
All other values are currently reserved.
If compiling for arm64e this context pointer is address-discriminated with the
discriminator 0xc31a and the DB (process-specific) key.
There is also an "i8** @llvm.swift.async.context.addr()" intrinsic providing
front-ends access to this slot (and forcing its creation initialized to nullptr
if necessary).
When a ptest is used to set flags from the output of rdffr, the ptest
can be eliminated, using a flags-setting rdffrs instead.
Additionally, check that nothing consumes flags between rdffr and ptest;
this case appears to have been missed previously.
* There is no unpredicated RDFFRS instruction.
* If substituting RDFFR_PP, require that the mask argument of the
PTEST matches that of the RDFFR_PP.
* Move some precondition code up inside optimizePTestInstr, so that it
covers the new code paths for RDFFR which return earlier.
* Only consider RDFFR, PTEST in same basic block.
* Check for other flag setting instructions between the two, abort if
found.
* Drop an old TODO comment about removing dead PTEST instructions.
RDFFR_P to follow in later patch.
Differential Revision: https://reviews.llvm.org/D101357
This patch merges STR<S,D,Q,W,X>pre-STR<S,D,Q,W,X>ui and
LDR<S,D,Q,W,X>pre-LDR<S,D,Q,W,X>ui instruction pairs into a single
STP<S,D,Q,W,X>pre and LDP<S,D,Q,W,X>pre instruction, respectively.
For each pair, there is a MIR test that verifies this optimization.
Differential Revision: https://reviews.llvm.org/D99272
Change-Id: Ie97a20c8c716c08492fe229c22e14e3c98ef08b7
Comparisons to zero or one after cset instructions can be safely
removed in examples like:
cset w9, eq cset w9, eq
cmp w9, #1 ---> <removed>
b.ne .L1 b.ne .L1
cset w9, eq cset w9, eq
cmp w9, #0 ---> <removed>
b.ne .L1 b.eq .L1
Peephole optimization to detect suitable cases and get rid of that
comparisons added.
Differential Revision: https://reviews.llvm.org/D98564
This reverts commit cca9b5985c.
Buildbot reported an error for CodeGen/AArch64/machine-combiner-fmul-dup.mir:
*** Bad machine code: Virtual register killed in block, but needed live out. ***
- function: indexed_2s
- basic block: %bb.0 entry (0x640fee8)
Virtual register %7 is used after the block.
*** Bad machine code: Virtual register defs don't dominate all uses. ***
- function: indexed_2s
- v. register: %7
LLVM ERROR: Found 2 machine code errors.
This patch adds DUP+FMUL => FMUL_indexed pattern to InstCombiner.
FMUL_indexed is normally selected during instruction selection, but it
does not work in cases when VDUP and VMUL are in different basic
blocks.
Differential Revision: https://reviews.llvm.org/D99662
Prefer (self-documenting) return values to output parameters (which are
liable to be used).
While here, rename Noop to Nop which is more widely used and improves
consistency with hasEmitNops/setEmitNops/emitNop/etc.
This combine tries to do inter-block hoisting of extends of G_PHIs, into the
originating blocks of the phi's incoming value. The idea is to expose further
optimization opportunities that are normally obscured by the PHI.
Some basic heuristics, and a target hook for AArch64 is added, to allow tuning.
E.g. if the extend is used by a G_PTR_ADD, it doesn't perform this combine
since it may be folded into the addressing mode during selection.
There are very minor code size improvements on AArch64 -Os, but the real benefit
is that it unlocks optimizations like AArch64 conditional compares on some
benchmarks.
Differential Revision: https://reviews.llvm.org/D95703
Different targets might handle branch performance differently, so this patch allows for
targets to specify the TailDuplicateSize threshold. Said threshold defines how small a branch
can be and still be duplicated to generate straight-line code instead.
This patch also specifies said override values for the AArch64 subtarget.
Differential Revision: https://reviews.llvm.org/D95631
After 49142991a6, clang detects that MUL may be uninitialized. Set it to nullptr to suppress this check.
Adding an assert to check that it is ultimately set fails two test cases. Since this is not a new issue, leave the assertion commented out until a code owner can fix the bug. The two failing test cases are noted in the assertion comment.
add a new goal MustReduceRegisterPressure for machine combiner pass.
PowerPC will use this new goal to do some register pressure related optimization.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D92068
This patch replaces the AArch64StackOffset class by the generic one
defined in TypeSize.h.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D88983
The pass is updated to handle loads through complex addressing mode,
specifically, when we have a scaled register and a scale.
It requires two API updates in TII which have been implemented for X86.
See added IR and MIR testcases.
Tests-Run: make check
Reviewed-By: reames, danstrushin
Differential Revision: https://reviews.llvm.org/D87148
PAC/BTI-related codegen in the AArch64 backend is controlled by a set
of LLVM IR function attributes, added to the function by Clang, based
on command-line options and GCC-style function attributes. However,
functions, generated in the LLVM middle end (for example,
asan.module.ctor or __llvm_gcov_write_out) do not get any attributes
and the backend incorrectly does not do any PAC/BTI code generation.
This patch record the default state of PAC/BTI codegen in a set of
LLVM IR module-level attributes, based on command-line options:
* "sign-return-address", with non-zero value means generate code to
sign return addresses (PAC-RET), zero value means disable PAC-RET.
* "sign-return-address-all", with non-zero value means enable PAC-RET
for all functions, zero value means enable PAC-RET only for
functions, which spill LR.
* "sign-return-address-with-bkey", with non-zero value means use B-key
for signing, zero value mean use A-key.
This set of attributes are always added for AArch64 targets (as
opposed, for example, to interpreting a missing attribute as having a
value 0) in order to be able to check for conflicts when combining
module attributed during LTO.
Module-level attributes are overridden by function level attributes.
All the decision making about whether to not to generate PAC and/or
BTI code is factored out into AArch64FunctionInfo, there shouldn't be
any places left, other than AArch64FunctionInfo, which directly
examine PAC/BTI attributes, except AArch64AsmPrinter.cpp, which
is/will-be handled by a separate patch.
Differential Revision: https://reviews.llvm.org/D85649
The motivation here is that MachineBlockPlacement relies on analyzeBranch to remove branches to fallthrough blocks when the branch is not fully analyzeable. With the introduction of the FAULTING_OP psuedo for implicit null checking (see D87861), this case becomes important. Note that it's hard to otherwise exercise this path as BranchFolding handle's any fully analyzeable branch sequence without using this interface.
p.s. For anyone who saw my comment in the original review, what I thought was an issue in BranchFolding originally turned out to simply be a bug in my patch. (Now fixed.)
Differential Revision: https://reviews.llvm.org/D88035
This change enables the generic implicit null transformation for the AArch64 target. As background for those unfamiliar with our implicit null check support:
An implicit null check is the use of a signal handler to catch and redirect to a handler a null pointer. Specifically, it's replacing an explicit conditional branch with such a redirect. This is only done for very cold branches under frontend control w/appropriate metadata.
FAULTING_OP is used to wrap the faulting instruction. It is modelled as being a conditional branch to reflect the fact it can transfer control in the CFG.
FAULTING_OP does not need to be an analyzable branch to achieve it's purpose. (Or at least, that's the x86 model. I find this slightly questionable.)
When lowering to MC, we convert the FAULTING_OP back into the actual instruction, record the labels, and lower the original instruction.
As can be seen in the test changes, currently the AArch64 backend does not eliminate the unconditional branch to the fallthrough block. I've tried two approaches, neither of which worked. I plan to return to this in a separate change set once I've wrapped my head around the interactions a bit better. (X86 handles this via AllowModify on analyzeBranch, but adding the obvious code causing BranchFolding to crash. I haven't yet figured out if it's a latent bug in BranchFolding, or something I'm doing wrong.)
Differential Revision: https://reviews.llvm.org/D87851
This reverts commit bc9a29b9ee.
The reasoning that this patch was wrong was itself incorrect
(see discussion on llvm-commits). This patch does seem to be exposing
a latent SVE code generation bug on non-public tests, which should
not block a correctness fix for public, non-SVE use cases.
This reverts commit e9d9a61208.
This patch was previously revert by 04879086b4
with the reapplication being done after breaking the assert used to
ensure SP is always 16-byte aligned, which is a requirement of the AAPCS.
For extra context the latest patch caused runtime failures when
building with "-march=armv8-a+sve -mllvm -aarch64-sve-vector-bits-min=256".
Original Commit Message:
After the commit r368987 (rG643adb55769e) was landed, the frame record (FP and LR register)
may be placed in the middle of a stack frame if a function has both callee-saved
general-purpose registers and floating point registers. This will break the stack unwinders
that simply walk through the frame records (based on the guarantee from AAPCS64
"The Frame Pointer" section). This commit fixes the problem by adding the frame record offset.
Patch By: logan
Differential Revision: D70800
In cases where MachineOutliner candidates either are:
* noreturn
* have calls with no available LR or free regs
* Don't use SP
we can end up hitting stack fixup code for the caller and the callee for
a FrameID of MachineOutlinerDefault. This triggers the assert:
`assert(OF.FrameConstructionID != MachineOutlinerDefault &&
"Can only fix up stack references once");`
in AArch64InstrInfo.cpp. This assert exists for now because a lot of the
fixup code is not tested to handle fixing up more than once and needs
some better checks and enhancements to avoid potentially generating
illegal code.
I've filed a Bugzilla report to track this until these cases are handled
by the AArch64 MachineOutliner: https://bugs.llvm.org/show_bug.cgi?id=46767
This diff detects cases that will cause these multiple stack fixups and
prune the Candidates from `RepeatedSequenceLocs`.
Differential Revision: https://reviews.llvm.org/D83923
Currently, instruction level fast math flags are not considered when
generating patterns for the machine combiner.
This currently leads to some missed opportunities to generate FMAs in
combination with `#pragma clang fp contract (fast)`.
For example, when building the example below with -O3 for AArch64, no
FMADD is generated. If built with -O2 and the DAGCombiner is used
instead of the MachineCombiner for FMAs, an FMADD is generated.
With this patch, the same code is generated in both cases.
float madd_contract(float a, float b, float c) {
#pragma clang fp contract (fast)
return (a * b) + c;
}
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D84930
It's sort of tricky to hit this in practice, but not impossible. I have
a synthetic C testcase if anyone is interested.
The implementation is identical to the equivalent NEON register copies.
Differential Revision: https://reviews.llvm.org/D84373
To make sure that no barrier gets placed on the architectural execution
path, each
BLR x<N>
instruction gets transformed to a
BL __llvm_slsblr_thunk_x<N>
instruction, with __llvm_slsblr_thunk_x<N> a thunk that contains
__llvm_slsblr_thunk_x<N>:
BR x<N>
<speculation barrier>
Therefore, the BLR instruction gets split into 2; one BL and one BR.
This transformation results in not inserting a speculation barrier on
the architectural execution path.
The mitigation is off by default and can be enabled by the
harden-sls-blr subtarget feature.
As a linker is allowed to clobber X16 and X17 on function calls, the
above code transformation would not be correct in case a linker does so
when N=16 or N=17. Therefore, when the mitigation is enabled, generation
of BLR x16 or BLR x17 is avoided.
As BLRA* indirect calls are not produced by LLVM currently, this does
not aim to implement support for those.
Differential Revision: https://reviews.llvm.org/D81402
Some processors may speculatively execute the instructions immediately
following RET (returns) and BR (indirect jumps), even though
control flow should change unconditionally at these instructions.
To avoid a potential miss-speculatively executed gadget after these
instructions leaking secrets through side channels, this pass places a
speculation barrier immediately after every RET and BR instruction.
Since these barriers are never on the correct, architectural execution
path, performance overhead of this is expected to be low.
On targets that implement that Armv8.0-SB Speculation Barrier extension,
a single SB instruction is emitted that acts as a speculation barrier.
On other targets, a DSB SYS followed by a ISB is emitted to act as a
speculation barrier.
These speculation barriers are implemented as pseudo instructions to
avoid later passes to analyze them and potentially remove them.
Even though currently LLVM does not produce BRAA/BRAB/BRAAZ/BRABZ
instructions, these are also mitigated by the pass and tested through a
MIR test.
The mitigation is off by default and can be enabled by the
harden-sls-retbr subtarget feature.
Differential Revision: https://reviews.llvm.org/D81400
Summary:
While clustering mem ops, AMDGPU target needs to consider number of clustered bytes
to decide on max number of mem ops that can be clustered. This patch adds support to pass
number of clustered bytes to target mem ops clustering logic.
Reviewers: foad, rampitec, arsenm, vpykhtin, javedabsar
Reviewed By: foad
Subscribers: MatzeB, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, javed.absar, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D80545
When a stack offset was too big to materialize in a single instruction, we were
trying to do it in stages:
adds xD, sp, #imm
adds xD, xD, #imm
Unfortunately, if xD is xzr then the second instruction doesn't exist and
wouldn't do what was needed if it did. Instead we can use a temporary register
for all but the last addition.
Summary:
This patch enables the register allocator to spill/fill lists of 2, 3
and 4 SVE vectors registers to/from the stack. This is implemented with
pseudo instructions that get expanded to individual LDR_ZXI/STR_ZXI
instructions in AArch64ExpandPseudoInsts.
Patch by Sander de Smalen.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D75988
The set of patterns for unpredicated load/store was incomplete: it only
included non-extending stores. Fill out the remaining patterns for
extending stores, and add the corresponding support to frame offset
lowering.
Differential Revision: https://reviews.llvm.org/D80349
The offsets were wrong. The result is now the same as what the compiler
would generate for a function that spills lr normally.
Differential Revision: https://reviews.llvm.org/D80238
When using reversedInstructionsWithoutDebug to construct a range from a
pair of MachineInstrBundleIterators, the range unexpectedly leaves out an
element. This results in mis-optimization as @mstorsjo points out in
https://reviews.llvm.org/D78157.
The problem is that when we convert a MachineInstrBundleIterator to a
reverse iterator, the result gets incremented:
MachineInstrBundleIterator(++I.getReverse())
The comment there explains that the "resulting iterator will dereference
... to the previous node, which is somewhat unexpected; but converting
the two endpoints in a range will give the same range in reverse". This
makes it hard to understand what reversedInstructionsWithoutDebug will
do: I've removed the helper to prevent similar mistakes in the future.