This adds t2WhileLoopStartTP, similar to the t2DoLoopStartTP added in
D90591. It keeps a reference to both the tripcount register and the
element count register, so that the ARMLowOverheadLoops pass in the
backend can pick the correct one without having to search for it from
the operand of a VCTP.
Differential Revision: https://reviews.llvm.org/D103236
This patch prevents phi-node-elimination from generating a COPY
operation for the register defined by t2WhileLoopStartLR, as it is a
terminator that defines a value.
This happens because of the presence of phi-nodes in the loop body (the
Preheader of which is the block containing the t2WhileLoopStartLR). If
this is not done, the COPY is generated above/before the terminator
(t2WhileLoopStartLR here), and since it uses the value defined by
t2WhileLoopStartLR, MachineVerifier throws a 'use before define' error.
This essentially adds on to the change in differential D91887/D97729.
Differential Revision: https://reviews.llvm.org/D100376
Recently we improved the lowering of low overhead loops and tail
predicated loops, but concentrated first on the DLS do style loops. This
extends those improvements over to the WLS while loops, improving the
chance of lowering them successfully. To do this the lowering has to
change a little as the instructions are terminators that produce a value
- something that needs to be treated carefully.
Lowering starts at the Hardware Loop pass, inserting a new
llvm.test.start.loop.iterations that produces both an i1 to control the
loop entry and an i32 similar to the llvm.start.loop.iterations
intrinsic added for do loops. This feeds into the loop phi, properly
gluing the values together:
%wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
%wls0 = extractvalue { i32, i1 } %wls, 0
%wls1 = extractvalue { i32, i1 } %wls, 1
br i1 %wls1, label %loop.ph, label %loop.exit
...
loop:
%lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ]
..
%iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
%cmp = icmp ne i32 %iv.next, 0
br i1 %cmp, label %loop, label %loop.exit
The llvm.test.start.loop.iterations need to be lowered through ISel
lowering as a pair of WLS and WLSSETUP nodes, which each get converted
to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent
t2WhileLoopStart from being a terminator that produces a value,
something difficult to control at that stage in the pipeline. Instead
the t2WhileLoopSetup produces the value of LR (essentially acting as a
lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc).
These are then converted into a single t2WhileLoopStartLR at the same
point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop
to prevent them from progressing further in the pipeline. The
t2WhileLoopStartLR is a single instruction that takes a GPR and produces
LR, similar to the WLS instruction.
%1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3
t2B %bb.1
...
bb.2.loop:
%2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2
...
%3:gprlr = t2LoopEndDec %2:gprlr, %bb.2
t2B %bb.3
The t2WhileLoopStartLR can then be treated similar to the other low
overhead loop pseudos, eventually being lowered to a WLS providing the
branches are within range.
Differential Revision: https://reviews.llvm.org/D97729
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.
This is a recommit of 3b34b06fc5 with correctness fixes and extra
tests.
Differential Revision: https://reviews.llvm.org/D95885
Currently the load/store optimizer will only fold in increments of the
same size as the load/store. This patch expands that to any legal
immediate for the post-inc instruction.
Differential Revision: https://reviews.llvm.org/D95885
This patch handles cases where we have to save/restore the link register
into the stack and and load/store instruction which use the stack are
part of the outlined region. It checks that there will be no overflow
introduced by the new offset and fixup these instructions accordingly.
Differential Revision: https://reviews.llvm.org/D92934
This treats low overhead loop branches the same as jump tables and
indirect branches in analyzeBranch - they cannot be analyzed but the
direct branches on the end of the block may be removed. This helps
remove the unnecessary branches earlier, which can help produce better
codegen (and change block layout in a number of cases).
Differential Revision: https://reviews.llvm.org/D94392
Adds ARMBankConflictHazardRecognizer. This hazard recognizer
looks for a few situations where the same base pointer is used and
then checks whether the offsets lead to a bank conflict. Two
parameters are also added to permit overriding of the target
assumptions:
arm-data-bank-mask=<int> - Mask of bits which are to be checked for
conflicts. If all these bits are equal in the offsets, there is a
conflict.
arm-assume-itcm-bankconflict=<bool> - Assume that there will be bank
conflicts on any loads to a constant pool.
This hazard recognizer is enabled for Cortex-M7, where the Technical
Reference Manual states that there are two DTCM banks banked using bit
2 and one ITCM bank.
Differential Revision: https://reviews.llvm.org/D93054
As a linker is allowed to clobber r12 on function calls, the code
transformation that hardens indirect calls is not correct in case a
linker does so. Similarly, the transformation is not correct when
register lr is used.
This patch makes sure that r12 or lr are not used for indirect calls
when harden-sls-blr is enabled.
Differential Revision: https://reviews.llvm.org/D92469
To make sure that no barrier gets placed on the architectural execution
path, each indirect call calling the function in register rN, it gets
transformed to a direct call to __llvm_slsblr_thunk_mode_rN. mode is
either arm or thumb, depending on the mode of where the indirect call
happens.
The llvm_slsblr_thunk_mode_rN thunk contains:
bx rN
<speculation barrier>
Therefore, the indirect call gets split into 2; one direct call and one
indirect jump.
This transformation results in not inserting a speculation barrier on
the architectural execution path.
The mitigation is off by default and can be enabled by the
harden-sls-blr subtarget feature.
As a linker is allowed to clobber r12 on function calls, the
above code transformation is not correct in case a linker does so.
Similarly, the transformation is not correct when register lr is used.
Avoiding r12/lr being used is done in a follow-on patch to make
reviewing this code easier.
Differential Revision: https://reviews.llvm.org/D92468
The only non-trivial consideration in this patch is that the formation
of TBB/TBH instructions, which is done in the constant island pass, does
not understand the speculation barriers inserted by the SLSHardening
pass. As such, when harden-sls-retbr is enabled for a function, the
formation of TBB/TBH instructions in the constant island pass is
disabled.
Differential Revision: https://reviews.llvm.org/D92396
Some processors may speculatively execute the instructions immediately
following indirect control flow, such as returns, indirect jumps and
indirect function calls.
To avoid a potential miss-speculatively executed gadget after these
instructions leaking secrets through side channels, this pass places a
speculation barrier immediately after every indirect control flow where
control flow doesn't return to the next instruction, such as returns and
indirect jumps, but not indirect function calls.
Hardening of indirect function calls will be done in a later,
independent patch.
This patch is implementing the same functionality as the AArch64 counter
part implemented in https://reviews.llvm.org/D81400.
For AArch64, returns and indirect jumps only occur on RET and BR
instructions and hence the function attribute to control the hardening
is called "harden-sls-retbr" there. On AArch32, there is a much wider
variety of instructions that can trigger an indirect unconditional
control flow change. I've decided to stick with the name
"harden-sls-retbr" as introduced for the corresponding AArch64
mitigation.
This patch implements this for ARM mode. A future patch will extend this
to also support Thumb mode.
The inserted barriers are never on the correct, architectural execution
path, and therefore performance overhead of this is expected to be low.
To ensure these barriers are never on an architecturally executed path,
when the harden-sls-retbr function attribute is present, indirect
control flow is never conditionalized/predicated.
On targets that implement that Armv8.0-SB Speculation Barrier extension,
a single SB instruction is emitted that acts as a speculation barrier.
On other targets, a DSB SYS followed by a ISB is emitted to act as a
speculation barrier.
These speculation barriers are implemented as pseudo instructions to
avoid later passes to analyze them and potentially remove them.
The mitigation is off by default and can be enabled by the
harden-sls-retbr subtarget feature.
Differential Revision: https://reviews.llvm.org/D92395
Although this was something that I was hoping we would not have to do,
this patch makes t2DoLoopStartTP a terminator in order to keep it at the
end of it's block, so not allowing extra MVE instruction between it and
the end. With t2DoLoopStartTP's also starting tail predication regions,
it also marks them as having side effects. The t2DoLoopStart is still
not a terminator, giving it the extra scheduling freedom that can be
helpful, but now that we have a TP version they can be treated
differently.
Differential Revision: https://reviews.llvm.org/D91887
We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.
Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.
There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.
This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.
Differential Revision: https://reviews.llvm.org/D91358
This adds code to revert low overhead loops with calls in them before
register allocation. Ideally we would not create low overhead loops with
calls in them to begin with, but that can be difficult to always get
correct. If we want to try and glue together t2LoopDec and t2LoopEnd
into a single instruction, we need to ensure that no instructions use LR
in the loop. (Technically the final code can be better too, as it
doesn't need to use the same registers but that has not been optimized
for here, as reverting loops with calls is expected to be very rare).
It also adds a MVETailPredUtils.h header to share the revert code
between different passes, and provides a place to expand upon, with
RevertLoopWithCall becoming a place to perform other low overhead loop
alterations like removing copies or combining LoopDec and End into a
single instruction.
Differential Revision: https://reviews.llvm.org/D91273
This introduces a new pseudo instruction, almost identical to a
t2DoLoopStart but taking 2 parameters - the original loop iteration
count needed for a low overhead loop, plus the VCTP element count needed
for a DLSTP instruction setting up a tail predicated loop. The idea is
that the instruction holds both values and the backend
ARMLowOverheadLoops pass can pick between the two, depending on whether
it creates a tail predicated loop or falls back to a low overhead loop.
To do that there needs to be something that converts a t2DoLoopStart to
a t2DoLoopStartTP, for which this patch repurposes the
MVEVPTOptimisationsPass as a "tail predication and vpt optimisation"
pass. The extra operand for the t2DoLoopStartTP is chosen based on the
operands of VCTP's in the loop, and the instruction is moved as late in
the block as possible to attempt to increase the likelihood of making
tail predicated loops.
Differential Revision: https://reviews.llvm.org/D90591
This patch make the outliner emit CFI instructions in a few more
places:
* after LR is restored, but before the return in an outlined
function
* around save/restore of LR to/from a register at calls to outlined
functions
* around save/restore of LR to/from the stack at calls to outlined
functions
The latter two only when the function does NOT spill LR. If the
function spills LR, then outliner generated saves/restores around
calls are not considered interesting for unwinding the frame.
Differential Revision: https://reviews.llvm.org/D89483
Some instructions may be removable through processes such as IfConversion,
however DefinesPredicate can not be made aware of when this should be considered.
This parameter allows DefinesPredicate to distinguish these removable instructions
on a per-call basis, allowing for more fine-grained control from processes like
ifConversion.
Renames DefinesPredicate to ClobbersPredicate, to better reflect it's purpose
Differential Revision: https://reviews.llvm.org/D88494
We really want to try and avoid spilling P0, which can be difficult
since there's only one register, so try to rematerialize any VCTP
instructions.
Differential Revision: https://reviews.llvm.org/D87280
Enable default outlining when the function has the minsize attribute
and we're targeting an m-class core.
Differential Revision: https://reviews.llvm.org/D82951
Use the stack to save and restore the link register when there is no
available register to do it.
Differential Revision: https://reviews.llvm.org/D76069
Widen the scope of memory operations that are allowed to be tail predicated
to include gathers and scatters, such that loops that are auto-vectorized
with the option -enable-arm-maskedgatscat (and actually end up containing
an MVE gather or scatter) can be tail predicated.
Differential Revision: https://reviews.llvm.org/D85138
This adds sign/zero extending scalar loads/stores to the MVE
instructions added in D77813, allowing us to create up more post-inc
instructions. These are comparatively simple, compared to LDR/STR (which
may be better turned into an LDRD/LDM), but still require some additions
over MVE instructions. Because there are i12 and i8 variants of the
offset loads/stores dealing with different signs, we may need to convert
an i12 address to a i8 negative instruction. t2LDRBi12 can also be
shrunk to a tLDRi under the right conditions, so we need to be careful
with codesize too.
Differential Revision: https://reviews.llvm.org/D78625
While validating live-out values, record instructions that look like
a reduction. This will comprise of a vector op (for now only vadd),
a vorr (vmov) which store the previous value of vadd and then a vpsel
in the exit block which is predicated upon a vctp. This vctp will
combine the last two iterations using the vmov and vadd into a vector
which can then be consumed by a vaddv.
Once we have determined that it's safe to perform tail-predication,
we need to change this sequence of instructions so that the
predication doesn't produce incorrect code. This involves changing
the register allocation of the vadd so it updates itself and the
predication on the final iteration will not update the falsely
predicated lanes. This mimics what the vmov, vctp and vpsel do and
so we then don't need any of those instructions.
Differential Revision: https://reviews.llvm.org/D75533
The ARM ARM considers p10/p11 valid arguments for MCR/MRC instructions.
MRC instructions with p10 arguments are also used in kernel code which
is shared for different architectures. Turn usage of p10/p11 to warnings
for ARMv7/ARMv8-M.
Reviewers: rengolin, olista01, t.p.northover, efriedma, psmith, simon_tatham
Reviewed By: simon_tatham
Subscribers: hiraditya, danielkiss, jcai19, tpimh, nickdesaulniers, peter.smith, javed.absar, kristof.beyls, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59733
Outline chunks of code which need to save and restore the link register
when a spare register can be used to it.
Differential Revision: https://reviews.llvm.org/D80127
Enables Machine Outlining for ARM and Thumb2 modes. This is the first
patch of the series which adds all the basic logic for the support, and
only handles tail-calls and thunks.
The outliner can be turned on by using clang -moutline option or -mllvm
-enable-machine-outliner one (like AArch64).
Differential Revision: https://reviews.llvm.org/D76066
getARMVPTBlockMask was an outdated function that only handled basic
block masks: T, TT, TTT and TTTT. This worked fine before the MVE
VPT Block Insertion Pass improvements as it was the only kind of
masks that it could generate, but now it can generate more complex
masks that uses E predicates, so it's dangerous to use that function
to calculate VPT/VPST block masks.
I replaced it with 2 different functions:
- expandPredBlockMask, in ARMBaseInfo. This adds an "E" or "T" at
the end of an existing PredBlockMask.
- recomputeVPTBlockMask, in Thumb2InstrInfo. This takes an iterator
to a VPT/VPST instruction and recomputes its block mask by looking
at the predicated instructions that follows it. This should be
used to recompute a block mask after removing/adding a predicated
instruction to the block.
The expandPredBlockMask function is pretty much imported from the MVE
VPT Blocks pass.
I had to change the ARMLowOverheadLoops and MVEVPTBlocks passes as well
so they could use these new functions.
Differential Revision: https://reviews.llvm.org/D78201
Enables the MVEGatherScatterLowering pass to build
pre-incrementing gathers. Incrementing writeback gathers
are built when it is possible to replace the loop increment
instruction.
Differential Revision: https://reviews.llvm.org/D76786
This adds some extra processing into the Pre-RA ARM load/store optimizer
to detect and merge MVE loads/stores and adds of the same base. This we
don't always turn into a post-inc during ISel, and due to the nature of
it being a graph we don't always know an order to use for the nodes, not
knowing which nodes to make post-inc and which to use the new post-inc
of. After ISel, we have an order that we can use to post-inc the
following instructions.
So this looks for a loads/store with a starting offset of 0, and an
add/sub from the same base, plus a number of other loads/stores. We then
do some checks and convert the zero offset load/store into a postinc
variant. Any loads/stores after it have the offset subtracted from their
immediates. For example:
LDR #4 LDR #4
LDR #0 LDR_POSTINC #16
LDR #8 LDR #-8
LDR #12 LDR #-4
ADD #16
It only handles MVE loads/stores at the moment. Normal loads/store will
be added in a followup patch, they just have some extra details to
ensure that we keep generating LDRD/LDM successfully.
Differential Revision: https://reviews.llvm.org/D77813
Summary:
The INLINEASM MIR instructions use immediate operands to encode the values of some operands.
The MachineInstr pretty printer function already handles those operands and prints human readable annotations instead of the immediates. This patch adds similar annotations to the output of the MIRPrinter, however uses the new MIROperandComment feature.
Reviewers: SjoerdMeijer, arsenm, efriedma
Reviewed By: arsenm
Subscribers: qcolombet, sdardis, jvesely, wdng, nhaehnle, hiraditya, jrtc27, atanasyan, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78088
VPTMaskValue was using the "instruction" encoding to represent the masks
(= the same encoding as the one used by the instructions in an object file),
but it is only used to build MCOperands, so it should use the MCOperand
encoding of the masks, which is slightly different.
Differential Revision: https://reviews.llvm.org/D76139
As a narrow stopgap for the assertion failure described in PR45025, add
a describeLoadedValue override to ARMBaseInstrInfo and use it to detect
copies in which the forwarding reg is a super/sub reg of the copy
destination. For the moment this is unsupported.
Several follow ups are possible:
1) Handle VORRq. At the moment, we do not, because isCopyInstrImpl
returns early when !MI.isMoveReg().
2) In the case where forwarding reg is a super-reg of the copy
destination, we should be able to describe the forwarding reg as a
subreg within the copy destination. I'm not 100% sure about this, but
it looks like that's what's done in AArch64InstrInfo.
3) In the case where the forwarding reg is a sub-reg of the copy
destination, maybe we could describe the forwarding reg using the
copy destinaion and a DW_OP_LLVM_fragment (I guess this should be
possible after D75036).
https://bugs.llvm.org/show_bug.cgi?id=45025
rdar://59772698
Differential Revision: https://reviews.llvm.org/D75273
This adds infrastructure to print and parse MIR MachineOperand comments.
The motivation for the ARM backend is to print condition code names instead of
magic constants that are difficult to read (for human beings). For example,
instead of this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14, $noreg
t2Bcc %bb.4, 0, killed $cpsr
we now print this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14 /* CC::always */, $noreg
t2Bcc %bb.4, 0 /* CC:eq */, killed $cpsr
This shows that MachineOperand comments are enclosed between /* and */. In this
example, the EOR instruction is not conditionally executed (i.e. it is "always
executed"), which is encoded by the 14 immediate machine operand. Thus, now
this machine operand has /* CC::always */ as a comment. The 0 on the next
conditional branch instruction represents the equal condition code, thus now
this operand has /* CC:eq */ as a comment.
As it is a comment, the MI lexer/parser completely ignores it. The benefit is
that this keeps the change in the lexer extremely minimal and no target
specific parsing needs to be done. The changes on the MIPrinter side are also
minimal, as there is only one target hooks that is used to create the machine
operand comments.
Differential Revision: https://reviews.llvm.org/D74306
Introduce a method to walk through use-def chains to decide whether
it's possible to remove a given instruction and its users. These
instructions are then stored in a set until the end of the transform
when they're erased. This is now used to perform checks on the
iteration count (LoopDec chain), element count (VCTP chain) and the
possibly redundant iteration count.
As well as being able to remove chains of instructions, we know also
check that the sub feeding the vctp is producing the expected value.
Differential Revision: https://reviews.llvm.org/D71837
In ARMLowOverheadLoops.cpp, MVETailPredication.cpp, and MVEVPTBlock.cpp we have
quite a few helper functions all looking at the opcodes of MVE instructions.
This moves all these utility functions to ARMBaseInstrInfo.
Diferential Revision: https://reviews.llvm.org/D71426
Summary:
Currently the describeLoadedValue() hook is assumed to describe the
value of the instruction's first explicit define. The hook will not be
called for instructions with more than one explicit define.
This commit adds a register parameter to the describeLoadedValue() hook,
and invokes the hook for all registers in the worklist.
This will allow us to for example describe instructions which produce
more than two parameters' values; e.g. Hexagon's various combine
instructions.
This also fixes situations in our downstream target where we may pass
smaller parameters in the high part of a register. If such a parameter's
value is produced by a larger copy instruction, we can't describe the
call site value using the super-register, and we instead need to know
which sub-register that should be used.
This also allows us to handle cases like this:
$ebx = [...]
$rdi = MOVSX64rr32 $ebx
$esi = MOV32rr $edi
CALL64pcrel32 @call
The hook will first be invoked for the MOV32rr instruction, which will
say that @call's second parameter (passed in $esi) is described by $edi.
As $edi is not preserved it will be added to the worklist. When we get
to the MOVSX64rr32 instruction, we need to describe two values; the
sign-extended value of $ebx -> $rdi for the first parameter, and $ebx ->
$edi for the second parameter, which is now possible.
This commit modifies the dbgcall-site-lea-interpretation.mir test case.
In the test case, the values of some 32-bit parameters were produced
with LEA64r. Perhaps we can in general cases handle such by emitting
expressions that AND out the lower 32-bits, but I have not been able to
land in a case where a LEA64r is used for a 32-bit parameter instead of
LEA64_32 from C code.
I have not found a case where it would be useful to describe parameters
using implicit defines, so in this patch the hook is still only invoked
for explicit defines of forwarding registers.
Reviewers: djtodoro, NikolaPrica, aprantl, vsk
Reviewed By: djtodoro, vsk
Subscribers: ormris, hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D70431
Currently the describeLoadedValue() hook is assumed to describe the
value of the instruction's first explicit define. The hook will not be
called for instructions with more than one explicit define.
This commit adds a register parameter to the describeLoadedValue() hook,
and invokes the hook for all registers in the worklist.
This will allow us to for example describe instructions which produce
more than two parameters' values; e.g. Hexagon's various combine
instructions.
This also fixes a case in our downstream target where we may pass
smaller parameters in the high part of a register. If such a parameter's
value is produced by a larger copy instruction, we can't describe the
call site value using the super-register, and we instead need to know
which sub-register that should be used.
This also allows us to handle cases like this:
$ebx = [...]
$rdi = MOVSX64rr32 $ebx
$esi = MOV32rr $edi
CALL64pcrel32 @call
The hook will first be invoked for the MOV32rr instruction, which will
say that @call's second parameter (passed in $esi) is described by $edi.
As $edi is not preserved it will be added to the worklist. When we get
to the MOVSX64rr32 instruction, we need to describe two values; the
sign-extended value of $ebx -> $rdi for the first parameter, and $ebx ->
$edi for the second parameter, which is now possible.
This commit modifies the dbgcall-site-lea-interpretation.mir test case.
In the test case, the values of some 32-bit parameters were produced
with LEA64r. Perhaps we can in general cases handle such by emitting
expressions that AND out the lower 32-bits, but I have not been able to
land in a case where a LEA64r is used for a 32-bit parameter instead of
LEA64_32 from C code.
I have not found a case where it would be useful to describe parameters
using implicit defines, so in this patch the hook is still only invoked
for explicit defines of forwarding registers.