This adds cost modelling for the inloop vectorization added in
745bf6cf44. Up until now they have been modelled as the original
underlying instruction, usually an add. This happens to works OK for MVE
with instructions that are reducing into the same type as they are
working on. But MVE's instructions can perform the equivalent of an
extended MLA as a single instruction:
%sa = sext <16 x i8> A to <16 x i32>
%sb = sext <16 x i8> B to <16 x i32>
%m = mul <16 x i32> %sa, %sb
%r = vecreduce.add(%m)
->
R = VMLADAV A, B
There are other instructions for performing add reductions of
v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64
(VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV).
The i64 are particularly interesting as there are no native i64 add/mul
instructions, leading to the i64 add and mul naturally getting very
high costs.
Also worth mentioning, under NEON there is the concept of a sdot/udot
instruction which performs a partial reduction from a v16i8 to a v4i32.
They extend and mul/sum the first four elements from the inputs into the
first element of the output, repeating for each of the four output
lanes. They could possibly be represented in the same way as above in
llvm, so long as a vecreduce.add could perform a partial reduction. The
vectorizer would then produce a combination of in and outer loop
reductions to efficiently use the sdot and udot instructions. Although
this patch does not do that yet, it does suggest that separating the
input reduction type from the produced result type is a useful concept
to model. It also shows that a MLA reduction as a single instruction is
fairly common.
This patch attempt to improve the costmodelling of in-loop reductions
by:
- Adding some pattern matching in the loop vectorizer cost model to
match extended reduction patterns that are optionally extended and/or
MLA patterns. This marks the cost of the reduction instruction correctly
and the sext/zext/mul leading up to it as free, which is otherwise
difficult to tell and may get a very high cost. (In the long run this
can hopefully be replaced by vplan producing a single node and costing
it correctly, but that is not yet something that vplan can do).
- getExtendedAddReductionCost is added to query the cost of these
extended reduction patterns.
- Expanded the ARM costs to account for these expanded sizes, which is a
fairly simple change in itself.
- Some minor alterations to allow inloop reduction larger than the highest
vector width and i64 MVE reductions.
- An extra InLoopReductionImmediateChains map was added to the vectorizer
for it to efficiently detect which instructions are reductions in the
cost model.
- The tests have some updates to show what I believe is optimal
vectorization and where we are now.
Put together this can greatly improve performance for reduction loop
under MVE.
Differential Revision: https://reviews.llvm.org/D93476
It turns out the vectorizer calls the getIntrinsicInstrCost functions
with a scalar return type and vector VF. This updates the costmodel to
handle that, still producing the correct vector costs.
A vectorizer test is added to show it vectorizing at the correct factor
again.
We have no lowering for VSELECT vXi1, vXi1, vXi1, so mark them as
expanded to turn them into a series of logical operations.
Differential Revision: https://reviews.llvm.org/D94946
This adds some basic MVE sadd_sat/ssub_sat/uadd_sat/usub_sat costs,
based on when the instruction is legal. With smaller than legal types
that are promoted we generate shr(qadd(shl, shl)), so the cost is 4
appropriately.
Differential Revision: https://reviews.llvm.org/D94958
This patch handles cases where we have to save/restore the link register
into the stack and and load/store instruction which use the stack are
part of the outlined region. It checks that there will be no overflow
introduced by the new offset and fixup these instructions accordingly.
Differential Revision: https://reviews.llvm.org/D92934
It turns our that the BranchFolder and IfCvt does not like unanalyzable
branches that fall-through. This means that removing the unconditional
branches from the end of tail predicated instruction can run into
asserts and verifier issues.
This effectively reverts 372eb2bbb6, but
adds handling to t2DoLoopEndDec which are not branches, so can be safely
skipped.
If the previous block in a function does not fallthough, adding nop's to
align it will never be executed. This means we can freely (except for
codesize) align more branches. This happens in constantislandspass (as
it cannot happen later) and only happens at aggressive optimization
levels as it does increase codesize.
Differential Revision: https://reviews.llvm.org/D94394
This treats low overhead loop branches the same as jump tables and
indirect branches in analyzeBranch - they cannot be analyzed but the
direct branches on the end of the block may be removed. This helps
remove the unnecessary branches earlier, which can help produce better
codegen (and change block layout in a number of cases).
Differential Revision: https://reviews.llvm.org/D94392
The TripCount for a predicated vector loop body will be
ceil(ElementCount/Width). This alters the conversion of an
active.lane.mask to a VCPT intrinsics to match.
Differential Revision: https://reviews.llvm.org/D94608
Not all machine loops will have a predecessor. so the pass needs to
check it before continuing.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D94780
For the ARM hard-float calling convention, calls to variadic functions
need to be treated diffrently, even if only the fixed arguments are
provided.
This fixes GCC-C-execute-pr68390 in the test-suite, which is failing on
the ARM GlobaISel bot.
Blocks can be laid out such that a t2WhileLoopStart branches backwards. This is forbidden by the architecture and so it fails to be converted into a low-overhead loop. This new pass checks for these cases and moves the target block, fixing any fall-through that would then be broken.
Differential Revision: https://reviews.llvm.org/D92385
The isVMOVNOriginalMask was previously only checking for two input
shuffles that could be better expanded as vmovn nodes. This expands that
to single input shuffles that will later be legalized to multiple
vectors.
Differential Revision: https://reviews.llvm.org/D94189
This adds uses for locals introduced for new debug messages for the load store optimizer. Those locals are only used on debug statements and otherwise create unused variable warnings.
Differential Revision: https://reviews.llvm.org/D94398
We did not have specific costs for larger than legal truncates that were
not otherwise cheap (where they were next to stores, for example). As
MVE does not have a dedicated instruction for them (and we do not use
loads/stores yet), they should be expensive as they get expanded to a
series of lane moves.
Differential Revision: https://reviews.llvm.org/D94260
The ISel patterns we have for truncating to i1's under MVE do not seem
to be correct. Instead custom lower to icmp(ne, and(x, 1), 0).
Differential Revision: https://reviews.llvm.org/D94226
Same as a9b6440edd, use zanyext to treat any_extends as zero extends
during lowering to create addw/addl/subw/subl nodes.
Differential Revision: https://reviews.llvm.org/D93835
Similar to 78d8a821e2 but for ARM, this handles any_extend whilst
creating MULL nodes, treating them as zextends.
Differential Revision: https://reviews.llvm.org/D93834
If the return values can't be lowered to registers
SelectionDAG performs the sret demotion. This patch
contains the basic implementation for the same in
the GlobalISel pipeline.
Furthermore, targets should bring relevant changes
during lowerFormalArguments, lowerReturn and
lowerCall to make use of this feature.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D92953
Current implementation assumes that, each MachineConstantPoolValue takes
up sizeof(MachineConstantPoolValue::Ty) bytes. For PowerPC, we want to
lump all the constants with the same type as one MachineConstantPoolValue
to save the cost that calculate the TOC entry for each const. So, we need
to extend the MachineConstantPoolValue that break this assumption.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D89108
The lowering of a <4 x i16> or <4 x i8> vecreduce.add into an i64 would
previously be expanded, due to the i64 not being legal. This patch
adjusts our reduction matchers, making it produce a VADDLV(sext A to
v4i32) instead.
Differential Revision: https://reviews.llvm.org/D93622
This patch upstreams support for the Armv8-a Cortex-A78C
processor for AArch64 and ARM.
In detail:
Adding cortex-a78c as cpu option for aarch64 and arm targets in clang
Adding Cortex-A78C CPU name and ProcessorModel in llvm
Details of the CPU can be found here:
https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a78c
Adds ARMBankConflictHazardRecognizer. This hazard recognizer
looks for a few situations where the same base pointer is used and
then checks whether the offsets lead to a bank conflict. Two
parameters are also added to permit overriding of the target
assumptions:
arm-data-bank-mask=<int> - Mask of bits which are to be checked for
conflicts. If all these bits are equal in the offsets, there is a
conflict.
arm-assume-itcm-bankconflict=<bool> - Assume that there will be bank
conflicts on any loads to a constant pool.
This hazard recognizer is enabled for Cortex-M7, where the Technical
Reference Manual states that there are two DTCM banks banked using bit
2 and one ITCM bank.
Differential Revision: https://reviews.llvm.org/D93054
CanBeUnnamed is rarely false. Splitting to a createNamedTempSymbol makes the
intention clearer and matches the direction of reverted r240130 (to drop the
unneeded parameters).
No behavior change.
As a linker is allowed to clobber r12 on function calls, the code
transformation that hardens indirect calls is not correct in case a
linker does so. Similarly, the transformation is not correct when
register lr is used.
This patch makes sure that r12 or lr are not used for indirect calls
when harden-sls-blr is enabled.
Differential Revision: https://reviews.llvm.org/D92469
To make sure that no barrier gets placed on the architectural execution
path, each indirect call calling the function in register rN, it gets
transformed to a direct call to __llvm_slsblr_thunk_mode_rN. mode is
either arm or thumb, depending on the mode of where the indirect call
happens.
The llvm_slsblr_thunk_mode_rN thunk contains:
bx rN
<speculation barrier>
Therefore, the indirect call gets split into 2; one direct call and one
indirect jump.
This transformation results in not inserting a speculation barrier on
the architectural execution path.
The mitigation is off by default and can be enabled by the
harden-sls-blr subtarget feature.
As a linker is allowed to clobber r12 on function calls, the
above code transformation is not correct in case a linker does so.
Similarly, the transformation is not correct when register lr is used.
Avoiding r12/lr being used is done in a follow-on patch to make
reviewing this code easier.
Differential Revision: https://reviews.llvm.org/D92468
The only non-trivial consideration in this patch is that the formation
of TBB/TBH instructions, which is done in the constant island pass, does
not understand the speculation barriers inserted by the SLSHardening
pass. As such, when harden-sls-retbr is enabled for a function, the
formation of TBB/TBH instructions in the constant island pass is
disabled.
Differential Revision: https://reviews.llvm.org/D92396
Some processors may speculatively execute the instructions immediately
following indirect control flow, such as returns, indirect jumps and
indirect function calls.
To avoid a potential miss-speculatively executed gadget after these
instructions leaking secrets through side channels, this pass places a
speculation barrier immediately after every indirect control flow where
control flow doesn't return to the next instruction, such as returns and
indirect jumps, but not indirect function calls.
Hardening of indirect function calls will be done in a later,
independent patch.
This patch is implementing the same functionality as the AArch64 counter
part implemented in https://reviews.llvm.org/D81400.
For AArch64, returns and indirect jumps only occur on RET and BR
instructions and hence the function attribute to control the hardening
is called "harden-sls-retbr" there. On AArch32, there is a much wider
variety of instructions that can trigger an indirect unconditional
control flow change. I've decided to stick with the name
"harden-sls-retbr" as introduced for the corresponding AArch64
mitigation.
This patch implements this for ARM mode. A future patch will extend this
to also support Thumb mode.
The inserted barriers are never on the correct, architectural execution
path, and therefore performance overhead of this is expected to be low.
To ensure these barriers are never on an architecturally executed path,
when the harden-sls-retbr function attribute is present, indirect
control flow is never conditionalized/predicated.
On targets that implement that Armv8.0-SB Speculation Barrier extension,
a single SB instruction is emitted that acts as a speculation barrier.
On other targets, a DSB SYS followed by a ISB is emitted to act as a
speculation barrier.
These speculation barriers are implemented as pseudo instructions to
avoid later passes to analyze them and potentially remove them.
The mitigation is off by default and can be enabled by the
harden-sls-retbr subtarget feature.
Differential Revision: https://reviews.llvm.org/D92395
MVE has a dual lane vector move instruction, capable of moving two
general purpose registers into lanes of a vector register. They look
like one of:
vmov q0[2], q0[0], r2, r0
vmov q0[3], q0[1], r3, r1
They only accept these lane indices though (and only insert into an
i32), either moving lanes 1 and 3, or 0 and 2.
This patch adds some tablegen patterns for them, selecting from vector
inserts elements. Because the insert_elements are know to be
canonicalized to ascending order there are several patterns that we need
to select. These lane indices are:
3 2 1 0 -> vmovqrr 31; vmovqrr 20
3 2 1 -> vmovqrr 31; vmov 2
3 1 -> vmovqrr 31
2 1 0 -> vmovqrr 20; vmov 1
2 0 -> vmovqrr 20
With the top one being the most common. All other potential patterns of
lane indices will be matched by a combination of these and the
individual vmov pattern already present. This does mean that we are
selecting several machine instructions at once due to the need to
re-arrange the inserts, but in this case there is nothing else that will
attempt to match an insert_vector_elt node.
This is a recommit of 6cc3d80a84 after
fixing the backward instruction definitions.
This extends the command-line support for the 'armv8.7-a' architecture
name to the ARM target.
Based on a patch written by Momchil Velikov.
Reviewed By: ostannard
Differential Revision: https://reviews.llvm.org/D93231
This introduces support for the v8.7-A architecture through a new
subtarget feature called "v8.7a". It adds two new "WFET" and "WFIT"
instructions, the nXS limited-TLB-maintenance qualifier for DSB and TLBI
instructions, a new CPU id register, ID_AA64ISAR2_EL1, and the new
HCRX_EL2 system register.
Based on patches written by Simon Tatham and Victor Campos.
Reviewed By: ostannard
Differential Revision: https://reviews.llvm.org/D91772
MVE has a dual lane vector move instruction, capable of moving two
general purpose registers into lanes of a vector register. They look
like one of:
vmov q0[2], q0[0], r2, r0
vmov q0[3], q0[1], r3, r1
They only accept these lane indices though (and only insert into an
i32), either moving lanes 1 and 3, or 0 and 2.
This patch adds some tablegen patterns for them, selecting from vector
inserts elements. Because the insert_elements are know to be
canonicalized to ascending order there are several patterns that we need
to select. These lane indices are:
3 2 1 0 -> vmovqrr 31; vmovqrr 20
3 2 1 -> vmovqrr 31; vmov 2
3 1 -> vmovqrr 31
2 1 0 -> vmovqrr 20; vmov 1
2 0 -> vmovqrr 20
With the top one being the most common. All other potential patterns of
lane indices will be matched by a combination of these and the
individual vmov pattern already present. This does mean that we are
selecting several machine instructions at once due to the need to
re-arrange the inserts, but in this case there is nothing else that will
attempt to match an insert_vector_elt node.
Differential Revision: https://reviews.llvm.org/D92553
A vpt block that just contains either VPST;VCTP or VPT;VCTP, once the
VCTP is removed will become invalid. This fixed the first by removing
the now empty block and bails out for the second, as we have no simple
way of converting a VPT to a VCMP.
Differential Revision: https://reviews.llvm.org/D92369
This adds some basic MVE masked load/store costs, notably changing the
cost of legal loads/stores to the MVECostFactor and the cost of
scalarized instructions to 8*NumElts.
Differential Revision: https://reviews.llvm.org/D86538
Although this was something that I was hoping we would not have to do,
this patch makes t2DoLoopStartTP a terminator in order to keep it at the
end of it's block, so not allowing extra MVE instruction between it and
the end. With t2DoLoopStartTP's also starting tail predication regions,
it also marks them as having side effects. The t2DoLoopStart is still
not a terminator, giving it the extra scheduling freedom that can be
helpful, but now that we have a TP version they can be treated
differently.
Differential Revision: https://reviews.llvm.org/D91887
We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.
Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.
There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.
This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.
Differential Revision: https://reviews.llvm.org/D91358
The phi created in a low overhead loop gets created with a default
register class it seems. There are then copied inserted between the low
overhead loop pseudo instructions (which produce/consume GPRlr
instructions) and the phi holding the induction. This patch removes
those as a step towards attempting to make t2LoopDec and t2LoopEnd a
single instruction, and appears useful in it's own right as shown in the
tests.
Differential Revision: https://reviews.llvm.org/D91267
This scans through blocks looking for constants used as predicates in
MVE instructions. When two constants are found which are the inverse of
one another, the second can be replaced by a VPNOT of the first,
potentially allowing that not to be folded away into an else predicate
of a vpt block.
Differential Revision: https://reviews.llvm.org/D92470
This folds a not (an xor -1) though a predicate_cast, so that it can be
turned into a VPNOT and potentially be folded away as an else predicate
inside a VPT block.
Differential Revision: https://reviews.llvm.org/D92235
We remove VPNOT instructions in VPT blocks as we create them, turning
them into else predicates. We don't remove the dead instructions until
after the block has been created though. Because the VPNOT will have
killed the vpr register it used, this makes finalizeBundle add internal
flags to the vpr uses of any instructions after the VPNOT. These
incorrect flags can then confuse what is alive and what is not, leading
to machine verifier problems.
This patch removes them earlier instead, before the bundle is finalized
so that kill flags remain valid.
Differential Revision: https://reviews.llvm.org/D92227
This adds code to revert low overhead loops with calls in them before
register allocation. Ideally we would not create low overhead loops with
calls in them to begin with, but that can be difficult to always get
correct. If we want to try and glue together t2LoopDec and t2LoopEnd
into a single instruction, we need to ensure that no instructions use LR
in the loop. (Technically the final code can be better too, as it
doesn't need to use the same registers but that has not been optimized
for here, as reverting loops with calls is expected to be very rare).
It also adds a MVETailPredUtils.h header to share the revert code
between different passes, and provides a place to expand upon, with
RevertLoopWithCall becoming a place to perform other low overhead loop
alterations like removing copies or combining LoopDec and End into a
single instruction.
Differential Revision: https://reviews.llvm.org/D91273
Original commit rG112b3cb6ba49 introduced non-determinism in subtarget
generator due to iteration over DenseMap. New patch fixes this changing
ProcModelMapTy from DenseMap to std::map.
1. Removed #include "...AliasAnalysis.h" in other headers and modules.
2. Cleaned up includes in AliasAnalysis.h.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D92489
We already expand select and select_cc in codegenprepare, but they can
still be generated under some situations. Explicitly mark them as expand
to ensure they are not produced, leading to a failure to select the
nodes.
Differential Revision: https://reviews.llvm.org/D92373
The PREDICATE_CAST node is used to model moves between MVE predicate
registers and gpr's, and eventually become a VMSR p0, rn. When moving to
a predicate only the bottom 16 bits of the sources register are
demanded. This adds a simple fold for that, allowing it to potentially
remove instructions like uxth.
Differential Revision: https://reviews.llvm.org/D92213
Currently, we have some confusion in the codebase regarding the
meaning of LocationSize::unknown(): Some parts (including most of
BasicAA) assume that LocationSize::unknown() only allows accesses
after the base pointer. Some parts (various callers of AA) assume
that LocationSize::unknown() allows accesses both before and after
the base pointer (but within the underlying object).
This patch splits up LocationSize::unknown() into
LocationSize::afterPointer() and LocationSize::beforeOrAfterPointer()
to make this completely unambiguous. I tried my best to determine
which one is appropriate for all the existing uses.
The test changes in cs-cs.ll in particular illustrate a previously
clearly incorrect AA result: We were effectively assuming that
argmemonly functions were only allowed to access their arguments
after the passed pointer, but not before it. I'm pretty sure that
this was not intentional, and it's certainly not specified by
LangRef that way.
Differential Revision: https://reviews.llvm.org/D91649
This strips out a lot of the code that should no longer be needed from
the MVETailPredictionPass, leaving the important part - find active lane
mask instructions and convert them to VCTP operations.
Differential Revision: https://reviews.llvm.org/D91866
Patch fixes scheduling of ALU instructions which modify pc register. Patch
also fixes computation of mutually exclusive predicates for sequences of
variants to be properly expanded
Differential revision: https://reviews.llvm.org/D91266
X86 was already specially marking fma as commutable which allowed
tablegen to autogenerate commuted patterns. This moves it to the target
independent definition and fix up the targets to remove now
unneeded patterns.
Unfortunately, the tests change because the commuted version of
the patterns are generating operands in a different than the
explicit patterns.
Differential Revision: https://reviews.llvm.org/D91842
This checks to see if the loop will likely become a tail predicated loop
and disables wls loop generation if so, as the likelihood for reverting
is currently too high. These should be fairly rare situations anyway due
to the way iterations and element counts are used during lowering. Just
not trying can alter how SCEV's are materialized however, leading to
different codegen.
It also adds a option to disable all while low overhead loops, for
debugging.
Differential Revision: https://reviews.llvm.org/D91663
This converts the intermediate VPR use assertion to a condition in the if-statement to protect against assertion failures in case behaviuour is changed.
This is a follow-up to https://reviews.llvm.org/D90935 and implements the post-approval comments.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D91790
This was already something that was handled by one of the "else"
branches in maybeLoweredToCall, so this patch is an NFC but makes it
explicit and adds a test. We may in the future want to support this
under certain situations but for the moment just don't try and create
low overhead loops with inline asm in them.
Differential Revision: https://reviews.llvm.org/D91257
2c196bbc6b asserted that
`SmallVector::push_back` doesn't invalidate the parameter when it needs
to grow. Do the same for `resize`, `append`, `assign`, `insert`, and
`emplace_back`.
Differential Revision: https://reviews.llvm.org/D91744
This patch factors out the part of printInstruction that gets the
mnemonic string for a given MCInst. This is intended to be used
subsequently for the instruction-mix remarks to display the final
mnemonic (D90040).
Unfortunately making `getMnemonic` available to the AsmPrinter
seems to require making it virtual. Not sure if there's a way around
that with the current layering of the AsmPrinters.
Reviewed By: Paul-C-Anagnostopoulos
Differential Revision: https://reviews.llvm.org/D90039
This patch adds the SchedMachineModel for Cortex-M7. It
also adds test cases for the scheduling information.
Details of the pipeline and descriptions are in comments
in file ARMScheduleM7.td included in this patch.
Differential Revision: https://reviews.llvm.org/D91355
No longer rely on an external tool to build the llvm component layout.
Instead, leverage the existing `add_llvm_componentlibrary` cmake function and
introduce `add_llvm_component_group` to accurately describe component behavior.
These function store extra properties in the created targets. These properties
are processed once all components are defined to resolve library dependencies
and produce the header expected by llvm-config.
Differential Revision: https://reviews.llvm.org/D90848
Of course there was something missing, in this case a check that the def
of the count register we are adding to a t2DoLoopStartTP would dominate
the insertion point.
In the future, when we remove some of these COPY's in between, the
t2DoLoopStartTP will always become the last instruction in the block,
preventing this from happening. In the meantime we need to check they
are created in a sensible order.
Differential Revision: https://reviews.llvm.org/D91287
We have a frequent pattern where we're merging two KnownBits to get the common/shared bits, and I just fell for the gotcha where I tried to use the & operator to merge them........
Previously we used setRegClass to rgpr, which may expand the register
domain if the result was already in a constrained class (tcgpr in the
above PR).
Differential Revision: https://reviews.llvm.org/D91192
This introduces a new pseudo instruction, almost identical to a
t2DoLoopStart but taking 2 parameters - the original loop iteration
count needed for a low overhead loop, plus the VCTP element count needed
for a DLSTP instruction setting up a tail predicated loop. The idea is
that the instruction holds both values and the backend
ARMLowOverheadLoops pass can pick between the two, depending on whether
it creates a tail predicated loop or falls back to a low overhead loop.
To do that there needs to be something that converts a t2DoLoopStart to
a t2DoLoopStartTP, for which this patch repurposes the
MVEVPTOptimisationsPass as a "tail predication and vpt optimisation"
pass. The extra operand for the t2DoLoopStartTP is chosen based on the
operands of VCTP's in the loop, and the instruction is moved as late in
the block as possible to attempt to increase the likelihood of making
tail predicated loops.
Differential Revision: https://reviews.llvm.org/D90591
We already do not unroll loops with vector instructions under MVE, but
that does not include the remainder loops that the vectorizer produces.
These remainder loops will be rarely executed and are not worth
unrolling, as the trip count is likely to be low if they get executed at
all. Luckily they get llvm.loop.isvectorized to make recognizing them
simpler.
We have wanted to do this for a while but hit issues with low overhead
loops being reverted due to difficult registry allocation. With recent
changes that seems to be less of an issue now.
Differential Revision: https://reviews.llvm.org/D90055
This hints the operand of a t2DoLoopStart towards using LR, which can
help make it more likely to become t2DLS lr, lr. This makes it easier to
move if needed (as the input is the same as the output), or potentially
remove entirely.
The hint is added after others (from COPY's etc) which still take
precedence. It needed to find a place to add the hint, which currently
uses the post isel custom inserter.
Differential Revision: https://reviews.llvm.org/D89883
This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR
This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.
This is a fairly simple change in itself, but leads to a number of other
required alterations.
- The hardware loop pass, if UsePhi is set, now generates loops of the
form:
%start = llvm.start.loop.iterations(%N)
loop:
%p = phi [%start], [%dec]
%dec = llvm.loop.decrement.reg(%p, 1)
%c = icmp ne %dec, 0
br %c, loop, exit
- For this a new llvm.start.loop.iterations intrinsic was added, identical
to llvm.set.loop.iterations but produces a value as seen above, gluing
the loop together more through def-use chains.
- This new instrinsic conceptually produces the same output as input,
which is taught to SCEV so that the checks in MVETailPredication are not
affected.
- Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
been left mostly as before. We should now more reliably be able to tell
that the t2DoLoopStart is correct without having to prove it, but
t2WhileLoopStart and tail-predicated loops will remain the same.
- And all the tests have been updated. There are a lot of them!
This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.
Differential Revision: https://reviews.llvm.org/D89881
This was accidentally using the same name for two different variables in
the same line. Whilst it seems to work for some compilers, others have
trouble and it is probably not a fantastic idea.
This patch make the outliner emit CFI instructions in a few more
places:
* after LR is restored, but before the return in an outlined
function
* around save/restore of LR to/from a register at calls to outlined
functions
* around save/restore of LR to/from the stack at calls to outlined
functions
The latter two only when the function does NOT spill LR. If the
function spills LR, then outliner generated saves/restores around
calls are not considered interesting for unwinding the frame.
Differential Revision: https://reviews.llvm.org/D89483
There were cases where a VCMP and a VPST were merged even if the VCMP
didn't have the same defs of its operands as the VPST. This is fixed by
adding RDA checks for the defs. This however gave rise to cases where
the new VPST created would precede the un-merged VCMP and so would fail
a predicate mask assertion since the VCMP wasn't predicated. This was
solved by converting the VCMP to a VPT instead of inserting the new
VPST.
Differential Revision: https://reviews.llvm.org/D90461
When we fold a VCMP into a VPST instruction any kill flags between the
old VCMP position and the new insertion point need to be removed, in
order to keep the verifier happy.
Differential Revision: https://reviews.llvm.org/D90964
Add support for the Neoverse V1 CPU to the ARM and AArch64 backends.
This is based on patches from Mark Murray and Victor Campos.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D90765
This is the cmp/sel sibling to D90692.
Again, the reasoning is: the throughput cost is number of instructions/uops,
so size/blended costs are identical except in special cases (for example,
fdiv or other known-expensive machine instructions or things like MVE that
may require cracking into >1 uops).
We need to check for a valid (non-null) condition type parameter because
SimplifyCFG may pass nullptr for that (and so we will crash multiple
regression tests without that check). I'm not sure if passing nullptr makes
sense, but other code in the cost model does appear to check if that param
is set or not.
Differential Revision: https://reviews.llvm.org/D90781
To accommodate frame layouts that have both fixed and scalable objects
on the stack, describing a stack location or offset using a pointer + uint64_t
is not sufficient. For this reason, we've introduced the StackOffset class,
which models both the fixed- and scalable sized offsets.
The TargetFrameLowering::getFrameIndexReference is made to return a StackOffset,
so that this can be used in other interfaces, such as to eliminate frame indices
in PEI or to emit Debug locations for variables on the stack.
This patch is purely mechanical and doesn't change the behaviour of how
the result of this function is used for fixed-sized offsets. The patch adds
various checks to assert that the offset has no scalable component, as frame
offsets with a scalable component are not yet supported in various places.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D90018
Hook up legalizations for VECREDUCE_SEQ_FMUL. This is following up on the VECREDUCE_SEQ_FADD work from D90247.
Differential Revision: https://reviews.llvm.org/D90644
This is based on the same idea that I am using for the basic model implementation
and what I have partly already done for x86: throughput cost is number of
instructions/uops, so size/blended costs are identical except in special cases
(for example, fdiv or other known-expensive machine instructions or things like
MVE that may require cracking into >1 uop)).
Differential Revision: https://reviews.llvm.org/D90692
If an instruction will be lowered to a call there is no advantage of
using a low overhead loop as the LR register will need to be spilled and
reloaded around the call, and the low overhead will end up being
reverted. This teaches our hardware loop lowering that these memory
intrinsics will be calls under certain situations.
Differential Revision: https://reviews.llvm.org/D90439
The `LiveRegUnits` utility (as well as `LivePhysRegs`) considers
callee-saved registers to be alive at the point after the return
instruction in a block. In the ARM backend, the `LR` register is
classified as callee-saved, which is not really correct (from an ARM
eABI or just common sense point of view). These two conditions cause
the `MachineOutliner` to overestimate the liveness of `LR`, which
results in unnecessary saves/restores of `LR` around calls to outlined
sequences. It also causes the `MachineVerifer` to crash in some
cases, because the save instruction reads a dead `LR`, for example
when the following program:
int h(int, int);
int f(int a, int b, int c, int d) {
a = h(a + 1, b - 1);
b = b + c;
return 1 + (2 * a + b) * (c - d) / (a - b) * (c + d);
}
int g(int a, int b, int c, int d) {
a = h(a - 1, b + 1);
b = b + c;
return 2 + (2 * a + b) * (c - d) / (a - b) * (c + d);
}
is compiled with `-target arm-eabi -march=armv7-m -Oz`.
This patch computes the liveness of `LR` in return blocks only, while
taking into account the few ARM instructions, which read `LR`, but
nevertheless the register is not mentioned (explicitly or implicitly)
in the instruction operands.
Differential Revision: https://reviews.llvm.org/D89189
This reverts the revert commit 408c4408fa.
This version of the patch includes a fix for a crash caused by
treating ICmp/FCmp constant expressions as instructions.
Original message:
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
Patch fixes case when sched class has write and read variants belonging
to different processor models.
Differential revision: https://reviews.llvm.org/D89777
If the elt size is unknown due to it being a pointer, a comparison
against 0 will cause an assert. Make sure the elt size is large enough
before comparing and for the moment just return the scalar cost.
Add Legalization support for VECREDUCE_SEQ_FADD, so that we don't need to depend on ExpandReductionsPass.
Differential Revision: https://reviews.llvm.org/D90247
On some targets, like AArch64, vector selects can be efficiently lowered
if the vector condition is a compare with a supported predicate.
This patch adds a new argument to getCmpSelInstrCost, to indicate the
predicate of the feeding select condition. Note that it is not
sufficient to use the context instruction when querying the cost of a
vector select starting from a scalar one, because the condition of the
vector select could be composed of compares with different predicates.
This change greatly improves modeling the costs of certain
compare/select patterns on AArch64.
I am also planning on putting up patches to make use of the new argument in
SLPVectorizer & LV.
Reviewed By: dmgreen, RKSimon
Differential Revision: https://reviews.llvm.org/D90070
This adds ISel matching for a form of VQDMULH. There are several ir
patterns that we could match to that instruction, this one is for:
min(ashr(mul(sext(a), sext(b)), 7), 127)
Which is what llvm will optimize to once it has removed the max that
usually makes up the min/max saturate pattern, as in this case the
compare will always be false. The additional complication to match i32
patterns (which extend into an i64) is that the min will be a
vselect/setcc, as vmin is not supported for i64 vectors. Tablegen
patterns have also been updated to attempt to reuse the MVE_TwoOpPattern
patterns.
Differential Revision: https://reviews.llvm.org/D90096
Fixes a regression caused by D82439, in which IT blocks were no longer being generated when -Oz is present.
Differential Revision: https://reviews.llvm.org/D88496
This adds a MultiHazardRecognizer and starts to make use of it in the
ARM backend. The idea of the class is to allow multiple independent
hazard recognizers to be added to a single base MultiHazardRecognizer,
allowing them to all work in parallel without requiring them to be
chained into subclasses. They can then be added or not based on cpu or
subtarget features, which will become useful in the ARM backend once
more hazard recognizers are being used for various things.
This also renames ARMHazardRecognizer to ARMHazardRecognizerFPMLx in the
process, to more clearly explain what that recognizer is designed for.
Differential Revision: https://reviews.llvm.org/D72939
Some instructions may be removable through processes such as IfConversion,
however DefinesPredicate can not be made aware of when this should be considered.
This parameter allows DefinesPredicate to distinguish these removable instructions
on a per-call basis, allowing for more fine-grained control from processes like
ifConversion.
Renames DefinesPredicate to ClobbersPredicate, to better reflect it's purpose
Differential Revision: https://reviews.llvm.org/D88494
This reverts commit 38f625d0d1.
This commit contains some holes in its logic and has been causing
issues since it was commited. The idea sounds OK but some cases were not
handled correctly. Instead of trying to fix that up later it is probably
simpler to revert it and work to reimplement it in a more reliable way.
Create the LLVM / CodeView register mappings for the 32-bit ARM Window targets.
Reviewed By: compnerd
Differential Revision: https://reviews.llvm.org/D89622
This adds some basic costs for MVE reductions - currently just costing
the simple legal add vectors as a single MVE instruction. More complex
costing can be added in the future when the framework more readily
allows it.
Differential Revision: https://reviews.llvm.org/D88980
This adds a very basic cost for active_lane_mask under MVE - making the
assumption that they will be free and then apologizing for that in a
comment.
In reality they may either be free (by being nicely folded into a tail
predicated loop), cost the same as a VCTP or be expanded into vdup's,
adds and cmp's. It is difficult to detect the difference from a single
getIntrinsicInstrCost call, so makes the assumption that the vectorizer
is adding them, and only added them where it makes sense.
We may need to change this in the future to better model predicate costs
in the vectorizer, especially at -Os or non-tail predicated loops. The
vectorizer currently does not query the cost of these instructions but
that will change in the future and a zero cost there probably makes the
most sense at the moment.
Differential Revision: https://reviews.llvm.org/D88989
In most of lib/Target we know that we are not dealing with scalable
types so it's perfectly fine to replace TypeSize comparison operators
with their fixed width equivalents, making use of getFixedSize()
and so on.
Differential Revision: https://reviews.llvm.org/D89101
There are a number of places in RDA where we assume the block will not
be empty. This isn't necessarily true for tail predicated loops where we
have removed instructions. This attempt to make the pass more resilient
to empty blocks, not casting pointers to machine instructions where they
would be invalid.
The test contains a case that was previously failing, but recently been
hidden on trunk. It contains an empty block to begin with to show a
similar error.
Differential Revision: https://reviews.llvm.org/D88926
This folds a select_cc or select(set_cc) of a max or min vector reduction with a scalar value into a VMAXV or VMINV.
Differential Revision: https://reviews.llvm.org/D87836
This folds a select_cc or select(set_cc) of a max or min vector reduction with a scalar value into a VMAXV or VMINV.
Differential Revision: https://reviews.llvm.org/D87836
We were not accounting for the pointer offset when splitting a store from
a VMOVDRR node, which could lead to incorrect aliasing info. In this
case it is the fneg via integer arithmetic that gives us a store->load
pair that we started getting wrong.
Differential Revision: https://reviews.llvm.org/D88653
Marks constants of an ICmp instruction as free if it's only user is a select
instruction that is part of a min(max()) pattern. Ensures that in loops, in
particular when loop unrolling is turned on, SSAT will still be correctly generated.
Differential Revision: https://reviews.llvm.org/D88662
Before deciding to insert a [W|D]LSTP, check that defining LR with
the element count won't affect any other instructions that should be
taking the iteration count.
Differential Revision: https://reviews.llvm.org/D88549
This is part of the Propeller framework to do post link code layout optimizations. Please see the RFC here: https://groups.google.com/forum/#!msg/llvm-dev/ef3mKzAdJ7U/1shV64BYBAAJ and the detailed RFC doc here: https://github.com/google/llvm-propeller/blob/plo-dev/Propeller_RFC.pdf
This patch provides exception support for basic block sections by splitting the call-site table into call-site ranges corresponding to different basic block sections. Still all landing pads must reside in the same basic block section (which is guaranteed by the the core basic block section patch D73674 (ExceptionSection) ). Each call-site table will refer to the landing pad fragment by explicitly specifying @LPstart (which is omitted in the normal non-basic-block section case). All these call-site tables will share their action and type tables.
The C++ ABI somehow assumes that no landing pads point directly to LPStart (which works in the normal case since the function begin is never a landing pad), and uses LP.offset = 0 to specify no landing pad. In the case of basic block section where one section contains all the landing pads, the landing pad offset relative to LPStart could actually be zero. Thus, we avoid zero-offset landing pads by inserting a **nop** operation as the first non-CFI instruction in the exception section.
**Background on Exception Handling in C++ ABI**
https://github.com/itanium-cxx-abi/cxx-abi/blob/master/exceptions.pdf
Compiler emits an exception table for every function. When an exception is thrown, the stack unwinding library queries the unwind table (which includes the start and end of each function) to locate the exception table for that function.
The exception table includes a call site table for the function, which is used to guide the exception handling runtime to take the appropriate action upon an exception. Each call site record in this table is structured as follows:
| CallSite | --> Position of the call site (relative to the function entry)
| CallSite length | --> Length of the call site.
| Landing Pad | --> Position of the landing pad (relative to the landing pad fragment’s begin label)
| Action record offset | --> Position of the first action record
The call site records partition a function into different pieces and describe what action must be taken for each callsite. The callsite fields are relative to the start of the function (as captured in the unwind table).
The landing pad entry is a reference into the function and corresponds roughly to the catch block of a try/catch statement. When execution resumes at a landing pad, it receives an exception structure and a selector value corresponding to the type of the exception thrown, and executes similar to a switch-case statement. The landing pad field is relative to the beginning of the procedure fragment which includes all the landing pads (@LPStart). The C++ ABI requires all landing pads to be in the same fragment. Nonetheless, without basic block sections, @LPStart is the same as the function @Start (found in the unwind table) and can be omitted.
The action record offset is an index into the action table which includes information about which exception types are caught.
**C++ Exceptions with Basic Block Sections**
Basic block sections break the contiguity of a function fragment. Therefore, call sites must be specified relative to the beginning of the basic block section. Furthermore, the unwinding library should be able to find the corresponding callsites for each section. To do so, the .cfi_lsda directive for a section must point to the range of call-sites for that section.
This patch introduces a new **CallSiteRange** structure which specifies the range of call-sites which correspond to every section:
`struct CallSiteRange {
// Symbol marking the beginning of the precedure fragment.
MCSymbol *FragmentBeginLabel = nullptr;
// Symbol marking the end of the procedure fragment.
MCSymbol *FragmentEndLabel = nullptr;
// LSDA symbol for this call-site range.
MCSymbol *ExceptionLabel = nullptr;
// Index of the first call-site entry in the call-site table which
// belongs to this range.
size_t CallSiteBeginIdx = 0;
// Index just after the last call-site entry in the call-site table which
// belongs to this range.
size_t CallSiteEndIdx = 0;
// Whether this is the call-site range containing all the landing pads.
bool IsLPRange = false;
};`
With N basic-block-sections, the call-site table is partitioned into N call-site ranges.
Conceptually, we emit the call-site ranges for sections sequentially in the exception table as if each section has its own exception table. In the example below, two sections result in the two call site ranges (denoted by LSDA1 and LSDA2) placed next to each other. However, their call-sites will refer to records in the shared Action Table. We also emit the header fields (@LPStart and CallSite Table Length) for each call site range in order to place the call site ranges in separate LSDAs. We note that with -basic-block-sections, The CallSiteTableLength will not actually represent the length of the call site table, but rather the reference to the action table. Since the only purpose of this field is to locate the action table, correctness is guaranteed.
Finally, every call site range has one @LPStart pointer so the landing pads of each section must all reside in one section (not necessarily the same section). To make this easier, we decide to place all landing pads of the function in one section (hence the `IsLPRange` field in CallSiteRange).
| @LPStart | ---> Landing pad fragment ( LSDA1 points here)
| CallSite Table Length | ---> Used to find the action table.
| CallSites |
| … |
| … |
| @LPStart | ---> Landing pad fragment ( LSDA2 points here)
| CallSite Table Length |
| CallSites |
| … |
| … |
…
…
| Action Table |
| Types Table |
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D73739
Just because we haven't encountered an instruction setting the VPR,
it doesn't mean we can't create a VPT block - the VPR maybe a
live-in.
Differential Revision: https://reviews.llvm.org/D88224
Added patterns to generate an SSAT or USAT with shift for
SSAT/USAT instructions that are matched from IR patterns.
Differential Revision: https://reviews.llvm.org/D88145
We have been running tests/benchmarks downstream with tail-predication enabled
for some time now and this behaves as expected: we are not aware of any
correctness issues, and this performs better across the board than with
tail-predication disabled. Time to flip the switch!
Differential Revision: https://reviews.llvm.org/D88093
This is a reimplementation of the overflow checks for the elementcount,
i.e. the 2nd argument of intrinsic get.active.lane.mask. The element
count is lowered in each iteration of the tail-predicated loop, and
we must prove that this expression doesn't overflow.
Many thanks to Eli Friedman and Sam Parker for all their help with
this work.
Differential Revision: https://reviews.llvm.org/D88086
9d9a11c7be added this check for predicatable instructions between the
D/WLSTP and the loop's start, but it was missing the last instruction in
the block. Change it to use some iterators instead.
Differential Revision: https://reviews.llvm.org/D88354
On failing to find a VCTP in the list of instructions that explicitly
predicate the entry of a VPT block, inspect whether the block is
controlled via VPT which is implicitly predicated due to it's
predicated operand(s).
Differential Revision: https://reviews.llvm.org/D87819
This might be useful for testing. We already have an option -tail-predication
but that controls the MVETailPredication pass. This
-arm-loloops-disable-tail-pred is just for disabling it in the LowoverheadLoops
pass.
Differential Revision: https://reviews.llvm.org/D88212
If the LSTP instruction is inserted with an element count low enough
to immediately predicate some lanes as false, this can have some
unintended effects on any proceeding MVE instructions in the
preheader.
Differential Revision: https://reviews.llvm.org/D88209
Previously, if a floating-point type was legal, but FNEG wasn't legal,
we would use FSUB. Instead, we should use integer ops, to preserve the
semantics. (Alternatively, there's a compiler-rt call we could use, but
there isn't much reason to use that.)
It turns out we actually are still using this obscure codepath in a few
cases: on some targets, we have "legal" floating-point types that don't
actually support any floating-point operations. In particular, ARM and
AArch64 are using this path.
The implementation for SelectionDAG is pretty simple because we can
reuse the infrastructure from FCOPYSIGN.
See also 9a3dc3e, the corresponding change to type legalization.
Also includes a "bonus" change to STRICT_FSUB legalization, so we can
lower a STRICT_FSUB to a float libcall.
Includes the changes to both LegalizeDAG and GlobalISel so we don't have
inconsistent results in the future.
Fixes https://bugs.llvm.org/show_bug.cgi?id=46792 .
Differential Revision: https://reviews.llvm.org/D84287
An existing function Type::getScalarSizeInBits returns a uint64_t
instead of a TypeSize class because the caller is requesting a
scalar size, which cannot be scalable. This patch makes other
similar functions requesting a scalar size consistent with that,
thereby eliminating more than 1000 implicit TypeSize -> uint64_t
casts.
Differential revision: https://reviews.llvm.org/D87889
Changes TTI function getIntImmCostInst to take an additional Instruction parameter,
which enables us to be able to check it is part of a min(max())/max(min()) pattern that will match SSAT.
We can then mark the constant used as free to prevent it being hoisted so SSAT can still be generated.
Required minor changes in some non-ARM backends to allow for the optional parameter to be included.
Differential Revision: https://reviews.llvm.org/D87457
The VPTBlock has been modified to track the 'global' state of the
VPR, as well as the state for each block. Each object now just holds
a list of instructions that makeup the block, while static structures
hold the predicate information. This enables global access for
querying how both a VPT block and individual instructions are
predicated. These changes now allow us, again, to handle more
complicated cases where multiple instructions build a predicate
and/or where the same predicate in used in multiple blocks.
It doesn't, however, get us back to before the tracking was 'fixed'
as some extra logic will be required to properly handle VPT
instructions. Currently a VPT could be effectively predicated because
of it's inputs, but the existing logic will not detect that and so
will refuse to perform the transformation. This can be seen in
remat-vctp.ll test where we still don't perform the transform.
Differential Revision: https://reviews.llvm.org/D87681
Remove the domain from the instructions and create a shouldInspect
helper for LowOverheadLoops which queries it or a vpr operand.
Differential Revision: https://reviews.llvm.org/D87900
security boundary
It was never supported and that part was accidentally omitted when
upstreaming D76518.
Differential Revision: https://reviews.llvm.org/D86478
Change-Id: If6ba9506eb0431c87a1d42a38aa60e47ce263039
This adds lowering for f32 values using the vmov.f16, which zeroes the
top bits whilst setting the lower bits to a pattern. This range of
values does not often come up, except where a f16 constant value has
been converted to a f32.
Differential Revision: https://reviews.llvm.org/D87790
This adds simple constant folding for VMOVrh, to constant fold fp16
constants to integer values. It can help especially with soft calling
conventions, but some of the results are not optimal as we end up
loading using a vldr. This will be improved in a follow up patch.
Differential Revision: https://reviews.llvm.org/D87789
When generating matching tables for GlobalISel, TableGen would output
"::zero_reg" whenever encountering the zero_reg, which in turn would
result in compilation error. This patch fixes that by instead outputting
NoRegister (== 0), which is the same result that TableGen produces when
generating matching tables for ISelDAG.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D86215
This extends the distributing postinc code in load/store optimizer to
also handle the case where there is an existing pre/post inc instruction,
where subsequent instructions can be modified to use the adjusted
offset from the increment. This can save us having to keep the old
register live past the increment instruction.
Differential Revision: https://reviews.llvm.org/D83377
The predicated MVE intrinsics are generated as, for example,
llvm.arm.mve.add.predicated(x, splat(y). p). We need to sink the splat
value back into the loop, like we do for other instructions, so we can
re-select qr variants.
Differential Revision: https://reviews.llvm.org/D87693
Additional sanity checks were added to get.active.lane.mask's second argument,
the loop tripcount/elementcount, in rG635b87511ec3. Like the other (overflow)
checks, skip this if tail-predication is forced.
Differential Revision: https://reviews.llvm.org/D87769
Clear the CurrentPredicate when we find an instruction which would
completely overwrite the VPR. This fix essentially means we're back
to not really being able to handle VPT instructions when tail
predicating.
Differential Revision: https://reviews.llvm.org/D87610
Modify the unit test to inspect all MVE instructions and mark the
load/store/move of vpr/p0 as valid, as well as the remaining scalar
shifts.
Differential Revision: https://reviews.llvm.org/D87753
The versions that take 'unsigned' will be removed in the future.
I tried to use getOriginalAlign instead of getAlign in some
places. getAlign factors in the minimum alignment implied by
the offset in the pointer info. Since we're also passing the
pointer info we can use the original alignment.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D87592
This adds SoftenFloatRes, PromoteFloatRes and SoftPromoteHalfRes
legalizations for VECREDUCE, to fill the remaining hole in the SDAG
legalization. These legalizations simply expand the reduction and
let it be recursively legalized. For the PromoteFloatRes case at
least it is possible to do better than that, but it's pretty tricky
(because we need to consider the interaction of three different
vector legalizations and the type promotion) and probably not
really worthwhile.
I haven't added ExpandFloatRes support, as I am not familiar with
ppc_fp128.
Differential Revision: https://reviews.llvm.org/D87569
LLVM will canonicalize conditional selectors to a different pattern than the old code that was used.
This is updating the function to match the new expected patterns and select SSAT or USAT when successful.
Tests have also been updated to use the new patterns.
Differential Review: https://reviews.llvm.org/D87379
This adds additional checks for the original scalar loop tripcount value, i.e.
get.active.lane.mask second argument, and perform several sanity checks to see
if it is of the form that we expect similarly like we already do for the IV
which is the first argument of get.active.lane.
Differential Revision: https://reviews.llvm.org/D86074
Treating an SoImm offset as a multiple of 4 between -1020 and 1020
mis-handles the second of a pair of 16-bit constants where the offset is a multiple of 2 but not a multiple of 4,
leading to an LLVM ERROR: out of range pc-relative fixup value
For 32-bit and larger (64-bit) constants, continue to treat an SoImm offset as a multiple of 4 between -1020 and 1020.
For smaller (16-bit) constants, treat an SoImm offset as a multiple of 1 between -255 and 255.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D86949
This allows the backend to tell the vectorizer to produce inloop
reductions through a TTI hook.
For the moment on ARM under MVE this means allowing integer add
reductions of the correct size. In the future this can include integer
min/max too, under -Os.
Differential Revision: https://reviews.llvm.org/D75512
This fixes a complication on top of D87276. If we are sign extending
around a mul with the two operands that are the same, instcombine will
helpfully convert one of the sext to a zext. Reverse that so that we
again generate a reduction.
Differnetial Revision: https://reviews.llvm.org/D87287
As discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/140729.html
This is hopefully the final remaining showstopper before we can remove
the 'experimental' from the reduction intrinsics.
No behavior was specified for the FP min/max reductions, so we have a
mess of different interpretations.
There are a few potential options for the semantics of these max/min ops.
I think this is the simplest based on current behavior/implementation:
make the reductions inherit from the existing llvm.maxnum/minnum intrinsics.
These correspond to libm fmax/fmin, and those are similar to the (now
deprecated?) IEEE-754 maxNum/minNum functions (NaNs are treated as missing
data). So the default expansion creates calls to libm functions.
Another option would be to inherit from llvm.maximum/minimum (NaNs propagate),
but most targets just crash in codegen when given those nodes because no
default expansion was ever implemented AFAICT.
We could also just assume 'nnan' semantics by default (we are already
assuming 'nsz' semantics in the maxnum/minnum intrinsics), but some targets
(AArch64, PowerPC) support the more defined behavior, so it doesn't make much
sense to not allow a tighter spec. Fast-math-flags (nnan) can be used to
loosen the semantics.
(Note that D67507 was proposed to update the LangRef to acknowledge the more
recent IEEE-754 2019 standard, but that patch seems to have stalled. If we do
update based on the new standard, the reduction instructions can seamlessly
inherit from whatever updates are made to the max/min intrinsics.)
x86 sees a regression here on 'nnan' tests because we have underlying,
longstanding bugs in FMF creation/propagation. Those need to be fixed apart
from this change (for example: https://llvm.org/PR35538). The expansion
sequence before this patch may not have been correct.
Differential Revision: https://reviews.llvm.org/D87391
We can sometimes get code that does:
xe = zext i16 x to i32
ye = zext i16 y to i32
m = mul i32 xe, ye
me = zext i32 m to i64
r = vecreduce.add(me)
This "double extend" can trip up the reduction identification, but
should give identical results.
This extends the pattern matching to handle them.
Differential Revision: https://reviews.llvm.org/D87276
values
The effects of unpredicated vector instruction with unknown
lanes cannot be predicted and therefore cannot be tail predicated. This
does not apply to predicated vector instructions and so this patch
allows tail predication on them.
Differential Revision: https://reviews.llvm.org/D87376
We weren't using this before, so none of the MachineFunction CFG edges had the
branch probability information added. As a result, block placement later in the
pipeline was flying blind.
This is enabled only with optimizations enabled like SelectionDAG.
Differential Revision: https://reviews.llvm.org/D86824
We really want to try and avoid spilling P0, which can be difficult
since there's only one register, so try to rematerialize any VCTP
instructions.
Differential Revision: https://reviews.llvm.org/D87280
count register
After my patch at D86087, code that now uses the mov operand rather than
the vctp operand will no longer remove modifications to the vctp operand
as they should. This patch fixes that by explicitly removing
modifications to the vctp operand rather than the register used as the
element count.
This was reverted in 503deec218
because it caused gigantic increase (3x) in branch mispredictions
in certain benchmarks on certain CPU's,
see https://reviews.llvm.org/D84108#2227365.
It has since been investigated and here are the results:
https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200907/827578.html
> It's an amazingly severe regression, but it's also all due to branch
> mispredicts (about 3x without this). The code layout looks ok so there's
> probably something else to deal with. I'm not sure there's anything we can
> reasonably do so we'll just have to take the hit for now and wait for
> another code reorganization to make the branch predictor a bit more happy :)
>
> Thanks for giving us some time to investigate and feel free to recommit
> whenever you'd like.
>
> -eric
So let's just reland this.
Original commit message:
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.
After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.
Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.
I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.
Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.
Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.
As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S
llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement
The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.
If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
* -14 (-73.68%) loops not rotated due to the header size (yay)
* -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
* -3937 (-64.19%) common instructions hoisted
* +561 (+0.06%) x86 asm instructions
* -2 basic blocks
* +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable {F12360201}
* -36396 (-65.29%) common instructions hoisted
* +1676 (+0.02%) x86 asm instructions
* +662 (+0.06%) basic blocks
* +4395 (+0.04%) IR instructions
It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D84108
This reverts commit 503deec218.
When optimising for size, make the cost of i1 logical operations
relatively expensive so that optimisations don't try to combine
predicates.
Differential Revision: https://reviews.llvm.org/D86525
This adds a simple tablegen pattern for folding predicate_cast(load)
into vldr p0, providing the alignment and offset are correct.
Differential Revision: https://reviews.llvm.org/D86702
This patch implements the foldMemoryOperand hook in Thumb1InstrInfo,
allowing tBLXr and a spilled function address to be combined back into a
tBL. This can help with codesize at Oz, especailly in the tinycrypt
library.
Differential Revision: https://reviews.llvm.org/D79785
There's a special case in hasAttribute for None when pImpl is null. If pImpl is not null we dispatch to pImpl->hasAttribute which will always return false for Attribute::None.
So if we just want to check for None its sufficient to just check that pImpl is null. Which can even be done inline.
This patch adds a helper for that case which I hope will speed up our getSubtargetImpl implementations.
Differential Revision: https://reviews.llvm.org/D86744
Skip this for now, to avoid a backend crash in:
UNREACHABLE executed at llvm/lib/Target/ARM/ARMISelLowering.cpp:13412
This should fix PR45824.
Differential Revision: https://reviews.llvm.org/D86784
These arm_mve_vldr_gather_offset_predicated and
arm_mve_vstr_scatter_offset_predicated have some extra parameters
meaning the predicate is at a later operand. If a loop contains _only_
those masked instructions, we would miss transforming the active lane
mask.
Differential Revision: https://reviews.llvm.org/D86791
Remove the code that tried to look for reduction patterns, since the
vectorizer and isel can now produce predicated arithmetic instructios
within the loop body. This has required some reorganisation and fixes
around live-out and predication checks, as well as looking for cases
where an input/output is initialised to zero.
Differential Revision: https://reviews.llvm.org/D86613
This patch adjusts the following ARM/AArch64 LLVM IR intrinsics:
- neon_bfmmla
- neon_bfmlalb
- neon_bfmlalt
so that they take and return bf16 and float types. Previously these
intrinsics used <8 x i8> and <4 x i8> vectors (a rudiment from
implementation lacking bf16 IR type).
The neon_vbfdot[q] intrinsics are adjusted similarly. This change
required some additional selection patterns for vbfdot itself and
also for vector shuffles (in a previous patch) because of SelectionDAG
transformations kicking in and mangling the original code.
This patch makes the generated IR cleaner (less useless bitcasts are
produced), but it does not affect the final assembly.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D86146
Enable default outlining when the function has the minsize attribute
and we're targeting an m-class core.
Differential Revision: https://reviews.llvm.org/D82951
Fix the ARM backend's analyzeBranch so it doesn't ignore predicated
return instructions, and make the MachineVerifier rule more strict.
Differential Revision: https://reviews.llvm.org/D40061
MVE Gather scatter codegeneration is looking a lot better than it used
to, but still has some issues. The instructions we currently model as 1
cycle per element, which is a bit low for some cases. Increasing the
cost by the MVECostFactor brings them in-line with our other instruction
costs. This will have the effect of only generating then when the extra
benefit is more likely to overcome some of the issues. Notably in
running out of registers and vectorizing loops that could otherwise be
SLP vectorized.
In the short-term whilst we look at other ways of dealing with those
more directly, we can increase the costs of gathers to make them more
likely to be beneficial when created.
Differential Revision: https://reviews.llvm.org/D86444
This adapts tail-predication to the new semantics of get.active.lane.mask as
defined in D86147. This means that:
- we can remove the BTC + 1 overflow checks because now the loop tripcount is
passed in to the intrinsic,
- we can immediately use that value to setup a counter for the number of
elements processed by the loop and don't need to materialize BTC + 1.
Differential Revision: https://reviews.llvm.org/D86303
If gather/scatters are enabled, ARMTargetTransformInfo now allows
tail predication for loops with a much wider range of strides, up
to anything that is loop invariant.
Differential Revision: https://reviews.llvm.org/D85410
As disscussed in post-commit review starting with
https://reviews.llvm.org/D84108#2227365
while this appears to be mostly a win overall, especially code-size-wise,
this appears to shake //certain// code pattens in a way that is extremely
unfavorable for performance (+30% runtime regression)
on certain CPU's (i personally can't reproduce).
So until the behaviour is better understood, and a path forward is mapped,
let's back this out for now.
This reverts commit 1d51dc38d8.
Modify the ARM getCmpSelInstrCost implementation for the code size
costs of selects. Now consider the legalization cost and increase
the cost of i1 because those values wouldn't live in a general purpose
register. We also make selects +1 more expensive to account for the IT
instruction.
Differential Revision: https://reviews.llvm.org/D82091
As part of D84741, this adds a target hook for the
preferPredicatedReductionSelect option and makes use
of it under MVE, allowing us to tail predicate most
reduction loops.
Differential Revision: https://reviews.llvm.org/D85980
Use the stack to save and restore the link register when there is no
available register to do it.
Differential Revision: https://reviews.llvm.org/D76069
VLD2/4 instructions cannot be predicated, so we cannot tail predicate
them from autovec. From intrinsics though, they should be valid as they
will just end up loading extra values into off vector lanes, not
effecting the on lanes. The same is true for loads in general where so
long as we are not using the other vector lanes, an unpredicated load
can be converted to a predicated one.
This marks VLD2 and VLD4 instructions as validForTailPredication and
allows any unpredicated load in tail predication loop, which seems to be
valid given the other checks we have.
Differential Revision: https://reviews.llvm.org/D86022
There are some cases where the instruction that sets up the iteration
count for a tail predicated loop cannot be moved before the dlstp,
stopping tail predication entirely. This patch checks if the mov operand
can be used and if so, uses that instead.
Differential Revision: https://reviews.llvm.org/D86087
This patch implements initial backend support for a -mtune CPU controlled by a "tune-cpu" function attribute. If the attribute is not present X86 will use the resolved CPU from target-cpu attribute or command line.
This patch adds MC layer support a tune CPU. Each CPU now has two sets of features stored in their GenSubtargetInfo.inc tables . These features lists are passed separately to the Processor and ProcessorModel classes in tablegen. The tune list defaults to an empty list to avoid changes to non-X86. This annoyingly increases the size of static tables on all target as we now store 24 more bytes per CPU. I haven't quantified the overall impact, but I can if we're concerned.
One new test is added to X86 to show a few tuning features with mismatched tune-cpu and target-cpu/target-feature attributes to demonstrate independent control. Another new test is added to demonstrate that the scheduler model follows the tune CPU.
I have not added a -mtune to llc/opt or MC layer command line yet. With no attributes we'll just use the -mcpu for both. MC layer tools will always follow the normal CPU for tuning.
Differential Revision: https://reviews.llvm.org/D85165
These operations take Qda and Rn register operands, which are
commutative so long as the instruction is not predicated.
Differential Revision: https://reviews.llvm.org/D85813
Similar to the Two op + select patterns that were added recently, this
adds some patterns for select + fma to turn them into predicated
operations.
Differential Revision: https://reviews.llvm.org/D85824
Widen the scope of memory operations that are allowed to be tail predicated
to include gathers and scatters, such that loops that are auto-vectorized
with the option -enable-arm-maskedgatscat (and actually end up containing
an MVE gather or scatter) can be tail predicated.
Differential Revision: https://reviews.llvm.org/D85138
When TTI was updated to use an explicit cost, TCK_CodeSize was used
although the default implicit cost would have been the hand-wavey
cost of size and latency. So, revert back to this behaviour. This is
not expected to have (much) impact on targets since most (all?) of
them return the same value for SizeAndLatency and CodeSize.
When optimising for size, the logic has been changed to query
CodeSize costs instead of SizeAndLatency.
This patch also adds a testing option in the unroller so that
OptSize thresholds can be specified.
Differential Revision: https://reviews.llvm.org/D85723
This pick ups the work on the overflow checks for get.active.lane.mask,
which ensure that it is safe to insert the VCTP intrinisc that enables
tail-predication. For a 2d auto-correlation kernel and its inner loop j:
M = Size - i;
for (j = 0; j < M; j++)
Sum += Input[j] * Input[j+i];
For this inner loop, the SCEV backedge taken count (BTC) expression is:
(-1 + (sext i16 %Size to i32)),+,-1}<nw><%for.body>
and LoopUtil cannotBeMaxInLoop couldn't calculate a bound on this, thus "BTC
cannot be max" could not be determined. So overflow behaviour had to be assumed
in the loop tripcount expression that uses the BTC. As a result
tail-predication had to be forced (with an option) for this case.
This change solves that by using ScalarEvolution's helper
getConstantMaxBackedgeTakenCount which is able to determine the range of BTC,
thus can determine it is safe, so that we no longer need to force tail-predication
as reflected in the changed test cases.
Differential Revision: https://reviews.llvm.org/D85737
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D85220
IT blocks with more than one instruction were performance deprecated in Armv8
but that doesn't mean we should follow that advise when optimising for size.
Differential Revision: https://reviews.llvm.org/D85638
This adds patterns for v16i16's vecreduce, using all the existing code
to go via an i32 VADDV/VMLAV and truncating the result.
Differential Revision: https://reviews.llvm.org/D85452
This formats some of the MVE patterns, and adds a missing
Predicates = [HasMVEInt] to some VRHADD patterns I noticed
as going through. Although I don't believe NEON would ever
use the patterns (as it would use ADDL and VSHRN instead)
they should ideally be predicated on having MVE instructions.
As with other targets, set the throughput cost of control-flow
instructions to free so that we don't miss out of vectorization
opportunities.
Differential Revision: https://reviews.llvm.org/D85283
Added patterns so that both SSAT and USAT instructions are generated with shifts. Added corresponding regression tests.
Differential Review: https://reviews.llvm.org/D85120
VPSEL has slightly different semantics under tail predication (it can
end up selecting from Qn, Qm and Qd). We do not model that at the moment
so they block tail predicated loops from being formed.
This just converts them into a predicated VMOV instead (via a VORR),
allowing tail predication to happen whilst still modelling the original
behaviour of the input.
Differential Revision: https://reviews.llvm.org/D85110
Fixes a regression caused by D82439, in which IT blocks were no longer being
generated when -Oz is present. This was due to the CPSR register being marked as
dead, while this case was not accounted for.
Differential Revision: https://reviews.llvm.org/D83667
This adds sign/zero extending scalar loads/stores to the MVE
instructions added in D77813, allowing us to create up more post-inc
instructions. These are comparatively simple, compared to LDR/STR (which
may be better turned into an LDRD/LDM), but still require some additions
over MVE instructions. Because there are i12 and i8 variants of the
offset loads/stores dealing with different signs, we may need to convert
an i12 address to a i8 negative instruction. t2LDRBi12 can also be
shrunk to a tLDRi under the right conditions, so we need to be careful
with codesize too.
Differential Revision: https://reviews.llvm.org/D78625
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.
After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.
Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.
I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.
Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.
Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.
As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S
llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement
The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.
If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
* -14 (-73.68%) loops not rotated due to the header size (yay)
* -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
* -3937 (-64.19%) common instructions hoisted
* +561 (+0.06%) x86 asm instructions
* -2 basic blocks
* +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable {F12360201}
* -36396 (-65.29%) common instructions hoisted
* +1676 (+0.02%) x86 asm instructions
* +662 (+0.06%) basic blocks
* +4395 (+0.04%) IR instructions
It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D84108
This patch uses the feature added in D79162 to fix the cost of a
sext/zext of a masked load, or a trunc for a masked store.
Previously, those were considered cheap or even free, but it's
not the case as we cannot split the load in the same way we would for
normal loads.
This updates the costs to better reflect reality, and adds a test for it
in test/Analysis/CostModel/ARM/cast.ll.
It also adds a vectorizer test that showcases the improvement: in some
cases, the vectorizer will now choose a smaller VF when
tail-predication is enabled, which results in better codegen. (Because
if it were to use a higher VF in those cases, the code we see above
would be generated, and the vmovs would block tail-predication later in
the process, resulting in very poor codegen overall)
Original Patch by Pierre van Houtryve
Differential Revision: https://reviews.llvm.org/D79163
Currently, getCastInstrCost has limited information about the cast it's
rating, often just the opcode and types. Sometimes there is a context
instruction as well, but it isn't trustworthy: for instance, when the
vectorizer is rating a plan, it calls getCastInstrCost with the old
instructions when, in fact, it's trying to evaluate the cost of the
instruction post-vectorization. Thus, the current system can get the
cost of certain casts incorrect as the correct cost can vary greatly
based on the context in which it's used.
For example, if the vectorizer queries getCastInstrCost to evaluate the
cost of a sext(load) with tail predication enabled, getCastInstrCost
will think it's free most of the time, but it's not always free. On ARM
MVE, a VLD2 group cannot be extended like a normal VLDR can. Similar
situations can come up with how masked loads can be extended when being
split.
To fix that, this path adds a new parameter to getCastInstrCost to give
it a hint about the context of the cast. It adds a CastContextHint enum
which contains the type of the load/store being created by the
vectorizer - one for each of the types it can produce.
Original patch by Pierre van Houtryve
Differential Revision: https://reviews.llvm.org/D79162
Optimize some specific immediates selection by materializing them with sub/mvn
instructions as opposed to loading them from the constant pool.
Patch by Ben Shi, powerman1st@163.com.
Differential Revision: https://reviews.llvm.org/D83745
A patch following up on the introduction of pointer induction variables, adding
a preprocessing step to the address optimisation in the MVEGatherScatterLowering
pass. If the getelementpointer that is the address is itself using a
getelementpointer as base, they will be merged into one by summing up the
offsets, after checking that this will not cause an overflow (this can be
repeated recursively).
Differential Revision: https://reviews.llvm.org/D84027
Many Thumb1 instructions are defined to set CPSR if executed outside an IT
block, but leave it alone from inside one. In MachineIR this is represented by
whether an optional register is CPSR or NoReg (0), and affects how the
instructions are printed.
This sets the instruction to the appropriate form during if-conversion.
Added extra patterns to VABD instruction so it is selected in place of VSUB and VABS. Added corresponding regression test too.
Differential Revision: https://reviews.llvm.org/D84500
Similar to 8fa824d7a3 but this time for MLA patterns, this selects
predicated vmlav/vmlava/vmlalv/vmlava instructions from
vecreduce.add(select(p, mul(x, y), 0)) nodes.
Differential Revision: https://reviews.llvm.org/D84102
There's no reason to involve the hassle of a virtual method targets
have to override for a simple boolean.
Not sure exactly what's going on with Mips, but it seems to define its
own totally separate handler classes.
Given a vecreduce.add(select(p, x, 0)), we can convert that to a
predicated vaddv, as the else value for the select is the identity
value, a zero. That is what this patch does for the vaddv, vaddva,
vaddlv and vaddlva instructions, copying the existing patterns to also
handle predication through a select.
Differential Revision: https://reviews.llvm.org/D84101
For a long time, the InstCombine pass handled target specific
intrinsics. Having target specific code in general passes was noted as
an area for improvement for a long time.
D81728 moves most target specific code out of the InstCombine pass.
Applying the target specific combinations in an extra pass would
probably result in inferior optimizations compared to the current
fixed-point iteration, therefore the InstCombine pass resorts to newly
introduced functions in the TargetTransformInfo when it encounters
unknown intrinsics.
The patch should not have any effect on generated code (under the
assumption that code never uses intrinsics from a foreign target).
This introduces three new functions:
TargetTransformInfo::instCombineIntrinsic
TargetTransformInfo::simplifyDemandedUseBitsIntrinsic
TargetTransformInfo::simplifyDemandedVectorEltsIntrinsic
A few target specific parts are left in the InstCombine folder, where
it makes sense to share code. The largest left-over part in
InstCombineCalls.cpp is the code shared between arm and aarch64.
This allows to move about 3000 lines out from InstCombine to the targets.
Differential Revision: https://reviews.llvm.org/D81728
This is very similar to 243970d03cace2, but handling a slightly
different form of predicated operations. When starting with a pattern of
the form select(p, BinOp(x, y), x), Instcombine will often transform
this to BinOp(x, select(p, y, 0)), where 0 is the identity value of the
binop (0 for adds/subs, 1 for muls, -1 for ands etc). This adds the
patterns that transforms those back into predicated binary operations.
There is also a very minor adjustment to tablegen null_frag in here, to
allow it to also be recognized as a PatLeaf node, so that it can be used
in MVE_TwoOpPattern to easily exclude the cases where we do not need the
alternate transform.
Differential Revision: https://reviews.llvm.org/D84091
Most MVE instructions can be predicated to fold a select into the
instruction, using the predicate and the selects else as a passthough.
This adds tablegen patterns for most two operand instructions using the
newly added TwoOpPattern from 1030e82598.
Differential Revision: https://reviews.llvm.org/D83222
Summary:
[Thumb] set code alignment for 16-bit load from constant pool
LLVM miscompiles this code when compiling for a target with v8.2-A FP16 and the Thumb ISA at -O0:
extern void bar(__fp16 P5);
int main() {
__fp16 P5 = 1.96875;
bar(P5);
}
The code section containing main has 2 byte alignment.
It needs to have 4 byte alignment,
because the load literal instruction has an offset from the
load address with the low 2 bits zeroed.
I do not include a test case in this check-in.
llc and llvm-mc do not exhibit this bug. They do not set code section alignment
in the same manner as clang.
Reviewers: dnsampaio
Reviewed By: dnsampaio
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D84169
Summary:
This fixes Bugzilla #46616 in which it was reported
that "tbb [pc, r0]" was marked as SoftFail
(aka unpredictable) incorrectly.
Expected behaviour is:
* ARMv8 is required to use sp as rn or rm
(tbb/tbh only have a Thumb encoding so using Arm mode
is not an option)
* If rm is the pc then the instruction is always
unpredictable
Some of this was implemented already and this fixes the
rest. Added tests cover the new and pre-existing handling.
Reviewers: ostannard
Reviewed By: ostannard
Subscribers: kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D84227
This commons out a chunk of the different two operand MVE patterns into
a single helper multidef. Or technically two multidef patterns so that
the Dup qr patterns can also get the same treatment. This is most of the
two address instructions that we have some codegen pattern for (not ones
that we select purely from intrinsics). It does not include shifts,
which are more spread out and will need some extra work to be given the
same treatment.
Differential Revision: https://reviews.llvm.org/D83219
These extra vcvt instructions were missed from 74ca67c109 because they
live in a different Domain, but should be treated in the same way.
Differential Revision: https://reviews.llvm.org/D83204
As far as I can tell, it should not be necessary for VCTP to be
unpredictable in tail predicated loops. Either it has a a valid loop
counter as a operand which will naturally keep it in the right loop, or
it doesn't and it won't be converted to a tail predicated loop. Not
marking it as having side effects allows it to be scheduled more cleanly
for cases where it is not expected to become a tail predicate loop.
Differential Revision: https://reviews.llvm.org/D83907
When the byref attribute is added, there will need to be two similar
functions for the existing cases which have an associate value copy,
and byref which does not. Most, but not all of the existing uses will
use the existing version.
The associated size function added by D82679 also needs to
contextually differ, and will help eliminate a few places still
relying on pointee element types.
This reverts commit 1067d3e176,
which reverted commit b2018198c3,
because it introduced a Dependency Cycle between Transforms/Scalar and
Transforms/Utils.
So let's just move SimplifyCFGOptions.h into Utils/, thus avoiding
the cycle.
Vector bitwise selects are matched by pseudo VBSP instruction
and expanded to VBSL/VBIT/VBIF after register allocation
depend on operands registers to minimize extra copies.
This adds a peephole optimisation to turn a t2MOVccr that could not be
folded into any other instruction into a CSEL on 8.1-m. The t2MOVccr
would usually be expanded into a conditional mov, that becomes an IT;
MOV pair. We can instead generate a CSEL instruction, which can
potentially be smaller and allows better register allocation freedom,
which can help reduce codesize. Performance is more variable and may
depend on the micrarchitecture details, but initial results look good.
If we need to control this per-cpu, we can add a subtarget feature as we
need it.
Original patch by David Penry.
Differential Revision: https://reviews.llvm.org/D83566
This reverts commit b2018198c3.
This commit introduced a Dependency Cycle between Transforms/Scalar and
Transforms/Utils. Transforms/Scalar already depends on Transforms/Utils,
so if SimplifyCFGOptions.h is moved to Scalar, and Utils/Local.h still
depends on it, we have a cycle.
Taking so many parameters is simply unmaintainable.
We don't want to include the entire llvm/Transforms/Utils/Local.h into
llvm/Transforms/Scalar.h so i've split SimplifyCFGOptions into
it's own header.
If a vector body has live-out values, it is probably a reduction, which needs a
final reduction step after the loop. MVE has a VADDV instruction to reduce
integer vectors, but doesn't have an equivalent one for float vectors. A
live-out value that is not recognised as reduction later in the optimisation
pipeline will result in the tail-predicated loop to be reverted to a
non-predicated loop and this is very expensive, i.e. it has a significant
performance impact, which is what we hope to avoid with fine tuning the ARM TTI
hook preferPredicateOverEpilogue implementation.
Differential Revision: https://reviews.llvm.org/D82953
This refactors option -disable-mve-tail-predication to take different arguments
so that we have 1 option to control tail-predication rather than several
different ones.
This is also a prep step for D82953, in which we want to reject reductions
unless that is requested with this option.
Differential Revision: https://reviews.llvm.org/D83133
Summary:
This patch separates the peeling specific parameters from the UnrollingPreferences,
and creates a new struct called PeelingPreferences. Functions which used the
UnrollingPreferences struct for peeling have been updated to use the PeelingPreferences struct.
Author: sidbav (Sidharth Baveja)
Reviewers: Whitney (Whitney Tsang), Meinersbur (Michael Kruse), skatkov (Serguei Katkov), ashlykov (Arkady Shlykov), bogner (Justin Bogner), hfinkel (Hal Finkel), anhtuyen (Anh Tuyen Tran), nikic (Nikita Popov)
Reviewed By: Meinersbur (Michael Kruse)
Subscribers: fhahn (Florian Hahn), hiraditya (Aditya Kumar), llvm-commits, LLVM
Tag: LLVM
Differential Revision: https://reviews.llvm.org/D80580
This patch upstreams support for the Arm-v8 Cortex-A78 and Cortex-X1
processors for AArch64 and ARM.
In detail:
- Adding cortex-a78 and cortex-x1 as cpu options for aarch64 and arm targets in clang
- Adding Cortex-A78 and Cortex-X1 CPU names and ProcessorModels in llvm
details of the CPU can be found here:
https://www.arm.com/products/cortex-xhttps://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a78
The following people contributed to this patch:
- Luke Geeson
- Mikhail Maltsev
Reviewers: t.p.northover, dmgreen
Reviewed By: dmgreen
Subscribers: dmgreen, kristof.beyls, hiraditya, danielkiss, cfe-commits,
llvm-commits, miyuki
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D83206
This adjusts the MVE fp16 cost model, similar to how we already do for
integer casts. It uses the base cost of 1 per cvt for most fp extend /
truncates, but adjusts it for loads and stores where we know that a
extending load has been used to get the load into the correct lane, and
only an MVE VCVTB is then needed.
Differential Revision: https://reviews.llvm.org/D81813
This adds some default costs for fp extends and truncates, generally
costing them as 1 per lane. If the type is not legal then the cost will
include a call to an __aeabi_ function.
Some NEON code is also adjusted to make sure it applies to the expected
types, now that fp16 is a more common thing.
Differential Revision: https://reviews.llvm.org/D82458
This expands the existing extend costs with a few extras for larger
types than legal, which will usually be split under MVE. It also adds
trunk support for the same thing. These should not have a large effect
on many things, but makes the costs explicit and keeps a certain balance
between the trunks and extends.
Differential Revision: https://reviews.llvm.org/D82457
This alters getMemoryOpCost to use the Base TargetTransformInfo version
that includes some additional checks for whether extending loads are
legal. This will generally have the effect of making <2 x ..> and some
<4 x ..> loads/stores more expensive, which in turn should help favour
larger vector factors.
Notably it alters the cost of a <4 x half>, which with the current
codegen will be expensive if it is not extended.
Differential Revision: https://reviews.llvm.org/D82456
Whether an instruction is deemed to have side effects in determined by
whether it has a tblgen pattern that emits a single instruction.
Because of the way a lot of the the vcvt instructions are specified
either in dagtodag code or with patterns that emit multiple
instructions, they don't get marked as not having side effects.
This just marks them as not having side effects manually. It can help
especially with instruction scheduling, to not create artificial
barriers, but one of these tests also managed to produce fewer
instructions.
Differential Revision: https://reviews.llvm.org/D81639
This patch upstreams support for the Arm-v8 Cortex-A77
processor for AArch64 and ARM.
In detail:
- Adding cortex-a77 as a cpu option for aarch64 and arm targets in clang
- Cortex-A77 CPU name and ProcessorModel in llvm
details of the CPU can be found here:
https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortex-a77
and a similar submission to GCC can be found here:
e0664b7a63
The following people contributed to this patch:
- Luke Geeson
- Mikhail Maltsev
Reviewers: t.p.northover, dmgreen, ostannard, SjoerdMeijer
Reviewed By: dmgreen
Subscribers: dmgreen, kristof.beyls, hiraditya, danielkiss, cfe-commits,
llvm-commits, miyuki
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D82887
Move the Thumb2SizeReduce pass to before IfConversion when optimising
for minimal code size.
Running the Thumb2SizeReduction pass before IfConversionallows T1
instructions to propagate to the final output, rather than the
ifConverter modifying T2 instructions and preventing them from being
reduced later.
This change does introduce a regression regarding execution time, so
it's only applied when optimising for size.
Running the LLVM Test Suite with this change produces a geomean
difference of -0.1% for the size..text metric.
Differential Revision: https://reviews.llvm.org/D82439
Before this instruction supported output values, it fit fairly
naturally as a terminator. However, being a terminator while also
supporting outputs causes some trouble, as the physreg->vreg COPY
operations cannot be in the same block.
Modeling it as a non-terminator allows it to be handled the same way
as invoke is handled already.
Most of the changes here were created by auditing all the existing
users of MachineBasicBlock::isEHPad() and
MachineBasicBlock::hasEHPadSuccessor(), and adding calls to
isInlineAsmBrIndirectTarget or mayHaveInlineAsmBr, as appropriate.
Reviewed By: nickdesaulniers, void
Differential Revision: https://reviews.llvm.org/D79794
While validating live-out values, record instructions that look like
a reduction. This will comprise of a vector op (for now only vadd),
a vorr (vmov) which store the previous value of vadd and then a vpsel
in the exit block which is predicated upon a vctp. This vctp will
combine the last two iterations using the vmov and vadd into a vector
which can then be consumed by a vaddv.
Once we have determined that it's safe to perform tail-predication,
we need to change this sequence of instructions so that the
predication doesn't produce incorrect code. This involves changing
the register allocation of the vadd so it updates itself and the
predication on the final iteration will not update the falsely
predicated lanes. This mimics what the vmov, vctp and vpsel do and
so we then don't need any of those instructions.
Differential Revision: https://reviews.llvm.org/D75533
After the rewrite of this pass (D79175) I missed one thing: the inserted VCTP
intrinsic can be cloned to exit blocks if there are instructions present in it
that perform the same operation, but this wasn't triggering anymore. However,
it turns out that for handling reductions, see D75533, it's actually easier not
not to have the VCTP in exit blocks, so this removes that code.
This was possible because it turned out that some other code that depended on
this, rematerialization of the trip count enabling more dead code removal
later, wasn't doing much anymore due to more aggressive dead code removal that
was added to the low-overhead loops pass.
Differential Revision: https://reviews.llvm.org/D82773
This patch stops the trunc, rint, round, floor and ceil intrinsics from blocking tail predication.
Differential Revision: https://reviews.llvm.org/D82553
MVE has native reductions for integer add and min/max. The others need
to be expanded to a series of extract's and scalar operators to reduce
the vector into a single scalar. The default codegen for that expands
the reduction into a series of in-order operations.
This modifies that to something more suitable for MVE. The basic idea is
to use vector operations until there are 4 remaining items then switch
to pairwise operations. For example a v8f16 fadd reduction would become:
Y = VREV X
Z = ADD(X, Y)
z0 = Z[0] + Z[1]
z1 = Z[2] + Z[3]
return z0 + z1
The awkwardness (there is always some) comes in from something like a
v4f16, which is first legalized by adding identity values to the extra
lanes of the reduction, and which can then not be optimized away through
the vrev; fadd combo, the inserts remain. I've made sure they custom
lower so that we can produce the pairwise additions before the extra
values are added.
Differential Revision: https://reviews.llvm.org/D81397
Pre-commit for D82257, this adds a DemandedElts arg to ShrinkDemandedConstant/targetShrinkDemandedConstant which will allow future patches to (optionally) add vector support.
Similar to the recent patch for fpext, this adds vcvtb and vcvtt with
insert into vector instruction selection patterns for fptruncs. This
helps clear up a lot of register shuffling that we would otherwise do.
Differential Revision: https://reviews.llvm.org/D81637
We current extract and convert from a top lane of a f16 vector using a
VMOVX;VCVTB pair. We can simplify that to use a single VCVTT. The
pattern is mostly copied from a vector extract pattern, but produces a
VCVTTHS f32 directly.
This had to move some code around so that ARMInstrVFP had access to the
required pattern frags that were previously part of ARMInstrNEON.
Differential Revision: https://reviews.llvm.org/D81556
This extends PerformSplittingToWideningLoad to also handle FP_Ext, as
well as sign and zero extends. It uses an integer extending load
followed by a VCVTL on the bottom lanes to efficiently perform an fpext
on a smaller than legal type.
The existing code had to be rewritten a little to not just split the
node in two and let legalization handle it from there, but to actually
split into legal chunks.
Differential Revision: https://reviews.llvm.org/D81340
This adds code to lower f16 to f32 fp_exts's using an MVE VCVT
instructions, similar to a recent similar patch for fp_trunc. Again it
goes through the lowering of a BUILD_VECTOR, but is slightly simpler
only having to deal with interleaved indices. It adds a VCVTL node to
lower to, similar to VCVTN.
Differential Revision: https://reviews.llvm.org/D81339
This splits MVE vector stores of a fp_trunc in the same way that we do
for standard trunc's. It extends PerformSplittingToNarrowingStores to
handle fp_round, splitting the store into pieces and adding a VCVTNb to
perform the actual fp_round. The actual store is then converted to an
integer store so that it can truncate bottom lanes of the result.
Differential Revision: https://reviews.llvm.org/D81141
This adds code to lower f32 to f16 fp_trunc's using a pair of MVE VCVT
instructions. Due to v4f16 not being legal, fp_round are often split up
fairly early. So this reconstructs the vcvt's from a buildvector of
fp_rounds from two vector inputs. Something like:
BUILDVECTOR(FP_ROUND(EXTRACT_ELT(X, 0),
FP_ROUND(EXTRACT_ELT(Y, 0),
FP_ROUND(EXTRACT_ELT(X, 1),
FP_ROUND(EXTRACT_ELT(Y, 1), ...)
It adds a VCVTN node to handle this, which like VMOVN or VQMOVN lowers
into the top/bottom lanes of an MVE instruction.
Differential Revision: https://reviews.llvm.org/D81139
The main interface has been migrated to Align already but a few backends where broadening the type from Align to MaybeAlign.
This patch makes sure all implementations conform to the public API.
Differential Revision: https://reviews.llvm.org/D82465
The ARM ARM considers p10/p11 valid arguments for MCR/MRC instructions.
MRC instructions with p10 arguments are also used in kernel code which
is shared for different architectures. Turn usage of p10/p11 to warnings
for ARMv7/ARMv8-M.
Reviewers: rengolin, olista01, t.p.northover, efriedma, psmith, simon_tatham
Reviewed By: simon_tatham
Subscribers: hiraditya, danielkiss, jcai19, tpimh, nickdesaulniers, peter.smith, javed.absar, kristof.beyls, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59733
Summary:
Get back `const` partially lost in one of recent changes.
Additionally specify explicit qualifiers in few places.
Reviewers: samparker
Reviewed By: samparker
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82383
Summary:
This change permits scalar bfloats to be loaded, stored, moved and
used as function call arguments and return values, whenever the bf16
feature is supported by the subtarget.
Previously that was only supported in the presence of the fullfp16
feature, because the code generation strategy depended on instructions
from that extension. This change adds alternative code generation
strategies so that those operations can be done even without fullfp16.
The strategy for loads and stores is to replace VLDRH/VSTRH with
integer LDRH/STRH plus a move between register classes. I've written
isel patterns for those, conditional on //not// having the fullfp16
feature (so that in the fullfp16 case, the existing patterns will
still be used).
For function arguments and returns, instead of writing isel patterns
to match `VMOVhr` and `VMOVrh`, I've avoided generating those SDNodes
in the first place, by factoring out the code that constructs them
into helper functions `MoveToHPR` and `MoveFromHPR` which have a
fallback for non-fullfp16 subtargets.
The current output code is not especially pretty: in the new test file
you can see unnecessary store/load pairs implementing no-op bitcasts,
and lots of pointless moves back and forth between FP registers and
GPRs. But it at least works, which is an improvement on the previous
situation.
Reviewers: dmgreen, SjoerdMeijer, stuij, chill, miyuki, labrinea
Reviewed By: dmgreen, labrinea
Subscribers: labrinea, kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82372
Implement them on top of sdiv/udiv, similar to what we do for integer
types.
Potential future work: implementing i8/i16 srem/urem, optimizations for
constant divisors, optimizing the mul+sub to mls.
Differential Revision: https://reviews.llvm.org/D81511
LDRD and STRD along with UBFX and SBFX are selected from DAGToDAG
transforms, so do not have tblgen patterns. They don't get marked as
having side effects so cannot be scheduled as efficiently as you would
like.
This specifically marks then as not having side effects.
Differential Revision: https://reviews.llvm.org/D82358
The VLLDM and VLSTM instructions are incompletely specified. They
(potentially) write (or read, respectively) registers Q0-Q7, VPR, and
FPSCR, but the compiler is unaware of it.
In the new test case `cmse-vlldm-no-reorder.ll` case the compiler
missed an anti-dependency and reordered a `VLLDM` ahead of the
instruction, which stashed the return value from the non-secure call,
effectively clobbering said value.
This test case does not fail with upstream LLVM, because of scheduling
differences and I couldn't find a test case for the VLSTM either.
Differential Revision: https://reviews.llvm.org/D81586
This patch adds codegen for the following BFloat
operations to the ARM backend:
* concatenation of bf16 vectors
* bf16 vector element extraction
* bf16 vector element insertion
* duplication of a bf16 value into each lane of a vector
* duplication of a bf16 vector lane into each lane
Differential Revision: https://reviews.llvm.org/D81411
This was passing in all the parameters needed to construct a
LegalizerHelper in the custom legalization, when it's simpler to just
pass in the existing helper.
This is slightly more annoying to use in the common case where you
don't need the legalizer helper, but we could add back the common
parameters back in addition to the helper.
I didn't propagate this to all the internal target changes that this
logically implies, but did update a sample one for
legalizeMinNumMaxNum.
This is in preparation for moving AMDGPU load/store legalization
entirely into custom lowering. The current set of legalization actions
is really constraining and not really capable of expressing all the
actions needed to legalize loads/stores. In particular there's no way
to express when the memory access itself needs to change size vs. the
result type. There's also a lot of redundancy since the same
split/widen actions need to be applied in both vector and scalar
cases. All of the sub-cases logically belong as steps in the legalizer
helper, but it will be easier to consider everything at once in custom
lowering.
This patch adds basic support for BFloat in the Arm backend.
For now the code generation relies on fullfp16 being present.
Briefly:
* adds the bfloat scalar and vector types in the necessary register classes,
* adjusts the calling convention to cope with bfloat argument passing and return,
* adds codegen patterns for moves, loads and stores.
It's tested mostly by the intrinsic patches that depend on it (load/store, convert/copy).
The following people contributed to this patch:
* Alexandros Lamprineas
* Ties Stuij
Differential Revision: https://reviews.llvm.org/D81373
Summary:
As half-precision floating point arguments and returns were previously
coerced to either float or int32 by clang's codegen, the CMSE handling
of those was also performed in clang's side by zeroing the unused MSBs
of the coercer values.
This patch moves this handling to the backend's calling convention
lowering, making sure the high bits of the registers used by
half-precision arguments and returns are zeroed.
Reviewers: chill, rjmccall, ostannard
Reviewed By: ostannard
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D81428
Summary:
Half-precision floating point arguments and returns are currently
promoted to either float or int32 in clang's CodeGen and there's
no existing support for the lowering of `half` arguments and returns
from IR in AArch32's backend.
Such frontend coercions, implemented as coercion through memory
in clang, can cause a series of issues in argument lowering, as causing
arguments to be stored on the wrong bits on big-endian architectures
and incurring in missing overflow detections in the return of certain
functions.
This patch introduces the handling of half-precision arguments and returns in
the backend using the actual "half" type on the IR. Using the "half"
type the backend is able to properly enforce the AAPCS' directions for
those arguments, making sure they are stored on the proper bits of the
registers and performing the necessary floating point convertions.
Reviewers: rjmccall, olista01, asl, efriedma, ostannard, SjoerdMeijer
Reviewed By: ostannard
Subscribers: stuij, hiraditya, dmgreen, llvm-commits, chill, dnsampaio, danielkiss, kristof.beyls, cfe-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75169
The rearranges PerformANDCombine and PerformORCombine to try and make
sure we don't call isConstantSplat on any i1 vectors. As pointed out in
D81860 it may not be very well defined in those cases.
To set up a tail-predicated loop, we need to to calculate the number of
elements processed by the loop. We can now use intrinsic
@llvm.get.active.lane.mask() to do this, which is emitted by the vectoriser in
D79100. This intrinsic generates a predicate for the masked loads/stores, and
consumes the Backedge Taken Count (BTC) as its second argument. We can now use
that to reconstruct the loop tripcount, instead of the IR pattern match
approach we were using before.
Many thanks to Eli Friedman and Sam Parker for all their help with this work.
This also adds overflow checks for the different, new expressions that we
create: the loop tripcount, and the sub expression that calculates the
remaining elements to be processed. For the latter, SCEV is not able to
calculate precise enough bounds, so we work around that at the moment, but is
not entirely correct yet, it's conservative. The overflow checks can be
overruled with a force flag, which is thus potentially unsafe (but not really
because the vectoriser is the only place where this intrinsic is emitted at the
moment). It's also good to mention that the tail-predication pass is not yet
enabled by default. We will follow up to see if we can implement these
overflow checks better, either by a change in SCEV or we may want revise the
definition of llvm.get.active.lane.mask.
Differential Revision: https://reviews.llvm.org/D79175
This emits new IR intrinsic @llvm.get.active.mask for tail-folded vectorised
loops if the intrinsic is supported by the backend, which is checked by
querying TargetTransform hook emitGetActiveLaneMask.
This intrinsic creates a mask representing active and inactive vector lanes,
which is used by the masked load/store instructions that are created for
tail-folded loops. The semantics of @llvm.get.active.mask are described here in
LangRef:
https://llvm.org/docs/LangRef.html#llvm-get-active-lane-mask-intrinsics
This intrinsic is also used to provide a hint to the backend. That is, the
second argument of the intrinsic represents the back-edge taken count of the
loop. For MVE, for example, we use that to set up tail-predication, which is a
new form of predication in MVE for vector loops that implicitely predicates the
last vector loop iteration by implicitely setting active/inactive lanes, i.e.
the tail loop is predicated. In order to set up a tail-predicated vector loop,
we need to know the number of data elements processed by the vector loop, which
corresponds the the tripcount of the scalar loop, which we can now reconstruct
using @llvm.get.active.mask.
Differential Revision: https://reviews.llvm.org/D79100