Commit Graph

1981 Commits

Author SHA1 Message Date
David Green 21a8b9d9e9 [ARM] Limit PerformExtractEltToVMOVRRD to when f64 is legal.
The generic SoftFloatVectorExtract.ll test was failing when run on arm
machines, as it tries to create a f64 under soft float. Limit the
transform to when f64 is legal.

Also add a missing override, as reported in D100244.
2021-04-20 16:24:36 +01:00
David Green 48cef1fa8e [ARM] Create VMOVRRD from adjacent vector extracts
This adds a combine for extract(x, n); extract(x, n+1)  ->
VMOVRRD(extract x, n/2). This allows two vector lanes to be moved at the
same time in a single instruction, and thanks to the other VMOVRRD folds
we have added recently can help reduce the amount of executed
instructions. Floating point types are very similar, but will include a
bitcast to an integer type.

This also adds a shouldRewriteCopySrc, to prevent copy propagation from
DPR to SPR, which can break as not all DPR regs can be extracted from
directly.  Otherwise the machine verifier is unhappy.

Differential Revision: https://reviews.llvm.org/D100244
2021-04-20 15:15:43 +01:00
David Green 00a6045473 [ARM] Combine sub 0, csinc X, Y, CC -> csinv -X, Y, CC
Combine sub 0, csinc X, Y, CC to csinv -X, Y, CC providing that the
negation of X is cheap, currently just handling constants. This comes up
during the splat of an i1 to a predicate, where we now generate csetm,
as opposed to cset; rsb.

Differential Revision: https://reviews.llvm.org/D99940
2021-04-16 11:52:31 +01:00
Simon Pilgrim ddbb58736a [KnownBits] Rename KnownBits::computeForMul to KnownBits::mul. NFCI.
As promised in D98866
2021-04-06 10:11:41 +01:00
David Green 35e0567d58 [ARM] Add VREV MVE shuffle costs
This uses the shuffle mask cost from D98206 to give a better cost of MVE
VREV instructions. This helps especially in VectorCombine where the cost
of shuffles is used to reorder bitcasts, which this helps keep the phase
ordering test for fp16 reductions producing optimal code. The isVREVMask
has been moved to a header file to allow it to be used across target
transform and isel lowering.

Differential Revision: https://reviews.llvm.org/D98210
2021-03-17 21:21:43 +00:00
David Green bd516d24c1 [ARM] Move t2DoLoopStart reg alloc hint
This adjusts the place that the t2DoLoopStart reg allocation hint is
inserted, adding it in the ARMTPAndVPTOptimizaionPass in a similar place
as other tail predicated loop optimizations. This removes the need for
doing so in a custom inserter, and should make the hint more accurate,
only adding it where we expect to create a DLS (not DLSTP or WLS).
2021-03-11 17:56:19 +00:00
David Green fad70c3068 [ARM] Improve WLS lowering
Recently we improved the lowering of low overhead loops and tail
predicated loops, but concentrated first on the DLS do style loops. This
extends those improvements over to the WLS while loops, improving the
chance of lowering them successfully. To do this the lowering has to
change a little as the instructions are terminators that produce a value
- something that needs to be treated carefully.

Lowering starts at the Hardware Loop pass, inserting a new
llvm.test.start.loop.iterations that produces both an i1 to control the
loop entry and an i32 similar to the llvm.start.loop.iterations
intrinsic added for do loops. This feeds into the loop phi, properly
gluing the values together:

  %wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
  %wls0 = extractvalue { i32, i1 } %wls, 0
  %wls1 = extractvalue { i32, i1 } %wls, 1
  br i1 %wls1, label %loop.ph, label %loop.exit
...
loop:
  %lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ]
  ..
  %iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
  %cmp = icmp ne i32 %iv.next, 0
  br i1 %cmp, label %loop, label %loop.exit

The llvm.test.start.loop.iterations need to be lowered through ISel
lowering as a pair of WLS and WLSSETUP nodes, which each get converted
to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent
t2WhileLoopStart from being a terminator that produces a value,
something difficult to control at that stage in the pipeline. Instead
the t2WhileLoopSetup produces the value of LR (essentially acting as a
lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc).

These are then converted into a single t2WhileLoopStartLR at the same
point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop
to prevent them from progressing further in the pipeline. The
t2WhileLoopStartLR is a single instruction that takes a GPR and produces
LR, similar to the WLS instruction.

  %1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3
  t2B %bb.1
...
bb.2.loop:
  %2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2
  ...
  %3:gprlr = t2LoopEndDec %2:gprlr, %bb.2
  t2B %bb.3

The t2WhileLoopStartLR can then be treated similar to the other low
overhead loop pseudos, eventually being lowered to a WLS providing the
branches are within range.

Differential Revision: https://reviews.llvm.org/D97729
2021-03-11 17:56:19 +00:00
David Green a968e7b82e [ARM] KnownBits for CSINC/CSNEG/CSINV
This adds some simple known bits handling for the three CSINC/NEG/INV
instructions. From the operands known bits we can compute the common
bits of the first operand and incremented/negated/inverted second
operand. The first, especially CSINC ZR, ZR, comes up fair amount in the
tests. The others are more rare so a unit test for them is added.

Differential Revision: https://reviews.llvm.org/D97788
2021-03-04 08:40:20 +00:00
David Green 438c98515c [ARM] Use 0, not ZR during ISel for CSINC/INV/NEG
Instead of converting the 0 into a ZR reg during lowering, do that with
tablegen by matching the zero immediate. This when combined with other
optimizations is more likely to use ZR and helps keep the DAG more
easily optimizable. It should not otherwise effect code generation.
2021-03-02 19:01:14 +00:00
David Green 91ebc4e864 [ARM] VMOVN undef folding
If we insert undef using a VMOVN, we can just use the original value in
three out of the four possible combinations. Using VMOVT into a undef
vector will still require the lanes to be moved, but otherwise the
non-undef value can be used.
2021-02-28 14:44:45 +00:00
David Green 0fe64812d8 [ARM] VECTOR_REG_CAST undef -> undef
Propagate undef through VECTOR_REG_CAST nodes, allowing extra
simplification in some patterns.
2021-02-28 11:13:49 +00:00
Leonard Chan c77659e549 [llvm][IR] Do not place constants with static relocations in a mergeable section
This patch provides two major changes:

1. Add getRelocationInfo to check if a constant will have static, dynamic, or
   no relocations. (Also rename the original needsRelocation to needsDynamicRelocation.)
2. Only allow a constant with no relocations (static or dynamic) to be placed
   in a mergeable section.

This will allow unused symbols that contain static relocations and happen to
fit in mergeable constant sections (.rodata.cstN) to instead be placed in
unique-named sections if -fdata-sections is used and subsequently garbage collected
by --gc-sections.

See https://lists.llvm.org/pipermail/llvm-dev/2021-February/148281.html.

Differential Revision: https://reviews.llvm.org/D95960
2021-02-18 15:39:00 -08:00
Serge Pavlov 816053bc71 [FPEnv][ARM] Implement lowering of llvm.set.rounding
Differential Revision: https://reviews.llvm.org/D96501
2021-02-13 11:16:29 +07:00
David Green 875f0cbcc6 [ARM] Optimize fp store of extract to integer store if already available.
Given a floating point store from an extracted vector, with an integer
VGETLANE that already exists, storing the existing VGETLANEu directly
can be better for performance. As the value is known to already be in an
integer registers, this can help reduce fp register pressure, removed
the need for the fp extract and allows use of more integer post-inc
stores not available with vstr.

This can be a bit narrow in scope, but helps with certain biquad kernels
that store shuffled vector elements.

Differential Revision: https://reviews.llvm.org/D96159
2021-02-12 18:34:58 +00:00
David Green 541828e35d [ARM] Single source VMOVNT
Our current lowering of VMOVNT goes via a shuffle vector of the form
<0, N, 2, N+2, 4, N+4, ..>. That can of course also be a single input
shuffle of the form <0, 0, 2, 2, 4, 4, ..>, where we use a VMOVNT to
insert a vector into the top lanes of itself. This adds lowering of that
case, re-using the existing isVMOVNMask.

Differential Revision: https://reviews.llvm.org/D96065
2021-02-12 14:28:57 +00:00
David Green 1db7b9ceaa [ARM] Make a BE predicate bitcast consistent with the rest of llvm
We were storing predicate registers, such as a <8 x i1>, in the opposite
order to how the rest of llvm expects. This actually turns out to be
correct for the one place that usually uses it - the
ScalarizeMaskedMemIntrin pass, but only because the pass was incorrect
itself. This fixes the order so that bits are stored in the opposite
order and bitcasts work as expected. This allows the Scalarization pass
to be fixed, as in https://reviews.llvm.org/D94765.

Differential Revision: https://reviews.llvm.org/D94867
2021-02-11 08:59:52 +00:00
David Green 0c7e044a7f [ARM] One-off identity shuffle
A One-Off Identity mask is a shuffle that is mostly an identity mask
from as single source but contains a single element out-of-place, either
from a different vector or from another position in the same vector. As
opposed to lowering this via a ARMISD::BUILD_VECTOR we can generate an
extract/insert pair directly. Under ARM with individually accessible
lane elements this often becomes a simple lane move.

This also alters the LowerVECTOR_SHUFFLEUsingMovs code to use v4f32 (not
v4i32), a more natural type for lane moves.

Differential Revision: https://reviews.llvm.org/D95551
2021-02-08 21:24:32 +00:00
David Green 11e415dc90 [ARM] Make v2f64 scalar_to_vector legal
Because we mark all operations as expand for v2f64, scalar_to_vector
would end up lowering through a stack store/reload. But it is pretty
simple to implement, only inserting a D reg into an undef vector. This
helps clear up some inefficient codegen from soft calling conventions.

Differential Revision: https://reviews.llvm.org/D96153
2021-02-08 11:34:55 +00:00
Craig Topper 11ef356d9e [TargetLowering] Use Align in allowsMisalignedMemoryAccesses.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96097
2021-02-04 19:22:06 -08:00
Ayke van Laethem aecdf15cc7
[ARM] Do not emit ldrexd/strexd on Cortex-M chips
The ldrexd/strexd instructions are not supported on M-class chips, see
for example
https://developer.arm.com/documentation/dui0489/e/arm-and-thumb-instructions/memory-access-instructions/ldrex-and-strex
which says:

> All these 32-bit Thumb instructions are available in ARMv6T2 and
> above, except that LDREXD and STREXD are not available in the ARMv7-M
> architecture.

Looking at the ARMv8-M architecture, it appears that these instructions
aren't supported either. The Architecture Reference Manual lists
ldrex/strex but not ldrexd/strexd:
https://developer.arm.com/documentation/ddi0553/bn/

Godbolt example on LLVM 11.0.0, which incorrectly emits ldrexd/strexd
instructions: https://llvm.godbolt.org/z/5qqPnE

Differential Revision: https://reviews.llvm.org/D95891
2021-02-04 21:55:34 +01:00
David Green 649a3d00df [ARM] Handle f16 in GeneratePerfectShuffle
This new f16 shuffle under Neon would hit an assert in
GeneratePerfectShuffle as it would try to treat a f16 vector as an i8.
Add f16 handling, treating them like an i16.

Differential Revision: https://reviews.llvm.org/D95446
2021-02-04 11:14:52 +00:00
David Green 5b2626ea87 [ARM] Flatten identity shuffles through vqdmulh nodes
Given a shuffle(vqdmulh(shuffle, shuffle), we can flatter the shuffles
out if they become an identity mask. This can come up during lane
interleaving, when we do that better.

Differential Revision: https://reviews.llvm.org/D94034
2021-02-01 19:14:20 +00:00
David Green 5805521207 [ARM] Simplify VMOVRRD from extracts of buildvectors
Under the softfp calling convention, we are often left with
VMOVRRD(extract(bitcast(build_vector(a, b, c, d)))) for the return value
of the function. These can be simplified to a,b or c,d directly,
depending on the value of the extract.

Big endian is a little different because the bitcast switches the lanes
around, meaning we end up with b,a or d,c.

Differential Revision: https://reviews.llvm.org/D94989
2021-02-01 16:09:25 +00:00
David Green ad12e6ee95 [ARM] Turn sext_inreg(VGetLaneu) into VGetLaneu
This adds a DAG combine for converting sext_inreg of VGetLaneu into
VGetLanes, providing the types match correctly.

Differential Revision: https://reviews.llvm.org/D95073
2021-02-01 11:10:35 +00:00
David Green 6ab792b68d [ARM] Simplify extract of VMOVDRR
Under SoftFP calling conventions, we can be left with
extract(bitcast(BUILD_VECTOR(VMOVDRR(a, b), ..))) patterns that can
simplify to a or b, depending on the extract lane.

Differential Revision: https://reviews.llvm.org/D94990
2021-02-01 10:24:57 +00:00
Kazu Hirata 7925aa091d [llvm] Populate SmallVector at construction time (NFC) 2021-01-28 22:21:14 -08:00
David Green 40f46cb0e4 [ARM] Add alignment checks for MVE VLDn
The MVE VLD2/4 and VST2/4 instructions require the pointer to be aligned
to at least the size of the element type. This adds a check for that
into the ARM lowerInterleavedStore and lowerInterleavedLoad functions,
not creating the intrinsics if they are invalid for the alignment of
the load/store.

Unfortunately this is one of those bug fixes that does effect some
useful codegen, as we were able to sometimes do some nice lowering of
q15 types. But they can cause problem with low aligned pointers.

Differential Revision: https://reviews.llvm.org/D95319
2021-01-28 13:10:08 +00:00
Kazu Hirata 054444177b [Target] Use llvm::append_range (NFC) 2021-01-24 12:18:56 -08:00
Kazu Hirata e4847a7fcf Revert "[Target] Use llvm::append_range (NFC)"
This reverts commit cc7a238286.

The X86WinEHState.cpp hunk seems to break certain builds.
2021-01-23 11:25:27 -08:00
Kazu Hirata cc7a238286 [Target] Use llvm::append_range (NFC) 2021-01-23 10:56:31 -08:00
David Green af03324984 [ARM] Disable sign extended SSAT pattern recognition.
I may have given bad advice, and skipping sext_inreg when matching SSAT
patterns is not valid on it's own. It at least needs to sext_inreg the
input again, but as far as I can tell is still only valid based on
demanded bits. For the moment disable that part of the combine,
hopefully reimplementing it in the future more correctly.
2021-01-22 14:07:48 +00:00
David Green 9ae73cdbc1 [ARM] Adjust isSaturatingConditional to return a new SDValue. NFC
This replaces the isSaturatingConditional function with
LowerSaturatingConditional that directly returns a new SSAT or
USAT SDValue, instead of returning true and the components of it.
2021-01-22 11:11:36 +00:00
David Green 6a563eef13 [ARM] Expand vXi1 VSELECT's
We have no lowering for VSELECT vXi1, vXi1, vXi1, so mark them as
expanded to turn them into a series of logical operations.

Differential Revision: https://reviews.llvm.org/D94946
2021-01-19 17:56:50 +00:00
David Green c29ca8551a [ARM] Update isVMOVNOriginalMask to handle single input shuffle vectors
The isVMOVNOriginalMask was previously only checking for two input
shuffles that could be better expanded as vmovn nodes. This expands that
to single input shuffles that will later be legalized to multiple
vectors.

Differential Revision: https://reviews.llvm.org/D94189
2021-01-13 08:51:28 +00:00
David Green 024af42c60 [ARM] Custom lower i1 vector truncates
The ISel patterns we have for truncating to i1's under MVE do not seem
to be correct. Instead custom lower to icmp(ne, and(x, 1), 0).

Differential Revision: https://reviews.llvm.org/D94226
2021-01-08 18:21:00 +00:00
David Green ddb82fc76c [ARM] Handle any extend whilst lowering mull
Similar to 78d8a821e2 but for ARM, this handles any_extend whilst
creating MULL nodes, treating them as zextends.

Differential Revision: https://reviews.llvm.org/D93834
2021-01-06 10:51:12 +00:00
David Green 901cc9b6f3 [ARM] Extend lowering for i64 reductions
The lowering of a <4 x i16> or <4 x i8> vecreduce.add into an i64 would
previously be expanded, due to the i64 not being legal. This patch
adjusts our reduction matchers, making it produce a VADDLV(sext A to
v4i32) instead.

Differential Revision: https://reviews.llvm.org/D93622
2021-01-04 12:44:43 +00:00
Kazu Hirata 0e219b6443 [Target] Construct SmallVector with iterator ranges (NFC) 2021-01-03 09:57:45 -08:00
Kristof Beyls df8ed39283 [ARM] harden-sls-blr: avoid r12 and lr in indirect calls.
As a linker is allowed to clobber r12 on function calls, the code
transformation that hardens indirect calls is not correct in case a
linker does so.  Similarly, the transformation is not correct when
register lr is used.

This patch makes sure that r12 or lr are not used for indirect calls
when harden-sls-blr is enabled.

Differential Revision: https://reviews.llvm.org/D92469
2020-12-19 12:39:59 +00:00
David Green 03e675fd12 [ARM] Turn pred_cast(xor(x, -1)) into xor(pred_cast(x), -1)
This folds a not (an xor -1) though a predicate_cast, so that it can be
turned into a VPNOT and potentially be folded away as an else predicate
inside a VPT block.

Differential Revision: https://reviews.llvm.org/D92235
2020-12-08 15:22:46 +00:00
David Green eedf0ed63e [ARM] Mark select and selectcc of MVE vector operations as expand.
We already expand select and select_cc in codegenprepare, but they can
still be generated under some situations. Explicitly mark them as expand
to ensure they are not produced, leading to a failure to select the
nodes.

Differential Revision: https://reviews.llvm.org/D92373
2020-12-01 15:05:55 +00:00
David Green 7923d71b4a [ARM] PREDICATE_CAST demanded bits
The PREDICATE_CAST node is used to model moves between MVE predicate
registers and gpr's, and eventually become a VMSR p0, rn. When moving to
a predicate only the bottom 16 bits of the sources register are
demanded. This adds a simple fold for that, allowing it to potentially
remove instructions like uxth.

Differential Revision: https://reviews.llvm.org/D92213
2020-12-01 10:32:24 +00:00
Simon Pilgrim 1a62ca65c1 [KnownBits] Add KnownBits::commonBits helper. NFCI.
We have a frequent pattern where we're merging two KnownBits to get the common/shared bits, and I just fell for the gotcha where I tried to use the & operator to merge them........
2020-11-11 12:15:54 +00:00
David Green 73a6cd4b6b [ARM] Add a RegAllocHint for hinting t2DoLoopStart towards LR
This hints the operand of a t2DoLoopStart towards using LR, which can
help make it more likely to become t2DLS lr, lr. This makes it easier to
move if needed (as the input is the same as the output), or potentially
remove entirely.

The hint is added after others (from COPY's etc) which still take
precedence. It needed to find a place to add the hint, which currently
uses the post isel custom inserter.

Differential Revision: https://reviews.llvm.org/D89883
2020-11-10 16:28:57 +00:00
David Green d14db8c8dc [ARM] Match MVE vqdmulh
This adds ISel matching for a form of VQDMULH. There are several ir
patterns that we could match to that instruction, this one is for:

min(ashr(mul(sext(a), sext(b)), 7), 127)

Which is what llvm will optimize to once it has removed the max that
usually makes up the min/max saturate pattern, as in this case the
compare will always be false. The additional complication to match i32
patterns (which extend into an i64) is that the min will be a
vselect/setcc, as vmin is not supported for i64 vectors. Tablegen
patterns have also been updated to attempt to reuse the MVE_TwoOpPattern
patterns.

Differential Revision: https://reviews.llvm.org/D90096
2020-10-30 13:34:27 +00:00
David Sherwood 47f2dc7e5f [SVE][NFC] Replace some TypeSize comparisons in non-AArch64 Targets
In most of lib/Target we know that we are not dealing with scalable
types so it's perfectly fine to replace TypeSize comparison operators
with their fixed width equivalents, making use of getFixedSize()
and so on.

Differential Revision: https://reviews.llvm.org/D89101
2020-10-15 09:01:21 +01:00
Sam Tebbs 68e002e181 [ARM] Fold select_cc(vecreduce_[u|s][min|max], x) into VMINV or VMAXV
This folds a select_cc or select(set_cc) of a max or min vector reduction with a scalar value into a VMAXV or VMINV.

    Differential Revision: https://reviews.llvm.org/D87836
2020-10-06 14:44:58 +01:00
Amara Emerson c9f5cdd453 Revert "[ARM]Fold select_cc(vecreduce_[u|s][min|max], x) into VMINV or VMAXV"
This reverts commit 2573cf3c3d.

These seem to break some lit tests.
2020-10-05 10:52:43 -07:00
Sam Tebbs 2573cf3c3d [ARM]Fold select_cc(vecreduce_[u|s][min|max], x) into VMINV or VMAXV
This folds a select_cc or select(set_cc) of a max or min vector reduction with a scalar value into a VMAXV or VMINV.

    Differential Revision: https://reviews.llvm.org/D87836
2020-10-05 15:51:28 +01:00
David Green 7feafa0286 [ARM] Fix pointer offset when splitting stores from VMOVDRR
We were not accounting for the pointer offset when splitting a store from
a VMOVDRR node, which could lead to incorrect aliasing info. In this
case it is the fneg via integer arithmetic that gives us a store->load
pair that we started getting wrong.

Differential Revision: https://reviews.llvm.org/D88653
2020-10-03 16:47:50 +01:00