Commit Graph

61980 Commits

Author SHA1 Message Date
David Green 3a6365a439 [ARM] Add FeatureHasNoBranchPredictor for Thumb1 cores
Mark v6m/v8m-baseline cores as having no branch predictors. This should
not alter very much on its own, but is more correct as the cores do not
have branch predictors and can help in the future.
2021-03-30 21:45:26 +01:00
Fangrui Song 73adc05ced [GlobalISel] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds after D99463 2021-03-30 12:52:56 -07:00
Amara Emerson a35c2c7942 [GlobalISel] Implement fewerElements legalization for vector reductions.
This patch adds 3 methods, one for power-of-2 vectors which use tree
reductions using vector ops, before a final reduction op. For non-pow-2
types it generates multiple narrow reductions and combines the values with
scalar ops.

Differential Revision: https://reviews.llvm.org/D97163
2021-03-30 11:19:21 -07:00
Amara Emerson 1bc90847ee [AArch64][GlobalISel] Define some legalization rules for G_ROTR and G_ROTL.
For imported pattern purposes, we have a custom rule that promotes the rotate
amount to 64b as well.

Differential Revision: https://reviews.llvm.org/D99463
2021-03-30 11:11:19 -07:00
Jessica Paquette 700431128e [GlobalISel][AArch64] Combine G_SEXT_INREG + right shift -> G_SBFX
Basically a port of isBitfieldExtractOpFromSExtInReg in AArch64ISelDAGToDAG.

This is only done post-legalization for now. Once the legalizer knows how to
decompose these back into shifts, this requirement can probably be removed.

Differential Revision: https://reviews.llvm.org/D99230
2021-03-30 10:14:30 -07:00
Craig Topper a33fcafaf0 [RISCV] Pass 'half' in the lower 16 bits of an f32 value when F extension is enabled, but Zfh is not.
Without Zfh the half type isn't legal, but it could still be
used as an argument/return in IR. Clang will not generate this today.

Previously we promoted the half value to float for arguments and
returns if the F extension is enabled but Zfh isn't. Then depending on
which ABI is enabled we would pass it in either an FPR or a GPR in
float format.

If the F extension isn't enabled, it would get passed in the lower
16 bits of a GPR in half format.

With this patch the value will always in half format and will be
in the lower bits of a GPR or FPR. This should be consistent
with where the bits are located when Zfh is enabled.

I've based this implementation off of how this is done on ARM.

I've manually nan-boxed the value to 32 bits using integer ops.
It looks like flw, fsw, fmv.s, fmv.w.x, fmf.x.w won't
canonicalize nans so should leave the value alone. I think those
are the instructions that could get used on this value.

Reviewed By: kito-cheng

Differential Revision: https://reviews.llvm.org/D98670
2021-03-30 09:47:54 -07:00
Tomas Matheson a9968c0a33 [NFC][CodeGen] Tidy up TargetRegisterInfo stack realignment functions
Currently needsStackRealignment returns false if canRealignStack returns false.
This means that the behavior of needsStackRealignment does not correspond to
it's name and description; a function might need stack realignment, but if it
is not possible then this function returns false. Furthermore,
needsStackRealignment is not virtual and therefore some backends have made use
of canRealignStack to indicate whether a function needs stack realignment.

This patch attempts to clarify the situation by separating them and introducing
new names:

 - shouldRealignStack - true if there is any reason the stack should be
   realigned

 - canRealignStack - true if we are still able to realign the stack (e.g. we
   can still reserve/have reserved a frame pointer)

 - hasStackRealignment = shouldRealignStack && canRealignStack (not target
   customisable)

Targets can now override shouldRealignStack to indicate that stack realignment
is required.

This change will make it easier in a future change to handle the case where we
need to realign the stack but can't do so (for example when the register
allocator creates an aligned spill after the frame pointer has been
eliminated).

Differential Revision: https://reviews.llvm.org/D98716

Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87
2021-03-30 17:31:39 +01:00
Craig Topper f069000b43 [RISCV] Remove floating point condition code legalization from lowerFixedLengthVectorSetccToRVV.
After D98939, this is done by LegalizeVectorOps making this code dead.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D99519
2021-03-30 09:11:56 -07:00
Sebastian Neubauer 1c3b74f0ab [AMDGPU] Remove outdated TODOs. NFC
spillSGPRToVGPR is already respected in these places since D95768.

Differential Revision: https://reviews.llvm.org/D99570
2021-03-30 15:18:49 +02:00
Sanjay Patel e694e19a79 [x86] enhance matching of pmaddwd
This was crashing with the example from:
https://llvm.org/PR49716
...and that was avoided with a283d72583 ,
but as we can see from the SSE vs. AVX test code diff,
we can try harder to match the pattern.

This matcher code was adapted from another pmadd pattern
match in D49636, but it needs different ops to deal with
size mismatches.

Differential Revision: https://reviews.llvm.org/D99531
2021-03-30 07:28:33 -04:00
David Green d4b3380dfe [ARM] Handle Splats in MVE lane interleaving
As another addition to MVE lane interleaving, this handles Splat shuffle
vectors, as the shuffle of a splat is a splat.

Differential Revision: https://reviews.llvm.org/D97291
2021-03-30 11:19:16 +01:00
Joe Ellis a7dde4c5f7 [AArch64][SVE] Lower fixed length INSERT_VECTOR_ELT
Differential Revision: https://reviews.llvm.org/D98496
2021-03-30 09:37:11 +00:00
Joe Ellis c4d39f64d0 [AArch64][SVE] Lower fixed length EXTRACT_VECTOR_ELT
Differential Revision: https://reviews.llvm.org/D98625
2021-03-30 09:35:44 +00:00
Bing1 Yu 0c63b862c4 Revert "[X86] Pass to transform tdpbsud&tdpbusd&tdpbuud intrinsics to scalar operation"
This reverts commit 275df61f04.
2021-03-30 16:33:07 +08:00
Bing1 Yu 275df61f04 [X86] Pass to transform tdpbsud&tdpbusd&tdpbuud intrinsics to scalar operation
Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D99244
2021-03-30 16:21:10 +08:00
Jun Ma 65462a08bf [NFC][SVE] Remove redundant pattern 2021-03-30 10:35:08 +08:00
Jun Ma 1af373c673 [AArch64][SVE] Codegen dup_lane for dup(vector_extract)
Differential Revision: https://reviews.llvm.org/D99324
2021-03-30 10:35:08 +08:00
Jun Ma b0db2dbc29 [AArch64][SVEIntrinsicOpts] Optimize tbl+dup into dup+extractelement
Differential Revision: https://reviews.llvm.org/D99412
2021-03-30 10:35:08 +08:00
Evandro Menezes fd94cfeeb5 [RISCV] Move scheduling resources for B into a separate file (NFC)
Differential Revision: https://reviews.llvm.org/D99557
2021-03-29 20:37:22 -05:00
Thomas Lively a1b8b0739a [WebAssembly] Fix i8x16.popcnt opcode
When I updated the SIMD opcodes in f5764a8654, I accidentally missed updating
i8x16.popcnt. This patch fixes the omission.

Differential Revision: https://reviews.llvm.org/D99536
2021-03-29 17:23:15 -07:00
Florian Hahn 482283042f
[AArch64] Remove custom zext/sext legalization code.
Currently performExtendCombine assumes that the src-element bitwidth * 2
is a valid MVT. But this is not the case for i1 and it causes a crash on
the v64i1 test cases added in this patch.

It turns out that this code appears to not be needed; the same patterns are
handled by other code and we end up with the same results, even without the
custom lowering. I also added additional test cases in a50037aaa6.

Let's just remove the unneeded code.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D99437
2021-03-29 22:22:05 +01:00
Nikita Popov 7669455df4 [X86][FastISel] Fix with.overflow eflags clobber (PR49587)
If the successor block has a phi node, then additional moves may
be inserted into predecessors, which may clobber eflags. Don't try
to fold the with.overflow result into the branch in that case.

This is done by explicitly checking for any phis in successor
blocks, not sure if there's some more principled way to address
this. Other fused compare and branch patterns avoid the issue by
emitting the comparison when handling the branch, so that no
instructions may be inserted in between. In this case, the
with.overflow call is emitted separately (and I don't think this
is avoidable, as it will generally have at least two users).

Fixes https://bugs.llvm.org/show_bug.cgi?id=49587.

Differential Revision: https://reviews.llvm.org/D98600
2021-03-29 23:08:47 +02:00
Stanislav Mekhanoshin 619b88849e [AMDGPU] Fix "Sequence" spelling. NFC. 2021-03-29 12:11:36 -07:00
Joe Nash 45fd7c02af Revert "[AMDGPU] Mark additional VOP3 as commutable"
This reverts commit d35d8da7d6.
2021-03-29 14:48:11 -04:00
Joe Nash d35d8da7d6 [AMDGPU] Mark additional VOP3 as commutable
Note, only src0 and src1 will be commuted if the isCommutable flag
is set. This patch does not change that, it just makes it possible
to commute src0 and src1 of more instructions.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D99376

Change-Id: I61e20490962d95ea429beb355c55f55c024dafdc
2021-03-29 14:22:20 -04:00
Craig Topper 3dd4aa7d09 [RISCV] When custom iseling masked loads/stores, copy the mask into V0 instead of virtual register.
This matches what we do in our isel patterns. In our internal
testing we've found this is needed to make the fast register
allocator happy at -O0. Otherwise it may assign V0 to an earlier
operand and find itself with no registers left when it reaches
the mask operand. By using V0 explicitly, the fast register allocator
will see it when it checks for phys register usages before it
starts allocating vregs. I'll try to update this with a test case.

Unfortunately, this does appear to prevent some instruction reordering
by the pre-RA scheduler which leads to the increased spills seen in
some tests. I suspect that problem could already occur for other
instructions that already used V0 directly.

There's a lot of repeated code here that could do with some
wrapper functions. Not sure if that should be at the level of the
new code that deals with V0. That would require multiple output
parameters to pass the glue, chain and register back. Maybe it
should be at a higher level over the entire set of push_backs.

Reviewed By: frasercrmck, HsiangKai

Differential Revision: https://reviews.llvm.org/D99367
2021-03-29 10:20:43 -07:00
Craig Topper 54bacaf311 [X86] Always use rip-relative addressing on 64-bit when rematerializing all zeros/ones registers using a folded load.
Previously we only used RIP relative when PIC was enabled. But
we know we're in small/kernel code model here so we should
be able to always use RIP-relative which will give a smaller
encoding.

Here's a godbolt link that demonstrates the current codegen https://godbolt.org/z/j3158o
Note in the non-PIC version the load from .LCPI0_0 doesn't use
RIP-relative addressing, but if you change the constant in the
source from 0.0 to 1.0 it will become RIP-relative.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D97208
2021-03-29 10:06:17 -07:00
Roger Ferrer Ibanez ef76a333fa [RISCV] Fix offset computation for RVV
In D97111 we changed the RVV frame layout when using sp or bp to address
the stack slots so we could address the emergency stack slot. The idea
is to put the RVV objects as far as possible (in offset terms) from the
frame reference register (sp / fp / bp).

When using fp this happens naturally because the RVV objects are already
the top of the stack and due to the constraints of RVV (VLENB being a
power of two >= 128) the stack remains aligned. The rest of this summary
does not apply to this case.

When using sp / bp we need to skip the non-RVV stack slots. The size of
the the non-RVV objects is computed subtracting the callee saved
register size (whose computation is added in D97111 itself) to the total
size of the stack (which does not account for RVV stack slots). However,
when doing so we round to 16 bytes when computing that size and we end
emitting a smaller offset that may belong to a scalar stack slot (see
D98801). So this change removes that rounding.

Also, because we want the RVV objects be between the non-RVV stack slots
and the callee-saved register slots, we need to make sure the RVV
objects are properly aligned to 8 bytes. Adding a padding of 8 would
render the stack unaligned. So when allocating space for RVV (only when
we don't use fp) we need to have extra padding that preserves the stack
alignment. This way we can round to 8 bytes the offset that skips the
non-RVV objects and we do not misalign the whole stack in the way. In
some circumstances this means that the RVV objects may have padding
before (=lower offsets from sp/bp) and after (before the CSR stack
slots).

Differential Revision: https://reviews.llvm.org/D98802
2021-03-29 17:03:49 +00:00
Bradley Smith 9745dce8c3 [SelectionDAG][AArch64][SVE] Perform SETCC condition legalization in LegalizeVectorOps
This is currently performed in SelectionDAGLegalize, here we make it also
happen in LegalizeVectorOps, allowing a target to lower the SETCC condition
codes first in LegalizeVectorOps and then lower to a custom node afterwards,
without having to duplicate all of the SETCC condition legalization in the
target specific lowering.

As a result of this, fixed length floating point SETCC nodes can now be
properly lowered for SVE.

Differential Revision: https://reviews.llvm.org/D98939
2021-03-29 15:32:25 +01:00
Simon Pilgrim 805148eaf2 [X86][SSE] combineHorizOpWithShuffle - consistently use getTargetShuffleInputs to decode shuffles
Minor cleanup before I start trying to merge the unary/binary shuffle combining paths.
2021-03-29 11:31:19 +01:00
Nashe Mncube 19601a4c6c [SVE][Analysis]Instruction costs for ops on scalable-vec
The following operations have no associated cost for them
when applied to scalable vectors, and as a consequence
can trigger a crash when a call is made to
AArch64TTIImpl::getCastInstrCost():
- fptrunc
- trunc
- fpext
- fpto(u,s)i

This patch adds costs for these operations and
relevant regression tests.

Differential Revision: https://reviews.llvm.org/D98934
2021-03-29 11:15:50 +01:00
David Green 3a68c6d26c [ARM] Extend MVE lane interleaving to handle other non-instruction leaves
This extends the recent MVE lane interleaving passto handle other
non-instruction leaves, for which a new shuffle is added. This helps
especially for constants and potentially for arguments.

Differential Revision: https://reviews.llvm.org/D97289
2021-03-29 09:05:45 +01:00
David Green 6c88ffeda3 [ARM] Fix the Changed value in the MVE lane interleaving pass. 2021-03-28 23:47:53 +01:00
Craig Topper 69bdf35dc7 [X86] Optimize vXi8 MULHS on targets where we can't sign_extend to the next register size.
For these cases we need to extract the upper or lower elements,
multiply them using 16-bit multiplies and repack them.

Previously we used punpcklbw/punpckhbw+psraw or pmovsxbw+pshudfd to
extract and sign extend so we could use pmullw to compute the 16-bit
product and then shift down the high bits.

We can avoid the need to sign extend if we unpack the bytes into
the high byte of each word and fill the lower byte with 0 using
pxor. This puts the sign bit of each byte into the sign bit of
each word. Since the LHS and RHS have 8 trailing zeros, the full
32-bit product of those 16-bit values will have 16 trailing zeros.
This means the 16-bit product of the original bytes is in the upper
16 bits which we can calculate using pmulhw.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D98587
2021-03-28 11:41:29 -07:00
David Green 7b6f760fcd [ARM] MVE vector lane interleaving
MVE does not have a single sext/zext or trunc instruction that takes the
bottom half of a vector and extends to a full width, like NEON has with
MOVL. Instead it is expected that this happens through top/bottom
instructions. So the MVE equivalent VMOVLT/B instructions take either
the even or odd elements of the input and extend them to the larger
type, producing a vector with half the number of elements each of double
the bitwidth. As there is no simple instruction for a normal extend, we
often have to expand sext/zext/trunc into a series of lane moves (or
stack loads/stores, which we do not do yet).

This pass takes vector code that starts at truncs, looks for
interconnected blobs of operations that end with sext/zext and
transforms them by adding shuffles so that the lanes are interleaved and
the MVE VMOVL/VMOVN instructions can be used. This is done pre-ISel so
that it can work across basic blocks.

This initial version of the pass just handles a limited set of
instructions, not handling constants or splats or FP, which can all come
as extensions to this base.

Differential Revision: https://reviews.llvm.org/D95804
2021-03-28 19:34:58 +01:00
Florian Hahn eb3d9f2eb6
[SelDag] Add isIntOrFPConstant helper function.
This patch adds a new isIntOrFPConstant  helper function to check if a
SDValue is a integer of FP constant. This pattern is used in various
places.

There also are places that incorrectly just check for integer constants,
e.g. D99384, so hopefully this helper will help people avoid that issue.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D99428
2021-03-28 12:48:58 +01:00
Hsiangkai Wang bc82e9bf25 [RISCV] Add vfabs.v pseudo instruction.
Differential Revision: https://reviews.llvm.org/D99454
2021-03-28 10:24:05 +08:00
Craig Topper 5692fc38e0 [RISCV] Add a pattern for (sext_inreg (mul (and X, 0xffffffff), (and Y, 0xffffffff)), i32) to suppress MULW formation
We have a special pattern for
(mul (and X, 0xffffffff), (and Y, 0xffffffff)), to optimize the
ANDs to shift. But if a sext_inreg coms first, we'll form a MULW
and limit the effectiveness of the special match. So this patch
adds a larger pattern to suppress the MULW formation by emitting
a sext.w and then the same output we use for the
(mul (and X, 0xffffffff), (and Y, 0xffffffff)). This should all
get CSEd.

This is the issue I was trying to fix with D99029, but that affected
many more tests.
2021-03-27 15:37:18 -07:00
Simon Pilgrim 2a0d5da917 [X86][SSE] foldShuffleOfHorizOp - remove broadcast handling.
Remove VBROADCAST/MOVDDUP/splat-shuffle handling from foldShuffleOfHorizOp

This can all be handled by canonicalizeShuffleMaskWithHorizOp along as we check that the HADD/SUB are only used once (to prevent infinite loops on slow-horizop targets which will try to reuse the nodes again followed by a post-hop shuffle).
2021-03-27 15:09:23 +00:00
Simon Pilgrim 41146bfe82 [X86][SSE] combineX86ShuffleChain - attempt to recognise 'hidden' identity shuffles
See if the combined shuffle mask is equivalent to an identity shuffle, typically this is due to repeated LHS/RHS ops in horiz-ops, but isTargetShuffleEquivalent might see other patterns as well.

This is another small step towards getting rid of foldShuffleOfHorizOp and relying on canonicalizeShuffleMaskWithHorizOp and generic shuffle combining.
2021-03-27 11:09:30 +00:00
Sanjay Patel a283d72583 [x86] prevent crashing while matching pmaddwd
This could crash in 2 ways: either one or both of
the input vectors could be a different size than
the math ops.

https://llvm.org/PR49716
2021-03-27 05:27:14 -04:00
Craig Topper 4d5ee71b52 [RISCV] Merge FMulAdd and FMulSub scheduler classes to a single FMA scheduler class. NFC
It's unlikely that FMADD and FMSUB would have different scheduling
information so merge them.

Reviewed By: HsiangKai

Differential Revision: https://reviews.llvm.org/D99140
2021-03-26 16:37:20 -07:00
Craig Topper c41f2f6492 [RISCV] Add scheduler classes for the Zba and Zbb extensions.
I've used IALU for the simplest operations from Zbb:
min, minu, max, maxu, sext.b, sext.h, zext.h, andn, orn, xnor

I've put add.uw in IALU32 and slli.uw in ShiftImm32.

Remaining instructions have received new classes.
All 3 sh*add are grouped together. sh*add.uw are grouped together.
Rotate left and right are together. Everything else got their own
class containing one instruction.

I think what I have here is the minimum granularity we need. I
could be convinced that we need more classes.

Reviewed By: evandro

Differential Revision: https://reviews.llvm.org/D99040
2021-03-26 14:15:29 -07:00
Simon Pilgrim c769ba9514 [X86][AVX] combineHorizOpWithShuffle - improve SHUFFLE(HOP(LOSUBVECTOR(X),HISUBVECTOR(X))) folding
Peek through bitcasts to find subvector splits and use getTargetShuffleInputs to decode target shuffles as well as ShuffleVectorSDNode
2021-03-26 17:23:54 +00:00
Jay Foad 9d08f276d7 [AMDGPU] Use reductions instead of scans in the atomic optimizer
If the result of an atomic operation is not used then it can be more
efficient to build a reduction across all lanes instead of a scan. Do
this for GFX10, where the permlanex16 instruction makes it viable. For
wave64 this saves a couple of dpp operations. For wave32 it saves one
readlane (which are generally bad for performance) and one dpp
operation.

Differential Revision: https://reviews.llvm.org/D98953
2021-03-26 15:38:14 +00:00
Zakk Chen 9049cf77e3 [RISCV] Add constraint for RVV indexed loads.
Add the constraint when destination EEW not equals the source EEW for
correctness.

The RVV spec has three register overlap rules and I implement the first
stricter constraint because the others are difficult to enforce.

Reviewed By: frasercrmck, craig.topper

Differential Revision: https://reviews.llvm.org/D98920
2021-03-26 07:23:24 -07:00
Jay Foad d92b4956d6 [AMDGPU] Inline FSHRPattern into its only use. NFC. 2021-03-26 09:32:02 +00:00
Craig Topper 8f62a80328 [RISCV] Optimize (and (shl GPR:, uimm5:), 0xffffffff) to use 2 shifts instead of 3.
The and would normally become SLLI+SRLI, giving us 2 SLLI+SRLI. We
can detect this and combine the 2 SLLIs into 1.
2021-03-25 23:31:01 -07:00
Craig Topper 5a18c576c4 [RISCV] Don't call CheckAndMask from selectZExti32.
Now that targetShrinkDemandedConstant preserves 0xffffffff masks we
shouldn't need to call computeKnownBits here.
2021-03-25 22:07:41 -07:00
Jessica Paquette 23f657c165 [AArch64][GlobalISel] Emit bzero on Darwin
Darwin platforms for both AArch64 and X86 can provide optimized `bzero()`
routines. In this case, it may be preferable to use `bzero` in place of a
memset of 0.

This adds a G_BZERO generic opcode, similar to G_MEMSET et al. This opcode can
be generated by platforms which may want to use bzero.

To emit the G_BZERO, this adds a pre-legalize combine for AArch64. The
conditions for this are largely a port of the bzero case in
`AArch64SelectionDAGInfo::EmitTargetCodeForMemset`.

The only difference in comparison to the SelectionDAG code is that, when
compiling for minsize, this will fire for all memsets of 0. The original code
notes that it's not beneficial to do this for small memsets; however, using
bzero here will save a mov from wzr. For minsize, I think that it's preferable
to prioritise omitting the mov.

This also fixes a bug in the libcall legalization code which would delete
instructions which could not be legalized. It also adds a check to make sure
that we actually get a libcall name.

Code size improvements (Darwin):

- CTMark -Os: -0.0% geomean (-0.1% on pairlocalalign)
- CTMark -Oz: -0.2% geomean (-0.5% on bullet)

Differential Revision: https://reviews.llvm.org/D99358
2021-03-25 17:14:25 -07:00