Commit Graph

7474 Commits

Author SHA1 Message Date
Simon Pilgrim 8767f3bb97 [X86][AVX] Remove X86ISD::SUBV_BROADCAST (PR38969)
Followup to D92645 - remove the remaining places where we create X86ISD::SUBV_BROADCAST, and fold splatted vector loads to X86ISD::SUBV_BROADCAST_LOAD instead.

Remove all the X86SubVBroadcast isel patterns, including all the fallbacks for if memory folding failed.
2020-12-18 15:49:53 +00:00
Simon Pilgrim 992fad03e2 [X86][AVX] Replace extract_subvector(broadcast(), 0) folds with generic SimplifyDemandedVectorEltsForTargetNode handling.
Simplifies a few more cases, notably shuffle demanded elts cases.
2020-12-18 11:51:10 +00:00
Simon Pilgrim 931e66bd89 [X86] Remove extract_subvector(subv_broadcast_load()) fold.
This was needed in an earlier version of D92645, but isn't now - and I've just noticed that it was potentially flawed depending on the relevant widths of the broadcasted and extracted subvectors.
2020-12-17 11:02:49 +00:00
Simon Pilgrim cdb692ee0c [X86] Add X86ISD::SUBV_BROADCAST_LOAD and begin removing X86ISD::SUBV_BROADCAST (PR38969)
Subvector broadcasts are only load instructions, yet X86ISD::SUBV_BROADCAST treats them more generally, requiring a lot of fallback tablegen patterns.

This initial patch replaces constant vector lowering inside lowerBuildVectorAsBroadcast with direct X86ISD::SUBV_BROADCAST_LOAD loads which helps us merge a number of equivalent loads/broadcasts.

As well as general plumbing/analysis additions for SUBV_BROADCAST_LOAD, I needed to wrap SelectionDAG::makeEquivalentMemoryOrdering so it can handle result chains from non generic LoadSDNode nodes.

Later patches will continue to replace X86ISD::SUBV_BROADCAST usage.

Differential Revision: https://reviews.llvm.org/D92645
2020-12-17 10:25:25 +00:00
QingShan Zhang ebdd20f430 Expand the fp_to_int/int_to_fp/fp_round/fp_extend as libcall for fp128
X86 and AArch64 expand it as libcall inside the target. And PowerPC also
want to expand them as libcall for P8. So, propose an implement in the
legalizer to common the logic and remove the code for X86/AArch64 to
avoid the duplicate code.

Reviewed By: Craig Topper

Differential Revision: https://reviews.llvm.org/D91331
2020-12-17 07:59:30 +00:00
Simon Pilgrim 553808d456 [X86] Rename reduction combiners to make it clearer whats happening. NFCI.
Since these are all working on reduction patterns, actually use that term in the function name to make them easier to search for.

At some point we're likely to start working with the ISD::VECREDUCE_* opcodes directly in the x86 backend, but that is still some way off.
2020-12-16 14:48:21 +00:00
Simon Pilgrim e55f7de946 [X86][SSE] combineReductionToHorizontal - don't rely on widenSubVector to handle illegal vector types.
Thanks to @asbirlea for reporting the bug.
2020-12-16 11:24:40 +00:00
Simon Pilgrim 712117338a [X86] Explicitly use SDValue instead of auto. NFCI.
Fix static analyzer warning about not using a SDValue&
2020-12-15 17:27:25 +00:00
Simon Pilgrim b0e5aea557 [X86] Remove unnecessary SUBV_BROADCAST combines. NFCI.
Noticed while dealing with D92645 - these are now handled by getFauxShuffleMask + shuffle combining code.
2020-12-15 16:54:34 +00:00
Simon Pilgrim bd07092669 [X86] Remove trailing whitespace. NFC. 2020-12-15 10:11:38 +00:00
Simon Pilgrim 15a31389b2 [X86][AVX] LowerBUILD_VECTOR - reduce 256/512-bit build vectors with zero/undef upper elements + pad.
As discussed on D92645, we don't do a good job of recognising when we don't require the full width of a ymm/zmm build vector because the upper elements are undef/zero.

This commit allows us to make use of implicit zeroing of upper elements with AVX instructions, which we emulate in DAG with a INSERT_SUBVECTOR into the bottom of a undef/zero vector of the original type.

This exposed a limitation in getTargetConstantBitsFromNode which didn't extract bits from INSERT_SUBVECTORs of different element widths which I've included as well to prevent a couple of regressions.
2020-12-15 10:11:38 +00:00
Harald van Dijk 9eac818370
[X86] Fix variadic argument handling for x32
The X86-64 ABI defines va_list as

  typedef struct {
    unsigned int gp_offset;
    unsigned int fp_offset;
    void *overflow_arg_area;
    void *reg_save_area;
  } va_list[1];

This means the size, alignment, and reg_save_area offset will depend on
whether we are in LP64 or in ILP32 mode, so this commit adds the checks.
Additionally, the VAARG_64 pseudo-instruction assumed 64-bit pointers, so
this commit adds a VAARG_X32 pseudo-instruction that behaves just like
VAARG_64, except for assuming 32-bit pointers.

Some of these changes were originally done by
Michael Liao <michael.hliao@gmail.com>.

Fixes https://bugs.llvm.org/show_bug.cgi?id=48428.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D93160
2020-12-14 23:47:27 +00:00
Simon Pilgrim 5f5a2547c1 [X86] LowerBUILD_VECTOR - track zero/nonzero elements with APInt masks. NFCI.
Prep work for undef/zero 'upper elements' handling as proposed in D92645.
2020-12-14 16:28:45 +00:00
Kazu Hirata 913515e465 [Target] Use llvm::is_contained (NFC) 2020-12-13 19:35:10 -08:00
Simon Pilgrim d5c434d7dd [X86][SSE] combineX86ShufflesRecursively - add basic handling for combining shuffles of different widths (PR45974)
If a faux shuffle uses smaller shuffle inputs, try to recursively combine with those inputs directly instead of widening them immediately. Then widen all smaller inputs at the bottom of the recursion.

This will still mean we're generating nodes on the fly (PR45974) even if we don't combine to a new shuffle but it does help AVX2+ targets combine across xmm/ymm/zmm types, mainly as variable shuffles.
2020-12-13 17:18:07 +00:00
Simon Pilgrim 47321c311b [X86][SSE] combineReductionToHorizontal - add vXi8 ISD::MUL reduction handling (PR39709)
Default expansion leads to repeated extensions/truncations to/from vXi16 which shuffle combining and demanded elts can't completely unravel.

Better just to promote (any_extend) the input and perform a vXi16 reduction.

We'll be able to remove a lot of this if we ever get decent legalization support for reduction intrinsics in SelectionDAG.
2020-12-13 15:22:54 +00:00
Luo, Yuanke f80b29878b [X86] AMX programming model.
This patch implements amx programming model that discussed in llvm-dev
 (http://lists.llvm.org/pipermail/llvm-dev/2020-August/144302.html).
 Thank Hal for the good suggestion in the RA. The fast RA is not in the patch yet.
 This patch implemeted 7 components.

1. The c interface to end user.
2. The AMX intrinsics in LLVM IR.
3. Transform load/store <256 x i32> to AMX intrinsics or split the
   type into two <128 x i32>.
4. The Lowering from AMX intrinsics to AMX pseudo instruction.
5. Insert psuedo ldtilecfg and build the def-use between ldtilecfg to amx
   intruction.
6. The register allocation for tile register.
7. Morph AMX pseudo instruction to AMX real instruction.

Change-Id: I935e1080916ffcb72af54c2c83faa8b2e97d5cb0

Differential Revision: https://reviews.llvm.org/D87981
2020-12-10 17:01:54 +08:00
Saleem Abdulrasool ee74d1b420 X86: use a data driven configuration of Windows x86 libcalls (NFC)
Rather than creating a series of associated calls and ensuring that
everything is lined up, use a table driven approach that ensures that
they two always stay in sync.
2020-12-09 22:49:11 +00:00
Simon Pilgrim 24184dbb82 [X86] Fold CONCAT(VPERMV3(X,Y,M0),VPERMV3(Z,W,M1)) -> VPERMV3(CONCAT(X,Z),CONCAT(Y,W),CONCAT(M0,M1))
Further prep work toward supporting different subvector sizes in combineX86ShufflesRecursively
2020-12-09 14:29:32 +00:00
Kerry McLaughlin 4519ff4b6f [SVE][CodeGen] Add the ExtensionType flag to MGATHER
Adds the ExtensionType flag, which reflects the LoadExtType of a MaskedGatherSDNode.
Also updated SelectionDAGDumper::print_details so that details of the gather
load (is signed, is scaled & extension type) are printed.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D91084
2020-12-09 11:19:08 +00:00
Harald van Dijk 29c8ea6f1a
[X86] Handle localdynamic TLS model in x32 mode
D92346 added TLS_(base_)addrX32 to handle TLS in x32 mode, but missed the
different TLS models. This diff fixes the logic for the local dynamic model
where `RAX` was used when `EAX` should be, and extends the tests to cover
all four TLS models.

Fixes https://bugs.llvm.org/show_bug.cgi?id=26472.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D92737
2020-12-08 21:06:00 +00:00
Tim Northover c5978f42ec UBSAN: emit distinctive traps
Sometimes people get minimal crash reports after a UBSAN incident. This change
tags each trap with an integer representing the kind of failure encountered,
which can aid in tracking down the root cause of the problem.
2020-12-08 10:28:26 +00:00
Simon Pilgrim 0101fb73de [X86] Fold MOVMSK(ICMP_SGT(X,-1)) -> NOT(MOVMSK(X)))
Noticed while triaging PR37506
2020-12-06 17:56:41 +00:00
Layton Kifer ac522f8700 [DAGCombiner] Fold (sext (not i1 x)) -> (add (zext i1 x), -1)
Move fold of (sext (not i1 x)) -> (add (zext i1 x), -1) from X86 to DAGCombiner to improve codegen on other targets.

Differential Revision: https://reviews.llvm.org/D91589
2020-12-06 11:52:10 -05:00
Simon Pilgrim b96a521077 [X86] LowerRotate - enable custom lowering of ROTL/ROTR vXi16 on VBMI2 targets. 2020-12-04 12:16:59 +00:00
Simon Pilgrim d073805be6 [X86] LowerRotate - VBMI2 targets can lower vXi16 rotates using funnel shifts.
Ideally we'd do this inside DAGCombine but until we can make the FSHL/FSHR opcodes legal for VBMI2 it won't help us.
2020-12-04 11:29:23 +00:00
Simon Pilgrim df1ddc4234 [X86] Let VBMI2 non-VLX targets still use funnel shifts instructions 2020-12-04 11:06:43 +00:00
Simon Pilgrim 8eedd18fcb [X86] Remove unnecessary bitcast. NFC.
The X86ISD::SUBV_BROADCAST node is already VT
2020-12-04 09:44:57 +00:00
Xiang1 Zhang f2e2924463 [X86] Unbind the ebx with GOT address in regcall calling convention
No register can be allocated for indirect call when it use regcall calling
convention and passed 5/5+ args.
For example:
call vreg (ag1, ag2, ag3, ag4, ag5, ...) --> 5 regs (EAX, ECX, EDX, ESI, EDI)
used for pass args, 1 reg (EBX )used for hold GOT point, so no regs can be
allocated to vreg.

The Intel386 architecture provides 8 general purpose 32-bit registers. RA
mostly use 6 of them (EAX, EBX, ECX, EDX, ESI, EDI). 5 of this regs can be
used to pass function arguments (EAX, ECX, EDX, ESI, EDI).
EBX used to hold the GOT pointer when making function calls via the PLT.
ESP and EBP usually be "reserved" in register allocation.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D91020
2020-12-04 10:00:13 +08:00
Harald van Dijk c9be4ef184
[X86] Add TLS_(base_)addrX32 for X32 mode
LLVM has TLS_(base_)addr32 for 32-bit TLS addresses in 32-bit mode, and
TLS_(base_)addr64 for 64-bit TLS addresses in 64-bit mode. x32 mode wants 32-bit
TLS addresses in 64-bit mode, which were not yet handled. This adds
TLS_(base_)addrX32 as copies of TLS_(base_)addr64, except that they use
tls32(base)addr rather than tls64(base)addr, and then restricts
TLS_(base_)addr64 to 64-bit LP64 mode, TLS_(base_)addrX32 to 64-bit ILP32 mode.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D92346
2020-12-02 22:20:36 +00:00
Simon Pilgrim f019362329 [X86] EltsFromConsecutiveLoads - remove old FIXME comment. NFC.
Its unlikely an undef element in a zero vector will be any use.
2020-12-02 17:21:41 +00:00
Simon Pilgrim 3900ec6f05 [X86] combineX86ShufflesRecursively - remove old FIXME comment. NFC.
Its unlikely an undef element in a zero vector will be any use, and SimplifyDemandedVectorElts now calls combineX86ShufflesRecursively so its unlikely we actually have a dependency on these specific elements.
2020-12-02 16:29:38 +00:00
Simon Pilgrim 0dab7ecc5d [X86] EltsFromConsecutiveLoads - pull out repeated NumLoadedElts. NFCI. 2020-12-02 16:29:37 +00:00
Simon Pilgrim 1b209ff9e3 [DAG] Move vselect(icmp_ult, 0, sub(x,y)) -> usubsat(x,y) to DAGCombine (PR40111)
Move the X86 VSELECT->USUBSAT fold to DAGCombiner - there's nothing target specific about these folds.
2020-12-01 14:25:29 +00:00
Simon Pilgrim 6dbd0d36a1 [DAG] Move vselect(icmp_ult, -1, add(x,y)) -> uaddsat(x,y) to DAGCombine (PR40111)
Move the X86 VSELECT->UADDSAT fold to DAGCombiner - there's nothing target specific about these folds.

The SSE42 test diffs are relatively benign - its avoiding an extra constant load in exchange for an extra xor operation - there are extra register moves, which is annoying as all those operations should commute them away.

Differential Revision: https://reviews.llvm.org/D91876
2020-12-01 11:56:26 +00:00
Harald van Dijk cdac34bd47
[X86] Zero-extend pointers to i64 for x86_64
For LP64 mode, this has no effect as pointers are already 64 bits.
For ILP32 mode (x32), this extension is specified by the ABI.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D91338
2020-11-30 18:51:23 +00:00
Simon Pilgrim 83d79ca5bf [X86][AVX512] Only lower to VPALIGNR if we have BWI (PR48322) 2020-11-30 10:51:24 +00:00
Simon Pilgrim 969918e177 [DAG] Legalize umin(x,y) -> sub(x,usubsat(x,y)) and umax(x,y) -> add(x,usubsat(y,x)) iff usubsat is legal
If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion.

Allows us to move the x86-specific lowering to the generic expansion code.

Differential Revision: https://reviews.llvm.org/D92183
2020-11-27 11:18:58 +00:00
Simon Pilgrim 8057ebf4a0 Revert rG12d59b696b330 "[DAG] Legalize umin(x,y) -> sub(x,usubsat(x,y)) and umax(x,y) -> add(x,usubsat(y,x)) iff usubsat is legal"
This reverts commit 12d59b696b.

Prematurely pushed this to trunk
2020-11-26 15:07:45 +00:00
Simon Pilgrim 12d59b696b [DAG] Legalize umin(x,y) -> sub(x,usubsat(x,y)) and umax(x,y) -> add(x,usubsat(y,x)) iff usubsat is legal
If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion.

Allows us to move the x86-specific lowering to the generic expansion code.
2020-11-26 14:47:28 +00:00
Simon Pilgrim 791040cd8b [DAG] LowerMINMAX - move default expansion to generic TargetLowering::expandIntMINMAX
This is part of the discussion on D91876 about trying to reduce custom lowering of MIN/MAX ops on older SSE targets - if we can improve generic vector expansion we should be able to relax the limitations in SelectionDAGBuilder when it will let MIN/MAX ops be generated, and avoid having to flag so many ops as 'custom'.
2020-11-22 13:02:27 +00:00
Simon Pilgrim 0341029bb4 [X86][AVX] LowerADDSAT_SUBSAT - avoid X86ISD::BLENDV in UADDSAT/USUBSAT v8i32/v4i64 lowering
Use the OR(CMP,ADD) / AND(CMP,SUB) patterns like we do on SSE targets.

Enable custom lowering for v8i32/v4i64 and generalize the 128-bit lowering code for any vector size - this also lets us use the slightly cheaper codegen for icmp_ugt instead of umin/umax.
2020-11-20 18:16:44 +00:00
Craig Topper a7eae62a42 [SelectionDAG][X86][PowerPC][Mips] Replace the default implementation of LowerOperationWrapper with the X86 and PowerPC version.
The default version only works if the returned node has a single
result. The X86 and PowerPC versions support multiple results
and allow a single result to be returned from a node with
multiple outputs. And allow a single result that is not result 0
of the node.

Also replace the Mips version since the new version should work
for it. The original version handled multiple results, but only
if the new node and original node had the same number of results.

Differential Revision: https://reviews.llvm.org/D91846
2020-11-20 10:06:53 -08:00
Simon Pilgrim 09a081f221 [X86][SSE] LowerADDSAT_SUBSAT - avoid X86ISD::BLENDV in UADDSAT/USUBSAT custom lowering
Use the OR(CMP,ADD) / AND(CMP,SUB) patterns like we do on pre-SSE4 targets.

We're still using X86ISD::BLENDV on some AVX targets as we don't do custom lowering for >= 256-bit vectors.

Really this (and combineVSelectWithAllOnesOrZeros) needs moving to DAGCombiner, but pre-SSE42 we see the vXi64 comparison type as a 2 x 32-bits result so we can't just rely on ComputeNumSignBits to give us the 'all bits' result we need.
2020-11-20 16:53:01 +00:00
Simon Pilgrim 14ae02fb33 [X86][AVX] Only share broadcasts of different widths from the same SDValue of the same SDNode (PR48215)
D57663 allowed us to reuse broadcasts of the same scalar value by extracting low subvectors from the widest type.

Unfortunately we weren't ensuring the broadcasts were from the same SDValue, just the same SDNode - which failed on multiple-value nodes like ISD::SDIVREM

FYI: I intend to request this be merged into the 11.x release branch.

Differential Revision: https://reviews.llvm.org/D91709
2020-11-19 12:15:18 +00:00
Craig Topper f0b0bab34d [X86] Use GF2P8AFFINEQB to implement vector bitreverse.
We can use GF2P8AFFINEQB to reverse bits in a byte. Shuffles are needed to reverse the bytes in elements larger than i8. LegalizeVectorOps takes care of inserting the shuffle for the larger element size.

We already have Custom lowering for v16i8 with SSSE3, v32i8 with AVX, and v64i8 with AVX512BW.

I think we might be able to use this for scalars too by moving into a vector and back. But I'll save that for a follow up as its a little more involved.

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D91515
2020-11-17 23:49:06 -08:00
Craig Topper 57c0c4a275 [X86] Fix crash with i64 bitreverse on 32-bit targets with XOP.
We unconditionally marked i64 as Custom, but did not install a
handler in ReplaceNodeResults when i64 isn't legal type. This
leads to ReplaceNodeResults asserting.

We have two options to fix this. Only mark i64 as Custom on
64-bit targets and let it expand to two i32 bitreverses which
each need a VPPERM. Or the other option is to add the Custom
handling to ReplaceNodeResults. This is what I went with.
2020-11-15 19:02:34 -08:00
Craig Topper 114f044640 [X86] Use EVT::getIntegerVT instead of MVT::getIntegerVT where the type can be i2 or i4.
This was a mistake introduced in D91294. I'm not sure how to
exercise this with the existing code, but I hit it while trying
some follow up experiments.
2020-11-12 21:48:45 -08:00
Craig Topper a4124e455e [X86] When storing v1i1/v2i1/v4i1 to memory, make sure we store zeros in the rest of the byte
We can't store garbage in the unused bits. It possible that something like zextload from i1/i2/i4 is created to read the memory. Those zextloads would be legalized assuming the extra bits are 0.

I'm not sure that the code in lowerStore is executed for the v1i1/v2i1/v4i1 case. It looks like the DAG combine in combineStore may have converted them to v8i1 first. And I think we're missing some cases to avoid going to the stack in the first place. But I don't have time to investigate those things at the moment so I wanted to focus on the correctness issue.

Should fix PR48147.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D91294
2020-11-12 21:28:18 -08:00
Simon Pilgrim 1a62ca65c1 [KnownBits] Add KnownBits::commonBits helper. NFCI.
We have a frequent pattern where we're merging two KnownBits to get the common/shared bits, and I just fell for the gotcha where I tried to use the & operator to merge them........
2020-11-11 12:15:54 +00:00