Commit Graph

21769 Commits

Author SHA1 Message Date
Simon Pilgrim a26288e803 [X86][Atom] Fix vector fadd/fcmp/fmul resource/throughputs
Match whats documented in the Intel AOM - these are all fadd/fcmp use Port1 and fmul uses Port1, but in many cases BOTH ports are required - this was being incorrectly modelled as EITHER port.

Discovered while investigating the correct fptoui costs to fix the regressions in D101555.

Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
2021-05-20 18:56:58 +01:00
Simon Pilgrim 62fca69a70 [CostModel][X86][AVX2] Improve 256-bit vector non-uniform shifts costs
Haswell, Excavator and early Ryzen all have slower 256-bit non-uniform vector shifts (confirmed on AMDSoG/Agner/instlatx64 and llvm models) - so bump the worst case costs accordingly.

Noticed while investigating PR50364
2021-05-20 12:16:16 +01:00
Sanjay Patel f12f9beb04 [x86] propagate FMF from x86-specific intrinsic nodes to others during combining
This is another FMF gap exposed by D90901, but I don't see a way
to show the difference in a regression test as with:
f66ba4c
6025663

We will see an asm difference if we add a test as part of D90901.
2021-05-19 14:25:09 -04:00
Sanjay Patel f66ba4cfa7 [x86] propagate FMF from x86-specific intrinsic nodes to others during lowering
This is another fast-math-flags failure exposed by D90901.
2021-05-19 13:11:15 -04:00
Wang, Pengfei 9d09d20448 Reapply "[X86] Limit X86InterleavedAccessGroup to handle the same type case only"
The current implementation assumes the destination type of shuffle is the same as the decomposed ones. Add the check to avoid crush when the condition is not satisfied.

This fixes PR37616.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D102751
2021-05-19 22:27:16 +08:00
Simon Pilgrim 707fc2e2f2 Revert rG528bc10e95d5f9d6a338f9bab5e91d7265d1cf05 : "[X86FixupLEAs] Transform the sequence LEA/SUB to SUB/SUB"
Reports on D101970 indicate this is causing failures on multi-stage compiles.
2021-05-19 15:01:20 +01:00
Simon Pilgrim ab4e04a0f3 [X86][AVX] createVariablePermute - generalize the PR50356 fix for smaller indices vector as well
Generalize the fix from rGd0902a8665b1 by ensuring we widen/narrow the indices subvector first and then perform the ZERO_EXTEND_VECTOR_INREG (if necessary), which should allow us to perform the variable permutes with source/destination/indices vectors of any widths.
2021-05-19 14:39:41 +01:00
Simon Pilgrim b14f9a1ebd [X86][Atom] Fix vector integer shift by immediate resource/throughputs
Match whats documented in the Intel AOM (and Agner/instlatx64 agree) - these are all Port0 only.

Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
2021-05-19 14:39:40 +01:00
Wang, Pengfei 66513e2f20 Revert "[X86] Limit X86InterleavedAccessGroup to handle the same type case only"
This reverts commit ca23a38e37.

Revert due to EXPENSIVE_CHECKS fail.
2021-05-19 20:35:45 +08:00
Simon Pilgrim 222314d8b0 [X86] Atom (pre-SLM) doesn't support PTEST instructions 2021-05-19 12:25:29 +01:00
Simon Pilgrim 8c717920d8 [X86] Remove copy + paste typos in AtomWriteResPair comment.
Remnants from when the Atom model was copied from the Btver2 model.....
2021-05-19 12:25:28 +01:00
Wang, Pengfei ca23a38e37 [X86] Limit X86InterleavedAccessGroup to handle the same type case only
The current implementation assumes the destination type of shuffle is the same as the decomposed ones. Add the check to avoid crush when the condition is not satisfied.

This fixes PR37616.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D102751
2021-05-19 18:39:08 +08:00
Tim Northover c1dc267258 MachineBasicBlock: add liveout iterator aware of which liveins are defined by the runtime.
Using this in RegAlloc fast reduces register pressure, and in some cases allows
x86 code to compile that wouldn't before.
2021-05-19 11:00:24 +01:00
Guozhi Wei 528bc10e95 [X86FixupLEAs] Transform the sequence LEA/SUB to SUB/SUB
This patch transforms the sequence

    lea (reg1, reg2), reg3
    sub reg3, reg4

to two sub instructions

    sub reg1, reg4
    sub reg2, reg4

Similar optimization can also be applied to LEA/ADD sequence.
The modifications to TwoAddressInstructionPass is to ensure the operands of ADD
instruction has expected order (the dest register of LEA should be src register
of ADD).

Differential Revision: https://reviews.llvm.org/D101970
2021-05-18 18:02:36 -07:00
Fabian Sommer 5f2b276667 Default stack alignment of x86 NaCl to 16 bytes
X86 NaCl generally requires the stack to be aligned to 16 bytes.
This change was already implemented in two downstream NaCl compilers
based on llvm.

Reviewed By: dschuff

Differential Revision: https://reviews.llvm.org/D102610
2021-05-18 15:16:59 -07:00
Simon Pilgrim d0902a8665 [X86][AVX] createVariablePermute - correctly extend same-sized-vector indices (PR50356)
D101838 incorrectly handled indices vectors of the same size but with higher element counts to just bitcast to the target indices type instead of performing a ZERO_EXTEND_VECTOR_INREG
2021-05-18 20:30:46 +01:00
Simon Pilgrim 99c0f16ea4 [X86] Use Skylake Server model for x86-64-v4 so we have full instruction coverage
The x86-64-v4 generic cpu arch supports AVX512BW/DQ/CD/VLX which isn't covered by the Haswell model, use the SkylakeServer model instead which is a lot closer to what the arch represents.

Differential Revision: https://reviews.llvm.org/D102553
2021-05-18 18:06:40 +01:00
Roman Lebedev 75ea0abaae
[X86] AMD Zen 3: fix MULX modelling - don't forget about WriteIMulH (PR50387)
Otherwise lack thereof will be caught by a defensive check during
scheduling, and we'll crash.

I've literally never seen this syntax before..
2021-05-18 19:58:04 +03:00
Tim Northover ba1509da7b Recommit X86: support Swift Async context
This adds support to the X86 backend for the newly committed swiftasync
function parameter. If such a (pointer) parameter is present it gets stored
into an augmented frame record (populated in IR, but generally containing
enhanced backtrace for coroutines using lots of tail calls back and forth).

The context frame is identical to AArch64 (primarily so that unwinders etc
don't get extra complexity). Specfically, the new frame record is [AsyncCtx,
%rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp
being set to 1. %rbp still points to the frame pointer in memory for backwards
compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest
of the record is because of x86).

Recommited with a fix for unwind info when i386 pc-rel thunks are
adjacent to a prologue.
2021-05-18 15:19:05 +01:00
Roman Lebedev 3cc3960766
[X86] AMD Zen 3: cap LoopMicroOpBufferSize to workaround PR50384 (quadratic IndVars runtime)
While i would like to keep the right value here,
i would also like to be able to actually compile
e.g. vanilla test-suite.

256 is a pretty random guess, it should be pretty good enough
for serious loops, but small enough to result in tolerant
compile times for certain edge cases.

https://bugs.llvm.org/show_bug.cgi?id=50384
2021-05-18 15:56:57 +03:00
Simon Pilgrim 560b709abe [X86][AVX] Cleanup AVX2 vector integer truncation costs
Noticed while investigating PR50364, the truncation costs for v4i64->v4i16/v4i8 and v8i32->v8i8 were way too optimistic for a shuffle sequence that usually matches the AVX1 codegen (they matched AVX512 numbers which have actual truncation instructions!).
2021-05-18 13:07:29 +01:00
Mitch Phillips 6791a6b309 Revert "X86: support Swift Async context"
This reverts commit 747e5cfb9f.

Reason: New frame layout broke the sanitizer unwinder. Not clear why,
but seems like some of the changes aren't always guarded by Swyft
checks. See
https://reviews.llvm.org/rG747e5cfb9f5d944b47fe014925b0d5dc2fda74d7 for
more information.
2021-05-17 12:44:57 -07:00
Simon Pilgrim 41587466aa [X86] Don't dereference a dyn_cast<> - use a cast<> instead. NFCI.
dyn_cast<> can return null if the cast fails, by using cast<> we assert that the cast is correct helping to avoid a potential null dereference.
2021-05-17 15:58:32 +01:00
Tim Northover 747e5cfb9f X86: support Swift Async context
This adds support to the X86 backend for the newly committed swiftasync
function parameter. If such a (pointer) parameter is present it gets stored
into an augmented frame record (populated in IR, but generally containing
enhanced backtrace for coroutines using lots of tail calls back and forth).

The context frame is identical to AArch64 (primarily so that unwinders etc
don't get extra complexity). Specfically, the new frame record is [AsyncCtx,
%rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp
being set to 1. %rbp still points to the frame pointer in memory for backwards
compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest
of the record is because of x86).
2021-05-17 11:56:16 +01:00
Tim Northover 82a0e808bb IR/AArch64/X86: add "swifttailcc" calling convention.
Swift's new concurrency features are going to require guaranteed tail calls so
that they don't consume excessive amounts of stack space. This would normally
mean "tailcc", but there are also Swift-specific ABI desires that don't
naturally go along with "tailcc" so this adds another calling convention that's
the combination of "swiftcc" and "tailcc".

Support is added for AArch64 and X86 for now.
2021-05-17 10:48:34 +01:00
Simon Pilgrim 262e4200d1 [X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines (REAPPLIED). NFCI.
The intention is to be able to run this from additional locations (such as shuffle combining) in the future.

Reapplies rGb95a103808ac (after reversion at rGc012a388a15b), with SSE3/SSSE3 typo fix, test added at rG0afb10de1449.
2021-05-16 13:50:58 +01:00
Nico Weber c012a388a1 Revert "[X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines. NFCI."
This reverts commit b95a103808.
Makes clang assert very early in a Chromium build. See
https://bugs.chromium.org/p/chromium/issues/detail?id=1209490#c1
for a standalone repro.
2021-05-15 12:20:02 -04:00
Simon Pilgrim 8cb04d891f [X86] X86OptimizeLEAPass::replaceDebugValue - take a copy of the DebugLoc not a reference as it may be deleted.
Fixes msan warning due to rG9ca2c50b3601
2021-05-15 16:28:20 +01:00
Simon Pilgrim 2ed89001e1 [X86] X86CmovConverterPass::convertCmovInstsToBranches - take a copy of the DebugLoc not a reference as it may be deleted.
Fixes msan warning due to rG9ca2c50b3601
2021-05-15 16:13:34 +01:00
Simon Pilgrim 73635adb86 X86SpeculativeLoadHardeningPass::hardenValueInRegister - assert that we have a i8/i16/i32/i64 sized register. NFCI.
Silence static analyzer warning for out-of-range access to the SubRegImms[] array.
2021-05-15 15:13:28 +01:00
Simon Pilgrim f9b1208681 [X86][Atom] Fix vector integer multiplication resource/throughputs
Match whats documented in the Intel AOM (and Agner/instlatx64 agree) - vector integer multiplies are pipelined - all Port0, throughput = 2 @ 128bits, 1 @ 64bits.

Noticed while checking reduction costs - now that we can use in-order models in llvm-mca, the atom model is the "worst case scenario" we have in x86.
2021-05-15 14:25:48 +01:00
Simon Pilgrim 9ca2c50b36 [X86] Try to pass DebugLoc by const-ref to avoid costly TrackingMDNodeRef copies (REAPPLIED). NFCI.
Reapply rG5ed56a821c06 (after reverted by rG7aa89c4a22fd) - don't take reference from struct that will be erased in X86FrameLowering::eliminateCallFramePseudoInstr
2021-05-15 13:23:28 +01:00
Mitch Phillips 7aa89c4a22 Revert "[X86] Try to pass DebugLoc by const-ref to avoid costly TrackingMDNodeRef copies. NFCI."
This reverts commit 5ed56a821c.

Reason: Broke the MSan buildbots. See Phabricator for more info
(https://reviews.llvm.org/rG5ed56a821c0622869739a3ae752eea97a1ee1f48).
2021-05-14 14:30:57 -07:00
Roman Lebedev 1fc1c88704
[X86] AMD Zen 3: same-reg AVX YMM VPCMPGT{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As measured by exegesis, and confirmed by ref docs.
2021-05-14 20:23:05 +03:00
Roman Lebedev 2f8572d8e2
[X86] AMD Zen 3: same-reg AVX XMM VPCMPGT{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As measured by exegesis, and confirmed by ref docs.
2021-05-14 20:23:04 +03:00
Roman Lebedev f8f7c765a0
[X86] AMD Zen 3: same-reg SSE XMM PCMPGT{B,W,D,Q} is a 1-cycle(!) dep-breaking zero-idiom
As measured by exegesis, and confirmed by ref docs.
2021-05-14 20:23:04 +03:00
Roman Lebedev 26eeb6e650
[X86] AMD Zen 3: same-reg AVX YMM VPSUBUS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:03 +03:00
Roman Lebedev 41a5dcdf87
[X86] AMD Zen 3: same-reg AVX XMM VPSUBUS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:03 +03:00
Roman Lebedev 6733fe5c0d
[X86] AMD Zen 3: same-reg SSE XMM PSUBUS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
2021-05-14 20:23:03 +03:00
Roman Lebedev 555e1d2987
[X86] AMD Zen 3: same-reg AVX YMM VPSUBS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:02 +03:00
Roman Lebedev 012417c980
[X86] AMD Zen 3: same-reg AVX XMM VPSUBS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
Yes, this one is also not zero-cycle.
2021-05-14 20:23:02 +03:00
Roman Lebedev 29c4f892fe
[X86] AMD Zen 3: same-reg SSE XMM PSUBS{B,W} is a 1-cycle(!) dep-breaking zero-idiom
Not really mentioned in ref docs, but measures as such.
2021-05-14 20:23:02 +03:00
Roman Lebedev 93f2642871
[X86] AMD Zen 3: same-reg AVX YMM VPSUB{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:23:01 +03:00
Roman Lebedev 7a45b96e04
[X86] AMD Zen 3: same-reg AVX XMM VPSUB{B,W,D,Q} is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:23:01 +03:00
Roman Lebedev 1ea8be214f
[X86] AMD Zen 3: same-reg SSE XMM PSUB{B,W,D,Q} is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:23:00 +03:00
Roman Lebedev ce22f53916
[X86] AMD Zen 3: same-reg AVX YMM VPANDN is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:23:00 +03:00
Roman Lebedev 44c2b4fe91
[X86] AMD Zen 3: same-reg AVX XMM VPANDN is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:23:00 +03:00
Roman Lebedev a72cacb53f
[X86] AMD Zen 3: same-reg SSE XMM PANDN is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:22:59 +03:00
Roman Lebedev 1d73c2b8cf
[X86] AMD Zen 3: same-reg AVX YMM VPXOR is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:22:59 +03:00
Roman Lebedev 31669b5073
[X86] AMD Zen 3: same-reg AVX XMM VPXOR is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 20:22:58 +03:00
Roman Lebedev 498bf365f4
[X86] AMD Zen 3: same-reg SSE XMM PXOR is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 20:22:58 +03:00
Simon Pilgrim b95a103808 [X86][SSE] Pull out combineToHorizontalAddSub helper from inside (F)ADD/SUB combines. NFCI.
The intention is to be able to run this from additional locations (such as shuffle combining) in the future.
2021-05-14 16:52:55 +01:00
Roman Lebedev 4af4afe014
[X86] AMD Zen 3: same-reg AVX YMM VANDNPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev 17f99a8a41
[X86] AMD Zen 3: same-reg AVX XMM VANDNPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev 38ceb46fb0
[X86] AMD Zen 3: same-reg SSE XMM ANDNPD is a 1-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev d8a595b81c
[X86] AMD Zen 3: same-reg AVX YMM VANDNPS is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:24 +03:00
Roman Lebedev fd4cbc822b
[X86] AMD Zen 3: same-reg AVX XMM VANDNPS is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 14:06:23 +03:00
Roman Lebedev f38dcbecb6
[X86] AMD Zen 3: same-reg SSE XMM ANDNPS is a 1-cycle(!) dep-breaking zero-idiom
Same as SSE XMM XORPS/XORPD, it is not zero-cycle, even though it breaks the deps.
As confirmed by the exegesis measurements, and ref docs.
2021-05-14 14:06:23 +03:00
Simon Pilgrim 5ed56a821c [X86] Try to pass DebugLoc by const-ref to avoid costly TrackingMDNodeRef copies. NFCI. 2021-05-14 11:14:18 +01:00
Roman Lebedev 43a7f130a7
[X86] AMD Zen 3: same-reg AVX YMM VXORPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 11:56:07 +03:00
Roman Lebedev 336b9dbe88
[X86] AMD Zen 3: same-reg AVX XMM VXORPD is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis measurements, and ref docs.
2021-05-14 11:56:07 +03:00
Roman Lebedev 9c596bc541
[X86] AMD Zen 3: same-reg SSE XMM XORPD is a 1-cycle(!) dep-breaking zero-idiom
Same as with it's float friend, unlike their AVX versions.
As confirmed by exegesis, and ref docs.
2021-05-14 11:56:07 +03:00
Roman Lebedev 59554c01ab
[X86] AMD Zen 3: same-reg AVX YMM VXORPS is a zero-cycle(!) dep-breaking zero-idiom
As confirmed by exegesis, and ref docs.
2021-05-14 11:56:06 +03:00
Roman Lebedev 26c1bffe67
[X86] AMD Zen 3: same-reg AVX XMM VXORPS is a zero-cycle(!) dep-breaking zero-idiom
Unlike it's legacy SSE XMM XORPS version, which measures as being 1-cycle,
this one is certainly a zero-cycle instruction, in addition to both of them
being dependency breaking.

As confirmed by exegesis measurements, and ref docs.
2021-05-14 11:56:06 +03:00
Roman Lebedev 5fddc3312b
Revert "[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again"
As reported in post-commit feedback, this has issues with e.g. <16 x i1>:
https://llvm.godbolt.org/z/jxPvdGEW4

This reverts commit c02476f315.
2021-05-14 00:03:36 +03:00
Roman Lebedev 6b95fd199d
Revert "[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost()"
Depends on a commit that is about to be reverted.

This reverts commit 69ed93a435.
2021-05-14 00:03:36 +03:00
Roman Lebedev aa0dcb3ba4
[X86] AMD Zen 3: same-reg SSE XMM XORPS is a 1-cycle(!) dep-breaking one-idiom
While both the SOG and Agner insist that it is zero-cycle,
i can not confirm that claim. While it clearly breaks the dependency,
i can not come up with a snippet, or measurement approach,
to end up with IPC bigger than 4, which, to me, means that it actually
consumes execution resource of an FP unit for a cycle.
2021-05-14 00:03:36 +03:00
Simon Pilgrim ba0ec1be29 [X86] X86ExpandPseudo.cpp - try to pass DebugLoc by const-ref to avoid costly TrackingMDNodeRef copies. NFCI. 2021-05-13 13:31:54 +01:00
Simon Pilgrim 4956655640 [X86] X86InstrInfo.cpp - try to pass DebugLoc by const-ref to avoid costly TrackingMDNodeRef copies. NFCI. 2021-05-13 13:31:53 +01:00
Simon Pilgrim 9dfc4ac41c [X86] VZeroUpperInserter::insertVZeroUpper - avoid DebugLoc creation by embedding in the BuildMI calls. NFCI.
Try to pass DebugLoc by const-ref to avoid costly TrackingMDNodeRef copies.
2021-05-13 13:31:52 +01:00
Serge Pavlov 12537ab772 [FPEnv][X86] Implement lowering of llvm.set.rounding
Differential Revision: https://reviews.llvm.org/D74730
2021-05-13 14:30:38 +07:00
Fangrui Song 0fe6649bc5 [X86] Fix -Wunused-lambda-capture 2021-05-12 10:34:32 -07:00
Simon Pilgrim fb1d61b725 [X86][AVX] Fold concat(ps*lq(x,32),ps*lq(y,32)) -> shuffle(concat(x,y),zero) (PR46621)
On AVX1 targets we can handle v4i64 logical shifts by 32 bits as a pair of v8f32 shuffles with zero.

I was hoping to put this in LowerScalarImmediateShift, but performing that early causes regressions where other instructions were respliting the subvectors.
2021-05-12 18:04:40 +01:00
Simon Pilgrim 7bff9bdd34 [X86][AVX] combineConcatVectorOps - add ConcatSubOperand helper. NFCI.
Pull out repeated code to create a concat_vectors of the same operand from all subvecs.
2021-05-12 16:42:18 +01:00
Craig Topper 44e0e91db0 [ValueTypes] Rename MVT::getVectorNumElements() to MVT::getVectorMinNumElements(). Fix some misuses of getVectorNumElements()
getVectorNumElements() returns a value for scalable vectors
without any warning so it is effectively getVectorMinNumElements().
By renaming it and making getVectorNumElements() forward to
it, we can insert a check for scalable vectors into getVectorNumElements()
similar to EVT. I didn't do that in this patch because there are still more
fixes needed, but I was able to temporarily do it and passed the RISCV
lit tests with these changes.

The changes to isPow2VectorType and getPow2VectorType are copied from EVT.

The change to TypeInfer::EnforceSameNumElts reduces the size of AArch64's isel table.
We're now considering SameNumElts to require the scalable property to match which
removes some unneeded type checks.

This was motivated by the bug I fixed yesterday in 80b9510806

Reviewed By: frasercrmck, sdesmalen

Differential Revision: https://reviews.llvm.org/D102262
2021-05-12 07:46:45 -07:00
Sanjay Patel f58e0513dd [x86] try harder to lower to PCMPGT instead of not-of-PCMPEQ
This is motivated by the example in https://llvm.org/PR50055 ,
but it doesn't do anything for that bug currently because we
don't actually have a zero-extended setcc there.

Proof for the generic transform (inverse of what we would
try to do in combining):
https://alive2.llvm.org/ce/z/aBL-Mg

Differential Revision: https://reviews.llvm.org/D102275
2021-05-12 08:25:29 -04:00
Simon Pilgrim 72e242a286 [X86][AVX] canonicalizeShuffleMaskWithHorizOp - improve support for 256/512-bit vectors
Extend the HOP(HOP(X,Y),HOP(Z,W)) and SHUFFLE(HOP(X,Y),HOP(Z,W)) folds to handle repeating 256/512-bit vector cases.

This allows us to drop the UNPACK(HOP(),HOP()) custom fold in combineTargetShuffle.

This required isRepeatedTargetShuffleMask to be tweaked to support target shuffle masks taking more than 2 inputs.
2021-05-12 12:13:24 +01:00
Matt Arsenault 24e2e5df0e GlobalISel: Split ValueHandler into assignment and emission classes
Currently the ValueHandler handles both selecting the type and
location for arguments, as well as inserting instructions needed to
handle them. Split this so that the determination of the argument
handling is independent of the function state. Currently the checks
for tail call compatibility do not follow the full assignment logic,
so it misses cases where arguments require nontrivial legalization.

This should help avoid targets ending up in a buggy state where the
argument evaluation may change in different contexts.
2021-05-11 19:50:12 -04:00
Roman Lebedev 97e04d41e6
[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): canonicalize to integer type
This way we don't have to duplicate i32/f32 and i64/f64 entries,
which was already forgotten to be done for a few tuples.
2021-05-11 21:35:58 +03:00
Roman Lebedev 5f78ba001c
[X86][Codegen] Shift amount mod: sh? i64 x, (32-y) --> sh? i64 x, -(y+32)
I've seen this in the RawSpeed's BitPumpMSB*::push() hotpath,
after fixing the buffer abstraction to a more sane one,
when looking into a +5% runtime regression.
I was hoping that this would fix it, but it does not look it does.

This seems to be at least not worse than the original pattern.
But i'm actually mainly interested in the case where we already
compute `(y+32)` (see last test),

https://alive2.llvm.org/ce/z/ZCzJio

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D101944
2021-05-11 19:39:41 +03:00
Roman Lebedev 69ed93a435
[X86] X86TTIImpl::getInterleavedMemoryOpCostAVX2(): use getMemoryOpCost()
Now that getMemoryOpCost() correctly handles all the vector variants,
we should no longer hand-roll our own version of it, but use it directly.

The AVX512 variant probably needs a similar change,
but there it is less obvious.
2021-05-11 16:28:00 +03:00
Simon Pilgrim 759b97e55a [X86] Replace repeated isa/cast<ConstantSDNode> calls with single single dyn_cast<>. NFCI.
Noticed while looking at D101944
2021-05-11 14:18:45 +01:00
Simon Pilgrim 9acc03ad92 [X86][SSE] Replace foldShuffleOfHorizOp with generalized version in canonicalizeShuffleMaskWithHorizOp
foldShuffleOfHorizOp only handled basic shufps(hop(x,y),hop(z,w)) folds - by moving this to canonicalizeShuffleMaskWithHorizOp we can work with more general/combined v4x32 shuffles masks, float/integer domains and support shuffle-of-packs as well.

The next step will be to support 256/512-bit vector cases.
2021-05-11 14:18:45 +01:00
Roman Lebedev c02476f315
[X86][CostModel] X86TTIImpl::getMemoryOpCost(): rewrite vector handling again
Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.

Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.

Reviewed By: RKSimon, ABataev

Differential Revision: https://reviews.llvm.org/D100684
2021-05-11 16:02:22 +03:00
Roman Lebedev 6a64c462eb
[X86] AMD Zen 3: same-reg AVX YMM VPCMP is dep breaking one-idiom
As measured by exegesis, and confirmed by ref docs.
Still not zero-cycle :)
2021-05-10 23:49:27 +03:00
Roman Lebedev 2953245337
[X86] AMD Zen 3: same-reg AVX XMM VPCMP is dep breaking one-idiom
As measured by exegesis, and confirmed by ref docs.
Again, it's not zero-cycle.
2021-05-10 23:49:26 +03:00
Roman Lebedev 0f3bcb97ef
[X86] AMD Zen 3: same-reg SSE XMM PCMP is dep breaking one-idiom
As measured by exegesis, and confirmed by ref docs.
Much like with MMX PCMP, it does actually have to execute, though.
2021-05-10 23:49:26 +03:00
Roman Lebedev b24edfff4f
[X86] AMD Zen 3: same-reg PCMPEQ is an MMX all-ones dep breaking idiom
They are, however, not zero-cycle, and do actually execute.

As measured by exegesis, and confirmed by ref docs.
2021-05-10 23:49:26 +03:00
Roman Lebedev 08cf2776ac
[X86] AMD Zen 3: sub-32-bit CMP also break dependencies
They measure as having the same effect as 32-bit CMP.
2021-05-10 20:57:38 +03:00
Simon Pilgrim e32374ed5c [X86][SSE] canonicalizeShuffleMaskWithHorizOp - add TODO for better 256/512-bit shuffle+hop folding support. NFC. 2021-05-10 18:43:16 +01:00
Simon Pilgrim 605f90475f X86FlagsCopyLowering.cpp - try to pass DebugLoc by const-ref to avoid costly TrackingMDNodeRef copies. NFCI. 2021-05-10 14:00:37 +01:00
Simon Pilgrim 9243a584d3 X86LoadValueInjectionLoadHardening.cpp - use const-reference in for-range loops to avoid unnecessary copies. NFCI. 2021-05-10 14:00:36 +01:00
Roman Lebedev be23d5e814
[X86] AMD Zen 3: same-reg CMP is a zero-cycle dependency-breaking instruction
As measured by exegesis, and confirmed by ref docs.
2021-05-10 00:03:20 +03:00
Roman Lebedev 11b0568dce
[X86] AMD Zen 3: same-reg SBB is a dependency-breaking instruction
As confirmed by exegesis measurements, and ref docs.
It does actually execute.

While there, bump latency for MULX32rr, that seems to match measurements.
2021-05-10 00:03:20 +03:00
Roman Lebedev eed8552787
[X86] AMD Zen 3: same-register XOR/SUB are GPR dependency breaking zero-idioms
As measured by exegesis and confirmed in reference docs.
2021-05-10 00:03:20 +03:00
Roman Lebedev 675daef58b
[NFC][X86] Znver3: drop obsolete fixme 2021-05-09 20:37:57 +03:00
Roman Lebedev a21df76db6
[X86] AMD Zen 3: XCHG is a zero-cycle instruction
As measured by exegesis and confirmed by reference docs.
2021-05-09 20:37:57 +03:00
Roman Lebedev f858929208
[NFCI][X86] Mark Znver3 scheduling model as complete
To the best of my knowledge, all instructions are modelled,
and have reasonable values to them; flipping the switch
doesn't cause any diff for MCA tests, so either we're good,
or we have test coverage gaps.

I'm not really sure why no other X86 sched model is marked as complete.
2021-05-09 01:07:07 +03:00
Roman Lebedev d5494931f2
[NFCI][X86] Mark a few lately-added system instructions as such for Scheduling purposes 2021-05-09 01:07:07 +03:00
Simon Pilgrim 4524d8b755 [X86] combineHorizOpWithShuffle - generalize HOP(SHUFFLE(X),SHUFFLE(Y)) -> SHUFFLE(HOP(X,Y)) fold.
For 128-bit types, generalize the fold to recognise duplicate operands in either shuffle.
2021-05-08 16:23:27 +01:00
Roman Lebedev b1c38207e9
[X86] Improve costmodel for scalar byte swaps
Currently we model i16 bswap as very high cost (`10`),
which doesn't seem right, with all other being at `1`.

Regardless of `MOVBE`, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of `1`:

Since i32/i64 already have cost of `1`,
`MOVBE` doesn't improve their costs any further.

BUT, `MOVBE` must have at least a single memory operand,
with other being a register. Which means, if we have
a bswap of load, iff load has a single use,
we'll fold bswap into load.

Likewise, if we have store of a bswap, iff bswap
has a single use, we'll fold bswap into store.

So i think we should treat such a bswap as free,
unless of course we know that for the particular CPU
they are performing badly.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D101924
2021-05-08 15:17:35 +03:00
Xiang1 Zhang d4bdeca576 [X86] Support AMX fast register allocation
Differential Revision: https://reviews.llvm.org/D100026
2021-05-08 14:21:11 +08:00
Xiang1 Zhang bebafe01a7 Revert "[X86] Support AMX fast register allocation"
This reverts commit 77e2e5e07d.
2021-05-08 13:43:32 +08:00
Xiang1 Zhang 77e2e5e07d [X86] Support AMX fast register allocation 2021-05-08 13:27:21 +08:00
Roman Lebedev b8701dc174
[X86] AMD Zen 3: mark XMM/YMM (but not MMX!) reg moves as eliminatible in RegisterFile 2021-05-07 20:11:21 +03:00
Roman Lebedev 5b1610a250
[X86] AMD Zen 3: MOVSX32rr32 is a zero-cycle move
It measures as such, and the reference docs agree.

I can't easily add a MCA test, because there's no mnemonic for it,
it can only be disassembled or created as a MCInst.
2021-05-07 20:11:20 +03:00
Simon Pilgrim f744723f75 [X86] combineXor - limit fold to non-opaque constants (PR50254)
Ensure we don't try to fold when one might be an opaque constant - the constant fold will fail and then the reverse fold will happen in DAGCombine.....
2021-05-07 16:39:24 +01:00
Roman Lebedev 2819009b5a
[X86] AMD Zen 3: _REV variants of zero-cycles moves are also zero-cycles (PR50261)
Sometimes disassembler picks _REV variants of instructions
over the plain ones, which in this case exposed an issue
that the _REV variants aren't being modelled as optimizable moves.
2021-05-07 18:27:40 +03:00
Roman Lebedev 758c173309
[X86] AMD Zen 3: throughput for renameable XMM/YMM moves is 6
They are resolved at the register rename stage without
using any execution units.
2021-05-07 17:06:45 +03:00
Roman Lebedev 715c0d0bd4
[X86] AMD Zen 3: AVX YMM moves are zero-cycle
I've verified this with llvm-exegesis.
This is not limited to zero registers.
2021-05-07 17:06:45 +03:00
Roman Lebedev ee020b930d
[X86] AMD Zen 3: AVX XMM moves are zero-cycle
I've verified this with llvm-exegesis.
This is not limited to zero registers.
2021-05-07 17:06:44 +03:00
Roman Lebedev 9db4203883
[X86] AMD Zen 3: SSE XMM moves are zero-cycle
I've verified this with llvm-exegesis.
This is not limited to zero registers.

Refs:
AMD SOG 19h, 2.9.4 Zero Cycle Move
The processor is able to execute certain register to register
mov operations with zero cycle delay.

Agner,
22.13 Instructions with no latency
Register-to-register move instructions are resolved at
the register rename stage without using any execution units.
These instructions have zero latency. It is possible to do six such
register renamings per clock cycle, and it is even possible to
rename the same register multiple times in one clock cycle.
2021-05-07 17:06:44 +03:00
Roman Lebedev d8c6202576
[X86] AMD Zen 3: throughput for renameable GPR moves is 6
They are resolved at the register rename stage without
using any execution units.
2021-05-07 17:06:43 +03:00
Roman Lebedev c3cd8ed009
[NFC][X86] AMD Zen 3: move sched classes for renameables moves togeter 2021-05-07 17:06:43 +03:00
Simon Pilgrim 280aa3415e [DAG] Add a generic expansion for SHIFT_PARTS opcodes using funnel shifts
Based off a discussion on D89281 - where the AARCH64 implementations were being replaced to use funnel shifts.

Any target that has efficient funnel shift lowering can handle the shift parts expansion using the same expansion, avoiding a lot of duplication.

I've generalized the X86 implementation and moved it to TargetLowering - so far I've found that AARCH64 and AMDGPU benefit, but many other targets (ARM, PowerPC + RISCV in particular) could easily use this with a few minor improvements to their funnel shift lowering (or the folding of their target ops that funnel shifts lower to).

NOTE: I'm trying to avoid adding full SHIFT_PARTS legalizer handling as I think it might actually be possible to remove these opcodes in the medium-term and use funnel shift / libcall expansion directly.

Differential Revision: https://reviews.llvm.org/D101987
2021-05-07 13:12:30 +01:00
Simon Pilgrim 8e42024f79 [X86] Ensure we pass DebugLoc by const reference where possible. NFCI.
Avoids a lot of unnecessary tracking increments/decrements of the underlying TrackingMDNodeRef
2021-05-07 12:32:59 +01:00
Roman Lebedev 7059b28d5d
[X86] AMD Zen 3: 32/64 -bit GPR register moves are zero-cycle
I've verified this with llvm-exegesis.
This is not limited to zero registers.

Refs:
AMD SOG 19h, 2.9.4 Zero Cycle Move
The processor is able to execute certain register to register
mov operations with zero cycle delay.

Agner,
22.13 Instructions with no latency
Register-to-register move instructions are resolved at
the register rename stage without using any execution units.
These instructions have zero latency. It is possible to do six such
register renamings per clock cycle, and it is even possible to
rename the same register multiple times in one clock cycle.
2021-05-07 13:56:07 +03:00
Matt Arsenault 23ae35e858 X86/GlobalISel: Use generic version of splitToValueTypes
The custom insert of an unmerge and the callback weirdness should be
unnecessary. Since handleAssignments should now use
getRegisterTypeForCalling conv as SelectionDAG builder would, this
should now just be able to use the generic code. X86-32 relies on the
generated CCAssignFns not seeing illegal types and sharing code with
x86_64, so i64 values would incorrectly be assigned to 64-bit
registers.
2021-05-05 17:35:02 -04:00
Matt Arsenault fa0b93b5a0 GlobalISel: Use DAG call lowering infrastructure in a more compatible way
Unfortunately the current call lowering code is built on top of the
legacy MVT/DAG based code. However, GlobalISel was not using it the
same way. In short, the DAG passes legalized types to the assignment
function, and GlobalISel was passing the original raw type if it was
simple.

I do believe the DAG lowering is conceptually broken since it requires
picking a type up front before knowing how/where the value will be
passed. This ends up being a problem for AArch64, which wants to pass
i1/i8/i16 values as a different size if passed on the stack or in
registers.

The argument type decision is split across 3 different places which is
hard to follow. SelectionDAG builder uses
getRegisterTypeForCallingConv to pick a legal type, tablegen gives the
illusion of controlling the type, and the target may have additional
hacks in the C++ part of the call lowering. AArch64 hacks around this
by not using the standard AnalyzeFormalArguments and special casing
i1/i8/i16 by looking at the underlying type of the original IR
argument.

I believe people have generally assumed the calling convention code is
processing the original types, and I've discovered a number of dead
paths in several targets.

x86 actually relies on the opposite behavior from AArch64, and relies
on x86_32 and x86_64 sharing calling convention code where the 64-bit
cases implicitly do not work on x86_32 due to using the pre-legalized
types.

AMDGPU targets without legal i16/f16 have always used a broken ABI
that promotes to i32/f32. GlobalISel accidentally fixed this to be the
ABI we should have, but this fixes it so we're using the worse ABI
that is compatible with the DAG. Ideally we would fix the DAG to match
the old GlobalISel behavior, but I don't wish to fight that battle.

A new native GlobalISel call lowering framework should let the target
process the incoming types directly.

CCValAssigns select a "ValVT" and "LocVT" but the meanings of these
aren't entirely clear. Different targets don't use them consistently,
even within their own call lowering code. My current belief is the
intent was "ValVT" is supposed to be the legalized value type to use
in the end, and and LocVT was supposed to be the ABI passed type
(which is also legalized).

With the default CCState::Analyze functions always passing the same
type for these arguments, these only differ when the TableGen part of
the lowering decide to promote the type from one legal type to
another. AArch64's i1/i8/i16 hack ends up inverting the meanings of
these values, so I had to add an additional hack to let the target
interpret how large the argument memory is.

Since targets don't consistently interpret ValVT and LocVT, this
doesn't produce quite equivalent code to the initial DAG
lowerings. I've opted to consistently interpret LocVT as the in-memory
size for stack passed values, and ValVT as the register type to assign
from that memory. We therefore produce extending loads directly out of
the IRTranslator, whereas the DAG would emit regular loads of smaller
values. This will also produce loads/stores that are wider than the
argument value if the allocated stack slot is larger (and there will
be undef padding bytes). If we had the optimizations to reduce
load/stores based on truncated values, this wouldn't produce a
different end result.

Since ValVT/LocVT are more consistently interpreted, we now will emit
more G_BITCASTS as requested by the CCAssignFn. For example AArch64
was directly assigning types to some physical vector registers which
according to the tablegen spec should have been casted to a vector
with a different element type.

This also moves the responsibility for inserting
G_ASSERT_SEXT/G_ASSERT_ZEXT from the target ValueHandlers into the
generic code, which is closer to how SelectionDAGBuilder works.

I had to xfail an x86 test since I don't see a quick way to fix it
right now (I filed bug 50035 for this). It's broken independently of
this change, and only triggers since now we end up with more ands
which hit the improperly handled selection pattern.

I also observed that FP arguments that need promotion (e.g. f16 passed
as f32) are broken, and use regular G_TRUNC and G_ANYEXT.

TLDR; the current call lowering infrastructure is bad and nobody has
ever understood how it chooses types.
2021-05-05 17:35:02 -04:00
Simon Pilgrim 85460a2f5b [X86][SSE] Move unpack(hop,hop) fold from foldShuffleOfHorizOp to combineTargetShuffle
By moving this after more of the shuffle canonicalization we reduce the demanded vector elts, avoiding a few unnecessary copies/moves etc.
2021-05-05 13:36:09 +01:00
Alexey Bataev 13a51e017c [X86]Fix a crash trying to convert indices to proper type.
Need to perfortm a bitcast on IndicesVec rather than subvector extract
if the original size of the IndicesVec is the same as the size of the
  destination type.

Differential Revision: https://reviews.llvm.org/D101838
2021-05-05 05:14:42 -07:00
Matt Arsenault 6dd8834772 X86/GlobalISel: Rely on default assignValueToReg
The resulting output is semantically closer to what the DAG emits and
is more compatible with the existing CCAssignFns.

The returns of f32 in f80 are clearly broken, but they were broken
before when using G_ANYEXT to go from f32 to f80.
2021-05-04 16:36:37 -04:00
Harald van Dijk f30500632b
[X32][CET] Fix size and alignment of .note.gnu.property section
X32 uses 32-bit ELF object files with 32-bit alignment, so the
.note.gnu.property section needs to be emitted as it is for X86.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D101689
2021-05-01 22:17:04 +01:00
Roman Lebedev 2b93c9c16c
[X86] AMD Zen 3 Scheduler Model
Introduce basic schedule model for AMD Zen 3 CPU's, a.k.a `znver3`.

This is fully built from scratch, from llvm-mca measurements
and documented reference materials.
Nothing was copied from `znver2`/`znver1`.

I believe this is in a reasonable state of completion for inclusion,
probably better than D52779 `bdver2` was :)

Namely:
* uops are pretty spot-on (at least what llvm-mca can measure)
  {F16422596}
* latency is also pretty spot-on (at least what llvm-mca can measure)
  {F16422601}
* throughput is within reason
  {F16422607}

I haven't run much benchmarks with this,
however RawSpeed benchmarks says this is beneficial:
{F16603978}
{F16604029}

I'll call out the obvious problems there:
* i didn't really bother with X87 instructions
* i didn't really bother with obviously-microcoded/system instructions
* There are large discrepancy in throughput for `mr` and `rm` instructions.
  I'm not really sure if it's a modelling defect that needs to be fixed,
  or it's a defect of measurments.
* Pipe distributions are probably bad :)
  I can't do much here until AMD allows that to be fixed
  by documenting the appropriate counters and updating libpfm

That being said, as @RKSimon notes:
>>! In D94395#2647381, @RKSimon wrote:
> I'll mention again that all the znver* models appear to be very inaccurate wrt SIMD/FPU instructions <...>
so how much worse this could possibly be?!

Things that aren't there:
* Various tunings: zero idioms, etc. That is follow-ups.

Differential Revision: https://reviews.llvm.org/D94395
2021-05-01 22:08:13 +03:00
Dávid Bolvanský 2af95a5275 [X86] Promote 16-bit CTTZ_ZERO_UNDEF to 32-bit variant
Related to PR50172.

Protects us against regressions after we will start doing cttz(zext(x)) -> zext(cttz(x)) transformation in the middle-end.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D101662
2021-05-01 00:42:15 +02:00
Daniil Fukalov 3489c2d7b1 [TTI] NFC: Change getTypeLegalizationCost to return InstructionCost.
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: sdesmalen, kparzysz

Differential Revision: https://reviews.llvm.org/D101533
2021-04-30 22:51:51 +03:00
Alexey Bataev 12c51f2358 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 12:48:00 -07:00
Alexey Bataev 6e859f3cd4 Revert "[COST] Improve shuffle kind detection if shuffle mask is provided."
This reverts commit 9239932221 to fix
a compiler crash on mask checks.
2021-04-29 12:40:33 -07:00
Benjamin Kramer df323ba445 Revert "[X86] Support AMX fast register allocation"
This reverts commit 3b8ec86fd5.

Revert "[X86] Refine AMX fast register allocation"

This reverts commit c3f95e9197.

This pass breaks using LLVM in a multi-threaded environment by
introducing global state.
2021-04-29 18:56:33 +02:00
Alexey Bataev 9239932221 [COST] Improve shuffle kind detection if shuffle mask is provided.
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.

Differential Revision: https://reviews.llvm.org/D100865
2021-04-29 09:42:56 -07:00
Harald van Dijk 1b788607f5
[X32][CET] Fix handling of indirect branches
As X32 uses 32-bit pointers without having 32-bit indirect branch
instructions, we need to fix up indirect branches by extending the
branch targets to 64 bits. This was already done for BRIND but not yet
for NT_BRIND. The same logic works for both, so this applies that
existing logic to NT_BRIND as well.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D101499
2021-04-29 08:33:22 +01:00
Wang, Pengfei f69adfb87f [X86][AMX][NFC] Add more comments and remove unnecessary check found by Clocwork 2021-04-28 16:35:17 +08:00
Craig Topper 3067520bf4 [SelectionDAG] Use a VTSDNode to store the saturation width for FP_TO_SINT_SAT/FP_TO_UINT_SAT
Previously we used an i32 constant to store the saturation width, but i32 isn't
legal on RISCV64. This wasn't a big deal to fix, but it is extra work for the
type legalizer.

This patch uses a VTSDNode to store the type similar to SEXT_INREG. This makes
it opaque to the type legalizer.

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D101262
2021-04-27 14:38:42 -07:00
Alexey Bataev f19e8f424f [COST][X86]Improve cost model for reverse shuffle v32i16/v64i8 in AVX512F.
Improved cost model for reverse shuffle on AVX512F for types
v32i16/v64i8.

Differential Revision: https://reviews.llvm.org/D100974
2021-04-27 11:14:21 -07:00
Nick Desaulniers ea8416bf4d [CodeGenOptions] make StackProtectorGuardOffset signed
GCC supports negative values for -mstack-protector-guard-offset=, this
should be a signed value. Pre-req to D100919.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D101325
2021-04-27 10:12:58 -07:00
Simon Pilgrim decab8e973 Revert rG9b7a0a50355d5 - Revert "[X86] Add support for reusing ZF etc. from locked XADD instructions (PR20841)"
Still causing some sanitizer buildbot failures.
2021-04-27 15:39:20 +01:00
Simon Pilgrim 9b7a0a5035 [X86] Add support for reusing ZF etc. from locked XADD instructions (PR20841)
XADD has the same EFLAGS behaviour as ADD

Reapplies rG2149aa73f640 (after it was reverted at rG535df472b042) - AFAICT rG029e41ec9800 should ensure we correctly tag the LXADD* ops as load/stores - I haven't been able to repro the sanitizer buildbot fails locally so this is a speculative commit.
2021-04-27 15:01:13 +01:00
Simon Pilgrim 029e41ec98 [X86] Ensure multiclass ATOMIC_RMW_BINOP is tagged as MayLoad and MayStore
These are RMW ops and should be tagged as both loads and stores.
2021-04-27 14:11:22 +01:00
dfukalov e4c606acaf [TTI] NFC: Change getScalarizationOverhead and getOperandsScalarizationOverhead to return InstructionCost.
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D101283
2021-04-27 08:51:48 +03:00
Wang, Pengfei 016092d786 Reapply "[X86][AMX] Try to hoist AMX shapes' def"
We request no intersections between AMX instructions and their shapes'
def when we insert ldtilecfg. However, this is not always ture resulting
from not only users don't follow AMX API model, but also optimizations.

This patch adds a mechanism that tries to hoist AMX shapes' def as well.
It only hoists shapes inside a BB, we can improve it for cases across
BBs in future. Currently, it only hoists shapes of which all sources' def
above the first AMX instruction. We can improve for the case that only
source that moves an immediate value to a register below AMX instruction.

Reviewed By: xiangzhangllvm

Differential Revision: https://reviews.llvm.org/D101067
2021-04-27 10:27:59 +08:00
Simon Pilgrim a0677ff5eb [X86] Rename multiclass ATOMIC_LOAD_BINOP -> ATOMIC_RMW_BINOP. NFCI.
Noticed while triaging the rG2149aa73f640c96 regressions - the LXADD ops are load+store RMW instructions, not just loads.
2021-04-26 15:17:30 +01:00
Simon Pilgrim 535df472b0 Revert rG2149aa73f640c96 "[X86] Add support for reusing ZF etc. from locked XADD instructions (PR20841)"
This might be the cause of some msan build failures - I don't have access to a msan build right now, so this is a speculative revert.
2021-04-25 12:45:07 +01:00
Simon Pilgrim 2149aa73f6 [X86] Add support for reusing ZF etc. from locked XADD instructions (PR20841)
XADD has the same EFLAGS behaviour as ADD
2021-04-25 12:02:33 +01:00
Xiang1 Zhang c3f95e9197 [X86] Refine AMX fast register allocation 2021-04-25 14:20:53 +08:00
Xiang1 Zhang 3b8ec86fd5 [X86] Support AMX fast register allocation
Differential Revision: https://reviews.llvm.org/D100026
2021-04-25 09:45:41 +08:00
Mitch Phillips caea37b37e Revert "[X86][AMX] Try to hoist AMX shapes' def"
This reverts commit 90118563ad.

Reason: Broke the MSan buildbots.
https://lab.llvm.org/buildbot/#/builders/5/builds/6967/steps/9/logs/stdio

More details can be found in the original phabricator review:
https://reviews.llvm.org/D101067
2021-04-23 10:42:26 -07:00
Simon Pilgrim 043bc88dba [CostModel][X86] Improve v2f32 fadd reduction cost
This was being reported as a similar cost to v4f32 when its a lot cheaper (just a shufps+addps).
2021-04-23 16:56:13 +01:00
Sander de Smalen f9a50f04ba [TTI] NFC: Change getIntImmCost[Inst|Intrin] to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100565
2021-04-23 16:06:36 +01:00
Sander de Smalen 43ace8b5ce [TTI] NFC: Change getScalingFactorCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100564
2021-04-23 16:06:36 +01:00
Sander de Smalen e0edfa052f [TTI] NFC: Change getAddressComputationCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100561
2021-04-23 16:06:35 +01:00
Simon Pilgrim 7b32e8b96a [X86] combineSetCCAtomicArith - pull out repeated ops. NFCI.
Reduces diff in D101074
2021-04-23 14:19:24 +01:00
Wang, Pengfei 151e244fe6 [X86][AMX][NFC] Make comparison operators to be complete
The previous D101039 didn't fix the SmallSet insertion issue, due to we
always return false for the comparison between 2 different nonnull BBs.
This patch makes the the comparison to be complete by comparing `MBB`
first, so that we can always get the invariant order by a single
operator.
2021-04-23 17:38:54 +08:00
Wang, Pengfei 53673fd1bf [X86][AMX][NFC] Avoid assert for the same immidiate value
The previous condition in the assert was over strict. We ought to allow
the same immidiate value being loaded more than once. The intention for
the assert is to check the same AMX register uses multiple different
immidiate shapes. So this fix supposes to be NFC.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D101124
2021-04-23 12:17:00 +08:00
Wang, Pengfei 90118563ad [X86][AMX] Try to hoist AMX shapes' def
We request no intersections between AMX instructions and their shapes'
def when we insert ldtilecfg. However, this is not always ture resulting
from not only users don't follow AMX API model, but also optimizations.

This patch adds a mechanism that tries to hoist AMX shapes' def as well.
It only hoists shapes inside a BB, we can improve it for cases across
BBs in future. Currently, it only hoists shapes of which all sources' def
above the first AMX instruction. We can improve for the case that only
source that moves an immediate value to a register below AMX instruction.

Differential Revision: https://reviews.llvm.org/D101067
2021-04-23 12:17:00 +08:00
Wang, Pengfei e8bce83996 [X86] Enable compilation of user interrupt handlers.
Add __uintr_frame structure and use UIRET instruction for functions with
x86 interrupt calling convention when UINTR is present.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D99708
2021-04-23 11:43:57 +08:00
Wang, Pengfei aafb6d81cf [X86][AMX][NFC] Remove assert for comparison between different BBs.
SmallSet may use operator `<` when we insert MIRef elements, so we
cannot limit the comparison between different BBs.

We allow MIRef() to be less that any initialized MIRef object, otherwise,
we always reture false when compare between different BBs.

Differential Revision: https://reviews.llvm.org/D101039
2021-04-22 20:41:59 +08:00
Simon Pilgrim a511b55cfd [X86][SSE] getFauxShuffleMask - don't decode OR(SHUFFLE,SHUFFLE) containing UNDEFs. (PR50049)
PR50049 demonstrated an infinite loop between OR(SHUFFLE,SHUFFLE) <-> BLEND(SHUFFLE,SHUFFLE) patterns.

The UNDEF elements were allowing a combined shuffle mask to be widened which lost the undef element, resulting us needing to use the BLEND pattern (as the undef element would need to be zero for the OR pattern). But then bitcast folds would re-expose the undef element allowing us to use OR again.....
2021-04-21 18:47:00 +01:00
Anirudh Prasad 8f6185c713 [AsmParser][ms][X86] Fix possible misbehaviour in parsing of special tokens at start of string.
- Previously, https://reviews.llvm.org/D72680 introduced a new attribute called `AllowSymbolAtNameStart` (in relation to the MAsmParser changes) in `MCAsmInfo.h` which (according to the comment in the header) allows the following behaviour:

```
  /// This is true if the assembler allows $ @ ? characters at the start of
  /// symbol names. Defaults to false.
```

- However, the usage of this field in AsmLexer.cpp doesn't seem completely accurate* for a couple of reasons.

```
  default:
    if (MAI.doesAllowSymbolAtNameStart()) {
      // Handle Microsoft-style identifier: [a-zA-Z_$.@?][a-zA-Z0-9_$.@#?]*
      if (!isDigit(CurChar) &&
          isIdentifierChar(CurChar, MAI.doesAllowAtInName(),
                           AllowHashInIdentifier))
        return LexIdentifier();
    }
```

1. The Dollar and At tokens, when occurring at the start of the string, are treated as separate tokens (AsmToken::Dollar and AsmToken::At respectively) and not lexed as an Identifier.
2. I'm not too sure why `MAI.doesAllowAtInName()` is used when `AllowAtInIdentifier` could be used. For X86 platforms, afaict, this shouldn't be an issue, since the `CommentString` attribute isn't "@". (alternatively the call to the setter can be set anywhere else as needed). The `AllowAtInName` does have an additional important meaning, but in the context of AsmLexer, shouldn't mean anything different compared to `AllowAtInIdentifier`

My proposal is the following:

- Introduce 3 new fields called `AllowQuestionTokenAtStartOfString`, `AllowDollarTokenAtStartOfString` and `AllowAtTokenAtStartOfString` in MCAsmInfo.h which will encapsulate the previously documented behaviour of "allowing $, @, ? characters at the start of symbol names")
- Introduce these fields where "$", "@" are lexed, and treat them as identifiers depending on whether `Allow[Dollar|At]TokenAtStartOfString` is set.
- For the sole case of "?", append it to the existing logic for treating a "default" token as an Identifier.

z/OS (HLASM) will also make use of some of these fields in follow up patches.

completely accurate* - This was based on the comments and the intended behaviour the code. I might have completely misinterpreted it, and if that is the case my sincere apologies. We can close this patch if necessary, if there are no changes to be made :)

Depends on https://reviews.llvm.org/D99374

Reviewed By: Jonathan.Crowther

Differential Revision: https://reviews.llvm.org/D99889
2021-04-21 10:21:09 -04:00
Philip Reames 4824d876f0 Revert "Allow invokable sub-classes of IntrinsicInst"
This reverts commit d87b9b81cc.

Post commit review raised concerns, reverting while discussion happens.
2021-04-20 15:38:38 -07:00
Philip Reames d87b9b81cc Allow invokable sub-classes of IntrinsicInst
It used to be that all of our intrinsics were call instructions, but over time, we've added more and more invokable intrinsics. According to the verifier, we're up to 8 right now. As IntrinsicInst is a sub-class of CallInst, this puts us in an awkward spot where the idiomatic means to check for intrinsic has a false negative if the intrinsic is invoked.

This change switches IntrinsicInst from being a sub-class of CallInst to being a subclass of CallBase. This allows invoked intrinsics to be instances of IntrinsicInst, at the cost of requiring a few more casts to CallInst in places where the intrinsic really is known to be a call, not an invoke.

After this lands and has baked for a couple days, planned cleanups:
    Make GCStatepointInst a IntrinsicInst subclass.
    Merge intrinsic handling in InstCombine and use idiomatic visitIntrinsicInst entry point for InstVisitor.
    Do the same in SelectionDAG.
    Do the same in FastISEL.

Differential Revision: https://reviews.llvm.org/D99976
2021-04-20 15:03:49 -07:00
Simon Pilgrim 2a419a0b99 [X86][SSE] combineX86ShuffleChain - check if we're blending with zero into already zero elements
Add a SelectionDAG::MaskedElementsAreZero helper that wraps SelectionDAG::MaskedValueIsZero testing for entirely zero vector elements
2021-04-20 17:09:49 +01:00
Nick Desaulniers c440b97d89 [TargetLowering] move "o" and "X" constraint handling to base class
These constraints are machine agnostic; there's no reason to handle
these per-arch. If arches don't support these constraints, then they
will fail elsewhere during instruction selection. We don't need virtual
calls to look these up; TargetLowering::getInlineAsmMemConstraint should
only be overridden by architectures with additional unique memory
constraints.

Reviewed By: echristo, MaskRay

Differential Revision: https://reviews.llvm.org/D100416
2021-04-19 10:53:31 -07:00
Roman Lebedev df9597cf5a
[X86][CostModel] X86TTIImpl::getShuffleCost(): subvector insertions are cheap
This is similar to the subvector extractions,
except that the 0'th subvector isn't free to insert,
because we generally don't know whether or not
the upper elements need to be preserved:
https://godbolt.org/z/rsxP5W4sW

This is needed to avoid regressions in D100684

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100698
2021-04-19 13:24:58 +03:00
Serge Guelton d6de1e1a71 Normalize interaction with boolean attributes
Such attributes can either be unset, or set to "true" or "false" (as string).
throughout the codebase, this led to inelegant checks ranging from

        if (Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true")

to

        if (Fn->hasAttribute("no-jump-tables") && Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true")

Introduce a getValueAsBool that normalize the check, with the following
behavior:

no attributes or attribute set to "false" => return false
attribute set to "true" => return true

Differential Revision: https://reviews.llvm.org/D99299
2021-04-17 08:17:33 +02:00
Roman Lebedev b06c55a698
[X86][CostModel] Fix cost model for non-power-of-two vector load/stores
Sometimes LV has to produce really wide vectors,
and sometimes they end up being not powers of two.
As it can be seen from the diff, the cost computation
is currently completely non-sensical in those cases.

Instead of just scalarizing everything, split/factorize the wide vector
into a number of subvectors, each one having a power-of-two elements,
recurse to get the cost of op on this subvector. Also, check how we'd
legalize this subvector, and if the legalized type is scalar,
also account for the scalarization cost.

Note that for sub-vector loads, we might be able to do better,
when the vectors are properly aligned.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D100099
2021-04-16 15:30:57 +03:00
Simon Pilgrim 9d57a77b81 [X86] combineCMP - fold cmpEQ/NE(TRUNC(X),0) -> cmpEQ/NE(X,0)
If we are truncating from a i32 source before comparing the result against zero, then see if we can directly compare the source value against zero.

If the upper (truncated) bits are known to be zero then we can compare against that, hopefully increasing the chances of us folding the compare into a EFLAG result of the source's operation.

Fixes PR49028.

Differential Revision: https://reviews.llvm.org/D100491
2021-04-15 13:55:51 +01:00
Sander de Smalen 4f42d873c2 [TTI] NFC: Change getArithmeticInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100317
2021-04-14 17:20:36 +01:00
Sander de Smalen 1af35e77f4 [TTI] NFC: Change getVectorInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100315
2021-04-14 17:20:35 +01:00
Sander de Smalen 174e8f6c5e [TTI] NFC: Change getShuffleCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100314
2021-04-14 17:20:35 +01:00
Sander de Smalen 14b934f8a6 [TTI] NFC: Change getCFInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: samparker

Differential Revision: https://reviews.llvm.org/D100313
2021-04-14 17:20:34 +01:00
Simon Pilgrim 4fbe761572 [X86][SSE] canonicalizeShuffleWithBinOps - check for more combos of merge-able binary shuffles.
In the fold SHUFFLE(BINOP(X,Y),BINOP(Z,W)) -> BINOP(SHUFFLE(X,Z),SHUFFLE(Y,W)), check if both X/Z AND Y/W have at least one merge-able shuffle in which case the total number of shuffle should still fall.

Helps with instruction count regressions we saw while fixing PR48823
2021-04-14 15:24:41 +01:00
Simon Pilgrim 73737fe990 [X86] Fold cmpeq/ne(trunc(x),0) --> cmpeq/ne(x,0)
Relax the fold from rGbaadbe04bf75 to compare any op, not just logic ops, now that the movmsk regressions have been handled.
2021-04-14 11:02:02 +01:00
Simon Pilgrim 016ceb8382 [X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in MOVMSK(SHUFFLE(X,u)) -> MOVMSK(X) fold
Extension to rG74f98391a7a4, we can also include any of the upper (known zero) bits in the comparison in the shuffle removal fold, just as long as we demand all the elements of the movmsk source vector.
2021-04-14 11:02:01 +01:00
Bogdan Graur 0acf4e5005
[NFC] Fix unused warning.
Differential Revision: https://reviews.llvm.org/D100449
2021-04-14 09:09:20 +02:00
Wang, Pengfei a3b52a9d13 [X86][AMX] Refactor for PostRA ldtilecfg pass.
This is a follow up of D99010. We didn't consider the live range of shape registers when hoist ldtilecfg. There maybe risks, e.g. we happen to insert it to an invalid range of some registers and get unexpected error.

This patch fixes this problem by storing the value to corresponding stack place of ldtilecfg after all its definition immediately.

This patch also fix a problem in previous code: If we don't have a ldtilecfg which dominates all AMX instructions, we cannot initialize shapes for other ldtilecfg.

There're still some optimization points left. E.g. eliminate unused mov instructions, break the def-use dependency before RA etc.

Reviewed By: LuoYuanke, xiangzhangllvm

Differential Revision: https://reviews.llvm.org/D99966
2021-04-14 10:08:23 +08:00
Simon Pilgrim 74f98391a7 [X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in CMP(MOVMSK(PACKSS())) -> CMP(MOVMSK()) fold
We already allow the comparison of the upper bits of 'IsAllOf' (allbits) patterns, but we can safely compare the known zero bits for 'IsAnyOf' (zerobits) patterns as well.

This fixes an issues where we are comparing a type wide than the number of vector elements, which avoids a regression mentioned in rGbaadbe04bf75.
2021-04-13 17:37:24 +01:00
Sander de Smalen 03f47bdcb1 [TTI] NFC: Change get[Interleaved]MemoryOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100205
2021-04-13 14:21:02 +01:00
Sander de Smalen d676b5749d [TTI] NFC: Change getMaskedMemoryOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100204
2021-04-13 14:21:01 +01:00
Sander de Smalen db134e2428 [TTI] NFC: Change getCmpSelInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100203
2021-04-13 14:21:01 +01:00
Sander de Smalen 2285dfb73f [TTI] NFC: Change getMinMaxReductionCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100202
2021-04-13 14:21:00 +01:00
Sander de Smalen bd86824d98 [TTI] NFC: Change getArithmeticReductionCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

This patch is practically NFC, with the exception of an AArch64 SVE related
cost-model change, where we can now return an Invalid cost instead of some
bogus number.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100201
2021-04-13 14:20:59 +01:00
Sander de Smalen fd1f8a5462 [TTI] NFC: Change getGatherScatterOpCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100200
2021-04-13 14:20:59 +01:00
Sander de Smalen 92d8421f49 [TTI] NFC: Change getCastInstrCost and getExtractWithExtendCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D100199
2021-04-13 14:20:58 +01:00
Freddy Ye 3fc1fe8db8 [X86] Support -march=rocketlake
Reviewed By: skan, craig.topper, MaskRay

Differential Revision: https://reviews.llvm.org/D100085
2021-04-13 09:48:13 +08:00
Simon Pilgrim baadbe04bf [X86] Fold cmpeq/ne(trunc(logic(x)),0) --> cmpeq/ne(logic(x),0)
Fixes the issues noted in PR48768, where the and/or/xor instruction had been promoted to avoid i8/i16 partial-dependencies, but the test against zero had not.

We can almost certainly relax this fold to work for any truncation, although it breaks a number of existing folds (notable movmsk folds which tend to rely on the truncate to determine the demanded bits/elts in the source vector).

There is a reverse combine in TargetLowering.SimplifySetCC so we must wait until after legalization before attempting this.
2021-04-12 16:05:34 +01:00
Wang, Pengfei 4cbaaf4a24 [X86][AMX] Hoist ldtilecfg
The previous code calculated the first ldtilecfg by dominating all AMX registers' def. This may result in the ldtilecfg being inserted into a loop.

This patch try to calculate the nearest point where all shapes of AMX registers are reachable.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D99010
2021-04-12 22:36:41 +08:00
Bing1 Yu 747111ea71 [X86] Pass to transform tdpbsud&tdpbusd&tdpbuud intrinsics to scalar operation
Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D99244
2021-04-12 13:58:14 +08:00
Freddy Ye 5cb47be410 [X86] Remove FeatureCLWB from FeaturesICLClient
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D100279
2021-04-12 12:08:59 +08:00
Simon Pilgrim 231b87618b [X86][AVX512] Fold not(kmov(x)) -> kmov(not(x)) and not(widen_subvector(x)) -> widen_subvector(not(x))
Improve AVX512 mask inversion, rG38c799bce801 exposed some missing opportunities to move scalar not() back onto the boolvector types for folding with setcc etc.
2021-04-11 20:07:09 +01:00
Simon Pilgrim 13bdac5709 [X86] combineXor - Pull out repeated getOperand() calls. NFCI. 2021-04-11 19:01:59 +01:00
Simon Pilgrim 38c799bce8 [X86] Fold cmpeq/ne(and(X,Y),Y) --> cmpeq/ne(and(~X,Y),0)
Followup to D100177, handle an similar (demorgan inverse style) case from PR47797 as well

The AVX512 test cases could be further improved if we folded not(iX bitcast(vXi1)) -> (iX bitcast(not(vXi1)))

Alive2: https://alive2.llvm.org/ce/z/AnA_-W
2021-04-11 18:42:01 +01:00
dfukalov 8f4b7e94a2 [AMDGPU][CostModel] Refine cost model for control-flow instructions.
Added cost estimation for switch instruction, updated costs of branches, fixed
phi cost.
Had to increase `-amdgpu-unroll-threshold-if` default value since conditional
branch cost (size) was corrected to higher value.
Test renamed to "control-flow.ll".

Removed redundant code in `X86TTIImpl::getCFInstrCost()` and
`PPCTTIImpl::getCFInstrCost()`.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D96805
2021-04-10 09:20:24 +03:00
Simon Pilgrim d8bc4de3cf [X86] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) on non-BMI targets (PR44136)
Followup to D100177, enable the fold for non-BMI targets as well.
2021-04-09 16:11:11 +01:00
Simon Pilgrim 245036950a [X86][BMI] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) (PR44136)
I've initially just enabled this for BMI which has the ANDN instruction for i32/i64 - the i16/i8 cases give an idea of what'd we get when we enable it in all cases (I'll do this as a later commit).

Additionally, the i16/i8 cases could be freely promoted to i32 (as the args are already zeroext) and we could then make use of ANDN + the free cmp0 there as well - this has come up in PR48768 and PR49028 so I'm going to look at this soon.

https://alive2.llvm.org/ce/z/QVWHP_
https://alive2.llvm.org/ce/z/pLngT-

Vector cases do not appear to benefit from this as we end up with having to generate the zero vector as well - this is one of the reasons I didn't try to tie this into hasAndNot/hasAndNotCompare.

Differential Revision: https://reviews.llvm.org/D100177
2021-04-09 15:52:03 +01:00
dfukalov d066079728 [NFC][AA] Prepare to convert AliasResult to class with PartialAlias offset.
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.

Reviewed By: asbirlea

Differential Revision: https://reviews.llvm.org/D98027
2021-04-09 12:54:22 +03:00
Simon Pilgrim 3ae0a405fc [X86] combineHorizOpWithShuffle - peek through one use bitcasts when decoding shuffles.
Checking for one use, peek through bitcasts of the horizop args to allows us to merge shuffles of different widths through the horizop.
2021-04-09 10:51:04 +01:00
Simon Pilgrim 302e748065 [X86] Improve optimizeCompareInstr for signed comparisons after AND/OR/XOR instructions
Extend D94856 to handle 'and', 'or' and 'xor' instructions as well

We still fail on many i8/i16 cases as the test and the logic-op are performed on different widths
2021-04-07 14:28:42 +01:00
Simon Pilgrim 583258723f [X86] Improve optimizeCompareInstr for signed comparisons after BZHI instructions
Extend D94856 to handle 'bzhi' instructions as well
2021-04-07 12:07:26 +01:00
Nicolás Alvarez a1aada75f5 [docs] Fix doxygen comments wrongly attached to the llvm namespace
Looking at the Doxygen-generated documentation for the llvm namespace
currently shows all sorts of random comments from different parts of the
codebase. These are mostly caused by:

- File doc comments that aren't marked with \file, so they're attached to
  the next declaration, which is usually "namespace llvm {".
- Class doc comments placed before the namespace rather than before the
  class.
- Code comments before the namespace that (in my opinion) shouldn't be
  extracted by doxygen at all.

This commit fixes these comments. The generated doxygen documentation now
has proper docs for several classes and files, and the docs for the llvm
and llvm::detail namespaces are now empty.

Reviewed By: thakis, mizvekov

Differential Revision: https://reviews.llvm.org/D96736
2021-04-07 01:20:18 +02:00
Simon Pilgrim 53283cc2f1 [X86][SSE] canonicalizeShuffleWithBinOps - add MOVSD/MOVSS handling. 2021-04-06 16:42:18 +01:00
Simon Pilgrim 1dcb5b5e89 [X86] Improve optimizeCompareInstr for signed comparisons after ANDN instructions
Extend D94856 to handle 'andn' instructions as well
2021-04-06 14:16:16 +01:00
Simon Pilgrim 201877d572 [CostModel][X86] Improve accuracy of vXi8 multiply reduction costs
After rG47321c311bdbe0145b9bf45d822185c37b19fa50 we promote vXi8 reductions to vXi16 to create a much faster PMULLW mul reduction, followed by a (free) truncation. This avoids the high cost of repeated vXi8 multiplications (which extend+multiply+truncate to/from vXi16 types....).

Fixes the missing vXi8 mul reduction vectorization in PR42674 (Comment #20) 'mul16' test case.
2021-04-06 11:53:22 +01:00
Simon Pilgrim ddbb58736a [KnownBits] Rename KnownBits::computeForMul to KnownBits::mul. NFCI.
As promised in D98866
2021-04-06 10:11:41 +01:00
Simon Pilgrim 36d4f6d7f8 [X86] Fold xor(zext(xor(x,c1)),c2) -> xor(zext(x),xor(zext(c1),c2))
Fixes PR47603 (second case) by extending rG89afec348dbd3e5078f176e978971ee2d3b5dec8
2021-04-05 11:40:37 +01:00
Roman Lebedev 7727cc242d
[NFC][X86] Split VPMOV* AVX2 instructions into their own sched class
At least on all three Zen's, all such instructions cleanly map
into this new class with no overrides needed.
2021-04-03 22:39:07 +03:00
Nikita Popov 665065821e [FastISel] Remove kill tracking
This is a followup to D98145: As far as I know, tracking of kill
flags in FastISel is just a compile-time optimization. However,
I'm not actually seeing any compile-time regression when removing
the tracking. This probably used to be more important in the past,
before FastRA was switched to allocate instructions in reverse
order, which means that it discovers kills as a matter of course.

As such, the kill tracking doesn't really seem to serve a purpose
anymore, and just adds additional complexity and potential for
errors. This patch removes it entirely. The primary changes are
dropping the hasTrivialKill() method and removing the kill
arguments from the emitFast methods. The rest is mechanical fixup.

Differential Revision: https://reviews.llvm.org/D98294
2021-04-03 15:50:13 +02:00
Simon Pilgrim 89afec348d [X86] Fold xor(truncate(xor(x,c1)),c2) -> xor(truncate(x),xor(truncate(c1),c2))
Fixes PR47603

This should probably be transferable to DAGCombine - the main limitation with the existing trunc(logicop) DAG fold is we don't know if legalization has tried to promote truncated logicops already. We might be able to peek through extensions as well.
2021-04-03 12:43:05 +01:00
Simon Pilgrim 7c17f1ea84 [X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper (REAPPLIED)
Use the getTargetShuffleInputs helper for all shuffle decoding

Reapplied (after reversion in rGfa0aff6d6960) with fix+test for subvector splitting - we weren't accounting for peeking through bitcasts changing the vector element count of the shuffle sources.
2021-04-03 11:59:19 +01:00
Nico Weber fa0aff6d69 Revert "[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper"
This reverts commit 500969f1d0.
Makes clang assert compiling avx2 code, see
https://bugs.chromium.org/p/chromium/issues/detail?id=1195353#c4
for a standalone repro.
2021-04-02 09:55:55 -04:00
Simon Pilgrim 500969f1d0 [X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper
Use the getTargetShuffleInputs helper for all shuffle decoding
2021-04-02 11:50:18 +01:00
Yang Fan bc6001ce1e
[X86] Fix -Wunused-function warning (NFC)
GCC warning:
```
/llvm-project/llvm/lib/Target/X86/X86ISelLowering.cpp:9212:13: warning: ‘bool isHorizOp(unsigned int)’ defined but not used [-Wunused-function]
 9212 | static bool isHorizOp(unsigned Opcode) {
      |             ^~~~~~~~~
```
2021-04-02 09:38:12 +08:00
Mircea Trofin ce61def529 [regalloc] Ensure Query::collectInterferringVregs is called before interval iteration
The main part of the patch is the change in RegAllocGreedy.cpp: Q.collectInterferringVregs()
needs to be called before iterating the interfering live ranges.

The rest of the patch offers support that is the case: instead of  clearing the query's
InterferingVRegs field, we invalidate it. The clearing happens when the live reg matrix
is invalidated (existing triggering mechanism).

Without the change in RegAllocGreedy.cpp, the compiler ices.

This patch should make it more easily discoverable by developers that
collectInterferringVregs needs to be called before iterating.

I will follow up with a subsequent patch to improve the usability and maintainability of Query.

Differential Revision: https://reviews.llvm.org/D98232
2021-04-01 08:33:28 -07:00
Simon Pilgrim abbe80fa52 [X86][SSE] Fold HOP(HOP(X,X),HOP(Y,Y)) -> HOP(PERMUTE(HOP(X,Y)),PERMUTE(HOP(X,Y))
For slow-hop targets, attempt to merge HADD/SUB pairs used in chains.
2021-04-01 11:54:10 +01:00
Simon Pilgrim 301319840e [X86][SSE] Enable (F)HADD/SUB handling to SimplifyMultipleUseDemandedVectorElts
Attempt to bypass unused horiz-op operands.

This is very similar to the PACKSS/PACKUS handling - we should try to merge these.
2021-04-01 11:54:09 +01:00
Simon Pilgrim f7aeaced65 [X86][SSE] Add isHorizOp helper function. NFCI. 2021-04-01 11:54:09 +01:00
YangKeao 1c268a8ff4 [X86] add dwarf annotation for inline stack probe
While probing stack, the stack register is moved without dwarf
information, which could cause panic if unwind the backtrace.
This commit only add annotation for the inline stack probe case.
Dwarf information for the loop case should be done in another
patch and need further discussion.

Reviewed By: nagisa

Differential Revision: https://reviews.llvm.org/D99579
2021-04-01 00:32:50 +03:00
Craig Topper 437958d9fd [X86] Improve SMULO/UMULO codegen for vXi8 vectors.
The default expansion creates a MUL and either a MULHS/MULHU. Each
of those separately expand to sequences that use one or more
PMULLW instructions as well as additional instructions to
extend the types to vXi16. The MULHS/MULHU expansion computes the
whole 16-bit product, but only keeps the high part.

We can improve the lowering of SMULO/UMULO for some cases by using the MULHS/MULHU
expansion, but keep both the high and low parts. And we can use
those parts to calculate the overflow.

For AVX512 we might have vXi1 overflow outputs. We can improve those by using
vpcmpeqw to produce a k register if AVX512BW is enabled. This is a little better
than truncating the high result to use vpcmpeqb. If we don't have avx512bw we
can extend up to v16i32 to use vpcmpeqd to produce a k register.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D97624
2021-03-31 10:13:50 -07:00
Craig Topper 50b8634a99 [X86] Improve optimizeCompareInstr for signed comparisons after BMI/TBM instructions
We previously couldn't optimize out a TEST if the branch/setcc/cmov
used the overflow flag. This patches allows the TEST to be removed
if the flag producing instruction is known to clear the OF flag.
Thats what the TEST instruction would have done so that should be
equivalent.

Need to add test cases. I'll try to get back to this if I have bandwidth.

Fixes PR48768.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D94856
2021-03-31 09:45:29 -07:00
Sander de Smalen 2f6f249a49 NFC: Change getIntrinsicInstrCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97468

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97469
2021-03-31 14:04:41 +01:00
Sander de Smalen 2f56e1c6b1 NFC: Change getTypeBasedIntrinsicCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Depends on D97466

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D97468
2021-03-31 14:04:41 +01:00
Roman Lebedev ce548aa236
[X86] AMD Zen 3 has macro fusion
This is an improvement over Zen 2, where only branch fusion is supported,
as per Agner, 21.4 Instruction fusion.
AMD SOG 17h has no mention of fusion.

AMD SOG 19h, 2.9.3 Branch Fusion
The following flag writing instructions support branch fusion
with their reg/reg, reg/imm and reg/mem forms
* CMP
* TEST
* SUB
* ADD
* INC (no fusion with branches dependent on CF)
* DEC (no fusion with branches dependent on CF)
* OR
* AND
* XOR

Agner, 22.4 Instruction fusion
<...> This applies to CMP, TEST, ADD, SUB, AND, OR, XOR, INC, DEC and
all conditional jumps, except if the arithmetic or logic instruction has a rip-relative address or
both an address displacement and an immediate operand.
2021-03-31 14:31:50 +03:00
Tomas Matheson a9968c0a33 [NFC][CodeGen] Tidy up TargetRegisterInfo stack realignment functions
Currently needsStackRealignment returns false if canRealignStack returns false.
This means that the behavior of needsStackRealignment does not correspond to
it's name and description; a function might need stack realignment, but if it
is not possible then this function returns false. Furthermore,
needsStackRealignment is not virtual and therefore some backends have made use
of canRealignStack to indicate whether a function needs stack realignment.

This patch attempts to clarify the situation by separating them and introducing
new names:

 - shouldRealignStack - true if there is any reason the stack should be
   realigned

 - canRealignStack - true if we are still able to realign the stack (e.g. we
   can still reserve/have reserved a frame pointer)

 - hasStackRealignment = shouldRealignStack && canRealignStack (not target
   customisable)

Targets can now override shouldRealignStack to indicate that stack realignment
is required.

This change will make it easier in a future change to handle the case where we
need to realign the stack but can't do so (for example when the register
allocator creates an aligned spill after the frame pointer has been
eliminated).

Differential Revision: https://reviews.llvm.org/D98716

Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87
2021-03-30 17:31:39 +01:00
Sanjay Patel e694e19a79 [x86] enhance matching of pmaddwd
This was crashing with the example from:
https://llvm.org/PR49716
...and that was avoided with a283d72583 ,
but as we can see from the SSE vs. AVX test code diff,
we can try harder to match the pattern.

This matcher code was adapted from another pmadd pattern
match in D49636, but it needs different ops to deal with
size mismatches.

Differential Revision: https://reviews.llvm.org/D99531
2021-03-30 07:28:33 -04:00
Bing1 Yu 0c63b862c4 Revert "[X86] Pass to transform tdpbsud&tdpbusd&tdpbuud intrinsics to scalar operation"
This reverts commit 275df61f04.
2021-03-30 16:33:07 +08:00
Bing1 Yu 275df61f04 [X86] Pass to transform tdpbsud&tdpbusd&tdpbuud intrinsics to scalar operation
Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D99244
2021-03-30 16:21:10 +08:00
Nikita Popov 7669455df4 [X86][FastISel] Fix with.overflow eflags clobber (PR49587)
If the successor block has a phi node, then additional moves may
be inserted into predecessors, which may clobber eflags. Don't try
to fold the with.overflow result into the branch in that case.

This is done by explicitly checking for any phis in successor
blocks, not sure if there's some more principled way to address
this. Other fused compare and branch patterns avoid the issue by
emitting the comparison when handling the branch, so that no
instructions may be inserted in between. In this case, the
with.overflow call is emitted separately (and I don't think this
is avoidable, as it will generally have at least two users).

Fixes https://bugs.llvm.org/show_bug.cgi?id=49587.

Differential Revision: https://reviews.llvm.org/D98600
2021-03-29 23:08:47 +02:00
Craig Topper 54bacaf311 [X86] Always use rip-relative addressing on 64-bit when rematerializing all zeros/ones registers using a folded load.
Previously we only used RIP relative when PIC was enabled. But
we know we're in small/kernel code model here so we should
be able to always use RIP-relative which will give a smaller
encoding.

Here's a godbolt link that demonstrates the current codegen https://godbolt.org/z/j3158o
Note in the non-PIC version the load from .LCPI0_0 doesn't use
RIP-relative addressing, but if you change the constant in the
source from 0.0 to 1.0 it will become RIP-relative.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D97208
2021-03-29 10:06:17 -07:00
Simon Pilgrim 805148eaf2 [X86][SSE] combineHorizOpWithShuffle - consistently use getTargetShuffleInputs to decode shuffles
Minor cleanup before I start trying to merge the unary/binary shuffle combining paths.
2021-03-29 11:31:19 +01:00
Craig Topper 69bdf35dc7 [X86] Optimize vXi8 MULHS on targets where we can't sign_extend to the next register size.
For these cases we need to extract the upper or lower elements,
multiply them using 16-bit multiplies and repack them.

Previously we used punpcklbw/punpckhbw+psraw or pmovsxbw+pshudfd to
extract and sign extend so we could use pmullw to compute the 16-bit
product and then shift down the high bits.

We can avoid the need to sign extend if we unpack the bytes into
the high byte of each word and fill the lower byte with 0 using
pxor. This puts the sign bit of each byte into the sign bit of
each word. Since the LHS and RHS have 8 trailing zeros, the full
32-bit product of those 16-bit values will have 16 trailing zeros.
This means the 16-bit product of the original bytes is in the upper
16 bits which we can calculate using pmulhw.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D98587
2021-03-28 11:41:29 -07:00
Simon Pilgrim 2a0d5da917 [X86][SSE] foldShuffleOfHorizOp - remove broadcast handling.
Remove VBROADCAST/MOVDDUP/splat-shuffle handling from foldShuffleOfHorizOp

This can all be handled by canonicalizeShuffleMaskWithHorizOp along as we check that the HADD/SUB are only used once (to prevent infinite loops on slow-horizop targets which will try to reuse the nodes again followed by a post-hop shuffle).
2021-03-27 15:09:23 +00:00
Simon Pilgrim 41146bfe82 [X86][SSE] combineX86ShuffleChain - attempt to recognise 'hidden' identity shuffles
See if the combined shuffle mask is equivalent to an identity shuffle, typically this is due to repeated LHS/RHS ops in horiz-ops, but isTargetShuffleEquivalent might see other patterns as well.

This is another small step towards getting rid of foldShuffleOfHorizOp and relying on canonicalizeShuffleMaskWithHorizOp and generic shuffle combining.
2021-03-27 11:09:30 +00:00
Sanjay Patel a283d72583 [x86] prevent crashing while matching pmaddwd
This could crash in 2 ways: either one or both of
the input vectors could be a different size than
the math ops.

https://llvm.org/PR49716
2021-03-27 05:27:14 -04:00
Simon Pilgrim c769ba9514 [X86][AVX] combineHorizOpWithShuffle - improve SHUFFLE(HOP(LOSUBVECTOR(X),HISUBVECTOR(X))) folding
Peek through bitcasts to find subvector splits and use getTargetShuffleInputs to decode target shuffles as well as ShuffleVectorSDNode
2021-03-26 17:23:54 +00:00
Simon Pilgrim 36e3c6c841 [X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets
Until AVX512 we don't have any vector truncation instructions, and always lower using shuffles instead.

combineVectorTruncation performs this earlier than lowering as it makes it easier to use any sign/zero-extended bits in the truncated bits with PACKSS/PACKUS to perform the shuffle.

We currently don't attempt to use combineVectorTruncation on AVX2 targets as in the past 256-bit PACKSS/PACKUS tended to cause 128-bit lane shuffle regressions - but these should now be all resolved with combineHorizOpWithShuffle and in all cases we now reduce the amount of cross-lane shuffling and variable shuffle mask usage.

Differential Revision: https://reviews.llvm.org/D96609
2021-03-25 10:34:34 +00:00
Simon Pilgrim 9fde88c3e2 [X86][AVX] splitIntVSETCC - handle separate (canonicalized) SETCC operands
LowerVSETCC calls splitIntVSETCC after canonicalizing certain patterns, in particular (X & CPow2 != 0) -> (X & CPow2 == CPow2).

Unfortunately if we're splitting for AVX1/non-AVX512BW cases, we lose these canonicalizations as we call the split with the original SetCC node, and when the split nodes are later lowered in LowerVSETCC the patterns are lost behind extract_subvector etc. But if we pass the canonicalized operands for splitting we retain the optimizations.

Differential Revision: https://reviews.llvm.org/D99256
2021-03-25 10:18:44 +00:00
Sander de Smalen 55d18b3cc2 [TTI] Return a TypeSize from getRegisterBitWidth.
This patch changes the interface to take a RegisterKind, to indicate
whether the register bitwidth of a scalar register, fixed-width vector
register, or scalable vector register must be returned.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D98874
2021-03-24 14:45:13 +00:00
Simon Pilgrim 7920527796 [X86][AVX] combineBitcastvxi1 - improve handling of vectors truncated to vXi1
If we're truncating to vXi1 from a wider type, then prefer the original wider vector as is simplifies folding the separate truncations + extensions.

AVX1 this is only worth it for v8i1 cases, not v4i1 where we're always better off truncating down to v4i32 for movmsk.

Helps with some regressions encountered in D96609
2021-03-24 14:05:59 +00:00
Simon Pilgrim e9015bd595 [X86][AVX] lowerShuffleAsBroadcast - MOVDDUP(SCALAR_TO_VECTOR(X)) -> BROADCAST(X)
Prefer broadcast from scalar on AVX targets as this makes it easier for later folds to strip away bitcasts etc.

This helps a lot with the AVX1 poor codegen from PR49658.

There's a trivial regression in bitcast-int-to-vector-bool-*ext.ll tests due to SimplifyDemandedBits not being able to see a multi-use case, but there's bigger existing codegen issues to be addressed first in those tests (unnecessary NOTs).
2021-03-24 11:31:56 +00:00
Simon Pilgrim c1ef642ad8 [X86] Remove unused 'OneUse' option from IsNOT helper. NFCI. 2021-03-24 11:14:38 +00:00
Craig Topper 6204ac4536 [X86] Bale out of X86FastISel::X86SelectCmp for vectors.
None of the code in this function was written to handle
vectors.  Most of the cases already fail for vectors for one
reason or another. The exception is an optimization that
detects identical operands. This can be triggered by vectors,
but the code always creates a 0 or 1 constants in a scalar
register which is incorrect for vectors.

Fixes PR49706.
2021-03-23 20:16:04 -07:00
Simon Pilgrim 080cb83e52 [X86][AVX] Narrow VPBROADCASTQ->VPBROADCASTD if we don't need the upper bits.
Helps fix cases where we've splatted smaller types to a wider vector element type without needing the upper bits.

Avoid this on AVX512 targets as that can affect broadcast folding.
2021-03-23 09:41:02 +00:00
serge-sans-paille e617cf9576 [NFC] Restore original SmallString size for X86TargetMachine::getSubtargetImpl lookup
Better safe than sorry here, quoting Craig Topper:

> Clang passes a pretty lengthy feature string.
2021-03-22 19:19:46 +01:00
serge-sans-paille b2f7ce91a6 [NFC] Simpler and faster key computation for getSubtargetImpl memoization
There's no use in computing a large key that's only used for a memoization
optimization.
2021-03-22 10:02:51 +01:00
Bing1 Yu 113f077f80 [X86] Pass to transform tdpbf16ps intrinsics to scalar operation.
In previous patch https://reviews.llvm.org/D93594, we only scalarize tilezero, tileload, tilestore and tiledpbssd. In this patch we scalarize tdpbf16ps intrinsic.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D96110
2021-03-22 13:00:40 +08:00
Simon Pilgrim 3179588947 [X86][AVX] ComputeNumSignBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources
The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting.

Added as an extension to PR49658
2021-03-21 12:22:51 +00:00
Simon Pilgrim 297b9bc3fa [X86][AVX] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources
The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting.

Suggested by @craig.topper on PR49658
2021-03-21 10:40:57 +00:00
Simon Pilgrim 54a05f2ec8 [X86] computeKnownBitsForTargetNode - add X86ISD::PMULUDQ handling
Reuse the existing KnownBits multiplication code to handle what is effectively a ISD::UMUL_LOHI varient
2021-03-21 09:57:20 +00:00
Wang, Pengfei 2327513b85 [X86] Fix a bug when calculating the ldtilecfg insertion points.
The BB we initialized the ldtilecfg is special. We don't need to check
if its predecessor BBs need to insert ldtilecfg for calls.

We reused the flag HasCallBeforeAMX, so that the predecessors won't be
added to CfgNeedInsert.

This case happens only when the entry BB is in a loop. We need to hoist
the first tile config point out of the loop in future.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D98845
2021-03-20 17:48:59 +08:00
Fangrui Song c241659d15 [X86] Fix -Wunused-function in -DLLVM_ENABLE_ASSERTIONS=off builds 2021-03-18 23:22:58 -07:00
Bing1 Yu 0002d4bf36 [X86][AMX][NFC] Give correct Passname for Tile Register Pre-configure 2021-03-18 17:15:44 +08:00