Commit Graph

21905 Commits

Author SHA1 Message Date
Liu, Chen3 57e8f840b6 [X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A).
This bug was introduced by D109953. The operand order of generated FMA
is wrong.

Differential Revision: https://reviews.llvm.org/D110606
2021-09-28 11:38:53 +08:00
Roman Lebedev 2a7a768dad
[X86][Costmodel] Load/store i16 Stride=4 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For this tuple, measuring becomes problematic since there's a lot of spilling going on,
but apparently all these memory ops do not affect worst-case estimate at all here.

For load we have:
https://godbolt.org/z/zP4hd8MT6 - for intels `Block RThroughput: =150.0`; for ryzens, `Block RThroughput: <=59`
So pick cost of `150`.

For store we have:
https://godbolt.org/z/vKb8zTK8E - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=24.0`
So pick cost of `64`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110548
2021-09-27 22:20:01 +03:00
Roman Lebedev ee5a050e2e
[X86][Costmodel] Load/store i16 Stride=4 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =75.0`; for ryzens, `Block RThroughput: <=29.5`
So pick cost of `75`. (note that `# 32-byte Reload` does not affect throughput there.)

For store we have:
https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=12.0`
So pick cost of `32`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110543
2021-09-27 22:20:01 +03:00
Roman Lebedev 5615d6a6dd
[X86][Costmodel] Load/store i16 Stride=4 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/dd8T5P471 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=14.5`
So pick cost of `33`.

For store we have:
https://godbolt.org/z/zPxcKWhn4 - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=6.0`
So pick cost of `10`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110541
2021-09-27 22:20:01 +03:00
Roman Lebedev df2b42d12e
[X86][Costmodel] Load/store i16 Stride=4 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/rnsf639Wh - for intels `Block RThroughput: =17.0`; for ryzens, `Block RThroughput: <=7.5`
So pick cost of `17`.

For store we have:
https://godbolt.org/z/565KKrcY6 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `6`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110537
2021-09-27 22:20:01 +03:00
Roman Lebedev 45caac91c4
[X86][Costmodel] Load/store i16 Stride=4 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/5EYc6r9nh - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0`
So pick cost of `6`.

For store we have:
https://godbolt.org/z/z61e5d6GE - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110536
2021-09-27 22:20:01 +03:00
Roman Lebedev 7424deb743
[X86][Costmodel] Load/store i16 Stride=2 VF=32 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/q6GbK89br - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: <=7.0`
So pick cost of `18`.

For store we have:
https://godbolt.org/z/Yzfoo5TnW - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0`
So pick cost of `8`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110507
2021-09-27 14:21:12 +03:00
Roman Lebedev a5113e9445
[X86][Costmodel] Load/store i16 Stride=2 VF=16 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.5`
So pick cost of `9`.

For store we have:
https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `4`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110506
2021-09-27 14:20:11 +03:00
Roman Lebedev 70c90cc5bd
[X86][Costmodel] Load/store i16 Stride=2 VF=8 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/e5YE99a4P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0`
So pick cost of `6`.

For store we have:
https://godbolt.org/z/3vM4KsE1n - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0`
So pick cost of `3`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110505
2021-09-27 14:18:29 +03:00
Roman Lebedev 49e532aa52
[X86][Costmodel] Load/store i16 Stride=2 VF=4 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/1j3nf3dro - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/4n1zvP37j - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110504
2021-09-27 14:15:25 +03:00
Simon Pilgrim 468ff703e1 [X86] combineVectorHADDSUB - remove the broken HOP(x,x) merging code (PR51974)
This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies.

Fixes PR51974
2021-09-27 10:41:22 +01:00
Freddy Ye 902ec6142a [X86][ISel] Lowering FROUND(f16) and FROUNDEVEN(f16)
When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with
TypeLegalize, and no libcall in libm is ready for fround(f16) now.
FROUNDEVEN(f16) has related instruction in AVX512FP16.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D110312
2021-09-27 13:35:03 +08:00
Simon Pilgrim c0eff50fc5 [X86][SSE] combineMulToPMADDWD - enable sext_extend_vector_inreg(vXi16) -> zext_extend_vector_inreg(vXi16) fold
The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.
2021-09-26 19:37:23 +01:00
Simon Pilgrim ed3e4917b3 [X86] Fold PACK(*_EXTEND_VECTOR_INREG, UNDEF) -> *_EXTEND_VECTOR_INREG
For 128-bit vectors, we can remove a PACK of a EXTEND_VECTOR_INREG node and just create a smaller extension to the result/packed type.
2021-09-26 19:37:22 +01:00
Simon Pilgrim 3fe9767204 [X86] Fold ADD(VPMADDWD(X,Y),VPMADDWD(Z,W)) -> VPMADDWD(SHUFFLE(X,Z), SHUFFLE(Y,W))
Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements.

There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....
2021-09-26 18:08:29 +01:00
Roman Lebedev d9413f46b3
[X86][Costmodel] Load/store i16 VF=2 interleaving costs
The only sched models that for cpu's that support avx2
but not avx512 are: haswell, broadwell, skylake, zen1-3

For load we have:
https://godbolt.org/z/M8vEKs5jY - for intels `Block RThroughput: =2.0`;
                                  for ryzens, `Block RThroughput: <=1.0`
So pick cost of `2`.

For store we have:
https://godbolt.org/z/Kx1nKz7je - for intels `Block RThroughput: =1.0`;
                                  for ryzens, `Block RThroughput: <=0.5`
So pick cost of `1`.

I'm directly using the shuffling asm the llc produced,
without any manual fixups that may be needed
to ensure sequential execution.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103144
2021-09-26 19:13:23 +03:00
Simon Pilgrim 3538ee763d [CostModel][X86] Improve AVX1/AVX2 v16i32->v16i16/v16i8 truncation costs (PR51972)
Based off worst case btver2 (AVX1) and haswell (AVX2) llvm-mca reports
2021-09-26 13:43:46 +01:00
Simon Pilgrim 8c83bd3bd4 [CostModel][X86] Adjust vXi32 multiply costs if it can be performed using PMADDWD
Update the costs to match the codegen from combineMulToPMADDWD - not only can we use PMADDWD is its zero-extended, but also if its a constant or sign-extended from a vXi16 (which can be replaced with a zero-extension).
2021-09-25 16:28:48 +01:00
Simon Pilgrim eb7c78c2c5 [X86][SSE] combineMulToPMADDWD - mask off upper bits of sign-extended vXi32 constants
If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16
2021-09-25 15:50:45 +01:00
Simon Pilgrim 2a4fa0c27c [X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on sub-128 bit vectors 2021-09-25 15:50:45 +01:00
Simon Pilgrim f5a26ccae2 [X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on pre-SSE41 targets
We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.
2021-09-25 14:35:31 +01:00
Simon Pilgrim 4c72b10f0a [X86] X86FastISel::fastMaterializeConstant - break if-else chain to fix llvm-else-after-return warning. NFCI
All previous if-else cases return
2021-09-25 14:31:14 +01:00
Simon Pilgrim a25f25c3b7 [X86] combineShiftToPMULH - relax from ISA from SSE41 to SSE2
With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target.

We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.
2021-09-25 14:08:03 +01:00
Simon Pilgrim d8fc9f8727 [X86][SSE] combineMulToPMADDWD - replace sext(v8i16) -> zext(v8i16)
As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.
2021-09-24 16:42:01 +01:00
Sanjay Patel 09e71c367a [x86] convert logic-of-FP-compares to FP logic-of-vector-compares
This is motivated by the examples and discussion in:
https://llvm.org/PR51245
...and related bugs.

By using vector compares and vector logic, we can convert 2 'set'
instructions into 1 'movd' or 'movmsk' and generally improve
throughput/reduce instructions.

Unfortunately, we don't have a complete vector compare ISA before
AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic
ops to simulate the missing predicates for SSE 'cmpp*', so it's not
as clearly a win.

Differential Revision: https://reviews.llvm.org/D110342
2021-09-24 11:38:19 -04:00
Simon Pilgrim dade83c02a [X86][SLM] Fix ADDQ/SUBQ/CMPEQQ throughput to account for running on either port.
Testing on a SLM box suggests these can run on either port, but the throughput is 4cy on either (inc MMX versions). Confirmed with Intel AoM / Agner / InstLatX64.
2021-09-24 10:06:14 +01:00
Nico Weber 3fa43da7a3 [llvm] Fix a copy-pasto
We should use IMAGE_REL_I386_SECREL in the i386 section of this file.

IMAGE_REL_I386_SECREL and IMAGE_REL_AMD64_SECREL have the same
numeric value 0xB, so this doesn't change behavior.
2021-09-23 15:34:01 -04:00
Sanjay Patel 74ba4b769a [x86] move combiner state check into convertIntLogicToFPLogic(); NFC
This function can be adapted to solve bugs like PR51245,
but it could require differentiating the combiner timing
between the existing and new transforms.
2021-09-23 14:28:22 -04:00
Simon Pilgrim c931d35216 [CostModel][X86] Increase i64 mul cost from 1 to 2
Only the most recent cpus support really 1cy 64-bit multiplies, and the X64 cost table represents a realistic worst case. The 1cy value was also discouraging vectorization when most vXi64 PMULDQ expansions aren't actually slower than scalarization.

Noticed while investigating PR51436.
2021-09-23 14:48:21 +01:00
Jay Foad 6cef28ed2d [TII] Remove the MFI argument to convertToThreeAddress. NFC.
This simplifies the API and addresses a FIXME in
TwoAddressInstructionPass::convertInstTo3Addr.

Differential Revision: https://reviews.llvm.org/D110229
2021-09-23 08:58:46 +01:00
Liu, Chen3 76656ec8ec [X86][FP16] Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A)
This patch is to support transform something like
_mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph()))
to _mm512_fmadd_pch(a, b, acc).

Differential Revision: https://reviews.llvm.org/D109953
2021-09-23 15:37:08 +08:00
Wang, Pengfei ebec077e07 [X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands.
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D109658
2021-09-23 11:02:48 +08:00
Hongtao Yu d9b511d8e8 [CSSPGO] Set PseudoProbeInserter as a default pass.
Currenlty PseudoProbeInserter is a pass conditioned on a target switch. It works well with a single clang invocation. It doesn't work so well when the backend is called separately (i.e, through the linker or llc), where user has always to pass -pseudo-probe-for-profiling explictly. I'm making the pass a default pass that requires no command line arg to trigger, but will be actually run depending on whether the CU comes with `llvm.pseudo_probe_desc` metadata.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D110209
2021-09-22 09:09:48 -07:00
Simon Pilgrim b1f38a27f0 [Target][CodeGen] Remove default CostKind arguments on inner/impl TTI overrides
Based off a discussion on D110100, we should be avoiding default CostKinds whenever possible.

This initial patch removes them from the 'inner' target implementation callbacks - these should only be used by the main TTI calls, so this should guarantee that we don't cause changes in CostKind by missing it in an inner call. This exposed a few missing arguments in getGEPCost and reduction cost calls that I've cleaned up.

Differential Revision: https://reviews.llvm.org/D110242
2021-09-22 15:28:08 +01:00
Kirill Stoimenov 2649999579 [asan] Fixed a bug causing a crash when redzone optimization kicked in on X86 with -asan-optimize-callbacks flag on.
This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment.

Reviewed By: eugenis

Differential Revision: https://reviews.llvm.org/D110012
2021-09-21 22:26:03 +00:00
Craig Topper b81e26c7f4 Recommit "[X86] Clear kill flags when rewriting SETCC uses in flag copy lowering."
This time with the right bug number.

When we rewrite the setcc we replace set old setcc output register
with the new CondReg. But since CondReg can be shared by other
replacements, we don't know if the kill flags for the old register
are valid for CondReg. So be conservative and remove them.

The test case has a SETCCr and a SETCCm on the same condition so
they end up sharing the same CondReg. The SETCCr had one use with
a kill flag. This kill flag isn't valid after the replacement because
CondReg needs a live range extending to the later SETCCm replacment.

Fixes PR51903.
2021-09-21 14:59:25 -07:00
Craig Topper 51a82e051e Revert "[X86] Clear kill flags when rewriting SETCC uses in flag copy lowering."
This reverts commit 7550f146ff.

I botched the bug number.
2021-09-21 14:33:44 -07:00
Craig Topper 7550f146ff [X86] Clear kill flags when rewriting SETCC uses in flag copy lowering.
When we rewrite the setcc we replace set old setcc output register
with the new CondReg. But since CondReg can be shared by other
replacements, we don't know if the kill flags for the old register
are valid for CondReg. So be conservative and remove them.

The test case has a SETCCr and a SETCCm on the same condition so
they end up sharing the same CondReg. The SETCCr had one use with
a kill flag. This kill flag isn't valid after the replacement because
CondReg needs a live range extending to the later SETCCm replacment.

Fixes PR51908.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D110046
2021-09-21 14:29:46 -07:00
Kazu Hirata 85b4b21c8b [llvm] Use make_early_inc_range (NFC) 2021-09-20 19:30:02 -07:00
Amara Emerson 4ceea77409 [X86] Rename the X86WinAllocaExpander pass and related symbols to "DynAlloca". NFC.
For x86 Darwin, we have a stack checking feature which re-uses some of this
machinery around stack probing on Windows. Renaming this to be more appropriate
for a generic feature.

Differential Revision: https://reviews.llvm.org/D109993
2021-09-20 16:19:28 -07:00
Simon Pilgrim 4ab7c0d3fa [X86] X86TargetTransformInfo - remove unnecessary if-else after early exit. NFCI.
(style) Break the if-else chain as they all return.
2021-09-20 12:53:17 +01:00
Simon Pilgrim 0e89ff8195 [X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one use.
Helps with the regression noted on D109065 - don't truncate a broadcast source if the source has multiple uses.
2021-09-19 22:53:30 +01:00
Simon Pilgrim f855ef2601 [X86][Atom] Fix FP uops + port usage
Both ports are required in most cases. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well.

Noticed while trying to improve fp costs for vectorization via the D103695 helper script.
2021-09-19 20:39:20 +01:00
Simon Pilgrim b7342e3137 [X86] Fold SHUFPS(shuffle(x),shuffle(y),mask) -> SHUFPS(x,y,mask')
We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly.

Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.
2021-09-19 20:39:19 +01:00
Simon Pilgrim cf8fac7d07 [X86][Atom] Specific uops for all IMUL/IDIV instructions
Based off a mixture of llvm-exegesis captures (PR36895) and Intel AoM / Agner / InstLatX64 reports.
2021-09-19 16:58:52 +01:00
Roman Lebedev 5f2fe48d06
[X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded
Apparently this has no test coverage before D108382,
but D108382 itself shows a few regressions that this fixes.

It doesn't seem worthwhile breaking apart broadcasts,
assuming we want the broadcasted value to be preset in several elements,
not just the 0'th one.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D108411
2021-09-19 17:38:32 +03:00
Roman Lebedev 07f1d8f0ca
[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such
Split off from D108253.
Broadcast is simpler than any other shuffle we might produce
to do what we want to do here, so prefer it.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D108382
2021-09-19 17:35:37 +03:00
Roman Lebedev 0852313e47
[NFC] combineX86ShufflesRecursively(): actually address nits for previous patch 2021-09-19 17:29:18 +03:00
Roman Lebedev 1e72ca94e5
[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing
This was suggested in https://reviews.llvm.org/D108382#inline-1039018,
and it avoids regressions in that patch.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D109065
2021-09-19 17:24:58 +03:00
Simon Pilgrim e381d8b243 [X86][Atom] Fix (U)COMISS/SD uops, latency and throughput
Both ports are required, for reg and mem variants - we can also use the WriteFComX class directly and remove the unnecessary InstRW overrides. Matches what Intel AoM / Agner / InstLatX64 report as well.
2021-09-19 12:44:44 +01:00