Commit Graph

7770 Commits

Author SHA1 Message Date
Liu, Chen3 57e8f840b6 [X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A).
This bug was introduced by D109953. The operand order of generated FMA
is wrong.

Differential Revision: https://reviews.llvm.org/D110606
2021-09-28 11:38:53 +08:00
Simon Pilgrim 468ff703e1 [X86] combineVectorHADDSUB - remove the broken HOP(x,x) merging code (PR51974)
This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies.

Fixes PR51974
2021-09-27 10:41:22 +01:00
Freddy Ye 902ec6142a [X86][ISel] Lowering FROUND(f16) and FROUNDEVEN(f16)
When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with
TypeLegalize, and no libcall in libm is ready for fround(f16) now.
FROUNDEVEN(f16) has related instruction in AVX512FP16.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D110312
2021-09-27 13:35:03 +08:00
Simon Pilgrim c0eff50fc5 [X86][SSE] combineMulToPMADDWD - enable sext_extend_vector_inreg(vXi16) -> zext_extend_vector_inreg(vXi16) fold
The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.
2021-09-26 19:37:23 +01:00
Simon Pilgrim ed3e4917b3 [X86] Fold PACK(*_EXTEND_VECTOR_INREG, UNDEF) -> *_EXTEND_VECTOR_INREG
For 128-bit vectors, we can remove a PACK of a EXTEND_VECTOR_INREG node and just create a smaller extension to the result/packed type.
2021-09-26 19:37:22 +01:00
Simon Pilgrim 3fe9767204 [X86] Fold ADD(VPMADDWD(X,Y),VPMADDWD(Z,W)) -> VPMADDWD(SHUFFLE(X,Z), SHUFFLE(Y,W))
Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements.

There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....
2021-09-26 18:08:29 +01:00
Simon Pilgrim eb7c78c2c5 [X86][SSE] combineMulToPMADDWD - mask off upper bits of sign-extended vXi32 constants
If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16
2021-09-25 15:50:45 +01:00
Simon Pilgrim 2a4fa0c27c [X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on sub-128 bit vectors 2021-09-25 15:50:45 +01:00
Simon Pilgrim f5a26ccae2 [X86][SSE] combineMulToPMADDWD - enable sext(v8i16) -> zext(v8i16) fold on pre-SSE41 targets
We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.
2021-09-25 14:35:31 +01:00
Simon Pilgrim a25f25c3b7 [X86] combineShiftToPMULH - relax from ISA from SSE41 to SSE2
With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target.

We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.
2021-09-25 14:08:03 +01:00
Simon Pilgrim d8fc9f8727 [X86][SSE] combineMulToPMADDWD - replace sext(v8i16) -> zext(v8i16)
As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.
2021-09-24 16:42:01 +01:00
Sanjay Patel 09e71c367a [x86] convert logic-of-FP-compares to FP logic-of-vector-compares
This is motivated by the examples and discussion in:
https://llvm.org/PR51245
...and related bugs.

By using vector compares and vector logic, we can convert 2 'set'
instructions into 1 'movd' or 'movmsk' and generally improve
throughput/reduce instructions.

Unfortunately, we don't have a complete vector compare ISA before
AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic
ops to simulate the missing predicates for SSE 'cmpp*', so it's not
as clearly a win.

Differential Revision: https://reviews.llvm.org/D110342
2021-09-24 11:38:19 -04:00
Sanjay Patel 74ba4b769a [x86] move combiner state check into convertIntLogicToFPLogic(); NFC
This function can be adapted to solve bugs like PR51245,
but it could require differentiating the combiner timing
between the existing and new transforms.
2021-09-23 14:28:22 -04:00
Liu, Chen3 76656ec8ec [X86][FP16] Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A)
This patch is to support transform something like
_mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph()))
to _mm512_fmadd_pch(a, b, acc).

Differential Revision: https://reviews.llvm.org/D109953
2021-09-23 15:37:08 +08:00
Wang, Pengfei ebec077e07 [X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands.
Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D109658
2021-09-23 11:02:48 +08:00
Kirill Stoimenov 2649999579 [asan] Fixed a bug causing a crash when redzone optimization kicked in on X86 with -asan-optimize-callbacks flag on.
This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment.

Reviewed By: eugenis

Differential Revision: https://reviews.llvm.org/D110012
2021-09-21 22:26:03 +00:00
Amara Emerson 4ceea77409 [X86] Rename the X86WinAllocaExpander pass and related symbols to "DynAlloca". NFC.
For x86 Darwin, we have a stack checking feature which re-uses some of this
machinery around stack probing on Windows. Renaming this to be more appropriate
for a generic feature.

Differential Revision: https://reviews.llvm.org/D109993
2021-09-20 16:19:28 -07:00
Simon Pilgrim 0e89ff8195 [X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one use.
Helps with the regression noted on D109065 - don't truncate a broadcast source if the source has multiple uses.
2021-09-19 22:53:30 +01:00
Simon Pilgrim b7342e3137 [X86] Fold SHUFPS(shuffle(x),shuffle(y),mask) -> SHUFPS(x,y,mask')
We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly.

Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.
2021-09-19 20:39:19 +01:00
Roman Lebedev 5f2fe48d06
[X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded
Apparently this has no test coverage before D108382,
but D108382 itself shows a few regressions that this fixes.

It doesn't seem worthwhile breaking apart broadcasts,
assuming we want the broadcasted value to be preset in several elements,
not just the 0'th one.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D108411
2021-09-19 17:38:32 +03:00
Roman Lebedev 07f1d8f0ca
[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such
Split off from D108253.
Broadcast is simpler than any other shuffle we might produce
to do what we want to do here, so prefer it.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D108382
2021-09-19 17:35:37 +03:00
Roman Lebedev 0852313e47
[NFC] combineX86ShufflesRecursively(): actually address nits for previous patch 2021-09-19 17:29:18 +03:00
Roman Lebedev 1e72ca94e5
[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing
This was suggested in https://reviews.llvm.org/D108382#inline-1039018,
and it avoids regressions in that patch.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D109065
2021-09-19 17:24:58 +03:00
Roman Lebedev 6a2c2263fb
[X86] Improve i8 all-ones element insertion in pre-SSE4.1
Should avoid some regressions in D109065

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D109989
2021-09-18 22:24:06 +03:00
Roman Lebedev 358df06f4e
[X86] Improve `matchBinaryShuffle()`'s `BLEND` lowering with per-element all-zero/all-ones knowledge
We can use `OR` instead of `BLEND` if either the element we are not picking is zero (or masked away);
or the element we are picking overwhelms (e.g. it's all-ones) whatever the element we are not picking:
https://alive2.llvm.org/ce/z/RKejao

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D109726
2021-09-17 19:13:33 +03:00
Simon Pilgrim 1ef62cb200 [X86] SimplifyDemandedVectorEltsForTargetNode - add PSADBW handling
Peek through PSADBW operands to handle non demanded elements.
2021-09-16 11:28:31 +01:00
Simon Pilgrim dcba994184 [X86] combineX86ShuffleChain - ensure we only peek through bitcasts to vectors (PR51858)
When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.
2021-09-15 10:21:05 +01:00
Craig Topper 9af8f1b18e [SelectionDAG] Add isZero/isAllOnes methods to ConstantSDNode.
Soft deprecrate isNullValue/isAllOnesValue and update in tree
callers. This matches the changes to the APInt interface from
D109483.

Reviewed By: lattner

Differential Revision: https://reviews.llvm.org/D109535
2021-09-09 13:28:30 -07:00
Chris Lattner d51da74889 [CodeGen] Use DAG.getAllOnesConstant where possible to simplify code. NFC. 2021-09-09 10:22:51 -07:00
Craig Topper 124bcc1a13 [X86] Disable muloti4 libcalls for x86-64.
This library function only exists in compiler-rt not libgcc. So
this would fail to link unless we were linking with compiler-rt.

This is consistent with the recent removal of calls to mulodi4 on
32-bit targets like D108928.

I suppose maybe we could keep the libcalls for platforms like
Darwin that use compiler-rt exclusively?

Reviewed By: nickdesaulniers, MaskRay

Differential Revision: https://reviews.llvm.org/D109385
2021-09-09 10:03:15 -07:00
Chris Lattner 735f46715d [APInt] Normalize naming on keep constructors / predicate methods.
This renames the primary methods for creating a zero value to `getZero`
instead of `getNullValue` and renames predicates like `isAllOnesValue`
to simply `isAllOnes`.  This achieves two things:

1) This starts standardizing predicates across the LLVM codebase,
   following (in this case) ConstantInt.  The word "Value" doesn't
   convey anything of merit, and is missing in some of the other things.

2) Calling an integer "null" doesn't make any sense.  The original sin
   here is mine and I've regretted it for years.  This moves us to calling
   it "zero" instead, which is correct!

APInt is widely used and I don't think anyone is keen to take massive source
breakage on anything so core, at least not all in one go.  As such, this
doesn't actually delete any entrypoints, it "soft deprecates" them with a
comment.

Included in this patch are changes to a bunch of the codebase, but there are
more.  We should normalize SelectionDAG and other APIs as well, which would
make the API change more mechanical.

Differential Revision: https://reviews.llvm.org/D109483
2021-09-09 09:50:24 -07:00
Akira Hatanaka dea6f71af0 [ObjC][ARC] Use the addresses of the ARC runtime functions instead of
integer 0/1 for the operand of bundle "clang.arc.attachedcall"

https://reviews.llvm.org/D102996 changes the operand of bundle
"clang.arc.attachedcall". This patch makes changes to llvm that are
needed to handle the new IR.

This should make it easier to understand what the IR is doing and also
simplify some of the passes as they no longer have to translate the
integer values to the runtime functions.

Differential Revision: https://reviews.llvm.org/D103000
2021-09-08 11:58:03 -07:00
Nick Desaulniers d0eeb64be5 [X86ISelLowering] avoid emitting libcalls to __mulodi4()
Similar to D108842, D108844, and D108926.

__has_builtin(builtin_mul_overflow) returns true for 32b x86 targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks ARCH=i386 + CONFIG_BLK_DEV_NBD=y builds of the Linux kernel
that are using builtin_mul_overflow with these types for these targets.

If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.

This will still need to be worked around in the Linux kernel in order to
continue to support these builds of the Linux kernel for this
target with older releases of clang.

Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://bugs.llvm.org/show_bug.cgi?id=35922
Link: https://github.com/ClangBuiltLinux/linux/issues/1438

Reviewed By: lebedev.ri, RKSimon

Differential Revision: https://reviews.llvm.org/D108928
2021-09-07 10:44:54 -07:00
Simon Pilgrim d66d520fe1 [X86][SSE] combineMulToPMADDWD - improve recognition of sign/zero extended upper bits
PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]*y[0] + (int)x[1]*y[1], ..., (int)x[6]*y[6] + (int)x[7]*y[7] }

Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits.

That way the sign of the result is still preserved, and the second half is still zero.

Noticed while investigating PR47437.

Differential Revision: https://reviews.llvm.org/D108522
2021-09-02 17:36:22 +01:00
Roman Lebedev 3f1f08f0ed
Revert @llvm.isnan intrinsic patchset.
Please refer to
https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html
(and that whole thread.)

TLDR: the original patch had no prior RFC, yet it had some changes that
really need a proper RFC discussion. It won't be productive to discuss
such an RFC, once it's actually posted, while said patch is already
committed, because that introduces bias towards already-committed stuff,
and the tree is potentially in broken state meanwhile.

While the end result of discussion may lead back to the current design,
it may also not lead to the current design.

Therefore i take it upon myself
to revert the tree back to last known good state.

This reverts commit 4c4093e6e3.
This reverts commit 0a2b1ba33a.
This reverts commit d9873711cb.
This reverts commit 791006fb8c.
This reverts commit c22b64ef66.
This reverts commit 72ebcd3198.
This reverts commit 5fa6039a5f.
This reverts commit 9efda541bf.
This reverts commit 94d3ff09cf.
2021-09-02 13:53:56 +03:00
Simon Pilgrim b0acd6c369 [X86] Fold PMADD(x,0) or PMADD(0,x) -> 0
Pulled out of D108522 - handle zero-operand cases for PMADDWD/VPMADDUBSW ops
2021-09-02 10:48:50 +01:00
Wang, Pengfei 74043caef2 [X86] Enable half type support in inline assembly constraints
Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D105799
2021-09-01 09:29:31 +08:00
Wang, Pengfei ab40dbfe03 [X86] AVX512FP16 instructions enabling 6/6
Enable FP16 complex FMA instructions.

Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D105269
2021-08-30 13:08:45 +08:00
Roman Lebedev 6734018041
[Codegen][X86] EltsFromConsecutiveLoads(): if only have AVX1, ensure that the "load" is actually foldable (PR51615)
This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615
And again, the fix lies not in the code added in D105390

In this case, we completely don't check that the "broadcast-from-mem" we create
can actually fold the load. In this case, it's operand was not a load at all:
```
Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32
Creating new node: t29: i32 = undef
RepeatLoad:
t8: i32 = truncate t7
  t7: i64 = extract_vector_elt t5, Constant:i64<0>
    t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t1: i64 = Register %0
      t4: i64 = undef
    t3: i64 = Constant<0>
Combining: t15: v8i32 = undef

```

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D108821
2021-08-27 20:26:53 +03:00
Serge Pavlov cdbe569fb6 [X86] Implement llvm.isnan(x86_fp80) as unordered comparison
x86_fp80 format allows values that do not fit any of IEEE-754 category.
Previously they were recognized by intrinsic __builtin_isnan as NaNs.
Now this intrinsic is implemented using instruction FXAM, which
distinguish between NaNs and unsupported values. It can make some
programs behave differently.

As a solution, this fix changes lowering of the intrinsic. If floating
point exceptions are ignored, llvm.isnan is lowered into unordered
comparison, as __buildtin_isnan was implemented earlier. In strictfp
functions the intrinsic is lowered using FXAM, which does not raise
exceptions even for signaling NaN, as required by IEEE-754 and C
standards.

Differential Revision: https://reviews.llvm.org/D108037
2021-08-27 18:06:07 +07:00
Nathan Sidwell 199ac3a839 [NFC][X86] Sret return register cleanup
There are no paths into LowerFormalParms that have already specified
the sret register. We always materialize a virtual and then assign it
to the physical reg at the point of the return.

Differential Revision: https://reviews.llvm.org/D108762
2021-08-27 04:03:49 -07:00
Roman Lebedev a8125bf4a8
[X86][Codegen] PR51615: don't replace wide volatile load with narrow broadcast-from-memory
Even though https://bugs.llvm.org/show_bug.cgi?id=51615
appears to be introduced by D105390, the fix lies here.

We can not replace a wide volatile load with a broadcast-from-memory,
because that would narrow the load, which isn't legal for volatiles.

Reviewed By: spatel

Differential Revision: https://reviews.llvm.org/D108757
2021-08-26 18:46:49 +03:00
Nathan Sidwell ab55cc6cef [X86] pr51000 in-register struct return tailcalling
In-register structure returns are not special, and handled by lowering
to multiple-value tuples.  We can tail-call from non-sret fns to
structure-returning functions, except on i686 where the sret pointer
is callee-pop.

Differential Revision: https://reviews.llvm.org/D105807
2021-08-25 10:15:50 -07:00
Simon Pilgrim 307890f85b [X86] Freeze vXi8 shl(x,1) -> add(x,x) vector fold (PR50468)
We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468.

Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative.

Differential Revision: https://reviews.llvm.org/D108139
2021-08-24 16:08:24 +01:00
Liu, Chen3 b7795eb646 [X86] Building constant vector which element type is half will cause assertion fail.
Fix assertion fail when building con constant vector which element type is half.

Differential Revision: https://reviews.llvm.org/D108612
2021-08-24 14:34:30 +08:00
Wang, Pengfei c728bd5bba [X86] AVX512FP16 instructions enabling 5/6
Enable FP16 FMA instructions.

Ref.: https://software.intel.com/content/www/us/en/develop/download/intel-avx512-fp16-architecture-specification.html

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D105268
2021-08-24 09:07:19 +08:00
Simon Pilgrim 805fb1f6c1 [X86] combineMul - move MUL_IMM comment inside function. NFC.
combineMul is now used for other things as well as the mul-with-constant expansion - move the comment to where its actually relevant.
2021-08-22 18:27:03 +01:00
Simon Pilgrim 352df10a23 [X86][AVX] matchShuffleAsBlend - use isElementEquivalent to help match broadcast/repeated elements
Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element.

This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).
2021-08-22 15:26:17 +01:00
Simon Pilgrim 96fb3eef66 Fix signed/unsigned comparison warning. NFCI. 2021-08-22 15:02:19 +01:00
Simon Pilgrim a1c892b439 [X86][SSE] lowerVECTOR_SHUFFLE - canonicalize with horizontal ops.
Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.
2021-08-22 14:54:48 +01:00