Commit Graph

7719 Commits

Author SHA1 Message Date
Nick Desaulniers ea8416bf4d [CodeGenOptions] make StackProtectorGuardOffset signed
GCC supports negative values for -mstack-protector-guard-offset=, this
should be a signed value. Pre-req to D100919.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D101325
2021-04-27 10:12:58 -07:00
Sander de Smalen 43ace8b5ce [TTI] NFC: Change getScalingFactorCost to return InstructionCost
This patch migrates the TTI cost interfaces to return an InstructionCost.

See this patch for the introduction of the type: https://reviews.llvm.org/D91174
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html

Differential Revision: https://reviews.llvm.org/D100564
2021-04-23 16:06:36 +01:00
Simon Pilgrim 7b32e8b96a [X86] combineSetCCAtomicArith - pull out repeated ops. NFCI.
Reduces diff in D101074
2021-04-23 14:19:24 +01:00
Simon Pilgrim a511b55cfd [X86][SSE] getFauxShuffleMask - don't decode OR(SHUFFLE,SHUFFLE) containing UNDEFs. (PR50049)
PR50049 demonstrated an infinite loop between OR(SHUFFLE,SHUFFLE) <-> BLEND(SHUFFLE,SHUFFLE) patterns.

The UNDEF elements were allowing a combined shuffle mask to be widened which lost the undef element, resulting us needing to use the BLEND pattern (as the undef element would need to be zero for the OR pattern). But then bitcast folds would re-expose the undef element allowing us to use OR again.....
2021-04-21 18:47:00 +01:00
Simon Pilgrim 2a419a0b99 [X86][SSE] combineX86ShuffleChain - check if we're blending with zero into already zero elements
Add a SelectionDAG::MaskedElementsAreZero helper that wraps SelectionDAG::MaskedValueIsZero testing for entirely zero vector elements
2021-04-20 17:09:49 +01:00
Simon Pilgrim 9d57a77b81 [X86] combineCMP - fold cmpEQ/NE(TRUNC(X),0) -> cmpEQ/NE(X,0)
If we are truncating from a i32 source before comparing the result against zero, then see if we can directly compare the source value against zero.

If the upper (truncated) bits are known to be zero then we can compare against that, hopefully increasing the chances of us folding the compare into a EFLAG result of the source's operation.

Fixes PR49028.

Differential Revision: https://reviews.llvm.org/D100491
2021-04-15 13:55:51 +01:00
Simon Pilgrim 4fbe761572 [X86][SSE] canonicalizeShuffleWithBinOps - check for more combos of merge-able binary shuffles.
In the fold SHUFFLE(BINOP(X,Y),BINOP(Z,W)) -> BINOP(SHUFFLE(X,Z),SHUFFLE(Y,W)), check if both X/Z AND Y/W have at least one merge-able shuffle in which case the total number of shuffle should still fall.

Helps with instruction count regressions we saw while fixing PR48823
2021-04-14 15:24:41 +01:00
Simon Pilgrim 73737fe990 [X86] Fold cmpeq/ne(trunc(x),0) --> cmpeq/ne(x,0)
Relax the fold from rGbaadbe04bf75 to compare any op, not just logic ops, now that the movmsk regressions have been handled.
2021-04-14 11:02:02 +01:00
Simon Pilgrim 016ceb8382 [X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in MOVMSK(SHUFFLE(X,u)) -> MOVMSK(X) fold
Extension to rG74f98391a7a4, we can also include any of the upper (known zero) bits in the comparison in the shuffle removal fold, just as long as we demand all the elements of the movmsk source vector.
2021-04-14 11:02:01 +01:00
Simon Pilgrim 74f98391a7 [X86][SSE] combineSetCCMOVMSK - allow comparison with upper (known zero) bits in CMP(MOVMSK(PACKSS())) -> CMP(MOVMSK()) fold
We already allow the comparison of the upper bits of 'IsAllOf' (allbits) patterns, but we can safely compare the known zero bits for 'IsAnyOf' (zerobits) patterns as well.

This fixes an issues where we are comparing a type wide than the number of vector elements, which avoids a regression mentioned in rGbaadbe04bf75.
2021-04-13 17:37:24 +01:00
Simon Pilgrim baadbe04bf [X86] Fold cmpeq/ne(trunc(logic(x)),0) --> cmpeq/ne(logic(x),0)
Fixes the issues noted in PR48768, where the and/or/xor instruction had been promoted to avoid i8/i16 partial-dependencies, but the test against zero had not.

We can almost certainly relax this fold to work for any truncation, although it breaks a number of existing folds (notable movmsk folds which tend to rely on the truncate to determine the demanded bits/elts in the source vector).

There is a reverse combine in TargetLowering.SimplifySetCC so we must wait until after legalization before attempting this.
2021-04-12 16:05:34 +01:00
Simon Pilgrim 231b87618b [X86][AVX512] Fold not(kmov(x)) -> kmov(not(x)) and not(widen_subvector(x)) -> widen_subvector(not(x))
Improve AVX512 mask inversion, rG38c799bce801 exposed some missing opportunities to move scalar not() back onto the boolvector types for folding with setcc etc.
2021-04-11 20:07:09 +01:00
Simon Pilgrim 13bdac5709 [X86] combineXor - Pull out repeated getOperand() calls. NFCI. 2021-04-11 19:01:59 +01:00
Simon Pilgrim 38c799bce8 [X86] Fold cmpeq/ne(and(X,Y),Y) --> cmpeq/ne(and(~X,Y),0)
Followup to D100177, handle an similar (demorgan inverse style) case from PR47797 as well

The AVX512 test cases could be further improved if we folded not(iX bitcast(vXi1)) -> (iX bitcast(not(vXi1)))

Alive2: https://alive2.llvm.org/ce/z/AnA_-W
2021-04-11 18:42:01 +01:00
Simon Pilgrim d8bc4de3cf [X86] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) on non-BMI targets (PR44136)
Followup to D100177, enable the fold for non-BMI targets as well.
2021-04-09 16:11:11 +01:00
Simon Pilgrim 245036950a [X86][BMI] Fold cmpeq/ne(or(X,Y),X) --> cmpeq/ne(and(~X,Y),0) (PR44136)
I've initially just enabled this for BMI which has the ANDN instruction for i32/i64 - the i16/i8 cases give an idea of what'd we get when we enable it in all cases (I'll do this as a later commit).

Additionally, the i16/i8 cases could be freely promoted to i32 (as the args are already zeroext) and we could then make use of ANDN + the free cmp0 there as well - this has come up in PR48768 and PR49028 so I'm going to look at this soon.

https://alive2.llvm.org/ce/z/QVWHP_
https://alive2.llvm.org/ce/z/pLngT-

Vector cases do not appear to benefit from this as we end up with having to generate the zero vector as well - this is one of the reasons I didn't try to tie this into hasAndNot/hasAndNotCompare.

Differential Revision: https://reviews.llvm.org/D100177
2021-04-09 15:52:03 +01:00
Simon Pilgrim 3ae0a405fc [X86] combineHorizOpWithShuffle - peek through one use bitcasts when decoding shuffles.
Checking for one use, peek through bitcasts of the horizop args to allows us to merge shuffles of different widths through the horizop.
2021-04-09 10:51:04 +01:00
Simon Pilgrim 53283cc2f1 [X86][SSE] canonicalizeShuffleWithBinOps - add MOVSD/MOVSS handling. 2021-04-06 16:42:18 +01:00
Simon Pilgrim ddbb58736a [KnownBits] Rename KnownBits::computeForMul to KnownBits::mul. NFCI.
As promised in D98866
2021-04-06 10:11:41 +01:00
Simon Pilgrim 36d4f6d7f8 [X86] Fold xor(zext(xor(x,c1)),c2) -> xor(zext(x),xor(zext(c1),c2))
Fixes PR47603 (second case) by extending rG89afec348dbd3e5078f176e978971ee2d3b5dec8
2021-04-05 11:40:37 +01:00
Simon Pilgrim 89afec348d [X86] Fold xor(truncate(xor(x,c1)),c2) -> xor(truncate(x),xor(truncate(c1),c2))
Fixes PR47603

This should probably be transferable to DAGCombine - the main limitation with the existing trunc(logicop) DAG fold is we don't know if legalization has tried to promote truncated logicops already. We might be able to peek through extensions as well.
2021-04-03 12:43:05 +01:00
Simon Pilgrim 7c17f1ea84 [X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper (REAPPLIED)
Use the getTargetShuffleInputs helper for all shuffle decoding

Reapplied (after reversion in rGfa0aff6d6960) with fix+test for subvector splitting - we weren't accounting for peeking through bitcasts changing the vector element count of the shuffle sources.
2021-04-03 11:59:19 +01:00
Nico Weber fa0aff6d69 Revert "[X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper"
This reverts commit 500969f1d0.
Makes clang assert compiling avx2 code, see
https://bugs.chromium.org/p/chromium/issues/detail?id=1195353#c4
for a standalone repro.
2021-04-02 09:55:55 -04:00
Simon Pilgrim 500969f1d0 [X86][SSE] isHorizontalBinOp - use getTargetShuffleInputs helper
Use the getTargetShuffleInputs helper for all shuffle decoding
2021-04-02 11:50:18 +01:00
Yang Fan bc6001ce1e
[X86] Fix -Wunused-function warning (NFC)
GCC warning:
```
/llvm-project/llvm/lib/Target/X86/X86ISelLowering.cpp:9212:13: warning: ‘bool isHorizOp(unsigned int)’ defined but not used [-Wunused-function]
 9212 | static bool isHorizOp(unsigned Opcode) {
      |             ^~~~~~~~~
```
2021-04-02 09:38:12 +08:00
Simon Pilgrim abbe80fa52 [X86][SSE] Fold HOP(HOP(X,X),HOP(Y,Y)) -> HOP(PERMUTE(HOP(X,Y)),PERMUTE(HOP(X,Y))
For slow-hop targets, attempt to merge HADD/SUB pairs used in chains.
2021-04-01 11:54:10 +01:00
Simon Pilgrim 301319840e [X86][SSE] Enable (F)HADD/SUB handling to SimplifyMultipleUseDemandedVectorElts
Attempt to bypass unused horiz-op operands.

This is very similar to the PACKSS/PACKUS handling - we should try to merge these.
2021-04-01 11:54:09 +01:00
Simon Pilgrim f7aeaced65 [X86][SSE] Add isHorizOp helper function. NFCI. 2021-04-01 11:54:09 +01:00
Craig Topper 437958d9fd [X86] Improve SMULO/UMULO codegen for vXi8 vectors.
The default expansion creates a MUL and either a MULHS/MULHU. Each
of those separately expand to sequences that use one or more
PMULLW instructions as well as additional instructions to
extend the types to vXi16. The MULHS/MULHU expansion computes the
whole 16-bit product, but only keeps the high part.

We can improve the lowering of SMULO/UMULO for some cases by using the MULHS/MULHU
expansion, but keep both the high and low parts. And we can use
those parts to calculate the overflow.

For AVX512 we might have vXi1 overflow outputs. We can improve those by using
vpcmpeqw to produce a k register if AVX512BW is enabled. This is a little better
than truncating the high result to use vpcmpeqb. If we don't have avx512bw we
can extend up to v16i32 to use vpcmpeqd to produce a k register.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D97624
2021-03-31 10:13:50 -07:00
Tomas Matheson a9968c0a33 [NFC][CodeGen] Tidy up TargetRegisterInfo stack realignment functions
Currently needsStackRealignment returns false if canRealignStack returns false.
This means that the behavior of needsStackRealignment does not correspond to
it's name and description; a function might need stack realignment, but if it
is not possible then this function returns false. Furthermore,
needsStackRealignment is not virtual and therefore some backends have made use
of canRealignStack to indicate whether a function needs stack realignment.

This patch attempts to clarify the situation by separating them and introducing
new names:

 - shouldRealignStack - true if there is any reason the stack should be
   realigned

 - canRealignStack - true if we are still able to realign the stack (e.g. we
   can still reserve/have reserved a frame pointer)

 - hasStackRealignment = shouldRealignStack && canRealignStack (not target
   customisable)

Targets can now override shouldRealignStack to indicate that stack realignment
is required.

This change will make it easier in a future change to handle the case where we
need to realign the stack but can't do so (for example when the register
allocator creates an aligned spill after the frame pointer has been
eliminated).

Differential Revision: https://reviews.llvm.org/D98716

Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87
2021-03-30 17:31:39 +01:00
Sanjay Patel e694e19a79 [x86] enhance matching of pmaddwd
This was crashing with the example from:
https://llvm.org/PR49716
...and that was avoided with a283d72583 ,
but as we can see from the SSE vs. AVX test code diff,
we can try harder to match the pattern.

This matcher code was adapted from another pmadd pattern
match in D49636, but it needs different ops to deal with
size mismatches.

Differential Revision: https://reviews.llvm.org/D99531
2021-03-30 07:28:33 -04:00
Simon Pilgrim 805148eaf2 [X86][SSE] combineHorizOpWithShuffle - consistently use getTargetShuffleInputs to decode shuffles
Minor cleanup before I start trying to merge the unary/binary shuffle combining paths.
2021-03-29 11:31:19 +01:00
Craig Topper 69bdf35dc7 [X86] Optimize vXi8 MULHS on targets where we can't sign_extend to the next register size.
For these cases we need to extract the upper or lower elements,
multiply them using 16-bit multiplies and repack them.

Previously we used punpcklbw/punpckhbw+psraw or pmovsxbw+pshudfd to
extract and sign extend so we could use pmullw to compute the 16-bit
product and then shift down the high bits.

We can avoid the need to sign extend if we unpack the bytes into
the high byte of each word and fill the lower byte with 0 using
pxor. This puts the sign bit of each byte into the sign bit of
each word. Since the LHS and RHS have 8 trailing zeros, the full
32-bit product of those 16-bit values will have 16 trailing zeros.
This means the 16-bit product of the original bytes is in the upper
16 bits which we can calculate using pmulhw.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D98587
2021-03-28 11:41:29 -07:00
Simon Pilgrim 2a0d5da917 [X86][SSE] foldShuffleOfHorizOp - remove broadcast handling.
Remove VBROADCAST/MOVDDUP/splat-shuffle handling from foldShuffleOfHorizOp

This can all be handled by canonicalizeShuffleMaskWithHorizOp along as we check that the HADD/SUB are only used once (to prevent infinite loops on slow-horizop targets which will try to reuse the nodes again followed by a post-hop shuffle).
2021-03-27 15:09:23 +00:00
Simon Pilgrim 41146bfe82 [X86][SSE] combineX86ShuffleChain - attempt to recognise 'hidden' identity shuffles
See if the combined shuffle mask is equivalent to an identity shuffle, typically this is due to repeated LHS/RHS ops in horiz-ops, but isTargetShuffleEquivalent might see other patterns as well.

This is another small step towards getting rid of foldShuffleOfHorizOp and relying on canonicalizeShuffleMaskWithHorizOp and generic shuffle combining.
2021-03-27 11:09:30 +00:00
Sanjay Patel a283d72583 [x86] prevent crashing while matching pmaddwd
This could crash in 2 ways: either one or both of
the input vectors could be a different size than
the math ops.

https://llvm.org/PR49716
2021-03-27 05:27:14 -04:00
Simon Pilgrim c769ba9514 [X86][AVX] combineHorizOpWithShuffle - improve SHUFFLE(HOP(LOSUBVECTOR(X),HISUBVECTOR(X))) folding
Peek through bitcasts to find subvector splits and use getTargetShuffleInputs to decode target shuffles as well as ShuffleVectorSDNode
2021-03-26 17:23:54 +00:00
Simon Pilgrim 36e3c6c841 [X86][AVX] Truncate vectors with PACKSS/PACKUS on AVX2 targets
Until AVX512 we don't have any vector truncation instructions, and always lower using shuffles instead.

combineVectorTruncation performs this earlier than lowering as it makes it easier to use any sign/zero-extended bits in the truncated bits with PACKSS/PACKUS to perform the shuffle.

We currently don't attempt to use combineVectorTruncation on AVX2 targets as in the past 256-bit PACKSS/PACKUS tended to cause 128-bit lane shuffle regressions - but these should now be all resolved with combineHorizOpWithShuffle and in all cases we now reduce the amount of cross-lane shuffling and variable shuffle mask usage.

Differential Revision: https://reviews.llvm.org/D96609
2021-03-25 10:34:34 +00:00
Simon Pilgrim 9fde88c3e2 [X86][AVX] splitIntVSETCC - handle separate (canonicalized) SETCC operands
LowerVSETCC calls splitIntVSETCC after canonicalizing certain patterns, in particular (X & CPow2 != 0) -> (X & CPow2 == CPow2).

Unfortunately if we're splitting for AVX1/non-AVX512BW cases, we lose these canonicalizations as we call the split with the original SetCC node, and when the split nodes are later lowered in LowerVSETCC the patterns are lost behind extract_subvector etc. But if we pass the canonicalized operands for splitting we retain the optimizations.

Differential Revision: https://reviews.llvm.org/D99256
2021-03-25 10:18:44 +00:00
Simon Pilgrim 7920527796 [X86][AVX] combineBitcastvxi1 - improve handling of vectors truncated to vXi1
If we're truncating to vXi1 from a wider type, then prefer the original wider vector as is simplifies folding the separate truncations + extensions.

AVX1 this is only worth it for v8i1 cases, not v4i1 where we're always better off truncating down to v4i32 for movmsk.

Helps with some regressions encountered in D96609
2021-03-24 14:05:59 +00:00
Simon Pilgrim e9015bd595 [X86][AVX] lowerShuffleAsBroadcast - MOVDDUP(SCALAR_TO_VECTOR(X)) -> BROADCAST(X)
Prefer broadcast from scalar on AVX targets as this makes it easier for later folds to strip away bitcasts etc.

This helps a lot with the AVX1 poor codegen from PR49658.

There's a trivial regression in bitcast-int-to-vector-bool-*ext.ll tests due to SimplifyDemandedBits not being able to see a multi-use case, but there's bigger existing codegen issues to be addressed first in those tests (unnecessary NOTs).
2021-03-24 11:31:56 +00:00
Simon Pilgrim c1ef642ad8 [X86] Remove unused 'OneUse' option from IsNOT helper. NFCI. 2021-03-24 11:14:38 +00:00
Simon Pilgrim 080cb83e52 [X86][AVX] Narrow VPBROADCASTQ->VPBROADCASTD if we don't need the upper bits.
Helps fix cases where we've splatted smaller types to a wider vector element type without needing the upper bits.

Avoid this on AVX512 targets as that can affect broadcast folding.
2021-03-23 09:41:02 +00:00
Simon Pilgrim 3179588947 [X86][AVX] ComputeNumSignBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources
The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting.

Added as an extension to PR49658
2021-03-21 12:22:51 +00:00
Simon Pilgrim 297b9bc3fa [X86][AVX] computeKnownBitsForTargetNode - add X86ISD::VBROADCAST handling for scalar sources
The target shuffle code handles vector sources, but X86ISD::VBROADCAST can also accept a scalar source for splatting.

Suggested by @craig.topper on PR49658
2021-03-21 10:40:57 +00:00
Simon Pilgrim 54a05f2ec8 [X86] computeKnownBitsForTargetNode - add X86ISD::PMULUDQ handling
Reuse the existing KnownBits multiplication code to handle what is effectively a ISD::UMUL_LOHI varient
2021-03-21 09:57:20 +00:00
Simon Pilgrim 64687f2cc3 [X86][SSE] canonicalizeShuffleWithBinOps - add PERMILPS/PERMILPD + PERMPD/PERMQ + INSERTPS handling.
Bail if the INSERTPS would introduce zeros across the binop.
2021-03-16 13:52:08 +00:00
Simon Pilgrim 772155793b [X86][SSE] isHorizontalBinOp - ensure we clear any unused source operands to improve HADD/SUB matching
Our shuffle matching for HADD/SUB patterns wasn't clearing repeated ops in 'fake unary' style shuffle masks (unpack(x,x) etc.), preventing matching of add(fakeunary(),fakeunary()) style patterns.
2021-03-15 16:24:29 +00:00
Simon Pilgrim 814339454d [X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles.
Fold SHUFFLE(BINOP(SHUFFLE(X),SHUFFLE(Y))) -> BINOP(SHUFFLE'(X),SHUFFLE'(Y)) style patterns as well as the existing shuffles of constants.
2021-03-15 15:01:29 +00:00
Simon Pilgrim 07232f4507 [X86][SSE] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling.
Recommit rGcd938ab162b0ac560dd0e9fee290980c7e0e47e5 with an early-out if the pshub would introduce zeros across the binop.
2021-03-15 12:43:30 +00:00
Simon Pilgrim 75a184dacf Revert rG9ba577eca2e339726bfaad4e615c6324a705b292 "[X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. NFCI."
Sorry this wasn't supposed to be committed yet (and certainly not tagged as NFCI....)
2021-03-15 12:23:44 +00:00
Simon Pilgrim 9ba577eca2 [X86][SSE] canonicalizeShuffleWithBinOps - handle target shuffles. NFCI.
Fold SHUFFLE(BINOP(SHUFFLE(X),SHUFFLE(Y))) -> BINOP(SHUFFLE'(X),SHUFFLE'(Y)) style patterns as well as the existing shuffles of constants.
2021-03-15 11:59:25 +00:00
Simon Pilgrim 6878be5dc3 [X86][SSE] Attempt to merge single-op hops for slow targets.
For slow-hop targets, see if any single-op hops are duplicating work already done on another (dual-op) hop, which can sometimes occur as isHorizontalBinOp tries to find potential duplicates (but can't merge them itself). If so, reuse the other hop and shuffle the result.
2021-03-15 09:30:20 +00:00
Simon Pilgrim 6cb7dddaf4 [X86][AVX] Insert zeros byte elements into 256/512-bit vectors using shuffle/and
Avoid extracting/inserting subvectors which makes it more difficult for shuffle combining to merge them together.
2021-03-12 15:16:36 +00:00
Simon Pilgrim 33dcdd414c [X86] Provide lighter weight getTargetShuffleMask wrapper. NFCI.
Most callers to getTargetShuffleMask don't use the IsUnary flag.
2021-03-12 15:16:35 +00:00
Simon Pilgrim bc5e9ec2dc Revert rGcd938ab162b0ac560dd0e9fee290980c7e0e47e5 "[X86] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling."
Investigating an issue reported by @bkramer, possibly when the PSHUFB mask generates zero elements.
2021-03-11 13:14:00 +00:00
Simon Pilgrim 77394c12a4 [X86] Don't attempt to fold sub(C1, xor(X, C2)) with opaque constants
Fixes PR49451
2021-03-11 12:06:40 +00:00
Simon Pilgrim d0884541cc [X86] canonicalizeShuffleWithBinOps - add binary shuffle handling 2021-03-09 13:57:03 +00:00
Simon Pilgrim f71cee136d [X86] Break if-else chain. NFCI.
Both if blocks affect control flow - we don't need the else.

Fixes clang-tidy warning.
2021-03-08 11:44:31 +00:00
Simon Pilgrim cd938ab162 [X86] canonicalizeShuffleWithBinOps - add X86ISD::PSHUFB handling. 2021-03-07 12:56:35 +00:00
Simon Pilgrim 772a501bf4 [X86] canonicalizeShuffleWithBinOps - shuffle oneuse constants.
We can freely shuffle all ones/zeros constants but we can also freely shuffle other constants as long as they only have one use.
2021-03-07 11:17:03 +00:00
Alexey Lapshin cf7cdaff64 [X86][VARARG] Avoid spilling xmm registers for va_start.
That review is extracted from D69372.
It fixes https://bugs.llvm.org/show_bug.cgi?id=42219 bug.

For the noimplicitfloat mode, the compiler mustn't generate
floating-point code if it was not asked directly to do so.
This rule does not work with variable function arguments currently.
Though compiler correctly guards block of code, which copies xmm vararg
parameters with a check for %al, it does not protect spills for xmm registers.
Thus, such spills are generated in non-protected areas and could break code,
which does not expect floating-point data. The problem happens in -O0
optimization mode. With this optimization level there is used
FastRegisterAllocator, which spills virtual registers at basic block boundaries.
Register Allocator does not protect spills with additional control-flow modifications.
Thus to resolve that problem, it is suggested to not copy incoming physical
registers into virtual registers. Instead, store incoming physical xmm registers
into the memory from scratch.

Differential Revision: https://reviews.llvm.org/D80163
2021-03-06 15:25:47 +03:00
Simon Pilgrim d7b8cb4d57 [X86] X86ISelLowering.cpp - try to use for-range loops. NFCI. 2021-03-05 11:09:14 +00:00
Simon Pilgrim 7cbc5df438 [X86] X86TargetLowering::isSafeMemOpType - break if-else chain. NFCI.
All if-else blocks return - fixes clang-tidy warning.
2021-03-04 12:15:08 +00:00
Simon Pilgrim 1584e55a26 [X86] canonicalizeShuffleWithBinOps - handle general unaryshuffle(binop(x,c)) patterns not just xor(x,-1)
Generalize the shuffle(not(x)) -> not(shuffle(x)) fold to handle any binop with 0/-1.

Hopefully we can further generalize to help push target unary/binary shuffles through binops similar to what we do in DAGCombiner::visitVECTOR_SHUFFLE
2021-03-04 10:44:38 +00:00
Simon Pilgrim aa4afebbf9 [X86] Fold scalar_to_vector(x) -> extract_subvector(broadcast(x),0) iff broadcast(x) exists
Add handling for reusing an existing broadcast(x) to a wider vector.
2021-03-03 15:50:37 +00:00
Benjamin Kramer 10c256ccaf Revert "[X86] Fold shuffle(not(x),undef) -> not(shuffle(x,undef))"
This reverts commit 925093d88a.

Causes an infinite loop when compiling some shuffles:

$ cat bugpoint-reduced-simplified.ll
target triple = "x86_64-unknown-linux-gnu"

define void @foo() {
entry:
  %0 = load i8, i8* undef, align 1
  %broadcast.splatinsert = insertelement <16 x i8> poison, i8 %0, i32 0
  %1 = icmp ne <16 x i8> %broadcast.splatinsert, zeroinitializer
  %2 = shufflevector <16 x i1> %1, <16 x i1> undef, <16 x i32> zeroinitializer
  %wide.load = load <16 x i8>, <16 x i8>* undef, align 1
  %3 = icmp ne <16 x i8> %wide.load, zeroinitializer
  %4 = and <16 x i1> %3, %2
  %5 = zext <16 x i1> %4 to <16 x i8>
  store <16 x i8> %5, <16 x i8>* undef, align 1
  ret void
}

$ llc < bugpoint-reduced-simplified.ll
<timeout>
2021-03-02 11:24:07 +01:00
Simon Pilgrim 925093d88a [X86] Fold shuffle(not(x),undef) -> not(shuffle(x,undef))
Move NOT out to expose more AND -> ANDN folds
2021-03-01 14:47:39 +00:00
Simon Pilgrim ab3ea27b6f [X86][AVX] Reuse existing VBROADCAST(x) for SCALAR_TO_VECTOR(x)
Similar to what we already do for BROADCASTs of different vector sizes - if we're going to broadcast it anyway might as well reuse it.
2021-02-28 11:37:27 +00:00
Craig Topper 993f4d8ffa [X86] Fix a couple comments that said LHS where they meant RHS. NFC 2021-02-27 17:14:17 -08:00
James Y Knight 6de6455752 Use getAlign() on atomicrmw/cmpxchg instructions, now that it's available.
These locations were missed as part of adding alignment to the
instructions, and were still making their own alignment assumptions.
2021-02-26 15:06:15 -05:00
Simon Pilgrim ed1f45bce9 [X86][AVX] SimplifyDemandedBitsForTargetNode - add basic X86ISD::VBROADCAST handling.
Simplify through to the scalar/vector source operand.
2021-02-26 16:13:14 +00:00
Simon Pilgrim 7ac4c956af [X86] Remove unnecessary custom lowering of vXi1 SADDSAT/SSUBSAT/UADDSAT/USUBSAT
As discussed on D97478. The removal of the custom tag causes some changes in the add/sub-overflow expansion as it no longer expands to sat-arith codegen.
2021-02-26 12:10:23 +00:00
Simon Pilgrim aefe8f2f6c [DAG] Fold vXi1 multiplies -> and
This allows us to remove X86 custom lowering of vXi1 MUL, which helps simplify a load of mask math.

Mentioned in D97478 post review.
2021-02-26 11:46:12 +00:00
Simon Pilgrim 40b8b4a466 [X86] Remove unnecessary custom lowering of v16i1/v32i1 ADD/SUB
These were missed in D97478
2021-02-26 11:46:11 +00:00
Craig Topper ceaedfb5fc [X86] Remove custom lowering of vXi1 ADD/SUB now that they are canonicalized to XOR in getNode.
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D97478
2021-02-25 08:52:41 -08:00
Simon Pilgrim 8b82669d56 [X86][SSE] Move unaryshuffle(xor(x,-1)) -> xor(unaryshuffle(x),-1) fold into helper. NFCI.
We should be able to extend this "canonicalizeShuffleWithBinOps" to handle more generic binop cases where either/both operands can be cheaply shuffled.
2021-02-25 10:56:23 +00:00
Simon Pilgrim b568d3d6c9 [X86] Add vector support to sub(C1, xor(X, C2)) -> add(xor(X, ~C2), C1+1) fold. 2021-02-21 21:51:27 +00:00
Simon Pilgrim 3ab32c94a4 [X86] Replace explicit constant handling in sub(C1, xor(X, C2)) -> add(xor(X, ~C2), C1+1) fold. NFCI.
NFC cleanup before adding vector support - rely on the SelectionDAG to handle everything for us.
2021-02-21 21:40:32 +00:00
Simon Pilgrim bae04a3e2d [X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - remove unnecessary BITCASTs.
In conjunction with the 'vperm2x128(bitcast(x),bitcast(y),c) -> bitcast(vperm2x128(x,y,c))' fold in combineTargetShuffle, this should remove any unnecessary bitcasts around vperm2x128 lane shuffles.
2021-02-21 18:40:32 +00:00
Simon Pilgrim a6a258f1da [X86][AVX] Fold concat(extract_subvector(v0,c0), extract_subvector(v1,c1)) -> vperm2x128
Fixes regression exposed by removing bitcasts across logic-ops in D96206.

Differential Revision: https://reviews.llvm.org/D96206
2021-02-21 14:50:43 +00:00
Simon Pilgrim 2885d1251f [X86] Fold bitcast(logic(bitcast(X), Y)) --> logic'(X, bitcast(Y)) for int-int bitcasts
Extend the existing combine that handles bitcasting for fp-logic ops to also help remove logic ops across bitcasts to/from the same integer types.

This helps improve AVX512 predicate handling for D/Q logic ops and also allows DAGCombine's scalarizeExtractedBinop to remove some annoying gpr->simd->gpr transfers.

The concat_vectors regression in pr40891.ll will be addressed in a followup commit on this patch.

Differential Revision: https://reviews.llvm.org/D96206
2021-02-21 14:40:54 +00:00
Simon Pilgrim 761bbed264 [DAG] foldSubToUSubSat - fold sub(a,trunc(umin(zext(a),b))) -> usubsat(a,trunc(umin(b,SatLimit)))
This moves the last custom x86 USUBSAT fold to generic DAGCombine.

Completes PR40111

Differential Revision: https://reviews.llvm.org/D96703
2021-02-20 12:02:07 +00:00
Simon Pilgrim 2258b367db [X86][AVX] getFauxShuffleMask - decode VBROADCAST(EXTRACT_VECTOR_ELT(V,0))
Handle the case where we're broadcasting a scalar extracted from another vector.
2021-02-19 11:06:53 +00:00
Wang, Pengfei c98644c2ec [X86] Fix a codegen crash in getSetCCResultType
This patch fixes some crashes coming from
X86ISelLowering::getSetCCResultType, which would occasionally return
an EVT constructed from an invalid MVT, which has a null Type pointer.

This patch refers to D95434.

Differential Revision: https://reviews.llvm.org/D97036
2021-02-19 17:30:10 +08:00
Simon Pilgrim 05c64ea672 [DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) (REAPPLIED)
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))

Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.

Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.

This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.

Fixes issue raised by @saugustine in rG5aa8f4c0843a where we were failing to replace null shuffle operands from MergeInnerShuffle to UNDEFs.

Differential Revision: https://reviews.llvm.org/D96345
2021-02-17 11:42:43 +00:00
Sterling Augustine 5aa8f4c084 Revert "[DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)))"
This reverts commit 5dfba562dd.

That commit causes an assertion failure with the following repro:

typedef long b __attribute__((__vector_size__(16)));
b *d;
b e;
b __attribute__((__always_inline__)) c(b h, b i) {
  return (__attribute__((__vector_size__(8 * sizeof(short)))) short)h + i;
}
j() {
  b k, l, m, n, o[6], p, q;
  m = d[5];
  b r = m;
  b s = f(r, 8);
  q = s;
  l = d[1];
  p = l;
  t(q);
  n = c(m, l);
  o[1] = c(s, f(p, 8));
  k = __builtin_shufflevector(n, o[1], 0, 2);
  e = __builtin_ia32_psrlwi128(k, j);
}

./bin/clang -cc1 -triple x86_64-grtev4-linux-gnu -emit-obj -O1 -std=c99 test.c
2021-02-16 12:48:15 -08:00
Simon Pilgrim 5dfba562dd [DAG] Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d)))
Fold shuffle(bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))) -> bop(shuffle(x,y),shuffle(z,w)),bop(shuffle(a,b),shuffle(c,d))

Attempt to fold from a shuffle of a pair of binops to a binop of shuffles, as long as one/both of the binop sources are also shuffles that can be merged with the outer shuffle. This should guarantee that we remove one binop without introducing any additional shuffles.

Technically there's potential for a merged shuffle's lowering to be poorer than the original shuffle, but it could also be better, and I'm not seeing any regressions as long as we keep the 'don't merge splats' rule already present in MergeInnerShuffle.

This expands and generalizes an existing X86 combine and attempts to merge either of each binop's sources (with an on-the-fly commutation of the shuffle mask) - we couldn't do that in the x86 version as it had to stay in a form that DAGCombine's MergeInnerShuffle would still recognise.

Differential Revision: https://reviews.llvm.org/D96345
2021-02-16 15:46:34 +00:00
Simon Pilgrim 4841a225b7 [DAG] Move basic USUBSAT pattern matches from X86 to DAGCombine
Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets.

This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization.

We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation.

The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987.

Differential Revision: https://reviews.llvm.org/D96413
2021-02-12 18:22:57 +00:00
Simon Pilgrim eb31c3c5cb Revert rGe1172959226689a "[X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - merge VPERMILPD ops with different low/high masks."
Revert this while I investigate a downstream breakage report.
2021-02-10 10:26:44 +00:00
Simon Pilgrim 89d9ff8229 [X86][SSE] foldShuffleOfHorizOp - add SHUFPS v4f32 handling
Fold shufps(hop(x,y),hop(z,w)) -> permute(hop(x,z)) - this is very similar to the equivalent unpack fold.

I did start trying to convert foldShuffleOfHorizOp to handle generic shuffle masks but we're relying on a lot of special cases at the moment.
2021-02-09 14:18:45 +00:00
Simon Pilgrim 598ceb25d4 [X86][AVX] Fold extract_subvector(splat, c) -> extract_subvector(splat, 0)
We already do this for VBROADCASTs, extend this for any splat that SelectionDAG::isSplatValue recognises as well.
2021-02-07 11:42:41 +00:00
Craig Topper 6f4f0efd89 [X86] Don't pass a 1 to the second argument of ISD::FP_ROUND in LowerFCOPYSIGN.
I don't think we have any reason to believe the FP_ROUND here doesn't change the value.

Found while trying to see if we still need the fp128 block in CanCombineFCOPYSIGN_EXTEND_ROUND.
Removing that check caused this FP_ROUND to fire for fp128 which introduced a libcall expansion that asserted for this being a 1.

Reviewed By: RKSimon, pengfei

Differential Revision: https://reviews.llvm.org/D96098
2021-02-06 10:29:01 -08:00
Simon Pilgrim e117295922 [X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - merge VPERMILPD ops with different low/high masks.
Now that PR48908 has been dealt with, we can handle v4f64 permute cases by extracting the low/high lane VPERMILPD masks and creating a new mask based on which lanes are referenced by the VPERM2F128 mask.
2021-02-06 15:58:02 +00:00
Craig Topper 11ef356d9e [TargetLowering] Use Align in allowsMisalignedMemoryAccesses.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D96097
2021-02-04 19:22:06 -08:00
Simon Pilgrim 31b85e1c0b [X86] Use VT::changeVectorElementType helper where possible. NFCI. 2021-02-04 15:03:56 +00:00
Simon Pilgrim fa2cdb8140 [X86] Remove stale TODO comment. NFC.
We now handle implicit zero-extension shuffle mask cases.
2021-02-04 12:14:05 +00:00
Simon Pilgrim 32b7c2fa42 [X86][SSE] Support variable-index float/double vector insertion on SSE41+ targets (PR47924)
Extends D95779 to permit insertion into float/doubles vectors while avoiding a lot of aliased memory traffic.

The scalar value is already on the simd unit, so we only need to transfer and splat the index value, then perform the select.

SSE4 codegen is a little bulky due to the tied register requirements of (non-VEX) BLENDPS/PD but the extra moves are cheap so shouldn't be an actual problem.

Differential Revision: https://reviews.llvm.org/D95866
2021-02-03 14:14:35 +00:00
Simon Pilgrim 8c2e075c2c [X86][SSE] LowerINSERT_VECTOR_ELT - pull out repeated EltSizeInBits calls. NFCI. 2021-02-02 13:45:18 +00:00
Simon Pilgrim d46a6b3d55 [X86][AVX512] Support variable-index vector insertion on AVX512 targets (PR47924)
With predicate masks, AVX512 can efficiently perform variable-index vector insertion with 2 broadcasts + 1 comparison, avoiding a lot of aliased memory traffic.

Differential Revision: https://reviews.llvm.org/D95779
2021-02-02 11:41:18 +00:00
Philip Reames 46e764a628 [x86] introduce no_callee_saved_registers attribute
This is directly analogous to the existing no_caller_saved_registers, but with the opposite intention.  A function or call so marked shifts the responsibility of spilling the usual CSRs to it's caller.

An indirect call site and callee which don't agree on the attribute is ill defined.

The motivation for this change is that being able to prune callee saves (without modifying other details of the calling convention) is sometimes useful when generating stubs and adapters.  There's no intention to expose this as a source language feature; this is expected to be used by frontends to implement adapters where warranted.

Some specific examples of use cases:
* GC compatible compiled code wants to call an externally defined library function without needing to track pointer values through CSRs.
* debug enabled code wants to call precompiled library which doesn't provide enough information to track CSRs while preserving debug quality in caller.
* adapter stub entering hand written assembler which doesn't follow normal calling conventions.
2021-02-01 16:19:14 -08:00
Philip Reames 9d09db941f [NFC][X86] Use CallBase interface to simplify code 2021-02-01 15:24:41 -08:00
Philip Reames bb6c23b1f5 [NFC][X86] Avoid redundant work inspecting callee 2021-02-01 15:24:41 -08:00
Simon Pilgrim e640b209b2 [X86][SSE] LowerScalarImmediateShift - use APInt::getLowBitsSet for vXi8 ISD::SRL mask generation. NFCI.
Match what we do for ISD::SHL
2021-02-01 18:17:40 +00:00
Simon Pilgrim 5211af4818 [X86][AVX] combineExtractWithShuffle - combine extracts from 256/512-bit vector shuffles.
We can only legally extract from the lowest 128-bit subvector, so extract the correct subvector to allow us to handle 256/512-bit vector element extracts.
2021-02-01 10:31:43 +00:00
Simon Pilgrim d6b68d1344 [X86][SSE] combineExtractWithShuffle - support zero-extending to allow extracting from narrow shuffle masks
If the shuffle mask can't be widened to match the original extracted element width, see if the upper bits are zeroable - which allows us to extract+zero-extend the smaller extraction.
2021-01-29 14:22:10 +00:00
Simon Pilgrim f84efe97bc [X86][AVX] combineHorizOpWithShuffle - fix valuetype comparison typo.
Ensure we check the valuetypes of all the HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) shuffle input ops - there was a copy+paste typo (noticed by MSVC analyzer) that meant we were checking the same input from one of the shuffles twice.

I haven't been able to create a test case for this yet - I don't think its currently possible to create a target/faux binary shuffle that scales to a 2x128 shuffle mask from two different value types.
2021-01-28 16:36:23 +00:00
Simon Pilgrim 6663330bc8 [X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - don't merge VPERMILPD ops with different low/high masks.
Unlike VPERMILPS, VPERMILPD can have non-repeating masks in each 128-bit subvector, we weren't accounting for this when folding vperm2f128(vpermilpd(x,c),vpermilpd(y,c)) -> vpermilpd(vperm2f128(x,y),c).

I'm intending to add support for this but wanted to get a minimal fix in first for merging into 12.xx.

Fixes PR48908
2021-01-28 12:11:31 +00:00
Freddy Ye b3b0acdc6f [NFC] Refine some uninitialized used variables.
These warning are reported by static code analysis tool: Klocwork

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D95421
2021-01-26 16:51:05 +08:00
Simon Pilgrim 13f2aee783 [X86][AVX] Generalize vperm2f128/vperm2i128 patterns to support all legal 256-bit vector types
Remove bitcasts to/from v4x64 types through vperm2f128/vperm2i128 ops to help improve shuffle combining and demanded vector elts folding.
2021-01-25 15:35:36 +00:00
Simon Pilgrim 821a51a9ca [X86][AVX] combineX86ShuffleChainWithExtract - widen to at least original root size. NFCI.
We're relying on the source inputs for shuffle combining having already been widened to the root size (otherwise the offset logic falls over) - we're going to be supporting different sized shuffle inputs soon, so we need to explicitly make the minimum widened width the original root size.
2021-01-25 13:45:37 +00:00
Simon Pilgrim 1b780cf32e [X86][AVX] LowerTRUNCATE - avoid bitcasts around extract_subvectors.
We allow extract_subvector lowering of all legal types, so pre-bitcast the source type to try and reduce bitcast pollution.
2021-01-25 12:10:36 +00:00
Simon Pilgrim f461e35cba [X86][AVX] combineX86ShuffleChain - avoid bitcasts around insert_subvector() shuffle patterns.
We allow insert_subvector lowering of all legal types, so don't always cast to the vXi64/vXf64 shuffle types - this is only necessary for X86ISD::SHUF128/X86ISD::VPERM2X128 patterns later.
2021-01-25 11:35:45 +00:00
Fangrui Song d745b82de1 [XRay] Support DW_TAG_call_site and delete unneeded PATCHABLE_EVENT_CALL/PATCHABLE_TYPED_EVENT_CALL lowering 2021-01-25 00:49:18 -08:00
Simon Pilgrim bd122f6d21 [X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - handle vperm2x128(movddup(x),movddup(y)) cases
Fold vperm2x128(movddup(x),movddup(y)) -> movddup(vperm2x128(x,y))
2021-01-22 16:05:19 +00:00
Simon Pilgrim c33d36e066 [X86][AVX] canonicalizeLaneShuffleWithRepeatedOps - handle unary vperm2x128(permute/shift(x,c),undef) cases
Fold vperm2x128(permute/shift(x,c),undef) -> permute/shift(vperm2x128(x,undef),c)
2021-01-22 15:47:23 +00:00
Simon Pilgrim 4846f6ab81 [X86][AVX] combineTargetShuffle - simplify the X86ISD::VPERM2X128 subvector matching
Simplify vperm2x128(concat(X,Y),concat(Z,W)) folding.

Use collectConcatOps / ISD::INSERT_SUBVECTOR to find the source subvectors instead of hardcoded immediate matching.
2021-01-22 15:47:22 +00:00
Simon Pilgrim b1166e1317 [X86][AVX] combineX86ShufflesRecursively - attempt to constant fold before widening shuffle inputs
combineX86ShufflesConstants/canonicalizeShuffleMaskWithHorizOp can both handle/earlyout shuffles with inputs of different widths, so delay widening as late as possible to make it easier to match constant folds etc.

The plan is to eventually move the widening inside combineX86ShuffleChain so that we don't create any new nodes unless we successfully combine the shuffles.
2021-01-22 13:19:35 +00:00
Simon Pilgrim ffe72f987f [X86][SSE] Don't fold shuffle(binop(),binop()) -> binop(shuffle(),shuffle()) if the shuffle are splats
rGbe69e66b1cd8 added the fold, but DAGCombiner.visitVECTOR_SHUFFLE doesn't merge shuffles if the inner shuffle is a splat, so we need to bail.

The non-fast-horiz-ops paths see some minor regressions, we might be able to improve on this after lowering to target shuffles.

Fix PR48823
2021-01-22 11:31:38 +00:00
Simon Pilgrim 86021d98d3 [X86] Avoid a std::string copy by replacing auto with const auto&. NFC.
Fixes msvc analyzer warning.
2021-01-21 11:04:07 +00:00
Max Kazantsev d6bb96e677 [X86] Add experimental option to separately tune alignment of innermost loops
We already have an experimental option to tune loop alignment. Its impact
is very wide (and there is a suspicion that it's not always profitable). We want
to have something more narrow to play with. This patch adds similar option that
overrides preferred alignment for innermost loops. This is for experimental
purposes, default values do not change the existing behavior.

Differential Revision: https://reviews.llvm.org/D94895
Reviewed By: pengfei
2021-01-21 11:15:16 +07:00
Simon Pilgrim b8b5e87e6b [X86][AVX] Handle vperm2x128 shuffling of a subvector splat.
We already handle "vperm2x128 (ins ?, X, C1), (ins ?, X, C1), 0x31" for shuffling of the upper subvectors, but we weren't dealing with the case when we were splatting the upper subvector from a single source.
2021-01-20 18:16:33 +00:00
Simon Pilgrim 19d02842ee [X86][AVX] Fold extract_subvector(VSRLI/VSHLI(x,32)) -> VSRLI/VSHLI(extract_subvector(x),32)
As discussed on D56387, if we're shifting to extract the upper/lower half of a vXi64 vector then we're actually better off performing this at the subvector level as its very likely to fold into something.

combineConcatVectorOps can perform this in reverse if necessary.
2021-01-20 14:34:54 +00:00
Simon Pilgrim 5626adcd6b [X86][SSE] combineVectorSignBitsTruncation - fold trunc(srl(x,c)) -> packss(sra(x,c))
If a srl doesn't introduce any sign bits into the truncated result, then replace with a sra to let us use a PACKSS truncation - fixes a regression noticed in D56387 on pre-SSE41 targets that don't have PACKUSDW.
2021-01-19 11:04:13 +00:00
Sanjay Patel d27bb5c375 [x86] add cast to avoid compile-time warning; NFC 2021-01-18 17:47:04 -05:00
Simon Pilgrim ce06475da9 [X86][AVX] IsElementEquivalent - add matchShuffleWithUNPCK + VBROADCAST/VBROADCAST_LOAD handling
Specify LHS/RHS operands in matchShuffleWithUNPCK's calls to isTargetShuffleEquivalent, and handle VBROADCAST/VBROADCAST_LOAD matching in IsElementEquivalent
2021-01-18 15:55:00 +00:00
Simon Pilgrim 770d1e0a88 [X86][SSE] isHorizontalBinOp - reuse any existing horizontal ops.
If we already have similar horizontal ops using the same args, then match that, even if we are on a target with slow horizontal ops.
2021-01-18 10:14:45 +00:00
Kazu Hirata 2082b10d10 [llvm] Use *::empty (NFC) 2021-01-16 09:40:55 -08:00
Simon Pilgrim be69e66b1c [X86][SSE] Attempt to fold shuffle(binop(),binop()) -> binop(shuffle(),shuffle())
If this will help us fold shuffles together, then push the shuffle through the merged binops.

Ideally this would be performed in DAGCombiner::visitVECTOR_SHUFFLE but getting an efficient+legal merged shuffle can be tricky - on SSE we can be confident that for 32/64-bit elements vectors shuffles should easily fold.
2021-01-15 16:25:25 +00:00
Simon Pilgrim 1dfd5c9ad8 [X86][AVX] combineHorizOpWithShuffle - support target shuffles in HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y))
Be more aggressive on (AVX2+) folds of lane shuffles of 256-bit horizontal ops by working on target/faux shuffles as well.
2021-01-15 13:55:30 +00:00
Simon Pilgrim cbbfc82586 [X86][SSE] canonicalizeShuffleMaskWithHorizOp - simplify shuffle(HOP(HOP(X,Y),HOP(Z,W))) style chains.
See if we can remove the shuffle by resorting a HOP chain so that the HOP args are pre-shuffled.

This initial version just handles (the most common) v4i32/v4f32 hadd/hsub reduction patterns - future work can extend this to v8i16 types plus PACK chains (2f64 HADD/HSUB should already be handled in the half-lane combine code later on).
2021-01-13 17:19:40 +00:00
Simon Pilgrim 0a0ee7f5a5 [X86] canonicalizeShuffleMaskWithHorizOp - minor refactor to support multiple src ops. NFCI.
canonicalizeShuffleMaskWithHorizOp currently only supports shuffles with 1 or 2 sources, but PR41813 will require us to support higher numbers of sources.

This patch just generalizes the initial setup stages to ensure all src ops are the same type and opcode and then will continue to early out if we have more than 2 sources.
2021-01-13 13:59:56 +00:00
Simon Pilgrim 0f59d09957 [X86][AVX] combineVectorSignBitsTruncation - limit AVX512 truncations to 128-bits (PR48727)
rG73a44f437bf1 result in 256-bit packss/packus ops with additional shuffles that shuffle combining can sometimes try to convert back into a truncation.
2021-01-13 10:38:23 +00:00
Bevin Hansson 07605ea1f3 [X86] Improved lowering for saturating float to int.
Adapted from D54696 by @nikic.

This patch improves lowering of saturating float to
int conversions, FP_TO_[SU]INT_SAT, for X86.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D86079
2021-01-12 15:44:41 +01:00
Simon Pilgrim 2ed914cb7e [X86][SSE] getFauxShuffleMask - handle PACKSS(SRAI(),SRAI()) shuffle patterns.
We can't easily treat ASHR a faux shuffle, but if it was just feeding a PACKSS then it was likely being used as sign-extension for a truncation, so just peek through and adjust the mask accordingly.
2021-01-12 14:07:53 +00:00
Simon Pilgrim 7e44208115 [X86][SSE] combineSubToSubus - add v16i32 handling on pre-AVX512BW targets.
v16i32 -> v16i16/v8i16 truncation is now good enough using PACKSS/PACKUS + shuffle combining that its no longer necessary to early-out on pre-AVX512BW targets.

This was noticed while looking at completing PR40111 and moving combineSubToSubus to DAGCombine entirely.
2021-01-12 13:44:11 +00:00
Simon Pilgrim a5212b5c91 [X86][SSE] combineSubToSubus - remove SSE2 early-out.
SSE2 truncation codegen has improved over the past few years (mainly due to better shuffle lowering/combining and computeKnownBits) - its no longer necessary to early-out from v8i32/v8i64 truncations.

This was noticed while looking at completing PR40111 and moving combineSubToSubus to DAGCombine entirely.
2021-01-12 12:52:11 +00:00
Simon Pilgrim 4214ca9614 [X86][AVX] Attempt to fold vpermf128(op(x,i),op(y,i)) -> op(vpermf128(x,y),i)
If vpermf128/vpermi128 is acting on 2 similar 'inlane' ops, then try to perform the vpermf128 first which will allow us to merge the ops.

This will help us fix one of the regressions in D56387
2021-01-11 16:59:25 +00:00
Simon Pilgrim 41bf338dd1 Revert rGd43a264a5dd3 "Revert "[X86][SSE] Fold unpack(hop(),hop()) -> permute(hop())""
This reapplies commit rG80dee7965dffdfb866afa9d74f3a4a97453708b2.

[X86][SSE] Fold unpack(hop(),hop()) -> permute(hop())

UNPCKL/UNPCKH only uses one op from each hop, so we can merge the hops and then permute the result.

REAPPLIED with a fix for unary unpacks of HOP.
2021-01-11 11:29:04 +00:00
Nico Weber d43a264a5d Revert "[X86][SSE] Fold unpack(hop(),hop()) -> permute(hop())"
This reverts commit 80dee7965d.
Makes clang sometimes hang forever. See
https://bugs.chromium.org/p/chromium/issues/detail?id=1164786#c6 for a
stand-alone repro.
2021-01-10 20:22:53 -05:00
Simon Pilgrim 80dee7965d [X86][SSE] Fold unpack(hop(),hop()) -> permute(hop())
UNPCKL/UNPCKH only uses one op from each hop, so we can merge the hops and then permute the result.
2021-01-08 15:22:17 +00:00
Simon Pilgrim 73a44f437b [X86][AVX] combineVectorSignBitsTruncation - use PACKSS/PACKUS in more AVX cases
AVX512 has fast truncation ops, but if the truncation source is a concatenation of subvectors then its likely that we can use PACK more efficiently.

This is only guaranteed to work for truncations to 128/256-bit vectors as the PACK works across 128-bit sub-lanes, for now I've just disabled 512-bit truncation cases but we need to get them working eventually for D61129.
2021-01-05 15:01:45 +00:00
Kazu Hirata 985f899bf2 [Target] Use llvm::append_range (NFC) 2021-01-03 09:57:43 -08:00
Fangrui Song 6be0b9a8dd [X86] Don't fold negative offset into 32-bit absolute address (e.g. movl $foo-1, %eax)
When building abseil-cpp `bin/absl_hash_test` with Clang in -fno-pic
mode, an instruction like `movl $foo-2147483648, $eax` may be produced
(subtracting a number from the address of a static variable). If foo's
address is smaller than 2147483648, GNU ld/gold/LLD will error because
R_X86_64_32 cannot represent a negative value.

```
using absl::Hash;
struct NoOp {
  template < typename HashCode >
  friend HashCode AbslHashValue(HashCode , NoOp );
};
template <typename> class HashIntTest : public testing::Test {};
TYPED_TEST_SUITE_P(HashIntTest);
TYPED_TEST_P(HashIntTest, BasicUsage) {
  if (std::numeric_limits< TypeParam >::min )
    EXPECT_NE(Hash< NoOp >()({}),
              Hash< TypeParam >()(std::numeric_limits< TypeParam >::min()));
}
REGISTER_TYPED_TEST_CASE_P(HashIntTest, BasicUsage);
using IntTypes = testing::Types< int32_t>;
INSTANTIATE_TYPED_TEST_CASE_P(My, HashIntTest, IntTypes);

ld: error: hash_test.cc:(function (anonymous namespace)::gtest_suite_HashIntTest_::BasicUsage<int>::TestBody(): .text+0x4E472): relocation R_X86_64_32 out of range: 18446744071564237392 is not in [0, 4294967295]; references absl::hash_internal::HashState::kSeed
```

Actually any negative offset is not allowed because the symbol address
can be zero (e.g. set by `-Wl,--defsym=foo=0`). So disallow such folding.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D93931
2020-12-30 18:47:26 -08:00
Luo, Yuanke 981a0bd858 [X86] Add x86_amx type for intel AMX.
The x86_amx is used for AMX intrisics. <256 x i32> is bitcast to x86_amx when
it is used by AMX intrinsics, and x86_amx is bitcast to <256 x i32> when it
is used by load/store instruction. So amx intrinsics only operate on type x86_amx.
It can help to separate amx intrinsics from llvm IR instructions (+-*/).
Thank Craig for the idea. This patch depend on https://reviews.llvm.org/D87981.

Differential Revision: https://reviews.llvm.org/D91927
2020-12-30 13:52:13 +08:00
Simon Pilgrim 8767f3bb97 [X86][AVX] Remove X86ISD::SUBV_BROADCAST (PR38969)
Followup to D92645 - remove the remaining places where we create X86ISD::SUBV_BROADCAST, and fold splatted vector loads to X86ISD::SUBV_BROADCAST_LOAD instead.

Remove all the X86SubVBroadcast isel patterns, including all the fallbacks for if memory folding failed.
2020-12-18 15:49:53 +00:00
Simon Pilgrim 992fad03e2 [X86][AVX] Replace extract_subvector(broadcast(), 0) folds with generic SimplifyDemandedVectorEltsForTargetNode handling.
Simplifies a few more cases, notably shuffle demanded elts cases.
2020-12-18 11:51:10 +00:00
Simon Pilgrim 931e66bd89 [X86] Remove extract_subvector(subv_broadcast_load()) fold.
This was needed in an earlier version of D92645, but isn't now - and I've just noticed that it was potentially flawed depending on the relevant widths of the broadcasted and extracted subvectors.
2020-12-17 11:02:49 +00:00
Simon Pilgrim cdb692ee0c [X86] Add X86ISD::SUBV_BROADCAST_LOAD and begin removing X86ISD::SUBV_BROADCAST (PR38969)
Subvector broadcasts are only load instructions, yet X86ISD::SUBV_BROADCAST treats them more generally, requiring a lot of fallback tablegen patterns.

This initial patch replaces constant vector lowering inside lowerBuildVectorAsBroadcast with direct X86ISD::SUBV_BROADCAST_LOAD loads which helps us merge a number of equivalent loads/broadcasts.

As well as general plumbing/analysis additions for SUBV_BROADCAST_LOAD, I needed to wrap SelectionDAG::makeEquivalentMemoryOrdering so it can handle result chains from non generic LoadSDNode nodes.

Later patches will continue to replace X86ISD::SUBV_BROADCAST usage.

Differential Revision: https://reviews.llvm.org/D92645
2020-12-17 10:25:25 +00:00
QingShan Zhang ebdd20f430 Expand the fp_to_int/int_to_fp/fp_round/fp_extend as libcall for fp128
X86 and AArch64 expand it as libcall inside the target. And PowerPC also
want to expand them as libcall for P8. So, propose an implement in the
legalizer to common the logic and remove the code for X86/AArch64 to
avoid the duplicate code.

Reviewed By: Craig Topper

Differential Revision: https://reviews.llvm.org/D91331
2020-12-17 07:59:30 +00:00