vselect ((X & Pow2C) == 0), LHS, RHS --> vselect ((shl X, C') < 0), RHS, LHS
Follow-up to D83073 - the non-splat mask cases where we actually see an
improvement are quite limited from what I can tell. AVX1 needs multiply
and blend capabilities and AVX2 needs vector shift and blend capabilities.
The intersection of those 2 constraints is only vectors with 32-bit or
64-bit elements.
XOP is/was better.
Differential Revision: https://reviews.llvm.org/D83181
We were checking the VBROADCAST_LOAD element size against the extraction destination size instead of the extracted vector element size - PEXTRW/PEXTB have implicit zext'ing so have i32 destination sizes for v8i16/v16i8 vectors, resulting in us extracting from the wrong part of a load.
This patch bails from the fold if the vector element sizes don't match, and we now use the target constant extraction code later on like the pre-AVX2 targets, fixing the test case.
Found by internal fuzzing tests.
In the test based on PR46586:
https://bugs.llvm.org/show_bug.cgi?id=46586
...we are inserting 16-bits into the high element of the vector, shuffling it
to element 0, and extracting 32-bits. But xmm1 was never initialized, so the
top 16-bits of the extract are undef without this patch.
(It seems like we could do better than this by recognizing that we only demand
a subsection of the build vector, but I want to make sure we fix the
miscompile 1st.)
This path is only used for pre-SSE4.1, and simpler patterns get squashed
somewhere along the way, so the test still includes a 'urem' as it did in the
original test from the bug report.
Differential Revision: https://reviews.llvm.org/D83319
When an argument has 'byval' attribute and should be
passed on the stack according calling convention,
a stack copy would be emitted twice. This will cause
the real value will be put into stack where the pointer
should be passed.
Differential Revision: https://reviews.llvm.org/D83175
Probably not super important since there are no real CPUs with
avx512vl and not avx512bw. But vpternlog should be better than
vblendvb.
I do wonder if we should use vpternlog even with BWI. We
currently use vblendmb or vpblendmw by putting the mask into a GPR
and moving it to a k-register. But I don't think we hoist the
GPR to k-register copy in machine LICM. Using VPTERNLOG would use
a constant pool load, but has the advantage that we're pretty good
at hoisting and rematerializing those.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D83156
VPBLENDVB is multiple uops while VPTERNLOG is a single uop. So
we should use that instead.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D83155
Using PACK for truncations leaves us with intermediate shuffles that can be tricky to remove while the truncation tree is being formed.
This fold helps pull out the PERMQ case which is one of the most common, avoiding some costly lane-crossing shuffles.
A future patch will begin adding more general shuffle folding, which we should be able to use for HADD/HSUB as well.
We canonicalize patterns like:
%s = lshr i32 %a0, 1
%t = trunc i32 %s to i1
to:
%a = and i32 %a0, 2
%c = icmp ne i32 %a, 0
...in IR, but the bit-shifting original sequence may be better for x86 vector codegen.
I tried several variants of the transform, and it's tricky to not induce regressions.
In particular, I did not find a way to cleanly handle non-splat constants, so I've left
that as a TODO item here (currently negative tests for those are included). AVX512
resulted in some diffs, but didn't look meaningful, so I left that out too. Some of
the 256-bit AVX1 diffs are questionable, but close enough that they are probably
insignificant.
Differential Revision: https://reviews.llvm.org/D83073.
We consider v32i16/v64i8 to be legal types on avx512f, but we
don't have most operations until avx512bw. But we can use
and/or/xor operations. So try those before splitting.
This is especially helpful since we turn some ands with constant
masks into shuffles in early DAG combines. So we should make sure
we recover those back to AND.
The comments here indicate that we prefer to promote the shifts
instead of allowing rotate to be pattern matched. But we weren't
taking into account whether 512-bit registers are enabled or
whethever we have vpsllvw/vpsrlvw instructions.
splatvar_rotate_v32i8 is a slight regrssion, but the other cases
are neutral or improved.
D82257/rG3521ecf1f8a3 was incorrectly sign-extending a constant vector from the lsb, this is fine if all the constant elements are 'allsignbits' in the active bits, but if only some of the elements are, then we are corrupting the constant values for those elements.
This fix ensures we sign extend from the msb of the active/demanded bits instead.
If we're masking the result of an OR-reduction before comparing against zero, we can fold this into the PTEST() / MOVMSK(CMPEQ()) codegen by pre-masking the source value.
This works particularly well on PTEST which performs the AND as part of its operation, but the MOVMSK variant also benefits for non-V2I64 cases.
Fixes PR44781
If the shuffle is a blend and one input is a 0 vector, we should prefer AND over PSHUFB since its available on more execution ports.
Differential Revision: https://reviews.llvm.org/D82798
This was producing reg = xor undef reg, undef reg. This looks similar
to a use of a value to define itself, and I want to disallow undef
uses for SSA virtual registers. If this were to use implicit_def,
there's no guarantee the two operands end up using the same register
(I think no guarantee exists even if the two operands start out as the
same register, but this was violated when I switched this to use an
explicit implicit_def). The MOV32r0 pseudo evidently exists to handle
this case, so use it instead. This was more work than I expected for
the 64-bit case, but I didn't see any helper for materializing a
64-bit 0.
If a constant is only allsignbits in the demanded/active bits, then sign extend it to an allsignbits bool pattern for OR/XOR ops.
This also requires SimplifyDemandedBits XOR handling to be modified to call ShrinkDemandedConstant on any (non-NOT) XOR pattern to account for non-splat cases.
Next step towards fixing PR45808 - with this patch we now get a <-1,-1,0,0> v4i64 constant instead of <1,1,0,0>.
Differential Revision: https://reviews.llvm.org/D82257
Pre-commit for D82257, this adds a DemandedElts arg to ShrinkDemandedConstant/targetShrinkDemandedConstant which will allow future patches to (optionally) add vector support.
We already fold (v2i64 scalar_to_vector(aext)) -> (v2i64 bitcast(v4i32 scalar_to_vector(x))), this adds support for similar aextload cases and also handles v2f64 cases that wrap the i64 extension behind bitcasts.
Fixes the remaining issue with PR39016
Generalize the vector operand extraction code for shuffle/pack ops - we can assume that the vector operands are the same width as the result, and any non-vector values can be reused directly in the smaller width op.
Summary:
A while ago I implemented the functionality to lower Microsoft __ptr32
and __ptr64 pointers, which are stored as 32-bit and 64-bit pointer
and are extended/truncated to the appropriate pointer size when
dereferenced.
This patch adds an addrspacecast to cast from the __ptr32/__ptr64
pointer to a default address space when dereferencing.
Bug: https://bugs.llvm.org/show_bug.cgi?id=42359
Reviewers: hans, arsenm, RKSimon
Subscribers: wdng, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D81517
This caused a Chromium test to miscompile. See discussion on the Phabricator
review.
> This patch extends MatchVectorAllZeroTest to handle OR vector reduction patterns where the result is compared against zero.
>
> Fixes PR45378
>
> Differential Revision: https://reviews.llvm.org/D81547
This reverts 057c9c7ee0
We have many cases where we call SimplifyMultipleUseDemandedBits and demand specific vector elements, but all the bits from them - this adds a helper wrapper to handle this.
If a collection of interconnected phi nodes is only ever loaded, stored
or bitcast then we can convert the whole set to the bitcast type,
potentially helping to reduce the number of register moves needed as the
phi's are passed across basic block boundaries. This has to be done in
CodegenPrepare as it naturally straddles basic blocks.
The alorithm just looks from phi nodes, looking at uses and operands for
a collection of nodes that all together are bitcast between float and
integer types. We record visited phi nodes to not have to process them
more than once. The whole subgraph is then replaced with a new type.
Loads and Stores are bitcast to the correct type, which should then be
folded into the load/store, changing it's type.
This comes up in the biquad testcase due to the way MVE needs to keep
values in integer registers. I have also seen it come up from aarch64
partner example code, where a complicated set of sroa/inlining produced
integer phis, where float would have been a better choice.
I also added undef and extract element handling which increased the
potency in some cases.
This adds it with an option that defaults to off, and disabled for 32bit
X86 due to potential issues around canonicalizing NaNs.
Differential Revision: https://reviews.llvm.org/D81827
Pulled out from the ongoing work on D66004, currently we don't do a good job of simplifying variable shuffle masks that have already lowered to constant pool entries.
This patch adds SimplifyDemandedVectorEltsForTargetShuffle (a custom x86 helper) to first try SimplifyDemandedVectorElts (which we already do) and then constant pool simplification to help mark undefined elements.
To prevent lowering/combines infinite loops, we only handle basic constant pool loads instead of creating new BUILD_VECTOR nodes for lowering - e.g. we don't try to convert them to broadcast/vzext_load - there might be some benefit to this but if so I'd rather we come up with some way to reuse existing code than reimplement a lot of BUILD_VECTOR code.
Differential Revision: https://reviews.llvm.org/D81791
Without SSE41 we don't have the PCMPEQQ instruction, making cmp-with-zero reductions more complicated than necessary. We can compare as vXi32 (PCMPEQD) and tweak the MOVMSK comparison to test upper/lower DWORD comparisons.
This pre-fixes something that occurs with null tests for vectors of (64-bit) pointers such as in PR35129.
This patch extends MatchVectorAllZeroTest to handle OR vector reduction patterns where the result is compared against zero.
Fixes PR45378
Differential Revision: https://reviews.llvm.org/D81547
Pull the lowering code out of LowerVectorAllZeroTest (and rename it MatchVectorAllZeroTest).
We should be able to reuse this in combineVectorSizedSetCCEquality as well.
Another cleanup to simplify D81547.
Reduce by splitting the vector until we reach the target size for PTEST/MOVMSK_PCMPEQ. There might be some cases where AVX512 can perform this with 512-bit vectors but so far I haven't encountered any such pattern that reaches LowerVectorAllZeroTest.
Prep work for D81547
matchScalarReduction should return all its source vectors with the same type, so we can safely perform the OR reduction with the original type.
So we just need to bitcast for PTEST/PCMPEQB with the final reduced vector.
When checking for an enum function attribute, use hasFnAttribute()
rather than hasAttribute() at FunctionIndex, because it is
significantly faster (and more concise to boot).
Reduce XMM->GPR traffic by performing bitops on the vectors, and using a single MOVMSK call.
This requires us to use vectors of the same size and element width, but we can mix fp/int type equivalents with suitable bitcasting.
If the input to the bitcast is a sign bit test, it makes sense to
directly use vpmovmskb or vmovmskps/pd. This removes the need to
copy the sign bits to a k-register and then to a GPR.
Fixes PR46200.
Differential Revision: https://reviews.llvm.org/D81327
Noticed while trying to cleanup D66004 - if a shuffle operand came from a scalar, we're better off using INSERTPS vs UNPCKLPS as this is more likely to load fold later on. It also matches our existing BUILD_VECTOR lowering.
We can extend this to other PINSRB/D/Q/W cases in the future as the need arises.
Convert shift+or bool vector patterns into CONCAT_VECTORS if we know this will be lowered to KUNPCK (which requires 16+ vector elements).
Fixes PR32547
AVX512 mask types are often bitcasted to scalar integers for various ops before being bitcast back to be used as a predicate. In many cases we can avoid these KMASK<->GPR transfers and perform equivalent operations on the mask unit.
If the destination mask type is legal, and we can confirm that the scalar op originally came from a mask/vector/float/double type then we should try to avoid the scalar entirely.
This avoids some codegen issues noticed while working on PTEST/MOVMSK improvements.
Partially fixes PR32547 - we don't create a KUNPCK yet, but OR(X,KSHIFTL(Y)) can be handled in a separate patch.
Differential Revision: https://reviews.llvm.org/D81548
If we probe *after* each static stack allocation, we need to probe *before* each
dynamic stack allocation. Provide a scheme to describe the possible scenario.
Thanks a lot to @jonpa for motivating this fix.
Differential Revision: https://reviews.llvm.org/D81067
LowerSELECT sees the CMP with 0 and wants to use a trick with SUB
and SBB. But we can use the flags from the BSF/TZCNT.
Fixes PR46203.
Differential Revision: https://reviews.llvm.org/D81312
This combine tries shrink a vzmovl if its input is an
insert_subvector. This patch improves it to turn
(vzmovl (bitcast (insert_subvector))) into
(insert_subvector (vzmovl (bitcast))) potentially allowing the
bitcast to be folded with a load.
We can pad the v2f32 with 0s up to v8f32 and use a v8f32->v8i64
operation. This is what we end up with on non-strict nodes except
we don't pad with 0s since we don't care about exceptions.
In the sign splat case, we can fold PMOVMSKB(PACKSSBW(LO(X), HI(X))) -> PMOVMSKB(BITCAST_v32i8(X)) without introducing a signmask + comparison (which unlike for any_of won't fold into a single TEST).
Handle MOVMSK 'allof' comparisons (X86ISD::SUB X, AllBitsMask) as well as 'anyof' patterns.
This allows us to handle these patterns in the MOVMSK(BITCAST(X)) pattern to fix PR37087.
As shown on PR37087, if we have a MOVMSK(BICAST(X)) from a wider vector, then by using MOVMSK from the wider type (32/64-bit elements) we can improve the chances of further combines with SimplifyDemandedBits/Elts and on some targets (skylake) can be more efficient.
Shifts are supposed to always shift in zeros or sign bits regardless of their inputs. It's possible the input value may have been replaced with undef by SimplifyDemandedBits, but the shift in zeros are still demanded.
This issue was reported to me by ispc from 10.0. Unfortunately their failing test does not fail on trunk. Seems to be because the shl is optimized out earlier now and doesn't become VSHLI.
ispc bug https://github.com/ispc/ispc/issues/1771
Differential Revision: https://reviews.llvm.org/D81212
An initial patch adding combineSetCCMOVMSK to simplify MOVMSK and its vector input based on the comparison of the MOVMSK result.
This first stage just adds support for some simple MOVMSK(PACKSSBW()) cases where we remove the PACKSS if we're comparing ne/eq zero (any_of patterns), allowing us to directly compare against the v8i16 source vector(s) bitcasted to v16i8, with suitable masking to take into account of which signbits are valid.
Future combines could peek through further PACKSS, target shuffles, handle all_of patterns (ne/eq -1), optimize to a PTEST op, etc.
Differential Revision: https://reviews.llvm.org/D81171
If we're only demanding the (shifted) sign bits of the shift source value, then we can use the value directly.
This handles SimplifyDemandedBits/SimplifyMultipleUseDemandedBits for both ISD::SHL and X86ISD::VSHLI.
Differential Revision: https://reviews.llvm.org/D80869
We looked through a truncate to get to the load. So we should be
deleting the truncate first.
There is a check that the node is really unused before deleting
so this didn't cause a functional issue.
This matches what we do for the full sized vector ops at the start of combineX86ShufflesRecursively, and helps getFauxShuffleMask extract more INSERT_SUBVECTOR patterns.
Try to prevent future node creation issues (as detailed in PR45974) by making the SelectionDAG reference const, so it can still be used for analysis, but not node creation.
As detailed on PR45974 and D79987, getFauxShuffleMask is creating nodes on the fly to create shuffles with inputs the same size as the result, causing problems for hasOneUse() checks in later simplification stages.
Currently only combineX86ShufflesRecursively benefits from these widened inputs so I've begun moving the functionality there, and out of getFauxShuffleMask. This allows us to remove the widening from VBROADCAST and *EXTEND* faux shuffle cases.
This just leaves the INSERT_SUBVECTOR case in getFauxShuffleMask still creating nodes, which will require more extensive refactoring.
We already had a DAG combine for (mmx (bitconvert (i64 (extractelement v2i64))))
to MOVDQ2Q.
Remove patterns for MMX_MOVQ2DQrr/MMX_MOVDQ2Qrr that use
scalar_to_vector/extractelement involving i64 scalar type with
v2i64 and x86mmx.
Let the codegen recognized the nomerge attribute and disable branch folding when the attribute is given
Differential Revision: https://reviews.llvm.org/D79537
Only 64-bit bits will be loaded, not the whole 128 bits. We can
just combine it to plain mmx load. This has the side effect of
enabling isel load folding for it.
This part of my desire to get rid of isel patterns that shrink loads.
If we are using PTEST to check 'allsign bits' vector elements we can use MOVMSK to extract the signbits directly and perform the comparison on the scalar value.
For vXi16 cases, as we don't have a MOVMSK for this type, we must mask each signbit out of a PMOVMSKB v2Xi8 result, which folds into the TEST comparison.
If this allows us to remove a vector op (via the SimplifyMultipleUseDemandedBits call) this is consistently faster than a PTEST (https://godbolt.org/z/ziJUst).
I'm investigating whether we ever get regressions without the SimplifyMultipleUseDemandedBits call, even if this means we don't remove a vector op, but that has exposed some other poor codegen issues that I'm still investigating and would have to wait for a later patch.
Suggested on PR42035 to avoid unnecessary ashr(x,bw-1)/pcmpgt(0,x) sign splat patterns feeding into ptest.
Differential Revision: https://reviews.llvm.org/D80563
There's more code for calling CombineTo and replacing the nodes
that I'd like to share, but its complicated by the getNode call
in the middle that needs to be specific to each opcode.
While there are also make sure we recursively delete the load
we're replacing. It eventually gets removed by a RemoveDeadNodes
call at the end of DAG combine, but we should be more eager about
it. We were inconsistently doing this in some places but not all.
Fix combineSubToSubus to handle the new DAG to avoid a regression.
There are still regressions in test14/test15/test16. Where it
looks like were trying to set up cases we could match to
umin+trunc+subus but the handling was never finished. The
regression here isn't unique to sub. Its a lost opportunity for
taking an AND with two truncated inputs and producing a larger
AND with a single truncate. The same thing could happen with
any other node we handle in combineTruncatedArithmetic since we
are moving the truncate up the DAG.
Differential Revision: https://reviews.llvm.org/D80483
- test both 32 and 64 bit version
- probe the tail in dynamic-alloca
- generate more concise code
Differential Revision: https://reviews.llvm.org/D79482
This replaces the build_vector lowering code that was just added in
D80013
and matches the pattern later from the x86-specific "vzext_movl".
That seems to result in the same or better improvements and gets rid
of the 'TODO' items from that patch.
AFAICT, we always shrink wider constant vectors to 128-bit on these
patterns, so we still get the implicit zero-extension to ymm/zmm
without wasting space on larger vector constants. There's a trade-off
there because that means we miss potential load-folding.
Similarly, we could load scalar constants here with implicit
zero-extension even to 128-bit. That saves constant space, but it
means we forego load-folding, and so it increases register pressure.
This seems like a good middle-ground between those 2 options.
Differential Revision: https://reviews.llvm.org/D80131
If we're extracting an upper subvector from a broadcast we're better off extracting the lowest subvector instead as it avoids an actual extract instruction and might help SimplifyDemandedVectorElts further simplify the code.
This initial version only peeks through cases where we just demand the sign bit of an ashr shift, but we could generalize this further depending on how many sign bits we already have.
The pr18014.ll case is a minor annoyance - we've failed to to move the psrad/paddd after the blendvps which would have avoided the extra move, but we have still increased the ILP.
On X86 (AVX1/AVX2), non-boolean masked loads only demand the sign bit of the mask, we already do the equivalent for masked stores.
Annoyingly I can't easily handle this inside TargetLowering::SimplifyDemandedBits as this is an x86 specific case for a generic node.
Differential Revision: https://reviews.llvm.org/D80478
This is a preliminary patch before I deal with the xor+and issue raised in D77301.
We get much better code for i8/i16 funnel shifts by concatenating the operands together and performing the shift as a double width type, it avoids repeated use of the shift amount and partial registers.
fshl(x,y,z) -> (((zext(x) << bw) | zext(y)) << (z & (bw-1))) >> bw.
fshr(x,y,z) -> (((zext(x) << bw) | zext(y)) >> (z & (bw-1))) >> bw.
Alive2: http://volta.cs.utah.edu:8080/z/CZx7Cn
This doesn't do as well for i32 cases on x86_64 (the xor+and followup patch is much better) so I haven't bothered with that.
Cases with constant amounts are more dubious as well so I haven't currently bothered with those - its these kind of 'edge' cases that put me off trying to put this in TargetLowering::expandFunnelShift.
Differential Revision: https://reviews.llvm.org/D80466
If the caller needs to reponsible for making sure the MaybeAlign
has a value, then we should just make the caller convert it to an Align
with operator*.
I explicitly deleted the relational comparison operators that
were being inherited from Optional. It's unclear what the meaning
of two MaybeAligns were one is defined and the other isn't
should be. So make the caller reponsible for defining the behavior.
I left the ==/!= operators from Optional. But now that exposed a
weird quirk that ==/!= between Align and MaybeAlign required the
MaybeAlign to be defined. But now we use the operator== from
Optional that takes an Optional and the Value.
Differential Revision: https://reviews.llvm.org/D80455
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Reverted due to unexpectedly passing tests, added REQUIRES: asserts for reland.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.
This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D77319
This build vector lowering pattern came up in D79886.
I've tried to limit the improvement to cases where it looks
clearly better to load, but we could remove the 'TODO'
predicates already if we are willing to overlook some
corner cases.
Differential Revision: https://reviews.llvm.org/D80013
Summary:
The BFloat IR type is introduced to provide support for, initially, the BFloat16
datatype introduced with the Armv8.6 architecture (optional from Armv8.2
onwards). It has an 8-bit exponent and a 7-bit mantissa and behaves like an IEEE
754 floating point IR type.
This is part of a patch series upstreaming Armv8.6 features. Subsequent patches
will upstream intrinsics support and C-lang support for BFloat.
Reviewers: SjoerdMeijer, rjmccall, rsmith, liutianle, RKSimon, craig.topper, jfb, LukeGeeson, sdesmalen, deadalnix, ctetreau
Subscribers: hiraditya, llvm-commits, danielkiss, arphaman, kristof.beyls, dexonsmith
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78190
We already form PMULH when the shift is truncated. But we can
also do it from just a shift by extending the result.
Unfortunately, I get regressions if I try to replace the truncate
combine with this as we turn the truncate into a more complicated
sequence first. Then we are unable to combine that sequence with
the extend produced at the end of this combine.
Differential Revision: https://reviews.llvm.org/D79682
Expands on the enablement of the shouldSinkOperands() TLI hook in:
D79718
The last codegen/IR test diff shows what I suspected could happen - we were
sinking all splat shift operands into a loop. But that's not what we want in
general; we only want to sink the *shift amount* operand if it is a splat.
Differential Revision: https://reviews.llvm.org/D79827
SDAG suffers when it can't see that a funnel operand is a splat value
(due to single-basic-block visibility), so invert the normal loop
hoisting rules to move a splat op closer to its use.
This would be part 1 of an enhancement similar to D63233.
This is needed to re-fix PR37426:
https://bugs.llvm.org/show_bug.cgi?id=37426
...because we got better at canonicalizing IR to funnel shift intrinsics.
The existing CGP code for shift opcodes is likely overstepping what it was
intended to do, so that will be fixed in a follow-up.
Differential Revision: https://reviews.llvm.org/D79718
Summary:
This patch refactors handling of VarArgs in
X86TargetLowering::LowerFormalArguments.
That refactoring was requested while reviewing
D69372. Code related to varargs handling is removed
from X86TargetLowering::LowerFormalArguments and
is divided into smaller routines.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D74794
We have a couple main strategies for legalizing MULH.
-If the vXi16 type is legal, extend to do the full i16 multiply
and then shift and truncate the results.
-Use unpcks to split each 128 bit lane into high and low halves.a
For signed we have an extra case to split a v32i8 to v16i8 and then
use the extending to v16i16 strategy.
This patch proposes to use the unpck strategy instead. Which is
what we already do for unsigned.
This seems to be 1 instruction shorter when the RHS is constant
like the idiv case. It's 1 instruction longer for the smulo case.
But we're trading cross lane shuffles for inlane shuffles and a
shift.
Differential Revision: https://reviews.llvm.org/D79652
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.
This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D77319
XOP targets have fast per-element vector shifts and we're better off splitting to 128-bit shifts where necessary (which is what we already do in LowerShift).
For the sint_to_fp(and(X,C)) -> and(X,sint_to_fp(C)) fold, allow combineVectorCompareAndMaskUnaryOp to match any X that ComputeNumSignBits says is all-bits, not just SETCC.
Noticed while investigating mask promotion issues in PR45808
This patch stores the alignment for ConstantPoolSDNode as an
Align and updates the getConstantPool interface to take a MaybeAlign.
Removing getAlignment() will be done as a follow up.
Differential Revision: https://reviews.llvm.org/D79436
We rely on the combine
(sext_in_reg (v4i64 a/sext (v4i32 x)), v4i1) -> (v4i64 sext (v4i32 sext_in_reg (v4i32 x, ExtraVT)))
to avoid complex v4i64 ashr codegen, but doing so prevents v4i64 comparison mask promotion, so ensure we attempt to promote before canonicalizing the (hopefully now redundant sext_in_reg).
Helps with the poor codegen in PR45808.
Neither gcc or icc support this. Split out from D79472. I want
to remove more, but it looks like icc does support some things
gcc doesn't and I need to double check our internal test suites.
Y is the start of several 2 letter constraints, but we also had
partial support to recognize it by itself. But it doesn't look
like it can get through clang as a single letter so the backend
support for this was effectively dead.
This helped fix some i686 vXi64 broadcast folds that were becoming v2Xi32 broadcasts because we didn't match the broadcast until after SimplifyDemandedBits worked out we only used the bottom 32-bits in PMUL(U)DQ and type legalization had split the original i64 load.
A couple of regressions occurred which required some fixups - adding concat_vectors(broadcast_load,broadcast_load) splat support and recognising (unnecessary) unary shuffles of already broadcasted vectors.
This came about as part of the work investigating vector load combining from shuffles for PR42550.
This patch replaces the VZEXT_MOVL removal from combineShuffle with a more general version based in SimplifyDemandedVectorEltsForTargetNode.
By using computeKnownBits we can always remove the VZEXT_MOVL if the upper elements of the source operand are known to be zero.
This requires us to add the conversion ops to computeKnownBitsForTargetNode as well.
Reviewed By: @craig.topper
Differential Revision: https://reviews.llvm.org/D79335
gcc supports selecting ymm0/zmm0 for the Yz constraint when used with 256 or 512 bit vector types.
Fixes PR45806
Differential Revision: https://reviews.llvm.org/D79448
Unless we're truncating an 'all-bits' result, using PACKSS for vXi64->vXi32 truncation causes problems with later combines as ComputeNumSignBits struggles to see through BITCASTs to smaller types. If we don't use PACKSS in these cases then we fallback to shuffles which are usually just as good.
We haven't promoted AND/OR/XOR to vXi64 types for a while. So
there's no reason to use isOperationLegalOrPromote. So we can
just use isOperationLegal by merging with ADD handling.
Default legalization will create two v8i64 truncs to v8i32, concat
them to v16i32, and then truncate the rest of the way to v16i8.
Instead we can truncate directly from v8i64 to v8i8 in the lower
half of an xmm. Then concat the two halves to use vpunpcklqdq.
This is the same number of uops, but the dependency chain through
the uops is better since the halves are merged at the end.
I had to had SimplifyDemandedBits support for VTRUNC to prevent
a regression on vector-trunc-math.ll. combineTruncatedArithmetic
no longer gets a chance to shrink vXi64 mul so we were producing
the v8i64 multiply sequence using multiple PMULUDQs. With the
demanded bits fix we are able to prune out the extra ops leaving
just two PMULUDQs, one for each v8i64 half. This is twice the
width of the 2 v8i32 PMULLDs we had before, but PMULUDQ is 1
uop and PMULLD is 2. We also save some truncates. It's probably
worth using PMULUDQ even when PMULLQ is available since the latter
is 3 uops, but that will require a different change.
Differential Revision: https://reviews.llvm.org/D79231
The splitVector helper uses extractSubVector which splits build vectors like we do here, so avoid reimplementing it.
splitVector could easily be extended to peek through bitcasts as well but I'd prefer to keep this commit NFC.
Handle concat_vectors(extract_subvector(broadcast(x)), extract_subvector(broadcast(x))) -> broadcast(x)
To expose this we also need collectConcatOps to recognise the insert_subvector(x, extract_subvector(x, lo), hi) subvector splat pattern
Also fix some cost tables for vXi1 types to match the costs entries for the types they will be promoted to.
Differential Revision: https://reviews.llvm.org/D79045
This pushes the NOT pattern up the DAG to help expose it for further combines (AND->ANDN in particular).
The PSHUFD/MOVDDUP 'splat' cases are the only ones I've seen in the wild so far, we can further generalize if/when we need to.
X86 matches several 'shift+xor' funnel shift patterns:
fold (or (srl (srl x1, 1), (xor y, 31)), (shl x0, y)) -> (fshl x0, x1, y)
fold (or (shl (shl x0, 1), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
fold (or (shl (add x0, x0), (xor y, 31)), (srl x1, y)) -> (fshr x0, x1, y)
These patterns are also what we end up with the proposed expansion changes in D77301.
This patch moves these to DAGCombine's generic MatchFunnelPosNeg.
All existing X86 test cases still pass, and we just have a small codegen change in pr32282.ll.
Reviewed By: @spatel
Differential Revision: https://reviews.llvm.org/D78935
This section is the remnant of how this code was structured before
we made v32i16/v64i8 legal types with avx512f when not restricting
to 256 bit vectors. Now that there are just a few items left,
merge them near similar things in the other section.
I've modified isTruncateFree to get an accurate cost for types that need to be split. I'm planning to look into fixing it for all vectors, but need more cost cleanups first.
Differential Revision: https://reviews.llvm.org/D78973
This method has been commented as deprecated for a while. Remove
it and replace all uses with the equivalent getCalledOperand().
I also made a few cleanups in here. For example, to removes use
of getElementType on a pointer when we could just use getFunctionType
from the call.
Differential Revision: https://reviews.llvm.org/D78882
The insert(truncate/extend(extract(vec0,c0)),vec1,c1) case in rGacbc5ede99 wasn't combining the 'mineltsize' with the src vector elt size which may be smaller due to implicit extension during extraction.
Reduced from test case provided by @mstorsjo
This is a NFC patch for D77319. The idea is to hide the getNegatibleCost inside the getNegatedExpression()
to have it return null if the cost is expensive, and add some helper function for easy to use. And
rename the old getNegatedExpression to negateExpression to avoid the semantic conflict.
Reviewed By: RKSimon
Differential revision: https://reviews.llvm.org/D78291
Followup to the PR45604 fix at rGe71dd7c011a3 where we disabled most of these cases.
By creating the shuffle at the byte level we can handle any extension/truncation as long as we track how small the scalar got and assume that the upper bytes will need to be zero.
This is another enhancement to D77895/D78362
to avoid a round-trip from XMM->GPR->XMM.
This time we handle the case of starting/ending with different FP types
but always with signed i32 as the intermediate value.
I think this covers all of the faux vector optimization possibilities
for pre-AVX512.
There is at least 1 other transform mentioned in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c19
...where we fold an 'fpext' into a preceding 'sitofp'. I think we will
want to handle that earlier (DAGCombiner or instcombine) because that's
a target-independent optimization.
Differential Revision: https://reviews.llvm.org/D78758
Summary:
These helpers are exercised by follow-up commits in this patch series,
which is all about removing CodeGen differences with vs. without debug
info in the AArch64 backend.
Reviewers: fhahn, aprantl, jpaquette, paquette
Subscribers: kristof.beyls, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78260
gcc may warn here because X86ISD::NodeType is specified as "unsigned",
but ISD::NodeType is a naked C enum (although passed as an "unsigned"
throughout SDAG).
getFauxShuffle attempts to combine INSERT_VECTOR_ELT(TRUNCATE/EXTEND(EXTRACT_VECTOR_ELT(x))) patterns into a target shuffle chain.
PR45604 identified an issue where the scalar was truncated to a size smaller than the destination vector element and then zero extended back, which requires the upper bits to be zero'd which we don't currently do.
To avoid the bug I've added an early out in these truncation cases, a future commit should allow us to handle this by inserting the necessary SM_SentinelZero padding.
This is an enhancement to D77895 to avoid another
round-trip from XMM->GPR->XMM. This time we handle
the case of starting/ending with an f64 and casting
to signed i32 as the intermediate value.
It's a bit more involved than I initially assumed
because we need to use target-specific opcodes to
represent the non-standard cast ops.
Differential Revision: https://reviews.llvm.org/D78362
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.
This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.
Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.
Differential Revision: https://reviews.llvm.org/D76212
-Consistently name the functions as split*
-Add a helper for doing the two extractSubvector calls and determining the size of the split
-Use getSplitDestVTs to get the result type for the split node.
-Move the binary and unary helper to one place in the file near the extractSubvector functions. Left the VSETCC one near LowerVSETCC since that's its only caller.
-Remove the 256/512 wrappers that just had asserts. I don't think they provided a lot of value and now with the routines called split* the call sites are more obvious what they do.
-Make the unary routine support different source and dest types to support D76212.
-Add some weaker asserts into the helpers to make up for losing the very specific asserts from the 256/512 wrappers.
Differential Revision: https://reviews.llvm.org/D78176
It can be used to avoid passing the begin and end of a range.
This makes the code shorter and it is consistent with another
wrappers we already have.
Differential revision: https://reviews.llvm.org/D78016
The shuffle decoding is used by X86ISelLowering and
MCTargetDesc/X86InstComments. The latter used to be in a
separate InstPrinter library. The Utils library existed to allow
InstPrinter and CodeGen to share the shuffle decoding. Since
X86InstComments now lives in the MCTargetDesc, which CodeGen
already depends on, we can sink the shuffle decoding there as well.
Differential Revision: https://reviews.llvm.org/D77980
Summary:
Fold (shift (shift X, C2), C1) -> (shift X, (C1 + C2)) for logical as
well as arithmetic shifts. This is needed to prevent regressions from
an upcoming funnel shift expansion change.
While we're here, fold (VSRAI -1, C) -> -1 too.
Reviewers: RKSimon, craig.topper
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77300
Improve the chances of folding the writemask into the combined shuffle by scaling a wider shuffle mask to match the root's original type.
This creates a few minor issues with variable shuffles, preventing combines of shuffles because of the more limited support binary shuffle types. In most cases we're probably better off combining the shuffles and losing the writemask fold, but this isn't always going to be true.
As discussed in PR36617:
https://bugs.llvm.org/show_bug.cgi?id=36617#c13
...we can avoid the likely slow round-trip from XMM to GPR to XMM
by using the vector versions of the convert instructions.
Based on experimental results from recent Intel/AMD chips, we don't
need to worry about triggering denorm stalls while operating on
garbage data in the high lanes with convert instructions, so this is
expected to always be as good or better perf than the scalar
instruction equivalent. FP exceptions are also not a concern because
strict code should not be using the regular SDAG opcodes.
Differential Revision: https://reviews.llvm.org/D77895
A lot of vectorized code doesn't use masks so we shouldn't penalize them by not doing shuffle combining on avx512 targets.
I've added support for VALIGNQ/VALIGND and 512-bit SHUF128 to prevent some regressions. I also prevented recombining 256-bit SHUF128 to PERM2X128. We may not need to add 256-bit SHUF128 support, but I don't think I found any cases requiring that in my testing.
Differential Revision: https://reviews.llvm.org/D77928
As proposed in D77881, we'll have the related widening operation,
so this name becomes too vague.
While here, change the function signature to take an 'int' rather
than 'size_t' for the scaling factor, add an assert for overflow of
32-bits, and improve the documentation comments.
Summary:
There are at least three clients for KnownBits calculations:
ValueTracking, SelectionDAG and GlobalISel. To reduce duplication the
common logic should be moved out of these clients and into KnownBits
itself.
This patch does this for AND, OR and XOR calculations by implementing
and using appropriate operator overloads KnownBits::operator& etc.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74060
This removes a call to getScalarType from a bunch of call sites.
It also makes the behavior consistent with SIGN_EXTEND_INREG.
Differential Revision: https://reviews.llvm.org/D77631
Shuffle combining can insert zero byte sized elements into the shuffle mask, which combineX86ShufflesConstants will attempt to fold without taking into account whether the byte-sized type is legal (e.g. AVX512F only targets).
If we have a full-zeroable vector then we should just return a zero version of the root type, otherwise if the type isn't valid we should bail.
Fixes PR45443
truncateVectorWithPACK has its own vector length controls, so we can rely on those directly. This helps some existing truncation to subvector tests, which were being combined later during shuffle lowering at which point the sign/zero bit detection had become obscured preventing lowerShuffleWithPACK working as well as it could.
We had previously limited the shuffle(HORIZOP,HORIZOP) combine to binary shuffles, but we can often merge unary shuffles just as well, folding in UNDEF/ZERO values into the 64-bit half lanes.
For the (P)HADD/HSUB cases this is limited to fast-horizontal cases but PACKSS/PACKUS combines under all cases.
Our existing combine allows to merge the shuffle of 2 similar 64-bit wide 'horizontal ops' (HADD/PACK/etc.) if the shuffle was a UNPCK/MOVSD.
This patch generalizes this to decode any target shuffle mask that can be widened to a 128-bit repeating v2*64 mask, which helps us catch PBLENDW/PBLENDD cases.
If we're packing from 128-bits to 64-bits then we don't need the RHS argument. This helps with register allocation, especially as we avoid repeating a use of the input value.
Similar to the lowerV16I8Shuffle implementation, for binary compaction v8i16 shuffles we can avoid the PUNPCKLDQ(PSHUFB,PSHUFB) pattern on SSE41+ targets by using PACKUSDW and PBLENDW. Before SSE41 we would need to use PACKSSDW but that requires sign extension that seems to destroy any gains, even on targets without PSHUFB.
This is a bigger gain on AMD than Intel targets but should never be a regression, and avoiding the shuffle mask load(s) is always useful.
Noticed in codegen while dealing with PR31443.
Extend lowerShuffleWithPACK/matchShuffleWithPACK/createPackShuffleMask to handle compaction style shuffle masks that can be lowered to chains of PACKSS/PACKUS if their inputs are suitably sign/zero extended.
This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask should recognise the PACKSS/PACKUS chains.
This pass replaces each indirect call/jump with a direct call to a thunk that looks like:
lfence
jmpq *%r11
This ensures that if the value in register %r11 was loaded from memory, then
the value in %r11 is (architecturally) correct prior to the jump.
Also adds a new target feature to X86: +lvi-cfi
("cfi" meaning control-flow integrity)
The feature can be added via clang CLI using -mlvi-cfi.
This is an alternate implementation to https://reviews.llvm.org/D75934 That merges the thunk insertion functionality with the existing X86 retpoline code.
Differential Revision: https://reviews.llvm.org/D76812
First step toward making use of canLowerByDroppingEvenElements to match chains of PACKSS/PACKUS for compaction shuffles.
At the moment we still only match a single stage but the MatchPACK is now more general.
PTEST/TESTP sets EFLAGS as:
TESTZ: ZF = (Op0 & Op1) == 0
TESTC: CF = (~Op0 & Op1) == 0
TESTNZC: ZF == 0 && CF == 0
If we are inverting the 0'th operand of a PTEST/TESTP instruction we can adjust the comparisons to correct handle the inversion implicitly.
Additionally, for "TESTZ" (ZF) cases, the allones case, PTEST(X,-1) can be simplified to PTEST(X,X).
We can expand this for the TESTZ(X,~Y) pattern and also handle KTEST/KORTEST in the future.
Differential Revision: https://reviews.llvm.org/D76984
Summary:
Make sure we do not assert on value types not being
simple in getFauxShuffleMask when analysing operations
such as "v8i16 = truncate v8i24".
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77136
Summary: These were templated due to SelectionDAG using int masks for shuffles and IR using unsigned masks for shuffles. But now that D72467 has landed we have an int mask version of IRBuilder::CreateShuffleVector. So just use int instead of a template
Reviewers: spatel, efriedma, RKSimon
Reviewed By: efriedma
Subscribers: hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D77183
Also add lowerShuffleWithPACK call to lowerV32I16Shuffle - shuffle combining was catching it but we avoid a lot of temporary shuffle creations if we catch it at lowering first.
If canLowerByDroppingEvenElements indicates that the shuffle is a N:1 compaction pattern and the inputs are suitably sign/zero extended then we can use a chain of PACKSS/PACKUS to compact.
This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask can recognise PACKSS/PACKUS chains.
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: jyknight, sdardis, nemanjai, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, jfb, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77059
If we are lowering to X86ISD::SHUF128 we are going to lose track of individual 128-bit lanes that are UNDEF, so if we can widen these to guarantee that they are sequential with their neighbour we should. This helps with later shuffle combines.
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128, and we can use the implicit zeroing of the upper half.
I've added some extra tests to vector-shuffle-combining-avx2.ll to make sure we don't lose coverage.
This was an inner helper function for the real matchShuffleAsByteRotate function, but it is more generic and is used directly for VALIGN lowering which doesn't work at the byte level.
These transforms rely on a vector reduction flag on the SDNode
set by SelectionDAGBuilder. This flag exists because SelectionDAG
can't see across basic blocks so SelectionDAGBuilder is looking
across and saving the info. X86 is the only target that uses this
flag currently. By removing the X86 code we can remove the flag
and the SelectionDAGBuilder code.
This pass adds a dedicated IR pass for X86 that looks across the
blocks and transforms the IR into a form that the X86 SelectionDAG
can finish.
An advantage of this new approach is that we can enhance it to
shrink the phi nodes and final reduction tree based on the zeroes
that we need to concatenate to bring the partially reduced
reduction back up to the original width.
Differential Revision: https://reviews.llvm.org/D76649
We can improve computeKnownBits results by avoiding excess bitcasts.
For this pattern we were doing:
(v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK))))
By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on.
This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.
As discussed on PR31443, we should be trying to use PACKUS for binary truncation patterns to reduce the number of shuffles.
The plan is to support AVX2+ targets once we've worked around PR45315 - we fail to peek through a VBROADCAST_LOAD mask to recognise zero upper bits in a PACKUS pattern.
We should also be able to add support for v8i16 and possibly 256/512-bit vectors as well.
As long we extract from a source vector with smaller elements and we zero-extend the element in the final shuffle mask then we can safely peek through truncations and any/zero-extensions to find the source extraction.
Add support for combining shuffles to AVX512 truncate instructions - another step toward fixing D56387/D66004. It also fixes SKX code on PR31443.
We could probably extend this further to handle non-VLX truncation cases.
This reverts commit 4e0fe038f4. Re-lands
65b21282c7.
After landing 5ff5ddd0ad to add int3 into
trailing unreachable blocks, we can now remove these extra stack
adjustments without confusing the Win64 unwinder. See
https://llvm.org/45064#c4 or X86AvoidTrailingCall.cpp for a full
explanation.
Fixes PR45064.
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: dylanmckay, sdardis, nemanjai, hiraditya, kbarton, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76551
createPSADBW uses SplitsOpsAndApply so should be able to handle
any size.
Restrict the extract result type to i32 or i64 since that's what
we have coverage for today and probably matches what the
isSimple() check gave us before.
Differential Revision: https://reviews.llvm.org/D76560
SplitsOpsAndApply will take care of any needed splitting correctly.
All that we need to check is that the vector element count is a
power of 2.
Differential Revision: https://reviews.llvm.org/D76558
We often widen xmm/ymm vectors to ymm/zmm by insertion into an undef base vector. By letting getTargetShuffleAndZeroables track the undef elts we can help avoid a lot of unnecessary cross-lane shuffles.
Fixes PR44694
Now that rG18c19441d105 has improved VPERM2X128 handling, we can perform this to improve x64->x32 truncation without poor cross-lane issues.
Someday combineX86ShufflesRecursively will handle this, but we're still really bad at dealing with different vector widths.
The combine tries to put the broadcast in either the integer or
fp domain to match the bitcast domain. But we can only do this
if the broadcast size is 32 or larger.
Under certain circumstances we'll end up in the position where the negated shift amount will get truncated to the type specified getScalarShiftAmountTy(), so we need to test for a truncated version of the shift amount as well.
This allows us to remove half of the remaining patterns tested for by X86ISelLowering's combineOrShiftToFunnelShift.