This replaces the attempt in 20af71f8ec to use combineToExtendBoolVectorInReg to create X86ISD::BLENDV masks directly, instead we use it to canonicalize the iX bitcast to a sign-extended mask and then truncate it back to vXi1 prior to legalization breaking it apart.
Fixes#53760
If we're comparing a value against zero, strip away any zero-extension and perform the comparison on the pre-extended value
Fixes#38308
Differential Revision: https://reviews.llvm.org/D121472
If the SETCC fp-condcode is supported on SSE as a single CMPPS/PD op then we can use convertIntLogicToFPLogic to reduce EFLAGS and XMM->GPR traffic like we do for AVX targets.
Differential Revision: https://reviews.llvm.org/D121210
If the shift amount has been zero-extended, peek through as this might help us further canonicalize the shift amount.
Fixes regression mentioned in rG147cfcbef1255ba2b4875b76708dab1a685085f5
This completes the removal of uses of SelectionDAG::getSplatValue started in D119090 - by avoiding extracting the splatted element we make it a lot easier to zero-extend the bottom 64-bits of the shift amount and fixes issues we had on 32-bit targets where i64 isn't legal.
I've removed the old version of getTargetVShiftNode that took the scalar shift amount argument and LowerRotate can finally efficiently handle vXi16 rotates-by-scalar (using the same code as general funnel-shifts).
The only regression we see is in the X86-AVX2 PR52719 test case in vector-shift-ashr-256.ll - this is now hitting the same problem as the X86-AVX1 case (failure to simplify a multi-use X86ISD::VBROADCAST_LOAD) which I intend to address in a follow up patch.
For i16/32/64 vectors, if the upper bits are known to be zero, then we can try to truncate to vXi8 (if its worth it) and perform this as a PSADBW to add+zext each v4i8 subvector to a i64 sum, which we can then reduce together.
This addresses some of the PR42674 test cases where the source data was vXi8 but had been extended to match a wider unsigned integer accumulator.
Differential Revision: https://reviews.llvm.org/D120193
Using getSplatValue causes poor codegen due to not always being able to remove the EXTRACT_VECTOR_ELT created inside getSplatValue.
The vXi16 shifts/rotates are still showing occasional regressions but vXi8 is a definite improvement.
combineX86ShuffleChain no longer has to assume that the shuffle inputs are the right size, so don't create unnecessary nodes messing up oneuse limits as detailed on Issue #45319
combineX86ShuffleChain no longer has to assume that the shuffle inputs are the right size, so don't create unnecessary nodes messing up oneuse limits as detailed on Issue #45319
Removing widening from combineX86ShufflesRecursively will be the next step, followed by removing combineX86ShuffleChainWithExtract entirely
With only a load-fold the diffs look neutral. If there's a load and store (rmw)
fold opportunity as shown in the test based on #53862, then we end up with an
extra instruction.
Fixes#53862
Differential Revision: https://reviews.llvm.org/D120281
Peek through if we're extracting a non-zero'th subvector in an attempt to fold the extract into a lane-crossing shuffle
This also exposes a failure to fold extract_subvector(movddup(x),c) -> movddup(extract_subvector(x,c))
Extension to PR45974, unless we actual combine the target shuffles we shouldn't be generating temporary nodes as they may interfere with the one use checks in the shuffle recursions
When building 32b x86 code as PIC, the existing handling of "i"
constraints is conservative since generally we have to go through the
GOT to find references to functions.
But generally, BlockAddresses from C code refer to the Function in the
current TU. Permit BlockAddresses to be used with the "i" constraint
for those cases.
I regressed this in
commit 4edb9983cb ("[SelectionDAG] treat X constrained labels as i for asm")
Fixes: https://github.com/llvm/llvm-project/issues/53868
Reviewed By: efriedma, MaskRay
Differential Revision: https://reviews.llvm.org/D119905
Extend the existing split where we already do this for v32i16/v64i8
We can end up trying to use PCMPEQ/GT if the result needs to be sign-extended (typically due to the DAGCombiner::foldSextSetcc fold).
Fixes#53842
This is a retry of b4b97ec813 - that was reverted because it
could cause miscompiles by illegally reordering memory operations.
A new test based on #53695 is added here to verify we do not have
that same problem.
extract_vec_elt (load X), C --> scalar load (X+C)
As noted in the comment, DAGCombiner has this fold -- and the code in this
patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but
x86 should benefit even if the loaded vector has other uses as long as we
apply some other x86-specific conditions. The motivating example from #50310
is shown in vec_int_to_fp.ll.
Fixes#50310Fixes#53695
Differential Revision: https://reviews.llvm.org/D118376
D108887 fixed alignment mismatch by changing the caller's alignment in
ABI. However, we found some cases that still assume the alignment is
vector size. This patch fixes them to avoid the runtime crash.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D114536
Extend the existing fold to use SimplifyMultipleUseDemandedBits as well as SimplifyDemandedVectorElts/SimplifyDemandedBits when attempting to simplify based off known zero vector elements.
To find uniform shift/rotation amounts, we currently use SelectionDAG::getSplatValue which creates a node that extracts the scalar value from the source vector, this makes it more difficult for later combines to remove the extraction and stay on the SIMD unit, and can be a problem when the scalar type is illegal (i.e. i64 vs v2i64 on 32-bit targets).
This patch begins to use SelectionDAG::getSplatSourceVector (which SelectionDAG::getSplatValue uses internally) and adds a new variant of getTargetVShiftNode that takes the source vector and the splat index, and adjusts the vector in place to create the zero-extended value suitable for the SSE PSLL/PSRL/PSRA uniform instructions.
I'm still addressing a number of regressions when used for normal vector shifts, so I've just handled the funnelshift/rotation lowering for this first patch. I can then focus on the yak shaving (SimplifyDemandedBits/Elts in particular) necessary to always use SelectionDAG::getSplatSourceVector.
Differential Revision: https://reviews.llvm.org/D119090
Pulled out of D106237, this replaces the X86ISD::AVG DAG node with the
generic ISD::AVGCEILU. It doesn't remove the detectAVGPattern method,
but the extra generic ISel matching does alter the existing test.
Differential Revision: https://reviews.llvm.org/D119073
Replace the *_EXTEND node with the raw operands, this will make it easier to use combineToExtendBoolVectorInReg for any boolvec extension combine.
Cleanup prep for Issue #53760
This is no-functional-change-intended because only the
x86 target enables the TLI hook currently.
We can add fmul/fdiv opcodes to the switch similar to the
proposal D119111, but we don't need to make other changes
like enabling target-specific combines.
We can also add integer opcodes (add, or, shl, etc.) to
the switch because this function is called from all of the
generic binary opcodes.
The goal is to incrementally enable the profitable diffs
from D90113 while avoiding regressions.
Differential Revision: https://reviews.llvm.org/D119150
This reverts commit b4b97ec813.
As discussed in post-commit feedback at:
https://reviews.llvm.org/D118376
...there's a stage 2 failure on a Mac running a clang-refactor tool test.
This is an intentionally limited/different form of D90113.
That patch bravely tries to generalize folds where we pull
a binop into the arms of a select:
N0 + (Cond ? 0 : FVal) --> Cond ? N0 : (N0 + FVal)
...but it is not universally profitable.
This is the inverse of IR canonicalization as discussed in
D113442.
We know that this transform is not entirely profitable even
within x86, so we only handle x86 vector fadd/fsub as a 1st
step. The intent is to prevent AVX512 regressions as mentioned
in D113442.
The plan is to port this to DAGCombiner (so it will eventually
look more like D90113) and add more types/cases in pieces with
many more tests to verify that we are seeing improvements.
Differential Revision: https://reviews.llvm.org/D118644
None of the external users actual touch these (they're purely used internally down the recursive call) - its trivial to add another wrapper if anything ever does want to track known elements.
We already call SimplifyDemandedVectorElts using whether each vector mask element is zero/nonzero, this just extends this to also try SimplifyDemandedBits using the demanded bits mask generated from the nonzero elements.
This also requires an additional TargetLowering::SimplifyDemandedBits DemandedBits/DemandedElts wrapper.
Limit this to SSE41 - AVX1 targets to avoid UNPCKL(PSHUFB,PSHUFB), pre-SSE41 we don't have PACKUSDW/BLENDW and with AVX2 we can perform this as PERMQ(PSHUFB()).
Allows pow2 mask tests to avoid an unnecessary constant load.
Noticed while investigating how to extend MatchVectorAllZeroTest to support more allof/anyof patterns.
Without AVX512 (which can efficiently extend/truncate to vXi16/vXi32), unpacking/packing to vXi16 is more efficient that relying on the (uops-heavy) PBLENDV shift expansion
extract_vec_elt (load X), C --> scalar load (X+C)
As noted in the comment, DAGCombiner has this fold -- and the code in this
patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but
x86 should benefit even if the loaded vector has other uses as long as we
apply some other x86-specific conditions. The motivating example from #50310
is shown in vec_int_to_fp.ll.
Fixes#50310
Differential Revision: https://reviews.llvm.org/D118376
Previous folds by combineSetCCMOVMSK might have converted these to CMP when changing the bitwidth, and the CMP->SUB fold might not have happened (or will happen)
rG9103b73fe052 was assuming that we could OR/AND with the source vector, but that will fail on float/double vectors without bitcasting - it also missed the case that any_of checks might be testing less than all the source elements
InstCombine performs this more generally with SimplifyUsingDistributiveLaws, but we don't need anything that complex here - this is mainly to fix up cases where logic ops get created late on during lowering, often in conjunction with sext/zext ops for type legalization.
https://alive2.llvm.org/ce/z/gGpY5v
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
There's no further reasons to limit this to cmpeq-with-zero, the outstanding regressions with lowering to PTEST have now been addressed
Improves codegen for Issue #53379
vXi16 patterns allof(cmp()) reduction patterns will have to be pack the comparison results to vXi8 to use PMOVMSKB.
If we're reducing cmpeq(), then we can compare the vXi8 halves directly - similar to what we already do for vXi64 -> vXi32 for cases without PCMPEQQ.
As suggested on PR53379, for all-of icmp-eq patterns, we can use ptest(sub(x,y)) on SSE41+ targets
This is a generalization of the existing allof(cmpeq(x,0)) -> ptest(x) pattern
We can probably extend this further, in particularly to handle 256-bit cases on pre-AVX2 targets, but this part of the generalization is pretty trivial
Fixes Issue #53379
Pulled out of D106237, this folds truncstore(extend(x)) back to store(x)
if the original store was legal. This can come up due to the order we
fold nodes. A fold from X86 needs to be adjusted to prevent infinite
loops, to have it pick the operand of a trunc more directly.
Differential Revision: https://reviews.llvm.org/D117901
AVX512 doesn't provide a ADDSUB instruction, but if we've built this from a build vector of scalar fsub/fadd elements we can still lower to blend(fsub,fadd)
This allows us to match shuffle<1,3,5,7,9,11,13,15> style shift+trunc/pack patterns as well as the existing shuffle<0,2,4,6,8,10,12,14> style shuffle trunc/pack patterns
In the future, interleaving patterns might benefit from an even more general implementation for higher strides
Allows shuffle combining through per-element shift nodes
This exposed a number of issues with shuffle combining with target intrinsics that are lowered to nodes later during legalization - in particular shuffle combining and SimplifyDemandedVectorElts were being called after canonicalizeShuffleWithBinOps, meaning that shuffles didn't have a chance to be combined away before the shuffle(binop(x,y)) -> binop(shuffle(x),shuffle(y)) fold.
Partial element rotate patterns (e.g. for element insertion on Issue #53124) were being split if every lane wasn't crossing, but really there's a good repeated mask hiding in there.
This is a suggested follow-up to D116765.
This removes a clear of the register operand, so it is better
for code size, but it does potentially create a false register
dependency on surrounding code. If that is a problem, it should
be solvable using dependency-breaking code that is used for
other instructions.
Differential Revision: https://reviews.llvm.org/D116804
To enable this on all targets there's still a number of regressions due to getSplatValue/getTargetVShiftNode but these don't really affect pre-AVX targets.
This is very similar to the existing ROTL/ROTR support for scalar shifts in LowerRotate, I think as time goes on we should be able to share much of this code in helpers between Funnel Shift + Rotation lowering.
select (X != 0), -1, Y --> 0 - X; or (sbb), Y
select (X != 0), Y, -1 --> X - 1; or (sbb), Y
We already had these x86 carry-flag transforms, but one was over-specified to
handle a "0" select arm only. That's just a special-case of the more general
pattern (the 'or' will be deleted if Y is zero).
This is part of solving #53006, but it misses that example because some other
combine has already converted that exact pattern into math ops.
Differential Revision: https://reviews.llvm.org/D116765
For below C code, we can use VNNI to combine the mul and add operation.
int usdot_prod_qi(unsigned char *restrict a, char *restrict b, int c,
int n) {
int i;
for (i = 0; i < 32; i++) {
c += ((int)a[i] * (int)b[i]);
}
return c;
}
We didn't support the combine acoss basic block in this patch.
Differential Revision: https://reviews.llvm.org/D116039
SelectionDAG::getConstant creates an APInt internally anyway, and getLowBitsSet helps assert for legal bitwidths. Plus it silences static analyzer out-of-bounds shift warnings.
This function returns an upper bound on the number of bits needed
to represent the signed value. Use "Max" to match similar functions
in KnownBits like countMaxActiveBits.
Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old
name around to keep this patch size down. Will do a bulk rename as
follow up.
Rename KnownBits::countMaxSignedBits->countMaxSignificantBits.
Reviewed By: lebedev.ri, RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D116522
Fix issue in TargetLowering::expandROT where we only attempt to flip a rotation if the other direction has better support - this matches TargetLowering::expandFunnelShift
This allows us to enable ISD::ROTR lowering on SSE targets, which particularly simplifies/improves codegen for splat amount and AVX2 per-element shifts.
Move shift-by-constant handling and move it into its only user (VSHIFT intrinsics lowering).
This is some prep-work for getTargetVShiftNode to no longer take a scalar shift amount - we're introducing temporary ISD::EXTRACT_VECTOR_ELT nodes via SelectionDAG::getSplatValue to accommodate this which can cause various issues, including unnecessary scalarization and xmm->gpr->xmm transfers, and causes problems for 32-bit codegen if we fail to remove an (illegal) i64 scalar extracted from a (legal) vXi64 vector.
Pull out the "rotl(x,y) --> (unpack(x,x) << zext(splat(y % bw))) >> bw" special case from vXi8 lowering so we can reuse it for vXi32 types as well.
There's still some regressions with vXi16 to handle before this becomes entirely general.
It also allows us to remove the now unnecessary hack for handling amount-modulo before splatting.
Instead of bailing and using the default expansion, we can more efficiently use the shl(unpack(x,x),unpack(amt,zero)) pattern for vXi8 rotl, as we'll then use vXi16 fast PMULLW (or PSLLVW).
This required some minor changes to improve constant folding during unpack shuffle creation and convertShiftLeftToScale to support constants that have already been lowered to constant pools.
We currently rely on generic promotion to vXi16/vXi32 types for rotation lowering on various AVX512 targets.
We can more efficiently perform this by making use of the shl(unpack(x,x),amt) style pattern that we already use for vXi8 rotation by splat amounts, either by widening to a larger vector type or unpacking lo/hi halves of the subvectors so we can access whatever vXi16/vXi32 per-element shifts are supported.
This uncovered an issue in the supportedVectorShiftWithImm/supportedVectorVarShift legality checkers which was using hasAVX512() instead of useAVX512Regs() to detect support for 512-bit vector shifts.
NOTE: I'm actually hoping to eventually reuse this code for shl(unpack(y,x),amt) funnel shift lowering (vXi8 and wider), but initially I just want to ensure we have efficient ISD::ROTL lowering for all targets.
Differential Revision: https://reviews.llvm.org/D115180
If either operand has an element with allbits set, then we don't need the equivalent element from the other operand, as allbits are guaranteed to be set.
We have reasonable fast sqrt and accurate rsqrt for half type due to the
limited fractions. So neither do we need multi steps refinement for
rsqrt nor replace sqrt by rsqrt.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D114844
combinePMULH currently only truncates vXi32/vXi64 multiplies to PMULHW/PMULUW if the source operands are SEXT/ZEXT instructions for a 'free' truncation.
But we can generalize this to any source operand with sufficient leading sign/zero bits that would allow PACKS/PACKUS to be used as a 'cheap' truncation.
This helps us avoid the wider multiplies, in exchange for truncation on both source operands instead of the result.
Differential Revision: https://reviews.llvm.org/D113371
We have transform that tries turn a pmovmskb into movmskps/pd or
movmskps to movmskpd. This transform isn't valid if the truncate
discarded bits that might be set by the original movmsk.
We could fix this by inserting an AND after the new movmsk to discard
the equivalent of the truncated bits, but I've left that for later
patch.
Fixes PR52567.
Differential Revision: https://reviews.llvm.org/D114306
For tagged-globals, we only need to disable relaxation for globals that
we actually tag. With this patch function pointer relocations, which
we do not instrument, can be relaxed.
This patch also makes tagged-globals work properly with LTO, as
-Wa,-mrelax-relocations=no doesn't work with LTO.
Reviewed By: pcc
Differential Revision: https://reviews.llvm.org/D113220
Check for a hidden ISD::ROTR (rotl(sub(0,x))) - vXi8 lowering can handle both (its always beneficial for splats, but otherwise only if we have VPTERNLOG).
We currently hit infinite loops in TargetLowering::expandROT if we set ISD::ROTR to custom, which needs addressing before we extend this much further.
This is aligned with GCC's behavior.
Also, alias `-mno-fp-ret-in-387` to `-mno-x87`, by which we can fix pr51498.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D112143
AMD64 ABI mandates caller to specify the number of used SSE registers
when passing variable arguments.
GCC also provides option -mskip-rax-setup to skip the setup of rax when
SSE is disabled. This helps to reduce the code size, see pr23258.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D112413
If we're rotating vXi8 by a splatted amount, then unpack to vXi16, perform a SHL by the (extended) scalar, and then pack the results.
This is a vector equivalent to the "rotl(x,y) -> (((aext(x) << bw) | zext(x)) << (y & (bw-1))) >> bw" style expansion we do for scalars in LowerFunnelShift.
I think we can usefully use this for other vector types and vector funnel-shifts in the future, depending how we expand beyond D113192 for matching rotations/funnel-shifts for more type/ops.
Update splitVectorIntUnary/splitVectorIntBinary to use this internally, after some operand type sanity checks.
Avoid code duplication and makes it easier to split other vector instruction forms in the future.
The first try at this patch ( bf5748a1af ) was reverted ( 5be64d4164 )
because it could crash. The cause of that problem was failing to account
for the optional peek-through-bitcast in the enclosing function.
This version of the patch adds a clause to avoid the fold in case of
bitcasts because it is unlikely to be profitable in that scenario.
A test case based on https://llvm.org/PR52504 was added to make sure
we don't have that problem again.
Original commit message:
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).
We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.
Differential Revision: https://reviews.llvm.org/D113603
If an operand is a bitcasted or widended constant, try to more aggressively create broadcastable constants for folding, which in particular helps non-VLX modes.
I've refactored getAVX512Node so that VLX targets can make better use of this as well.
NOTE: In the future, I think we should consider removing the broadcast of constant data from DAG entirely and move this to either X86InstrInfo::foldMemoryOperand or a new pass - AVX1/2 targets has similar problems with missed (whole vector) folds that need to be improved as well.
Differential Revision: https://reviews.llvm.org/D113845
This casued assertion failures:
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:9446:
void llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode *, llvm::SDNode *):
Assertion `(!From->hasAnyUseOfValue(i) || From->getValueType(i) == To->getValueType(i))
&& "Cannot use this version of ReplaceAllUsesWith!"' failed.
See comment on the code review.
(Had to update some expectations in test/CodeGen/X86/vselect-zero.ll
manually due to other changes having landed after the reverted one.)
> and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
>
> This avoids the -1 constant vector in favor of an arithmetic shift
> instruction if it exists (the ISA is still not complete after all
> these years...).
>
> We catch this pattern late in combining by matching PCMPGT, so it
> should not interfere with more general folds.
>
> Differential Revision: https://reviews.llvm.org/D113603
This reverts commit bf5748a1af.
This helper provides a more complete approach for lowering to X86ISD::PACKSS/PACKUS nodes - testing for existing suitable sign/zero extension before recreating it.
It also optionally packs the upper half instead of the lower half.
Similar to what we've done for other ops, this patch widens VPTERNLOG to a 512-bit op for non-VLX targets.
Fixes regressions in D113192
Differential Revision: https://reviews.llvm.org/D113827
For AVX512 targets without VLX, we have to widen 128/256-bit vectors to 512-bits to use some specific AVX512 instructions (or some other instructions with predicates etc.).
I've pulled out the widening code from LowerFunnelShift into the helper function, so we can convert some other widening patterns in the future.
Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow.
We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).
Currently we only constant fold target shuffles if any of the sources has one use, or it would remove a variable shuffle mask - the aim being to avoid constant pool bloat.
This patch proposes we should constant fold by default and only limit this if optsize is enabled - I've added a basic test for this in vector-mul.ll (the pmuludq case is by far the most common), I can add other specific test cases if people need them.
This should permit further constant folding, break some instruction dependencies and help reduce shuffle port pressure.
Differential Revision: https://reviews.llvm.org/D113748
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).
We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.
Differential Revision: https://reviews.llvm.org/D113603
This fixes the crash due to lacking VZEXT_MOVL support with i16.
Reviewed By: LuoYuanke, RKSimon
Differential Revision: https://reviews.llvm.org/D113661
combineMulToPMADDWD is currently limited to legal types, but there's no reason why we can't handle any larger type that the existing SplitOpsAndApply code can use to split to legal X86ISD::VPMADDWD ops.
This also exposed a missed opportunity for pre-SSE41 targets to handle SEXT ops from types smaller than vXi16 - without PMOVSX instructions these will always be expanded to unpack+shifts, so we can cheat and convert this into a ZEXT(SEXT()) sequence to make it a valid PMADDWD op.
Differential Revision: https://reviews.llvm.org/D110995
Use ComputeMinSignedBits() to ensure the mul source operands at least sign-extend down from the bottom 16 bits.
This will make it easier if/when we try to support handling of source types larger than 32-bits.
As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper.
I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious.
Differential Revision: https://reviews.llvm.org/D113396
We have several places where we need to extract the raw bits data from a BUILD_VECTOR node, so consolidate this to a single helper function that handles Undefs and Integer/FP constants, including implicit truncation.
This should make it easier to extend D113202 to handle more constant folding of bitcasted constant data.
Differential Revision: https://reviews.llvm.org/D113351
For v8i16 shuffle patterns that are lowered with AND+PACKUS, check to see if the sources are from a 256-bit vector and perform the masking using BLENDW at the 256-bit level.
With the test changes we can see more examples of duplicate XMM/YMM zero vectors (PR26018) :(
The check for whether a zero extension was needed was subtly wrong and
saw a value that was already 64 bits, so did not extend.
Fixes PR52357.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112860
The changes in D80163 defered the assignment of MachineMemOperand (MMO)
until the X86ExpandPseudo pass. This will result in crash due to prolog
insert point been sunk across the pseudo instruction VASTART_SAVE_XMM_REGS.
Moving the assignment to the creation of the node can avoid the problem.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D112859
This is part of rG1cfecf4fc427 that was reverted to fix PR51226 - concating the broadcasts is OK, its the splatted loads that crash (we're not detecting extloads). I'm still creating a reduced test case so haven't added the load handling again yet.
The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances.
This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source.
There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this often creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses?
Noticed while investigating the quality of interleaved load/store codegen.
Differential Revision: https://reviews.llvm.org/D111960
As mentioned on D108539, when the gather indices are smaller than the pointer size, they are sign-extended BEFORE scale is applied, making the general fold unsafe.
If the index have sufficient sign-bits then folding the scale could be safe - I'll investigate this.
If the index operand for a gather/scatter intrinsic is being scaled (self-addition or a shl-by-immediate) then we may be able to fold that scaling into the intrinsic scale immediate value instead.
Fixes PR13310.
Differential Revision: https://reviews.llvm.org/D108539
In the instruction encoding, the passthru register is always
tied to the destination register. The CPU scheduler has to wait
for the last writer of this register to finish executing before
the gather can start. This is true even if the initial mask is
all ones so that the passthru will never be used.
By explicitly zeroing the register we can break the false
dependency. The zero idiom is executed completing by the
register renamer and so is immedately considered ready.
Authored by Craig.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D112505
As noted in D112464, a pre-AVX target may not be able to fold an
under-aligned vector load into another op, so we shouldn't report
that as a load folding candidate. I only found one caller where
this would make a difference -- combineCommutableSHUFP() -- so
that's where I added a test to show the (minor) regression.
Differential Revision: https://reviews.llvm.org/D112545
This can avoid a vector add and a constant pool load. Or an explicit broadcast in case of non-constant.
Also reverse the transform any time we encounter a constant index addend that can't be moved to base. In that case pull the constant from base into the index. This reduces code size needed for the displacement since we needed the index add anyway. Limit this to scale of 1 to avoid divisibility and wrap issues.
Authored by Craig.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D111595
This is a small extension of D112095 to avoid another regression
seen with D112085.
In this case, we allow the same conversion from usubsat to ALU
ops if the target supports vpternlog.
That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG().
This seems better than putting a magic immediate constant directly in
this code to create the exact vpternlog that we need. It's possible that
there are other special-cases along these lines, so we should try to
keep all of the vpternlog magic in one place.
Differential Revision: https://reviews.llvm.org/D112138
usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1)
This would be a regression with D112085 where we combine to
usubsat more aggressively, so avoid that by matching the
special-case where we are subtracting SMIN (signmask):
https://alive2.llvm.org/ce/z/4_3gBD
Differential Revision: https://reviews.llvm.org/D112095
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....
Differential Revision: https://reviews.llvm.org/D111998
If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero.
The lshr also has a better chance of folding with shuffles etc.
This patch detects the absolute value pattern on the RHS of a
subtract. If we find it we swap the CMOV true/false values and
replace the subtract with an ADD.
There may be a more generic way to do this, but I'm not sure.
Targets that don't have legal or custom ISD::ABS use a generic
expand in DAG combiner already when it sees (neg (abs(x))). I
haven't checked what happens if the neg is a more general subtract.
Fixes PR50991 for X86.
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D111858
CMOVGE reads SF and OF. CMOVNS only reads SF. This matches with
other recent changes to use a single flag where possible. It also
matches gcc codegen.
I believe this technically changes whether the conditioanl move happens
on INT_MIN, but for INT_MIN both registers are the same so it doesn't
matter.
Differential Revision: https://reviews.llvm.org/D111826
As noted in https://github.com/halide/Halide/pull/6302,
we hilariously fail to match PAVG if we even as much
as look at it the wrong way.
In this particular case, the problem stems from the fact that
`PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s,
and InstCombine really loves to get rid of both of these,
for example replace them with a bit mask. So we may not have
said `zext`.
Instead of checking for that + type match,
i think we should rely on the actual active type,
as per the knownbits.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D111571
The select(pshufb,pshufb) -> or(pshufb,pshufb) fold uses getConstVector to create the refreshed pshufb masks, which treats all negative indices as undef.
PR52122 shows that if we were selecting an element that the PSHUFB has set to zero we must set it back to 0x80 when we recreate the PSHUFB mask and not just leave it as SM_SentinelZero
Without popcnt we had a special case for using the parity flag from a single test i8 test instruction if only bits 7:0 could be non-zero. That special case is still useful when we have popcnt.
To reach this special case, we enable custom lowering of parity for i16/i32/i64 even when popcnt is enabled. The check for POPCNT being enabled is now after the special case in LowerPARITY.
Fixes PR52093
Differential Revision: https://reviews.llvm.org/D111249
The X86 backend only needs to know whether structure return is via an
sret pointer. This removes the categorization enumeration and
adjusts, templatizes and renames the related functions.
Differential Revision: https://reviews.llvm.org/D109966
Noticed while investigating the regressions in D110995 - if the RHS element is already zero, then we don't need the corresponding LHS element.
Technically we could also recheck RHS once we have LHS's known zeros, but I haven't seen any missed opportunities from that yet.
Stop using APInt constructors and methods that were soft-deprecated in
D109483. This fixes all the uses I found in llvm, except for the APInt
unit tests which should still test the deprecated methods.
Differential Revision: https://reviews.llvm.org/D110807
PR52040 identified several issues with the HOP(HOP'(X,X),HOP'(Y,Y)) -> HOP(PERMUTE(HOP'(X,Y)),PERMUTE(HOP'(X,Y)) slow-HOP fold.
Not only was there a copy+paste typo when accessing the inner HOP operands, but the (unnecessary) ReplaceAllUsesOfValueWith call was missing one use checks.
Now that we have better shuffle combines of HOPs we can just return a new HOP() sequence and not use ReplaceAllUsesOfValueWith at all - this actually improved pair_sum_v8i32_v4i32 codegen as it kicks off further shuffle combines.
X86's decomposeMulByConstant never permits mul decomposition to shift+add/sub if the vector multiply is legal.
Unfortunately this isn't great for SSE41+ targets which have PMULLD for vXi32 multiplies, but is often quite slow. This patch proposes to allow decomposition if the target has the SlowPMULLD flag (i.e. Silvermont). We also always decompose legal vXi64 multiplies - even latest IceLake has really poor latencies for PMULLQ.
Differential Revision: https://reviews.llvm.org/D110588
On Windows, i128 arguments are passed as indirect arguments, and
they are returned in xmm0.
This is mostly fixed up by `WinX86_64ABIInfo::classify` in Clang, making
the IR functions return v2i64 instead of i128, and making the arguments
indirect. However for cases where libcalls are generated in the target
lowering, the lowering uses the default x86_64 calling convention for
i128, where they are passed/returned as a register pair.
Add custom lowering logic, similar to the existing logic for i128
div/mod (added in 4a406d32e9),
manually making the libcall (while overriding the return type to
v2i64 or passing the arguments as pointers to arguments on the stack).
X86CallingConv.td doesn't seem to handle i128 at all, otherwise
the windows specific behaviours would ideally be implemented as
overrides there, in generic code, handling these cases automatically.
This fixes https://bugs.llvm.org/show_bug.cgi?id=48940.
Differential Revision: https://reviews.llvm.org/D110413
This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies.
Fixes PR51974
When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with
TypeLegalize, and no libcall in libm is ready for fround(f16) now.
FROUNDEVEN(f16) has related instruction in AVX512FP16.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D110312
The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.
Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements.
There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....
If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16
We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.
With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target.
We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.
As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.
This is motivated by the examples and discussion in:
https://llvm.org/PR51245
...and related bugs.
By using vector compares and vector logic, we can convert 2 'set'
instructions into 1 'movd' or 'movmsk' and generally improve
throughput/reduce instructions.
Unfortunately, we don't have a complete vector compare ISA before
AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic
ops to simulate the missing predicates for SSE 'cmpp*', so it's not
as clearly a win.
Differential Revision: https://reviews.llvm.org/D110342
This function can be adapted to solve bugs like PR51245,
but it could require differentiating the combiner timing
between the existing and new transforms.
This patch is to support transform something like
_mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph()))
to _mm512_fmadd_pch(a, b, acc).
Differential Revision: https://reviews.llvm.org/D109953
This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D110012
For x86 Darwin, we have a stack checking feature which re-uses some of this
machinery around stack probing on Windows. Renaming this to be more appropriate
for a generic feature.
Differential Revision: https://reviews.llvm.org/D109993
We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly.
Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.
Apparently this has no test coverage before D108382,
but D108382 itself shows a few regressions that this fixes.
It doesn't seem worthwhile breaking apart broadcasts,
assuming we want the broadcasted value to be preset in several elements,
not just the 0'th one.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108411
Split off from D108253.
Broadcast is simpler than any other shuffle we might produce
to do what we want to do here, so prefer it.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108382
We can use `OR` instead of `BLEND` if either the element we are not picking is zero (or masked away);
or the element we are picking overwhelms (e.g. it's all-ones) whatever the element we are not picking:
https://alive2.llvm.org/ce/z/RKejao
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D109726
When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.
Soft deprecrate isNullValue/isAllOnesValue and update in tree
callers. This matches the changes to the APInt interface from
D109483.
Reviewed By: lattner
Differential Revision: https://reviews.llvm.org/D109535
This library function only exists in compiler-rt not libgcc. So
this would fail to link unless we were linking with compiler-rt.
This is consistent with the recent removal of calls to mulodi4 on
32-bit targets like D108928.
I suppose maybe we could keep the libcalls for platforms like
Darwin that use compiler-rt exclusively?
Reviewed By: nickdesaulniers, MaskRay
Differential Revision: https://reviews.llvm.org/D109385
This renames the primary methods for creating a zero value to `getZero`
instead of `getNullValue` and renames predicates like `isAllOnesValue`
to simply `isAllOnes`. This achieves two things:
1) This starts standardizing predicates across the LLVM codebase,
following (in this case) ConstantInt. The word "Value" doesn't
convey anything of merit, and is missing in some of the other things.
2) Calling an integer "null" doesn't make any sense. The original sin
here is mine and I've regretted it for years. This moves us to calling
it "zero" instead, which is correct!
APInt is widely used and I don't think anyone is keen to take massive source
breakage on anything so core, at least not all in one go. As such, this
doesn't actually delete any entrypoints, it "soft deprecates" them with a
comment.
Included in this patch are changes to a bunch of the codebase, but there are
more. We should normalize SelectionDAG and other APIs as well, which would
make the API change more mechanical.
Differential Revision: https://reviews.llvm.org/D109483
integer 0/1 for the operand of bundle "clang.arc.attachedcall"
https://reviews.llvm.org/D102996 changes the operand of bundle
"clang.arc.attachedcall". This patch makes changes to llvm that are
needed to handle the new IR.
This should make it easier to understand what the IR is doing and also
simplify some of the passes as they no longer have to translate the
integer values to the runtime functions.
Differential Revision: https://reviews.llvm.org/D103000
Similar to D108842, D108844, and D108926.
__has_builtin(builtin_mul_overflow) returns true for 32b x86 targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks ARCH=i386 + CONFIG_BLK_DEV_NBD=y builds of the Linux kernel
that are using builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support these builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://bugs.llvm.org/show_bug.cgi?id=35922
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: lebedev.ri, RKSimon
Differential Revision: https://reviews.llvm.org/D108928
PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]*y[0] + (int)x[1]*y[1], ..., (int)x[6]*y[6] + (int)x[7]*y[7] }
Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits.
That way the sign of the result is still preserved, and the second half is still zero.
Noticed while investigating PR47437.
Differential Revision: https://reviews.llvm.org/D108522
Please refer to
https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html
(and that whole thread.)
TLDR: the original patch had no prior RFC, yet it had some changes that
really need a proper RFC discussion. It won't be productive to discuss
such an RFC, once it's actually posted, while said patch is already
committed, because that introduces bias towards already-committed stuff,
and the tree is potentially in broken state meanwhile.
While the end result of discussion may lead back to the current design,
it may also not lead to the current design.
Therefore i take it upon myself
to revert the tree back to last known good state.
This reverts commit 4c4093e6e3.
This reverts commit 0a2b1ba33a.
This reverts commit d9873711cb.
This reverts commit 791006fb8c.
This reverts commit c22b64ef66.
This reverts commit 72ebcd3198.
This reverts commit 5fa6039a5f.
This reverts commit 9efda541bf.
This reverts commit 94d3ff09cf.