This is no-functional-change-intended because only the
x86 target enables the TLI hook currently.
We can add fmul/fdiv opcodes to the switch similar to the
proposal D119111, but we don't need to make other changes
like enabling target-specific combines.
We can also add integer opcodes (add, or, shl, etc.) to
the switch because this function is called from all of the
generic binary opcodes.
The goal is to incrementally enable the profitable diffs
from D90113 while avoiding regressions.
Differential Revision: https://reviews.llvm.org/D119150
This reverts commit b4b97ec813.
As discussed in post-commit feedback at:
https://reviews.llvm.org/D118376
...there's a stage 2 failure on a Mac running a clang-refactor tool test.
This is an intentionally limited/different form of D90113.
That patch bravely tries to generalize folds where we pull
a binop into the arms of a select:
N0 + (Cond ? 0 : FVal) --> Cond ? N0 : (N0 + FVal)
...but it is not universally profitable.
This is the inverse of IR canonicalization as discussed in
D113442.
We know that this transform is not entirely profitable even
within x86, so we only handle x86 vector fadd/fsub as a 1st
step. The intent is to prevent AVX512 regressions as mentioned
in D113442.
The plan is to port this to DAGCombiner (so it will eventually
look more like D90113) and add more types/cases in pieces with
many more tests to verify that we are seeing improvements.
Differential Revision: https://reviews.llvm.org/D118644
None of the external users actual touch these (they're purely used internally down the recursive call) - its trivial to add another wrapper if anything ever does want to track known elements.
We already call SimplifyDemandedVectorElts using whether each vector mask element is zero/nonzero, this just extends this to also try SimplifyDemandedBits using the demanded bits mask generated from the nonzero elements.
This also requires an additional TargetLowering::SimplifyDemandedBits DemandedBits/DemandedElts wrapper.
Limit this to SSE41 - AVX1 targets to avoid UNPCKL(PSHUFB,PSHUFB), pre-SSE41 we don't have PACKUSDW/BLENDW and with AVX2 we can perform this as PERMQ(PSHUFB()).
Allows pow2 mask tests to avoid an unnecessary constant load.
Noticed while investigating how to extend MatchVectorAllZeroTest to support more allof/anyof patterns.
Without AVX512 (which can efficiently extend/truncate to vXi16/vXi32), unpacking/packing to vXi16 is more efficient that relying on the (uops-heavy) PBLENDV shift expansion
extract_vec_elt (load X), C --> scalar load (X+C)
As noted in the comment, DAGCombiner has this fold -- and the code in this
patch is adapted from DAGCombiner::scalarizeExtractedVectorLoad() -- but
x86 should benefit even if the loaded vector has other uses as long as we
apply some other x86-specific conditions. The motivating example from #50310
is shown in vec_int_to_fp.ll.
Fixes#50310
Differential Revision: https://reviews.llvm.org/D118376
Previous folds by combineSetCCMOVMSK might have converted these to CMP when changing the bitwidth, and the CMP->SUB fold might not have happened (or will happen)
rG9103b73fe052 was assuming that we could OR/AND with the source vector, but that will fail on float/double vectors without bitcasting - it also missed the case that any_of checks might be testing less than all the source elements
InstCombine performs this more generally with SimplifyUsingDistributiveLaws, but we don't need anything that complex here - this is mainly to fix up cases where logic ops get created late on during lowering, often in conjunction with sext/zext ops for type legalization.
https://alive2.llvm.org/ce/z/gGpY5v
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
There's no further reasons to limit this to cmpeq-with-zero, the outstanding regressions with lowering to PTEST have now been addressed
Improves codegen for Issue #53379
vXi16 patterns allof(cmp()) reduction patterns will have to be pack the comparison results to vXi8 to use PMOVMSKB.
If we're reducing cmpeq(), then we can compare the vXi8 halves directly - similar to what we already do for vXi64 -> vXi32 for cases without PCMPEQQ.
As suggested on PR53379, for all-of icmp-eq patterns, we can use ptest(sub(x,y)) on SSE41+ targets
This is a generalization of the existing allof(cmpeq(x,0)) -> ptest(x) pattern
We can probably extend this further, in particularly to handle 256-bit cases on pre-AVX2 targets, but this part of the generalization is pretty trivial
Fixes Issue #53379
Pulled out of D106237, this folds truncstore(extend(x)) back to store(x)
if the original store was legal. This can come up due to the order we
fold nodes. A fold from X86 needs to be adjusted to prevent infinite
loops, to have it pick the operand of a trunc more directly.
Differential Revision: https://reviews.llvm.org/D117901
AVX512 doesn't provide a ADDSUB instruction, but if we've built this from a build vector of scalar fsub/fadd elements we can still lower to blend(fsub,fadd)
This allows us to match shuffle<1,3,5,7,9,11,13,15> style shift+trunc/pack patterns as well as the existing shuffle<0,2,4,6,8,10,12,14> style shuffle trunc/pack patterns
In the future, interleaving patterns might benefit from an even more general implementation for higher strides
Allows shuffle combining through per-element shift nodes
This exposed a number of issues with shuffle combining with target intrinsics that are lowered to nodes later during legalization - in particular shuffle combining and SimplifyDemandedVectorElts were being called after canonicalizeShuffleWithBinOps, meaning that shuffles didn't have a chance to be combined away before the shuffle(binop(x,y)) -> binop(shuffle(x),shuffle(y)) fold.
Partial element rotate patterns (e.g. for element insertion on Issue #53124) were being split if every lane wasn't crossing, but really there's a good repeated mask hiding in there.
This is a suggested follow-up to D116765.
This removes a clear of the register operand, so it is better
for code size, but it does potentially create a false register
dependency on surrounding code. If that is a problem, it should
be solvable using dependency-breaking code that is used for
other instructions.
Differential Revision: https://reviews.llvm.org/D116804
To enable this on all targets there's still a number of regressions due to getSplatValue/getTargetVShiftNode but these don't really affect pre-AVX targets.
This is very similar to the existing ROTL/ROTR support for scalar shifts in LowerRotate, I think as time goes on we should be able to share much of this code in helpers between Funnel Shift + Rotation lowering.
select (X != 0), -1, Y --> 0 - X; or (sbb), Y
select (X != 0), Y, -1 --> X - 1; or (sbb), Y
We already had these x86 carry-flag transforms, but one was over-specified to
handle a "0" select arm only. That's just a special-case of the more general
pattern (the 'or' will be deleted if Y is zero).
This is part of solving #53006, but it misses that example because some other
combine has already converted that exact pattern into math ops.
Differential Revision: https://reviews.llvm.org/D116765
For below C code, we can use VNNI to combine the mul and add operation.
int usdot_prod_qi(unsigned char *restrict a, char *restrict b, int c,
int n) {
int i;
for (i = 0; i < 32; i++) {
c += ((int)a[i] * (int)b[i]);
}
return c;
}
We didn't support the combine acoss basic block in this patch.
Differential Revision: https://reviews.llvm.org/D116039
SelectionDAG::getConstant creates an APInt internally anyway, and getLowBitsSet helps assert for legal bitwidths. Plus it silences static analyzer out-of-bounds shift warnings.
This function returns an upper bound on the number of bits needed
to represent the signed value. Use "Max" to match similar functions
in KnownBits like countMaxActiveBits.
Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old
name around to keep this patch size down. Will do a bulk rename as
follow up.
Rename KnownBits::countMaxSignedBits->countMaxSignificantBits.
Reviewed By: lebedev.ri, RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D116522
Fix issue in TargetLowering::expandROT where we only attempt to flip a rotation if the other direction has better support - this matches TargetLowering::expandFunnelShift
This allows us to enable ISD::ROTR lowering on SSE targets, which particularly simplifies/improves codegen for splat amount and AVX2 per-element shifts.
Move shift-by-constant handling and move it into its only user (VSHIFT intrinsics lowering).
This is some prep-work for getTargetVShiftNode to no longer take a scalar shift amount - we're introducing temporary ISD::EXTRACT_VECTOR_ELT nodes via SelectionDAG::getSplatValue to accommodate this which can cause various issues, including unnecessary scalarization and xmm->gpr->xmm transfers, and causes problems for 32-bit codegen if we fail to remove an (illegal) i64 scalar extracted from a (legal) vXi64 vector.
Pull out the "rotl(x,y) --> (unpack(x,x) << zext(splat(y % bw))) >> bw" special case from vXi8 lowering so we can reuse it for vXi32 types as well.
There's still some regressions with vXi16 to handle before this becomes entirely general.
It also allows us to remove the now unnecessary hack for handling amount-modulo before splatting.
Instead of bailing and using the default expansion, we can more efficiently use the shl(unpack(x,x),unpack(amt,zero)) pattern for vXi8 rotl, as we'll then use vXi16 fast PMULLW (or PSLLVW).
This required some minor changes to improve constant folding during unpack shuffle creation and convertShiftLeftToScale to support constants that have already been lowered to constant pools.
We currently rely on generic promotion to vXi16/vXi32 types for rotation lowering on various AVX512 targets.
We can more efficiently perform this by making use of the shl(unpack(x,x),amt) style pattern that we already use for vXi8 rotation by splat amounts, either by widening to a larger vector type or unpacking lo/hi halves of the subvectors so we can access whatever vXi16/vXi32 per-element shifts are supported.
This uncovered an issue in the supportedVectorShiftWithImm/supportedVectorVarShift legality checkers which was using hasAVX512() instead of useAVX512Regs() to detect support for 512-bit vector shifts.
NOTE: I'm actually hoping to eventually reuse this code for shl(unpack(y,x),amt) funnel shift lowering (vXi8 and wider), but initially I just want to ensure we have efficient ISD::ROTL lowering for all targets.
Differential Revision: https://reviews.llvm.org/D115180
If either operand has an element with allbits set, then we don't need the equivalent element from the other operand, as allbits are guaranteed to be set.
We have reasonable fast sqrt and accurate rsqrt for half type due to the
limited fractions. So neither do we need multi steps refinement for
rsqrt nor replace sqrt by rsqrt.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D114844
combinePMULH currently only truncates vXi32/vXi64 multiplies to PMULHW/PMULUW if the source operands are SEXT/ZEXT instructions for a 'free' truncation.
But we can generalize this to any source operand with sufficient leading sign/zero bits that would allow PACKS/PACKUS to be used as a 'cheap' truncation.
This helps us avoid the wider multiplies, in exchange for truncation on both source operands instead of the result.
Differential Revision: https://reviews.llvm.org/D113371
We have transform that tries turn a pmovmskb into movmskps/pd or
movmskps to movmskpd. This transform isn't valid if the truncate
discarded bits that might be set by the original movmsk.
We could fix this by inserting an AND after the new movmsk to discard
the equivalent of the truncated bits, but I've left that for later
patch.
Fixes PR52567.
Differential Revision: https://reviews.llvm.org/D114306
For tagged-globals, we only need to disable relaxation for globals that
we actually tag. With this patch function pointer relocations, which
we do not instrument, can be relaxed.
This patch also makes tagged-globals work properly with LTO, as
-Wa,-mrelax-relocations=no doesn't work with LTO.
Reviewed By: pcc
Differential Revision: https://reviews.llvm.org/D113220
Check for a hidden ISD::ROTR (rotl(sub(0,x))) - vXi8 lowering can handle both (its always beneficial for splats, but otherwise only if we have VPTERNLOG).
We currently hit infinite loops in TargetLowering::expandROT if we set ISD::ROTR to custom, which needs addressing before we extend this much further.
This is aligned with GCC's behavior.
Also, alias `-mno-fp-ret-in-387` to `-mno-x87`, by which we can fix pr51498.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D112143
AMD64 ABI mandates caller to specify the number of used SSE registers
when passing variable arguments.
GCC also provides option -mskip-rax-setup to skip the setup of rax when
SSE is disabled. This helps to reduce the code size, see pr23258.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D112413
If we're rotating vXi8 by a splatted amount, then unpack to vXi16, perform a SHL by the (extended) scalar, and then pack the results.
This is a vector equivalent to the "rotl(x,y) -> (((aext(x) << bw) | zext(x)) << (y & (bw-1))) >> bw" style expansion we do for scalars in LowerFunnelShift.
I think we can usefully use this for other vector types and vector funnel-shifts in the future, depending how we expand beyond D113192 for matching rotations/funnel-shifts for more type/ops.
Update splitVectorIntUnary/splitVectorIntBinary to use this internally, after some operand type sanity checks.
Avoid code duplication and makes it easier to split other vector instruction forms in the future.
The first try at this patch ( bf5748a1af ) was reverted ( 5be64d4164 )
because it could crash. The cause of that problem was failing to account
for the optional peek-through-bitcast in the enclosing function.
This version of the patch adds a clause to avoid the fold in case of
bitcasts because it is unlikely to be profitable in that scenario.
A test case based on https://llvm.org/PR52504 was added to make sure
we don't have that problem again.
Original commit message:
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).
We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.
Differential Revision: https://reviews.llvm.org/D113603
If an operand is a bitcasted or widended constant, try to more aggressively create broadcastable constants for folding, which in particular helps non-VLX modes.
I've refactored getAVX512Node so that VLX targets can make better use of this as well.
NOTE: In the future, I think we should consider removing the broadcast of constant data from DAG entirely and move this to either X86InstrInfo::foldMemoryOperand or a new pass - AVX1/2 targets has similar problems with missed (whole vector) folds that need to be improved as well.
Differential Revision: https://reviews.llvm.org/D113845
This casued assertion failures:
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp:9446:
void llvm::SelectionDAG::ReplaceAllUsesWith(llvm::SDNode *, llvm::SDNode *):
Assertion `(!From->hasAnyUseOfValue(i) || From->getValueType(i) == To->getValueType(i))
&& "Cannot use this version of ReplaceAllUsesWith!"' failed.
See comment on the code review.
(Had to update some expectations in test/CodeGen/X86/vselect-zero.ll
manually due to other changes having landed after the reverted one.)
> and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
>
> This avoids the -1 constant vector in favor of an arithmetic shift
> instruction if it exists (the ISA is still not complete after all
> these years...).
>
> We catch this pattern late in combining by matching PCMPGT, so it
> should not interfere with more general folds.
>
> Differential Revision: https://reviews.llvm.org/D113603
This reverts commit bf5748a1af.
This helper provides a more complete approach for lowering to X86ISD::PACKSS/PACKUS nodes - testing for existing suitable sign/zero extension before recreating it.
It also optionally packs the upper half instead of the lower half.
Similar to what we've done for other ops, this patch widens VPTERNLOG to a 512-bit op for non-VLX targets.
Fixes regressions in D113192
Differential Revision: https://reviews.llvm.org/D113827
For AVX512 targets without VLX, we have to widen 128/256-bit vectors to 512-bits to use some specific AVX512 instructions (or some other instructions with predicates etc.).
I've pulled out the widening code from LowerFunnelShift into the helper function, so we can convert some other widening patterns in the future.
Add support for v32i8/v64i8 converting shift-by-constant to multiply-by-constant. This helps us avoid the generic vXi8 shift lowering, and a lot of VPBLENDVB ops which can be particularly slow.
We also needed to reorder a few shift lowering patterns to prevent regressions, particularly for XOP+AVX2 (Excavator) targets (which can split to fast v16i8 shifts) and AVX512-BWI targets (which prefers to extend to fast v32i16 shifts).
Currently we only constant fold target shuffles if any of the sources has one use, or it would remove a variable shuffle mask - the aim being to avoid constant pool bloat.
This patch proposes we should constant fold by default and only limit this if optsize is enabled - I've added a basic test for this in vector-mul.ll (the pmuludq case is by far the most common), I can add other specific test cases if people need them.
This should permit further constant folding, break some instruction dependencies and help reduce shuffle port pressure.
Differential Revision: https://reviews.llvm.org/D113748
and (pcmpgt X, -1), Y --> pandn (vsrai X, BitWidth-1), Y
This avoids the -1 constant vector in favor of an arithmetic shift
instruction if it exists (the ISA is still not complete after all
these years...).
We catch this pattern late in combining by matching PCMPGT, so it
should not interfere with more general folds.
Differential Revision: https://reviews.llvm.org/D113603
This fixes the crash due to lacking VZEXT_MOVL support with i16.
Reviewed By: LuoYuanke, RKSimon
Differential Revision: https://reviews.llvm.org/D113661
combineMulToPMADDWD is currently limited to legal types, but there's no reason why we can't handle any larger type that the existing SplitOpsAndApply code can use to split to legal X86ISD::VPMADDWD ops.
This also exposed a missed opportunity for pre-SSE41 targets to handle SEXT ops from types smaller than vXi16 - without PMOVSX instructions these will always be expanded to unpack+shifts, so we can cheat and convert this into a ZEXT(SEXT()) sequence to make it a valid PMADDWD op.
Differential Revision: https://reviews.llvm.org/D110995
Use ComputeMinSignedBits() to ensure the mul source operands at least sign-extend down from the bottom 16 bits.
This will make it easier if/when we try to support handling of source types larger than 32-bits.
As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper.
I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious.
Differential Revision: https://reviews.llvm.org/D113396
We have several places where we need to extract the raw bits data from a BUILD_VECTOR node, so consolidate this to a single helper function that handles Undefs and Integer/FP constants, including implicit truncation.
This should make it easier to extend D113202 to handle more constant folding of bitcasted constant data.
Differential Revision: https://reviews.llvm.org/D113351
For v8i16 shuffle patterns that are lowered with AND+PACKUS, check to see if the sources are from a 256-bit vector and perform the masking using BLENDW at the 256-bit level.
With the test changes we can see more examples of duplicate XMM/YMM zero vectors (PR26018) :(
The check for whether a zero extension was needed was subtly wrong and
saw a value that was already 64 bits, so did not extend.
Fixes PR52357.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112860
The changes in D80163 defered the assignment of MachineMemOperand (MMO)
until the X86ExpandPseudo pass. This will result in crash due to prolog
insert point been sunk across the pseudo instruction VASTART_SAVE_XMM_REGS.
Moving the assignment to the creation of the node can avoid the problem.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D112859
This is part of rG1cfecf4fc427 that was reverted to fix PR51226 - concating the broadcasts is OK, its the splatted loads that crash (we're not detecting extloads). I'm still creating a reduced test case so haven't added the load handling again yet.
The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances.
This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source.
There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this often creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses?
Noticed while investigating the quality of interleaved load/store codegen.
Differential Revision: https://reviews.llvm.org/D111960
As mentioned on D108539, when the gather indices are smaller than the pointer size, they are sign-extended BEFORE scale is applied, making the general fold unsafe.
If the index have sufficient sign-bits then folding the scale could be safe - I'll investigate this.
If the index operand for a gather/scatter intrinsic is being scaled (self-addition or a shl-by-immediate) then we may be able to fold that scaling into the intrinsic scale immediate value instead.
Fixes PR13310.
Differential Revision: https://reviews.llvm.org/D108539
In the instruction encoding, the passthru register is always
tied to the destination register. The CPU scheduler has to wait
for the last writer of this register to finish executing before
the gather can start. This is true even if the initial mask is
all ones so that the passthru will never be used.
By explicitly zeroing the register we can break the false
dependency. The zero idiom is executed completing by the
register renamer and so is immedately considered ready.
Authored by Craig.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D112505
As noted in D112464, a pre-AVX target may not be able to fold an
under-aligned vector load into another op, so we shouldn't report
that as a load folding candidate. I only found one caller where
this would make a difference -- combineCommutableSHUFP() -- so
that's where I added a test to show the (minor) regression.
Differential Revision: https://reviews.llvm.org/D112545
This can avoid a vector add and a constant pool load. Or an explicit broadcast in case of non-constant.
Also reverse the transform any time we encounter a constant index addend that can't be moved to base. In that case pull the constant from base into the index. This reduces code size needed for the displacement since we needed the index add anyway. Limit this to scale of 1 to avoid divisibility and wrap issues.
Authored by Craig.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D111595
This is a small extension of D112095 to avoid another regression
seen with D112085.
In this case, we allow the same conversion from usubsat to ALU
ops if the target supports vpternlog.
That pattern will get converted later in X86DAGToDAGISel::tryVPTERNLOG().
This seems better than putting a magic immediate constant directly in
this code to create the exact vpternlog that we need. It's possible that
there are other special-cases along these lines, so we should try to
keep all of the vpternlog magic in one place.
Differential Revision: https://reviews.llvm.org/D112138
usubsat X, SMIN --> (X ^ SMIN) & (X s>> BW-1)
This would be a regression with D112085 where we combine to
usubsat more aggressively, so avoid that by matching the
special-case where we are subtracting SMIN (signmask):
https://alive2.llvm.org/ce/z/4_3gBD
Differential Revision: https://reviews.llvm.org/D112095
Inspired by D111968, provide a isNegatedPowerOf2() wrapper instead of obfuscating code with (-Value).isPowerOf2() patterns, which I'm sure are likely avenues for typos.....
Differential Revision: https://reviews.llvm.org/D111998
If we're using an ashr to sign-extend the entire upper 16 bits of the i32 element, then we can replace with a lshr. The sign bit will be correctly shifted for PMADDWD's implicit sign-extension and the upper 16 bits are zero so the upper i16 sext-multiply is guaranteed to be zero.
The lshr also has a better chance of folding with shuffles etc.
This patch detects the absolute value pattern on the RHS of a
subtract. If we find it we swap the CMOV true/false values and
replace the subtract with an ADD.
There may be a more generic way to do this, but I'm not sure.
Targets that don't have legal or custom ISD::ABS use a generic
expand in DAG combiner already when it sees (neg (abs(x))). I
haven't checked what happens if the neg is a more general subtract.
Fixes PR50991 for X86.
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D111858
CMOVGE reads SF and OF. CMOVNS only reads SF. This matches with
other recent changes to use a single flag where possible. It also
matches gcc codegen.
I believe this technically changes whether the conditioanl move happens
on INT_MIN, but for INT_MIN both registers are the same so it doesn't
matter.
Differential Revision: https://reviews.llvm.org/D111826
As noted in https://github.com/halide/Halide/pull/6302,
we hilariously fail to match PAVG if we even as much
as look at it the wrong way.
In this particular case, the problem stems from the fact that
`PAVG` root (def) is a `trunc`, and leafs (uses) are `zext`'s,
and InstCombine really loves to get rid of both of these,
for example replace them with a bit mask. So we may not have
said `zext`.
Instead of checking for that + type match,
i think we should rely on the actual active type,
as per the knownbits.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D111571
The select(pshufb,pshufb) -> or(pshufb,pshufb) fold uses getConstVector to create the refreshed pshufb masks, which treats all negative indices as undef.
PR52122 shows that if we were selecting an element that the PSHUFB has set to zero we must set it back to 0x80 when we recreate the PSHUFB mask and not just leave it as SM_SentinelZero
Without popcnt we had a special case for using the parity flag from a single test i8 test instruction if only bits 7:0 could be non-zero. That special case is still useful when we have popcnt.
To reach this special case, we enable custom lowering of parity for i16/i32/i64 even when popcnt is enabled. The check for POPCNT being enabled is now after the special case in LowerPARITY.
Fixes PR52093
Differential Revision: https://reviews.llvm.org/D111249
The X86 backend only needs to know whether structure return is via an
sret pointer. This removes the categorization enumeration and
adjusts, templatizes and renames the related functions.
Differential Revision: https://reviews.llvm.org/D109966
Noticed while investigating the regressions in D110995 - if the RHS element is already zero, then we don't need the corresponding LHS element.
Technically we could also recheck RHS once we have LHS's known zeros, but I haven't seen any missed opportunities from that yet.
Stop using APInt constructors and methods that were soft-deprecated in
D109483. This fixes all the uses I found in llvm, except for the APInt
unit tests which should still test the deprecated methods.
Differential Revision: https://reviews.llvm.org/D110807
PR52040 identified several issues with the HOP(HOP'(X,X),HOP'(Y,Y)) -> HOP(PERMUTE(HOP'(X,Y)),PERMUTE(HOP'(X,Y)) slow-HOP fold.
Not only was there a copy+paste typo when accessing the inner HOP operands, but the (unnecessary) ReplaceAllUsesOfValueWith call was missing one use checks.
Now that we have better shuffle combines of HOPs we can just return a new HOP() sequence and not use ReplaceAllUsesOfValueWith at all - this actually improved pair_sum_v8i32_v4i32 codegen as it kicks off further shuffle combines.
X86's decomposeMulByConstant never permits mul decomposition to shift+add/sub if the vector multiply is legal.
Unfortunately this isn't great for SSE41+ targets which have PMULLD for vXi32 multiplies, but is often quite slow. This patch proposes to allow decomposition if the target has the SlowPMULLD flag (i.e. Silvermont). We also always decompose legal vXi64 multiplies - even latest IceLake has really poor latencies for PMULLQ.
Differential Revision: https://reviews.llvm.org/D110588
On Windows, i128 arguments are passed as indirect arguments, and
they are returned in xmm0.
This is mostly fixed up by `WinX86_64ABIInfo::classify` in Clang, making
the IR functions return v2i64 instead of i128, and making the arguments
indirect. However for cases where libcalls are generated in the target
lowering, the lowering uses the default x86_64 calling convention for
i128, where they are passed/returned as a register pair.
Add custom lowering logic, similar to the existing logic for i128
div/mod (added in 4a406d32e9),
manually making the libcall (while overriding the return type to
v2i64 or passing the arguments as pointers to arguments on the stack).
X86CallingConv.td doesn't seem to handle i128 at all, otherwise
the windows specific behaviours would ideally be implemented as
overrides there, in generic code, handling these cases automatically.
This fixes https://bugs.llvm.org/show_bug.cgi?id=48940.
Differential Revision: https://reviews.llvm.org/D110413
This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies.
Fixes PR51974
When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with
TypeLegalize, and no libcall in libm is ready for fround(f16) now.
FROUNDEVEN(f16) has related instruction in AVX512FP16.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D110312
The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.
Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements.
There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....
If we are multiplying by a sign-extended vXi32 constant, then we can mask off the upper 16 bits to allow folding to PMADDWD and make use of its implicit sign-extension from i16
We already do this on SSE41 targets where we have sext/zext instructions, now that combineShiftToPMULH handles SSE2 targets, we can enable this here as well.
With improved shuffle combines (in particular canonicalizeShuffleWithBinOps), we can now usefully perform this on any SSE2+ target.
We should be able to remove this entirely and just use DAGCombiner's combineShiftToMULH if we can someday get it to support illegal (pre-widened) types.
As suggested on D108522, if we're sign extending a v4i16 source before multiplying as a v4i32, then we can replace that with a zero extension and rely on the implicit sign-extension of PMADDWD.
This is motivated by the examples and discussion in:
https://llvm.org/PR51245
...and related bugs.
By using vector compares and vector logic, we can convert 2 'set'
instructions into 1 'movd' or 'movmsk' and generally improve
throughput/reduce instructions.
Unfortunately, we don't have a complete vector compare ISA before
AVX, so I left SSE-only out of this patch. Ie, we'd need extra logic
ops to simulate the missing predicates for SSE 'cmpp*', so it's not
as clearly a win.
Differential Revision: https://reviews.llvm.org/D110342
This function can be adapted to solve bugs like PR51245,
but it could require differentiating the combiner timing
between the existing and new transforms.
This patch is to support transform something like
_mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph()))
to _mm512_fmadd_pch(a, b, acc).
Differential Revision: https://reviews.llvm.org/D109953
This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment.
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D110012
For x86 Darwin, we have a stack checking feature which re-uses some of this
machinery around stack probing on Windows. Renaming this to be more appropriate
for a generic feature.
Differential Revision: https://reviews.llvm.org/D109993
We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly.
Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.
Apparently this has no test coverage before D108382,
but D108382 itself shows a few regressions that this fixes.
It doesn't seem worthwhile breaking apart broadcasts,
assuming we want the broadcasted value to be preset in several elements,
not just the 0'th one.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108411
Split off from D108253.
Broadcast is simpler than any other shuffle we might produce
to do what we want to do here, so prefer it.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108382
We can use `OR` instead of `BLEND` if either the element we are not picking is zero (or masked away);
or the element we are picking overwhelms (e.g. it's all-ones) whatever the element we are not picking:
https://alive2.llvm.org/ce/z/RKejao
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D109726
When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.
Soft deprecrate isNullValue/isAllOnesValue and update in tree
callers. This matches the changes to the APInt interface from
D109483.
Reviewed By: lattner
Differential Revision: https://reviews.llvm.org/D109535
This library function only exists in compiler-rt not libgcc. So
this would fail to link unless we were linking with compiler-rt.
This is consistent with the recent removal of calls to mulodi4 on
32-bit targets like D108928.
I suppose maybe we could keep the libcalls for platforms like
Darwin that use compiler-rt exclusively?
Reviewed By: nickdesaulniers, MaskRay
Differential Revision: https://reviews.llvm.org/D109385
This renames the primary methods for creating a zero value to `getZero`
instead of `getNullValue` and renames predicates like `isAllOnesValue`
to simply `isAllOnes`. This achieves two things:
1) This starts standardizing predicates across the LLVM codebase,
following (in this case) ConstantInt. The word "Value" doesn't
convey anything of merit, and is missing in some of the other things.
2) Calling an integer "null" doesn't make any sense. The original sin
here is mine and I've regretted it for years. This moves us to calling
it "zero" instead, which is correct!
APInt is widely used and I don't think anyone is keen to take massive source
breakage on anything so core, at least not all in one go. As such, this
doesn't actually delete any entrypoints, it "soft deprecates" them with a
comment.
Included in this patch are changes to a bunch of the codebase, but there are
more. We should normalize SelectionDAG and other APIs as well, which would
make the API change more mechanical.
Differential Revision: https://reviews.llvm.org/D109483
integer 0/1 for the operand of bundle "clang.arc.attachedcall"
https://reviews.llvm.org/D102996 changes the operand of bundle
"clang.arc.attachedcall". This patch makes changes to llvm that are
needed to handle the new IR.
This should make it easier to understand what the IR is doing and also
simplify some of the passes as they no longer have to translate the
integer values to the runtime functions.
Differential Revision: https://reviews.llvm.org/D103000
Similar to D108842, D108844, and D108926.
__has_builtin(builtin_mul_overflow) returns true for 32b x86 targets,
but Clang is deferring to compiler RT when encountering long long types.
This breaks ARCH=i386 + CONFIG_BLK_DEV_NBD=y builds of the Linux kernel
that are using builtin_mul_overflow with these types for these targets.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
This will still need to be worked around in the Linux kernel in order to
continue to support these builds of the Linux kernel for this
target with older releases of clang.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Link: https://bugs.llvm.org/show_bug.cgi?id=35922
Link: https://github.com/ClangBuiltLinux/linux/issues/1438
Reviewed By: lebedev.ri, RKSimon
Differential Revision: https://reviews.llvm.org/D108928
PMADDWD(v8i16 x, v8i16 y) == (v4i32) { (int)x[0]*y[0] + (int)x[1]*y[1], ..., (int)x[6]*y[6] + (int)x[7]*y[7] }
Currently combineMulToPMADDWD only folds cases where the upper 17 bits of both vXi32 inputs are known zero (i.e. the first half is positive and the second half of the pair is zero in each 2xi16 pair), this can be relaxed to only require one zero-extended input if the other input has at least 17 sign bits.
That way the sign of the result is still preserved, and the second half is still zero.
Noticed while investigating PR47437.
Differential Revision: https://reviews.llvm.org/D108522
Please refer to
https://lists.llvm.org/pipermail/llvm-dev/2021-September/152440.html
(and that whole thread.)
TLDR: the original patch had no prior RFC, yet it had some changes that
really need a proper RFC discussion. It won't be productive to discuss
such an RFC, once it's actually posted, while said patch is already
committed, because that introduces bias towards already-committed stuff,
and the tree is potentially in broken state meanwhile.
While the end result of discussion may lead back to the current design,
it may also not lead to the current design.
Therefore i take it upon myself
to revert the tree back to last known good state.
This reverts commit 4c4093e6e3.
This reverts commit 0a2b1ba33a.
This reverts commit d9873711cb.
This reverts commit 791006fb8c.
This reverts commit c22b64ef66.
This reverts commit 72ebcd3198.
This reverts commit 5fa6039a5f.
This reverts commit 9efda541bf.
This reverts commit 94d3ff09cf.
This fixes another reproducer from https://bugs.llvm.org/show_bug.cgi?id=51615
And again, the fix lies not in the code added in D105390
In this case, we completely don't check that the "broadcast-from-mem" we create
can actually fold the load. In this case, it's operand was not a load at all:
```
Combining: t16: v8i32 = vector_shuffle<0,u,u,u,0,u,u,u> t14, undef:v8i32
Creating new node: t29: i32 = undef
RepeatLoad:
t8: i32 = truncate t7
t7: i64 = extract_vector_elt t5, Constant:i64<0>
t5: v2i64,ch = load<(load (s128) from %ir.arg)> t0, t2, undef:i64
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t1: i64 = Register %0
t4: i64 = undef
t3: i64 = Constant<0>
Combining: t15: v8i32 = undef
```
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D108821
x86_fp80 format allows values that do not fit any of IEEE-754 category.
Previously they were recognized by intrinsic __builtin_isnan as NaNs.
Now this intrinsic is implemented using instruction FXAM, which
distinguish between NaNs and unsupported values. It can make some
programs behave differently.
As a solution, this fix changes lowering of the intrinsic. If floating
point exceptions are ignored, llvm.isnan is lowered into unordered
comparison, as __buildtin_isnan was implemented earlier. In strictfp
functions the intrinsic is lowered using FXAM, which does not raise
exceptions even for signaling NaN, as required by IEEE-754 and C
standards.
Differential Revision: https://reviews.llvm.org/D108037
There are no paths into LowerFormalParms that have already specified
the sret register. We always materialize a virtual and then assign it
to the physical reg at the point of the return.
Differential Revision: https://reviews.llvm.org/D108762
Even though https://bugs.llvm.org/show_bug.cgi?id=51615
appears to be introduced by D105390, the fix lies here.
We can not replace a wide volatile load with a broadcast-from-memory,
because that would narrow the load, which isn't legal for volatiles.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D108757
In-register structure returns are not special, and handled by lowering
to multiple-value tuples. We can tail-call from non-sret fns to
structure-returning functions, except on i686 where the sret pointer
is callee-pop.
Differential Revision: https://reviews.llvm.org/D105807
We don't have any vXi8 shift instructions (other than on XOP which is handled separately), so replace the shl(x,1) -> add(x,x) fold with shl(x,1) -> add(freeze(x),freeze(x)) to avoid the undef issues identified in PR50468.
Split off from D106675 as I'm still looking at whether we can fix the vXi16/i32/i64 issues with the D106679 alternative.
Differential Revision: https://reviews.llvm.org/D108139
Extend matchShuffleAsBlend to not only match against known in-place elements for BLEND shuffles, but use isElementEquivalent to determine if the shuffle mask's referenced element is the same as the in-place element.
This allows us to replace a number of insertps instructions with more general blendps instructions (better opportunities for commutation, concatenation etc.).
Before lowering shuffles, see if we can merge horizontal ops or canonicalize the shuffle mask to point to the same LHS/RHS of the HOps when an HOp's args are repeated.
This changes the lowering of saddsat and ssubsat so that instead of
using:
r,o = saddo x, y
c = setcc r < 0
s = c ? INTMAX : INTMIN
ret o ? s : r
into using asr and xor to materialize the INTMAX/INTMIN constants:
r,o = saddo x, y
s = ashr r, BW-1
x = xor s, INTMIN
ret o ? x : r
https://alive2.llvm.org/ce/z/TYufgD
This seems to reduce the instruction count in most testcases across most
architectures. X86 has some custom lowering added to compensate for
cases where it can increase instruction count.
Differential Revision: https://reviews.llvm.org/D105853
AttributeList::hasAttribute() is confusing, use clearer methods like
hasParamAttr()/hasRetAttr().
Add hasRetAttr() since it was missing from AttributeList.
The IR for pmuldq/pmuludq intrinsics uses a sext_inreg/zext_inreg
pattern on the inputs. Ideally we pattern match these away during
isel. It is possible for LICM or other middle end optimizations
to separate the extend from the mul. This prevents SelectionDAG
from removing it or depending on how the extend is lowered, we
may not be able to generate an AssertSExt/AssertZExt in the
mul basic block. This will prevent pmuldq/pmuludq from being
formed at all.
This patch teaches shouldSinkOperands to recognize this so
that CodeGenPrepare will clone the extend into the same basic
block as the mul.
Fixes PR51371.
Differential Revision: https://reviews.llvm.org/D107689
This isn't optimal, but prevents crashing when the libcall isn't
available. It just calculates the full product and makes sure the high bits
match the sign of the low half. Each of the pieces should go through their own
type legalization.
This can make D107420 unnecessary.
Needs tests, but I wanted to start discussion about D107420.
Reviewed By: FreddyYe
Differential Revision: https://reviews.llvm.org/D107581
As reported on PR51281, an internal fuzz test encountered an issue when extracting constant bits from a SUBV_BROADCAST node from a constant pool source larger than the broadcasted subvector width.
The getTargetConstantBitsFromNode was assuming that the Constant would the same size as the subvector, resulting in the incorrect packing of the per-element bits data.
This patch attempts to solve this by using the SUBV_BROADCAST node to determine the subvector width, and then ensuring we extract only the lowest bits from Constant of that subvector bitsize.
Differential Revision: https://reviews.llvm.org/D107158
This is recommit of the patch 16ff91ebcc,
reverted in 0c28a7c990 because it had
an error in call of getFastMathFlags (base type should be FPMathOperator
but not Instruction). The original commit message is duplicated below:
Clang has builtin function '__builtin_isnan', which implements C
library function 'isnan'. This function now is implemented entirely in
clang codegen, which expands the function into set of IR operations.
There are three mechanisms by which the expansion can be made.
* The most common mechanism is using an unordered comparison made by
instruction 'fcmp uno'. This simple solution is target-independent
and works well in most cases. It however is not suitable if floating
point exceptions are tracked. Corresponding IEEE 754 operation and C
function must never raise FP exception, even if the argument is a
signaling NaN. Compare instructions usually does not have such
property, they raise 'invalid' exception in such case. So this
mechanism is unsuitable when exception behavior is strict. In
particular it could result in unexpected trapping if argument is SNaN.
* Another solution was implemented in https://reviews.llvm.org/D95948.
It is used in the cases when raising FP exceptions by 'isnan' is not
allowed. This solution implements 'isnan' using integer operations.
It solves the problem of exceptions, but offers one solution for all
targets, however some can do the check in more efficient way.
* Solution implemented by https://reviews.llvm.org/D96568 introduced a
hook 'clang::TargetCodeGenInfo::testFPKind', which injects target
specific code into IR. Now only SystemZ implements this hook and it
generates a call to target specific intrinsic function.
Although these mechanisms allow to implement 'isnan' with enough
efficiency, expanding 'isnan' in clang has drawbacks:
* The operation 'isnan' is hidden behind generic integer operations or
target-specific intrinsics. It complicates analysis and can prevent
some optimizations.
* IR can be created by tools other than clang, in this case treatment
of 'isnan' has to be duplicated in that tool.
Another issue with the current implementation of 'isnan' comes from the
use of options '-ffast-math' or '-fno-honor-nans'. If such option is
specified, 'fcmp uno' may be optimized to 'false'. It is valid
optimization in general, but it results in 'isnan' always returning
'false'. For example, in some libc++ implementations the following code
returns 'false':
std::isnan(std::numeric_limits<float>::quiet_NaN())
The options '-ffast-math' and '-fno-honor-nans' imply that FP operation
operands are never NaNs. This assumption however should not be applied
to the functions that check FP number properties, including 'isnan'. If
such function returns expected result instead of actually making
checks, it becomes useless in many cases. The option '-ffast-math' is
often used for performance critical code, as it can speed up execution
by the expense of manual treatment of corner cases. If 'isnan' returns
assumed result, a user cannot use it in the manual treatment of NaNs
and has to invent replacements, like making the check using integer
operations. There is a discussion in https://reviews.llvm.org/D18513#387418,
which also expresses the opinion, that limitations imposed by
'-ffast-math' should be applied only to 'math' functions but not to
'tests'.
To overcome these drawbacks, this change introduces a new IR intrinsic
function 'llvm.isnan', which realizes the check as specified by IEEE-754
and C standards in target-agnostic way. During IR transformations it
does not undergo undesirable optimizations. It reaches instruction
selection, where is lowered in target-dependent way. The lowering can
vary depending on options like '-ffast-math' or '-ffp-model' so the
resulting code satisfies requested semantics.
Differential Revision: https://reviews.llvm.org/D104854
Having `NewMask` outside of an if and rebinding `BaseMask` `ArrayRef`
to it is confusing. Instead, just move the `Mask` vector higher up,
and change the code that earlier had no access to it but now does
to use `Mask` instead of `BaseMask`.
This has no other intentional changes.
This is a recommit of 35c0848b57,
that was reverted to simplify reversion of an earlier change.
IR typically creates INSERT_SUBVECTOR patterns as a widening of the subvector with undefs to pad to the destination size, followed by a shuffle for the actual insertion - SelectionDAGBuilder has to do something similar for shuffles when source/destination vectors are different sizes.
This combine attempts to recognize these patterns by looking for a shuffle of a subvector (from a CONCAT_VECTORS) that starts at a modulo of its size into an otherwise identity shuffle of the base vector.
This uncovered a couple of target-specific issues as we haven't often created INSERT_SUBVECTOR nodes in generic code - aarch64 could only handle insertions into the bottom of undefs (i.e. a vector widening), and x86-avx512 vXi1 insertion wasn't keeping track of undef elements in the base vector.
Fixes PR50053
Differential Revision: https://reviews.llvm.org/D107068
Having `NewMask` outside of an if and rebinding `BaseMask` `ArrayRef`
to it is confusing. Instead, just move the `Mask` vector higher up,
and change the code that earlier had no access to it but now does
to use `Mask` instead of `BaseMask`.
This has no other intentional changes.
I want to hoist `Mask` variable higher up,
but then it would clash with this one.
So let's rename this one first.
There are no other intentional changes here other than said rename.
Given a shuffle mask, if it is picking from an input that is splat
given the current granularity of the shuffle, then adjust the mask
to pick from the same lane of the input as the mask element is in.
This may result in a shuffle being simplified into a blend.
I believe this is correct given that the splat detection matches the one
just above the new code,
My basic thought is that we might be able to get less regressions
by handling multiple insertions of the same value into a vector
if we form broadcasts+blend here, as opposed to D105390,
but i have not really thought this through,
and did not try implementing it yet.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D107009
Clang has builtin function '__builtin_isnan', which implements C
library function 'isnan'. This function now is implemented entirely in
clang codegen, which expands the function into set of IR operations.
There are three mechanisms by which the expansion can be made.
* The most common mechanism is using an unordered comparison made by
instruction 'fcmp uno'. This simple solution is target-independent
and works well in most cases. It however is not suitable if floating
point exceptions are tracked. Corresponding IEEE 754 operation and C
function must never raise FP exception, even if the argument is a
signaling NaN. Compare instructions usually does not have such
property, they raise 'invalid' exception in such case. So this
mechanism is unsuitable when exception behavior is strict. In
particular it could result in unexpected trapping if argument is SNaN.
* Another solution was implemented in https://reviews.llvm.org/D95948.
It is used in the cases when raising FP exceptions by 'isnan' is not
allowed. This solution implements 'isnan' using integer operations.
It solves the problem of exceptions, but offers one solution for all
targets, however some can do the check in more efficient way.
* Solution implemented by https://reviews.llvm.org/D96568 introduced a
hook 'clang::TargetCodeGenInfo::testFPKind', which injects target
specific code into IR. Now only SystemZ implements this hook and it
generates a call to target specific intrinsic function.
Although these mechanisms allow to implement 'isnan' with enough
efficiency, expanding 'isnan' in clang has drawbacks:
* The operation 'isnan' is hidden behind generic integer operations or
target-specific intrinsics. It complicates analysis and can prevent
some optimizations.
* IR can be created by tools other than clang, in this case treatment
of 'isnan' has to be duplicated in that tool.
Another issue with the current implementation of 'isnan' comes from the
use of options '-ffast-math' or '-fno-honor-nans'. If such option is
specified, 'fcmp uno' may be optimized to 'false'. It is valid
optimization in general, but it results in 'isnan' always returning
'false'. For example, in some libc++ implementations the following code
returns 'false':
std::isnan(std::numeric_limits<float>::quiet_NaN())
The options '-ffast-math' and '-fno-honor-nans' imply that FP operation
operands are never NaNs. This assumption however should not be applied
to the functions that check FP number properties, including 'isnan'. If
such function returns expected result instead of actually making
checks, it becomes useless in many cases. The option '-ffast-math' is
often used for performance critical code, as it can speed up execution
by the expense of manual treatment of corner cases. If 'isnan' returns
assumed result, a user cannot use it in the manual treatment of NaNs
and has to invent replacements, like making the check using integer
operations. There is a discussion in https://reviews.llvm.org/D18513#387418,
which also expresses the opinion, that limitations imposed by
'-ffast-math' should be applied only to 'math' functions but not to
'tests'.
To overcome these drawbacks, this change introduces a new IR intrinsic
function 'llvm.isnan', which realizes the check as specified by IEEE-754
and C standards in target-agnostic way. During IR transformations it
does not undergo undesirable optimizations. It reaches instruction
selection, where is lowered in target-dependent way. The lowering can
vary depending on options like '-ffast-math' or '-ffp-model' so the
resulting code satisfies requested semantics.
Differential Revision: https://reviews.llvm.org/D104854
In this episode, we are trying to avoid an x86 micro-arch quirk where complex
(3 operand) LEA potentially costs significantly more than simple LEA. So we
simultaneously push and pull the math around the CMOV to balance the operations.
I looked at the debug spew during instruction selection and decided against
trying a later DAGToDAG transform -- it seems very difficult to match if the
trailing memops are already selected and managing the creation of extra
instructions at that level is always tricky.
Differential Revision: https://reviews.llvm.org/D106918