We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877
llvm-svn: 351018
We can't represent this properly with vselect like we normally do. We also have to update the instruction definition to use a VK2WM mask instead of VK4WM to represent this.
Fixes another case from PR34877.
llvm-svn: 351017
This patch takes some of the code from D49837 to allow us to enable ISD::ABS support for all SSE vector types.
Differential Revision: https://reviews.llvm.org/D56544
llvm-svn: 350998
The 128-bit input produces 64-bits of output and fills the upper 64-bits with 0. The mask only applies to the lower elements. But we can't represent this with a vselect like we normally do.
This also avoids the need to have a special X86ISD::SELECT when avx512bw isn't enabled since vselect v8i16 isn't legal there.
Fixes another instruction for PR34877.
llvm-svn: 350994
We no longer need to extend mask scalars before bitcasting them to vXi1. This was only needed for the truncate intrinsics. And was really a bug in our lowering of them.
llvm-svn: 350991
We still use i8 for the load/store type. So we need to convert to/from i16 to around the mask type.
By doing this we get an i8->i16 extload which we can then pattern match to a KMOVW if the access is aligned.
llvm-svn: 350989
We can't properly represent this with a vselect since the upper elements of the result are supposed to be zeroed regardless of the mask.
This also reuses the new nodes even when the result type fits in 128 bits if the input is q/d and the result is w/b since vselect w/b using k-register condition isn't legal without avx512bw. Currently we're doing this even when avx512bw is enabled, but I might change that.
This fixes some of PR34877
llvm-svn: 350985
Previously, we limited this transform to cases where the
extraction into the build vector happens from vectors of
the same type as the build vector, but that's not required.
There's a slight potential regression seen in the AVX512
result for phadd -- we're using the 256-bit flavor of the
instruction now even though the 128-bit subset is sufficient.
The same problem could already be seen in the AVX2 result.
Follow-up patches will attempt to narrow that back down.
llvm-svn: 350928
We were lowering the last step extract_vector_elt to a bitcast+truncate. Change it to use an extract_vector_elt of index 0 instead. Add isel patterns to do the equivalent of what the bitcast would have done. Plus an isel pattern for an any_extend+extract to prevent some regressions.
Finally add a DAG combine to turn v1i1 scalar_to_vector+extract_vector_elt of 0 into an extract_subvector.
This fixes some of the regressions from D350800.
llvm-svn: 350918
This extends to combineVSelectToShrunkBlend to be able to resimplify SHRUNKBLENDS that have already been created.
This should help some of the regressions from D56387
Differential Revision: https://reviews.llvm.org/D56421
llvm-svn: 350875
Despite what the comment says, FCMP_UNE would be an OR not an AND. In the lowering code the first branch created still goes to the original destination. The second branch was exchanged to go to where the subsequent unconditional branch went. This is different than what we do for FCMP_OEQ where both branches that we create go to the original unconditional branch.
As far as I can tell, I think this means we don't need to exchange the branch target with the unconditional branch for FCMP_UNE at all.
Differential Revision: https://reviews.llvm.org/D56309
llvm-svn: 350873
When we use the partial-matching function on a 128-bit chunk, we must
account for the possibility that we've matched undef halves of the
original source vectors, so the outputs may need to be reset.
This should allow closing PR40243:
https://bugs.llvm.org/show_bug.cgi?id=40243
llvm-svn: 350830
This is a partial fix for:
https://bugs.llvm.org/show_bug.cgi?id=40243
...as seen in the integer test, we still need to correct the result when using the
existing (old) horizontal op matching function because it does not model the way
x86 256-bit horizontal ops return results (each 128-bit half is its own horizontal-op).
A potential follow-up change for that is discussed in the bug report - see also D56490.
This generally duplicates a lot of the existing matching code, but we can't just remove
that without introducing regressions, so the existing code is renamed and used less often.
Follow-ups may try to reduce that overlap.
Differential Revision: https://reviews.llvm.org/D56450
llvm-svn: 350826
Found while trying to figure out why my second version of D56421 worked better than the first version. We weren't deleting the vselect in a timely fashion and that caused SimplfyDemandedBit to see an additional user.
The new version doesn't have this problem so this fix isn't needed there, but seemed like the right thing to do.
llvm-svn: 350781
When the result type is v2i64/v2f64 and the index element size is i32, the index vector has two unused elements making the type v4i32. The mask VT should match the number of memory accesses that will be made.
This is consistent with the isel patterns used for the target independent gather/scatter intrinsic.
llvm-svn: 350687
Summary: AVX512VBMI2 supports a funnel shift by immediate and a funnel shift by a variable vector.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56361
llvm-svn: 350498
This is especially helpful on targets without avx512bw since we don't have a good way to convert from v16i8/v32i8 to v16i1/v32i1 for the truncate anyway. If we're just going to convert it to a GPR we might as well use pmovmskb to accomplish both.
llvm-svn: 350480
The 1st try for this was at rL350369, but it caused IR-level diffs because
our cost models differentiate custom vs. legal/promote lowering. So that was
reverted at rL350373. The cost models were fixed independently at rL350403,
so this is effectively the same patch as last time.
Original commit message:
This would show up if we fix horizontal reductions to narrow as they go along,
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.
We need to do this late to not interfere with other pattern matching of larger
horizontal sequences.
We can extend this to integer ops in a follow-up patch.
Differential Revision: https://reviews.llvm.org/D56011
llvm-svn: 350421
This would show up if we fix horizontal reductions to narrow as they go along,
but it's an improvement for size and/or Jaguar (fast-hops) independent of that.
We need to do this late to not interfere with other pattern matching of larger
horizontal sequences.
We can extend this to integer ops in a follow-up patch.
Differential Revision: https://reviews.llvm.org/D56011
llvm-svn: 350369
As noted in PR39973 and D55558:
https://bugs.llvm.org/show_bug.cgi?id=39973
...this is a partial implementation of a fold that we do as an IR canonicalization in instcombine:
// extelt (binop X, Y), Index --> binop (extelt X, Index), (extelt Y, Index)
We want to have this in the DAG too because as we can see in some of the test diffs (reductions),
the pattern may not be visible in IR.
Given that this is already an IR canonicalization, any backend that would prefer a vector op over
a scalar op is expected to already have the reverse transform in DAG lowering (not sure if that's
a realistic expectation though). The transform is limited with a TLI hook because there's an
existing transform in CodeGenPrepare that tries to do the opposite transform.
Differential Revision: https://reviews.llvm.org/D55722
llvm-svn: 350354
INC/DEC are pretty much the same as ADD/SUB except that they don't update the C flag.
This patch removes the special nodes and just pattern matches from ADD/SUB during isel if the C flag isn't being used.
I had to avoid selecting DEC is the result isn't used. This will become a SUB immediate which will turned into a CMP later by optimizeCompareInstr. This lead to the one test change where we use a CMP instead of a DEC for an overflow intrinsic since we only checked the flag.
This also exposed a hole in our RMW flag matching use of hasNoCarryFlagUses. Our root node for the match is a store and there's no guarantee that all the flag users have been selected yet. So hasNoCarryFlagUses needs to check copyToReg and machine opcodes, but it also needs to check for the pre-match SETCC, SETCC_CARRY, BRCOND, and CMOV opcodes.
Differential Revision: https://reviews.llvm.org/D55975
llvm-svn: 350245
Peek through shift modulo masks while matching double shift patterns.
I was hoping to delay this until I could remove the X86 code with generic funnel shift matching (PR40081) but this will do for now.
Differential Revision: https://reviews.llvm.org/D56199
llvm-svn: 350222
All of these use custom isel so we can pretty easily detect the differences in the custom code in X86ISelDAGToDAG. The ISD opcodes just need to express the desired semantics not the details of how they would be selected by isel. So unifying them lets us remove the special casing from lowering.
llvm-svn: 350206
These require a different X86ISD node to be created than i16/i32/i64. I guess no one wanted to add the special code for that except in LowerXALUO. But now LowerXALUO, LowerSELECT, and LowerBRCOND all use a common helper function so they all share the special code.
Unfortunately, there are no test changes because we seem to correct the miss in a DAG combine later. I did verify it manually using test cases from xmulo.ll
llvm-svn: 350205
This makes it easier to keep the LowerBRCOND and LowerSELECT code in sync with LowerXALUO so they always pick the same operation for overflowing instructions.
This is inspired by the helper functions used by ARM and AArch64 for the same purpose.
The test change is because LowerSELECT was not in sync with LowerXALUO with regard to INC/DEC for SADDO/SSUBO.
llvm-svn: 350198
This seems to be getting in the way more than its helping. This does mean we stop scalarizing some cases, but I'm not convinced the scalarization was really better.
Some of the changes to vsel-cmp-load.ll are a regression but D56156 should fix it.
llvm-svn: 350159
This allows us to sign extend to v4i32 first. And then share that extension to implement the final steps to v4i64 using a pcmpgt and punpckl and punpckh.
We already do something similar for SIGN_EXTEND with -x86-experimental-vector-widening-legalization.
llvm-svn: 350158
This was tricking us into making these operations and then letting them get scalarized later. But I can't prove that the scalarized version is actually better.
llvm-svn: 350141
Previously we emitted a multiply and some masking that was supposed to matched to PMULUDQ, but the masking could sometimes be removed before we got a chance to match it. So instead just emit the PMULUDQ directly.
Remove the DAG combine that was added when the ReplaceNodeResults code was originally added. Add a new DAG combine to avoid regressions in shrink_vmul.ll
Some of the shrink_vmul.ll test cases now pick PMULUDQ instead of PMADDWD/PMULLD, but I think this should be an improvement on most CPUs.
I think all of this can go away if/when we switch to -x86-experimental-vector-widening-legalization
llvm-svn: 350134
Create PMULDQ/PMULUDQ as long as the number of elements is a power of 2.
This seems to give some improvements in our ability to use SimplifyDemandedBits.
llvm-svn: 350084
Make each of the helper functions only return their comparison node and the condition code. Leave X86ISD::SETCC creation to the LowerSETCC function itself.
Looking into whether we can use this code directly in BRCOND and SELECT lowering instead of going through LowerSETCC which creates an X86ISD::SETCC node we need to look through.
llvm-svn: 350082
Only one of the 3 callers of LowerAndToBT need the SETCC node. Two of them have to look through it to find the operands they really need. Instead create it after the one call that needs it.
LowerAndToBT now returns both the BT node and the X86 specific condition code separately.
llvm-svn: 350081
This is an alternative to what I attempted in D56057.
GetDemandedBits is a special version of SimplifyDemandedBits that allows simplifications even when the operand has other uses. GetDemandedBits will only do simplifications that allow a node to be bypassed. It won't create new nodes or alter any of the other users.
I had to add support for bypassing SIGN_EXTEND_INREG to GetDemandedBits.
Based on a patch that Simon Pilgrim sent me in email.
Fixes PR40142.
llvm-svn: 350059
Remove the TESTmr isel patterns and add another postprocessing combine for TESTrr+ANDrm->TESTmr. We already have a postprocessing combine for TESTrr+ANDrr->TESTrr. With this we can give ANDN a chance to match first. And clean it up during post processing if we ended up with just a regular AND.
This is another step towards my plan to gut EmitTest and do more flag handling during isel matching or by using optimizeCompare.
llvm-svn: 350038
This is admittedly a narrow fix for the problem:
https://bugs.llvm.org/show_bug.cgi?id=37502
...but as the XOP restriction shows, it's a maze to get this right.
In the motivating example, note that we have movddup before SSE4.1 and
again with AVX2. That's because insertps isn't available pre-SSE41 and
vbroadcast is (more generally) available with AVX2 (and the splat is
reduced to movddup via isel pattern).
Differential Revision: https://reviews.llvm.org/D55898
llvm-svn: 349937
Migrate the X86 backend from X86ISD opcodes ADDS and SUBS to generic
ISD opcodes SADDSAT and SSUBSAT. This also improves scodegen for
@llvm.sadd.sat() and @llvm.ssub.sat() intrinsics.
This is a followup to D55787 and part of PR40056.
Differential Revision: https://reviews.llvm.org/D55833
llvm-svn: 349520
InstCombine seems to canonicalize or PSUB patter into a max with the cosntant and an add with an inverse of the constant.
This patch recognizes this pattern and turns it into PSUBUS. Future work could improve undef element handling.
Fixes some of PR40053
Differential Revision: https://reviews.llvm.org/D55780
llvm-svn: 349519
Replace the X86ISD opcodes ADDUS and SUBUS with generic ISD opcodes
UADDSAT and USUBSAT. As a side-effect, this also makes codegen for
the @llvm.uadd.sat and @llvm.usub.sat intrinsics reasonable.
This only replaces use in the X86 backend, and does not move any of
the ADDUS/SUBUS X86 specific combines into generic codegen.
Differential Revision: https://reviews.llvm.org/D55787
llvm-svn: 349481
(VSRAI (VSHLI X, C1), C1) --> X iff NumSignBits(X) > C1
This works better as part of SimplifyDemandedBits than part of the general combine.
llvm-svn: 349462
This fold was incredibly specific - replace with a SimplifyDemandedBits fold to remove a VSRAI if only the original sign bit is demanded (its guaranteed to stay the same).
Test change is merely a rescheduling.
llvm-svn: 349459
Convert VSRAI to VSRLI is the sign bit is known zero and improve KnownBits output for all shift instruction.
Fixes the poor codegen comments in D55768.
llvm-svn: 349407
This allows a TEST to be used and can be combined with any AND that may already exist as an input to the shift.
This was already done in EmitTest, but was easily tricked by multiple uses because the setcc might be used by multiple instructions. Once the SETCC and users are legalized then we can look for the shift to be used by a single CMP, but the CMP itself can have multiple users.
This appears to fix the case in PR39968.
llvm-svn: 349385
This is an initial patch to add the necessary support for a DemandedElts argument to SimplifyDemandedBits, more closely matching computeKnownBits and to help improve vector codegen.
I've added only a small amount of the changes necessary to get at least one test to update - a lot more can be done but I'd like to add these methodically with proper test coverage, at the same time the hope is to slowly move some/all of SimplifyDemandedVectorElts into SimplifyDemandedBits as well.
Differential Revision: https://reviews.llvm.org/D55768
llvm-svn: 349374
I'd like to try to move a lot of the flag matching out of EmitTest and push it to isel or isel preprocessing. This is a step towards that.
The test-shrink-bug.ll changie is an improvement because we are no longer interfering with test shrink handling in isel.
The pr34137.ll change is a regression, but the IR came from -O0 and was not reduced by InstCombine. So it contains a lot of redundancies like duplicate loads that made it combine poorly.
llvm-svn: 349315
Use consistent rules for when to lower to SHLD/SHRD for slow machines - fixes a weird issue where funnel shift gets expanded but then X86ISelLowering's combineOr sees the optsize and combines to SHLD/SHRD, but now with the modulo amount guard......
llvm-svn: 349285
Summary:
If the setcc already has the target desired type we can reach the getSetCC/getSExtOrTrunc after the MatchingVecType check with the exact same types as the nodes we started with. This causes those causes VsetCC to be CSEd to N0 and the getSExtOrTrunc will CSE to N. When we return N, the caller will think that meant we called CombineTo and did our own worklist management. But that's not what happened. This prevents target hooks from being called for the node.
To fix this, I've now returned SDValue if the setcc is already the desired type. But to avoid some regressions in X86 I've had to disable one of the target combines that wasn't being reached before in the case of a (sext (setcc)). If we get vector widening legalization enabled that entire function will be deleted anyway so hopefully this is only for the short term.
Reviewers: RKSimon, spatel
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D55459
llvm-svn: 349137
This requires the two callers to manifest a 0 to make EmitCmp call EmitTest.
I'm looking into changing how we combine TEST and flag setting instructions to not be part of lowering. And instead be part of DAG combine or isel. Which will mean EmitTest will probably become gutted and maybe disappear entirely.
llvm-svn: 349094
There's still a couple of minor SimplifyDemandedElts regressions in some of the shift amount splats that will be fixed in future patches.
llvm-svn: 349052
Move existing rotation expansion code into TargetLowering and set it up for vectors as well.
Ideally this would share more of the funnel shift expansion, but we handle the shift amount modulo quite differently at the moment.
Begun removing x86 vector rotate custom lowering to use the expansion.
llvm-svn: 349025
I'm hoping we can just replace SETCC_CARRY with SBB. This is another step towards that.
I've explicitly used zero as the input to the setcc to avoid a false dependency that we've had with the SETCC_CARRY. I changed one of the patterns that used NEG to instead use an explicit compare with 0 on the LHS. We needed the zero anyway to avoid the false dependency. The negate would clobber its input register. By using a CMP we can avoid that which could be useful.
Differential Revision: https://reviews.llvm.org/D55414
llvm-svn: 348959
This patch introduces a generic function to determine whether a given vector type is known to be a splat value for the specified demanded elements, recursing up the DAG looking for BUILD_VECTOR or VECTOR_SHUFFLE splat patterns.
It also keeps track of the elements that are known to be UNDEF - it returns true if all the demanded elements are UNDEF (as this may be useful under some circumstances), so this needs to be handled by the caller.
A wrapper variant is also provided that doesn't take the DemandedElts or UndefElts arguments for cases where we just want to know if the SDValue is a splat or not (with/without UNDEFS).
I had hoped to completely remove the X86 local version of this function, but I'm seeing some regressions in shift/rotate codegen that will take a little longer to fix and I hope to get this in sooner so I can continue work on PR38243 which needs more capable splat detection.
Differential Revision: https://reviews.llvm.org/D55426
llvm-svn: 348953
This should really be generalized to allow increment and/or
we should replace it by using ISD::matchUnaryPredicate().
See D55515 for context.
llvm-svn: 348776
Previously we had to take the carry in and add -1 to it to set the carry flag so we could use it with ADC/SBB. But if we know its 0 then we don't need to bother.
This should go a long way towards fixing PR24545.
llvm-svn: 348727
This addresses a FIXME and avoids depending on an isel pattern match I think. I've remove the isel patterns too since he have no lit tests left that cover them. Hopefully that really means they are unused.
I'm trying to decide if we need SETCC_CARRY. This removes one of its usages.
Differential Revision: https://reviews.llvm.org/D55355
llvm-svn: 348536
Initial step towards making the function more generic (and probably move into SelectionDAG).
This is necessary to avoid massive codegen bloat for PR38243 (Add modulo rotate support to LowerRotate).
llvm-svn: 348498
Prep work for PR38243 - mainly adding comments on where we need to add modulo support (doing so at the moment causes massive codegen regressions).
I've also consistently added support for modulo folding for uniform constants (although at the moment we have no way to trigger this) and removed the old assertions.
llvm-svn: 348366
This is an initial patch to add a minimum level of support for funnel shifts to the SelectionDAG and to begin wiring it up to the X86 SHLD/SHRD instructions.
Some partial legalization code has been added to handle the case for 'SlowSHLD' where we want to expand instead and I've added a few DAG combines so we don't get regressions from the existing DAG builder expansion code.
Differential Revision: https://reviews.llvm.org/D54698
llvm-svn: 348353
PR17686 demonstrates that for some targets FP exceptions can fire in cases where the FP_TO_UINT is expanded using a FP_TO_SINT instruction.
The existing code converts both the inrange and outofrange cases using FP_TO_SINT and then selects the result, this patch changes this for 'strict' cases to pre-select the FP_TO_SINT input and the offset adjustment.
The X87 cases don't need the strict flag but generates much nicer code with it....
Differential Revision: https://reviews.llvm.org/D53794
llvm-svn: 348251