Commit Graph

7035 Commits

Author SHA1 Message Date
Simon Pilgrim 05c0d34918 [X86][SSE] Prefer trunc(movd(x)) to pextrb(x,0)
If we're extracting the 0'th index of a v16i8 vector we're better off using MOVD than PEXTRB, unless we're storing the value or we require the implicit zero extension of PEXTRB.

The biggest perf diff is on SLM targets where MOVD (uops=1, lat=3 tp=1) is notably faster than PEXTRB (uops=2, lat=5, tp=4).

This matches what we already do for PEXTRW.

Differential Revision: https://reviews.llvm.org/D76138
2020-03-13 18:43:04 +00:00
Simon Pilgrim 846c614f54 [X86] combineExtractWithShuffle - pull out repeated getSizeInBits() call. NFC. 2020-03-13 15:36:04 +00:00
Simon Pilgrim fe047fbccc [X86] LowerEXTRACT_VECTOR_ELT - pull out repeated getOperand() calls. NFC.
Also, cleanup LowerEXTRACT_VECTOR_ELT_SSE4 comments which had references to non-constant extraction indices.
2020-03-13 15:36:02 +00:00
Simon Pilgrim 4689eae820 [X86] combineOrShiftToFunnelShift - remove shift by immediate handling.
Now that D75114 has landed, DAGCombiner handles this case so the code is redundant.
2020-03-12 11:46:51 +00:00
Simon Pilgrim b3b4727a3e [X86] Replace (most) X86ISD::SHLD/SHRD usage with ISD::FSHL/FSHR generic opcodes (PR39467)
For i32 and i64 cases, X86ISD::SHLD/SHRD are close enough to ISD::FSHL/FSHR that we can use them directly, we just need to account for the operand commutation for SHRD.

The i16 SHLD/SHRD case is annoying as the shift amount is modulo-32 (vs funnel shift modulo-16), so I've added X86ISD::FSHL/FSHR equivalents, which matches the generic implementation in all other terms.

Something I'm slightly concerned with is that ISD::FSHL/FSHR legality is controlled by the Subtarget.isSHLDSlow() feature flag - we don't normally use non-ISA features for this but it allows the DAG combines to continue to operate after legalization in a lot more cases.

The X86 *bits.ll changes are all affected by the same issue - we now have a "FSHR(-1,-1,amt) -> ROTR(-1,amt) -> (-1)" simplification that reduces the dependencies enough for the branch fall through code to mess up.

Differential Revision: https://reviews.llvm.org/D75748
2020-03-11 11:17:49 +00:00
Simon Pilgrim c8ede5e485 [X86][SSE] getFauxShuffleMask - add support for INSERT_VECTOR_ELT(EXTRACT_VECTOR_ELT) shuffle pattern
We already do this for PINSRB/PINSRW and SCALAR_TO_VECTOR.
2020-03-10 15:42:37 +00:00
Simon Pilgrim e6a7e3b5e3 [X86][SSE] matchShuffleWithSHUFPD - add support for unary shuffles.
This causes one minor test change but is mainly necessary for an upcoming patch.
2020-03-10 15:42:36 +00:00
Simon Pilgrim 18c19441d1 [X86][AVX] combineX86ShuffleChain - combine binary shuffles to X86ISD::VPERM2X128
For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern.

At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.
2020-03-10 10:44:28 +00:00
Craig Topper ef4f939d38 [X86] Remove isel patterns for (X86VBroadcast (i16 (trunc (i32 (load))))). Replace with a DAG combine to form VBROADCAST_LOAD.
isTypeDesirableForOp prevents loads from being shrunk to i16 by DAG
combine. Because of this we can't just match the broadcast and a
scalar load. So look for broadcast+truncate+load and form a
vbroadcast_load during DAG combine. This replaces what was
previously done as an isel pattern and I think fixes it so we
won't change the size of a volatile load. But my main motivation
is just to clean up our isel patterns.
2020-03-10 00:07:07 -07:00
Simon Pilgrim 4b130b883d [X86][SSE] SimplifyDemandedVectorEltsForTargetNode - reduce vector width of X86ISD::BLENDI
If we don't need the upper subvector elements of the BLENDI node then use a smaller vector size.

This causes a couple of minor regressions in insertelement-ones.ll which are more examples of PR26018; given how cheap allones generation is I don't consider that a showstopper, just an annoyance (and there's plenty of other poor codegen cases in that file).
2020-03-09 18:29:28 +00:00
Craig Topper 3dcc0db15e [X86] Teach combineToExtendBoolVectorInReg to create opportunities for using broadcast load instructions.
If we're inserting a scalar that is smaller than the element
size of the final VT, the value of the extra bits doesn't matter.

Previously we any_extended in the scalar domain before inserting.

This patch changes this to use a broadcast of the original
scalar type and then a bitcast to the final type. This might
enable the use of a broadcast load.

This recovers regressions from 07d68c24aa
and 9fcd212e2f without relying on
alignment of the load.

Differential Revision: https://reviews.llvm.org/D75835
2020-03-09 11:26:12 -07:00
Djordje Todorovic c15c68abdc [CallSiteInfo] Enable the call site info only for -g + optimizations
Emit call site info only in the case of '-g' + 'O>0' level.

Differential Revision: https://reviews.llvm.org/D75175
2020-03-09 12:12:44 +01:00
Craig Topper 70e4fb8a53 [X86] Add DAG combine to turn (vzext_movl (vbroadcast_load)) -> vzext_load.
If we're zeroing the other elements then we don't need the broadcast.
2020-03-08 00:35:40 -08:00
Craig Topper d81d451442 [X86] Add DAG combine to replace vXi64 vzext_movl+scalar_to_vector with vYi32 vzext_movl+scalar_to_vector if the upper 32 bits of the scalar are zero.
We can just use a 32-bit copy and zero in the SSE domain when we
zero the upper bits.

Remove an isel pattern that becomes dead with this.
2020-03-07 16:14:26 -08:00
Craig Topper d41ea65ee8 [X86] Add DAG combines to enable removing of movddup/vbroadcast + simple_load isel patterns. 2020-03-07 15:22:02 -08:00
Craig Topper bc65b68661 [X86] Add a DAG combine to turn vbroadcast(vzload X) -> vbroadcast_load
Remove now unneeded isel patterns.
2020-03-07 15:22:02 -08:00
Craig Topper ec1d1f6ae7 [X86] Use MVT instead of EVT in a couple shuffle lowering functions. 2020-03-07 09:50:53 -08:00
Reid Kleckner 65b21282c7 Avoid emitting unreachable SP adjustments after `throw`
In 172eee9c, we tried to avoid these by modelling the callee as
internally resetting the stack pointer.

However, for the majority of functions with reserved stack frames, this
would lead LLVM to emit extra SP adjustments to undo the callee's
internal adjustment. This lead us to fix the problem further on down the
pipeline in eliminateCallFramePseudoInstr. In 5b79e603d3, I added
use a heuristic to try to detect when the adjustment would be
unreachable.

This heuristic is imperfect, and when exception handling is involved, it
fails to fire. The new test is an example of this. Simply throwing an
exception with an active cleanup emits dead SP adjustments after the
throw. Not only are they dead, but if they were executed, they would be
incorrect, so they are confusing.

This change essentially reverts 172eee9c and makes the 5b79e603d3
heuristic responsible for preventing unreachable stack adjustments. This
means we may emit unreachable stack adjustments for functions using EH
with unreserved call frames, but that is not very many these days. Back
in 2016 when this change was added, we were focused on 32-bit, which we
observed to have fewer reserved frames.

Fixes PR45064

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D75712
2020-03-06 13:33:45 -08:00
Craig Topper 4c7c87f245 [X86] Simplify the code at the end of lowerShuffleAsBroadcast.
The original code could create a bitcast from f64 to i64 and back
on 32-bit targets. This was only working because getBitcast was
able to fold the casts away to avoid leaving the illegal i64 type.

Now we handle the scalar case directly by broadcasting using the
scalar type as the element type. Then bitcasting to the final VT.
This works since we ensure the scalar type is the same size as
the final VT element type. No more casts to i64.

For the vector case, we cast to VT or subvector of VT. And then
do the broadcast.

I think this all matches what we generated before, just in a more
readable way.
2020-03-04 20:45:02 -08:00
Craig Topper eadea7868f [X86] Convert vXi1 vectors to xmm/ymm/zmm types via getRegisterTypeForCallingConv rather than using CCPromoteToType in the td file
Previously we tried to promote these to xmm/ymm/zmm by promoting
in the X86CallingConv.td file. But this breaks when we run out
of xmm/ymm/zmm registers and need to fall back to memory. We end
up trying to create a non-sensical scalar to vector. This lead
to an assertion. The new tests in avx512-calling-conv.ll all
trigger this assertion.

Since we really want to treat these types like we do on avx2,
it seems better to promote them before the calling convention
code gets involved. Except when the calling convention is one
that passes the vXi1 type in a k register.

The changes in avx512-regcall-Mask.ll are because we indicated
that xmm/ymm/zmm types should be passed indirectly for the
Win64 ABI before we go to the common lines that promoted the
vXi1 types. This caused the promoted types to be picked up by
the default calling convention code. Now we promote them earlier
so they get passed indirectly as though they were xmm/ymm/zmm.

Differential Revision: https://reviews.llvm.org/D75154
2020-03-04 15:02:32 -08:00
Craig Topper 06de426426 [X86] Directly form VBROADCAST_LOAD in lowerShuffleAsBroadcast on AVX targets.
If we would emit a VBROADCAST node, we can instead directly emit
a VBROADCAST_LOAD. This allows us to get rid of the special case
to use an f64 load on 32-bit targets for vXi64.

I believe there is more cleanup we can do later in this function,
but I'll do that in follow ups.
2020-03-04 09:11:57 -08:00
Craig Topper 9284abd004 [X86] Directly form VBROADCAST_LOAD for BUILD_VECTOR of splat loads in lowerBuildVectorAsBroadcast. 2020-03-03 22:27:34 -08:00
Craig Topper 3c4e635593 [X86] Always emit an integer vbroadcast_load from lowerBuildVectorAsBroadcast regardless of AVX vs AVX2
If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1.

Differential Revision: https://reviews.llvm.org/D75413
2020-03-03 10:39:11 -08:00
Craig Topper 56cd3bc209 [X86] Directly emit VBROADCAST_LOAD from constant pool in lowerBuildVectorAsBroadcast
Also add a DAG combine to combine different sized broadcasts from
constant pool to avoid a regression.

Differential Revision: https://reviews.llvm.org/D75412
2020-03-03 10:39:10 -08:00
Craig Topper 68aeaab888 [X86] Don't count the chain uses when forming broadcast loads in lowerBuildVectorAsBroadcast.
The build_vector needs to be the only user of the data, but the
chain will likely have another use. So we can't make sure the
build_vector is the only user of the node.
2020-03-03 08:41:31 -08:00
Craig Topper 2f4f8fcf64 [X86] Don't add DELETED_NODES to DAG combine worklist after calling SimplifyDemandedBits/SimplifyDemandedVectorElts.
These AddToWorklist calls were added in 84cd968f75.
It's possible the SimplifyDemandedBits/SimplifyDemandedVectorElts
triggered CSE that deleted N. Detect that and avoid adding N
to the worklist.

Fixes PR45067.
2020-03-01 00:06:32 -08:00
Craig Topper f2d45e5097 [X86] Canonicalize (bitcast (vbroadcast_load)) so that the cast and vbroadcast_load are both integer or fp.
Helps a little with some isel pattern matching. Especially on
32-bit targets where we sometimes use f64 loads.
2020-02-28 15:07:49 -08:00
Craig Topper b68eeff05c [X86] Cleanup a comment around bitcasting X86ISD::VBROADCAST_LOAD and add an assert to make sure memory VT size doesn't change. 2020-02-28 15:07:49 -08:00
Craig Topper c0d0e6b198 [X86] Recognize CVTPH2PS from STRICT_FP_EXTEND
This should avoid scalarizing the cvtph2ps intrinsics with D75162

Differential Revision: https://reviews.llvm.org/D75304
2020-02-28 10:19:57 -08:00
Simon Pilgrim f90cc633de Fix cppcheck definition/declaration arg mismatch warnings. NFCI. 2020-02-27 14:35:20 +00:00
Simon Pilgrim fe6bcfaf3b [X86] Use Subtarget.useSoftFloat() in X86TargetLowering constructor
Avoid use of X86TargetLowering::useSoftFloat() in the constructor as its a virtual function
2020-02-27 14:35:20 +00:00
Simon Pilgrim e61e7f0794 Fix shadow variable warning. NFC. 2020-02-27 14:23:05 +00:00
Simon Pilgrim dc7ac563ac Fix shadow variable warnings. NFC. 2020-02-27 14:21:30 +00:00
Simon Pilgrim efe2f59ec4 [X86] LowerMSCATTER/MGATHER - reduce scope of MaskVT. NFCI.
Fixes cppcheck warning.
2020-02-27 14:20:44 +00:00
Simon Pilgrim fabe52a741 Fix uninitialized variable warning. NFC. 2020-02-27 14:20:43 +00:00
Simon Pilgrim 6bdd63dc28 [X86] createVariablePermute - handle case where recursive createVariablePermute call fails
Account for the case where a recursive createVariablePermute call with a wider vector type fails.

Original test case from @craig.topper (Craig Topper)
2020-02-27 13:52:31 +00:00
Craig Topper 82a21c1655 [X86] Add proper MachinePointerInfo to stack store created in LowerWin64_i128OP. 2020-02-26 16:55:24 -08:00
Craig Topper 870363a22d [X86] Explicitly pass Destination VT and debug location to BuildFILD. NFC
We'd already passed most everything else. Might was well pass
these two things and stop passing Op.
2020-02-26 16:26:46 -08:00
Craig Topper 15e2831fcd [X86] Explicitly pass Pointer, MachinePointerInfo and Alignment to BuildFILD.
Previously this code was called into two ways, either a FrameIndexSDNode
was passed in StackSlot. Or a load node was passed in the argument
called StackSlot. This was determined by a dyn_cast to FrameIndexSDNode.

In the case of a load, we had to go find the real pointer from
operand 0 and cast the node to MemSDNode to find the pointer info.

For the stack slot case, the code assumed that the stack slot
was perfectly aligned despite not being the creator of the slot.

This commit modifies the interface to make the caller responsible
for passing all of the required information to avoid all the
guess work and reverse engineering.

I'm not aware of any issues with the original code after an
earlier commit to fix the alignment of one of the stack objects.
This is just clean up to make the code less surprising.
2020-02-26 16:26:26 -08:00
Craig Topper 77d9b7b2cd [X86] Query constant pool object alignment instead of hardcoding. 2020-02-26 14:45:39 -08:00
Craig Topper 9c1a707ba3 [X86] Use proper alignment for stack temporary and correct MachinePointerInfo for stack accesses in LowerUINT_TO_FP. 2020-02-26 14:45:38 -08:00
Craig Topper a8186935ae [X86] Use correct MachineMemOperand for stack load in LowerFLT_ROUNDS_ 2020-02-26 14:45:38 -08:00
Craig Topper 735d27dc40 [SelectionDAG][PowerPC][AArch64][X86][ARM] Add chain input and output the ISD::FLT_ROUNDS_
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.

This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.

Differential Revision: https://reviews.llvm.org/D75132
2020-02-25 16:58:23 -08:00
Craig Topper 9238dfb4d8 [X86] Remove mask output from X86 gather/scatter ISD opcodes.
Instead add it when we make the machine nodes during instruction
selections.

This makes this ISD node closer to ISD::MGATHER. Trying to see
if we remove the X86 specific ones.
2020-02-24 23:56:28 -08:00
Simon Pilgrim daac8dba77 [X86] combineX86ShuffleChain - select X86ISD::FAND/ISD::AND based on MaskVT
Noticed by inspection, we shouldn't use FloatDomain directly, we've already bitcast both inputs to MaskVT so select the opcode using that.
2020-02-24 18:24:44 +00:00
Simon Pilgrim 59d8d13c7b [X86] getTargetShuffleInputs - check that the source inputs are all the right size.
I'm hoping to begin improving shuffle combining across different vector sizes, but before that we must ensure that all existing getTargetShuffleInputs calls must bail if the inputs aren't the same size.
2020-02-24 16:26:10 +00:00
Craig Topper 7a7146cf72 [X86] When creating X86ISD::MGATHER nodes from AVX2 gather intrinsics, cast the mask to integer type.
The gather intrinsics use a floating point mask when the result
type is FP. But we call DemandedBits on the mask assuming its an
integer type. We also use integer types when we create it from
generic IR. So add a bitcast to the intrinsic path to guarantee
the integer type.
2020-02-23 23:00:41 -08:00
Craig Topper 5a70518660 [X86] Remove most X86 specific subclasses of MemSDNode. Just use a MemIntrinsicSDNode as we usually do.
Leave the gather/scatter subclasses, but make them inherit from
MemIntrinsicSDNode and delete their constructor and destructor.
This way we can still have the getIndex, getMask, etc. convenience
functions.
2020-02-23 15:13:32 -08:00
Craig Topper 15b6aa7448 [X86] Enable the use of movlps for i64 atomic load on 32-bit targets with sse1.
Still a little room for improvement by using movlps to store to
the stack temporary needed to move data out of the xmm register
after the load.
2020-02-23 15:11:38 -08:00
Craig Topper 2a10f8019d [X86] Use FIST for i64 atomic stores on 32-bit targets without SSE. 2020-02-23 15:11:38 -08:00
Craig Topper 84cd968f75 [X86] Add AddToWorklist(N) after calls to SimplifyDemandedBits/SimplifyDemandedVectorElts that are called on an operand of N.
If a simplication occurs the operand will be added to the worklist.
But since the demanded mask was based on N, we need to make sure
we revisit N in case there are more simplifications to be done.
Returning SDValue(N, 0) as we do, only tells DAG combine that
something changed, but that won't make it add anything to the
worklist.

Found while playing around with using VEXTRACT_STORE in more cases.
But I guess this doesn't affect any of our existing tests.
2020-02-22 21:42:59 -08:00
Craig Topper bdb1729c83 [X86] Teach EltsFromConsecutiveLoads that it's ok to form a v4f32 VZEXT_LOAD with a 64 bit memory size on SSE1 targets.
We can use MOVLPS which will load 64 bits, but we need a v4f32
result type. We already have isel patterns for this.

The code here is a little hacky. We can probably improve it with
more isel patterns.
2020-02-22 18:50:52 -08:00
Craig Topper e7a184fc7c [X86] Use movlps for i64 atomic stores on 32-targets with sse1.
This is similar to using movd which we do for sse2 targets.

I've added a DAG combine for VEXTRACT_STORE to use SimplifyDemandedVectorElts
to clean up some artifacts from type legalization.
2020-02-22 18:22:47 -08:00
Craig Topper 228a2bc9b7 [X86] Teach combineCVTPH2PS to shrink v8i16 loads when the output type is v4f32. Remove extra isel patterns.
Similar to what do for other operations that use a subset of bits.
Allows us to remove a pattern that shrinks a load. Which was
incorrect if the load was volatile.
2020-02-21 18:11:07 -08:00
Nikita Popov c90ea87cfd [X86] Fix SDLoc initialization
Fixes -Wparentheses warning, in this case indicating a genuine
bug.
2020-02-21 18:26:05 +01:00
Craig Topper 97f11600e0 [X86] Don't bother avoiding illegal FCMOVs if we don't have the cmov subtarget feature.
We'll be forced to emit branches so we might as well use the most
direct condition.
2020-02-21 00:34:15 -08:00
Craig Topper 263bef2bbc [X86] Make combineCMov not create unsupported FCMOVs when f32/f64 are using X87.
This makes the behavior consistent with what's in LowerSELECT.
2020-02-21 00:34:15 -08:00
Craig Topper 4576606831 [X86] Remove unnecessary isNullConstant in LowerSelect. NFC
At this point in the code we know that Op1 or Op2 is
all ones. Y points to the other operand. In the case that
Op2 is zero, Op1 must be all ones and Y is Op2. The OR
ORs Y into Res. But if Y is 0 the OR will be folded away by
getNode so we don't need to check for it.
2020-02-20 21:41:13 -08:00
Craig Topper 78be618717 [X86] Add CMOV_VR64 pseudo instruction for MMX. Remove mmx handling from combineSelect.
The combineSelect code was casting to i64 without any check that
i64 was legal. This can break after type legalization.

It also required splitting the mmx register on 32-bit targets.
It's not clear that this makes sense. Instead switch to using
a cmov pseudo like we do for XMM/YMM/ZMM.
2020-02-20 20:30:56 -08:00
Craig Topper e5782377f3 [X86] Add CMOV_VK1 pseudo so we don't crash on v1i1 ISD::SELECT 2020-02-20 15:13:48 -08:00
Craig Topper 7e92769862 [X86] Expand vselect of v1i1 under avx512.
We already do this for v2i1, v4i1, etc.
2020-02-20 15:13:47 -08:00
Craig Topper b00ef8951b [X86] Custom legalize v1i1 UADDSAT/USUBSAT/SADDSAT/UADDSAT to match v2i1/v4i1/v8i1 etc. 2020-02-20 15:13:46 -08:00
Craig Topper d95a10a7f9 [X86] Custom legalize v1i1 add/sub/mul to xor/xor/and with avx512.
We already did this for v2i1, v4i1, v8i1, etc.
2020-02-20 15:13:44 -08:00
Craig Topper c7b54a196e Recommit "[X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT""
With the correct author this time
2020-02-20 12:28:54 -08:00
Craig Topper 1d8860f90b Revert 714265dabb "[X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT"
I accidentally messed up the author on the previous commit somehow.
2020-02-20 12:28:33 -08:00
Quentin Colombet 714265dabb [X86] Replace a bad use of MVT::getVectorVT with EVT::getVectorVT
The type here isn't guaranteed to be a simple type.

Fixes PR44976
2020-02-20 12:25:37 -08:00
Sanjay Patel 064cd2ecdb [x86] allow peeking through an extract_subvector to find a splatted operand
The motivating case is seen in "splat4_v8f32_load_store" and based on code in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
(I haven't stepped through the v8i32 sibling test yet to see why that diverged.)

There are other potential improvements visible like allowing scalarization or vector
narrowing.

Differential Revision: https://reviews.llvm.org/D74909
2020-02-20 13:59:59 -05:00
Craig Topper 0ed7a61543 [X86] Fix a -Wparentheses warning. NFC 2020-02-20 09:32:03 -08:00
Craig Topper 3543ac9ab5 [X86] Rewrite LowerBRCOND to remove dead code and handle ISD::SETCC and overflow ops directly.
There's a lot of old leftover code in LowerBRCOND. Especially
the detecting or AND or OR of X86ISD::SETCC nodes. Those were
needed before LegalizeDAG was changed to visit nodes before
their operands.

It also relied on reversing the output of LowerSETCC to find the
flags producing node to use for the X86ISD::BRCOND node.

Rather than using LowerSETCC this patch uses emitFlagsForSetcc to
handle the integer ISD::SETCC case. This gives the flag producer
and the comparison code to use directly. I've removed the addTest
flag and just produce a X86ISD::BRCOND and return immediately.

Floating point ISD::SETCC case is just an X86ISD::FCMP with special
care for OEQ and UNE derived from the previous code. I've left
f128 out so it will emit a test. And LowerSETCC will be called
later to produce a libcall and X86ISD::SETCC. We have combines
that can merge the test and X86ISD::SETCC.

We need to handle two cases for overflow ops. Either they are used
directly or they have a seteq 0 or setne 1 to invert the overflow.
The old code did not handle the setne 1 case, but I think some
other combines were making up for it.

If we fail to find a condition, we'll wrap an AND with 1 on the
original condition and tell emitFlagsForSetcc to emit a compare
with 0. This will pickup the LowerAndToBT and or the EmitTest case.
I kept the isTruncWithZeroHighBitsInput call, but we might be able
to fold that in to emitFlagsForSetcc.

Differential Revision: https://reviews.llvm.org/D74750
2020-02-20 08:50:18 -08:00
Craig Topper 12cc105f80 [X86] Add DAG combines to form CVTPH2PS/CVTPS2PH from vXf16->vXf32/vXf64 fp_extends and vXf32->vXf16 fp_round.
Only handle power of 2 element count for simplicity. Not sure what to do with vXf64->vXf16 fp_round to avoid double rounding

Differential Revision: https://reviews.llvm.org/D74886
2020-02-20 08:26:17 -08:00
Djordje Todorovic 2f215cf36a Revert "Reland "[DebugInfo] Enable the debug entry values feature by default""
This reverts commit rGfaff707db82d.
A failure found on an ARM 2-stage buildbot.
The investigation is needed.
2020-02-20 14:41:39 +01:00
Craig Topper f559cecc3e [X86] Add DCI.isBeforeLegalize() check to the v64i1 constant splitting code in combineStore.
We only need to split after type legalization. If we're before
we can just use a wide store and type legalization will split it.

Add a v128i1 test to exercise it post type legalization.
2020-02-19 09:18:16 -08:00
Florian Hahn 216afd3301 [TargetLower] Update shouldFormOverflowOp check if math is used.
On some targets, like SPARC, forming overflow ops is only profitable if
the math result is used: https://godbolt.org/z/DxSmdB
This patch adds a new MathUsed parameter to allow the targets
to make the decision and defaults to only allowing it
if the math result is used. That is the conservative choice.

This patch also updates AArch64ISelLowering, X86ISelLowering,
ARMISelLowering.h, SystemZISelLowering.h to allow forming overflow
ops if the math result is not used. On those targets using the
overflow intrinsic for the overflow check only generates better code.

Reviewers: nikic, RKSimon, lebedev.ri, spatel

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D74722
2020-02-19 11:28:33 +01:00
Djordje Todorovic faff707db8 Reland "[DebugInfo] Enable the debug entry values feature by default"
Differential Revision: https://reviews.llvm.org/D73534
2020-02-19 11:12:26 +01:00
Craig Topper f69a29da5a [X86] Remove vXi1 select optimization from LowerSELECT. Move it to DAG combine. 2020-02-19 00:00:55 -08:00
Craig Topper 0dbc4658d8 [X86] Handle splats in LowerBUILD_VECTORvXi1 by directly emitting scalar selects instead of deferring that to LowerSELECT.
LoweSELECT will detect the constant inputs and convert to scalar
selects, but we can do it directly here.

I might remove some of the code from LowerSELECT and move it to
DAG combine so doing this explicitly will make us less dependent
on it happening in lowering.
2020-02-18 22:39:30 -08:00
Simon Pilgrim d6eef0614f [TargetLowering] Add SimplifyMultipleUseDemandedBits 'all elements' helper wrapper. NFC. 2020-02-18 19:53:50 +00:00
Craig Topper 89ab5c69c8 [X86] Add a helper function to pull some repeated code out of combineGatherScatter. NFC 2020-02-18 11:10:40 -08:00
Djordje Todorovic 2bf44d11cb Revert "Reland "[DebugInfo] Enable the debug entry values feature by default""
This reverts commit rGa82d3e8a6e67.
2020-02-18 16:38:11 +01:00
Djordje Todorovic a82d3e8a6e Reland "[DebugInfo] Enable the debug entry values feature by default"
This patch enables the debug entry values feature.

  - Remove the (CC1) experimental -femit-debug-entry-values option
  - Enable it for x86, arm and aarch64 targets
  - Resolve the test failures
  - Leave the llc experimental option for targets that do not
    support the CallSiteInfo yet

Differential Revision: https://reviews.llvm.org/D73534
2020-02-18 14:41:08 +01:00
Craig Topper e90dc7c48b [X86] Move avx512 code that forces zeros to the false side of vselects above a check for legal types.
This helps this transform occur earlier so we can fold the not
with setcc. If we delay it until after type legalization we might
have introduced instructions to widen the mask if the vselect was
widened. This can prevent the not from making it to the setcc.

We could of course add more DAG combines to handle that, but
moving this earlier is easier.
2020-02-17 22:24:21 -08:00
Craig Topper b0840934a7 [X86] Use isScalarFPTypeInSSEReg to simplify code in LowerSELECT. NFC 2020-02-17 19:43:57 -08:00
Craig Topper 3f4490d384 [X86] Add one use check to '0-x == y --> x+y == 0' in EmitCmp.
I failed to copy it when I moved this in
b62de210cf
2020-02-17 18:16:42 -08:00
Craig Topper 43e948c4b7 [X86] Change how the alignment for the stack object is created in LowerFLT_ROUNDS_.
We don't need FrameInfo's concept of the stack alignment. We just
need to tell it the desired alignment. Which in this case is 2.
2020-02-17 11:27:34 -08:00
Craig Topper b62de210cf [X86] Move '0-x == y --> x+y == 0' and similar combines to EmitCmp.
AArch64 handles this pattern in their lowering code. By emitting
CMN. ARM handles it as an isel pattern.
2020-02-17 11:27:34 -08:00
Nikita Popov 80397d2d12 [IRBuilder] Delete copy constructor
D73835 will make IRBuilder no longer trivially copyable. This patch
deletes the copy constructor in advance, to separate out the breakage.

Currently, the IRBuilder copy constructor is usually used by accident,
not by intention.  In rG7c362b25d7a9 I've fixed a number of cases where
functions accepted IRBuilder rather than IRBuilder &, thus performing
an unnecessary copy. In rG5f7b92b1b4d6 I've fixed cases where an
IRBuilder was copied, while an InsertPointGuard should have been used
instead.

The only non-trivial use of the copy constructor is the
getIRBForDbgInsertion() helper, for which I separated construction and
setting of the insertion point in this patch.

Differential Revision: https://reviews.llvm.org/D74693
2020-02-17 18:14:48 +01:00
Craig Topper 464729cf7c [X86] Remove unnecessary check for null SDValue. NFC 2020-02-16 20:25:24 -08:00
Craig Topper 272d35aef5 [X86] Separate floating point handling out of EmitCmp and emitFlagsForSetcc.
Both of those functions only have a single caller starting
at LowerSETCC. Just handle floating point directly in LowerSETCC.

This removes the need to pass Chain and IsSignaling all the way
down.
2020-02-16 10:51:05 -08:00
Craig Topper d26f11108b [X86] Split X86ISD::CMP into an integer and FP opcode. 2020-02-16 10:10:19 -08:00
Simon Pilgrim b85df2e185 [X86] combineX86ShuffleChain - add support for combining 512-bit shuffles to PALIGNR 2020-02-16 16:13:26 +00:00
Simon Pilgrim c9c1c2b335 [X86] combineX86ShuffleChain - add support for combining 512-bit shuffles to bit shifts 2020-02-16 16:13:25 +00:00
Sanjay Patel e48b536be6 [x86] form broadcast of scalar memop even with >1 use
The unseen logic diff occurs because MayFoldLoad() is defined like this:

static bool MayFoldLoad(SDValue Op) {
  return Op.hasOneUse() && ISD::isNormalLoad(Op.getNode());
}

The test diffs here all seem ok to me on screen/paper, but it's hard to know
if that will lead to universally better perf for all targets. For example,
if a target implements broadcast from mem as multiple uops, we would have to
weigh the potential reduction of instructions and register pressure vs.
possible increase in number of uops. I don't know if we can make a truly
informed decision on this at compile-time.

The motivating case that I'm looking at in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
...resembles the diff in extract-concat.ll, but we're not going to change the
larger example there without at least 1 other fix.

Differential Revision: https://reviews.llvm.org/D74088
2020-02-16 10:32:56 -05:00
Simon Pilgrim 34a054ce71 [X86] combineX86ShuffleChain - add support for combining to X86ISD::ROTLI
Refactors matchShuffleAsBitRotate to allow use by both lowerShuffleAsBitRotate and matchUnaryPermuteShuffle.
2020-02-15 20:04:54 +00:00
Simon Pilgrim 2492075add [X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets
Without PSHUFB we are better using ROTL (expanding to OR(SHL,SRL)) than using the generic v16i8 shuffle lowering - but if we can widen to v8i16 or more then the existing shuffles are still the better option.

REAPPLIED: Original commit rG11c16e71598d was reverted at rGde1d90299b16 as it wasn't accounting for later lowering. This version emits ROTLI or the OR(VSHLI/VSRLI) directly to avoid the issue.
2020-02-14 11:55:18 +00:00
Craig Topper c2e8a421ac [X86] Don't widen 128/256-bit strict compares with vXi1 result to 512-bits on KNL.
If we widen the compare we might trigger a spurious exception from
the garbage data.

We have two choices here. Explicitly force the upper bits to zero.
Or use a legacy VEX vcmpps/pd instruction and convert the XMM/YMM
result to mask register.

I've chosen to go with the second option. I'm not sure which is
really best. In some cases we could get rid of the zeroing since
the producing instruction probably already zeroed it. But we lose
the ability to fold a load. So which is best is dependent on
surrounding code.

Differential Revision: https://reviews.llvm.org/D74522
2020-02-13 13:26:40 -08:00
Amy Huang de1d90299b Revert "[X86][SSE] lowerShuffleAsBitRotate - lower to vXi8 shuffles to ROTL on pre-SSSE3 targets"
This reverts commit 11c16e7159 because it
causes a crash in chromium code. See
https://reviews.llvm.org/rG11c16e71598d51f15b4cfd0f719c4dabcc0bebf7.
2020-02-12 17:00:37 -08:00
Jay Foad 32aac25637 [KnownBits] Introduce anyext instead of passing a flag into zext
Summary:
This was a very odd API, where you had to pass a flag into a zext
function to say whether the extended bits really were zero or not. All
callers passed in a literal true or false.

I think it's much clearer to make the function name reflect the
operation being performed on the value we're tracking (rather than on
the KnownBits Zero and One fields), so zext means the value is being
zero extended and new function anyext means the value is being extended
with unknown bits.

NFC.

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74482
2020-02-12 19:06:53 +00:00
Simon Pilgrim ff307c8120 [X86] combineFneg - generalize FMA negations with isNegatibleForFree/getNegatedExpression
This has a really interesting side effect in that it improves some UMAX/UMIN reduction code which had redundant XOR(SHUFFLE(XOR(X,SIGNMASK)),SIGNMASK) patterns - the getNegatibleCost recognises it as FNEG(SHUFFLE(FNEG(X))).... We have a lot of FNEG patterns bitcasted to the integer domain for XOR signbit twiddling which is similar to what we do to allow UMAX/UMIN to be lowered using SMAX/SMIN.

Differential Revision: https://reviews.llvm.org/D74231
2020-02-12 16:07:27 +00:00
Simon Pilgrim 9eb426c88c [TargetLowering] Add NegatibleCost enum for isNegatibleForFree return codes
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.

This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.

It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.

Differential Revision: https://reviews.llvm.org/D74221
2020-02-12 11:51:42 +00:00
Djordje Todorovic 97ed706a96 Revert "[DebugInfo] Enable the debug entry values feature by default"
This reverts commit rG9f6ff07f8a39.

Found a test failure on clang-with-thin-lto-ubuntu buildbot.
2020-02-12 11:59:04 +01:00