Merge the INSERT_VECTOR_ELT/SCALAR_TO_VECTOR and PINSRW/PINSRB shuffle mask paths - they both do the same thing (find source vector + handle implicit zero extension). The PINSRW/PINSRB path also handled in the insertion of zero case which needed to be added to the general case as well.
This is a little more complicated than I'd like it to be. We have
to manually match a trunc+srl+load pattern that generic DAG
combine won't do for us due to isTypeDesirableForOp.
If we're extracting the 0'th index of a v16i8 vector we're better off using MOVD than PEXTRB, unless we're storing the value or we require the implicit zero extension of PEXTRB.
The biggest perf diff is on SLM targets where MOVD (uops=1, lat=3 tp=1) is notably faster than PEXTRB (uops=2, lat=5, tp=4).
This matches what we already do for PEXTRW.
Differential Revision: https://reviews.llvm.org/D76138
For i32 and i64 cases, X86ISD::SHLD/SHRD are close enough to ISD::FSHL/FSHR that we can use them directly, we just need to account for the operand commutation for SHRD.
The i16 SHLD/SHRD case is annoying as the shift amount is modulo-32 (vs funnel shift modulo-16), so I've added X86ISD::FSHL/FSHR equivalents, which matches the generic implementation in all other terms.
Something I'm slightly concerned with is that ISD::FSHL/FSHR legality is controlled by the Subtarget.isSHLDSlow() feature flag - we don't normally use non-ISA features for this but it allows the DAG combines to continue to operate after legalization in a lot more cases.
The X86 *bits.ll changes are all affected by the same issue - we now have a "FSHR(-1,-1,amt) -> ROTR(-1,amt) -> (-1)" simplification that reduces the dependencies enough for the branch fall through code to mess up.
Differential Revision: https://reviews.llvm.org/D75748
For pre-AVX512 targets, combine binary shuffles to X86ISD::VPERM2X128 if possible. This mainly helps optimize the blend(extract_subvector(x,1),y) pattern.
At some point soon we're going to have make a decision about when to combine AVX512 shuffles more aggressively - we bail out if there is any change in element size (to protect predicate mask merging) which means we miss out on a lot of optimizations.
isTypeDesirableForOp prevents loads from being shrunk to i16 by DAG
combine. Because of this we can't just match the broadcast and a
scalar load. So look for broadcast+truncate+load and form a
vbroadcast_load during DAG combine. This replaces what was
previously done as an isel pattern and I think fixes it so we
won't change the size of a volatile load. But my main motivation
is just to clean up our isel patterns.
If we don't need the upper subvector elements of the BLENDI node then use a smaller vector size.
This causes a couple of minor regressions in insertelement-ones.ll which are more examples of PR26018; given how cheap allones generation is I don't consider that a showstopper, just an annoyance (and there's plenty of other poor codegen cases in that file).
If we're inserting a scalar that is smaller than the element
size of the final VT, the value of the extra bits doesn't matter.
Previously we any_extended in the scalar domain before inserting.
This patch changes this to use a broadcast of the original
scalar type and then a bitcast to the final type. This might
enable the use of a broadcast load.
This recovers regressions from 07d68c24aa
and 9fcd212e2f without relying on
alignment of the load.
Differential Revision: https://reviews.llvm.org/D75835
In 172eee9c, we tried to avoid these by modelling the callee as
internally resetting the stack pointer.
However, for the majority of functions with reserved stack frames, this
would lead LLVM to emit extra SP adjustments to undo the callee's
internal adjustment. This lead us to fix the problem further on down the
pipeline in eliminateCallFramePseudoInstr. In 5b79e603d3, I added
use a heuristic to try to detect when the adjustment would be
unreachable.
This heuristic is imperfect, and when exception handling is involved, it
fails to fire. The new test is an example of this. Simply throwing an
exception with an active cleanup emits dead SP adjustments after the
throw. Not only are they dead, but if they were executed, they would be
incorrect, so they are confusing.
This change essentially reverts 172eee9c and makes the 5b79e603d3
heuristic responsible for preventing unreachable stack adjustments. This
means we may emit unreachable stack adjustments for functions using EH
with unreserved call frames, but that is not very many these days. Back
in 2016 when this change was added, we were focused on 32-bit, which we
observed to have fewer reserved frames.
Fixes PR45064
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D75712
The original code could create a bitcast from f64 to i64 and back
on 32-bit targets. This was only working because getBitcast was
able to fold the casts away to avoid leaving the illegal i64 type.
Now we handle the scalar case directly by broadcasting using the
scalar type as the element type. Then bitcasting to the final VT.
This works since we ensure the scalar type is the same size as
the final VT element type. No more casts to i64.
For the vector case, we cast to VT or subvector of VT. And then
do the broadcast.
I think this all matches what we generated before, just in a more
readable way.
Previously we tried to promote these to xmm/ymm/zmm by promoting
in the X86CallingConv.td file. But this breaks when we run out
of xmm/ymm/zmm registers and need to fall back to memory. We end
up trying to create a non-sensical scalar to vector. This lead
to an assertion. The new tests in avx512-calling-conv.ll all
trigger this assertion.
Since we really want to treat these types like we do on avx2,
it seems better to promote them before the calling convention
code gets involved. Except when the calling convention is one
that passes the vXi1 type in a k register.
The changes in avx512-regcall-Mask.ll are because we indicated
that xmm/ymm/zmm types should be passed indirectly for the
Win64 ABI before we go to the common lines that promoted the
vXi1 types. This caused the promoted types to be picked up by
the default calling convention code. Now we promote them earlier
so they get passed indirectly as though they were xmm/ymm/zmm.
Differential Revision: https://reviews.llvm.org/D75154
If we would emit a VBROADCAST node, we can instead directly emit
a VBROADCAST_LOAD. This allows us to get rid of the special case
to use an f64 load on 32-bit targets for vXi64.
I believe there is more cleanup we can do later in this function,
but I'll do that in follow ups.
If we go with D75412, we no longer depend on the scalar type directly. So we don't need to avoid using i64. We already have AVX1 fallback patterns with i32 and i64 scalar types so we don't need to avoid using integer types on AVX1.
Differential Revision: https://reviews.llvm.org/D75413
Also add a DAG combine to combine different sized broadcasts from
constant pool to avoid a regression.
Differential Revision: https://reviews.llvm.org/D75412
The build_vector needs to be the only user of the data, but the
chain will likely have another use. So we can't make sure the
build_vector is the only user of the node.
These AddToWorklist calls were added in 84cd968f75.
It's possible the SimplifyDemandedBits/SimplifyDemandedVectorElts
triggered CSE that deleted N. Detect that and avoid adding N
to the worklist.
Fixes PR45067.
Previously this code was called into two ways, either a FrameIndexSDNode
was passed in StackSlot. Or a load node was passed in the argument
called StackSlot. This was determined by a dyn_cast to FrameIndexSDNode.
In the case of a load, we had to go find the real pointer from
operand 0 and cast the node to MemSDNode to find the pointer info.
For the stack slot case, the code assumed that the stack slot
was perfectly aligned despite not being the creator of the slot.
This commit modifies the interface to make the caller responsible
for passing all of the required information to avoid all the
guess work and reverse engineering.
I'm not aware of any issues with the original code after an
earlier commit to fix the alignment of one of the stack objects.
This is just clean up to make the code less surprising.
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.
This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.
Differential Revision: https://reviews.llvm.org/D75132
Instead add it when we make the machine nodes during instruction
selections.
This makes this ISD node closer to ISD::MGATHER. Trying to see
if we remove the X86 specific ones.