To lower this we now create a new V1 containing the low half of both sources and a new V2 containing the upper half of both sources. Then we created a repeated lane shuffle of those new sources to create the final result.
This fixes PR35833
Differential Revison: https://reviews.llvm.org/D41794
llvm-svn: 339818
As discussed on D41794, we have many cases where we fail to combine shuffles as the input operands have other uses.
This patch permits these shuffles to be combined as long as they don't introduce additional variable shuffle masks, which should reduce instruction dependencies and allow the total number of shuffles to still drop without increasing the constant pool.
However, this may mean that some memory folds may no longer occur, and on pre-AVX require the occasional extra register move.
This also exposes some poor PMULDQ/PMULUDQ codegen which was doing unnecessary upper/lower calculations which will in fact fold to zero/undef - the fix will be added in a followup commit.
Differential Revision: https://reviews.llvm.org/D50328
llvm-svn: 339335
The code tried to find the immediate by using getNumOperands() on the MachineInstr, but there might be implicit-defs after the immediate that get counted.
Instead use getNumOperands() from the instruction description which will only count the operands that are defined in the td file.
llvm-svn: 337088
If broadcasting from another shuffle, attempt to simplify it.
We can probably generalize this a lot more (embedding in combineX86ShufflesRecursively), but BROADCAST is one of the more troublesome as it accepts inputs of different sizes to the result.
llvm-svn: 323602
Add support for custom execution domain fixing and implement support for BLENDPD/BLENDPS/PBLENDD/PBLENDW.
Differential Revision: https://reviews.llvm.org/D42042
llvm-svn: 322524
We try to prevent shuffle combining to value types that would stop the folding of masked operations, but by just returning early, we were failing to try different shuffle types.
The TODOs are all still relevant here to improve codegen but we're lacking test examples.
llvm-svn: 321085
As mentioned in D38318 and D40865, modern Intel processors prefer to combine multiple shuffles to a variable shuffle mask (PSHUFB/VPERMPS etc.) instead of having multiple stage 'fixed' shuffles which put more pressure on Port 5 (at the expense of extra shuffle mask loads).
This patch provides a FeatureFastVariableShuffle target flag for Haswell+ CPUs that prefers combining 2 or more fixed shuffles to a single variable shuffle (default is 3 shuffles).
The long term aim is to drive more of this from schedule data (probably via the MC) but we're not close to being ready for that yet.
Differential Revision: https://reviews.llvm.org/D41323
llvm-svn: 321074
As part of the unification of the debug format and the MIR format, print
MBB references as '%bb.5'.
The MIR printer prints the IR name of a MBB only for block definitions.
* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#" << ([a-zA-Z0-9_]+)->getNumber\(\)/" << printMBBReference(*\1)/g'
* find . \( -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#" << ([a-zA-Z0-9_]+)\.getNumber\(\)/" << printMBBReference(\1)/g'
* find . \( -name "*.txt" -o -name "*.s" -o -name "*.mir" -o -name "*.cpp" -o -name "*.h" -o -name "*.ll" \) -type f -print0 | xargs -0 sed -i '' -E 's/BB#([0-9]+)/%bb.\1/g'
* grep -nr 'BB#' and fix
Differential Revision: https://reviews.llvm.org/D40422
llvm-svn: 319665
This allows masked operations to be used and allows the register allocator to use YMM16-31 if necessary.
As a follow up I'll look into teaching EVEX->VEX how to turn this back into PERM2X128 if any of the additional features don't work out.
llvm-svn: 317403
There's no advantage to using these instructions when they aren't masked. This enables some additional execution domain switching without needing to update the table.
llvm-svn: 315674
Summary:
We currently disable some converting of shuffles to MOVSS/MOVSD during legalization if SSE41 is enabled. But later during shuffle combining we go back to prefering MOVSS/MOVSD.
Additionally we have patterns that look for BLENDIs to detect scalar arithmetic operations. I believe due to the combining using MOVSS/MOVSD these are unnecessary.
Interestingly, we still codegen blend instructions even though lowering/isel emit movss/movsd instructions. Turns out machine CSE commutes them to blend, and then commuting those blends back into blends that are equivalent to the original movss/movsd.
This patch fixes the inconsistency in legalization to prefer MOVSS/MOVSD. The one test change was caused by this change. The problem is that we have integer types and are mostly selecting integer instructions except for the shufps. This shufps forced the execution domain, but the vpblendw couldn't have its domain changed with a naive instruction swap. We could fix this by special casing VPBLENDW based on the immediate to widen the element type.
The rest of the patch is removing all the excess scalar patterns.
Long term we should probably add isel patterns to make MOVSS/MOVSD emit blends directly instead of relying on the double commute. We may also want to consider emitting movss/movsd for optsize. I also wonder if we should still use the VEX encoded blendi instructions even with AVX512. Blends have better throughput, and that may outweigh the register constraint.
Reviewers: RKSimon, zvi
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D38023
llvm-svn: 315181
The shuffle combining and lowerVectorShuffleAsLanePermuteAndBlend were both still trying to use VPERM2XF128 for unary shuffles when AVX2 is enabled. VPERM2X128 takes two inputs meaning when we use it for a unary shuffle one of those inputs is left undefined creating a false dependency on whatever register gets allocated there.
If we have VPERMQ/PD we should prefer those since they only have a single input.
Differential Revision: https://reviews.llvm.org/D37947
llvm-svn: 313542
The early out for AVX2 in lowerV2X128VectorShuffle is positioned in a weird spot below some shuffle mask equivalency checks.
But I think we want to allow VPERMQ for any unary shuffle.
Differential Revision: https://reviews.llvm.org/D37893
llvm-svn: 313373
This is a patch for an on-going bugzilla bug 21281 on the generated X86 code for a matrix transpose8x8 subroutine which requires vector interleaving. The generated code in AVX2 is currently non-optimal and requires 60 instructions as opposed to only 40 instructions generated for AVX1.
The patch includes a fix for the AVX2 case where vector unpack instructions use less operations than the vector blend operations available in AVX2.
In this case using vector unpack instructions is more efficient.
Reviewers:
zvi
delena
igorb
craig.topper
guyblank
eladcohen
m_zuckerman
aymanmus
RKSimon
llvm-svn: 298840
Since r274013, we've been looking through bitcasts on broadcast inputs.
In the scalar-folding case (from a load, build_vector, or sc2vec),
the input type didn't matter, as we'd simply bitcast the resulting
scalar back.
However, when broadcasting a 128-bit-lane-aligned element, we create an
EXTRACT_SUBVECTOR. Use proper types, by creating an extract_subvector
of the original input type.
llvm-svn: 294774
There are cases of AVX-512 instructions that have two possible encodings. This is the case with instructions that use vector registers with low indexes of 0 - 15 and do not use the zmm registers or the mask k registers.
The EVEX encoding prefix requires 4 bytes whereas the VEX prefix can take only up to 3 bytes. Consequently, using the VEX encoding for these instructions results in a code size reduction of ~2 bytes even though it is compiled with the AVX-512 features enabled.
Reviewers: Craig Topper, Zvi Rackoover, Elena Demikhovsky
Differential Revision: https://reviews.llvm.org/D27901
llvm-svn: 290663
Add the missing domain equivalences for movss, movsd, movd and movq zero extending loading instructions.
Differential Revision: https://reviews.llvm.org/D27684
llvm-svn: 289825
Summary:
The index and one of the table operands can be swapped by changing the opcode to the other version. Neither of these operands are the one that can load from memory so this can't be used to increase memory folding opportunities.
We need to handle the unmasked forms and the kz forms. Since the load operand isn't being commuted we can commute the load and broadcast instructions too.
Reviewers: igorb, delena, Ayal, Farhana, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D25652
llvm-svn: 287621
Summary: VALIGND and VALIGNQ are similar to PALIGNR but instead of working on a 128-bit lane they work on the entire vector register. This change leverages the shuffle rotate detection code used for PALIGNR to detect these cases.
Reviewers: delena, RKSimon
Subscribers: Farhana, llvm-commits
Differential Revision: https://reviews.llvm.org/D26297
llvm-svn: 286709
As discussed on PR28136, lowerShuffleAsRepeatedMaskAndLanePermute was attempting to match repeated masks at the 128-bit level and then permute the resultant lanes at the 128-bit (AVX1) or 64-bit (AVX2) sub-lane level.
This change allows us to create the repeated masks at the sub-lane level (and then concat them together to create a 128-bit repeated mask) and then select which sub-lane to permute. This has no effect on the AVX1 codegen.
Fixes PR28136.
llvm-svn: 275543
This improves the situation discussed in D19228 where we were forcing VPERMPD/VPERMQ where VPERM2F128/VPERM2I128 would have been better.
This was incorrectly reverted in rL275421 during triage of PR28552.
llvm-svn: 275497
This improves the situation discussed in D19228 where we were forcing VPERMPD/VPERMQ where VPERM2F128/VPERM2I128 would have been better.
llvm-svn: 275411
This patch allows target shuffles to be combined to single input immediate permute instructions - (V)PSHUFD/VPERMILPD/VPERMILPS - allowing more general pattern matching than what we current do and improves the likelihood of memory folding compared to existing patterns which tend to reuse the input in multiple arguments.
Further permute instructions (V)PSHUFLW/(V)PSHUFHW/(V)PERMQ/(V)PERMPD may be added in the future but its proven tricky to create tests cases for them so far. (V)PSHUFLW/(V)PSHUFHW is already handled quite well in combineTargetShuffle so it may be that removing some of that code may allow us to perform more of the combining in one place without duplication.
Differential Revision: http://reviews.llvm.org/D21148
llvm-svn: 273999
Using VPERMQ/VPERMPD allows memory folding of the (repeated) input where VINSERTI128/VINSERTF128 can not.
Differential Revision: http://reviews.llvm.org/D19228
llvm-svn: 266728
This patch attempts to represent a shuffle as a repeating shuffle (recognisable by is128BitLaneRepeatedShuffleMask) with the source input(s) in their original lanes, followed by a single permutation of the 128-bit lanes to their final destinations.
On AVX2 we can additionally attempt to match using 64-bit sub-lane permutation. AVX2 can also now match a similar 'broadcasted' repeating shuffle.
This patch has several benefits:
* Avoids prematurely matching with lowerVectorShuffleByMerging128BitLanes which can require both inputs to have their input lanes permuted before shuffling.
* Can replace PERMPS/PERMD instructions - although these are useful for cross-lane unary shuffling, they require their shuffle mask to be pre-loaded (and increase register pressure).
* Matching the repeating shuffle makes use of a lot of existing shuffle lowering.
There is an outstanding minor AVX1 regression (combine_unneeded_subvector1 in vector-shuffle-combining.ll) of a previously 128-bit shuffle + subvector splat being converted to a subvector splat + (2 instruction) 256-bit shuffle, I intend to fix this in a followup patch for review.
Differential Revision: http://reviews.llvm.org/D16537
llvm-svn: 260834
AVX2 can only broadcast from the zero'th element of a vector, but if the broadcastable element is the zero'th element of a 128-bit subvector its advantageous to extract the subvector, broadcast from that and avoid the loading of shuffle mask data that would be needed for VPERMPS/VPERMD. The only exception being when the source type is 4f64 or 4i64 which can directly use the immediate shuffle VPERMPD/VPERMQ directly.
Differential Revision: http://reviews.llvm.org/D16050
llvm-svn: 258081