We're relying on the source inputs for shuffle combining having already been widened to the root size (otherwise the offset logic falls over) - we're going to be supporting different sized shuffle inputs soon, so we need to explicitly make the minimum widened width the original root size.
We allow insert_subvector lowering of all legal types, so don't always cast to the vXi64/vXf64 shuffle types - this is only necessary for X86ISD::SHUF128/X86ISD::VPERM2X128 patterns later.
Simplify vperm2x128(concat(X,Y),concat(Z,W)) folding.
Use collectConcatOps / ISD::INSERT_SUBVECTOR to find the source subvectors instead of hardcoded immediate matching.
combineX86ShufflesConstants/canonicalizeShuffleMaskWithHorizOp can both handle/earlyout shuffles with inputs of different widths, so delay widening as late as possible to make it easier to match constant folds etc.
The plan is to eventually move the widening inside combineX86ShuffleChain so that we don't create any new nodes unless we successfully combine the shuffles.
rGbe69e66b1cd8 added the fold, but DAGCombiner.visitVECTOR_SHUFFLE doesn't merge shuffles if the inner shuffle is a splat, so we need to bail.
The non-fast-horiz-ops paths see some minor regressions, we might be able to improve on this after lowering to target shuffles.
Fix PR48823
We already have an experimental option to tune loop alignment. Its impact
is very wide (and there is a suspicion that it's not always profitable). We want
to have something more narrow to play with. This patch adds similar option that
overrides preferred alignment for innermost loops. This is for experimental
purposes, default values do not change the existing behavior.
Differential Revision: https://reviews.llvm.org/D94895
Reviewed By: pengfei
We already handle "vperm2x128 (ins ?, X, C1), (ins ?, X, C1), 0x31" for shuffling of the upper subvectors, but we weren't dealing with the case when we were splatting the upper subvector from a single source.
As discussed on D56387, if we're shifting to extract the upper/lower half of a vXi64 vector then we're actually better off performing this at the subvector level as its very likely to fold into something.
combineConcatVectorOps can perform this in reverse if necessary.
If a srl doesn't introduce any sign bits into the truncated result, then replace with a sra to let us use a PACKSS truncation - fixes a regression noticed in D56387 on pre-SSE41 targets that don't have PACKUSDW.
Specify LHS/RHS operands in matchShuffleWithUNPCK's calls to isTargetShuffleEquivalent, and handle VBROADCAST/VBROADCAST_LOAD matching in IsElementEquivalent
If this will help us fold shuffles together, then push the shuffle through the merged binops.
Ideally this would be performed in DAGCombiner::visitVECTOR_SHUFFLE but getting an efficient+legal merged shuffle can be tricky - on SSE we can be confident that for 32/64-bit elements vectors shuffles should easily fold.
See if we can remove the shuffle by resorting a HOP chain so that the HOP args are pre-shuffled.
This initial version just handles (the most common) v4i32/v4f32 hadd/hsub reduction patterns - future work can extend this to v8i16 types plus PACK chains (2f64 HADD/HSUB should already be handled in the half-lane combine code later on).
canonicalizeShuffleMaskWithHorizOp currently only supports shuffles with 1 or 2 sources, but PR41813 will require us to support higher numbers of sources.
This patch just generalizes the initial setup stages to ensure all src ops are the same type and opcode and then will continue to early out if we have more than 2 sources.
rG73a44f437bf1 result in 256-bit packss/packus ops with additional shuffles that shuffle combining can sometimes try to convert back into a truncation.
Adapted from D54696 by @nikic.
This patch improves lowering of saturating float to
int conversions, FP_TO_[SU]INT_SAT, for X86.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D86079
We can't easily treat ASHR a faux shuffle, but if it was just feeding a PACKSS then it was likely being used as sign-extension for a truncation, so just peek through and adjust the mask accordingly.
v16i32 -> v16i16/v8i16 truncation is now good enough using PACKSS/PACKUS + shuffle combining that its no longer necessary to early-out on pre-AVX512BW targets.
This was noticed while looking at completing PR40111 and moving combineSubToSubus to DAGCombine entirely.
SSE2 truncation codegen has improved over the past few years (mainly due to better shuffle lowering/combining and computeKnownBits) - its no longer necessary to early-out from v8i32/v8i64 truncations.
This was noticed while looking at completing PR40111 and moving combineSubToSubus to DAGCombine entirely.
If vpermf128/vpermi128 is acting on 2 similar 'inlane' ops, then try to perform the vpermf128 first which will allow us to merge the ops.
This will help us fix one of the regressions in D56387
This reapplies commit rG80dee7965dffdfb866afa9d74f3a4a97453708b2.
[X86][SSE] Fold unpack(hop(),hop()) -> permute(hop())
UNPCKL/UNPCKH only uses one op from each hop, so we can merge the hops and then permute the result.
REAPPLIED with a fix for unary unpacks of HOP.
AVX512 has fast truncation ops, but if the truncation source is a concatenation of subvectors then its likely that we can use PACK more efficiently.
This is only guaranteed to work for truncations to 128/256-bit vectors as the PACK works across 128-bit sub-lanes, for now I've just disabled 512-bit truncation cases but we need to get them working eventually for D61129.
When building abseil-cpp `bin/absl_hash_test` with Clang in -fno-pic
mode, an instruction like `movl $foo-2147483648, $eax` may be produced
(subtracting a number from the address of a static variable). If foo's
address is smaller than 2147483648, GNU ld/gold/LLD will error because
R_X86_64_32 cannot represent a negative value.
```
using absl::Hash;
struct NoOp {
template < typename HashCode >
friend HashCode AbslHashValue(HashCode , NoOp );
};
template <typename> class HashIntTest : public testing::Test {};
TYPED_TEST_SUITE_P(HashIntTest);
TYPED_TEST_P(HashIntTest, BasicUsage) {
if (std::numeric_limits< TypeParam >::min )
EXPECT_NE(Hash< NoOp >()({}),
Hash< TypeParam >()(std::numeric_limits< TypeParam >::min()));
}
REGISTER_TYPED_TEST_CASE_P(HashIntTest, BasicUsage);
using IntTypes = testing::Types< int32_t>;
INSTANTIATE_TYPED_TEST_CASE_P(My, HashIntTest, IntTypes);
ld: error: hash_test.cc:(function (anonymous namespace)::gtest_suite_HashIntTest_::BasicUsage<int>::TestBody(): .text+0x4E472): relocation R_X86_64_32 out of range: 18446744071564237392 is not in [0, 4294967295]; references absl::hash_internal::HashState::kSeed
```
Actually any negative offset is not allowed because the symbol address
can be zero (e.g. set by `-Wl,--defsym=foo=0`). So disallow such folding.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D93931
The x86_amx is used for AMX intrisics. <256 x i32> is bitcast to x86_amx when
it is used by AMX intrinsics, and x86_amx is bitcast to <256 x i32> when it
is used by load/store instruction. So amx intrinsics only operate on type x86_amx.
It can help to separate amx intrinsics from llvm IR instructions (+-*/).
Thank Craig for the idea. This patch depend on https://reviews.llvm.org/D87981.
Differential Revision: https://reviews.llvm.org/D91927
Followup to D92645 - remove the remaining places where we create X86ISD::SUBV_BROADCAST, and fold splatted vector loads to X86ISD::SUBV_BROADCAST_LOAD instead.
Remove all the X86SubVBroadcast isel patterns, including all the fallbacks for if memory folding failed.
This was needed in an earlier version of D92645, but isn't now - and I've just noticed that it was potentially flawed depending on the relevant widths of the broadcasted and extracted subvectors.
Subvector broadcasts are only load instructions, yet X86ISD::SUBV_BROADCAST treats them more generally, requiring a lot of fallback tablegen patterns.
This initial patch replaces constant vector lowering inside lowerBuildVectorAsBroadcast with direct X86ISD::SUBV_BROADCAST_LOAD loads which helps us merge a number of equivalent loads/broadcasts.
As well as general plumbing/analysis additions for SUBV_BROADCAST_LOAD, I needed to wrap SelectionDAG::makeEquivalentMemoryOrdering so it can handle result chains from non generic LoadSDNode nodes.
Later patches will continue to replace X86ISD::SUBV_BROADCAST usage.
Differential Revision: https://reviews.llvm.org/D92645
X86 and AArch64 expand it as libcall inside the target. And PowerPC also
want to expand them as libcall for P8. So, propose an implement in the
legalizer to common the logic and remove the code for X86/AArch64 to
avoid the duplicate code.
Reviewed By: Craig Topper
Differential Revision: https://reviews.llvm.org/D91331
Since these are all working on reduction patterns, actually use that term in the function name to make them easier to search for.
At some point we're likely to start working with the ISD::VECREDUCE_* opcodes directly in the x86 backend, but that is still some way off.
As discussed on D92645, we don't do a good job of recognising when we don't require the full width of a ymm/zmm build vector because the upper elements are undef/zero.
This commit allows us to make use of implicit zeroing of upper elements with AVX instructions, which we emulate in DAG with a INSERT_SUBVECTOR into the bottom of a undef/zero vector of the original type.
This exposed a limitation in getTargetConstantBitsFromNode which didn't extract bits from INSERT_SUBVECTORs of different element widths which I've included as well to prevent a couple of regressions.
The X86-64 ABI defines va_list as
typedef struct {
unsigned int gp_offset;
unsigned int fp_offset;
void *overflow_arg_area;
void *reg_save_area;
} va_list[1];
This means the size, alignment, and reg_save_area offset will depend on
whether we are in LP64 or in ILP32 mode, so this commit adds the checks.
Additionally, the VAARG_64 pseudo-instruction assumed 64-bit pointers, so
this commit adds a VAARG_X32 pseudo-instruction that behaves just like
VAARG_64, except for assuming 32-bit pointers.
Some of these changes were originally done by
Michael Liao <michael.hliao@gmail.com>.
Fixes https://bugs.llvm.org/show_bug.cgi?id=48428.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D93160