Its proving tricky to move this to the generic legalizer code, so manually insert the v2i32 subvector into v4i32, insert the AssertSext/AssertZext node, then extract the subvector again.
This avoids masks in the truncation/pack code, which means we avoid a PSHUFB in the fp_to_sint/uint code for sub-128 bit types (specific targets can still combine the packs to a pshufb if they have fast variable per-lane shuffles).
This was noticed when I was trying to improve fp_to_sint/uint costs with D103695 (and some targets had very high fp_to_sint costs due to the PSHUFB), so we can then update the fp_to_uint codegen from D89697.
This patch fixes PR50823.
The shuffle mask should be twisted twice before gotten the correct one due to the difference between inner HOP and outer.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D104903
This demonstrates a possible fix for PR48760 - for compares with constants, canonicalize the SGT/UGT condition code to use SGE/UGE which should reduce the number of EFLAGs bits we need to read.
As discussed on PR48760, some EFLAG bits are treated independently which can require additional uops to merge together for certain CMOVcc/SETcc/etc. modes.
I've limited this to cases where the constant increment doesn't result in a larger encoding or additional i64 constant materializations.
Differential Revision: https://reviews.llvm.org/D101074
Don't allow vectors to split into GPRs for 'r' and other scalar
constraints. Prevents assertion in getCopyToPartsVector.
Makes PR50907 give a better error instead of crashing.
This is a mechanical change. This actually also renames the
similarly named methods in the SmallString class, however these
methods don't seem to be used outside of the llvm subproject, so
this doesn't break building of the rest of the monorepo.
It looks like the fold introduced in 63f3383ece can cause crashes
if the type of the bitcasted value is not a valid vector element type,
like x86_mmx.
To resolve the crash, reject invalid vector element types. The way it is
done in the patch is a bit clunky. Perhaps there's a better way to
check?
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D104792
select (cmpeq Cond0, Cond1), LHS, (select (cmpugt Cond0, Cond1), LHS, Y) --> (select (cmpuge Cond0, Cond1), LHS, Y)
etc,
We already perform this fold in DAGCombiner for MVT::i1 comparison results, but these can still appear after legalization (in x86 case with MVT::i8 results), where we need to be more careful about generating new comparison codes.
Pulled out of D101074 to help address the remaining regressions.
Differential Revision: https://reviews.llvm.org/D104707
Since this method can apply to cmpxchg operations, make sure it's clear
what value we're actually retrieving. This will help ensure we don't
accidentally ignore the failure ordering of cmpxchg in the future.
We could potentially introduce a getOrdering() method on AtomicSDNode
that asserts the operation isn't cmpxchg, but not sure that's
worthwhile.
Differential Revision: https://reviews.llvm.org/D103338
As per the discussion in D103818, so far, this does not appear to be worthwhile.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D103818
Fixes:
- PR36507 Floating point varargs are not handled correctly with
-mno-implicit-float
- PR48528 __builtin_va_start assumes it can pass SSE registers
when using -Xclang -msoft-float -Xclang -no-implicit-float
On x86_64, floating-point parameters are normally passed in XMM
registers. For va_start, we spill those to memory so va_arg can
find them. There is an interaction here with -msoft-float and
-no-implicit-float:
When -msoft-float is in effect, instead of passing floating-point
parameters in XMM registers, they are passed in general-purpose
registers.
When -no-implicit-float is in effect, it "disables implicit
floating-point instructions" (per the LangRef). The intended
effect is to not have the compiler generate floating-point code
unless explicit floating-point operations are present in the
source code, but what exactly counts as an explicit floating-point
operation is not specified. The existing behavior of LLVM here has
led to some surprises and PRs.
This change modifies the behavior as follows:
| soft | no-implicit | old behavior | new behavior |
| no | no | spill XMM regs | spill XMM regs |
| yes | no | don't spill XMM | don't spill XMM |
| no | yes | don't spill XMM | spill XMM regs |
| yes | yes | assert | don't spill XMM |
In particular, this avoids the assert that happens when
-msoft-float and -no-implicit-float are both in effect. This
seems like a perfectly reasonable combination: If we don't want
to rely on hardware floating-point support, we want to both
avoid using float registers to pass parameters and avoid having
the compiler generate floating-point code that wasn't in the
original program. Instead of crashing the compiler, the new
behavior is to not synthesize floating-point code in this
case. This fixes PR48528.
The other interesting case is when -no-implicit-float is in
effect, but -msoft-float is not. In that case, any floating-point
parameters that are present will be in XMM registers, and so we
have to spill them to correctly handle those. This fixes
PR36507. The spill is conditional on %al indicating that
parameters are present in XMM registers, so no floating-point
code will be executed unless the function is called with
floating-point parameters.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D104001
Changing vector element type doesn't work for v6i32->v6i16 now
that v6i32 is an MVT and v6i16 is not.
I would like to fix this in changeVectorElementType, but you
need a LLVMContext to call getVectorVT which we can't get from
an MVT.
Fixes PR50709.
Fixes crash reported here https://reviews.llvm.org/D73607
Using a store to keep the trunc intact. Returning v16i24 would
cause the trunc to be optimized away in SelectionDAGBuilder.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D103940
Don't require a specific kind of IRBuilder for TargetLowering hooks.
This allows us to drop the IRBuilder.h include from TargetLowering.h.
Differential Revision: https://reviews.llvm.org/D103759
We were hitting an issue when the scalar_to_vector source was being implicitly truncated (in this case to i8 to vXi1) but we were also using the i8 source in a broadcast to a vXi8 value.
Fixes PR50374
Currently, X86 backend only has a global one-size-fits-all `FeatureFastVariableShuffle` feature,
which controls profitability of both the cross-lane and per-lane variable shuffles.
I guess, this has been fine so far.
But at least on AMD Zen 3, while per-line variable shuffles (e.g. `VPSHUFB`)
are as fast as as shuffles with fixed/immediate mask,
while lane-crossing shuffles, e.g. `VPERMPS` is performing worse.
So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles,
as suggested by @RKSimon, split the feature flag into two.
Differential Revision: https://reviews.llvm.org/D103274
SwiftTailCC has a different set of requirements than the C calling convention
for a tail call. The exact argument sequence doesn't have to match, but fewer
ABI-affecting attributes are allowed.
Also make sure the musttail diagnostic triggers if a musttail call isn't
actually a tail call.
We could previously do this by accident through the later
call to getTargetConstantBitsFromNode I think, but that only worked
if N0 had a single use. This patch makes it explicit for undef and
doesn't have a use count check.
I think this is needed to move the (shl X, 1)->(add X, X)
fold to isel for PR50468. We need to be sure X won't be IMPLICIT_DEF
which might prevent the same vreg from being used for both operands.
Differential Revision: https://reviews.llvm.org/D103192
D88631 added initial support for:
- -mstack-protector-guard=
- -mstack-protector-guard-reg=
- -mstack-protector-guard-offset=
flags, and D100919 extended these to AArch64. Unfortunately, these flags
aren't retained for LTO. Make them module attributes rather than
TargetOptions.
Link: https://github.com/ClangBuiltLinux/linux/issues/1378
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D102742
This patch adds support for lowering function calls with the
`clang.arc.attachedcall` bundle. The goal is to expand such calls to the
following sequence of instructions:
callq @fn
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue / _objc_unsafeClaimAutoreleasedReturnValue
This sequence of instructions triggers Objective-C runtime optimizations,
hence we want to ensure no instructions get moved in between them.
This patch achieves that by adding a new CALL_RVMARKER ISD node,
which gets turned into the CALL64_RVMARKER pseudo, which eventually gets
expanded into the sequence mentioned above.
The ObjC runtime function to call is determined by the
argument in the bundle, which is passed through as a
target constant to the pseudo.
@ahatanak is working on using this attribute in the front- & middle-end.
Together with the front- & middle-end changes, this should address
PR31925 for X86.
This is the X86 version of 46bc40e502,
which added similar support for AArch64.
Reviewed By: ab
Differential Revision: https://reviews.llvm.org/D94597
This is another FMF gap exposed by D90901, but I don't see a way
to show the difference in a regression test as with:
f66ba4c6025663
We will see an asm difference if we add a test as part of D90901.
Generalize the fix from rGd0902a8665b1 by ensuring we widen/narrow the indices subvector first and then perform the ZERO_EXTEND_VECTOR_INREG (if necessary), which should allow us to perform the variable permutes with source/destination/indices vectors of any widths.
D101838 incorrectly handled indices vectors of the same size but with higher element counts to just bitcast to the target indices type instead of performing a ZERO_EXTEND_VECTOR_INREG
This adds support to the X86 backend for the newly committed swiftasync
function parameter. If such a (pointer) parameter is present it gets stored
into an augmented frame record (populated in IR, but generally containing
enhanced backtrace for coroutines using lots of tail calls back and forth).
The context frame is identical to AArch64 (primarily so that unwinders etc
don't get extra complexity). Specfically, the new frame record is [AsyncCtx,
%rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp
being set to 1. %rbp still points to the frame pointer in memory for backwards
compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest
of the record is because of x86).
Recommited with a fix for unwind info when i386 pc-rel thunks are
adjacent to a prologue.
This adds support to the X86 backend for the newly committed swiftasync
function parameter. If such a (pointer) parameter is present it gets stored
into an augmented frame record (populated in IR, but generally containing
enhanced backtrace for coroutines using lots of tail calls back and forth).
The context frame is identical to AArch64 (primarily so that unwinders etc
don't get extra complexity). Specfically, the new frame record is [AsyncCtx,
%rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp
being set to 1. %rbp still points to the frame pointer in memory for backwards
compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest
of the record is because of x86).
Swift's new concurrency features are going to require guaranteed tail calls so
that they don't consume excessive amounts of stack space. This would normally
mean "tailcc", but there are also Swift-specific ABI desires that don't
naturally go along with "tailcc" so this adds another calling convention that's
the combination of "swiftcc" and "tailcc".
Support is added for AArch64 and X86 for now.
The intention is to be able to run this from additional locations (such as shuffle combining) in the future.
Reapplies rGb95a103808ac (after reversion at rGc012a388a15b), with SSE3/SSSE3 typo fix, test added at rG0afb10de1449.
On AVX1 targets we can handle v4i64 logical shifts by 32 bits as a pair of v8f32 shuffles with zero.
I was hoping to put this in LowerScalarImmediateShift, but performing that early causes regressions where other instructions were respliting the subvectors.
getVectorNumElements() returns a value for scalable vectors
without any warning so it is effectively getVectorMinNumElements().
By renaming it and making getVectorNumElements() forward to
it, we can insert a check for scalable vectors into getVectorNumElements()
similar to EVT. I didn't do that in this patch because there are still more
fixes needed, but I was able to temporarily do it and passed the RISCV
lit tests with these changes.
The changes to isPow2VectorType and getPow2VectorType are copied from EVT.
The change to TypeInfer::EnforceSameNumElts reduces the size of AArch64's isel table.
We're now considering SameNumElts to require the scalable property to match which
removes some unneeded type checks.
This was motivated by the bug I fixed yesterday in 80b9510806
Reviewed By: frasercrmck, sdesmalen
Differential Revision: https://reviews.llvm.org/D102262
Extend the HOP(HOP(X,Y),HOP(Z,W)) and SHUFFLE(HOP(X,Y),HOP(Z,W)) folds to handle repeating 256/512-bit vector cases.
This allows us to drop the UNPACK(HOP(),HOP()) custom fold in combineTargetShuffle.
This required isRepeatedTargetShuffleMask to be tweaked to support target shuffle masks taking more than 2 inputs.
foldShuffleOfHorizOp only handled basic shufps(hop(x,y),hop(z,w)) folds - by moving this to canonicalizeShuffleMaskWithHorizOp we can work with more general/combined v4x32 shuffles masks, float/integer domains and support shuffle-of-packs as well.
The next step will be to support 256/512-bit vector cases.
Ensure we don't try to fold when one might be an opaque constant - the constant fold will fail and then the reverse fold will happen in DAGCombine.....
Based off a discussion on D89281 - where the AARCH64 implementations were being replaced to use funnel shifts.
Any target that has efficient funnel shift lowering can handle the shift parts expansion using the same expansion, avoiding a lot of duplication.
I've generalized the X86 implementation and moved it to TargetLowering - so far I've found that AARCH64 and AMDGPU benefit, but many other targets (ARM, PowerPC + RISCV in particular) could easily use this with a few minor improvements to their funnel shift lowering (or the folding of their target ops that funnel shifts lower to).
NOTE: I'm trying to avoid adding full SHIFT_PARTS legalizer handling as I think it might actually be possible to remove these opcodes in the medium-term and use funnel shift / libcall expansion directly.
Differential Revision: https://reviews.llvm.org/D101987