Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations:
AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32
AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8
Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.
Writes of a mask result are always tail agnostic.
Unfortunately, this seems to have made codegen worse. I can only
think this must be because the vsetvli was acting as some sort
of barrier that prevented some code movement in the scheduler.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D103331
As a follow up to D103672, we should allow vaddr to be larger than
required when assembling GFX10 MIMG instructions.
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D103733
Avoid having to round up to v8f32/VReg_256 when only 5 VGPRs are
required for a MIMG address operand.
Maintain _V8 instruction variants of pseudo instructions allowing
assembly prior to GFX10 to work as-is. Currently the validator
can tell for GFX10 what the correct size is, so will disallow
oversize address registers.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103672
This patch optimizes (and r i) to
(BCLRI (BCLRI r, i0), i1) in which i = ~((1<<i0) | (1<<i1)).
or
(BCLRI (ANDI r, i0), i1) in which i = i0 & ~(1<<i1).
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103743
This uses 3 bits of data instead of 7. I'm wondering if we can use
bitfields for the lookup table key where this would matter.
I also name the shift_amount template to log2 since it is used
with more than just an srl now.
This allows to lower an LDS variable into a kernel structure
even if there is a constant expression used from different
kernels.
Differential Revision: https://reviews.llvm.org/D103655
So far, support for x86_64-linux-gnux32 has been handled by explicit
comparisons of Triple.getEnvironment() to GNUX32. This worked as long as
x86_64-linux-gnux32 was the only X32 environment to worry about, but we
now have x86_64-linux-muslx32 as well. To support this, this change adds
an isX32() function and uses it. It replaces all checks for GNUX32 or
MuslX32 by isX32(), except for the following:
- Triple::isGNUEnvironment() and Triple::isMusl() are supposed to treat
GNUX32 and MuslX32 differently.
- computeTargetTriple() needs to be able to transform triples to add or
remove X32 from the environment and needs to map GNU to GNUX32, and
Musl to MuslX32.
- getMultiarchTriple() completely lacks any Musl support and retains the
explicit check for GNUX32 as it can only return x86_64-linux-gnux32.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D103777
Include known bits support so we know we don't need to zext the
output if the input was already zero extended.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D103757
This can cause the vectorizer to generate interleaved scalar
code which might be ok for some CPUs, but definitely not all.
Disable it to restore the previous scalar behavior.
Differential Revision: https://reviews.llvm.org/D103787
This allows to convert the add instruction to s_addk_i32 and
v_add_nc_u32 instead of needing v_add_co_u32 when converting to a VALU
instruction.
Differential Revision: https://reviews.llvm.org/D103322
Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103261
Use llvm.experimental.vector.insert instead of storing into an alloca
when generating code for these intrinsics. This defers the codegen of
the generated vector to instruction selection, allowing existing
shufflevector style optimizations to apply.
Additionally, introduce a new target transform that can recognise fixed
predicate patterns in the svbool variants of these intrinsics.
Differential Revision: https://reviews.llvm.org/D103082
We should be exiting when the shift amount is greater than
the bit width regardless of whether it is a power of 2.
Reported by Simon Pilgrim here https://reviews.llvm.org/D96661
This requires getting a shift amount that is out of bounds that
wasn't already optimized by SelectionDAG. This would be pretty
trick to construct a test for.
Or it would require a non-power of 2 shift amount and a mask
that has runs of ones and zeros of the next lowest power of 2 from
that shift amount. I tried a little to produce a test for this,
but didn't get it to work.
Don't require a specific kind of IRBuilder for TargetLowering hooks.
This allows us to drop the IRBuilder.h include from TargetLowering.h.
Differential Revision: https://reviews.llvm.org/D103759
This NEG node is just a vector negation, easily represented as a SUB
zero. Removing it from the one place it is generated is essentially an
NFC, but can allow some extra folding. The updated tests are now loading
different constant literals, which have already been negated.
Differential Revision: https://reviews.llvm.org/D103703
While the IndVars issue (PR50384) has been resolved,
and the compile performance improved, a new blocker emerged,
the codegen machine instruction scheduling is also quadratic.
So we still can't really specify the right value here.
Filed PR50584.
This patch was split from https://reviews.llvm.org/D102246
[SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO
This is for llvm-profdata part of change. It sets the bit masks for the
profile reader in llvm-profdata. Also add an internal option
"-fs-discriminator-pass" for show and merge command to process the profile
offline.
This patch also moved setDiscriminatorMaskedBitFrom() to
SampleProfileReader::create() to simplify the interface.
Differential Revision: https://reviews.llvm.org/D103550
If we ended up with two phi instructions in a block, and we needed to fix up
the banks for the first one, we'd end up inserting our COPY before the second
phi.
E.g.
```
%x = G_PHI ...
%fixup = COPY ...
%y = G_PHI ...
```
This is invalid MIR, and breaks assumptions made by the register allocator later
down the line. With the verifier enabled, it also emits a verification error.
This teaches fixupPHIOpBanks to walk past any phi instructions in the block
when emitting the fixup copies.
Here's an example of the crashing code (same as added testcase):
https://godbolt.org/z/h5j1x3o6e
Differential Revision: https://reviews.llvm.org/D103582
This patch sets the isCommutable attribute for several opcodes that have
the "reg = OPCODE reg, reg" format.
Differential Revision: https://reviews.llvm.org/D103653
All that really matters is that the VLMAX of the preceding
instructions is the same as the VLMAX required by the mask
operation.
Also update the vmsge(u) handling to use the SEW/LMUL we use for
other mask register operations. We were matching it to the compare
before. Some cases will be improve if we fix masked compares to
use tail agnostic policy. I think they ignore the tail policy
anyway.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D103299
setcc (csel 0, 1, cond, X), 1, ne ==> csel 0, 1, !cond, X
Where X is a condition code setting instruction.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D103256
Due to the dependency on runtime unrolling, UnJ is only
enabled by default on in-order scheduling models,
and if a cpu is specified through -mcpu.
Differential Revision: https://reviews.llvm.org/D103604
When using and ACLE intrinsic for an SVE2 shift, if the predicate passed
has all relevant lanes active, then use a reversed version of the
instruction if beneficial.
Don't propagate launch bound related attributes to
address taken functions and their callees. The idea
is to do a traversal over the call graph starting at
address taken functions and erase the attributes
set by previous logic i.e. process().
This two phase approach makes sure that we don't
miss out on deep nested callees from address taken
functions as a function might be called directly as
well as indirectly.
This patch is also reattempt to D94585 as latent issues
are fixed in hasAddressTaken function in the recent
past.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D103138
Before packing LDS globals into a sorted structure, make sure that
their alignment is properly updated based on their size. This will make
sure that the members of sorted structure are properly aligned, and
hence it will further reduce the probability of unaligned LDS access.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103261
In this particular example, we had a crash when compiling it
for several architectures. This patch extends the legalization
of extract_subvector to avoid this problem.
Differential Revision: https://reviews.llvm.org/D103344
This is a followup to D103422. The DenseMapInfo implementations for
ArrayRef and StringRef are moved into the ArrayRef.h and StringRef.h
headers, which means that these two headers no longer need to be
included by DenseMapInfo.h.
This required adding a few additional includes, as many files were
relying on various things pulled in by ArrayRef.h.
Differential Revision: https://reviews.llvm.org/D103491
This patch addresses an issue in which fixed-length (VLS) vector RVV
code could fail to reserve an emergency spill slot for their frame index
elimination. This is because we were previously only reserving a spill
slot when there were `scalable-vector` frame indices being used.
However, fixed-length codegen uses regular-type frame indices if it
needs to spill.
This patch does the fairly brute-force method of checking ahead of time
whether the function contains any RVV spill instructions, in which case
it reserves one slot. Note that the second RVV slot is still only
reserved for `scalable-vector` frame indices.
This unfortunately causes quite a bit of churn in existing tests, where
we chop and change stack offsets for spill slots.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103269
We were hitting an issue when the scalar_to_vector source was being implicitly truncated (in this case to i8 to vXi1) but we were also using the i8 source in a broadcast to a vXi8 value.
Fixes PR50374
`TargetFrameLowering::emitCalleeSavedFrameMoves` with 4 arguments is not
used anywhere in CodeGen. Thus it shouldn't be exposed as a virtual
function. NFC.
Differential Revision: https://reviews.llvm.org/D103328
This patch was split from https://reviews.llvm.org/D102246
[SampleFDO] New hierarchical discriminator for Flow Sensitive SampleFDO
This is mainly for ProfileData part of change. It will load
FS Profile when such profile is detected. For an extbinary format profile,
create_llvm_prof tool will add a flag to profile summary section.
For other format profiles, the users need to use an internal option
(-profile-isfs) to tell the compiler that the profile uses FS discriminators.
This patch also simplified the bit API used by FS discriminators.
Differential Revision: https://reviews.llvm.org/D103041
RVV vectors must be aligned to their element types, so anything less is
unaligned.
For regular loads and stores, our custom-lowering of fixed-length
vectors meant that we opted out of LegalizeDAG's built-in unaligned
expansion. This patch adds that logic in to our custom lower function.
For masked intrinsics, we declare that anything unaligned is not legal,
leaving the ScalarizeMaskedMemIntrin pass to do the expansion for us.
Note that neither of these methods can handle the expansion of
scalable-vector memory ops, so those cases are left alone by this patch.
Scalable loads and stores already go through expansion by default but
hit an assertion, and scalable masked intrinsics will silently generate
incorrect code. It may be prudent to return an error in both of these
cases.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102493
The first source has the same EEW as the destination, but we're
using earlyclobber which prevents them from ever being the same
register.
To workaround this, add a special TIED pseudo to use whenever the
first source and merge operand are the same value. This allows
us to use a single operand for the merge operand and first source
which we can then tie to the destination. A tied source disables
earlyclobber for that operand.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D103211
Some existing places use getPointerElementType() to create a copy of a
pointer type with some new address space.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D103429
It's still in use in a few places so we can't delete it yet but there's not
many at this point.
Differential Revision: https://reviews.llvm.org/D103352
This patch transforms the sequence
lea (reg1, reg2), reg3
sub reg3, reg4
to two sub instructions
sub reg1, reg4
sub reg2, reg4
Similar optimization can also be applied to LEA/ADD sequence.
The modifications to TwoAddressInstructionPass is to ensure the operands of ADD
instruction has expected order (the dest register of LEA should be src register of ADD).
Differential Revision: https://reviews.llvm.org/D101970
This guarantees they meet this overlap exception:
"The destination EEW is smaller than the source EEW and the overlap
is in the lowest-numbered part of the source register group"
Being a single register guarantees the overlap is always in the
lowerst-number part of the group.
Reviewed By: frasercrmck, khchen
Differential Revision: https://reviews.llvm.org/D103351
Compares are considered a narrowing operation for register overlap.
I believe for LMUL<=1 they meet this exception to allow overlap
"The destination EEW is smaller than the source EEW and the overlap is in the
lowest-numbered part of the source register group"
Both the result and the sources will occupy a single register for
LMUL<=1 so the overlap would always be in the "lowest-numbered part".
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D103336
This patch extends the RISC-V lowering of the 'fastcc' calling
convention to vector types, both fixed-length and scalable. Without this
patch, any function passing or returning vector types by value would
throw a compiler error.
Vectors are handled in 'fastcc' much as they are in the default calling
convention, the noticeable difference being the extended set of scalar
GPR registers that can be used to pass vectors indirectly.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102505
This patch adds TargetStackID::WasmLocal. This stack holds locations of
values that are only addressable by name -- not via a pointer to memory.
For the WebAssembly target, these objects are lowered to WebAssembly
local variables, which are managed by the WebAssembly run-time and are
not addressable by linear memory.
For the WebAssembly target IR indicates that an AllocaInst should be put
on TargetStackID::WasmLocal by putting it in the non-integral address
space WASM_ADDRESS_SPACE_WASM_VAR, with value 1. SROA will mostly lift
these allocations to SSA locals, but any alloca that reaches instruction
selection (usually in non-optimized builds) will be assigned the new
TargetStackID there. Loads and stores to those values are transformed
to new WebAssemblyISD::LOCAL_GET / WebAssemblyISD::LOCAL_SET nodes,
which then lower to the type-specific LOCAL_GET_I32 etc instructions via
tablegen patterns.
Differential Revision: https://reviews.llvm.org/D101140
Currently, X86 backend only has a global one-size-fits-all `FeatureFastVariableShuffle` feature,
which controls profitability of both the cross-lane and per-lane variable shuffles.
I guess, this has been fine so far.
But at least on AMD Zen 3, while per-line variable shuffles (e.g. `VPSHUFB`)
are as fast as as shuffles with fixed/immediate mask,
while lane-crossing shuffles, e.g. `VPERMPS` is performing worse.
So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles,
as suggested by @RKSimon, split the feature flag into two.
Differential Revision: https://reviews.llvm.org/D103274
The code gen for f32 to i32 bitcast is not currently the most efficient;
this patch removes some unneccessary instructions gerneated.
Differential revision: https://reviews.llvm.org/D100782
It breaks up the function pass manager in the codegen pipeline.
With empty parameters, it looks at the -mllvm flag -rewrite-map-file.
This is likely not in use.
Add a check that we only have one function pass manager in the codegen
pipeline.
Some tests relied on the fact that we had a module pass somewhere in the
codegen pipeline.
addr-label.ll crashes on ARM due to this change. This is because a
ARMConstantPoolConstant containing a BasicBlock to represent a
blockaddress may hold an invalid pointer to a BasicBlock if the
blockaddress is invalidated by its BasicBlock getting removed. In that
case all referencing blockaddresses are RAUW a constant int. Making
ARMConstantPoolConstant::CVal a WeakVH fixes the crash, but I'm not sure
that's the right fix. As a workaround, create a barrier right before
ISel so that IR optimizations can't happen while a
ARMConstantPoolConstant has been created.
Reviewed By: rnk, MaskRay, compnerd
Differential Revision: https://reviews.llvm.org/D99707
This patch fixes a bug in lowering scalable-vector types in RISC-V's
main calling convention. When scalable-vector types are split and passed
indirectly, the target is responsible for scaling the offset --
initially set to the known-minimum store size -- by the scalable factor.
Before this we were issuing overlapping loads or stores to the different
parts, leading to incorrect codegen.
Credit to @HsiangKai for spotting this.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D103262
This patch custom lowers FP_TO_[US]INT and [US]INT_TO_FP conversions
between floating-point and boolean vectors. As the default action is
scalarization, this patch both supports scalable-vector conversions and
improves the code generation for fixed-length vectors.
The lowering for these conversions can piggy-back on the existing
lowering, which lowers the operations to a supported narrowing/widening
conversion and then either an extension or truncation.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103312
This patch adds TargetStackID::WasmLocal. This stack holds locations of
values that are only addressable by name -- not via a pointer to memory.
For the WebAssembly target, these objects are lowered to WebAssembly
local variables, which are managed by the WebAssembly run-time and are
not addressable by linear memory.
For the WebAssembly target IR indicates that an AllocaInst should be put
on TargetStackID::WasmLocal by putting it in the non-integral address
space WASM_ADDRESS_SPACE_WASM_VAR, with value 1. SROA will mostly lift
these allocations to SSA locals, but any alloca that reaches instruction
selection (usually in non-optimized builds) will be assigned the new
TargetStackID there. Loads and stores to those values are transformed
to new WebAssemblyISD::LOCAL_GET / WebAssemblyISD::LOCAL_SET nodes,
which then lower to the type-specific LOCAL_GET_I32 etc instructions via
tablegen patterns.
Differential Revision: https://reviews.llvm.org/D101140
This ensures that the operands of any gather/scatter instructions that
we attempt to push out of the loop are invariant, preventing invalid IR
from being generated.
When you try to define a new DEBUG_TYPE in a header file, DEBUG_TYPE
definition defined around the #includes in files include it could
result in redefinition warnings even compile errors.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D102594
If the operand of the WhileLoopStart is flagged as killed, that
currently gets propogated to both the t2CMPri as the instruction is
reverted, and the newly created t2DoLoopStart. Only the second should
remain as killing the operand, the first dropping the flags.
The implementation of subword atomics does not actually
guarantee the result is zero-extended, which now caused
build bot failures after https://reviews.llvm.org/D101342
was landed.
We have special handling for a zext of a load <32b because the load does a zext
for free. In that case, we just select the G_ZEXT as if it were a copy but this
triggered the copy checking code to balk at the mismatched size.
This was being hidden because normally these get combined into G_ZEXTLOAD but
for atomics this doesn't happen. The test case here just uses a normal load
because the particular atomic isn't supported yet anyway.
This is cleaner than slicing the MxList to remove elements from
the beginning or end since that requires hardcoding the size.
I don't expect the size of the list to change, but we shouldn't
repeat it in multiple places.
If a cmpxchg specifies acquire or seq_cst on failure, make sure we
generate code consistent with that ordering even if the success ordering
is not acquire/seq_cst.
At one point, it was ambiguous whether this sort of construct was valid,
but the C++ standad and LLVM now accept arbitrary combinations of
success/failure orderings.
This doesn't address the corresponding issue in AtomicExpand. (This was
reported as https://bugs.llvm.org/show_bug.cgi?id=33332 .)
Fixes https://bugs.llvm.org/show_bug.cgi?id=50512.
Differential Revision: https://reviews.llvm.org/D103284
Since ca5f07f8c4 already reverted
the cause for this warning, this commit now causes warnings about
a default label in a switch that covers the enum.
This reverts commit cf2eeb114c.
SwiftTailCC has a different set of requirements than the C calling convention
for a tail call. The exact argument sequence doesn't have to match, but fewer
ABI-affecting attributes are allowed.
Also make sure the musttail diagnostic triggers if a musttail call isn't
actually a tail call.
When flat scratch is used, the stack pointer needs to be added when
writing arguments to the stack.
For buffer instructions, this is done in SelectMUBUFScratchOffen
and SelectMUBUFScratchOffset.
Move that to call argument lowering, like it is done in GlobalISel.
Differential Revision: https://reviews.llvm.org/D103166
This patch adds TargetStackID::WasmLocal. This stack holds locations of
values that are only addressable by name -- not via a pointer to memory.
For the WebAssembly target, these objects are lowered to WebAssembly
local variables, which are managed by the WebAssembly run-time and are
not addressable by linear memory.
For the WebAssembly target IR indicates that an AllocaInst should be put
on TargetStackID::WasmLocal by putting it in the non-integral address
space WASM_ADDRESS_SPACE_WASM_VAR, with value 1. SROA will mostly lift
these allocations to SSA locals, but any alloca that reaches instruction
selection (usually in non-optimized builds) will be assigned the new
TargetStackID there. Loads and stores to those values are transformed
to new WebAssemblyISD::LOCAL_GET / WebAssemblyISD::LOCAL_SET nodes,
which then lower to the type-specific LOCAL_GET_I32 etc instructions via
tablegen patterns.
Differential Revision: https://reviews.llvm.org/D101140
Also changes the fewerElements helper to use the lookthrough constant helper
instead of m_ICst, since m_ICst doesn't look through extends.
Differential Revision: https://reviews.llvm.org/D103227
AIX use `__ssp_canary_word` instead of `__stack_chk_guard`.
This patch update the target hook to use correct symbol,
so that the basic stackprotect feature can work.
The traceback will be handled in follow up patch.
Reviewed By: #powerpc, shchenz
Differential Revision: https://reviews.llvm.org/D103100
If an instruction's AVL operand is a PHI node in the same block,
we may be able to peek through the PHI to find vsetvli instructions
that produce the AVL in other basic blocks. If we can prove those
vsetvli instructions have the same VTYPE and were the last vsetvli
in their respective blocks, then we don't need to insert a vsetvli
for this pseudo instruction.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D103277
This is the first in a series of patches to provide builtins for
compatibility with the XL compiler. Most of the builtins already had
intrinsics and only needed to be implemented in the front end.
Intrinsics were created for the three iospace builtins, eieio, and icbt.
Pseudo instructions were created for eieio and iospace_eieio to
ensure that nops were inserted before the eieio instruction.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D102443
Determined from llvm-mca analysis (btver2 vs bdver2 vs sandybridge), the split+extends+concat sequence on AVX1 capable targets are cheaper than the #ops that the cost was previously based on.
This can help avoid needing a virtual register for the vsetvl output
when the AVL is X0. For other register AVLs it can shorter the live
range of the AVL register if it isn't needed later.
There's probably no advantage when AVL is a 5 bit immediate that
can use vsetivli. But do it anyway for consistency.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D103215
We could previously do this by accident through the later
call to getTargetConstantBitsFromNode I think, but that only worked
if N0 had a single use. This patch makes it explicit for undef and
doesn't have a use count check.
I think this is needed to move the (shl X, 1)->(add X, X)
fold to isel for PR50468. We need to be sure X won't be IMPLICIT_DEF
which might prevent the same vreg from being used for both operands.
Differential Revision: https://reviews.llvm.org/D103192
Adjusting the load register type is a widenScalar type action, not a
lowering. lowerLoad should be reserved for operations that change the
memory access size, such as unaligned load decomposition. With this
trying to adjust the register type, it was hard to avoid infinite
loops in the legalizer. Adds a bandaid to avoid regressing a few
AArch64 tests, but I'm not sure what the exact condition is and
there's probably a cleaner way to do this.
For AMDGPU this regresses handling of some cases for unaligned loads,
but the way this is currently working is a pretty ugly hack.
The SkylakeServer model (and later IceLake/TigerLake targets according to Agner) have the PMOV truncations as uops=2, rthroughput=2 instructions.
Noticed while trying to reduce the diffs between cost tables and llvm-mca analysis.
This patch adds a way for the target to configure the type it uses for
the explicit vector length operands of VP SDNodes. The type must be a
legal integer type (there is still no target-independent legalization of
this operand) and must currently be at least as big as i32, the type
used by the IR intrinsics. An implicit zero-extension takes place on
targets which choose a larger type. All VP nodes should be created with
this type used for the EVL operand.
This allows 64-bit RISC-V to avoid custom legalization of all VP nodes,
keeping them in their target-independent form for that bit longer.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D103027
DAGCombine's `mergeStoresOfConstantsOrVecElts` optimization is told
whether it's to use vector types and also whether it's to issue a
truncating store. However, the truncating store code path assumes a
scalar integer `ConstantSDNode`, and when using vector types it creates
either a `BUILD_VECTOR` or `CONCAT_VECTORS` to store: neither of which
is a constant.
The `riscv64` target is able to expose a crash here because it switches
on both code paths at the same time. The `f32` is stored as `i32` which
must be promoted to `i64`, necessitating a truncating store.
It also decides later that it prefers a vector store of `v2f32`.
While vector truncating stores are legal, this combine is not able to
emit them. We also don't have a test case. This patch adds an assert to
catch this case more gracefully, and updates one of the caller functions
to the function to turn off the use of truncating stores when preferring
vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103173
The vector calling convention dictates that when the vector argument
registers are exhaused, GPRs are used to pass the address via the stack.
When the GPRs themselves are exhausted, at best we would previously
crash with an assertion, and at worst we'd generate incorrect code.
This patch addresses this issue by passing fixed-length vectors via the
stack with their full fixed-length size and aligned to their element
type size. Since the calling convention lowering can't yet handle
scalable vector types, this patch adds a fatal error to make it clear
that we are lacking in this regard.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102422
This patch extends the cases in which the legalizer is able to express
VSELECT in terms of XOR/AND/OR. When dealing with a VSELECT between
boolean vector types, the mask itself is an all-ones or all-ones value
of the operand type, so a 0/1 boolean type behaves identically to a 0/-1
type.
This greatly helps RISC-V which relies on expansion for these nodes. It
also allows scalable-vector bool VSELECTs to use the default expansion,
where before it would crash in SelectionDAG::UnrollVectorOp.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103147
This enables the no-aliases forms of many instructions.
Depends on D103004
Reviewed By: tmatheson
Differential Revision: https://reviews.llvm.org/D103005
We aren't going to connect the result to anything so we might
as well avoid allocating a register.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D102031
DwarfDebug unconditionally assumes for all call instructions the 0th
operand is the callee operand, which seems to be true for other targets,
but not for WebAssembly. This adds `TargetInstrInfo::getCallOperand`
method whose default implementation returns `getOperand(0)` and makes
WebAssembly overrides it to use its own utility method to get the callee
operand.
This also fixes an existing bug in `WebAssembly::getCalleeOp`, which was
uncovered by this CL.
Reviewed By: dschuff, djtodoro
Differential Revision: https://reviews.llvm.org/D102978
There is a trivial but severe bug in the recent code collecting
LDS globals used by kernel. It aborts scan on the first constant
without scanning further uses. That leads to LDS overallocation
with multiple kernels in certain cases.
Differential Revision: https://reviews.llvm.org/D103190
In objdump, many targets support `-M no-aliases`. Instead of having a
`-*-no-aliases` for each target when LLVM adds the support, it makes more sense
to introduce objdump style `-M`.
-riscv-arch-reg-names is removed. -riscv-no-aliases has too many uses and thus is retained for now.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D103004
SEW=64 shifts only uses the log2(64) bits of shift amount. If we're
splatting a 64 bit value in 2 parts, we can avoid splatting the
upper bits and just let the low bits be sign extended. They won't
be read anyway.
For the purposes of SelectionDAG semantics of the generic ISD opcodes,
if hi was non-zero or bit 31 of the low is 1, the shift was already
undefined so it should be ok to replace high with sign extend of low.
In order do be able to find the split i64 value before it becomes
a stack operation, I added a new ISD opcode that will be expanded
to the stack spill in PreprocessISelDAG. This new node is conceptually
similar to BuildPairF64, but I expanded earlier so that we could
go through regular isel to get the right VLSE opcode for the LMUL.
BuildPairF64 is expanded in a CustomInserter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102521
It's conceivable someone could put a vsetvli in inline assembly
so its safer to consider them as barriers. The alternative would
be to trust that the user marks VL and VTYPE registers as clobbers
of the inline assembly if they do that, but hat seems error prone.
I'm assuming inline assembly in vector code is going to be rare.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D103126
This patch extends D102737 to allow VL/VTYPE changes to be taken
into account before adding an explicit vsetvli.
We do this by using a data flow analysis to propagate VL/VTYPE
information from predecessors until we've determined a value for
every value in the function.
We use this information to determine if a vsetvli needs to be
inserted before the first vector instruction the block.
Differential Revision: https://reviews.llvm.org/D102739
Support virtual, physical and tied i128 register operands in inline assembly.
i128 is on SystemZ not really supported and is not a legal type and generally
such a value will be split into two i64 parts. There are however some
instructions that require a pair of two GPR64 registers contained in the GR128
bit reg class, which is untyped.
For inline assmebly operands, it proved to be very cumbersome to first follow
the general behavior of splitting an i128 operand into two parts and then
later rebuild the INLINEASM MI to have one GR128 register. Instead, some
minor common code changes were made to SelectionDAGBUilder to only create one
GR128 register part to begin with. In particular:
- getNumRegisters() now has an optional parameter "RegisterVT" which is
passed by AddInlineAsmOperands() and GetRegistersForValue().
- The bitcasting in GetRegistersForValue is not performed if RegVT is
Untyped.
- The RC for a tied use in AddInlineAsmOperands() is now computed either from
the tied def (virtual register), or by getMinimalPhysRegClass() (physical
register).
- InstrEmitter.cpp:EmitCopyFromReg() has been fixed so that the register
class (DstRC) can also be computed for an illegal type.
In the SystemZ backend getNumRegisters(), splitValueIntoRegisterParts() and
joinRegisterPartsIntoValue() have been implemented to handle i128 operands.
Differential Revision: https://reviews.llvm.org/D100788
Review: Ulrich Weigand
- Currently, LLVM supports symbols of the name "token1@token2".
- "token2" is used to identify whether an appropriate symbol reference can be used for the symbol.
- Now, if the symbol reference couldn't be found, the AsmParser usually emits an error, unless the backend is configured to accept the "@" in a symbol name
- Thus, this patch aims to do that. It sets the `AllowAtInName` attribute in the SystemZ backend for the HLASM dialect.
- Setting this attribute ensures that, if a particular symbol reference is found, it uses that. If it doesn't, and there exists an "@" in the symbol name, it will use that instead of explicitly erroring out.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D103111
This patch fixes a bug in the AMDGPU Propagate Attributes pass where a call
instruction with a function pointer argument is identified as a user of the
passed function, and illegally replaces the called function of the
instruction with the function argument.
For example, given functions f and g with appropriate types, the following
illegal transformation could occur without this fix:
call void @f(void ()* @g)
-->
call void @g(void ()* @g.1)
The solution introduced in this patch is to prevent the cloning and
substitution if the instruction's called function and the function which
might be cloned do not match.
Reviewed By: arsenm, madhur13490
Differential Revision: https://reviews.llvm.org/D101847
- Currently, before printing a label in MCSymbol.cpp (MCSymbol::print), the current code "validates" the label that is to be printed.
- If it fails the validation step, then it prints the label within double quotes.
- However, the validation is provided as a virtual function in MCAsmInfo.h (i.e. isAcceptableChar() function). So we can override this for the AD_HLASM dialect in SystemZMCAsmInfo.cpp.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D103091
The previous code detect if a MBB is bottom block to determine if it is
a backedge of a loop. We should check latch block instead of bottom
block and we should check the header and the bottom block are in the
same loop.
Differential Revision: https://reviews.llvm.org/D103145
The existing LD1 patterns do not cover cases where result type does
not match the memory type. This happens when illegal vector types are
extended and scalarized, for example:
load <2 x i16>* %v2i16
is lowered into:
// first element
(v4i32 (insert_subvector (v2i32 (scalar_to_vector (load anyext from i16)))))
// other elements
(v4i32 (insert_vector_elt (i32 (load anyext from i16)) idx))
Before this patch these patterns were compiled into LDR + INS.
Now they are compiled into LD1.
The problem was reported in
PR24820: LLVM Generates abysmal code in simple situation.
Differential Revision: https://reviews.llvm.org/D102938
Match whats documented in the Intel AOM (+Agner) - PSHUFB xmm is really slow, and mmx/xmm vector shifts are half rate.
Noticed while working to get the cost tables to more closely match llvm-mca analysis, in this case for shifts and truncations.
This function can change regbank for registers which already have a selected
bank. Depending on the instruction where these registers were used it can
cause instruction selection to fail.
Differential Revision: https://reviews.llvm.org/D98515
Match whats documented in the Intel AOM - the non-immediate variants of the PSLL*/PSRA*/PSRL* shift instructions requires BOTH ports - this was being incorrectly modelled as EITHER port.
Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
Now that vmulh can be selected, this adds the MVE patterns to make it
legal and generate instructions.
Differential Revision: https://reviews.llvm.org/D88011
This function can change regbank for registers which already have a selected
bank. Depending on the instruction where these registers were used it can
cause instruction selection to fail.
We are using TOCEntry symbols like `LC..0` in TOC loads,
this is hard to read , at least requiring an additional step to figure
out the loaded symbols.
We should print out the name in comments.
Reviewed By: #powerpc, shchenz
Differential Revision: https://reviews.llvm.org/D102949
Match whats documented in the Intel AOM - the XMM variant of PSHUFB requires BOTH ports - this was being incorrectly modelled as EITHER port.
Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
Determined from llvm-mca analysis, AVX1 capable targets have a higher throughput for VPBLENDVB and shuffle ops, making it cheaper to perform shift+shuffle/select shift patterns.
Currently, BPF only contains three relocations:
R_BPF_NONE for no relocation
R_BPF_64_64 for LD_imm64 and normal 64-bit data relocation
R_BPF_64_32 for call insn and normal 32-bit data relocation
Also .BTF and .BTF.ext sections contain symbols in allocated
program and data sections. These two sections reserved 32bit
space to hold the offset relative to the symbol's section.
When LLVM JIT is used, the LLVM ExecutionEngine RuntimeDyld
may attempt to resolve relocations for .BTF and .BTF.ext,
which we want to prevent. So we used R_BPF_NONE for such relocations.
This all works fine until when we try to do linking of
multiple objects.
. R_BPF_64_64 handling of LD_imm64 vs. normal 64-bit data
is different, so lld target->relocate() needs more context
to do a correct job.
. The same for R_BPF_64_32. More context is needed for
lld target->relocate() to differentiate call insn vs.
normal 32-bit data relocation.
. Since relocations in .BTF and .BTF.ext are set to R_BPF_NONE,
they will not be relocated properly when multiple .BTF/.BTF.ext
sections are merged by lld.
This patch intends to address this issue by adding additional
relocation kinds:
R_BPF_64_ABS64 for normal 64-bit data relocation
R_BPF_64_ABS32 for normal 32-bit data relocation
R_BPF_64_NODYLD32 for .BTF and .BTF.ext style relocations.
The old R_BPF_64_{64,32} semantics:
R_BPF_64_64 for LD_imm64 relocation
R_BPF_64_32 for call insn relocation
The existing R_BPF_64_64/R_BPF_64_32 mapping to numeric values
is maintained. They are the most common use cases for
bpf programs and we want to maintain backward compatibility
as much as possible.
ExecutionEngine RuntimeDyld BPF relocations are adjusted as well.
R_BPF_64_{ABS64,ABS32} relocations will be resolved properly and
other relocations will be ignored.
Two tests are added for RuntimeDyld. Not handling R_BPF_64_NODYLD32 in
RuntimeDyldELF.cpp will result in "Relocation type not implemented yet!"
fatal error.
FK_SecRel_4 usages in BPFAsmBackend.cpp and BPFELFObjectWriter.cpp
are removed as they are not triggered in BPF backend.
BPF backend used FK_SecRel_8 for LD_imm64 instruction operands.
Differential Revision: https://reviews.llvm.org/D102712
NFC, since no instructions have their AsmMatchConverter
changed, but prepares for that to happen.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D103046
Change-Id: I6afefad899076de7b9a412374d09b95b29e012fa
rG1ad4f887bd7692a9e63fb42586f0ece366f2fe01 incorrectly assumed that vXi64 non-uniform shifts were slow like vXi32 were - but llvm-mca (+Agner) both confirm that Haswell/Broadwell are full rate.
NFC. Renames the variable in the dpp input operand generators
from DstRC to OldRC, because that is what it actually sets.
Also documents the importance of setting HasModifiers = 0 in the
dpp8 asm string.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D103047
Change-Id: Ice69ae38f644de7f228a75ca47c43e88b1f7d9e1
Determined from llvm-mca analysis, AVX2+ capable targets have a higher throughput for VPBLENDVB and VPMOVZX ops, making it cheaper to perform shift+select patterns for vXi8 shifts or extend/shift/truncate for vXi16 shifts. Similarly AVX512BW can perform vXi8 as extend/shift/truncate patterns.
This is a replacement for D101938 for inserting vsetvli
instructions where needed. This new version changes how
we track the information in such a way that we can extend
it to be aware of VL/VTYPE changes in other blocks. Given
how much it changes the previous patch, I've decided to
abandon the previous patch and post this from scratch.
For now the pass consists of a single phase that assumes
the incoming state from other basic blocks is unknown. A
follow up patch will extend this with a phase to collect
information about how VL/VTYPE change in each block and
a second phase to propagate this information to the entire
function. This will be used by a third phase to do the
vsetvli insertion.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102737
`WebAssemblyDebugValueManager` does not currently handle
`DBG_VALUE_LIST`, which is a recent addition to LLVM. We tried to
nullify them within the constructor of `WebAssemblyDebugValueManager` in
D102589, but it made the class error-prone to use because it deletes
instructions within the constructor and thus invalidates existing
iterators within the BB, so the user of the class should take special
care not to use invalidated iterators. This actually caused a bug in
ExplicitLocals pass.
Instead of trying to fix ExplicitLocals pass to make the iterator usage
correct, which is possible but error-prone, this adds
NullifyDebugValueLists pass that nullifies all `DBG_VALUE_LIST`
instructions before we run WebAssembly specific passes in the backend.
We can remove this pass after we implement handlers for
`DBG_VALUE_LIST`s in `WebAssemblyDebugValueManager` and elsewhere.
Fixes https://github.com/emscripten-core/emscripten/issues/14255.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D102999
This follows in steps of similar `getMemoryOpCost()` changes, D100099/D100684.
Intel SDM, `VPMASKMOV — Conditional SIMD Integer Packed Loads and Stores`:
```
Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to
referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no
faults will be detected if the mask bits are all zero.
```
I.e., if mask is all-zeros, any address is fine.
Masked load/store's prime use-case is e.g. tail masking the loop remainder,
where for the last iteration, only first some few elements of a vector exist.
So much similarly, i don't see why must we scalarize non-power-of-two vectors,
iff the element type is something we can masked- store/load.
We simply need to legalize it, widen the mask, and be done with it.
And we even already count the cost of widening the mask.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D102990
If the local variable `NumOfVReg` isPowerOf2_32(NumOfVReg - 1) or isPowerOf2_32(NumOfVReg + 1), the ADDI and MUL instructions can be replaced with SLLI and ADD(or SUB) instructions.
Based on original patch by StephenFan.
Reviewed By: frasercrmck, StephenFan
Differential Revision: https://reviews.llvm.org/D100577
to match fmod frem result must have the dividend sign. Previous implementation
had the wrong sign when passing negative numbers. For ex: frem(-16, 7) was
returning 5 instead of -2. We should just a ftrunc instead of floor when
lowering to get the right behavior.
Differential Revision: https://reviews.llvm.org/D102528
By llvm-mca analysis, Haswell/Broadwell has a non-uniform vector shift recip-throughput cost of the AVX2 targets at 2 for both 128 and 256-bit vectors - XOP capable targets have better 128-bit vector shifts so improve the fallback in those cases.
The findLoopPreheader function will currently not find a preheader if it
branches to multiple different loop headers. This patch adds an option
to relax that, allowing ARMLowOverheadLoops to process more loops
successfully. This helps with WhileLoopStart setup instructions that can
branch/fallthrough to the low overhead loop and to branch to a separate
loop from the same preheader (but I don't believe it is possible for
both loops to be low overhead loops).
Differential Revision: https://reviews.llvm.org/D102747
This makes sure that the blocks created for lowering memcpy to loops end
up with branches, even if they fall through to the successor. Otherwise
IfCvt is getting confused with unanalyzable branches and creating
invalid block layouts.
The extra branches should be removed as the tail predicated loop is
finalized in almost all cases.
The trip count for a memcpy/memset will be n/16 rounded up to the
nearest integer. So (n+15)>>4. The old code was including a BIC too, to
clear one of the bits, which does not seem correct. This remove the
extra BIC.
Note that ideally this would never actually be generated, as in the
creation of a tail predicated loop we will DCE that setup code, letting
the WLSTP perform the trip count calculation. So this doesn't usually
come up in testing (and apparently the ARMLowOverheadLoops pass does not
do any sort of validation on the tripcount). Only if the generation of
the WLTP fails will it use the incorrect BIC instructions.
Differential Revision: https://reviews.llvm.org/D102629
RVV code generation does not successfully custom-lower BUILD_VECTOR in all
cases. When it resorts to default expansion it may, on occasion, be expanded to
scalar stores through the stack. Unfortunately these stores may then be picked
up by the post-legalization DAGCombiner which merges them again. The merged
store uses a BUILD_VECTOR which is then expanded, and so on.
This patch addresses the issue by overriding the `mergeStoresAfterLegalization`
hook. A lack of granularity in this method (being passed the scalar type) means
we opt out in almost all cases when RVV fixed-length vector support is enabled.
The only exception to this rule are mask vectors, which are always either
custom-lowered or are expanded to a load from a constant pool.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D102913
By llvm-mca analysis, Haswell/Broadwell has the worst v4i64 recip-throughput cost of the AVX2 targets at 6 (vs the currently used cost of 8). Similarly SkylakeServer (our only AVX512 target model) implements PMULLQ with an average cost of 1.5 (rounded up to 2.0), and the PMULUDQ-sequence (without AVX512DQ) as a cost of 6.
The prevailing style does not add the message. The directive name is not useful
because the next line replicates the error line which includes the directive.
Based on worst case of sandybridge (vs btver2 + bdver2) llvm-mca analysis - which is a lot less than what we were predicting (I think based off total uop count).
BTVER2 has a 2 cycle throughput for v4i32 multiplies (same as SSE41 targets), which is only partially hidden by the subvector extracts/insert when splitting v8i32.
Now that getMemoryOpCost() correctly handles all the vector variants,
we should no longer hand-roll our own version of it, but use it directly.
The AVX512 variant probably needs a similar change,
but there it is less obvious.
This was initially landed in 69ed93a435,
but was reverted in 6b95fd199d
because the patch it depends on was reverted.
Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.
Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.
This was initially landed in c02476f315,
but reverted in 5fddc3312b,
because the code made some very optimistic assumptions about invariants
that didn't hold in practice.
Reviewed By: RKSimon, ABataev
Differential Revision: https://reviews.llvm.org/D100684
D88631 added initial support for:
- -mstack-protector-guard=
- -mstack-protector-guard-reg=
- -mstack-protector-guard-offset=
flags, and D100919 extended these to AArch64. Unfortunately, these flags
aren't retained for LTO. Make them module attributes rather than
TargetOptions.
Link: https://github.com/ClangBuiltLinux/linux/issues/1378
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D102742
This adds the {s,u,m}badaddr CSR aliases as well as the sptbr alias.
These are for compatibility with binutils. Furthermore, these are used
by the RISC-V Proxy Kernel and are required to enable building the Proxy
Kernel with the LLVM IAS.
The aliases here are deprecated. These are being introduced in order to
provide a compatibility story for the existing GNU toolchain, which
still supports the deprecated spelling in the assembler. However, in
order to encourage the migration of existing coding, we provide warnings
indicating that the aliased CSRs are deprecated and should be replaced.
Differential Revision: https://reviews.llvm.org/D101919
Reviewed By: Craig Topper
BTVER2 has a weaker f64 multiplier that other AVX1-era targets, so we need to bump the worst case cost slightly - llvm-mca reports the new vectorization in simplebb is beneficial on btver2, bdver2 and sandybridge AVX1 targets
This patch adds support for lowering function calls with the
`clang.arc.attachedcall` bundle. The goal is to expand such calls to the
following sequence of instructions:
callq @fn
movq %rax, %rdi
callq _objc_retainAutoreleasedReturnValue / _objc_unsafeClaimAutoreleasedReturnValue
This sequence of instructions triggers Objective-C runtime optimizations,
hence we want to ensure no instructions get moved in between them.
This patch achieves that by adding a new CALL_RVMARKER ISD node,
which gets turned into the CALL64_RVMARKER pseudo, which eventually gets
expanded into the sequence mentioned above.
The ObjC runtime function to call is determined by the
argument in the bundle, which is passed through as a
target constant to the pseudo.
@ahatanak is working on using this attribute in the front- & middle-end.
Together with the front- & middle-end changes, this should address
PR31925 for X86.
This is the X86 version of 46bc40e502,
which added similar support for AArch64.
Reviewed By: ab
Differential Revision: https://reviews.llvm.org/D94597
This patch resolves the Wsign-compare warning that I observed on armv7l and x86 with both gcc and clang.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D102792
The Loop start instruction handled by the ARMLowOverheadLoops are:
$lr = t2DoLoopStart $r0
$lr = t2DoLoopStartTP $r1, $r0
$lr = t2WhileLoopStartLR $r0, %bb, implicit-def dead $cpsr
All three of these will have LR as the 0 argument, the trip count as the
1 argument.
This patch updated a few places in ARMLowOverheadLoops where the 0th arg
was being used for t2WhileLoopStartLR instructions as the trip count.
One place was entirely removed as it does not seem valid any more, the
case the code is trying to protect against should not be able to occur
with our correct-by-construction low overhead loops.
Differential Revision: https://reviews.llvm.org/D102620
I do not see any practical difference but technically
used.* variables are internal and a call to getGlobalVariable
misses true as a second argument. NFC as far as I can tell.
Differential Revision: https://reviews.llvm.org/D102884
Accesses to global module LDS variable start from null,
but kernel also thinks its variables start address is
null. Fixed by not using a null as an address.
Differential Revision: https://reviews.llvm.org/D102882
This patch adds supports for inline assembly operands and some simple
operand constraints, including register and constant operands.
Differential Revision: https://reviews.llvm.org/D102585
Add `-ffixed-a[0-6]` and `-ffixed-d[0-7]` and the corresponding
subtarget features to prevent certain register from being allocated.
Differential Revision: https://reviews.llvm.org/D102805
Unlike normal loads these don't have an extension field, but we know
from TargetLowering whether these are sign-extending or zero-extending,
and so can optimise away unnecessary extensions.
This was noticed on RISC-V, where sign extensions in the calling
convention would result in unnecessary explicit extension instructions,
but this also fixes some Mips inefficiencies. PowerPC sees churn in the
tests as all the zero extensions are only for promoting 32-bit to
64-bit, but these zero extensions are still not optimised away as they
should be, likely due to i32 being a legal type.
This also simplifies the WebAssembly code somewhat, which currently
works around the lack of target-independent combines with some ugly
patterns that break once they're optimised away.
Re-landed with correct handling in ComputeNumSignBits for Tmp == VTBits,
where zero-extending atomics were incorrectly returning 0 rather than
the (slightly confusing) required return value of 1.
Re-landed again after D102819 fixed PowerPC to correctly zero-extend all
of its atomics as it claimed to do, since the combination of that bug
and this optimisation caused buildbot regressions.
Reviewed By: RKSimon, atanasyan
Differential Revision: https://reviews.llvm.org/D101342
The default expansion for BUILD_VECTORs -- save for going through
shuffles -- is to go through the stack. This method only works when the
type is at least byte-sized, so for v2i1 and v4i1 we would crash.
This patch ensures that small mask-type BUILD_VECTORs are always handled
without crashing. We lower to a SETCC of the equivalent i8 type.
This also exposes some pre-existing issues where the lowering when
optimizing for size results in larger code than without. Those will be
tackled in future patches.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102767
Match whats documented in the Intel AOM - these are all fadd/fcmp use Port1 and fmul uses Port1, but in many cases BOTH ports are required - this was being incorrectly modelled as EITHER port.
Discovered while investigating the correct fptoui costs to fix the regressions in D101555.
Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
Partword atomic binaries are not zero extended as they should be.
This patch fixes them to ensure that they are zero extended.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D102819
The use of `SelectionDAG::getSplatValue` isn't guaranteed to return a
type-legal splat value as it may implicitly extract a vector element
from another shuffle. It is not permitted to introduce an illegal type
when lowering shuffles.
This patch addresses the crash by adding a boolean flag to
`getSplatValue`, defaulting to false, which when set will ensure a
type-legal return value. If it is unable to do that it will fail to
return a splat value.
I've been through the existing uses of `getSplatValue` in other targets
and was unable to find a need or test cases showing a need to update
their uses. In some cases, the call is made during `LegalizeVectorOps`
which may still produce illegal scalar types. In other situations, the
illegally-typed splat value may be quickly patched up to a legal type
(such as any-extending the returned `extract_vector_elt` up to a legal
type) before `LegalizeDAG` notices.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102687
__table_base is know 64-bit, since in LLVM it represents a function pointer offset
__table_base32 is a copy in wasm32 for use in elem init expr, since no truncation may be used there.
New reloc R_WASM_TABLE_INDEX_REL_SLEB64 added
Differential Revision: https://reviews.llvm.org/D101784
Follow up to D101357 / 3fa6510f6.
Supersedes D102330.
Goal: Use flags setting rdffrs instead of rdffr + ptest.
Problem: RDFFR_P doesn't have have a flags setting equivalent.
Solution: in instcombine, canonicalize to RDFFR_PP at the IR level, and
rely on RDFFR_PP+PTEST => RDFFRS_PP optimization in
AArch64InstrInfo::optimizePTestInstr.
While here:
* Test that rdffr.z+ptest generates a rdffrs.
* Use update_{test,llc}_checks.py on the tests.
* Use sve attribute on functions.
Differential Revision: https://reviews.llvm.org/D102623
Linker scripts might not handle COMDAT sections. SLSHardeing adds
new section for each __llvm_slsblr_thunk_xN. This new option allows
the generation of the thunks into the normal text section to handle these
exceptional cases.
,comdat or ,noncomdat can be added to harden-sls to control the codegen.
-mharden-sls=[all|retbr|blr],nocomdat.
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D100546
Strictly speaking, the architecture manual no longer uses the st
mnemonic, but that's a much more intrusive change for little gain.
Differential Revision: https://reviews.llvm.org/D96313
Haswell, Excavator and early Ryzen all have slower 256-bit non-uniform vector shifts (confirmed on AMDSoG/Agner/instlatx64 and llvm models) - so bump the worst case costs accordingly.
Noticed while investigating PR50364
This adds custom lowering for the MLOAD and MSTORE ISD nodes when
passed fixed length vectors in SVE. This is done by converting the
vectors to VLA vectors and using the VLA code generation.
Fixed length extending loads and truncating stores currently produce
correct code, but do not use the built in extend/truncate in the
load and store instructions. This will be fixed in a future patch.
Differential Revision: https://reviews.llvm.org/D101834
We have been handling filters and landingpads incorrectly all along. We
pass clauses' (catches') types to `__cxa_find_matching_catch` in JS glue
code, which returns the thrown pointer and sets the selector using
`setTempRet0()`.
We apparently have been doing the same for filters' (exception specs')
types; we pass them to `__cxa_find_matching_catch` just the same way as
clauses. And `__cxa_find_matching_catch` treats all given types as
clauses. So it is a little surprising; maybe we intended to do something
from the JS side and didn't end up doing?
So anyway, I don't think supporting exception specs in Emscripten EH is
a priority, but this can actually cause incorrect results for normal
catches when functions are inlined and the inlined spec type has a
parent-child relationship with the catch's type.
---
The below is an example of a bug that can happen when inlining and class
hierarchy is mixed. If you are busy you can skip this part:
```
struct A {};
struct B : A {};
void bar() throw (B) { throw B(); }
void foo() {
try {
bar();
} catch (A &) {
fputs ("Expected result\n", stdout);
}
}
```
In the unoptimized code, `bar`'s landingpad will have a filter for `B`
and `foo`'s landingpad will have a clause for `A`. But when `bar` is
inlined into `foo`, `foo`'s landingpad has both a filter for `B` and a
clause for `A`, and it passes the both types to
`__cxa_find_matching_catch`:
```
__cxa_find_matching_catch(typeinfo for B, typeinfo for A)
```
`__cxa_find_matching_catch` thinks both are clauses, and looks at the
first type `B`, which belongs to a filter. And the thrown type is `B`,
so it thinks the first type `B` is caught. But this makes it return an
incorrect selector, because it is supposed to catch the exception using
the second type `A`, which is a parent of `B`. As a result, the `foo` in
the example program above does not print "Expected result" but just
throws the exception to the caller. (This wouldn't have happened if `A`
and `B` are completely disjoint types, such as `float` and `int`)
Fixes https://bugs.llvm.org/show_bug.cgi?id=50357.
Reviewed By: dschuff, kripken
Differential Revision: https://reviews.llvm.org/D102795
bswap.v2i16 + sitofp in LLVM IR generate a sequence of:
- REV32 + USHR for bswap.v2i16
- SHL + SSHR + SCVTF for sext to v2i32 and scvt
The shift instructions are excessive as noted in PR24820, and they can
be optimized to just SSHR.
Differential Revision: https://reviews.llvm.org/D102333
The ROR instruction can only handle immediates between 1 and 31. The
would-be encoding for ROR #0 is actually the RRX instruction.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D102455
This is another FMF gap exposed by D90901, but I don't see a way
to show the difference in a regression test as with:
f66ba4c6025663
We will see an asm difference if we add a test as part of D90901.
- This patch (is one in a series of patches) which introduces HLASM Parser support (for the first parameter of inline asm statements) to LLVM ([[ https://lists.llvm.org/pipermail/llvm-dev/2021-January/147686.html | main RFC here ]])
- This patch in particular introduces HLASM Parser support for Z machine instructions.
- The approach taken here was to subclass `AsmParser`, and make various functions and variables as "protected" wherever appropriate.
- The `HLASMAsmParser` class overrides the `parseStatement` function. Two new private functions `parseAsHLASMLabel` and `parseAsMachineInstruction` are introduced as well.
The general syntax is laid out as follows (more information available in [[ https://www.ibm.com/support/knowledgecenter/SSENW6_1.6.0/com.ibm.hlasm.v1r6.asm/asmr1023.pdf | HLASM V1R6 Language Reference Manual ]] - Chapter 2 - Instruction Statement Format):
```
<TokA><spaces.*><TokB><spaces.*><TokC><spaces.*><TokD>
```
1. TokA is referred to as the Name Entry. This token is optional
2. TokB is referred to as the Operation Entry. This token is mandatory.
3. TokC is referred to as the Operand Entry. This token is mandatory
4. TokD is referred to as the Remarks Entry. This token is optional
- If TokA is provided, then we either parse TokA as a possible comment or as a label (Name Entry), Tok B as the Operation Entry and so on.
- If TokA is not provided (i.e. we have one or more spaces and then the first token), then we will parse the first token (i.e TokB) as a possible Z machine instruction, TokC as the operands to the Z machine instruction and TokD as a possible Remark field
- TokC (Operand Entry), no spaces are allowed between OperandEntries. If a space occurs it is classified as an error.
- TokD if provided is taken as is, and emitted as a comment.
The following additional approach was examined, but not taken:
- Adding custom private only functions to base AsmParser class, and only invoking them for z/OS. While this would eliminate the need for another child class, these private functions would be of non-use to every other target. Similarly, adding any pure virtual functions to the base MCAsmParser class and overriding them in AsmParser would also have the same disadvantage.
Testing:
- This patch doesn't have tests added with it, for the sole reason that MCStreamer Support and Object File support hasn't been added for the z/OS target (yet). Hence, it's not possible generate code outright for the z/OS target. They are in the process of being committed / process of being worked on.
- Any comments / feedback on how to combat this "lack of testing" due to other missing required features is appreciated.
Reviewed By: Kai, uweigand
Differential Revision: https://reviews.llvm.org/D98276
The current implementation assumes the destination type of shuffle is the same as the decomposed ones. Add the check to avoid crush when the condition is not satisfied.
This fixes PR37616.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D102751
Generalize the fix from rGd0902a8665b1 by ensuring we widen/narrow the indices subvector first and then perform the ZERO_EXTEND_VECTOR_INREG (if necessary), which should allow us to perform the variable permutes with source/destination/indices vectors of any widths.
Match whats documented in the Intel AOM (and Agner/instlatx64 agree) - these are all Port0 only.
Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
The current implementation assumes the destination type of shuffle is the same as the decomposed ones. Add the check to avoid crush when the condition is not satisfied.
This fixes PR37616.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D102751
Like the element extraction of these vectors, we choose to promote up to
an i8 vector type and perform the insertion there.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102697
This patch transforms the sequence
lea (reg1, reg2), reg3
sub reg3, reg4
to two sub instructions
sub reg1, reg4
sub reg2, reg4
Similar optimization can also be applied to LEA/ADD sequence.
The modifications to TwoAddressInstructionPass is to ensure the operands of ADD
instruction has expected order (the dest register of LEA should be src register
of ADD).
Differential Revision: https://reviews.llvm.org/D101970
X86 NaCl generally requires the stack to be aligned to 16 bytes.
This change was already implemented in two downstream NaCl compilers
based on llvm.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D102610
This patch adds the XPLINK64 calling convention to the SystemZ
backend. It specifies and implements the argument passing and
return value conventions.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D101010
D101838 incorrectly handled indices vectors of the same size but with higher element counts to just bitcast to the target indices type instead of performing a ZERO_EXTEND_VECTOR_INREG
The x86-64-v4 generic cpu arch supports AVX512BW/DQ/CD/VLX which isn't covered by the Haswell model, use the SkylakeServer model instead which is a lot closer to what the arch represents.
Differential Revision: https://reviews.llvm.org/D102553
We can use an ORRWrs (mov) + SUBREG_TO_REG rather than a UBFX for G_ZEXT on
s32->s64.
This closer matches what SDAG does, and is likely more power efficient etc.
(Also fixed up arm64-rev.ll which had a fallback check line which was entirely
useless.)
Simple example: https://godbolt.org/z/h1jKKdx5c
Differential Revision: https://reviews.llvm.org/D102656
Use existing KnownBits helpers from KnownBits.h to simplify G_ICMPs.
E.g.
x == x -> true
x != x -> false
load(x) > 1 -> true (when the load is known to be greater than 1)
And so on.
Differential Revision: https://reviews.llvm.org/D102542
This adds support to the X86 backend for the newly committed swiftasync
function parameter. If such a (pointer) parameter is present it gets stored
into an augmented frame record (populated in IR, but generally containing
enhanced backtrace for coroutines using lots of tail calls back and forth).
The context frame is identical to AArch64 (primarily so that unwinders etc
don't get extra complexity). Specfically, the new frame record is [AsyncCtx,
%rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp
being set to 1. %rbp still points to the frame pointer in memory for backwards
compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest
of the record is because of x86).
Recommited with a fix for unwind info when i386 pc-rel thunks are
adjacent to a prologue.
While i would like to keep the right value here,
i would also like to be able to actually compile
e.g. vanilla test-suite.
256 is a pretty random guess, it should be pretty good enough
for serious loops, but small enough to result in tolerant
compile times for certain edge cases.
https://bugs.llvm.org/show_bug.cgi?id=50384
Noticed while investigating PR50364, the truncation costs for v4i64->v4i16/v4i8 and v8i32->v8i8 were way too optimistic for a shuffle sequence that usually matches the AVX1 codegen (they matched AVX512 numbers which have actual truncation instructions!).
Passing template parameter packs to std::map doesn't work in VS 2017/2019, so this updates the preprocessor version check to use an alternate version in VS2019, as well.
Reviewed By: DavidSpickett
Differential Revision: https://reviews.llvm.org/D102260
Where the RVV specification writes `vs2, vs1`, our TableGen patterns use
`rs1, rs2`. These differences can easily cause confusion. The VMANDNOT
instruction performs `LHS && !RHS`, and similarly for VMORNOT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102606
The implementation just extends the vector to a larger element type, and
extracts from that. Not fancy, but generates reasonable code.
There was discussion in the review of doing the promotion in
target-independent code, but I'm sticking with this to avoid making
LegalizeDAG infrastructure more complicated.
Differential Revision: https://reviews.llvm.org/D87651
WebAssemblyDebugValueManager class currently does not handle
DBG_VALUE_LIST instructions correctly for two reasons, which are
explained in https://bugs.llvm.org/show_bug.cgi?id=50361.
This effectively nullifies DBG_VALUE_LISTs in
WebAssemblyDebugValueManager so that the info will appear as "optimized
out" in debuggers but still be at least correct in the meantime.
Reviewed By: dschuff, jmorse
Differential Revision: https://reviews.llvm.org/D102589
Follow up to D88631 but for aarch64; the Linux kernel uses the command
line flags:
1. -mstack-protector-guard=sysreg
2. -mstack-protector-guard-reg=sp_el0
3. -mstack-protector-guard-offset=0
to use the system register sp_el0 for the stack canary, enabling the
kernel to have a unique stack canary per task (like a thread, but not
limited to userspace as the kernel can preempt itself).
Address pr/47341 for aarch64.
Fixes: https://github.com/ClangBuiltLinux/linux/issues/289
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed By: xiangzhangllvm, DavidSpickett, dmgreen
Differential Revision: https://reviews.llvm.org/D100919
Set the output register class based on the output type, instead of
hard-coding VGPR_32. I think this is more correct. It doesn't make any
difference at the moment because we use the same class for 16- and
32-bit results, but it might in future if we make more use of true
16-bit register classes.
Differential Revision: https://reviews.llvm.org/D102622
Adding lowering support for bitreverse.
Previously, lowering bitreverse would expand it into a series of other instructions. This patch makes it so this produces a single rbit instruction instead.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D102397
These patterns are missing even though the underlying instruction
doesn't really care about the type. Added these patterns to resolve
https://bugs.llvm.org/show_bug.cgi?id=50084
This instruction is a nop on all server cores (certainly on all
cores that AIX supports) so it is fine to emit a nop instead of it.
In fact, that is exactly what XL emits. So we emit a nop on AIX
and we leave the codegen as is on other platforms since there may
indeed be cores out there for which this actually does some prefetching.
This adds support to the X86 backend for the newly committed swiftasync
function parameter. If such a (pointer) parameter is present it gets stored
into an augmented frame record (populated in IR, but generally containing
enhanced backtrace for coroutines using lots of tail calls back and forth).
The context frame is identical to AArch64 (primarily so that unwinders etc
don't get extra complexity). Specfically, the new frame record is [AsyncCtx,
%rbp, ReturnAddr], and its presence is signalled by bit 60 of the stored %rbp
being set to 1. %rbp still points to the frame pointer in memory for backwards
compatibility (only partial on x86, but OTOH the weird AsyncCtx before the rest
of the record is because of x86).
Swift's new concurrency features are going to require guaranteed tail calls so
that they don't consume excessive amounts of stack space. This would normally
mean "tailcc", but there are also Swift-specific ABI desires that don't
naturally go along with "tailcc" so this adds another calling convention that's
the combination of "swiftcc" and "tailcc".
Support is added for AArch64 and X86 for now.
AArch64's fctv* instructions implement the saturating behaviour that the
fpto*i.sat intrinsics require, in cases where the destination width
matches the saturation width. Lowering them removes a lot of unnecessary
generated code.
Only scalar lowerings are supported for now.
Differential Revision: https://reviews.llvm.org/D102353
The ComplexPattern is looking for an immediate in a certain range
that has a single use. This can be handled with a PatLeaf since
we aren't matching multiple patterns or checking any complicated
relationships between nodes.
This shrinks the isel table a little bit since tablegen no longer
has to generate patterns with commuted operands. With the PatLeaf,
tablegen can see we're matching an immediate which should always
be on the right hand side of add.
Reviewed By: benshi001
Differential Revision: https://reviews.llvm.org/D102510
This adds a simple fold into codegenprepare that converts comparison of
branches towards comparison with zero if possible. For example:
%c = icmp ult %x, 8
br %c, bla, blb
%tc = lshr %x, 3
becomes
%tc = lshr %x, 3
%c = icmp eq %tc, 0
br %c, bla, blb
As a first order approximation, this can reduce the number of
instructions needed to perform the branch as the shift is (often) needed
anyway. At the moment this does not effect very much, as llvm tends to
prefer the opposite form. But it can protect against regressions from
commits like rG9423f78240a2.
Simple cases of Add and Sub are added along with Shift, equally as the
comparison to zero can often be folded with cpsr flags.
Differential Revision: https://reviews.llvm.org/D101778
The intention is to be able to run this from additional locations (such as shuffle combining) in the future.
Reapplies rGb95a103808ac (after reversion at rGc012a388a15b), with SSE3/SSSE3 typo fix, test added at rG0afb10de1449.
The default AsmPrinter print GV in comments,
AIX should do so too.
This also fix LLVM :: CodeGen/Generic/inline-asm-mem-clobber.ll.
Reviewed By: hubert.reinterpretcast
Differential Revision: https://reviews.llvm.org/D102534
Match whats documented in the Intel AOM (and Agner/instlatx64 agree) - vector integer multiplies are pipelined - all Port0, throughput = 2 @ 128bits, 1 @ 64bits.
Noticed while checking reduction costs - now that we can use in-order models in llvm-mca, the atom model is the "worst case scenario" we have in x86.
Reapply rG5ed56a821c06 (after reverted by rG7aa89c4a22fd) - don't take reference from struct that will be erased in X86FrameLowering::eliminateCallFramePseudoInstr
The FixSGPRCopies pass converts instructions to VALU when
removing illegal VGPR to SGPR copies. Instructions that use SCC
are changed to use VCC instead. When that happens, the pass must
also change instructions that define SCC to define VCC.
The pass was not changing the SCC definition when an ADDC is
converted due to a input that is a VGPR to SGPR copy. But, the
initial ADD insruction, which define SCC, is not converted.
This causes a compilation failure due to a use of an undefined
physical register.
This patch adds code that inserts the SCC definition in the
MoveToVALU worklist when a SCC use is converted to a VCC use.
Differential Revision: https://reviews.llvm.org/D102111
This patch adds the abstract class SystemZCallingConventionRegisters
which is a SystemZ-specific class detailing special registers used
by calling conventions on the target. SystemZELFRegisters and
SystemZXPLINK64Registers implement this class for ELF and XPLINK64
respectively.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D102370
moveOperands does not handle moving tied operands since it would
generally have to fixup the tied operand references. Avoid the assert
by untying and retying after the modification. These in place
modifications really aren't managable.
Addition of this node allows us to better utilize the different forms of
the SVE BIC instructions, including using the alias to an AND (immediate).
Differential Revision: https://reviews.llvm.org/D101831
We were previously only searching a single preheader for call
instructions when reverting WhileLoopStarts to DoLoopStarts. This
extends that to multiple blocks that can come up when, for example a
loop is expanded from a memcpy. It also expends the instructions from
just Call's to also include other LoopStarts, to catch other low
overhead loops in the preheader.
Differential Revision: https://reviews.llvm.org/D102269
These pseudos are converted post-isel into t2WhileLoopStart and
t2LoopEnd/LoopDec instructions, which themselves are defined to clobber
CPSR. Doing the same with the MEMCPY nodes will make sure they are
scheduled correctly to not end up with incorrect uses.
The MachineBasicBlock::iterator is continuously changing during
generating the frame handling instructions. We should use the DebugLoc
from the caller, instead of getting it from the changing iterator.
If the prologue instructions located in a basic block without any other
instructions after these prologue instructions, the iterator will be
updated to the boundary of the basic block and it is invalid to use the
iterator to access DebugLoc. This patch also fixes the crash when
accessing DebugLoc using the iterator.
Differential Revision: https://reviews.llvm.org/D102386
When a stackified variable has an associated `DBG_VALUE` instruction,
DebugFixup pass adds a `DBG_VALUE` instruction after the stackified
value's last use to clear the variable's debug range info. But when the
last use instruction is a terminator, it can cause a verification
failure (when run with `-verify-machineinstrs`) because there are no
instructions allowed after a terminator.
For example:
```
%myvar = ...
DBG_VALUE target-index(wasm-operand-stack), $noreg, !"myvar", ...
BR_IF 0, %myvar, ...
DBG_VALUE $noreg, $noreg, !"myvar", ...
```
In this test, `%myvar` is stackified, so the first `DBG_VALUE`
instruction's first operand has changed to `wasm-operand-stack` to
denote it. And an additional `DBG_VALUE` instruction is added after its
last use, `BR_IF`, to signal variable `myvar` is not in the operand
stack anymore. But because the `DBG_VALUE` instruction is added after
the `BR_IF`, a terminator, it fails MachineVerifier.
`DBG_VALUE` instructions are used in `DbgEntityHistoryCalculator` to
compute value ranges to emit DWARF info, and it turns out the
`DbgEntityHistoryCalculator` terminates ranges at the end of a BB, so we
don't need to emit `DBG_VALUE` after a terminator.
Fixes https://bugs.llvm.org/show_bug.cgi?id=50175.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D102309
In wasm64, the signatures of some library functions and global variables
defined in Emscripten change:
- `emscripten_longjmp`: `(i32, i32) -> ()` -> `(i64, i32) -> ()`
This changes because the first argument is the address of a memory
buffer. This in turn causes more changes below.
- `setThrew`: `(i32, i32) -> ()` -> `(i64, i32) -> ()`
`emscripten_longjmp` calls `setThrew` with the i64 buffer argument as
the first parameter.
- `__THREW__` (global var): `i32` to `i64`
`setThrew`'s first argument is set to this `__THREW__` variable, so it
should change to i64 as well.
- `testSetjmp`: `(i32, i32*, i32) -> (i32)` -> `(i64, i32*, i32) -> (i32)`
In the code transformation done in this pass, the value of `__THREW__`
is passed as the first parameter of `testSetjmp`.
This patch creates some helper functions to easily get types that become
different depending on the wasm32/wasm64, and uses them to change
various function signatures and code transformations. Also updates the
tests with WASM32/WASM64 check lines.
(Untested) Emscripten side patch: https://github.com/emscripten-core/emscripten/pull/14108
Reviewed By: aardappel
Differential Revision: https://reviews.llvm.org/D101985
This extends any frame record created in the function to include that
parameter, passed in X22.
The new record looks like [X22, FP, LR] in memory, and FP is stored with 0b0001
in bits 63:60 (CodeGen assumes they are 0b0000 in normal operation). The effect
of this is that tools walking the stack should expect to see one of three
values there:
* 0b0000 => a normal, non-extended record with just [FP, LR]
* 0b0001 => the extended record [X22, FP, LR]
* 0b1111 => kernel space, and a non-extended record.
All other values are currently reserved.
If compiling for arm64e this context pointer is address-discriminated with the
discriminator 0xc31a and the DB (process-specific) key.
There is also an "i8** @llvm.swift.async.context.addr()" intrinsic providing
front-ends access to this slot (and forcing its creation initialized to nullptr
if necessary).
There are three essentially different cases to handle:
* -O1, no LSE. The IR is expanded to ldxp/stxp and we need patterns to select
them.
* -O0, no LSE. We get G_ATOMIC_CMPXCHG, and need to produce CMP_SWAP_N
pseudos. The registers are all 64-bit so this is easy.
* LSE. We get G_ATOMIC_CMPXCHG and need to produce a CASP instruction with
XSeqPair registers.
The last case is by far the hardest, and and adds 128-bit GPR support as a
byproduct.
A consequence is that checkInstOffsetsDoNotOverlap can now distinguish
sp+offset from fp+offset, so it knows that it shouldn't try to work out
whether the accesses overlap just by comparing the offsets. For example
in these two instructions:
MIR:
BUFFER_STORE_DWORD_OFFSET %0:vgpr_32(s32), $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable store 4 into stack + 4, addrspace 5)
%4:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from `i8 addrspace(5)* undef`, addrspace 5)
ISA:
buffer_store_dword v0, off, s[0:3], s32 offset:4
buffer_load_dword v0, off, s[0:3], s34
Differential Revision: https://reviews.llvm.org/D73957
The Arm Architecture Reference Manual says that the SystemHintOp_BTI
opcode is prefered when CRm:op2 matches 0100:xx0, but llvm-mc
currently accepts 0100:xxx, which isn't right.
Differential Revision: https://reviews.llvm.org/D102415
Unlike it's legacy SSE XMM XORPS version, which measures as being 1-cycle,
this one is certainly a zero-cycle instruction, in addition to both of them
being dependency breaking.
As confirmed by exegesis measurements, and ref docs.
For gfx10 gradient (g16) and address (a16) can be independent. Previous
implementation assumed that a16 implied g16.
There are some other changes that fix the verification (as well as asm/disasm)
that are required for the included test to pass - the XFAIL will be removed in
those changes.
This also includes required fixes for GlobalISel
Differential Revision: https://reviews.llvm.org/D102066
Change-Id: I7d171cc90994de05f41669b66a6d0ffa2ed05d09
A16 support for image instructions assembly/disassembly (gfx10) was missing
Also refactor MIMG op addr size calcs to common function
We'd got 3 places where the same operation was being done.
One test is now marked XFAIL until a related codegen patch is in place
Differential Revision: https://reviews.llvm.org/D102231
Change-Id: I7e86e730ef8c71901457855cba570581f4f576bb
While both the SOG and Agner insist that it is zero-cycle,
i can not confirm that claim. While it clearly breaks the dependency,
i can not come up with a snippet, or measurement approach,
to end up with IPC bigger than 4, which, to me, means that it actually
consumes execution resource of an FP unit for a cycle.
Added hashst to the prologue and hashchk to the epilogue.
The hash for the prologue and epilogue must always be stored as the first
element in the local variable space on the stack.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D99377
We currently prefer t2CMPrs over t2CMPri when the node contains a shift.
This can introduce more nodes if the shift has multiple uses though, as
value from the shift will be needed anyway, and in the case of a t2CMPri
compared with zero will more readily be removed entirely.
Differential Revision: https://reviews.llvm.org/D101688
Previously we were allowing to use FP atomics without
-amdgpu-unsafe-fp-atomics option if a scope is less then
system. This is not safe just as well if we have UC memory.
This change only allows global and flat FP atomics with
the unsafe option. Consequentially that makes a check for
denorm mode redundant since we skip it with the unsafe
option and do not have a way to produce these instructions
without it anyway.
Differential Revision: https://reviews.llvm.org/D102347
The complex selection pattern for add/sub shifted immediates is
incorrect in it's handling of incoming constant values, in that it
does not properly anticipate the values to be signed extended to
32-bits.
Co-authored-by: Graham Hunter <graham.hunter@arm.com>
Differential Revision: https://reviews.llvm.org/D101833
There are two reasons this shouldn't be restricted to Power8 and up:
1. For XL compatibility
2. Because clang will expand comparison operators to these intrinsics*
*Without this patch, the following causes a selection error:
int test(vector signed long a, vector signed long b) {
return a < b;
}
This patch provides the handling for the intrinsics in the back
end and removes the Power8 guards from the predicate functions
(vec_{all|any}_{eq|ne|gt|ge|lt|le}).
We explicitly made it error out in D101403, out of a good intention that
the error message will make people less confusing. Turns out, we weren't
failing all cases of wasm EH + SjLj; only a few cases were failing and
our client was able to get around by fixing source code, but now we made
it fail for all cases, even the cases that previously succeeded fail,
which we didn't intend. This reverts that change.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D102364
The VSEW encoding isn't a useful value to pass around. It's better
to use SEW or log2(SEW) directly. The only real ugliness is that
the vsetvli IR intrinsics use the VSEW encoding, but it's easy
enough to decode that when the intrinsic is processed.
On AVX1 targets we can handle v4i64 logical shifts by 32 bits as a pair of v8f32 shuffles with zero.
I was hoping to put this in LowerScalarImmediateShift, but performing that early causes regressions where other instructions were respliting the subvectors.
This patch disables the SIFormMemoryClauses pass at -O1. This pass has a
significant impact on compilation time, so we only want it to be enabled
starting from -O2.
Differential Revision: https://reviews.llvm.org/D101939
getVectorNumElements() returns a value for scalable vectors
without any warning so it is effectively getVectorMinNumElements().
By renaming it and making getVectorNumElements() forward to
it, we can insert a check for scalable vectors into getVectorNumElements()
similar to EVT. I didn't do that in this patch because there are still more
fixes needed, but I was able to temporarily do it and passed the RISCV
lit tests with these changes.
The changes to isPow2VectorType and getPow2VectorType are copied from EVT.
The change to TypeInfer::EnforceSameNumElts reduces the size of AArch64's isel table.
We're now considering SameNumElts to require the scalable property to match which
removes some unneeded type checks.
This was motivated by the bug I fixed yesterday in 80b9510806
Reviewed By: frasercrmck, sdesmalen
Differential Revision: https://reviews.llvm.org/D102262
When a ptest is used to set flags from the output of rdffr, the ptest
can be eliminated, using a flags-setting rdffrs instead.
Additionally, check that nothing consumes flags between rdffr and ptest;
this case appears to have been missed previously.
* There is no unpredicated RDFFRS instruction.
* If substituting RDFFR_PP, require that the mask argument of the
PTEST matches that of the RDFFR_PP.
* Move some precondition code up inside optimizePTestInstr, so that it
covers the new code paths for RDFFR which return earlier.
* Only consider RDFFR, PTEST in same basic block.
* Check for other flag setting instructions between the two, abort if
found.
* Drop an old TODO comment about removing dead PTEST instructions.
RDFFR_P to follow in later patch.
Differential Revision: https://reviews.llvm.org/D101357
Improve the code generation of build_vector.
Use the v_pack_b32_f16 instruction instead of
v_and_b32 + v_lshl_or_b32
Differential Revision: https://reviews.llvm.org/D98081
Patch by Julien Pagès!
Extend the HOP(HOP(X,Y),HOP(Z,W)) and SHUFFLE(HOP(X,Y),HOP(Z,W)) folds to handle repeating 256/512-bit vector cases.
This allows us to drop the UNPACK(HOP(),HOP()) custom fold in combineTargetShuffle.
This required isRepeatedTargetShuffleMask to be tweaked to support target shuffle masks taking more than 2 inputs.
The sve.convert.to.svbool lowering has the effect of widening a logical
<M x i1> vector representing lanes into a physical <16 x i1> vector
representing bits in a predicate register.
In general, if converting to svbool, the contents of lanes in the
physical register might not be known. For sve.convert.to.svbool the new
lanes are specified to be zeroed, requiring 'and' instructions to mask
off the new lanes. For lanes coming from a ptrue or a comparison,
however, they are known to be zero.
CodeGen Before:
ptrue p0.s, vl16
ptrue p1.s
ptrue p2.b
and p0.b, p2/z, p0.b, p1.b
ret
After:
ptrue p0.s, vl16
ret
Differential Revision: https://reviews.llvm.org/D101544
No need to handle invariant loads when avoiding WAR conflicts, as
there cannot be a vector store to the same memory location.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D101177
Based on the same for AArch64: 4751cadcca
At -O0, the fast register allocator may insert spills between the ldrex and
strex instructions inserted by AtomicExpandPass when expanding atomicrmw
instructions in LL/SC loops. To avoid this, expand to cmpxchg loops and
therefore expand the cmpxchg pseudos after register allocation.
Required a tweak to ARMExpandPseudo::ExpandCMP_SWAP to use the 4-byte encoding
of UXT, since the pseudo instruction can be allocated a high register (R8-R15)
which the 2-byte encoding doesn't support. However, the 4-byte encodings
are not present for ARM v8-M Baseline. To enable this, two new pseudos are
added for Thumb which are only valid for v8mbase, tCMP_SWAP_8 and
tCMP_SWAP_16.
The previously committed attempt in D101164 had to be reverted due to runtime
failures in the test suites. Rather than spending time fixing that
implementation (adding another implementation of atomic operations and more
divergence between backends) I have chosen to follow the approach taken in
D101163.
Differential Revision: https://reviews.llvm.org/D101898
Depends on D101912
Currently the ValueHandler handles both selecting the type and
location for arguments, as well as inserting instructions needed to
handle them. Split this so that the determination of the argument
handling is independent of the function state. Currently the checks
for tail call compatibility do not follow the full assignment logic,
so it misses cases where arguments require nontrivial legalization.
This should help avoid targets ending up in a buggy state where the
argument evaluation may change in different contexts.
We can handle the distinction easily enough in the generic code, and
this makes it easier to abstract the selection of type/location from
the code to insert code.
The waitcnt pass would increment the number of vmem events for some buffer
invalidates that were not handled by the pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D102252
When an integer is converted into floating point in subword vector extract,
it can be done in 2 instructions instead of the 3+ instructions it generates
right now. This patch removes the uncessary generation.
Differential: https://reviews.llvm.org/D100604
Similar to X86 D73230 and AArch64 D101872
With this change, we can set dso_local in clang's -fpic -fno-semantic-interposition mode,
for default visibility external linkage non-ifunc-non-COMDAT definitions.
For such dso_local definitions, variable access/taking the address of a
function/calling a function will go through a local alias to avoid GOT/PLT.
Reviewed By: jrtc27, luismarques
Differential Revision: https://reviews.llvm.org/D101875
My thought process is that if v2i64 is an LMUL=1 type then v2i32
should be an LMUL=1/2 type. We limit the fractional LMUL so that
SEW=64 clips to LMUL=1, SEW=32 clips to LMUL=1/2, etc. This
ensures there's always a fractional LMUL available to truncate a type.
This does reduce the number of vsetvlis in some cases.
Some tests increase vsetvlis because the best container type for a
mask type is dependent on the LMUL+SEW that the mask was produced
from, but you can't tell that from the type. I think this is
something we need to solve this in the machine IR when optimizing
vsetvlis.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D101215
I've seen this in the RawSpeed's BitPumpMSB*::push() hotpath,
after fixing the buffer abstraction to a more sane one,
when looking into a +5% runtime regression.
I was hoping that this would fix it, but it does not look it does.
This seems to be at least not worse than the original pattern.
But i'm actually mainly interested in the case where we already
compute `(y+32)` (see last test),
https://alive2.llvm.org/ce/z/ZCzJio
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D101944
Limited to splats because we would need to truncate the shift
amount vector otherwise.
I tried to do this with new ISD nodes and a DAG combine to
avoid such a large pattern, but we don't form the splat until
LegalizeDAG and need DAG combine to remove a scalable->fixed->scalable
cast before it becomes visible to the shift node. By the time that
happens we've already visited the truncate node and won't revisit it.
I think I have an idea how to improve i64 on RV32 I'll save for a
follow up.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D102019
Now that getMemoryOpCost() correctly handles all the vector variants,
we should no longer hand-roll our own version of it, but use it directly.
The AVX512 variant probably needs a similar change,
but there it is less obvious.
foldShuffleOfHorizOp only handled basic shufps(hop(x,y),hop(z,w)) folds - by moving this to canonicalizeShuffleMaskWithHorizOp we can work with more general/combined v4x32 shuffles masks, float/integer domains and support shuffle-of-packs as well.
The next step will be to support 256/512-bit vector cases.
Instead of handling power-of-two sized vector chunks,
try handling the large vector in a stream mode,
decreasing the operational vector size
once it no longer works for the elements left to process.
Notably, this improves costs for overaligned loads - loading padding is fine.
This more directly tracks when we need to insert/extract the YMM/XMM subvector,
some costs fluctuate because of that.
Reviewed By: RKSimon, ABataev
Differential Revision: https://reviews.llvm.org/D100684
Moving code sinking pass before structurizer creates more sinking
opportunities.
The extra flow edges introduced by the structurizer can have adverse
effects on sinking, because the sinking pass prefers moving instructions
to blocks with unique predecessors and the structurizer destroys that
property in some cases.
A notable example is moving high-latency image instructions across kills.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D101115
The stack frame update code does not take into consideration spilling
to registers for callee saved registers. The option -ppc-enable-pe-vector-spills
turns on spilling to registers for callee saved registers and may expose a bug
in the code that moves a stack frame pointer update instruction.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D101366
As we have been missing support for WebAssembly globals on the IR level,
the lowering of WASM_SYMBOL_TYPE_GLOBAL to IR was incomplete. This
commit fleshes out the lowering support, lowering references to and
definitions of addrspace(1) values to correctly typed
WASM_SYMBOL_TYPE_GLOBAL symbols.
Depends on D101608.
Differential Revision: https://reviews.llvm.org/D101913
This patch adds support for WebAssembly globals in LLVM IR, representing
them as pointers to global values, in a non-default, non-integral
address space. Instruction selection legalizes loads and stores to
these pointers to new WebAssemblyISD nodes GLOBAL_GET and GLOBAL_SET.
Once the lowering creates the new nodes, tablegen pattern matches those
and converts them to Wasm global.get/set of the appropriate type.
Based on work by Paulo Matos in https://reviews.llvm.org/D95425.
Reviewed By: pmatos
Differential Revision: https://reviews.llvm.org/D101608
For Zvlsseg spilling, we need to convert the pseudo instructions
into multiple vector load/store instructions with appropriate offsets.
For example, for PseudoVSPILL3_M2, we need to convert it to
VS2R %v2, %base
ADDI %base, %base, (vlenb x 2)
VS2R %v4, %base
ADDI %base, %base, (vlenb x 2)
VS2R %v6, %base
We need to keep the size of the offset in the pseudo spilling instructions.
In this case, it is (vlenb x 2).
In the original implementation, we use the size of frame objects divide the
number of vectors in zvlsseg types. The size of frame objects is not
necessary exactly the same as the spilling data. It may be larger than
it. So, we change it to (VLENB x LMUL) in this patch. The calculation is
more direct and easy to understand.
Differential Revision: https://reviews.llvm.org/D101869
This change was originally landed in: 5000a1b4b9
It was reverted in: 061e071d8c
This change adds support for a new WASM_SEG_FLAG_STRINGS flag in
the object format which works in a similar fashion to SHF_STRINGS
in the ELF world.
Unlike the ELF linker this support is currently limited:
- No support for SHF_MERGE (non-string merging)
- Always do full tail merging ("lo" can be merged with "hello")
- Only support single byte strings (p2align 0)
Like the ELF linker merging is only performed at `-O1` and above.
This fixes part of https://bugs.llvm.org/show_bug.cgi?id=48828,
although crucially it doesn't not currently support debug sections
because they are not represented by data segments (they are custom
sections)
Differential Revision: https://reviews.llvm.org/D97657
This is roughly equivalent to the floating point portion of
`AArch64TargetLowering::LowerVSETCC`. Main part that's missing is the v4s16 bit.
This also adds helpers equivalent to `EmitVectorComparison`, and
`changeVectorFPCCToAArch64CC`. This moves `changeFCMPPredToAArch64CC` out of
the selector into AArch64GlobalISelUtils for the sake of code reuse.
This is done in post-legalizer lowering with pseudos to simplify selection.
The imported patterns end up handling selection for us this way.
Differential Revision: https://reviews.llvm.org/D101782
The combines in `tryCombineMemCpyFamily` have heuristics (e.g.
`TLI.getMaxStoresPerMemset`) which consider size. So, theoretically, enabling
these combines on minsize functions shouldn't be harmful.
With this enabled we save 0.9% geomean on CTMark at -Oz, and 5.1% on Bullet.
There are no code size regressions.
Differential Revision: https://reviews.llvm.org/D102198
This change adds support for a new WASM_SEG_FLAG_STRINGS flag in
the object format which works in a similar fashion to SHF_STRINGS
in the ELF world.
Unlike the ELF linker this support is currently limited:
- No support for SHF_MERGE (non-string merging)
- Always do full tail merging ("lo" can be merged with "hello")
- Only support single byte strings (p2align 0)
Like the ELF linker merging is only performed at `-O1` and above.
This fixes part of https://bugs.llvm.org/show_bug.cgi?id=48828,
although crucially it doesn't not currently support debug sections
because they are not represented by data segments (they are custom
sections)
Differential Revision: https://reviews.llvm.org/D97657
If spills are to registers instead of to the stack then a copy will be used
and frame index scavenging is not required.
This patch adds debug info to frame index scavenging and makes sure that
spilling to registers does not cause frame index scavenging.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D101360
Analogously to https://reviews.llvm.org/D98794 this patch uses the
`alignstack` attribute to fix incorrect passing of homogeneous
aggregate (HA) arguments on AArch32. The EABI/AAPCS was recently
updated to clarify how VFP co-processor candidates are aligned:
4488e34998
Differential Revision: https://reviews.llvm.org/D100853
Expanding a fixed length operation involves wrapping the operation in an
insert/extract subvector pair, as such, when this is done to bitcast we
end up with an extract_subvector of a bitcast. DAGCombine tries to
convert this into a bitcast of an extract_subvector which restores the
initial fixed length bitcast, causing an infinite loop of legalization.
As part of this patch, we must make sure the above DAGCombine does not
trigger after legalization if the created bitcast would not be legal.
Differential Revision: https://reviews.llvm.org/D101990
When using predicated intrinsics, if the predicate used is all lanes active,
use an unpredicated form of the instruction, additionally this allows for
better use of immediate forms.
This only includes instructions where the unpredicated/predicated forms
matched in such a way that instruction selection would not introduce extra
ptrue instructions. This allows us to convert the intrinsics directly to
architecture independent ISD nodes.
Depends on D101062
Differential Revision: https://reviews.llvm.org/D101828
When using predicated arithmetic intrinsics, if the predicate used is all
lanes active, use an unpredicated form of the instruction, additionally
this allows for better use of immediate forms.
This also includes a new complex isel pattern which allows matching an
all active predicate when the types are different but the predicate is a
superset of the type being used. For example, to allow a b8 ptrue for a
b32 predicate operand.
This only includes instructions where the unpredicated/predicated forms
are mismatched between variants, meaning that the removal of the
predicate is done during instruction selection in order to prevent
spurious re-introductions of ptrue instructions.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D101062
DAGCombiner tries to combine a (fpext (load)) to (fround (extload))
but SVE has no FP-extending loads. By marking these as expand,
the combine no longer happens.
This also fixes a similar issue for fptrunc, where the source type
is not a legal type.
Reviewed By: bsmith, kmclaughlin
Differential Revision: https://reviews.llvm.org/D102053
Large loads on target that does not useFlatForGlobal have to be split
in regbankselect. This did not happen in case when destination had vgpr
bank and address had sgpr bank.
Instead of checking if address bank is sgpr check bank of the destination.
Differential Revision: https://reviews.llvm.org/D101992
This patch extends VectorLegalizer::ExpandSELECT to permit expansion
also for scalable vector types. The only real change is conditionally
checking for BUILD_VECTOR or SPLAT_VECTOR legality depending on the
vector type.
We can use this to fix "cannot select" errors for scalable vector
selects on the RISCV target. Note that in future patches RISCV will
possibly custom-lower vector SELECTs to VSELECTs for branchless codegen.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D102063
Since index_vector is lowered into step_vector in D100816, we can just remove
index_vector, use step_vector for codegen directly.
Differential Revision: https://reviews.llvm.org/D101593
As confirmed by exegesis measurements, and ref docs.
It does actually execute.
While there, bump latency for MULX32rr, that seems to match measurements.
To the best of my knowledge, all instructions are modelled,
and have reasonable values to them; flipping the switch
doesn't cause any diff for MCA tests, so either we're good,
or we have test coverage gaps.
I'm not really sure why no other X86 sched model is marked as complete.
Currently we model i16 bswap as very high cost (`10`),
which doesn't seem right, with all other being at `1`.
Regardless of `MOVBE`, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of `1`:
Since i32/i64 already have cost of `1`,
`MOVBE` doesn't improve their costs any further.
BUT, `MOVBE` must have at least a single memory operand,
with other being a register. Which means, if we have
a bswap of load, iff load has a single use,
we'll fold bswap into load.
Likewise, if we have store of a bswap, iff bswap
has a single use, we'll fold bswap into store.
So i think we should treat such a bswap as free,
unless of course we know that for the particular CPU
they are performing badly.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D101924
Printing pass manager invocations is fairly verbose and not super
useful.
This allows us to remove DebugLogging from pass managers and PassBuilder
since all logging (aside from analysis managers) goes through
instrumentation now.
This has the downside of never being able to print the top level pass
manager via instrumentation, but that seems like a minor downside.
Reviewed By: ychen
Differential Revision: https://reviews.llvm.org/D101797
We never bothered to have a separate set of combines for -O0 in the prelegalizer
before. This results in some minor performance hits for a mode where performance
isn't a concern (although not regressing code size significantly is still preferable).
This also removes the CSE option since we don't need it for -O0.
Through experiments, I've arrived at a set of combines that gets the most code
size improvement at -O0, while reducing the amount of time spent in the combiner
by around 35% give or take.
Differential Revision: https://reviews.llvm.org/D102038
Using `clampScalar` here because we ought to mark s128 as custom eventually.
(Right now, it will just fall back.)
With this legalization, we get the same code as SDAG:
https://godbolt.org/z/TneoPKrKG
Differential Revision: https://reviews.llvm.org/D100908
It measures as such, and the reference docs agree.
I can't easily add a MCA test, because there's no mnemonic for it,
it can only be disassembled or created as a MCInst.
Similar to X86 D73230 & 46788a21f9
With this change, we can set dso_local in clang's -fpic -fno-semantic-interposition mode,
for default visibility external linkage non-ifunc-non-COMDAT definitions.
For such dso_local definitions, variable access/taking the address of a
function/calling a function will go through a local alias to avoid GOT/PLT.
Note: the 'S' inline assembly constraint refers to an absolute symbolic address
or a label reference (D46745).
Differential Revision: https://reviews.llvm.org/D101872
Ensure we don't try to fold when one might be an opaque constant - the constant fold will fail and then the reverse fold will happen in DAGCombine.....
Sometimes disassembler picks _REV variants of instructions
over the plain ones, which in this case exposed an issue
that the _REV variants aren't being modelled as optimizable moves.
I've verified this with llvm-exegesis.
This is not limited to zero registers.
Refs:
AMD SOG 19h, 2.9.4 Zero Cycle Move
The processor is able to execute certain register to register
mov operations with zero cycle delay.
Agner,
22.13 Instructions with no latency
Register-to-register move instructions are resolved at
the register rename stage without using any execution units.
These instructions have zero latency. It is possible to do six such
register renamings per clock cycle, and it is even possible to
rename the same register multiple times in one clock cycle.
gfx9 does not work with negative offsets, gfx10 works only with
aligned negative offsets, but not with unaligned negative offsets.
This is slightly more conservative than needed, gfx9 does support
negative offsets when a VGPR address is used and gfx10 supports
negative, unaligned offsets when an SGPR address is used, but we
do not make use of that with this patch.
Differential Revision: https://reviews.llvm.org/D101292
Retrying after revert and fix (removed implicit def flag from operand). Now
passes with expensive_checks enabled.
Since there is a single scratch resource descriptor for all shaders, if there is
a wave32 and a wave64 shader (for instance for VsFs pairs)
then the const_index_stride will be incorrect for wave32 shaders.
Differential Revision: https://reviews.llvm.org/D101830
Change-Id: Ie3b8b2921237968caca91527dd0c97b1b0cc0360
This patch converts llvm.memset intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).
The llvm.memset is converted to a TP loop for both
constant and non-constant input sizes (of llvm.memset).
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D100435
Based off a discussion on D89281 - where the AARCH64 implementations were being replaced to use funnel shifts.
Any target that has efficient funnel shift lowering can handle the shift parts expansion using the same expansion, avoiding a lot of duplication.
I've generalized the X86 implementation and moved it to TargetLowering - so far I've found that AARCH64 and AMDGPU benefit, but many other targets (ARM, PowerPC + RISCV in particular) could easily use this with a few minor improvements to their funnel shift lowering (or the folding of their target ops that funnel shifts lower to).
NOTE: I'm trying to avoid adding full SHIFT_PARTS legalizer handling as I think it might actually be possible to remove these opcodes in the medium-term and use funnel shift / libcall expansion directly.
Differential Revision: https://reviews.llvm.org/D101987
Since there is a single scratch resource descriptor for all shaders, if there is
a wave32 and a wave64 shader (for instance for VsFs pairs)
then the const_index_stride will be incorrect for wave32 shaders.
Differential Revision: https://reviews.llvm.org/D101830
Change-Id: Id8de5566b0d1a07a814e2e7db016df9d20bf6d2c
I've verified this with llvm-exegesis.
This is not limited to zero registers.
Refs:
AMD SOG 19h, 2.9.4 Zero Cycle Move
The processor is able to execute certain register to register
mov operations with zero cycle delay.
Agner,
22.13 Instructions with no latency
Register-to-register move instructions are resolved at
the register rename stage without using any execution units.
These instructions have zero latency. It is possible to do six such
register renamings per clock cycle, and it is even possible to
rename the same register multiple times in one clock cycle.
GNU as documentation states that a `.thumb_func` directive implies `.thumb`, teach the asm parser to switch mode whenever it's encountered. On the other hand the labeled form, exclusive to Apple's toolchain, doesn't switch mode at all.
Reviewed By: nickdesaulniers, peter.smith
Differential Revision: https://reviews.llvm.org/D101975
Serialize ScavengeFI from SIMachineFunctionInfo into yaml.
ScavengeFI is not used outside of the PrologEpilogInserter,
so this shouldn't change anything.
Differential Revision: https://reviews.llvm.org/D101367
This is a simple fix on LE. On BE, vector shuffles are categorized into
different ops. We may need more work to eliminate these in
tablegen/pre-isel.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D101605
Lorenz Bauer reported an issue in bpf mailing list ([1]) where
for FIELD_EXISTS relocation, if the object is an array subscript,
the patched immediate is the object offset from the base address,
instead of 1.
Currently in BPF AbstractMemberAccess pass, the final offset
from the base address is the patched offset except FIELD_EXISTS
which is 1 unconditionally. In this particular case, the last
data structure access is not a field (struct/union offset)
so it didn't hit the place to set patched immediate to be 1.
This patch fixed the issue by checking the relocation type.
If the type is FIELD_EXISTS, just set to 1.
Tested by modifying some bpf selftests, libbpf is okay with
such types with FIELD_EXISTS relocation.
[1] https://lore.kernel.org/bpf/CACAyw99n-cMEtVst7aK-3BfHb99GMEChmRLCvhrjsRpHhPrtvA@mail.gmail.com/
Differential Revision: https://reviews.llvm.org/D102036
This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).
From an implementation point of view, the patch
- adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel)
- adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter,
on matching the above node.
- Adds a custom inserter function that expands the pseudo instruction into MIR suitable
to be (by later passes) into a WLSTP loop.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D99723
Use result_type for the IMPLICIT_DEF in masked vector patterns.
This doesn't matter today because result_type and op_type are
always the same.
Use multiclass inheritance to reduce repeated code.
Expand 128 bit shifts instead of using a libcall.
This patch removes the 128 bit shift libcalls and thereby causes
ExpandShiftWithUnknownAmountBit() to be called.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D101993
Rename RVInstR4 as used by F/D/Zfh extensions to RVInstR4Frm.
Introduce new RVInstR4 that takes funct3 as a parameter.
Add new format classes for FSRI and FSRIW instead of trying to
bend RVInstR4 to use a shamt overlayed on rs2 and funct2.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100427
AMDGPUAsmParser::isSupportedDPPCtrl() was failing to correctly
find a DPP register operand, regadless of the position it is
always src0. Moved this check into a new validateDPP() method
where we have full instruction already. In particular it was
failing to reject this case:
v_cvt_u32_f64 v5, v[0:1] quad_perm:[0,2,1,1] row_mask:0xf bank_mask:0xf
Essentially it was broken for any case where size of dst and
src0 differ.
It also improves the diagnostics with a proper error message.
The check in the InstPrinter also drops verification of the dst
register as it does not have anything to do with the dpp operand.
Differential Revision: https://reviews.llvm.org/D101930
- Add branch absolute reloction R_RBA, R_TLS relocation for the variable offset
for the tlsgd model and R_TLSM for the region handle for the tlsgd model
- Properly set the relocation fixed values for R_TLS and R_TLSM
- Emit the TCEntry with the variant kind in the XCOFFStreamer
Reviewed by: sfertile, nemanjai, DiggerLin
Differential Revision: https://reviews.llvm.org/D100214
Instruction test for inactive kill/demote needs to be based on
actual opcode not whether instruction would be lowered to demote.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D101966
In order to use __builtin_frame_address(0) with packed stack and no
backchain, the address of where the backchain would have been written is
returned (like GCC).
This address may either contain a saved register or be unused.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D101897
First clean up the strange API of tryConstantFoldOp where it took an
immediate operand value, but no indication of which operand it was the
value for.
Second clean up the loop that calls tryConstantFoldOp so that it does
not have to restart from the beginning every time it folds an
instruction.
This is NFCI but there are some minor changes caused by the order in
which things are folded.
Differential Revision: https://reviews.llvm.org/D100031
This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).
From an implementation point of view, the patch
- adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel)
- adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter,
on matching the above node.
- Adds a custom inserter function that expands the pseudo instruction into MIR suitable
to be (by later passes) into a WLSTP loop.
Note: A cli option is used to control the conversion of memcpy to TP
loop and this option is currently disabled by default. It may be enabled
in the future after further downstream testing.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D99723
Unlike normal loads these don't have an extension field, but we know
from TargetLowering whether these are sign-extending or zero-extending,
and so can optimise away unnecessary extensions.
This was noticed on RISC-V, where sign extensions in the calling
convention would result in unnecessary explicit extension instructions,
but this also fixes some Mips inefficiencies. PowerPC sees churn in the
tests as all the zero extensions are only for promoting 32-bit to
64-bit, but these zero extensions are still not optimised away as they
should be, likely due to i32 being a legal type.
This also simplifies the WebAssembly code somewhat, which currently
works around the lack of target-independent combines with some ugly
patterns that break once they're optimised away.
Re-landed with correct handling in ComputeNumSignBits for Tmp == VTBits,
where zero-extending atomics were incorrectly returning 0 rather than
the (slightly confusing) required return value of 1.
Reviewed By: RKSimon, atanasyan
Differential Revision: https://reviews.llvm.org/D101342
This shall speedup compilation and also remove threshold
limitations used by memory dependency analysis.
It also seem to fix the bug in the coalescer_remat.ll
where an SMRD load was used in presence of a potentially
clobbering store.
Fixes: SWDEV-272132
Differential Revision: https://reviews.llvm.org/D101962
Preexisting waitcnt may not update the scoreboard if the instruction
being examined needed to wait on fewer counters than what was encoded in
the old waitcnt instruction. Fixing this results in the elimination of
some redudnat waitcnt.
These changes also enable combining consecutive waitcnt into a single
S_WAITCNT or S_WAITCNT_VSCNT instruction.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D100281
It simplifies the logic by moving the predecessor (preHeader or it's predecessor) above the target (or loopExit),
instead of moving the target to after the predecessor.
Since the loopExit is no longer being moved, directions of any branches within/to it are unaffected.
While the predecessor is being moved, the backwards movement simplifies some considerations,
and the only consideration now required is that a forward WLS to the predecessor should not become backwards.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D100094
Adjust sanity check in register parsing function to allow register
name with more than 2 characters (e.g. ccr).
Differential Revision: https://reviews.llvm.org/D101733
- Removes the mention of fastcomp, which is deprecated.
- Some functions in Emscripten have moved from JS glue code to
compiler-rt/emscripten_setjmp.c and
compiler-rt/emscripten_exception_builtins.c. This fixes comments about
that.
Reviewed By: sbc100
Differential Revision: https://reviews.llvm.org/D101812
This change enables emitting CFI unwind information for debugging purpose
for targets with MCAsmInfo::ExceptionsType == ExceptionHandling::None.
Currently generating CFI unwind information is entangled with supporting
the exceptions, even when AsmPrinter explicitly recognizes that the unwind
tables are being generated as debug information.
In fact, the unwind information is not generated even if we specify
--force-dwarf-frame-section, unless exceptions are enabled. The LIT test
llvm/test/CodeGen/AMDGPU/debug_frame.ll demonstrates this behavior.
Enable this option for AMDGPU to prepare for future patches which add
complete CFI support.
Reviewed By: dblaikie, MaskRay
Differential Revision: https://reviews.llvm.org/D78778
Widen 1 and 2 byte scalar loads to 4 bytes when sufficiently
aligned to avoid using a global load.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D100430
An address can be a uniform sum of two i64 bit values.
That regularly happens in a loop where index is an induction
variable promoted to 64 bit by the LSR. We can materialize
zero in a VGPR and still use SADDR form of the load.
Differential Revision: https://reviews.llvm.org/D101591
The custom insert of an unmerge and the callback weirdness should be
unnecessary. Since handleAssignments should now use
getRegisterTypeForCalling conv as SelectionDAG builder would, this
should now just be able to use the generic code. X86-32 relies on the
generated CCAssignFns not seeing illegal types and sharing code with
x86_64, so i64 values would incorrectly be assigned to 64-bit
registers.
Unfortunately the current call lowering code is built on top of the
legacy MVT/DAG based code. However, GlobalISel was not using it the
same way. In short, the DAG passes legalized types to the assignment
function, and GlobalISel was passing the original raw type if it was
simple.
I do believe the DAG lowering is conceptually broken since it requires
picking a type up front before knowing how/where the value will be
passed. This ends up being a problem for AArch64, which wants to pass
i1/i8/i16 values as a different size if passed on the stack or in
registers.
The argument type decision is split across 3 different places which is
hard to follow. SelectionDAG builder uses
getRegisterTypeForCallingConv to pick a legal type, tablegen gives the
illusion of controlling the type, and the target may have additional
hacks in the C++ part of the call lowering. AArch64 hacks around this
by not using the standard AnalyzeFormalArguments and special casing
i1/i8/i16 by looking at the underlying type of the original IR
argument.
I believe people have generally assumed the calling convention code is
processing the original types, and I've discovered a number of dead
paths in several targets.
x86 actually relies on the opposite behavior from AArch64, and relies
on x86_32 and x86_64 sharing calling convention code where the 64-bit
cases implicitly do not work on x86_32 due to using the pre-legalized
types.
AMDGPU targets without legal i16/f16 have always used a broken ABI
that promotes to i32/f32. GlobalISel accidentally fixed this to be the
ABI we should have, but this fixes it so we're using the worse ABI
that is compatible with the DAG. Ideally we would fix the DAG to match
the old GlobalISel behavior, but I don't wish to fight that battle.
A new native GlobalISel call lowering framework should let the target
process the incoming types directly.
CCValAssigns select a "ValVT" and "LocVT" but the meanings of these
aren't entirely clear. Different targets don't use them consistently,
even within their own call lowering code. My current belief is the
intent was "ValVT" is supposed to be the legalized value type to use
in the end, and and LocVT was supposed to be the ABI passed type
(which is also legalized).
With the default CCState::Analyze functions always passing the same
type for these arguments, these only differ when the TableGen part of
the lowering decide to promote the type from one legal type to
another. AArch64's i1/i8/i16 hack ends up inverting the meanings of
these values, so I had to add an additional hack to let the target
interpret how large the argument memory is.
Since targets don't consistently interpret ValVT and LocVT, this
doesn't produce quite equivalent code to the initial DAG
lowerings. I've opted to consistently interpret LocVT as the in-memory
size for stack passed values, and ValVT as the register type to assign
from that memory. We therefore produce extending loads directly out of
the IRTranslator, whereas the DAG would emit regular loads of smaller
values. This will also produce loads/stores that are wider than the
argument value if the allocated stack slot is larger (and there will
be undef padding bytes). If we had the optimizations to reduce
load/stores based on truncated values, this wouldn't produce a
different end result.
Since ValVT/LocVT are more consistently interpreted, we now will emit
more G_BITCASTS as requested by the CCAssignFn. For example AArch64
was directly assigning types to some physical vector registers which
according to the tablegen spec should have been casted to a vector
with a different element type.
This also moves the responsibility for inserting
G_ASSERT_SEXT/G_ASSERT_ZEXT from the target ValueHandlers into the
generic code, which is closer to how SelectionDAGBuilder works.
I had to xfail an x86 test since I don't see a quick way to fix it
right now (I filed bug 50035 for this). It's broken independently of
this change, and only triggers since now we end up with more ands
which hit the improperly handled selection pattern.
I also observed that FP arguments that need promotion (e.g. f16 passed
as f32) are broken, and use regular G_TRUNC and G_ANYEXT.
TLDR; the current call lowering infrastructure is bad and nobody has
ever understood how it chooses types.
The WebAssembly SIMD intrinsics in wasm_simd128.h generally try not to require
any particular alignment for memory operations to be maximally flexible. For
builtin memory access functions and their corresponding LLVM IR intrinsics,
there's no way to set the expected alignment, so the best we can do is set the
alignment to 1 in the backend. This change means that the alignment hints in the
emitted code will no longer be incorrect when users use the intrinsics to access
unaligned data.
Differential Revision: https://reviews.llvm.org/D101850
This untangles the MCContext and the MCObjectFileInfo. There is a circular
dependency between MCContext and MCObjectFileInfo. Currently this dependency
also exists during construction: You can't contruct a MOFI without a MCContext
without constructing the MCContext with a dummy version of that MOFI first.
This removes this dependency during construction. In a perfect world,
MCObjectFileInfo wouldn't depend on MCContext at all, but only be stored in the
MCContext, like other MC information. This is future work.
This also shifts/adds more information to the MCContext making it more
available to the different targets. Namely:
- TargetTriple
- ObjectFileType
- SubtargetInfo
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101462
This seems to have broken sanitizers, giving lots of
Assertion `NumBits <= MAX_INT_BITS && "bitwidth too large"' failed.
failures across multiple targets (currently X86 and PowerPC). Reverting
until I have a chance to reproduce and debug.
This reverts commit 6e876f9ded.
Unlike normal loads these don't have an extension field, but we know
from TargetLowering whether these are sign-extending or zero-extending,
and so can optimise away unnecessary extensions.
This was noticed on RISC-V, where sign extensions in the calling
convention would result in unnecessary explicit extension instructions,
but this also fixes some Mips inefficiencies. PowerPC sees churn in the
tests as all the zero extensions are only for promoting 32-bit to
64-bit, but these zero extensions are still not optimised away as they
should be, likely due to i32 being a legal type.
This also simplifies the WebAssembly code somewhat, which currently
works around the lack of target-independent combines with some ugly
patterns that break once they're optimised away.
Reviewed By: RKSimon, atanasyan
Differential Revision: https://reviews.llvm.org/D101342
This patch fixes an issue where a pre-indexed store e.g.,
STR x1, [x0, #24]! with a store like STR x0, [x0, #8] are
merged into a single store: STP x1, x0, [x0, #24]!
. They shouldn’t be merged because the second store uses
x0 as both the stored value and the address and so it needs to be using the updated x0.
Therefore, it should not be folded into a STP <>pre.
Additionally a new test case is added to verify this fix.
Differential Revision: https://reviews.llvm.org/D101888
Change-Id: I26f1985ac84e970961e2cdca23c590fa6773851a
This issue was reported in PR50057: Cannot select:
t10: i64 = AArch64ISD::VSHL t2, Constant:i32<2>
Shift intrinsics (llvm.aarch64.neon.ushl.i64 and sshl) with a constant
shift operand are lowered into AArch64ISD::VSHL in tryCombineShiftImm.
VSHL has i64 and v1i64 patterns for a right shift, but only v1i64 for
a left shift.
This patch adds the missing i64 pattern for AArch64ISD::VSHL, and LIT
tests to cover scalar variants (i64 and v1i64) of all shift
intrinsics (only ushl and sshl cases fail without the patch, others
were just not covered).
Differential Revision: https://reviews.llvm.org/D101580
Need to perfortm a bitcast on IndicesVec rather than subvector extract
if the original size of the IndicesVec is the same as the size of the
destination type.
Differential Revision: https://reviews.llvm.org/D101838
This patch supports all of the current set of VP integer binary
intrinsics by lowering them to to RVV instructions. It does so by using
the existing RISCVISD *_VL custom nodes as an intermediate layer. Both
scalable and fixed-length vectors are supported by using this method.
One notable change to the existing vector codegen strategy is that
scalable all-ones and all-zeros mask SPLAT_VECTORs are now lowered to
RISCVISD VMSET_VL and VMCLR_VL nodes to match their fixed-length
BUILD_VECTOR counterparts. This allows them to reuse the existing
"all-ones" VL patterns.
To reduce the size of the phabricator diff, some tests are intentionally
left out and will be added later if the patch is accepted.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101826
Previously, RISC-V would make legal all fixed-length vectors types whose
size are less than or equal to some function of the minimum value of
VLEN and the maximum-permissible LMUL grouping.
Due to vector legalization issues, this patch instead caps the legal
fixed-length vector types to those with 256 elements. This value was
chosen because it is the longest vector length which has corresponding
MVTs across all supported element types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101839
Improve the code generation of fp_to_sint
and fp_to_uint for integer on 16-bits.
Differential Revision: https://reviews.llvm.org/D101481
Patch by Julien Pagès!
This patch disables some of the passes at -O1. These passes have a significant
impact on compilation time, so we only want them to be enabled starting from -O2.
Differential Revision: https://reviews.llvm.org/D101414
The resulting output is semantically closer to what the DAG emits and
is more compatible with the existing CCAssignFns.
The returns of f32 in f80 are clearly broken, but they were broken
before when using G_ANYEXT to go from f32 to f80.
We previously had an ISel pattern for i64x2.abs, but because the ISDNode was not
marked legal for v2i64, the instruction was not being selected.
Differential Revision: https://reviews.llvm.org/D101803
This patch updates the scalar atomic patterns to use the refactored load/store
implementation introduced in D93370.
All existing test cases pass with when the refactored patterns are utilized.
Differential Revision: https://reviews.llvm.org/D94498
Specifically, this allow us to rely on the lane zero'ing behaviour of
SVE reduce instructions.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D101369
Ordering of operands was incorrect meaning that a16 operand was treated as tfe
Differential Revision: https://reviews.llvm.org/D101618
Change-Id: I3b15e71ef5ff625f19f52823414ab684d76aca33
This patch adds support for splatting i1 types to fixed-length or
scalable vector types. It does so by lowering the operation to a SETCC
of the equivalent i8 type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101465
As pointed out in D101726, this function already exists in MathExtras.
It uses different types, but with the values used here I believe that
should not make a functional difference.
In some cases, we can improve the generated code by using a load with
the "wrong" element width: in particular, using ld1b/st1b when we see
reg+reg without a shift.
Differential Revision: https://reviews.llvm.org/D100527
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
Originally submitted as 3338290c18.
Reverted in c7df6b1223.
- This patch attempts to implement the location counter syntax (*) for the HLASM variant for PC-relative instructions.
- In the HLASM variant, for purely constant relocatable values, we expect a * token preceding it, with special support for " *" which is parsed as "<pc-rel-insn 0>"
- For combinations of absolute values and relocatable values, we don't expect the "*" preceding the token.
When you have a " * " what’s accepted is:
```
*<space>.*{.*} -> <pc-rel-insn> 0
*[+|-][constant-value] -> <pc-rel-insn> [+|-]constant-value
```
When you don’t have a " * " what’s accepted is:
```
brasl 1,func is allowed (MCSymbolRef type)
brasl 1,func+4 is allowed (MCBinary type)
brasl 1,4+func is allowed (MCBinary type)
brasl 1,-4+func is allowed (MCBinary type)
brasl 1,func-4 is allowed (MCBinary type)
brasl 1,*func is not allowed (* cannot be used for non-MCConstantExprs)
brasl 1,*+func is not allowed (* cannot be used for non-MCConstantExprs)
brasl 1,*+func+4 is not allowed (* cannot be used for non-MCConstantExprs)
brasl 1,*+4+func is not allowed (* cannot be used for non-MCConstantExprs)
brasl 1,*-4+8+func is not allowed (* cannot be used for non-MCConstantExprs)
```
Reviewed By: Kai
Differential Revision: https://reviews.llvm.org/D100987
Extend the legalization of global SADDR loads and stores
with changing to VADDR to the FLAT scratch instructions.
Differential Revision: https://reviews.llvm.org/D101408
The previous implementation of the default AltiVec ABI marked registers V20-V31
as reserved. This failed to prevent reserved VFRC registers being allocated.
In this patch instead of marking the registers reserved we remove unallowed
registers from the allocation order completely.
This is a slight rework of an implementation by @nemanjai
Reviewed By: jsji
Differential Revision: https://reviews.llvm.org/D100050
Instead of legalizing saddr operand with a readfirstlane
when address is moved from SGPR to VGPR we can just
change the opcode.
Differential Revision: https://reviews.llvm.org/D101405
This can come up in rare situations, where a csel is created with
identical operands. These can be folded simply to the original value,
allowing the csel to be removed and further simplification to happen.
This patch also removes FCSEL as it is unused, not being produced
anywhere or lowered to anything.
Differential Revision: https://reviews.llvm.org/D101687
- Previously, https://reviews.llvm.org/D101308 removed prefixes from register while printing them out. This was especially needed for inline asm statements which used input/output operands.
- However, the backend SystemZAsmParser, accepts both prefixed registers and prefix-less registers as part of its implementation
- This patch aims to change that by ensuring that prefixed registers are only allowed for the ATT dialect.
Reviewed By: uweigand
Differential Revision: https://reviews.llvm.org/D101665
Similarly to D101096, this makes sure that MMO operands get propagated
through from MVE gathers/scatters to the Machine Instructions. This
allows extra scheduling freedom, not forcing the instructions to act as
scheduling barriers. We create MMO's with an unknown size, specifying
that they can load from anywhere in memory, similar to the masked_gather
or X86 intrinsics.
Differential Revision: https://reviews.llvm.org/D101219
SITargetLowering::LowerFormalArguments asserts that none of these
features are used for graphics calling conventions, so
AnnotateKernelFeatures should not add them.
Differential Revision: https://reviews.llvm.org/D101534
We create MMO's for the VLDn/VSTn intrinsics in ARMTargetLowering::
getTgtMemIntrinsic, but they do not currently make it ll the way through
ISel. This changes that in the various places it needs changing, making
sure that the MMO is propagate through to the final instruction. This
can help in scheduling, not treating the VLD2/VST2 as a scheduling
barrier.
Differential Revision: https://reviews.llvm.org/D101096
Setting the preffered function alignment to 16 for Cortex A53/A55
improves performance in a wide range of benchmarks. This brings it
in line with the Cortex-A53/A55 tuning that is used in GCC
(gcc/config/aarch64/aarch64.c).
Differential Revision: https://reviews.llvm.org/D101636
Change-Id: I2ce47fe7ab5e3b54f49c89038d8da4e404742de2
This shrinks the immediate that isel table needs to emit for these
instructions. Hoping this allows me to change OPC_EmitInteger to
use a better variable length encoding for representing negative
numbers. Similar to what was done a few months ago for OPC_CheckInteger.
The alternative encoding uses less bytes for negative numbers, but
increases the number of bytes need to encode 64 which was a very
common number in the RISCV table due to SEW=64. By using Log2 this
becomes 6 and is no longer a problem.
X32 uses 32-bit ELF object files with 32-bit alignment, so the
.note.gnu.property section needs to be emitted as it is for X86.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101689
Introduce basic schedule model for AMD Zen 3 CPU's, a.k.a `znver3`.
This is fully built from scratch, from llvm-mca measurements
and documented reference materials.
Nothing was copied from `znver2`/`znver1`.
I believe this is in a reasonable state of completion for inclusion,
probably better than D52779 `bdver2` was :)
Namely:
* uops are pretty spot-on (at least what llvm-mca can measure)
{F16422596}
* latency is also pretty spot-on (at least what llvm-mca can measure)
{F16422601}
* throughput is within reason
{F16422607}
I haven't run much benchmarks with this,
however RawSpeed benchmarks says this is beneficial:
{F16603978}
{F16604029}
I'll call out the obvious problems there:
* i didn't really bother with X87 instructions
* i didn't really bother with obviously-microcoded/system instructions
* There are large discrepancy in throughput for `mr` and `rm` instructions.
I'm not really sure if it's a modelling defect that needs to be fixed,
or it's a defect of measurments.
* Pipe distributions are probably bad :)
I can't do much here until AMD allows that to be fixed
by documenting the appropriate counters and updating libpfm
That being said, as @RKSimon notes:
>>! In D94395#2647381, @RKSimon wrote:
> I'll mention again that all the znver* models appear to be very inaccurate wrt SIMD/FPU instructions <...>
so how much worse this could possibly be?!
Things that aren't there:
* Various tunings: zero idioms, etc. That is follow-ups.
Differential Revision: https://reviews.llvm.org/D94395
Apply the same logic used to check if CMPXCHG nodes should be expanded
at -O0: the register allocator may end up spilling some register in
between the atomic load/store pairs, breaking the atomicity and possibly
stalling the execution.
Fixes PR48017
Reviewed By: efriedman
Differential Revision: https://reviews.llvm.org/D101163
Related to PR50172.
Protects us against regressions after we will start doing cttz(zext(x)) -> zext(cttz(x)) transformation in the middle-end.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101662
This is a long overdue cleanup. Not every use is eliminated, I stuck to uses
that were directly being called from select(), and not the render functions.
Differential Revision: https://reviews.llvm.org/D101590
SIPreEmitPeephole did not try to remove redundant s_set_gpr_idx_*
instructions in blocks that end with a conditional branch instruction.
This seems like a simple oversight.
Differential Revision: https://reviews.llvm.org/D101629
Fixes the following warnings observerd when building the experimental
m68k backend (-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="M68k"):
../lib/Target/M68k/M68kMachineFunction.h:71:3: warning: explicitly
defaulted default constructor is implicitly deleted
[-Wdefaulted-function-deleted]
M68kMachineFunctionInfo() = default;
^
../lib/Target/M68k/M68kMachineFunction.h:24:20: note: default
constructor of 'M68kMachineFunctionInfo' is implicitly deleted because
field 'MF' of reference type 'llvm::MachineFunction &' would not be
initialized
MachineFunction &MF;
^
In file included from ../lib/Target/M68k/M68kISelLowering.cpp:18:
In file included from ../lib/Target/M68k/M68kSubtarget.h:17:
../lib/Target/M68k/M68kFrameLowering.h:60:8: warning:
'llvm::M68kFrameLowering::emitCalleeSavedFrameMoves' hides overloaded
virtual functions [-Woverloaded-virtual]
void emitCalleeSavedFrameMoves(MachineBasicBlock &MBB,
^
../include/llvm/CodeGen/TargetFrameLowering.h:215:3: note: hidden
overloaded virtual function
'llvm::TargetFrameLowering::emitCalleeSavedFrameMoves' declared here:
different number of parameters (2 vs 3)
emitCalleeSavedFrameMoves(MachineBasicBlock &MBB,
^
../include/llvm/CodeGen/TargetFrameLowering.h:218:16: note: hidden
overloaded virtual function
'llvm::TargetFrameLowering::emitCalleeSavedFrameMoves' declared here:
different number of parameters (4 vs 3)
virtual void emitCalleeSavedFrameMoves(MachineBasicBlock &MBB,
^
pr/50071
Reviewed By: myhsu
Differential Revision: https://reviews.llvm.org/D101588
These operations don't exist natively, so just let the
target-independent code expand to plain shifts.
The generated sequences could probably be optimized a bit more, but
they seem good enough for now.
Differential Revision: https://reviews.llvm.org/D101574
This patch merges STR<S,D,Q,W,X>pre-STR<S,D,Q,W,X>ui and
LDR<S,D,Q,W,X>pre-LDR<S,D,Q,W,X>ui instruction pairs into a single
STP<S,D,Q,W,X>pre and LDP<S,D,Q,W,X>pre instruction, respectively.
For each pair, there is a MIR test that verifies this optimization.
Differential Revision: https://reviews.llvm.org/D99272
Change-Id: Ie97a20c8c716c08492fe229c22e14e3c98ef08b7
The functionality in SVEIntrinsicOpts::isReinterpretToSVBool was moved in
D101302, however the original now unused function was not removed (NFC).
Differential Revision: https://reviews.llvm.org/D101642
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
This patch introduces a new infrastructure that is used to select the load and
store instructions in the PPC backend.
The primary motivation is that the current implementation of selecting load/stores
is dependent on the ordering of patterns in TableGen. Given this limitation, we
are not able to easily and reliably generate the P10 prefixed load and stores
instructions (such as when the immediates that fit within 34-bits). This
refactoring is meant to provide us with more control over the patterns/different
forms to exploit, as well as eliminating dependency of pattern declaration in TableGen.
The idea of this refactoring is that it introduces a set of addressing modes that
correspond to different instruction formats of a particular load and store
instruction, along with a set of common flags that describes a load/store.
Whenever a load/store instruction is being selected, we analyze the instruction
and compute a set of flags for it. The computed flags are then used to
select the most optimal load/store addressing mode.
This patch is the first of a series of patches to be committed - it contains the
initial implementation of the refactored load/store selection infrastructure and
also updates P8/P9 patterns to adopt this infrastructure. The idea is that
incremental patches will add more implementation and support, and eventually
the old implementation will be removed.
Differential Revision: https://reviews.llvm.org/D93370
Summary:
This patch implements the backend implementation of adding global variables
directly to the table of contents (TOC), rather than adding the address of the
variable to the TOC.
Currently, this patch will look for the "toc-data" attribute on symbols in the
IR, and then add those symbols to the TOC.
ATM, this is implemented for 32 bit AIX.
Reviewers: sfertile
Differential Revision: https://reviews.llvm.org/D101178
As discussed in D100107, this patch first convert index_vector to
step_vector, and convert step_vector back to index_vector after LegalizeDAG.
Differential Revision: https://reviews.llvm.org/D100816
DAGCombiner was recently taught how to combine STEP_VECTOR nodes,
meaning the step value is no longer guaranteed to be one by the time it
reaches the backend for lowering.
This patch supports such cases on RISC-V by lowering to other step
values to a multiply following the vid.v instruction. It includes a
small optimization for common cases where the multiply can be expressed
as a shift left.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D100856
When rvv vector objects existed, using sp to access the fixed stack object will pass the rvv vector objects field. So the StackOffset needs add a scalable offset of the size of rvv vector objects field
Differential Revision: https://reviews.llvm.org/D100286
When creating G_SBFX/G_UBFX opcodes, the last operand is the
width instead of the bit position. The bit position is used
for the AArch64 SBFM and UBFM instructions. The bit position
is converted to a width if the SBFX/UBFX aliases are generated.
For other SBMF/UBFM aliases, such as shifts, the bit position
is used.
Differential Revision: https://reviews.llvm.org/D101543
Refactor IsHazardFn and IsExpiredFn to use constant references as these should not be mutating the instructions visited and the instruction can never be null.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D101430
Remove an early-out in wait state counting which can never be
taken.
Reviewed By: foad, rampitec
Differential Revision: https://reviews.llvm.org/D101520
As explained in the comments, matchSwap matches:
// mov t, x
// mov x, y
// mov y, t
and turns it into:
// mov t, x (t is potentially dead and move eliminated)
// v_swap_b32 x, y
On physical registers we don't have full use-def chains so the check
for T being live-out was not working properly with subregs/superregs.
Differential Revision: https://reviews.llvm.org/D101546
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
When atomic image intrinsic return value is unused, register class for
destination of a sub-register copy of return value ends up not being set.
This copy then hits 'Register class not set' assert later.
If return value has uses, register class is determined by use instruction.
Fix is to not create sub-register copy when image intrinsic destination has
no uses because it would be deleted by dead-mi-elimination later anyway.
Differential Revision: https://reviews.llvm.org/D101448
- Add new variantKinds for the symbol's variable offset and region handle
- Print the proper relocation specifier @gd in the asm streamer when emitting
the TC Entry for the variable offset for the symbol
- Fix the switch section failure between the TC Entry of variable offset and
region handle
- Put .__tls_get_addr symbol in the ProgramCodeSects with XTY_ER property
Reviewed by: sfertile
Differential Revision: https://reviews.llvm.org/D100956
This reverts commit 3b8ec86fd5.
Revert "[X86] Refine AMX fast register allocation"
This reverts commit c3f95e9197.
This pass breaks using LLVM in a multi-threaded environment by
introducing global state.
Similar for or/xor with 0 in place of -1.
This is the canonical form produced by InstCombine for something like `c ? x & y : x;` Since we have to use control flow to expand select we'll usually end up with a mv in basic block. By folding this we may be able to pull the and/or/xor into the block instead and avoid a mv instruction.
The code here is based on code from ARM that uses this to create predicated instructions. I'm doing it on SELECT_CC so it happens late, but we could do it on select earlier which is what ARM does. I'm not sure if we lose any combine opportunities if we do it earlier.
I left out add and sub because this can separate sext.w from the add/sub. It also made a conditional i64 addition/subtraction on RV32 worse. I guess both of those would be fixed by doing this earlier on select.
The select-binop-identity.ll test has not been commited yet, but I made the diff show the changes to it.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D101485
Added an extra analysis for better choosing of shuffle kind in
getShuffleCost functions for better cost estimation if mask was
provided.
Differential Revision: https://reviews.llvm.org/D100865
- Currently, the "." (Dot) character, when not identifying an Identifier or a Constant, refers to the current PC (Program Counter)
- However, in z/OS, for the HLASM dialect, it strictly accepts only the "*" as the current PC (Support for this will be put up in a follow-up patch)
- The changes in this patch allow individual platforms to choose whether they would like to use the "." (Dot) character as a marker for the current PC or not.
- It is achieved by introducing a new field in MCAsmInfo.h called `DotIsPC` (similar to `DollarIsPC`)
Reviewed By: abhina.sreeskantharajan
Differential Revision: https://reviews.llvm.org/D100975
This replaces D98479.
This allows type legalization to form SPLAT_VECTOR_PARTS so we don't
lose the splattedness when the scalar type is split.
I'm handling SPLAT_VECTOR_PARTS for fixed vectors separately so
we can continue using non-VL nodes for scalable vectors.
I limited to RV32+vXi64 because DAGCombiner::visitBUILD_VECTOR likes
to form SPLAT_VECTOR before seeing if it can replace the BUILD_VECTOR
with other operations. Especially interesting is a splat BUILD_VECTOR of
the extract_vector_elt which can become a splat shuffle, but won't if
we form SPLAT_VECTOR first. We either need to reorder visitBUILD_VECTOR
or add visitSPLAT_VECTOR.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100803
This seems like a reasonable upper bound on VL. WG discussions for
the V spec would probably allow us to use 2^16 as an upper bound
on VLEN, but this is good enough for now.
This allows us to remove sext and zext if user happens to assign
the size_t result into an int and then uses it as a VL intrinsic
argument which is size_t.
Reviewed By: frasercrmck, rogfer01, arcbbb
Differential Revision: https://reviews.llvm.org/D101472
At the intrinsic layer the sve.insr operation takes a scalar. When this
scalar is an integer we are forcing a data transition between GPRs and
ZPRs that is potentially costly.
Often the integer scalar is the result of a vector extract, when
performing a reduction for example. In such cases we should keep all
data within the ZPRs.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D101169
By converting the SVE intrinsic to a normal LLVM insertelement we give
the code generator a better chance to remove transitions between GPRs
and VPRs
Co-authored-by: Paul Walker <paul.walker@arm.com>
Depends on D101302
Differential Revision: https://reviews.llvm.org/D101167
As part of this the ptrue coalescing done in SVEIntrinsicOpts has been
modified to not introduce redundant converts, since the convert removal
will no longer run after that optimisation to clean up.
Differential Revision: https://reviews.llvm.org/D101302
This allows calling buildSpillLoadStore for an empty basic block, where
MI points at the end of the block instead of to an instruction.
This only happens with downstream CFI changes, so I was not able to
create a testcase that works with upstream LLVM.
Differential Revision: https://reviews.llvm.org/D101356
Otherwise the CMP glue may be used in multiple nodes, needing to be
emitted multiple times. Currently this either increases instruction
count or fails as it attempt to insert the same node multiple times.
This patch enables support on SPE for constrained arithmetic and
comparison operations. This fixes bugzilla 50070.
One thing not covered is fcmp vs. fcmps on SPE. Some condition code
generates singaling comparison while some not. In this patch, all are
considered as singaling. So there might be still some issue when
compiling from C code.
Reviewed By: jhibbits
Differential Revision: https://reviews.llvm.org/D101282
This is an complementary/alternative fix for D99068. It takes a slightly
different approach by explicitly summing up all of the required split
part type sizes and ensuring we allocate enough space for them. It also
takes the maximum alignment of each part.
Compared with D99068 there are fewer changes to the stack objects in
existing tests. However, @luismarques has shown in that patch that there
are opportunities to reduce our stack usage in the future.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D99087
As X32 uses 32-bit pointers without having 32-bit indirect branch
instructions, we need to fix up indirect branches by extending the
branch targets to 64 bits. This was already done for BRIND but not yet
for NT_BRIND. The same logic works for both, so this applies that
existing logic to NT_BRIND as well.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101499
The ARMConstantIsland pass will convert any t2B to tB if they are within
range after it has added or moved any constant pools. They don't need to
be deliberately converted beforehand, and it doesn't deal with needing
to convert tB to t2B very well.
SelectionDAG has separate ISD opcodes for regular global values and thread-local
global values, while GlobalISel does not.
This combine was ported from SDAG directly without knowing that. As a result,
it was running on TLS globals.
This makes it so that `matchFoldGlobalOffset` doesn't match on TLS globals, and
adds an assert to `selectTLSGlobalValue` to make sure that TLS globals never
have offsets.
Differential Revision: https://reviews.llvm.org/D101478
- Previously, https://reviews.llvm.org/D99889 changed the framework in the AsmLexer to treat special tokens, if they occur at the start of the string, as Identifiers.
- These are used by the MASM Parser implementation in LLVM, and we can extend some of the changes made in the previous patch to SystemZ.
- In SystemZ, the special "tokens" referred to here are "_", "$", "@", "#". [_|$|@|#] are already supported as "part" of an Identifier.
- The changes in this patch ensure that these special tokens, when they occur at the start of the Identifier, are treated as Identifiers.
Reviewed By: abhina.sreeskantharajan
Differential Revision: https://reviews.llvm.org/D100959
Note, only src0 and src1 will be commuted if the isCommutable flag
is set. This patch does not change that, it just makes it possible
to commute src0 and src1 of some U/I/B vop3 instructions.
This patch revises d35d8da7d6.
It contains the commute opportunities excluding float insts
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D101474
Change-Id: I62938173d750453839f2457a3851661a29135faf
This patch changes the AArch32 crypto instructions (sha2 and aes) to
require the specific sha2 or aes features. These features have
already been implemented and can be controlled through the command
line, but do not have the expected result (i.e. `+noaes` will not
disable aes instructions). The crypto feature retains its existing
meaning of both sha2 and aes.
Several small changes are included due to the knock-on effect this has:
- The AArch32 driver has been modified to ensure sha2/aes is correctly
set based on arch/cpu/fpu selection and feature ordering.
- Crypto extensions are permitted for AArch32 v8-R profile, but not
enabled by default.
- ACLE feature macros have been updated with the fine grained crypto
algorithms. These are also used by AArch64.
- Various tests updated due to the change in feature lists and macros.
Reviewed By: lenary
Differential Revision: https://reviews.llvm.org/D99079
In terms of readability, the `enum CFIMoveType` didn't better document what it
intends to convey i.e. the type of CFI section that gets emitted.
Reviewed By: dblaikie, MaskRay
Differential Revision: https://reviews.llvm.org/D76519
- Error out when both Emscripten EH and wasm EH are used together, i.e.,
both `-enable-emscripten-cxx-exceptions` and `-exception-model=wasm`
are given together. This will not happen if you use Emscripten, but
this can happen when you call `llc` manually with wrong set of
arguments.
- Currently we don't yet support using wasm EH with Emscripten SjLj.
Unlike `-enable-emscripten-cxx-exceptions` which is turned on only
when you use `emcc -s DISABLE_EXCEPTION_CATCHING=0`,
`-enable-emscripten-sjlj` is turned on by Emscripten by default. So we
error out only when it is turned on and `setjmp` or `longjmp` is
actually used.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D101403
Previously we used an i32 constant to store the saturation width, but i32 isn't
legal on RISCV64. This wasn't a big deal to fix, but it is extra work for the
type legalizer.
This patch uses a VTSDNode to store the type similar to SEXT_INREG. This makes
it opaque to the type legalizer.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D101262
This adds a special operand type that is allowed to be either
an immediate or register. By giving it a unique operand type the
machine verifier will ignore it.
This perturbs a lot of tests but mostly it is just slightly different
instruction orders. Something bad did happen to some min/max reduction
tests. We're spilling vector registers when we weren't before.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101246
This is hopefully NFC, but should be more robust in ignoring all
instructions that should be ignored, instead of just some of them.
Differential Revision: https://reviews.llvm.org/D101372
This adds a pattern to recognize VIDUP from BUILD_VECTOR of incrementing
adds. This can come up from either geps or adds, and came up recently in
D100550. We are just looking for a BUILD_VECTOR where each lane is an
add of the first lane with N*i, where i is the lane and N is one of 1,
2, 4, or 8, supported by the VIDUP instruction.
Differential Revision: https://reviews.llvm.org/D101263
- This patch is the first part in enforcing prefix-less registers for the HLASM dialect in z/OS
- This patch removes the "%[r|f|v]" prefix while printing registers
- To achieve this, the `AssemblerDialect` field of MAI was used
- There is also a bit of refactoring done to ensure code repetition is reduced.
- Currently the LLVM assembler for SystemZ/z/OS accepts both prefixed registers and prefix-less registers. A subsequent follow-up patch will restrict the SystemZAsmParser to only accept prefix-less registers.
Crediting @kianm as an author as well.
Reviewed By: uweigand, abhina.sreeskantharajan
Differential Revision: https://reviews.llvm.org/D101308
GCC supports negative values for -mstack-protector-guard-offset=, this
should be a signed value. Pre-req to D100919.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101325
This patch adds the support to restrict prefixed instruction from
crossing the 64 byte boundary:
- Add the infrastructure to register a custom XCOFF streamer
- Add a custom XCOFF streamer for PowerPC to allow us to
intercept instructions as they are being emitted and align all 8 byte
instructions to a 64 byte boundary if required by adding a 4 byte nop.
Reviewed By: stefanp
Differential Revision: https://reviews.llvm.org/D101107
XADD has the same EFLAGS behaviour as ADD
Reapplies rG2149aa73f640 (after it was reverted at rG535df472b042) - AFAICT rG029e41ec9800 should ensure we correctly tag the LXADD* ops as load/stores - I haven't been able to repro the sanitizer buildbot fails locally so this is a speculative commit.
Buffer_load does unsigned offset calculations. Don't fold
operands of 32-bit add that are likely to cause unsigned add
overflow (common case is when one of the operands is negative).
Differential Revision: https://reviews.llvm.org/D91336
Add basic version of isCanonicalized for global-isel. Copied from sdag.
Add post legalizer combine that deletes G_FCANONICALIZE when its input
is already Canonicalized.
Differential Revision: https://reviews.llvm.org/D96605
Add signed and unsigned integer version of med3 combine.
Source pattern is min(max(Val, K0), K1) or max(min(Val, K1), K0)
where K0 and K1 are constants and K0 <= K1. Destination is med3
that corresponds to signedness of min/max in source.
Differential Revision: https://reviews.llvm.org/D90050
We request no intersections between AMX instructions and their shapes'
def when we insert ldtilecfg. However, this is not always ture resulting
from not only users don't follow AMX API model, but also optimizations.
This patch adds a mechanism that tries to hoist AMX shapes' def as well.
It only hoists shapes inside a BB, we can improve it for cases across
BBs in future. Currently, it only hoists shapes of which all sources' def
above the first AMX instruction. We can improve for the case that only
source that moves an immediate value to a register below AMX instruction.
Reviewed By: xiangzhangllvm
Differential Revision: https://reviews.llvm.org/D101067
For an example like below,
extern int do_work(int);
long bpf_helper(void *callback_fn);
long prog() {
return bpf_helper(&do_work);
}
The final generated codes look like:
r1 = do_work ll
call bpf_helper
exit
where we have debuginfo for do_work() extern function:
!17 = !DISubprogram(name: "do_work", ...)
This patch implemented to add additional checking
in processing LD_imm64 operands for possible function pointers
so BTF for bpf function do_work() can be properly generated.
The original llvm function name processReloc() is renamed to
processGlobalValue() to better reflect what the function is doing.
Differential Revision: https://reviews.llvm.org/D100568
LLVM does not have valid assembly backends for atomicrmw on local memory. However, as this memory is thread local, we should be able to lower this to the relevant load/store.
Differential Revision: https://reviews.llvm.org/D98650
LLVM does not have valid assembly backends for atomicrmw on local memory. However, as this memory is thread local, we should be able to lower this to the relevant load/store.
Differential Revision: https://reviews.llvm.org/D98650
This reverts commit 0ce723cb22.
D76519 was not quite NFC. If we see a CFISection::Debug function before a
CFISection::EH one (-fexceptions -fno-asynchronous-unwind-tables), we may
incorrectly pick CFISection::Debug and emit a `.cfi_sections .debug_frame`.
We should use .eh_frame instead.
This scenario is untested.
the compilation time and there is no case for which we see any improvement in
performance. This patch removes this pass and its associated test cases from
the tree.
Differential Revision: https://reviews.llvm.org/D101313
Change-Id: I0599169a7609c19a887f8d847a71e664030cc141
This modifies my previous patch to push the strided load formation
to isel. This gives us opportunity to fold the splat into a .vx
operation first. Using a scalar register and a .vx operation reduces
vector register pressure which can be important for larger LMULs.
If we can't fold the splat into a .vx operation, then it can make
sense to use a strided load to free up the vector arithmetic
ALU to do actual arithmetic rather than tying it up with vmv.v.x.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D101138
1. Add an accessor function to MCSymbolizer to retrieve addresses
referenced by a symbolizable operand, but not resolved to a symbol.
That way, the caller can synthesize labels at those addresses and
then retry disassembling the section.
2. Implement that in AMDGPU -- a failed symbol lookup results in the
address being added to a vector returned by the new function.
3. Use that in llvm-objdump when using MCSymbolizer (which only happens
on AMDGPU) and SymbolizeOperands is on.
Differential Revision: https://reviews.llvm.org/D101145
Change-Id: I19087c3bbfece64bad5a56ee88bcc9110d83989e
This expands the VMOVRRD(extract(..(build_vector(a, b, c, d)))) pattern,
to also handle insert_vectors. Providing we can find the correct insert,
this helps further simplify patterns by removing the redundant VMOVRRD.
Differential Revision: https://reviews.llvm.org/D100245
When vectorising for AArch64 targets if you specify the SVE attribute
we automatically then treat masked loads and stores as legal. Also,
since we have no cost model for masked memory ops we believe it's
cheap to use the masked load/store intrinsics even for fixed width
vectors. This can lead to poor code quality as the intrinsics will
currently be scalarised in the backend. This patch adds a basic
cost model that marks fixed-width masked memory ops as significantly
more expensive than for scalable vectors.
Tests for the cost model are added here:
Transforms/LoopVectorize/AArch64/masked-op-cost.ll
Differential Revision: https://reviews.llvm.org/D100745
CGP can move instructions like a ptrtoint into a loop, but the
MVETailPredication when converting them will currently assume invariant
trip counts. This tries to ensure the operands are loop invariant, and
bails if not.
Differential Revision: https://reviews.llvm.org/D100550
We have several extensions that need i32 to be Custom for
INTRINSIC_WO_CHAIN with RV64 so enable it for all RV64.
For V extension, make i32 Custom for RV64 and i64 Custom for RV32.
When the i32 or i64 is legal, the operation action doesn't matter.
LegalizeDAG checks MVT::Other rather than the real type.
This teaches DAG combine that shift amount operands for grev, gorc
shfl, unshfl only read a few bits.
This also teaches DAG combine that grevw, gorcw, shflw, unshflw,
bcompressw, bdecompressw only consume the lower 32 bits of their
inputs.
In the future we can teach SimplifyDemandedBits to also propagate
demanded bits of the output to the inputs in some cases.
Try to fix bug 49974.
This patch fixes two issues:
1. BL does not use predicate (BL_pred is the predicate version of BL),
so we shouldn't add predicate operands in DecodeBranchImmInstruction.
2. Inside DecodeT2AddSubSPImm, we shouldn't add predicate operands into
the MCInst because ARMDisassembler::AddThumbPredicate will do that for us.
However, we should handle CC-out operand for t2SUBspImm and t2AddspImm.
Differential Revision: https://reviews.llvm.org/D100585
In terms of readability, the `enum CFIMoveType` didn't better document what it
intends to convey i.e. the type of CFI section that gets emitted.
Reviewed By: dblaikie, MaskRay
Differential Revision: https://reviews.llvm.org/D76519
This is similar to D69796 from the ARM backend. We remove the UseAA
feature, enabling it globally in the AArch64 backend. This should in
general be an improvement allowing the backend to reorder more
instructions in scheduling and codegen, and enabling it by default helps
to improve the testing of the feature, not making it cpu-specific. A
debugging option is added instead for testing.
Differential Revision: https://reviews.llvm.org/D98781
This clang-formats the list of ARMISD nodes. Usually this is something I
would avoid, but these cause problems with formatting every time new
nodes are added.
The list in getTargetNodeName also makes use of MAKE_CASE macros, as
other backends do.
Use getContainerForFixedLengthVector and getRegClassIDForVecVT to
get the register class to use when making a fixed vector type legal.
Inline it into the other two call sites.
I'm looking into using fractional lmul for fixed length vectors
and getLMULForFixedLengthVector returned an integer making it
unable to express this. I considered returning the LMUL
enum, but that seemed like it would introduce more complexity to
convert it for use.
Make it a static function RISCVISelLowering, the only place it
is used.
I think I'm going to make this return a fractional LMULs in some
cases so I'm sorting out where it should live before I start
making changes.
We can have RISCVISelDAGToDAG.cpp call the VT only version by
finding the RISCVTargetLowering object via the Subtarget.
Make the static versions just global static functions in
RISCVISelLowering that can be called by static functions in that
file.
Theses instructions are allowed to write v0 when they are masked.
We'll still never use v0 because of the earlyclobber constraint so
this doesn't really help anything. It just makes the definitions
correct.
While I was there remove an unused multiclass I noticed.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D101118
The values of registers in inactive lanes needs to be saved during
function calls.
Save all registers used for whole wave mode, similar to how it is done
for VGPRs that are used for SGPR spilling.
Differential Revision: https://reviews.llvm.org/D99429
Reapply with fixed tests on window.
These are added for compatibility with XLC. They are similar to
vec_cts and vec_ctu except that the result is a doubleword vector
regardless of the parameter type.
The function AArch64TargetLowering::LowerFixedLengthVectorIntDivideToSVE
previously assumed the operands were full vectors, but this is not
always true. This function would produce bogus if the division operands
are not full vectors, resulting in miscompiles when dividing 8-bit or
16-bit vectors.
The fix is to perform an extend + div + truncate for non-full vectors,
instead of the usual unpacking and unzipping logic. This is an additive
change which reduces the non-full integer vector divisions to a pattern
recognised by the existing lowering logic.
For future reference, an example of code that would miscompile before
this patch is below:
1 int8_t foo(unsigned N, int8_t *a, int8_t *b, int8_t *c) {
2 int8_t result = 0;
3 for (int i = 0; i < N; ++i) {
4 result += (a[i] / b[i]) / c[i];
5 }
6 return result;
7 }
Differential Revision: https://reviews.llvm.org/D100370
Deleted HexagonMapAsm2IntrinV65.gen.td that wasn't included anywhere,
moved V6_vrmpy*_rtt* patterns to HexagonIntrinsics.td.
Touch CMakeLists.txt to force re-cmake (somehow the unused file was
listed as a dependency in the generated makefiles).
The values of registers in inactive lanes needs to be saved during
function calls.
Save all registers used for whole wave mode, similar to how it is done
for VGPRs that are used for SGPR spilling.
Differential Revision: https://reviews.llvm.org/D99429
This patch adds support for both scalable- and fixed-length vector code
lowering of the llvm.minnum and llvm.maxnum intrinsics to the equivalent
RVV instructions.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D101035
The previous D101039 didn't fix the SmallSet insertion issue, due to we
always return false for the comparison between 2 different nonnull BBs.
This patch makes the the comparison to be complete by comparing `MBB`
first, so that we can always get the invariant order by a single
operator.
EMITBKEY is emitted for PAC-RET+bkey, which is a non machine instructions.
PR: 49957
Reviewed By: eugenis
Differential Revision: https://reviews.llvm.org/D100996
The previous condition in the assert was over strict. We ought to allow
the same immidiate value being loaded more than once. The intention for
the assert is to check the same AMX register uses multiple different
immidiate shapes. So this fix supposes to be NFC.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D101124
We request no intersections between AMX instructions and their shapes'
def when we insert ldtilecfg. However, this is not always ture resulting
from not only users don't follow AMX API model, but also optimizations.
This patch adds a mechanism that tries to hoist AMX shapes' def as well.
It only hoists shapes inside a BB, we can improve it for cases across
BBs in future. Currently, it only hoists shapes of which all sources' def
above the first AMX instruction. We can improve for the case that only
source that moves an immediate value to a register below AMX instruction.
Differential Revision: https://reviews.llvm.org/D101067
Add __uintr_frame structure and use UIRET instruction for functions with
x86 interrupt calling convention when UINTR is present.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D99708
9931b1f7a4 switched this to checking for
the two specific subtargets, instead of the dedicated feature. This
broke supporting functions which force added the feature when emitting
targets that do not actually support them. This stil does not work for
the targets that use the gfx6/7 or gfx10 encodings.
Background:
CFGStackify's [[ 398f253400/llvm/lib/Target/WebAssembly/WebAssemblyCFGStackify.cpp (L1481-L1540) | fixEndsAtEndOfFunction ]] fixes block/loop/try's return
type when the end of function is unreachable and the function return
type is not void. So if a function returns i32 and `block`-`end` wraps the
whole function, i.e., the `block`'s `end` is the last instruction of the
function, the `block`'s return type should be i32 too:
```
block i32
...
end
end_function
```
If there are consecutive `end`s, this signature has to be propagate to
those blocks too, like:
```
block i32
...
block i32
...
end
end
end_function
```
This applies to `try`-`end` too:
```
try i32
...
catch
...
end
end_function
```
In case of `try`, we not only follow consecutive `end`s but also follow
`catch`, because for the type of the whole `try` to be i32, both `try`
and `catch` parts have to be i32:
```
try i32
...
block i32
...
end
catch
...
block i32
...
end
end
end_function
```
---
Previously we only handled consecutive `end`s or `end` before a `catch`.
But now we have `delegate`, which serves like `end` for
`try`-`delegate`. So we have to follow `delegate` too and mark its
corresponding `try` as i32 (the function's return type):
```
try i32
...
catch
...
try i32 ;; Here
...
delegate N
end
end_function
```
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D101036
This adds support for YAML serialization of `Params` and `Results`
fields in `WebAssemblyMachineFunctionInfo`. Types are printed as `MVT`'s
string representation. This is for writing MIR tests easier.
The tests added are testing simple parsing and printing of `params` /
`results` fields under `machineFunctionInfo`.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D101029
This CL
1. Creates Utils/ directory under lib/Target/WebAssembly
2. Moves existing WebAssemblyUtilities.cpp|h into the Utils/ directory
3. Creates Utils/WebAssemblyTypeUtilities.cpp|h and put type
declarataions and type conversion functions scattered in various
places into this single place.
It has been suggested several times that it is not easy to share utility
functions between subdirectories (AsmParser, DIsassembler, MCTargetDesc,
...). Sometimes we ended up [[ https://reviews.llvm.org/D92840#2478863 | duplicating ]] the same function because of
this.
There are already other targets doing this: AArch64, AMDGPU, and ARM
have Utils/ subdirectory under their target directory.
This extracts the utility functions into a single directory Utils/ and
make them sharable among all passes in WebAssembly/ and its
subdirectories. Also I believe gathering all type-related conversion
functionalities into a single place makes it more usable. (Actually I
was working on another CL that uses various type conversion functions
scattered in multiple places, which became the motivation for this CL.)
Reviewed By: dschuff, aardappel
Differential Revision: https://reviews.llvm.org/D100995
'not' expands to checking for an xor with a -1 constant. Since
this looks for a ConstantSDNode it will never match for a vector.
Co-authored-by: Craig Topper <craig.topper@sifive.com>
Differential Revision: https://reviews.llvm.org/D100687
This improves the lowering of v8i16 and v16i8 vector reverse shuffles.
Instead of going via a generic tbl it uses a rev64; ext pair, as already
happens for v4i32.
Differential Revision: https://reviews.llvm.org/D100882
These instructions don't really exist, but we have ways we can
emulate them.
.vv will swap operands and use vmsle().vv. .vi will adjust the
immediate and use .vmsgt(u).vi when possible. For .vx we need to
use some of the multiple instruction sequences from the V extension
spec.
For unmasked vmsge(u).vx we use:
vmslt{u}.vx vd, va, x; vmnand.mm vd, vd, vd
For cases where mask and maskedoff are the same value then we have
vmsge{u}.vx v0, va, x, v0.t which is the vd==v0 case that
requires a temporary so we use:
vmslt{u}.vx vt, va, x; vmandnot.mm vd, vd, vt
For other masked cases we use this sequence:
vmslt{u}.vx vd, va, x, v0.t; vmxor.mm vd, vd, v0
We trust that register allocation will prevent vd in vmslt{u}.vx
from being v0 since v0 is still needed by the vmxor.
Differential Revision: https://reviews.llvm.org/D100925
Refactor to use new multiclass instead of individual patterns.
We already supported this due to SEW=64 on RV32, but we didn't have
test cases for all the types we supported.
Part of D100925
We don't have instructions for these, but can swap the operands
to use vmle/vmflt. This makes the IR interface more consistent and
simplifies the frontend implementation.
Part of D100925
Implementations are allowed to optimize an x0 stride to perform
less memory accesses. This is the case in SiFive cores.
No idea if this is the case in other implementations. We might
need a tuning flag for this.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D100815
Rather than doing splatting each separately and doing bit manipulation
to merge them in the vector domain, copy the data to the stack
and splat it using a strided load with x0 stride. At least on
some implementations this vector load is optimized to not do
a load for each element.
This is equivalent to how we move i64 to f64 on RV32.
I've only implemented this for the intrinsic fallbacks in this
patch. I think we do similar splatting/shifting/oring in other
places. If this is approved, I'll refactor the others to share
the code.
Differential Revision: https://reviews.llvm.org/D101002
Intrinsics for the following instructions are added. The intrinsic
name is "int_hexagon_<inst>[_128B]", e.g.
int_hexagon_V6_vL32b_pred_ai for 64-byte version
int_hexagon_V6_vL32b_pred_ai_128B for 128-byte version
V6_vL32b_pred_ai if (Pv4) Vd32 = vmem(Rt32+#s4)
V6_vL32b_pred_pi if (Pv4) Vd32 = vmem(Rx32++#s3)
V6_vL32b_pred_ppu if (Pv4) Vd32 = vmem(Rx32++Mu2)
V6_vL32b_npred_ai if (!Pv4) Vd32 = vmem(Rt32+#s4)
V6_vL32b_npred_pi if (!Pv4) Vd32 = vmem(Rx32++#s3)
V6_vL32b_npred_ppu if (!Pv4) Vd32 = vmem(Rx32++Mu2)
V6_vL32b_nt_pred_ai if (Pv4) Vd32 = vmem(Rt32+#s4):nt
V6_vL32b_nt_pred_pi if (Pv4) Vd32 = vmem(Rx32++#s3):nt
V6_vL32b_nt_pred_ppu if (Pv4) Vd32 = vmem(Rx32++Mu2):nt
V6_vL32b_nt_npred_ai if (!Pv4) Vd32 = vmem(Rt32+#s4):nt
V6_vL32b_nt_npred_pi if (!Pv4) Vd32 = vmem(Rx32++#s3):nt
V6_vL32b_nt_npred_ppu if (!Pv4) Vd32 = vmem(Rx32++Mu2):nt
V6_vS32b_pred_ai if (Pv4) vmem(Rt32+#s4) = Vs32
V6_vS32b_pred_pi if (Pv4) vmem(Rx32++#s3) = Vs32
V6_vS32b_pred_ppu if (Pv4) vmem(Rx32++Mu2) = Vs32
V6_vS32b_npred_ai if (!Pv4) vmem(Rt32+#s4) = Vs32
V6_vS32b_npred_pi if (!Pv4) vmem(Rx32++#s3) = Vs32
V6_vS32b_npred_ppu if (!Pv4) vmem(Rx32++Mu2) = Vs32
V6_vS32Ub_pred_ai if (Pv4) vmemu(Rt32+#s4) = Vs32
V6_vS32Ub_pred_pi if (Pv4) vmemu(Rx32++#s3) = Vs32
V6_vS32Ub_pred_ppu if (Pv4) vmemu(Rx32++Mu2) = Vs32
V6_vS32Ub_npred_ai if (!Pv4) vmemu(Rt32+#s4) = Vs32
V6_vS32Ub_npred_pi if (!Pv4) vmemu(Rx32++#s3) = Vs32
V6_vS32Ub_npred_ppu if (!Pv4) vmemu(Rx32++Mu2) = Vs32
V6_vS32b_nt_pred_ai if (Pv4) vmem(Rt32+#s4):nt = Vs32
V6_vS32b_nt_pred_pi if (Pv4) vmem(Rx32++#s3):nt = Vs32
V6_vS32b_nt_pred_ppu if (Pv4) vmem(Rx32++Mu2):nt = Vs32
V6_vS32b_nt_npred_ai if (!Pv4) vmem(Rt32+#s4):nt = Vs32
V6_vS32b_nt_npred_pi if (!Pv4) vmem(Rx32++#s3):nt = Vs32
V6_vS32b_nt_npred_ppu if (!Pv4) vmem(Rx32++Mu2):nt = Vs32
There are no patterns for the AArch64ISD::BSP ISD node for anything
other than NEON vectors at the moment. As a result, if we hit these
combines for vectors wider than a NEON vector (such as what we might get
with fixed length SVE) we will fail to lower.
This patch simply prevents us from attempting the combines if the input
vector type is too wide.
Reviewed By: peterwaller-arm
Differential Revision: https://reviews.llvm.org/D100961
SmallSet may use operator `<` when we insert MIRef elements, so we
cannot limit the comparison between different BBs.
We allow MIRef() to be less that any initialized MIRef object, otherwise,
we always reture false when compare between different BBs.
Differential Revision: https://reviews.llvm.org/D101039
We currently do not utilize instructions that convert single
precision vectors to doubleword integer vectors. These conversions
come up in code occasionally and this improvement allows us to
open code some functions that need to be added to altivec.h.
When inspecting the calling convention, for calling windows functions
from a non-windows function, inspect the calling convention of
the called function, not the caller.
Also remove an unnecessary parameter to AArch64CallLowering
OutgoingArgHandler.
Differential Revision: https://reviews.llvm.org/D100890
STRICT_WWM and STRICT_WQM are already defined with Uses = [EXEC], so
there is no need to add another implicit use of $exec when lowering them
to V_MOV_B32 instructions.
Differential Revision: https://reviews.llvm.org/D100969
The value is always an immediate and can never be in a register.
This the kind of thing TargetConstant is for.
Saves a step GenDAGISel to convert a Constant to a TargetConstant.
This recognizes the case when Hi is (sra Lo, 31). We can use
SPLAT_VECTOR_I64 rather than splatting the high bits and
combining them in the vector register.
Each of the cases marked as legal here have an imported pattern in
AArch64GenGlobalISel.inc. So, if we mark them as legal, we get selection for
free.
Technically this is only supposed to happen if we have NEON support. But, we
fall back if we don't have that in the legalizer right now. I suppose it'd be
better to have a FIXME so we can write the testcase when the time comes.
(Plus, it'd just fall back in selection if NEON isn't available, so it's not
*wrong*, I guess?)
This fixes some fallbacks in the test suite.
(Also use `isScalar` from LegalityPredicates.cpp while we're here just to tidy
things a little bit.)
Differential Revision: https://reviews.llvm.org/D100916
This previously made references to 2.3-draft which was a short
lived version number in 2017. It was replaced by date based
versions leading up to ratification.
This patch uses the latest ratified version number and just says
what the behavior is. Nothing here is in flux.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100878
This was checked in some asserts, but not enforced by the
instruction matching.
There's still a second bug that we don't check that vt and vd
are different registers, but that will require custom checking.
Differential Revision: https://reviews.llvm.org/D100928
PR50049 demonstrated an infinite loop between OR(SHUFFLE,SHUFFLE) <-> BLEND(SHUFFLE,SHUFFLE) patterns.
The UNDEF elements were allowing a combined shuffle mask to be widened which lost the undef element, resulting us needing to use the BLEND pattern (as the undef element would need to be zero for the OR pattern). But then bitcast folds would re-expose the undef element allowing us to use OR again.....
Let it work on a very small kernels only. Measurements showed
the performance benefit is not worth the compile time.
Differential Revision: https://reviews.llvm.org/D100904
- Previously, https://reviews.llvm.org/D72680 introduced a new attribute called `AllowSymbolAtNameStart` (in relation to the MAsmParser changes) in `MCAsmInfo.h` which (according to the comment in the header) allows the following behaviour:
```
/// This is true if the assembler allows $ @ ? characters at the start of
/// symbol names. Defaults to false.
```
- However, the usage of this field in AsmLexer.cpp doesn't seem completely accurate* for a couple of reasons.
```
default:
if (MAI.doesAllowSymbolAtNameStart()) {
// Handle Microsoft-style identifier: [a-zA-Z_$.@?][a-zA-Z0-9_$.@#?]*
if (!isDigit(CurChar) &&
isIdentifierChar(CurChar, MAI.doesAllowAtInName(),
AllowHashInIdentifier))
return LexIdentifier();
}
```
1. The Dollar and At tokens, when occurring at the start of the string, are treated as separate tokens (AsmToken::Dollar and AsmToken::At respectively) and not lexed as an Identifier.
2. I'm not too sure why `MAI.doesAllowAtInName()` is used when `AllowAtInIdentifier` could be used. For X86 platforms, afaict, this shouldn't be an issue, since the `CommentString` attribute isn't "@". (alternatively the call to the setter can be set anywhere else as needed). The `AllowAtInName` does have an additional important meaning, but in the context of AsmLexer, shouldn't mean anything different compared to `AllowAtInIdentifier`
My proposal is the following:
- Introduce 3 new fields called `AllowQuestionTokenAtStartOfString`, `AllowDollarTokenAtStartOfString` and `AllowAtTokenAtStartOfString` in MCAsmInfo.h which will encapsulate the previously documented behaviour of "allowing $, @, ? characters at the start of symbol names")
- Introduce these fields where "$", "@" are lexed, and treat them as identifiers depending on whether `Allow[Dollar|At]TokenAtStartOfString` is set.
- For the sole case of "?", append it to the existing logic for treating a "default" token as an Identifier.
z/OS (HLASM) will also make use of some of these fields in follow up patches.
completely accurate* - This was based on the comments and the intended behaviour the code. I might have completely misinterpreted it, and if that is the case my sincere apologies. We can close this patch if necessary, if there are no changes to be made :)
Depends on https://reviews.llvm.org/D99374
Reviewed By: Jonathan.Crowther
Differential Revision: https://reviews.llvm.org/D99889
This patch changes the lowering of SELECT_CC from Legal to Expand for scalable
vector and adds support for scalable vectors in performSelectCombine.
When selecting the nodes to lower in visitSELECT it checks if it is possible to
use SELECT_CC in cases where SETCC is followed by SELECT. visistSELECT checks
if SELECT_CC is legal or custom to replace SELECT by SELECT_CC.
SELECT_CC used to be legal for scalable vector, so the node changes to
SELECT_CC. This used to crash the compiler as there is no support for SELECT_CC
with scalable vectors. So now the compiler lowers to VSELECT instead of
SELECT_CC.
Differential Revision: https://reviews.llvm.org/D100485
This patch fixes a case missed out by D100574, in which RVV scalable
stack offset computations may require three live registers in the case
where the offset's fixed component is 12 bits or larger and has a
scalable component.
Instead of adding an additional emergency spill slot, this patch further
optimizes the scalable stack offset computation sequences to reduce
register usage.
By emitting the sequence to compute the scalable component before the
fixed component, we can free up one scratch register to be reallocated
by the sequence for the fixed component. Doing this saves one register
and thus one additional emergency spill slot.
Compare:
$x5 = LUI 1
$x1 = ADDIW killed $x5, -1896
$x1 = ADD $x2, killed $x1
$x5 = PseudoReadVLENB
$x6 = ADDI $x0, 50
$x5 = MUL killed $x5, killed $x6
$x1 = ADD killed $x1, killed $x5
versus:
$x5 = PseudoReadVLENB
$x1 = ADDI $x0, 50
$x5 = MUL killed $x5, killed $x1
$x1 = LUI 1
$x1 = ADDIW killed $x1, -1896
$x1 = ADD $x2, killed $x1
$x1 = ADD killed $x1, killed $x5
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D100847
We were missing some instruction costs when converting vectors of
floating point half types into integers, so I've added those here.
I also manually generated assembly code for each FP->int case and
looked at the number of instructions generated, which meant
adjusting some of the existing costs too.
I've updated an existing test to reflect the new costs:
Analysis/CostModel/AArch64/sve-fptoi.ll
Differential Revision: https://reviews.llvm.org/D99935
New registers FRM, FFLAGS and FCSR was defined. They represent
corresponding system registers. The new registers are necessary to
properly order floating point instructions in non-default modes.
Differential Revision: https://reviews.llvm.org/D99083
This is just a cleanup of the very high level stuff. I'm sure there is
more to update here but I'll leave that to others and/or a followup.
Differential Revision: https://reviews.llvm.org/D100888
It used to be that all of our intrinsics were call instructions, but over time, we've added more and more invokable intrinsics. According to the verifier, we're up to 8 right now. As IntrinsicInst is a sub-class of CallInst, this puts us in an awkward spot where the idiomatic means to check for intrinsic has a false negative if the intrinsic is invoked.
This change switches IntrinsicInst from being a sub-class of CallInst to being a subclass of CallBase. This allows invoked intrinsics to be instances of IntrinsicInst, at the cost of requiring a few more casts to CallInst in places where the intrinsic really is known to be a call, not an invoke.
After this lands and has baked for a couple days, planned cleanups:
Make GCStatepointInst a IntrinsicInst subclass.
Merge intrinsic handling in InstCombine and use idiomatic visitIntrinsicInst entry point for InstVisitor.
Do the same in SelectionDAG.
Do the same in FastISEL.
Differential Revision: https://reviews.llvm.org/D99976
Introduced the cost of thre reverse shuffles for AArch64, currently just
copied the costs for PermuteSingleSrc.
Differential Revision: https://reviews.llvm.org/D100871
af7925b4dd added a custom DAG combine for recognizing fp-to-ints of
extract_subvectors that could be lowered to f64x2.convert_low_i32x4_{s,u}
instructions. This commit extends the combines to recognize equivalent
extract_subvectors of fp-to-ints as well.
Differential Revision: https://reviews.llvm.org/D100790
In GFX10 VOP3 can have a literal, which opens up the possibility of two
operands using the same literal value, which is allowed and only counts
as one use of the constant bus.
AMDGPUAsmParser::validateConstantBusLimitations already knew about this
but SIInstrInfo::verifyInstruction did not.
Differential Revision: https://reviews.llvm.org/D100770
apple-m1 has the same level of ISA support as apple-a14,
so this is a straightforward mechanical change. However, that
also means this inherits apple-a14's v8.5a+nobti quirkiness.
rdar://68287159
The generic SoftFloatVectorExtract.ll test was failing when run on arm
machines, as it tries to create a f64 under soft float. Limit the
transform to when f64 is legal.
Also add a missing override, as reported in D100244.
Mark MULHS/MULHU nodes as legal for both scalable and fixed SVE types,
and lower them to the appropriate SVE instructions.
Additionally now that the MULH nodes are legal, integer divides can be
expanded into a more performant code sequence.
Differential Revision: https://reviews.llvm.org/D100487
This adds a combine for extract(x, n); extract(x, n+1) ->
VMOVRRD(extract x, n/2). This allows two vector lanes to be moved at the
same time in a single instruction, and thanks to the other VMOVRRD folds
we have added recently can help reduce the amount of executed
instructions. Floating point types are very similar, but will include a
bitcast to an integer type.
This also adds a shouldRewriteCopySrc, to prevent copy propagation from
DPR to SPR, which can break as not all DPR regs can be extracted from
directly. Otherwise the machine verifier is unhappy.
Differential Revision: https://reviews.llvm.org/D100244
Instructions on the transcendental unit are executed in parallel to the
normal VALU, so add this as an extra resource.
This doesn't seem to have any effect, but it should be more correct.
Differential Revision: https://reviews.llvm.org/D100123
Extend shuffle canonicalization and conversion of shuffles fed by vectorized
scalars to big endian subtargets. For big endian subtargets, loads and direct
moves of scalars into vector registers put the data in the correct element for
SCALAR_TO_VECTOR if the data type is 8 bytes wide. However, if the data type is
narrower, the value still ends up in the wrong place - althouth a different
wrong place than on little endian targets.
This patch extends the combine that keeps values where they are if they feed a
shuffle to big endian targets.
Differential revision: https://reviews.llvm.org/D100478
when the predicate used by last{a,b} specifies a known vector length.
For example:
aarch64_sve_lasta(VL1, D) -> extractelement(D, #1)
aarch64_sve_lastb(VL1, D) -> extractelement(D, #0)
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D100476
This patch adds an additional emergency spill slot to RVV code. This is
required as RVV stack offsets may require an additional register to compute.
This patch includes an optimization by @HsiangKai <kai.wang@sifive.com>
to reduce the number of registers required for the computation of stack
offsets from 3 to 2. Otherwise we'd need two additional emergency spill
slots.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D100574
This patch exploits mtvsrdd instruction (available in ISA3.0+) to save
two callee-saved GPR registers into a single VSR, making it more
efficient.
Reviewed By: jsji, nemanjai
Differential Revision: https://reviews.llvm.org/D62565
Don't shrink VOP3 instructions if there are any uses of a carry-out
operand, because the shrunken form of the instruction would write the
carry-out to vcc instead of to a virtual register.
Differential Revision: https://reviews.llvm.org/D100760
This patch is the last one in backend to support fp128 type in
pre-POWER9 subtargets with VSX, removing temporary option and updating
remaining tests.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D92374
This patch adds basic CSKY branch instructions and symbol address series instructions.
Those two kinds of instruction have relationship between each other, and it involves much work about Fixups.
For now, basic instructions are enabled except for disassembler support.
We would support to generate basic codegen asm firstly and delay disassembler work later.
Differential Revision: https://reviews.llvm.org/D95029
This patch adds basic CSKY integer instructions except for branch series such as bsr, br.
It mainly includes basic ALU, load & store, compare and data move instructions.
Branch series instructions need handle complex symbol operand as following patch later.
Differential Revision: https://reviews.llvm.org/D94007
This basic parser will handle basic instructions with register or immediate operands.
With the addition of CSKYInstPrinter, we can now make use of lit tests.
Differential Revision: https://reviews.llvm.org/D93798
It's necessary to calculate correct instruction size because
PseudoVRELOAD and PseudoSPILL will be expanded into multiple
instructions.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100702
M68kDisassembler should put M68kDesc as its direct library dependency
since it uses logics releated to code beads Otherwise the build will
fail when building LLVM libraries as shared objects (building LLVM
libraries statically won't have this problem though)
This also includes PC-relative addresses since they are still
referenced as absolute addresses in assembly and converted to
relative addresses by the assembler.
This changes, for example:
- `bra #-2` -> `bra $100`
- `jsr #16` -> `jsr $10`
Differential Revision: https://reviews.llvm.org/D100697
Used to model structural hazards on FP issue, where some
instructions take up 2 issue slots and others one as well
as similar structural hazards on load issue, where some
instructions take up two load lanes and others one.
Differential Revision: https://reviews.llvm.org/D98977
We previously used splats instead of v128.const to materialize vector constants
because V8 did not support v128.const. Now that V8 supports v128.const, we can
use v128.const instead. Although this increases code size, it should also
increase performance (or at least require fewer engine-side optimizations), so
it is an appropriate change to make.
Differential Revision: https://reviews.llvm.org/D100716
This patch removes -fixed-abi check for indirect calls
and also adds queue-ptr which is required for indirect calls to work.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D100633
Comparisons to zero or one after cset instructions can be safely
removed in examples like:
cset w9, eq cset w9, eq
cmp w9, #1 ---> <removed>
b.ne .L1 b.ne .L1
cset w9, eq cset w9, eq
cmp w9, #0 ---> <removed>
b.ne .L1 b.eq .L1
Peephole optimization to detect suitable cases and get rid of that
comparisons added.
Differential Revision: https://reviews.llvm.org/D98564
As noted in the FIXME there's a sort of agreement that the any
extra bits stored will be 0.
The generated code is pretty terrible. I was really hoping we
could use a tail undisturbed trick, but tail undisturbed no
longer applies to masked destinations in the current draft
spec.
Fingers crossed that it isn't common to do this. I doubt IR
from clang or the vectorizer would ever create this kind of store.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100618
This is a partial port of AArch64TargetLowering::LowerCTPOP.
This custom lowering tries to uses NEON instructions to give a more efficient
CTPOP lowering when possible.
In the non-NEON/noimplicitfloat case, this should use the generic lowering
(see: https://godbolt.org/z/GcaPvWe4x). I think that's worth implementing after
implementing the widening code for s16/s8 though.
Differential Revision: https://reviews.llvm.org/D100399
These constraints are machine agnostic; there's no reason to handle
these per-arch. If arches don't support these constraints, then they
will fail elsewhere during instruction selection. We don't need virtual
calls to look these up; TargetLowering::getInlineAsmMemConstraint should
only be overridden by architectures with additional unique memory
constraints.
Reviewed By: echristo, MaskRay
Differential Revision: https://reviews.llvm.org/D100416
It turns out we actually import a bunch of selection code for intrinsics. The
imported code checks that the register banks on the G_INTRINSIC instruction
are correct. If so, it goes ahead and selects it.
This adds code to AArch64RegisterBankInfo to allow us to correctly determine
register banks on intrinsics which have known register bank constraints.
For now, this only handles @llvm.aarch64.neon.uaddlv. This is necessary for
porting AArch64TargetLowering::LowerCTPOP.
Also add a utility for getting the intrinsic ID from a G_INTRINSIC instruction.
This seems a little nicer than having to know about how intrinsic instructions
are structured.
Differential Revision: https://reviews.llvm.org/D100398
Remove the MachineDCE pass after the first SIFoldOperands pass now
that SIFoldOperands deletes its own dead instructions.
Reapply after fixing dependent change D100188.
Differential Revision: https://reviews.llvm.org/D100189
This is fairly cheap to implement and means less work for future
passes like MachineDCE.
Reapply with a fix for using InstToErase after it had been erased.
Differential Revision: https://reviews.llvm.org/D100188
This is similar to the subvector extractions,
except that the 0'th subvector isn't free to insert,
because we generally don't know whether or not
the upper elements need to be preserved:
https://godbolt.org/z/rsxP5W4sW
This is needed to avoid regressions in D100684
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100698
This patch extends the lowering of RVV fixed-length vector shuffles to
avoid the default stack expansion and instead lower to vrgather
instructions.
For "permute"-style shuffles where one vector is swizzled, we can lower
to one vrgather. For shuffles involving two vector operands, we lower to
one unmasked vrgather (or splat, where appropriate) followed by a masked
vrgather which blends in the second half.
On occasion, when it's not possible to create a legal BUILD_VECTOR for
the indices, we use vrgatherei16 instructions with 16-bit index types.
For 8-bit element vectors where we may have indices over 255, we have a
fairly blunt fallback to the stack expansion to avoid custom-splitting
of the vector types.
To enable the selection of masked vrgather instructions, this patch
extends the various RISCVISD::VRGATHER nodes to take a passthru operand.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100549
The internalization pass only internalizes global variables
with no users. If the global variable has some dead user,
the internalization pass will not internalize it.
To be able to internalize global variables with dead
users, a global dce pass is needed before the
internalization pass.
This patch adds that.
Reviewed by: Artem Belevich, Matt Arsenault
Differential Revision: https://reviews.llvm.org/D98783
Such attributes can either be unset, or set to "true" or "false" (as string).
throughout the codebase, this led to inelegant checks ranging from
if (Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true")
to
if (Fn->hasAttribute("no-jump-tables") && Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true")
Introduce a getValueAsBool that normalize the check, with the following
behavior:
no attributes or attribute set to "false" => return false
attribute set to "true" => return true
Differential Revision: https://reviews.llvm.org/D99299
These lines set the value to what it already was,
so they are redundant. NFC
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D100664
Change-Id: Ibf6f27d50a7fa1f76c127f01b799821378bfd3b3
Gives reasoning for convertDPP8.
Also corrects typo in Operand type comment.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D100665
Change-Id: I33ff269db8072d83e5e0ecdbfb731d6000fc26c4
Use the target-independent @llvm.fptosi and @llvm.fptoui intrinsics instead.
This includes removing the instrinsics for i32x4.trunc_sat_zero_f64x2_{s,u},
which are now represented in IR as a saturating truncation to a v2i32 followed by
a concatenation with a zero vector.
Differential Revision: https://reviews.llvm.org/D100596
This patch prevents phi-node-elimination from generating a COPY
operation for the register defined by t2WhileLoopStartLR, as it is a
terminator that defines a value.
This happens because of the presence of phi-nodes in the loop body (the
Preheader of which is the block containing the t2WhileLoopStartLR). If
this is not done, the COPY is generated above/before the terminator
(t2WhileLoopStartLR here), and since it uses the value defined by
t2WhileLoopStartLR, MachineVerifier throws a 'use before define' error.
This essentially adds on to the change in differential D91887/D97729.
Differential Revision: https://reviews.llvm.org/D100376
Sometimes LV has to produce really wide vectors,
and sometimes they end up being not powers of two.
As it can be seen from the diff, the cost computation
is currently completely non-sensical in those cases.
Instead of just scalarizing everything, split/factorize the wide vector
into a number of subvectors, each one having a power-of-two elements,
recurse to get the cost of op on this subvector. Also, check how we'd
legalize this subvector, and if the legalized type is scalar,
also account for the scalarization cost.
Note that for sub-vector loads, we might be able to do better,
when the vectors are properly aligned.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D100099
Combine sub 0, csinc X, Y, CC to csinv -X, Y, CC providing that the
negation of X is cheap, currently just handling constants. This comes up
during the splat of an i1 to a predicate, where we now generate csetm,
as opposed to cset; rsb.
Differential Revision: https://reviews.llvm.org/D99940
It has to save all caller-saved registers before a call in the handler.
So don't emit a call that save/restore registers.
Reviewed By: simoncook, luismarques, asb
Differential Revision: https://reviews.llvm.org/D100532
Part of the code related to ds_read/ds_write ISel is refactored, and the
corresponding comment is re-written for better readability, which would help
while implementing any future ds_read/ds_write ISel related modifications.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D100300
When we pass a AArch64 Homogeneous Floating-Point
Aggregate (HFA) argument with increased alignment
requirements, for example
struct S {
__attribute__ ((__aligned__(16))) double v[4];
};
Clang uses `[4 x double]` for the parameter, which is passed
on the stack at alignment 8, whereas it should be at
alignment 16, following Rule C.4 in
AAPCS (https://github.com/ARM-software/abi-aa/blob/master/aapcs64/aapcs64.rst#642parameter-passing-rules)
Currently we don't have a way to express in LLVM IR the
alignment requirements of the function arguments. The align
attribute is applicable to pointers only, and only for some
special ways of passing arguments (e..g byval). When
implementing AAPCS32/AAPCS64, clang resorts to dubious hacks
of coercing to types, which naturally have the needed
alignment. We don't have enough types to cover all the
cases, though.
This patch introduces a new use of the stackalign attribute
to control stack slot alignment, when and if an argument is
passed in memory.
The attribute align is left as an optimizer hint - it still
applies to pointer types only and pertains to the content of
the pointer, whereas the alignment of the pointer itself is
determined by the stackalign attribute.
For byval arguments, the stackalign attribute assumes the
role, previously perfomed by align, falling back to align if
stackalign` is absent.
On the clang side, when passing arguments using the "direct"
style (cf. `ABIArgInfo::Kind`), now we can optionally
specify an alignment, which is emitted as the new
`stackalign` attribute.
Patch by Momchil Velikov and Lucas Prates.
Differential Revision: https://reviews.llvm.org/D98794
Move some utility functions which are used within LDS lowering pass to a separate utils
file so that other LDS related passes can make use of them when required.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D100526
This generalizes RVInstIShift/RVInstIShiftW to take the upper
5 or 7 bits of the immediate as an input instead of only bit 30. Then
we can share them.
For RVInstIShift I left a hardcoded 0 at bit 26 where RV128 gets
a 7th bit for the shift amount.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D100424
Being lazy with printing the banner seems hard to reason with, we should print it
unconditionally first (it could also lead to duplicate banners if we
have multiple functions in -filter-print-funcs).
The printIR() functions were doing too many things. I separated out the
call from PrintPassInstrumentation since we were essentially doing two
completely separate things in printIR() from different callers.
There were multiple ways to generate the name of some IR. That's all
been moved to getIRName(). The printing of the IR name was also
inconsistent, now it's always "IR Dump on $foo" where "$foo" is the
name. For a function, it's the function name. For a loop, it's what's
printed by Loop::print(), which is more detailed. For an SCC, it's the
list of functions in parentheses. For a module it's "[module]", to
differentiate between a possible SCC with a function called "module".
To preserve D74814, we have to check if we're going to print anything at
all first. This is unfortunate, but I would consider this a special
case that shouldn't be handled in the core logic.
Reviewed By: jamieschmeiser
Differential Revision: https://reviews.llvm.org/D100231
There are four new PowerPC instructions that are introduced in
Power 10. They are hashst, hashchk, hashstp, hashchkp.
These instructions will be used for ROP Protection.
This patch adds the four instructions.
Reviewed By: nemanjai, amyk, #powerpc
Differential Revision: https://reviews.llvm.org/D99375
Returning in memory is not supported, so fall back to sret.
Also, extend i1 and i16 to i32. Otherwise, they would be passed through
memory.
Differential Revision: https://reviews.llvm.org/D100543
If we are truncating from a i32 source before comparing the result against zero, then see if we can directly compare the source value against zero.
If the upper (truncated) bits are known to be zero then we can compare against that, hopefully increasing the chances of us folding the compare into a EFLAG result of the source's operation.
Fixes PR49028.
Differential Revision: https://reviews.llvm.org/D100491
With this patch vbslq_f32(vnegq_s32(a), b, c) lowers to a BIT instruction.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D100304
At the moment, getMemoryOpCost returns 1 for all inputs if CostKind is
CodeSize or SizeAndLatency. This fools LoopUnroll into thinking memory
operations on large vectors have a cost of one, even if they will get
expanded to a large number of memory operations in the backend.
This patch updates getMemoryOpCost to return the cost for the type
legalization for both CodeSize and SizeAndLatency. This should more
accurately reflect the number of memory operations required.
I am not sure how latency should properly be included in SizeAndLatency
from the description, but returning the size cost should be clearly more
accurate.
This does not cause any binary changes when building
MultiSource/SPEC2000/SPEC2006 with -O3 -flto for AArch64, likely because
large vector memops are not really formed by code emitted from Clang.
But using the C/C++ matrix extension can easily result in code with very
large vector operations directly from Clang, e.g.
https://clang.godbolt.org/z/6xzxcTGvb
Reviewed By: samparker
Differential Revision: https://reviews.llvm.org/D100291
On Windows, float arguments are normally passed in float registers
in the calling convention for regular functions. For variable
argument functions, floats are passed in integer registers. This
already was done correctly since many years.
However, the surprising bit was that floats among the fixed arguments
also are supposed to be passed in integer registers, contrary to regular
functions. (This also seems to be the behaviour on ARM though, both
on Windows, but also on e.g. hardfloat linux.)
In the calling convention, don't promote shorter floats to f64, but
convert them to integers of the same length. (Floats passed as part of
the actual variable arguments are promoted to double already on the
C/Clang level; the LLVM vararg calling convention doesn't do any
extra promotion of f32 to f64 - this matches how it works on X86 too.)
Technically, this is an ABI break compared to older LLVM versions,
but it fixes compatibility with the official platform ABI. (In practice,
floats among the fixed arguments in variable argument functions is
a pretty rare construct.)
Differential Revision: https://reviews.llvm.org/D100365
Now since LDS uses within non-kernel functions are being handled in the
pass - LowerModuleLDS, we *NO* need to *forcefully* inline non-kernel
functions just because they use LDS. Do forceful inlining only when the
pass - LowerModuleLDS is not enabled. It is enabled by default.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D100481
Removes the builtins and intrinsics used to opt in to using these instructions
and replaces them with normal ISel patterns now that they are no longer
prototypes.
Differential Revision: https://reviews.llvm.org/D100402
Add a custom DAG combine and ISD opcode for detecting patterns like
(uint_to_fp (extract_subvector ...))
before the extract_subvector is expanded to ensure that they will ultimately
lower to f64x2.convert_low_i32x4_{s,u} instructions. Since these instructions
are no longer prototypes and can now be produced via standard IR, this commit
also removes the target intrinsics and builtins that had been used to prototype
the instructions.
Differential Revision: https://reviews.llvm.org/D100425
This is a service function generally useful for selection
of a FI in an SADDR. NFC for now, needed for future patch.
Differential Revision: https://reviews.llvm.org/D100406
Now that these instructions are no longer prototypes, we do not need to be
careful about keeping them opt-in and can use the standard LLVM infrastructure
for them. This commit removes the bespoke intrinsics we were using to represent
these operations in favor of the corresponding target-independent intrinsics.
The clang builtins are preserved because there is no standard way to easily
represent these operations in C/C++.
For consistency with the scalar codegen in the Wasm backend, the intrinsic used
to represent {f32x4,f64x2}.nearest is @llvm.nearbyint even though
@llvm.roundeven better captures the semantics of the underlying Wasm
instruction. Replacing our use of @llvm.nearbyint with use of @llvm.roundeven is
left to a potential future patch.
Differential Revision: https://reviews.llvm.org/D100411
Rename the name of "LDS lowering" pass from `amdgpu-disable-lower-module-lds` to
`amdgpu-enable-lower-module-lds` as later is consistent and reads better.
Reviewed By: JonChesterfield
Differential Revision: https://reviews.llvm.org/D100441
In the fold SHUFFLE(BINOP(X,Y),BINOP(Z,W)) -> BINOP(SHUFFLE(X,Z),SHUFFLE(Y,W)), check if both X/Z AND Y/W have at least one merge-able shuffle in which case the total number of shuffle should still fall.
Helps with instruction count regressions we saw while fixing PR48823
The existing BTI placement pass avoids inserting "BTI c" when the
function has local linkage and is only directly called. However,
even in this case, there is a (small) chance that the linker later
adds a hunk with an indirect call to the function, e.g. if the
function is placed in a separate section and moved far away from
its callers. Make sure to add BTI for these functions too.
Differential Revision: https://reviews.llvm.org/D99417
Otherwise it reuses the same register for storing the stack slot
offset if the stack slot offset is big.
Differential Revision: https://reviews.llvm.org/D100461
Extension to rG74f98391a7a4, we can also include any of the upper (known zero) bits in the comparison in the shuffle removal fold, just as long as we demand all the elements of the movmsk source vector.
This fixes breakage on Windows/ARM64 after D94355.
Modelled after the corresponding code for X86; not entirely familiar
with those aspects of that layer otherwise.
Differential Revision: https://reviews.llvm.org/D99572
M68kAsmParser uses `llvm::getTheM68kTarget` from M68kInfo, therefore we
should put M68kInfo as its direct dependency. Otherwise the build will
fail when building LLVM libraries as shared objects (building LLVM
libraries statically won't have this problem though).
This is a follow up of D99010. We didn't consider the live range of shape registers when hoist ldtilecfg. There maybe risks, e.g. we happen to insert it to an invalid range of some registers and get unexpected error.
This patch fixes this problem by storing the value to corresponding stack place of ldtilecfg after all its definition immediately.
This patch also fix a problem in previous code: If we don't have a ldtilecfg which dominates all AMX instructions, we cannot initialize shapes for other ldtilecfg.
There're still some optimization points left. E.g. eliminate unused mov instructions, break the def-use dependency before RA etc.
Reviewed By: LuoYuanke, xiangzhangllvm
Differential Revision: https://reviews.llvm.org/D99966
The VSX tablegen file has some rather eggregious uses of
COPY_TO_REGCLASS even in situations where it needs to use
SUBREG_TO_REG. While this produces correct code, it often doesn't
allow the register coalescer to coalesce copies and the resulting
code ends up being suboptimal. This patch just changes over
patterns that should use SUBREG_TO_REG.
Currently, for any extern variable, if it doesn't have
section attribution, it will be put into a default ".extern"
btf DataSec. The initial design is to put every extern
variable in a DataSec so libbpf can use it.
But later on, libbpf actually requires extern variables
to put into special sections, e.g., ".kconfig", ".ksyms", etc.
so they can be used properly based on section name.
Andrii mentioned since ".extern" variables are
not actually used, it makes sense to remove it from
the compiler so libbpf does not need to deal with it,
esp. for static linking. The BTF for these extern variables
is still generated.
With this patch, I tested kernel selftests/bpf and all tests
passed. Indeed, removing ".extern" DataSec seems having no
impact.
Differential Revision: https://reviews.llvm.org/D100392
We already allow the comparison of the upper bits of 'IsAllOf' (allbits) patterns, but we can safely compare the known zero bits for 'IsAnyOf' (zerobits) patterns as well.
This fixes an issues where we are comparing a type wide than the number of vector elements, which avoids a regression mentioned in rGbaadbe04bf75.
- In the SystemZAsmParser, there will be a few queries to the type of dialect it is (AD_ATT, AD_HLASM) in future patches.
- It would be nice to have two small helper functions `isParsingATT()` and `isParsingHLASM()`
- Putting this as a separate smaller patch allows us to remove its definitions from other dependent patches.
Reviewed By: uweigand, abhina.sreeskantharajan
Differential Revision: https://reviews.llvm.org/D99891
For a global weak symbol defined as below:
char g __attribute__((weak)) = 2;
LLVM generates an allocated global with WeakAnyLinkage,
for which BPF backend generates proper BTF info.
For the above example, if a modifier "const" is added like
const char g __attribute__((weak)) = 2;
LLVM generates an allocated global with WeakODRLinkage,
for which BPF backend didn't generate any BTF as it
didn't handle WeakODRLinkage.
This patch addes support for WeakODRLinkage and proper
BTF info can be generated for weak symbol defined with
"const" modifier.
Differential Revision: https://reviews.llvm.org/D100362
- Currently, MCAsmInfo provides a CommentString attribute, that various targets can set, so that the AsmLexer can appropriately lex a string as a comment based on the set value of the attribute.
- However, AsmLexer also supports a few additional comment syntaxes, in addition to what's specified as a CommentString attribute. This includes regular C-style block comments (/* ... */), regular C-style line comments (// .... ) and #. While I'm not sure as to why this behaviour exists, I am assuming it does to maintain backward compatibility with GNU AS (see https://sourceware.org/binutils/docs/as/Comments.html#Comments for reference)
For example:
Consider a target which sets the CommentString attribute to '*'.
The following strings are all lexed as comments.
```
"# abc" -> comment
"// abc" -> comment
"/* abc */ -> comment
"* abc" -> comment
```
- In HLASM however, only "*" is accepted as a comment string, and nothing else.
- To achieve this, an additional attribute (`AllowAdditionalComments`) has been added to MCAsmInfo. If this attribute is set to false, then only the string specified by the CommentString attribute is used as a possible comment string to be lexed by the AsmLexer. The regular C-style block comments, line comments and "#" are disabled. As a final note, "#" will still be treated as a comment, if the CommentString attribute is set to "#".
Depends on https://reviews.llvm.org/D99277
Reviewed By: abhina.sreeskantharajan, myiwanch
Differential Revision: https://reviews.llvm.org/D99286
This patch adds attributes corresponding to
implicits to functions/kernels if
1. it has an indirect call OR
2. it's address is taken.
Once such attributes are set, rest of the codegen would work
out-of-box for indirect calls. This patch eliminates
the potential overhead -fixed-abi imposes even though indirect functions
calls are not used.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D99347
This is a work-in-progress implementation of an assembler for M68k.
Outstanding work:
- Updating existing tests assembly syntax
- Writing new tests for the assembler (and disassembler)
I've left those until there's consensus that this approach is okay (I hope that's okay!).
Questions I'm aware of:
- Should this use Motorola or gas syntax? (At the moment it uses Motorola syntax.)
- The disassembler produces a table at runtime for disassembly generated from the code beads. Is this okay? (This is less than ideal but as I mentioned in my llvm-dev post, it's quite complicated to write a table-gen parser for code beads.)
Depends on D98519
Depends on D98532
Depends on D98534
Depends on D98535
Depends on D98536
Differential Revision: https://reviews.llvm.org/D98537
Prep work for adding intrinsics in the future.
Left an assert that the input is constant in ReplaceNodeResults,
as the intrinsic shouldn't go through that path.
We should consider the feeder user number when we do reverse memory
operation transformation. Otherwise, we may get negative impact.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D100166
Currently the ARM backend only accpets constant expressions as the
immediate operand in load and store instructions. This allows the
result of symbolic expressions to be used in memory instructions. For
example,
0:
.space 2048
strb r2, [r0, #(.-0b)]
would be assembled into the following instructions.
strb r2, [r0, #2048]
This only adds support to ldr, ldrb, str, and strb in arm mode to
address the build failure of Linux kernel for now, but should facilitate
adding support to similar instructions in the future if the need arises.
Link:
https://github.com/ClangBuiltLinux/linux/issues/1329
Reviewed By: peter.smith, nickdesaulniers
Differential Revision: https://reviews.llvm.org/D98916
This patch adds more optimized codegen for the above SETCC forms,
by matching the '.vi' vector forms when the immediate is a 5-bit signed
immediate plus 1. The immediate can be decremented and the corresponding
SET[U]LE or SET[U]GT forms can be matched.
This work was left as a TODO from D94168.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100096
Add a number of intrinsics which natively lower to MVE operations to the
lane interleaving pass, allowing it to efficiently interleave the lanes
of chucks of operations containing these intrinsics.
Differential Revision: https://reviews.llvm.org/D97293
Fixes the issues noted in PR48768, where the and/or/xor instruction had been promoted to avoid i8/i16 partial-dependencies, but the test against zero had not.
We can almost certainly relax this fold to work for any truncation, although it breaks a number of existing folds (notable movmsk folds which tend to rely on the truncate to determine the demanded bits/elts in the source vector).
There is a reverse combine in TargetLowering.SimplifySetCC so we must wait until after legalization before attempting this.
The previous code calculated the first ldtilecfg by dominating all AMX registers' def. This may result in the ldtilecfg being inserted into a loop.
This patch try to calculate the nearest point where all shapes of AMX registers are reachable.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D99010
FP16 to FP32 converts can be handled in MVE lane interleaving, much like
the sext/zext lowering we do. This expands the pass with fpext and
fptrunc handling, and basic fp operations allowing more efficient
lowering of fp vectors.
Differential Revision: https://reviews.llvm.org/D97292
The patch makes two updates to the arm-block-placement pass:
- Handle arbitrarily nested loops
- Extends the search (for t2WhileLoopStartLR) to the predecessor of the
preHeader.
Differential Revision: https://reviews.llvm.org/D99649
This reverts commit cca9b5985c.
Buildbot reported an error for CodeGen/AArch64/machine-combiner-fmul-dup.mir:
*** Bad machine code: Virtual register killed in block, but needed live out. ***
- function: indexed_2s
- basic block: %bb.0 entry (0x640fee8)
Virtual register %7 is used after the block.
*** Bad machine code: Virtual register defs don't dominate all uses. ***
- function: indexed_2s
- v. register: %7
LLVM ERROR: Found 2 machine code errors.
This patch adds DUP+FMUL => FMUL_indexed pattern to InstCombiner.
FMUL_indexed is normally selected during instruction selection, but it
does not work in cases when VDUP and VMUL are in different basic
blocks.
Differential Revision: https://reviews.llvm.org/D99662
Spilling the fp or bp to scratch could overwrite VGPRs of inactive
lanes. Fix that by using only the active lanes of the scavenged VGPR.
This builds on the assumptions that
1. a function is never called with exec=0
2. lanes do not die in a function, i.e. exec!=0 in the function epilog
3. no new lanes are active when exiting the function, i.e. exec in the
epilog is a subset of exec in the prolog.
Differential Revision: https://reviews.llvm.org/D96869
Spilling SGPRs to scratch uses a temporary VGPR. LLVM currently cannot
determine if a VGPR is used in other lanes or not, so we need to save
all lanes of the VGPR. We even need to save the VGPR if it is marked as
dead.
The generated code depends on two things:
- Can we scavenge an SGPR to save EXEC?
- And can we scavenge a VGPR?
If we can scavenge an SGPR, we
- save EXEC into the SGPR
- set the needed lane mask
- save the temporary VGPR
- write the spilled SGPR into VGPR lanes
- save the VGPR again to the target stack slot
- restore the VGPR
- restore EXEC
If we were not able to scavenge an SGPR, we do the same operations, but
everytime the temporary VGPR is written to memory, we
- write VGPR to memory
- flip exec (s_not exec, exec)
- write VGPR again (previously inactive lanes)
Surprisingly often, we are able to scavenge an SGPR, even though we are
at the brink of running out of SGPRs.
Scavenging a VGPR does not have a great effect (saves three instructions
if no SGPR was scavenged), but we need to know if the VGPR we use is
live before or not, otherwise the machine verifier complains.
Differential Revision: https://reviews.llvm.org/D96336
This patch adds the memory operands for indexed loads so
that certain optimizations can take place.
Differential Revision: https://reviews.llvm.org/D100215/
Change-Id: I539fcf046ca4ad1e7df1d893f57d751419d8364d
XSCMPUQP is not available for pre-P9 subtargets. This patch will lower
them into libcall for correct behavior on power7/power8.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D92083
Improve AVX512 mask inversion, rG38c799bce801 exposed some missing opportunities to move scalar not() back onto the boolvector types for folding with setcc etc.
In the final SIMD spec, there is only a single v128.any_true instruction, rather
than one for each lane interpretation because the semantics do not depend on the
lane interpretation.
Differential Revision: https://reviews.llvm.org/D100241
Followup to D100177, handle an similar (demorgan inverse style) case from PR47797 as well
The AVX512 test cases could be further improved if we folded not(iX bitcast(vXi1)) -> (iX bitcast(not(vXi1)))
Alive2: https://alive2.llvm.org/ce/z/AnA_-W
The first source has the same EEW as the destination and the other
source is a scalar so the overlap constraints don't apply to
the unmasked version.
For the masked version we have a constraint that the destination
can't be V0 so that covers the only overlap issue there.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D100217
Added cost estimation for switch instruction, updated costs of branches, fixed
phi cost.
Had to increase `-amdgpu-unroll-threshold-if` default value since conditional
branch cost (size) was corrected to higher value.
Test renamed to "control-flow.ll".
Removed redundant code in `X86TTIImpl::getCFInstrCost()` and
`PPCTTIImpl::getCFInstrCost()`.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D96805
This adds support for swapping comparison operands when it may introduce new
folding opportunities.
This is roughly the same as the code added to AArch64ISelLowering in
162435e7b5.
For an example of a testcase which exercises this, see
llvm/test/CodeGen/AArch64/swap-compare-operands.ll
(Godbolt for that testcase: https://godbolt.org/z/43WEMb)
The idea behind this is that sometimes, we may be able to fold away, say, a
shift or extend in a compare by swapping its operands.
e.g. in the case of this compare:
```
lsl x8, x0, #1
cmp x8, x1
cset w0, lt
```
The following is equivalent:
```
cmp x1, x0, lsl #1
cset w0, gt
```
Most of the code here is just a reimplementation of what already exists in
AArch64ISelLowering.
(See `getCmpOperandFoldingProfit` and `getAArch64Cmp` for the equivalent code.)
Note that most of the AND code in the testcase doesn't actually fold. It seems
like we're missing selection support for that sort of fold right now, since SDAG
happily folds these away (e.g testSwapCmpWithShiftedZeroExtend8_32 in the
original .ll testcase)
Differential Revision: https://reviews.llvm.org/D89422
Remove the MachineDCE pass after the first SIFoldOperands pass now
that SIFoldOperands deletes its own dead instructions.
Differential Revision: https://reviews.llvm.org/D100189
Regiser types for xxsplti32dx for two td file patterns was incorrect.
Fixed the two types and added a test case that was reduced from a larger
failing test.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D100223
When lowering a BUILD_VECTOR SDNode, we choose among various possible vector
creation instructions in an attempt to minimize the total number of instructions
used. We previously considered using swizzles, consts, and splats, and this
patch adds shuffles as well. A common pattern that now lowers to shuffles is
when two 64-bit vectors are concatenated. Previously, concatenations generally
lowered to sequences of extract_lane and replace_lane instructions when they
could have been a single shuffle.
Differential Revision: https://reviews.llvm.org/D100018
There are four new PowerPC instructions that are introduced in
Power 10. They are hashst, hashchk, hashstp, hashchkp.
These instructions will be used for ROP Protection.
This patch adds the four instructions.
Reviewed By: nemanjai, amyk, #powerpc
Differential Revision: https://reviews.llvm.org/D99375
I've initially just enabled this for BMI which has the ANDN instruction for i32/i64 - the i16/i8 cases give an idea of what'd we get when we enable it in all cases (I'll do this as a later commit).
Additionally, the i16/i8 cases could be freely promoted to i32 (as the args are already zeroext) and we could then make use of ANDN + the free cmp0 there as well - this has come up in PR48768 and PR49028 so I'm going to look at this soon.
https://alive2.llvm.org/ce/z/QVWHP_https://alive2.llvm.org/ce/z/pLngT-
Vector cases do not appear to benefit from this as we end up with having to generate the zero vector as well - this is one of the reasons I didn't try to tie this into hasAndNot/hasAndNotCompare.
Differential Revision: https://reviews.llvm.org/D100177
This is cheap to implement, means less work for future passes like
MachineDCE, and slightly improves the folding in some cases.
Differential Revision: https://reviews.llvm.org/D100117
Use SIInstrFlags to differentiate between the different
variants of flat instructions (flat, global and scratch).
This should make it easier to bundle the immediate offset logic in a
single place and implement restrictions and bug workarounds.
Fixed version of D99587, which does not rely on the address space.
Differential Revision: https://reviews.llvm.org/D99743
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D98027
New SDTypeProfile can be reused for other word operation patterns without explicit i64 type in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100097
Add explicit type i64 to RV64 only patterns to stop emitting unneeded i32 patterns.
It can reduce the isel table size.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100089
Instead of instantiating multiclasses inside multiclasses, just
inherit from them.
We can do the same for the VPseudo* multiclasses, but that may
interfere with the scheduler class work.