This simplifies our existing G_EXTRACT rules and adds some test coverage. Mostly
changing this because it should make it easier to improve legalization for
instructions which use G_EXTRACT as part of the legalization process.
This also adds support for legalizing some weird types. Similar to other recent
legalizer changes, this changes the order of widening/clamping.
There was some dead code in our existing rules (e.g. the p0 case would never get
hit), so this knocks those out and makes the types we want to handle explicit.
This also removes some checks which, nowadays, are handled by the
MachineVerifier.
Differential Revision: https://reviews.llvm.org/D107505
This is re-landing the same patch again, but without the changes to
LegalizerHelper that regressed the Mips test:
test/CodeGen/Mips/GlobalISel/llvm-ir/ctpop.ll
Differential revision: https://reviews.llvm.org/D106494
G_CONCAT_VECTORS shows up from time to time when legalizing other instructions.
We actually import patterns for the v16s8 <- v8s8, v8s8 case so marking it
as legal gives us selection for free.
Differential Revision: https://reviews.llvm.org/D107512
We don't have real demanded bits support for MULHU, but we can
still use the known bits based constant folding support at the end
of SimplifyDemandedBits to simplify a MULHU. This helps with cases
where we know the LHS and RHS have enough leading zeros so that
the high multiply result is always 0.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D106471
IR typically creates INSERT_SUBVECTOR patterns as a widening of the subvector with undefs to pad to the destination size, followed by a shuffle for the actual insertion - SelectionDAGBuilder has to do something similar for shuffles when source/destination vectors are different sizes.
This combine attempts to recognize these patterns by looking for a shuffle of a subvector (from a CONCAT_VECTORS) that starts at a modulo of its size into an otherwise identity shuffle of the base vector.
This uncovered a couple of target-specific issues as we haven't often created INSERT_SUBVECTOR nodes in generic code - aarch64 could only handle insertions into the bottom of undefs (i.e. a vector widening), and x86-avx512 vXi1 insertion wasn't keeping track of undef elements in the base vector.
Fixes PR50053
Differential Revision: https://reviews.llvm.org/D107068
Instructions that produceSameValue produce same values for operands with
same index. matchEqualDefs used to return true for any two values from
different instructions that produce same values. Fix this by checking if
values are defined by operands with the same index.
Differential Revision: https://reviews.llvm.org/D107362
- Remove a redundant test: there were `longjmp_only` and `only_longjmp`,
which do the same thing
- Add `CHECK-LABEL` lines for function names
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D107511
Emit references to '__do_global_ctors' and '__do_global_dtors' to allow
constructor/destructor routines to run.
Reviewed by: MaskRay
Differential Revision: https://reviews.llvm.org/D107133
Kuniyuki Iwashima reported in [1] that llvm compiler may
convert a loop exit condition with "i < bound" to "i != bound", where
"i" is the loop index variable and "bound" is the upper bound.
In case that "bound" is not a constant, verifier will always have "i != bound"
true, which will cause verifier failure since to verifier this is
an infinite loop.
The fix is to avoid transforming "i < bound" to "i != bound".
In llvm, the transformation is done by IndVarSimplify pass.
The compiler checks loop condition cost (i = i + 1) and if the
cost is lower, it may transform "i < bound" to "i != bound".
This patch implemented getArithmeticInstrCost() in BPF TargetTransformInfo
class to return a higher cost for such an operation, which
will prevent the transformation for the test case
added in this patch.
[1] https://lore.kernel.org/netdev/1994df05-8f01-371f-3c3b-d33d7836878c@fb.com/
Differential Revision: https://reviews.llvm.org/D107483
Clamp the max number of elements when legalizing G_PHI. This allows us to
legalize some common fallbacks like 4 x s64.
Here's an example: https://godbolt.org/z/6YocsEYTd
Had to add -global-isel-abort=0 to legalize-phi.mir to account for the
G_EXTRACT_VECTOR_ELT from the 32 x s8 G_PHI.
Differential Revision: https://reviews.llvm.org/D107508
to `lib/CodeGen/CommandFlags.cpp`. It can replace
-x86-experimental-pref-loop-alignment=.
The loop alignment is only used by MachineBlockPlacement.
The implementation uses a new `llvm::TargetOptions` for now, as
an IR function attribute/module flags metadata may be overkill.
This is the llvm part of D106701.
This allows special constants like to 0 to be recognized. It's also
expected by isel patterns if a target had a mulh with immediate instructions.
The commuting done by tablegen won't commute patterns with immediates since it
expects DAGCombine to have done it.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D107486
InstCombine canonicalizes c ? (x+y) : x to (c ? y : 0) + x. It
does the same for and/or/xor. We already reverse this transform
for those, but don't do add/sub yet.
This allows us to handle weird types like s88; we first widen to s128, then
clamp back down to s64.
https://godbolt.org/z/9xqbP46Mz
Also this makes it possible for GISel to legalize the case in pr48188.ll. It
now does the same thing as SDAG, although regalloc chooses different registers.
Differential Revision: https://reviews.llvm.org/D107417
Going through our legalization rules and doing some cleanup.
Widening and then clamping is usually easier than clamping and then widening.
This allows us to legalize some weird types like s88.
Differential Revision: https://reviews.llvm.org/D107413
An insert subvector that is inserting the result of a vector predicate
sized load into undef at index 0, whose result is casted to a predicate
type, can be combined into a direct predicate load. Likewise the same
applies to extract subvector but in reverse.
The purpose of this optimization is to clean up cases that will be
introduced in a later patch where casts to/from predicate types from i8
types will use insert subvector, rather than going through memory early.
This optimization is done in SVEIntrinsicOpts rather than InstCombine to
re-introduce scalable loads as late as possible, to give other
optimizations the best chance possible to do a good job.
Differential Revision: https://reviews.llvm.org/D106549
Don't know how to custom expand this
UNREACHABLE executed at llvm-project/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:16788
The fix is to provide missing expansions for:
case ISD::STRICT_FP_TO_UINT:
case ISD::STRICT_FP_TO_SINT:
A test case is provided.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D107452
This patch introduces a new code object metadata field, ".kind"
which is used to add support for init and fini kernels.
HSAStreamer will use function attributes, "device-init" and
"device-fini" to distinguish between init and fini kernels from
the regular kernels and will emit metadata with ".kind" set to
"init" and "fini" respectively.
To reduce the number of init and fini kernels, the ctors and
dtors present in the llvm's global.ctors and global.dtors lists
are called from a single init and fini kernel respectively.
Reviewed by: yaxunl
Differential Revision: https://reviews.llvm.org/D105682
This assert is intended to ensure that the high registers are not
selected when it is passed to one of the thumb UXT instructions. However
it was triggering even for 32 bit where no UXT instruction is emitted.
Fixes PR51313.
Differential Revision: https://reviews.llvm.org/D107363
Given a shuffle mask, if it is picking from an input that is splat
given the current granularity of the shuffle, then adjust the mask
to pick from the same lane of the input as the mask element is in.
This may result in a shuffle being simplified into a blend.
I believe this is correct given that the splat detection matches the one
just above the new code,
My basic thought is that we might be able to get less regressions
by handling multiple insertions of the same value into a vector
if we form broadcasts+blend here, as opposed to D105390,
but i have not really thought this through,
and did not try implementing it yet.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D107009
This attempts to make more of RDA aware of potentially overlapping
subregisters. Some of this was already in place, with it iterating
through MCRegUnitIterators. This also replaces calls to
LiveRegs.contains(..) with !LiveRegs.available(..), and updates the
isValidRegUseOf and isValidRegDefOf to search subregs.
Differential Revision: https://reviews.llvm.org/D107351
Clang has builtin function '__builtin_isnan', which implements C
library function 'isnan'. This function now is implemented entirely in
clang codegen, which expands the function into set of IR operations.
There are three mechanisms by which the expansion can be made.
* The most common mechanism is using an unordered comparison made by
instruction 'fcmp uno'. This simple solution is target-independent
and works well in most cases. It however is not suitable if floating
point exceptions are tracked. Corresponding IEEE 754 operation and C
function must never raise FP exception, even if the argument is a
signaling NaN. Compare instructions usually does not have such
property, they raise 'invalid' exception in such case. So this
mechanism is unsuitable when exception behavior is strict. In
particular it could result in unexpected trapping if argument is SNaN.
* Another solution was implemented in https://reviews.llvm.org/D95948.
It is used in the cases when raising FP exceptions by 'isnan' is not
allowed. This solution implements 'isnan' using integer operations.
It solves the problem of exceptions, but offers one solution for all
targets, however some can do the check in more efficient way.
* Solution implemented by https://reviews.llvm.org/D96568 introduced a
hook 'clang::TargetCodeGenInfo::testFPKind', which injects target
specific code into IR. Now only SystemZ implements this hook and it
generates a call to target specific intrinsic function.
Although these mechanisms allow to implement 'isnan' with enough
efficiency, expanding 'isnan' in clang has drawbacks:
* The operation 'isnan' is hidden behind generic integer operations or
target-specific intrinsics. It complicates analysis and can prevent
some optimizations.
* IR can be created by tools other than clang, in this case treatment
of 'isnan' has to be duplicated in that tool.
Another issue with the current implementation of 'isnan' comes from the
use of options '-ffast-math' or '-fno-honor-nans'. If such option is
specified, 'fcmp uno' may be optimized to 'false'. It is valid
optimization in general, but it results in 'isnan' always returning
'false'. For example, in some libc++ implementations the following code
returns 'false':
std::isnan(std::numeric_limits<float>::quiet_NaN())
The options '-ffast-math' and '-fno-honor-nans' imply that FP operation
operands are never NaNs. This assumption however should not be applied
to the functions that check FP number properties, including 'isnan'. If
such function returns expected result instead of actually making
checks, it becomes useless in many cases. The option '-ffast-math' is
often used for performance critical code, as it can speed up execution
by the expense of manual treatment of corner cases. If 'isnan' returns
assumed result, a user cannot use it in the manual treatment of NaNs
and has to invent replacements, like making the check using integer
operations. There is a discussion in https://reviews.llvm.org/D18513#387418,
which also expresses the opinion, that limitations imposed by
'-ffast-math' should be applied only to 'math' functions but not to
'tests'.
To overcome these drawbacks, this change introduces a new IR intrinsic
function 'llvm.isnan', which realizes the check as specified by IEEE-754
and C standards in target-agnostic way. During IR transformations it
does not undergo undesirable optimizations. It reaches instruction
selection, where is lowered in target-dependent way. The lowering can
vary depending on options like '-ffast-math' or '-ffp-model' so the
resulting code satisfies requested semantics.
Differential Revision: https://reviews.llvm.org/D104854
While collecting reachable callees (from kernels), ignore call graph node which
does not have associated function or associated function is not a definition.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D107329
- Rename `wasm.catch` intrinsic to `wasm.catch.exn`, because we are
planning to add a separate `wasm.catch.longjmp` intrinsic which
returns two values.
- Rename several variables
- Remove an unnecessary parameter from `canLongjmp` and `isEmAsmCall`
from LowerEmscriptenEHSjLj pass
- Add `-verify-machineinstrs` in a test for a safety measure
- Add more comments + fix some errors in comments
- Replace `std::vector` with `SmallVector` for cases likely with small
number of elements
- Renamed `EnableEH`/`EnableSjLj` to `EnableEmEH`/`EnableEmSjLj`: We are
soon going to add `EnableWasmSjLj`, so this makes the distincion
clearer
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D107405
Previously we would emit constant pool entries for ldr inline asm at the
very end of AsmPrinter::doFinalization(). However, if we're emitting
dwarf aranges, that would end all sections with aranges. Then if we have
constant pool entries to be emitted in those same sections, we'd hit an
assert that the section has already been ended.
We want to emit constant pool entries before emitting dwarf aranges.
This patch splits out arm32/64's constant pool entry emission into its
own MCTargetStreamer virtual method.
Fixes PR51208
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D107314
This changes the lowering of f32 and f64 COPY from a 128bit vector ORR to
a fmov of the appropriate type. At least on some CPU's with 64bit NEON
data paths this is expected to be faster, and shouldn't be slower on any
CPU that treats fmov as a register rename.
Differential Revision: https://reviews.llvm.org/D106365
This patch extends the optimization of VID-sequence BUILD_VECTORs
introduced in D104921 to include simple fractional steps composed of a
separated integer numerator and denominator.
A notable limitation in this sequence detection is that only sequences
with steps N/1 or 1/D are found, meaning that the step between elements
and the frequency with which it changes is consistent across the whole
sequence. Fractional steps such as 2/3 won't be matched as those would
involve more complex tracking of state or some level of backtracking.
As is stands, however, this patch is sufficient to match common
interleave-type shuffle indices, for example matching `<0,0,1,1>` (or
commonly `<0,u,1,u>` or `<u,0,u,1>`) to an index sequence divided by 2.
While the optimization is relatively `undef`-tolerant, due to greedy
pattern-matching there even are some simple patterns which confuse the
sequence detection into identifying either a suboptimal sequence or no
sequence at all.
Currently only fractional-step sequences identified as having a
power-of-two denominator are actually lowered to RVV instructions. This
is to avoid introducing divisions into the generated code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106533
Add a comment when there is a shifted value,
add x9, x0, #291, lsl #12 ; =1191936
but not when the immediate value is unshifted,
subs x9, x0, #256 ; =256
when the comment adds nothing additional to the reader.
Differential Revision: https://reviews.llvm.org/D107196
These instructions have an implicit use of vcc which counts towards the
constant bus limit. Pre gfx10 this means that the explicit operands
cannot be sgprs. Use the custom inserter hook to call legalizeOperands
to enforce that restriction.
Fixes https://bugs.llvm.org/show_bug.cgi?id=51217
Differential Revision: https://reviews.llvm.org/D106868
Add new pass LowerRefTypesIntPtrConv to generate debugtrap
instruction for an inttoptr and ptrtoint of a reference type instead
of erroring, since calling these instructions on non-integral pointers
has been since allowed (see ac81cb7e6).
Differential Revision: https://reviews.llvm.org/D107102
If a vsetvli instruction is not compatible with the next vector instruction,
and there is no other things that may update or use VL/VTYPE, we could merge
it with the next vsetvli instruction that should be insert for the vector
instruction.
This commit only merge VTYPE with the former vsetvli instruction which has
the same VL.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106857
This adds handling for two cases:
1. A scalable vector where the element type is promoted.
2. A scalable vector where the element count is odd (or more generally,
not divisble by the element count of the part type).
(Some element types still don't work; for example, <vscale x 2 x i128>,
or <vscale x 2 x fp128>.)
Differential Revision: https://reviews.llvm.org/D105591
When a value is expected to be extended, we should emit an extended load rather
than a normal G_LOAD.
Add checklines to arm64-abi.ll which show that we now emit the correct loads.
For ease of comparison: https://godbolt.org/z/8WvY6EfdE
Differential Revision: https://reviews.llvm.org/D107313
Add new pass LowerRefTypesIntPtrConv to generate trap
instruction for an inttoptr and ptrtoint of a reference type instead
of erroring, since calling these instructions on non-integral pointers
has been since allowed (see ac81cb7e6).
Differential Revision: https://reviews.llvm.org/D107102
This optimizes out the mask when the result of a bitmask is interpreted as an i8
or i16 value. Resolves PR50507.
Differential Revision: https://reviews.llvm.org/D107103
ArtifactValueFinder keeps trying to combine g_unmerge_values in some cases.
Fix is to skip combine attempt for dead defs.
Differential Revision: https://reviews.llvm.org/D106879
If the block target for a WLSTP instruction is known to be out of range,
and cannot be fixed by the ARMBlockPlacementPass, we can relax it to a
DLSTP (and cmp/branch) to still allow the creation of tail predicated
loops. That is what this patch does, adding extra revert code to the
fallback path of ARMBlockPlacementPass.
Due to the code produced when reverting, this creates a DLSTP between a
Bcc and a Br. As a DLS isn't necessarily a terminator we need to split
the block to move the DLS/Br into.
Differential Revision: https://reviews.llvm.org/D104709
Application of default mapping to BVH intrinsics was missing.
Copy parts of SelectionDAG test to GlobalISel test as these would
have indicated this error.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D107211
Currently, the default alignment is much larger than the actual size of
the vector in memory. Fix this to use a sane default.
For SVE, temporarily remove lowering of load/store operations for
predicates with less than 16 elements. The layout the backend was
assuming for SVE predicates with less than 16 elements doesn't agree
with the frontend. More work probably needs to be done here.
This change is, strictly speaking, not backwards-compatible at the
bitcode level. But probably nobody is actually depending on that; i1
vectors in memory are rare, and the code that does use them probably
ends up forcing the alignment to something sane anyway. If we think
this is a concern, I can restrict this to scalable vectors for now
(where it's actually causing issues for me at the moment).
Differential Revision: https://reviews.llvm.org/D88994
The load/store tests are giant and have some cases that fail in them,
but it's hard to tell which ones are really failing. Check the remarks
to make it easier to track.
If all demanded elements of the BUILD_VECTOR pass a isGuaranteedNotToBeUndefOrPoison check, then we can treat this specific demanded use of the BUILD_VECTOR as guaranteed not to be undef or poison either.
Differential Revision: https://reviews.llvm.org/D107174
This patch legalizes the Machine Value Type introduced in D94096 for loads
and stores. A new target hook named getAsmOperandValueType() is added which
maps i512 to MVT::i64x8. GlobalISel falls back to DAG for legalization.
Differential Revision: https://reviews.llvm.org/D94097
This distributes reductions based on the relative offset of loads, if
one is found from their operands. Given chains of reductions this will
then sort them in ascending load order, which in turn can help simple
prefetches latch on to increasing strides more easily.
Differential Revision: https://reviews.llvm.org/D106569
This could be smarter by picking an ideal type, or at least splitting
the vector in half first. Also handles lower for non-power-of-2,
non-extending vector loads.
Currently this just avoids failing to legalize some odd vector AMDGPU
tests, but is a step towards removing the split logic from the
NarrowScalar logic.
The code for splitting an unaligned access into 2 pieces is
essentially the same as for splitting a non-power-of-2 load for
scalars. It would be better to pick an optimal memory access size and
directly use it, but splitting in half is what the DAG does.
As-is this fixes handling of some unaligned sextload/zextloads for
AMDGPU. In the future this will help drop the ugly abuse of
narrowScalar to handle splitting unaligned accesses.
This adds a combine for adds of reductions, distributing them so that
they occur sequentially to enable better use of accumulating VADDVA
instructions. It combines:
add(X, add(vecreduce(Y), vecreduce(Z))) ->
add(add(X, vecreduce(Y)), vecreduce(Z))
and
add(add(A, reduce(B)), add(C, reduce(D))) ->
add(add(add(A, C), reduce(B)), reduce(D))
These together distribute the add's so that more reductions can be
selected to VADDVA.
Differential Revision: https://reviews.llvm.org/D106532
Under MVE we can use VADDV/VADDVA's to perform integer add reductions,
so it can be beneficial to use more reductions than summing subvectors
and reducing once. Especially for VMLAV/VMLAVA the mul can be
incorporated into the reduction, producing less instructions.
Some of the test cases currently get larger due to extra integer adds,
but will be improved in a followup patch.
Differential Revision: https://reviews.llvm.org/D106531
These are generated from SLP vectorization, for example
https://godbolt.org/z/ebxdPh1Kz. These backend tests
show cases that we can produce better code for.
Regsier hints when copying to a UACC register do not always produce VSRp
registers. This patch makes sure that we do not produce hints in cases
where the subregsiter of the UACC is not a VSRp.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D107101
This patch prevents GlobalISel from optimizing out redundant branch
instructions when compiling without optimizations.
The motivating example is code like the following common pattern in
Swift, where users expect to be able to set a breakpoint on the early
exit:
public func f(b: Bool) {
guard b else {
return // I would like to set a breakpoint here.
}
...
}
The patch modifies two places in GlobalISEL: The first one is in
IRTranslator.cpp where the removal of redundant branches is made
conditional on the optimization level. The second one is in
AArch64InstructionSelector.cpp where an -O0 *only* optimization is
being removed.
Disabling these optimizations increases code size at -O0 by
~8%. However, doing so improves debuggability, and debug builds are
the primary reason why developers compile without optimizations. We
thus concluded that this is the right trade-off.
rdar://79515454
This tenatively reapplies the patch without modifications, the LLDB
test that has blocked this from landing previously has since been
modified to hopefully no longer be sensitive to this change.
Differential Revision: https://reviews.llvm.org/D105238
If a target lists both a subreg and a superreg in a callee-saved
register mask, the prolog will spill both aliasing registers. Instead,
don't spill the subreg if a superreg is being spilled. This case is hit by the
PowerPC SPE code, as well as a modified RISC-V backend for CHERI I maintain out
of tree.
Reviewed By: jhibbits
Differential Revision: https://reviews.llvm.org/D73170
This transform was added with D58874, but there were no tests for overflow ops.
We need to change this one way or another because it can crash as shown in:
https://llvm.org/PR51238
Note that if there are no uses of an overflow op's bool overflow result, we
reduce it to a regular math op, so we continue to fold that case either way.
If we have uses of both the math and the overflow bool, then we are likely
not saving anything by creating an independent sub instruction as seen in
the test diffs here.
This patch makes the behavior in SDAG consistent with what we do in
instcombine AFAICT.
Differential Revision: https://reviews.llvm.org/D106983
There's a generic combine for these, but no test coverage.
It's not clear if this is actually a good fold.
The combine was added with D58874, but it has a bug that
can cause crashing ( https://llvm.org/PR51238 ).
An incorrect mask type when lowering an SVE gather/scatter was causing
a codegen fault which manifested as the incorrect predicate size being
used for an SVE gather/scatter, (e.g.. p0.b rather than p0.d).
Fixes PR51182.
Differential Revision: https://reviews.llvm.org/D106943
While v_cmp will AND inactive lanes with 0, that is not the case for logical
operations.
This fixes a Vulkan CTS test that would hang otherwise.
Differential Revision: https://reviews.llvm.org/D105709
This patch aims to improve the performance of BUILD_VECTORs which are
identified as containing a dominant element. Given that most
floating-point constants themselves require a load from the constant
pool, it was possible for the optimization to actually increase the
number of individual loads on small vectors. The exception is the zero
constant -- +0.0 -- which can be materialized efficiently.
While this optimization could do with a proper cost model to weigh the
benfits of a single vector load vs. the manipulation of individual
elements -- even for integer vectors which often require several
instructions to materialize -- without a concrete RVV implementation to
work with any heuristic is likely to be both more obtuse and inaccurate.
Until then, this patch fixes at least one known obvious deficiency.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106963
The second test case added here was pointed out to me by @craig.topper
and shows how we "optimize" a two-element BUILD_VECTOR from being one
load from the constant pool to two loads from the constant pool.
The first test case shows that since materialization for the
floating-point +0.0 value is cheap and doesn't involve a load, the
optimization is more clearly beneficial here.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106962
Function findBestLoopTopHelper tries to find a new loop top block which can also
fall through to OldTop, but it's impossible if OldTop is not a chain header, so
it should exit immediately.
Differential Revision: https://reviews.llvm.org/D106329
Swap the order of widening so that we widen to the next power-of-2 first when
legalizing G_LOAD.
Also, provide a minimum type for the power of 2 to disallow s2 + s1. Clamping
ought to disallow s2 and s1, but I think it's better to be explicit about the
expected minimum size.
We probably need a similar change for G_STORE, but it seems to be a bit more
finnicky. So, let's just handle G_LOAD for now.
Differential Revision: https://reviews.llvm.org/D107013
We were handing types like s88 like
1) clamp to the range
2) widen to the next power of 2
This isn't desirable because it causes an odd breakdown for types like s88.
If we widen to the next power of 2 (s128) first, then we get a clean breakdown
when we clamp back to s64.
Differential Revision: https://reviews.llvm.org/D106998
Using split-file does not work with update_llc_test_checks.py. It's also
mostly redundant, as the single and double tests can just use a single
llc and FileCheck invocation for each FPU type using -check-prefixes
rather than -check-prefix, and update_llc_test_checks.py will merge what
it can. Only test_dasmconst needs to be SPE-only and so is pulled out
into its own mall file (rather than using sed to preprocess the file and
keep it commented out for EFPU2, which would work, but is ugly).
As well as cutting down on the number of RUN lines, this also results in
test_fma's CHECK lines being merged for both FPUs.
Reviewed By: kiausch
Differential Revision: https://reviews.llvm.org/D106969
The sign_extend we insert here can get turned into a zero_extend if
the sign bit is known zero. This can enable a setcc combine that
shrinks compares with zero_extend. This reduces the use count of
the zero_extend allowing other combines to turn it back into an
any_extend.
This restricts the combine to only cases where the result is used
by a CopyToReg. This works for my original motivating case. I
hope the CopyToReg use will prevent any converted extends from
turning back into an any_extend.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106754
In this episode, we are trying to avoid an x86 micro-arch quirk where complex
(3 operand) LEA potentially costs significantly more than simple LEA. So we
simultaneously push and pull the math around the CMOV to balance the operations.
I looked at the debug spew during instruction selection and decided against
trying a later DAGToDAG transform -- it seems very difficult to match if the
trailing memops are already selected and managing the creation of extra
instructions at that level is always tricky.
Differential Revision: https://reviews.llvm.org/D106918
This patch adds a peephole optimization `SETCC(FREEZE(x),const)` => `FREEZE(SETCC(x,const))`
if the SETCC is only used by BRCOND.
Combined with `BRCOND(FREEZE(X)) => BRCOND(X)`, this leads to a nice improvement in the generated assembly when x is a masked loaded value.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D105344
The dst/dstt/dstst/dststt instructions are nop's on all PowerPC
cores that AIX supports. The AIX assembler also does not accept
these mnemonics. Turn them into nop's on AIX (similar to dstall).
This is partially a workaround. SILowerI1Copies does not understand
unstructured loops. This would result in inserting instructions to
merge a mask register in the same block where it was defined in an
unstructured loop.
Replace the clang builtins and LLVM intrinsics for the SIMD extmul instructions
with normal codegen patterns.
Differential Revision: https://reviews.llvm.org/D106724
- This patch consists of the bare basic code needed in order to generate some assembly for the z/OS target.
- Only the .text and the .bss sections are added for now.
- The relevant MCSectionGOFF/Symbol interfaces have been added. This enables us to print out the GOFF machine code sections.
- This patch enables us to add simple lit tests wherever possible, and contribute to the testing coverage for the z/OS target
- Further improvements and additions will be made in future patches.
Reviewed By: tmatheson
Differential Revision: https://reviews.llvm.org/D106380
Avoid several crashes when DBG_INSTR_REF and DBG_PHI instructions are fed
to the instruction scheduler. DBG_INSTR_REFs should be treated like
DBG_LABELs, and just ignored for the purpose of scheduling [0].
DBG_PHIs however behave much more like DBG_VALUEs: they refer to register
operands, and if some register defs get shuffled around during instruction
scheduling, there's a risk that the debug instr will refer to the wrong
value. There's already a facility for updating DBG_VALUEs to reflect this;
add DBG_PHI to the list of instructions that it will update.
[0] Suboptimal, but it's what instr scheduling does right now.
Differential Revision: https://reviews.llvm.org/D106663
This patch builds on top of D106575 in which scalable-vector splats were
supported in `ISD::matchBinaryPredicate`. It teaches the DAGCombiner how
to perform a variety of the pre-existing saturating add/sub combines on
scalable-vector types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106652
This patch adds support for lowering the saturating vector add/sub
intrinsics to RVV instructions, for both fixed-length and
scalable-vector forms alike.
Note that some of the DAG combines are still not triggering for the
scalable-vector tests. These require a bit more work in the DAGCombiner
itself.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106651
These will be optimized by upcoming patches. The tests are primarily not
being optimized due to the lack of support for saturating vector
arithmetic in the RISC-V backend.
On top of that, however, a large percentage of the scalable-vector tests
are also lacking support in the DAGCombiner: either in
`ISD::matchBinaryPredicate` or due to checks specifically for
`BUILD_VECTOR` and not `SPLAT_VECTOR`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106649
This implements the isLoadFromStackSlot and isStoreToStackSlot for MVE
MVE_VSTRWU32 and MVE_VLDRWU32 functions. They behave the same as many
other loads/stores, expecting a FI in Op1 and zero offset in Op2. At the
same time this alters VLDR_P0_off and VSTR_P0_off to use the same code
too, as they too should be returning VPR in Op0, take a FI in Op1 and
zero offset in Op2.
Differential Revision: https://reviews.llvm.org/D106797
Enable custom insert_subvector for larger vector types.
This is necessary now that SelectionDAG can attempt v3f64 insert
to v4f64, etc.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D105385
All floating point values in registers are in double precision
representation. In order to materialize the correct single precision
value, we need to convert the APFloat that represents the value
to double precision first.
Reviewed By: amyk, NeHuang
Differential Revision: https://reviews.llvm.org/D106812
This adds support for the case where
WideSize = DstSize + K * SrcSize
In this case, we can pad the G_MERGE_VALUES instruction with K extra undef
values with width SrcSize. Then the destination can be handled via
widenScalarDst.
Differential Revision: https://reviews.llvm.org/D106814
Before MASSV only supported P8 and P9 on AIX ans Linux . This patch proposes
MASSV to add support of P7 and P10 only on AIX too.
Differential: https://reviews.llvm.org/D106678
Use it AArch64 post-legal combiner. These don't always get folded because when
the instructions are created the constants are obscured by artifacts.
Differential Revision: https://reviews.llvm.org/D106776
When Emscripten EH mixes with Emscripten SjLj, we are not currently
handling some of them correctly. There are three cases:
1. The current function calls `setjmp` and there is an `invoke` to a
function that can either throw or longjmp. In this case, we have to
check both for exception and longjmp. We are currently handling this
case correctly:
0c0eb76782/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L1058-L1090)
When inserting routines for functions that can longjmp, which we do
only for setjmp-calling functions, we check if the function was
previously an `invoke` and handle it correctly.
2. The current function does NOT call `setjmp` and there is an `invoke`
to a function that can either throw or longjmp. Because there is no
`setjmp` call, we haven't been doing any check for functions that can
longjmp. But in that case, for `invoke`, we only check for an
exception and if it is not an exception we reset `__THREW__` to 0,
which can silently swallow the longjmp:
0c0eb76782/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L70-L80)
This CL fixes this.
3. The current function calls `setjmp` and there is no `invoke`. Because
it is not an `invoke`, we haven't been doing any check for functions
that can throw, and only insert longjmp-checking routines for
functions that can longjmp. But in that case, if a longjmpable
function throws, we only check for a longjmp so if it is not a
longjmp we reset `__THREW__` to 0, which can silently swallow the
exception:
0c0eb76782/llvm/lib/Target/WebAssembly/WebAssemblyLowerEmscriptenEHSjLj.cpp (L156-L169)
This CL fixes this.
To do that, this moves around some code, so we register necessary
functions for both EH and SjLj and precompute some data (the set of
functions that contains `setjmp`) before doing actual EH or SjLj
transformation.
This CL makes 2nd and 3rd tests in
https://github.com/emscripten-core/emscripten/pull/14732 work.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D106525
This is a followup patch for D105930 to add implicit-def of RM for
mtfsb[01] instructions as per review comments.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D106603
The legalizer generates selects for some operations, which can have constant
condition values, resulting in lots of dead code if it's not folded away.
Differential Revision: https://reviews.llvm.org/D106762
During tail duplication, SSA values may be updated and have their uses
replaced with a virtual register, and any debug instructions that use
that value are deleted. This patch fixes the implementation of the debug
instruction deletion to work correctly for debug instructions that use
the SSA value multiple times, by batching deletions so that we don't
attempt to delete the same instruction twice.
Differential Revision: https://reviews.llvm.org/D106557
For reg+imm SVE addressing mode imm is implictly scaled by VL,
making them impractical for truely immediate offsets. However, if
the offset can be unscaled based on the storage element type we
can use the reg+reg SVE addressing mode and thus either reduce the
number of generate add instructions or replace them with a mov
instruction that can be hoisted from the hot code path.
Differential Revision: https://reviews.llvm.org/D106744
This patch adds support for the next-generation arch14
CPU architecture to the SystemZ backend.
This includes:
- Basic support for the new processor and its features.
- Detection of arch14 as host processor.
- Assembler/disassembler support for new instructions.
- New LLVM intrinsics for certain new instructions.
- Support for low-level builtins mapped to new LLVM intrinsics.
- New high-level intrinsics in vecintrin.h.
- Indicate support by defining __VEC__ == 10304.
Note: No currently available Z system supports the arch14
architecture. Once new systems become available, the
official system name will be added as supported -march name.
Codegen for the raw/struct buffer access intrinsics would update the
offset in the MMO to reflect the combined offset, if it was known to be
constant. If the combined offset was not known to be constant, or if
there was an index, it would set the offset in the MMO to 0. This is
unsafe because it makes it look like the access does not alias with
another access with a fixed non-zero offset.
Fix these cases by setting the pointer in the MMO to null, to reflect
the fact that we do not have any known IR value pointer + constant
offset for the access.
D106284 did this for SelectionDAG. This is the corresponding fix for
GlobalISel.
Differential Revision: https://reviews.llvm.org/D106451
Codegen for the raw/struct buffer access intrinsics would update the
offset in the MMO to reflect the combined offset, if it was known to be
constant. If the combined offset was not known to be constant, or if
there was an index, it would set the offset in the MMO to 0. This is
unsafe because it makes it look like the access does not alias with
another access with a fixed non-zero offset.
Fix these cases by setting the pointer in the MMO to null, to reflect
the fact that we do not have any known IR value pointer + constant
offset for the access.
Differential Revision: https://reviews.llvm.org/D106284
The register class required for some MVE loads/stores is more
constrained than the register we use when creating postinc. Make sure we
constrain the register class to keep the code correct.
This patch implements vector_splice in tablegen for all cases when the
Immediate is positive and lower than the known minimum value of
a scalable vector.
Vector_splice can be implemented using SVE instruction EXT.
For instance :
@llvm.experimental.vector.splice(Vector_1, Vector_2, Imm)
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E>
EXT Vector_1, Vector_2, Imm // Vector_1 = B, C, D + Vector_2 = E
Depends on D105633
Differential Revision: https://reviews.llvm.org/D106273
This patch implements vector_splice in tablegen for:
a) when the immediate is equal to -1 (Imm==1) and uses:
INSR + LASTB
For instance :
@llvm.experimental.vector.splice(Vector_1, Vector_2, -1)
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <D, E, F, G>
LAST RegLast, Vector_1 // RegLast = D
INSR Res, (Vector_1 >> 1), RegLast // Res = D + E, F, G
Differential Revision: https://reviews.llvm.org/D105633
This patch extends support for (scalable-vector) splats in the
DAGCombiner via the `ISD::matchBinaryPredicate` function, which enable a
variety of simple combines of constants.
Users of this function may now have to distinguish between
`BUILD_VECTOR` and `SPLAT_VECTOR` vector operands. The way of dealing
with this in-tree follows the approach added for
`ISD::matchUnaryPredicate` implemented in D94501.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106575
If the rotation amount is a known splat, perform the modulo on the splat source, and then perform the splat. That way the amount-extension performed later by LowerScalarVariableShift can fold the splats away without any multiple-use issues.
Fixes one of the concerns raised on D104156
This is a minimum extension of D106607 to allow folding for
2 non-zero constantsi that can be materialized as immediates..
In the reduced test examples, we save 1 instruction by rolling
the constants into LEA/ADD. In the motivating test from the bullet
benchmark, we absorb both of the constant moves into add ops via
LEA magic, so we reduce by 2 instructions.
Differential Revision: https://reviews.llvm.org/D106684
As noticed on D105390 - we were hardwiring the depth limit for combining to VPERMI2W/VPERMI2B instructions. Not only had we made the limit too low, we hadn't accounted for slow/fast shuffles via the VariableCrossLaneShuffleDepth control
For types like s96, we don't want to clamp to s64, we want to first widen to
s128 and then narrow it. Otherwise we end up with impossible to legalize types.
I stumbled onto a case where our (sext_inreg (assertzexti32 (fptoui X)), i32)
isel pattern can cause an fcvt.wu and fcvt.lu to be emitted if
the assertzexti32 has an additional user. If we add a one use check
it would just cause a fcvt.lu followed by a sext.w when only need
a fcvt.wu to satisfy both users.
To mitigate this I've added custom isel and new ISD opcodes for
fcvt.wu. This allows us to keep know it started life as a conversion
to i32 without needing to match multiple nodes. ComputeNumSignBits
has been taught that this new nodes produces 33 sign bits. To
prevent regressions when we need to zero extend the result of an
(i32 (fptoui X)), I've added a DAG combine to convert it to an
(i64 (fptoui X)) before type legalization. In most cases this would
happen in InstCombine, but a zero_extend can be created for function
returns or arguments.
To keep everything consistent I've added new nodes for fptosi as well.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106346
Many of the tests have used NEXT when DAG is more approprite. In
some cases single DAG lines have been used. Note that these are
manual tests because they're to complex for update_llc_test_checks.py
and so it's worth not relying too much on the ordered output.
I've also made the CHECK lines more uniform when it comes to the
ordering of things like LO/HI.
If value tracking can confirm that the cttz/ctlz source is known non-zero then we don't need to create a branch (which DAG will struggle to recover from).
Differential Revision: https://reviews.llvm.org/D106685
Most other registers are allocatable and therefore cannot be used.
This issue was flagged by the machine verifier, because reading other
registers is considered reading from an undefined register.
Differential Revision: https://reviews.llvm.org/D96969
This patch fixes some issues with the RORB pseudo instruction.
- A minor issue in which the instructions were said to use the SREG,
which is not true.
- An issue with the BLD instruction, which did not have an output operand.
- A major issue in which invalid instructions were generated. The fix
also reduce RORB from 4 to 3 instructions, so it's also a small
optimization.
These issues were flagged by the machine verifier.
Differential Revision: https://reviews.llvm.org/D96957
This patch makes sure shift instructions such as this one:
%result = shl i32 %n, %amount
are expanded just before the IR to SelectionDAG conversion to a loop so
that calls to non-existing library functions such as __ashlsi3 are
avoided. The generated code is currently pretty bad but there's a lot of
room for improvement: the shift itself can be done in just four
instructions.
Differential Revision: https://reviews.llvm.org/D96677
Previously, AVRTargetLowering::LowerCall attempted to keep stack stores
in order with chains. Perhaps this worked in the past, but it does not
work now: it appears that the SelectionDAG legalization phase removes
these chains. Therefore, I've removed these chains entirely to match
X86 (which, similar to AVR, also prefers to use push instructions over
stack-relative stores to set up a call frame). With this change, all the
stack stores are in a somewhat reasonable order.
Differential Revision: https://reviews.llvm.org/D97853
This patch introduces a pass that uses the Attributor to deduce AMDGPU specific attributes.
Reviewed By: jdoerfert, arsenm
Differential Revision: https://reviews.llvm.org/D104997
Replace the clang builtins and LLVM intrinsics for {f32x4,f64x2}.{pmin,pmax}
with standard codegen patterns. Since wasm_simd128.h uses an integer vector as
the standard single vector type, the IR for the pmin and pmax intrinsic
functions contains bitcasts that would not be there otherwise. Add extra codegen
patterns that can still select the pmin and pmax instructions in the presence of
these bitcasts.
Differential Revision: https://reviews.llvm.org/D106612
This patch adds a reduced test case which identifies an illegal vsetvli
inserted by the compiler. The compiler emits a vsetvli which is intended
to preserve VL with the SEW/LMUL ratio e32/m1 when in fact the VL could
have been set by e64/m2 in a predecessor block.
Differential Revision: https://reviews.llvm.org/D106286
Since we're changing VTYPE, we may change VLMAX which could
invalidate the previous VL. If we can't tell if it is safe we
should use an AVL of 1 instead of keeping the old VL.
This is a quick fix. We may want to thread VL to the pseudo
instruction instead of making up a value. That will require ISD
opcode changes and changes to the C intrinsic interface.
This fixes the issue raised in D106286.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D106403
This code tries to form a TEST from CMP+AND with an optional
truncate in between. If we looked through the truncate, we may
have extra bits in the AND mask that shouldn't participate in
the checks. Normally SimplifyDemendedBits takes care of this, but
the AND may have another user. So manually mask out any extra bits.
Fixes PR51175.
Differential Revision: https://reviews.llvm.org/D106634
This is not the transform direction we want in general,
but by the time we have a CMOV, we've already tried
everything else that could be better.
The transform increases the uses of the other add operand,
but that is safe according to Alive2:
https://alive2.llvm.org/ce/z/Yn6p-A
We could probably extend this to other binops (not just add).
This is the motivating pattern discussed in:
https://llvm.org/PR51069
The test with i8 shows a missed fold because there's a trunc
sitting in front of the add. That can be handled with a small
follow-up.
Differential Revision: https://reviews.llvm.org/D106607
This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE. It also includes a DAG combine to
fold extends followed by truncating stores into non-truncating
stores in order to prevent this pattern appearing once truncating
stores are supported.
Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.
Differential Revision: https://reviews.llvm.org/D104471
Add maximum NSA size limit as an ISA feature.
Use this to reduce NSA usage on GFX10.1 to avoid stability issues
with 4 and 5 dwords NSA instructions.
Maintain use of longer NSA instructions on GFX10.3.
Note: this also contains some minor fixes for GlobalISel which
did not work correctly with non-NSA form instructions on GFX10.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D103348
These tests show missed opportunities in the SelectionDAG layer when
dealing with scalable-vector splats. All of these are handled for the
equivalent `ISD::BUILD_VECTOR` code, and the tests have largely been
translated from the equivalent X86 tests.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106574