Change to use R_VE_SREL32 for relative branch instructions instead of
R_VE_PC_LO32 in order to check ranges of relative branch isntructions
at link time correctly.
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D115097
The v8.1-m ARMARM uses the vrinta.f16.f16 names, as opposed to
vrinta.f16. This adds an alias for it in the same way that we have for
f32 and f64.
Differential Revision: https://reviews.llvm.org/D68127
This patch implements the intrinsic for ref.null.
In the process of implementing int_wasm_ref_null_func() and
int_wasm_ref_null_extern() intrinsics, it removes the redundant
HeapType.
This also causes the textual assembler syntax for ref.null to
change. Instead of receiving an argument: `func` or `extern`, the
instruction mnemonic is either ref.null_func or ref.null_extern,
without the need for a further operand.
Reviewed By: tlively
Differential Revision: https://reviews.llvm.org/D114979
Add all CompressPat to map instructions between 16-bit and 32-bit with using the CompressInstEmitter infra.
Although it's only used in asm printer, also enable it in asm parser to debug mapping when -enable-csky-asm-compressed-inst is on.
Differential Revision: https://reviews.llvm.org/D115026
Implement 'back-to-back' FX fusion according to Power10 User Manual
'19.1.5.4 Fusion', not enabled by default.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D114345
Expanding on D109750.
Since `DBG_VALUE` instructions have final register validity determined in
`LDVImpl::handleDebugValue`, there is no apparent reason to immediately prune
unused register operands as their defs are erased. Consequently, this renders
`MachineInstr::eraseFromParentAndMarkDBGValuesForRemoval` moot; gaining a
substantial performance improvement.
The only necessary changes involve making relevant passes consider invalid
DBG_VALUE vregs uses as valid.
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D112852
As an extension to D111976, this converts clamp fptosi, clamped between
0 and (2^n)-1 to a fptoui.sat. This can greatly help on targets with
conversions that naturally saturate, such as Arm.
X86 disables the transform as some of the test cases increases in size.
A fptoui.sat necessitates a fp clamp without native support, so there is
little use in converting if the instruction is just going to be
expanded.
Differential Revision: https://reviews.llvm.org/D112428
Code using indirect calls is broken without this, and there isn't
really much value in supporting the old attempt to vary the argument
placement based on uses. This resulted in more argument shuffling code
anyway.
Also have the option stop implying all inputs need to be passed. This
will no rely on the amdgpu-no-* attributes to avoid passing
unnecessary values.
Previously we would require adding an attribute to kernels to enable
the inputs passed in the kernarg segment, accessed by
llvm.amdgcn.implicitarg.ptr. This violates the principle of being
correct by default. Some OpenMP testcases were broken recently since
it wasn't correctly setting this attribute, and no known frontends are
setting this to anything other than the maximum.
Most of the test changes are from load widening of argument loads
since there now more implied dereferenceable bytes.
The ray_origin, ray_dir and ray_inv_dir arguments should all be vec3 to
match how the hardware instruction works.
Don't change the API of the corresponding OpenCL builtins.
Differential Revision: https://reviews.llvm.org/D115032
Two-address pass works right before RA and if an immediate
was folded into an instruction there is nothing to remove
the dead def. We end up with something like:
v_mov_b32_e32 v14, 0xc1700000
v_mov_b32_e32 v14, 0x41200000
v_fmaak_f32 v51, s67, v19, 0xc1700000
v_fmaak_f32 v38, v51, v19, 0x4120000
The patch kills the dead move instruction right in the folding.
Differential Revision: https://reviews.llvm.org/D114999
This adjusts all the MVE and CDE intrinsics now that v2i1 is a legal
type, to use a <2 x i1> as opposed to emulating the predicate with a
<4 x i1>. The v4i1 workarounds have been removed leaving the natural
v2i1 types, notably in vctp64 which now generates a v2i1 type.
AutoUpgrade code has been added to upgrade old IR, which needs to
convert the old v4i1 to a v2i1 be converting it back and forth to an
integer with arm.mve.v2i and arm.mve.i2v intrinsics. These should be
optimized away in the final assembly.
Differential Revision: https://reviews.llvm.org/D114455
The Power ISA defined l[bhwdq]arx as both base and
extended mnemonics. The base mnemonic takes the EH
bit as an operand and the extended mnemonic omits
it, making it implicitly zero. The existing
implementation only handles the base mnemonic when
EH is 1 and internally produces a different
instruction. There are historical reasons for this.
This patch simply removes the limitation introduced
by this implementation that disallows the base
mnemonic with EH = 0 in the ASM parser.
This resolves an issue that prevented some files
in the Linux kernel from being built with
-fintegrated-as.
Also fix a crash if the value is not an integer immediate.
MVE can treat v16i1, v8i1, v4i1 and v2i1 as different views onto the
same 16bit VPR.P0 register, with v2i1 holding two 8 bit values for the
two halves. This was never treated as a legal type in llvm in the past
as there are not many 64bit instructions and no 64bit compares. There
are a few instructions that could use it though, notably a VSELECT (as
it can handle any size using the underlying v16i8 VPSEL), AND/OR/XOR for
similar reasons, some gathers/scatter and long multiplies and VCTP64
instructions.
This patch goes through and makes v2i1 a legal type, handling all the
cases that fall out of that. It also makes VSELECT legal for v2i64 as a
side benefit. A lot of the codegen changes as a result - usually in way
that is a little better or a little worse, but still expensive. Costs
can change a little too in the process, again in a way that expensive
things remain expensive. A lot of the tests that changed are mainly to
ensure correctness - the code can hopefully be improved in the future
where it comes up in practice.
The intrinsics currently remain using the v4i1 they previously did to
emulate a v2i1. This will be changed in a followup patch but this one
was already large enough.
Differential Revision: https://reviews.llvm.org/D114449
Add clamp combine. Source is fminnum(fmaxnum(Val, 0.0), 1.0) or
fmaxnum(fminnum(Val, 1.0), 0.0) or fmed3 intrinsic with 0.0 and
1.0 as two out of three operands.
Differential Revision: https://reviews.llvm.org/D90052
Add floating point version of med3 combine.
Source is fminnum(fmaxnum(Val, K0), K1) or fmaxnum(fminnum(Val, K1), K0)
where K0 and K1 are constants and K0 <= K1.
Differential Revision: https://reviews.llvm.org/D90051
Recognize constant splat padded with undef in isCanonicalized.
Fcanonicalize will be removed by RemoveFcanonicalize in post-legalizer
combiner. We will treat undef as value that will result in a splat
in clamp combine after regbankselect.
Differential Revision: https://reviews.llvm.org/D104408
When used as a non-leaf node, TableGen does not currently use the type
of a ComplexPattern for type inference, which also means it does not
check it doesn't conflict with the use. This differs from when used as a
leaf value, where the type is used for inference. Fixing that
discrepancy is something I intend to upstream as a subsequent review.
AArch64 currently has several ComplexPatterns that are used in contexts
where they're expected to be an iPTR. The cases that lead to type
contradictions are separated out in D108759, but there are additional
differences to the TableGen output when using my locally-patched
TableGen. None of these appear to matter, at least for passing all the
CodeGen tests, but it's safer to avoid such changes (and similar changes
were causing issues on some AMDGPU tests, causing failures to select).
Changing these additional ComplexPatterns to use iPTR rather than i64
ensures that the TableGen output remains bit-for-bit identical (compared
to without having this patch and my TableGen patch, as well as the
intermediate state of having this patch but not my TableGen patch), and
more accurately captures the higher-level meaning of these patterns.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D109034
When used as a non-leaf node, TableGen does not currently use the type
of a ComplexPattern for type inference, which also means it does not
check it doesn't conflict with the use. This differs from when used as a
leaf value, where the type is used for inference. Fixing that
discrepancy is something I intend to upstream as a subsequent review.
AMDGPU currently has several ComplexPatterns that are used in contexts
where they're expected to be an iPTR, and where using an iPTR instead of
a fixed-width integer type matters. With my locally-patched TableGen,
none of these mismatches result in type contradictions, but do change
the patterns and cause various failures to select. These changes to the
ComplexPatterns' types reflect how they are actually used, result in
bit-for-bit identical TableGen output (without my local TableGen patch),
and ensure that with improved type inference AMDGPU's backend will
continue to work.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109032
When used as a non-leaf node, TableGen does not currently use the type
of a ComplexPattern for type inference, which also means it does not
check it doesn't conflict with the use. This differs from when used as a
leaf value, where the type is used for inference. Fixing that
discrepancy is something I intend to upstream as a subsequent review,
but these are all the type conflicts found (all legitimate) by my
locally-patched TableGen.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D108759
Changed registers to R10 and R11 because PLT resolution clobbers them. Also changed the implementation to use R11 instead of RCX, which saves a push/pop.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D115002
1. Fixed vector instructions costs estimations incosistency - removed different
logic for "not simple types" since it biases costs for these types.
2. Fixed legalization penalty for vectors too big for the target: changed from
overwrite default legalization cost value estimation to added penalty.
3. Fixed few typos in tests.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114893
Do not infer no-amdgpu-implicitarg-ptr for sanitized functions. If a
function is explicitly marked amdgpu-no-implicitarg-ptr and
sanitize_address, infer that it is required.
The load/store infrastructure previously made an incorrect assumption that
whenever it is used with a load/store intrinsic on Power10 - those intrinsics
would automatically be the lxvp/stxvp intrinsics introduced in Power10.
However, this is obviously not the case as there are multiple instances of
pre-P10 intrinsics that use the refactored load/store implementation.
This patch corrects this assumption, and produces the expected intrinsic on pre-P10.
Differential Revision: https://reviews.llvm.org/D114978
Some instructions with i8 immediate ranges can only hold negative values
(like t2LDRHi8), only hold positive values (like t2STRT) or hold +/-
depending on the U bit (like the pre/post inc instructions. e.g
t2LDRH_POST). This patch splits the AddrModeT2_i8 into AddrModeT2_i8,
AddrModeT2_i8pos and AddrModeT2_i8neg to make this clear.
This allows us to get the offset ranges of t2LDRHi8 correct in the
load/store optimizer, fixing issues where we could end up creating
instructions with positive offsets (which may then be encoded as ldrht).
Differential Revision: https://reviews.llvm.org/D114638
bvh should be handled separately from vmem and vmem with sampler instructions
for waitcnt handling.
Differential Revision: https://reviews.llvm.org/D114794
The ranges in isLegalAddressImm were off by one, not allowing the
maximum values for unscaled offsets.
Differential Revision: https://reviews.llvm.org/D114636
There is custom expansion code for packed VFMK Pseudos in the VE
backend. This code erased the Pseudo without telling
ExpandPostRAPseudos about it, causing the generic expansion function to
access the erased Pseudo. This bug triggered in the
test/CodeGen/VE/VELIntrinsics/vfmk.ll test with asan-enabled builds.
Detected by:
sanitizer-x86_64-linux-fast
(https://lab.llvm.org/buildbot/#/builders/5/builds/15393)
Given a min(max(fptosi, INT_MIN), INT_MAX) with the correct constants,
we can now generate a fptosi.sat. But in the arm backend, the constant
can be treated as high cost, pulling it out of the basic block in a way
that the DAG combine can no longer see it. This teaches it again that it
is a low cost constant, not worth hoisting out.
Recommitted from 0e98659ea1 with a fix for APInt comparison.
Differential Revision: https://reviews.llvm.org/D114380
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.
Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.
This type of change needs to be evaluated experimentally. Preliminary
results are positive.
Fixes: SWDEV-282962
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114777
We have reasonable fast sqrt and accurate rsqrt for half type due to the
limited fractions. So neither do we need multi steps refinement for
rsqrt nor replace sqrt by rsqrt.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D114844
Along with vector RC flags, this scalar flag will
make various regclass queries like `isVGPR` more
accurate.
Regclasses other than vectors are currently set
with the new flag even though certain unallocatable
classes aren't truly scalars. It would be ok as long
as they remain unallocatable.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D110053
Detected on targets older then gfx10 (e.g. gfx9) for constants that are
too large to be inlined (constant are sgpr by default).
In med3 combine it is expected that regbankselect maps all operands of
min/max we try to match to vgpr. However constants are mapped to sgpr
and there will be a sgpr-to-vgpr copy. Matchers look through sgpr-to-vgpr
copies and return sgpr and these break constant bus restriction.
Build med3 with all vgpr operands. Use existing sgpr-to-vgpr copies for
matched sgprs. If there is no such copy (not expected) build one.
Differential Revision: https://reviews.llvm.org/D114700
This prevents scalarization of fixed vector operations or crashes
on scalable vectors.
We don't have direct support for these operations. To emulate
ftrunc we can convert to the same sized integer and back to fp using
round to zero. We don't need to do a convert if the value is large
enough to have no fractional bits or is a nan.
The ceil and floor lowering would be better if we changed FRM, but
we don't model FRM correctly yet. So I've used the trunc lowering
with a conditional add or subtract with 1.0 if the truncate rounded
in the wrong direction.
There are also missed opportunities to use masked instructions.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113543
First call getOperand, then erase the MachineInstr. Not the other way
round.
Expected to fix test/CodeGen/VE/VELIntrinsics/lvm.ll
Detected by asan buildbot:
sanitizer-x86_64-linux-fast
(https://lab.llvm.org/buildbot/#/builders/5/builds/15384)
combinePMULH currently only truncates vXi32/vXi64 multiplies to PMULHW/PMULUW if the source operands are SEXT/ZEXT instructions for a 'free' truncation.
But we can generalize this to any source operand with sufficient leading sign/zero bits that would allow PACKS/PACKUS to be used as a 'cheap' truncation.
This helps us avoid the wider multiplies, in exchange for truncation on both source operands instead of the result.
Differential Revision: https://reviews.llvm.org/D113371
The patch expands the existing 32-bit toc-data attribute support to 64-bit.
In both 32-bit and 64-bit it is supported for small code model only.
Differential Revision: https://reviews.llvm.org/D114654