This includes:
- New llvm.amdgcn.image.msaa.load.* intrinsics
- NSA changes, because MIMG-NSA is now limited to 3 dwords
- Split CD forms of IMAGE_SAMPLE instructions out into separate
test files since they are no longer supported in GFX11
Differential Revision: https://reviews.llvm.org/D127837
The generic legalizer framework is still used to reduce the problem
to scalar multiplication with the bit size a multiple of 32.
Generating optimal code sequences for big integer multiplication is
somewhat tricky and has a number of target-specific intricacies:
- The target has V_MAD_U64_U32 instructions that multiply two 32-bit
factors and add a 64-bit accumulator. Most partial products should
use this instruction.
- The accumulator is mapped to consecutive 32-bit GPRs, and partial-
product multiply-adds can feed the accumulator into each other
directly. (The register allocator's support for that is somewhat
limited, but that only matters for 128-bit integers and larger.)
- OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator
to be stored in an even-aligned pair of GPRs. To avoid excessive
register copies, it makes sense to compute odd partial products
separately from even partial products (where a partial product
src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both
halves together as a final step.
- We can combine G_MUL+G_ADD into a single cascade of multiply-adds.
- The target can keep many carry-bits in flight simultaneously, so
combining carries using G_UADDE is preferable over G_ZEXT + G_ADD.
- Not addressed by this patch: When the factors are sign-extended,
the V_MAD_I64_I32 instruction (signed version!) can be used.
It is difficult to address these points generically:
1) Finding matching pairs of G_MUL and G_UMULH to find a wide
multiply is expensive. We could add a G_UMUL_LOHI generic instruction
and conditionally use that in the generic legalizer, but by itself
this wouldn't allow us to use the accumulation capability of
V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE
post-legalization, but this is also expensive.
2) Similarly, making sense of the legalization outcome of a wide
pre-legalization G_MUL+G_ADD pair is extremely expensive.
3) How could the generic legalizer possibly deal with the
particular idiosyncracy of "odd" vs. "even" partial products.
All this points in the direction of directly emitting an ideal code
sequence during legalization, but the generic legalizer should not
be burdened with such overly target-specific concerns. Hence, a
custom legalization.
Note that the implemented approach is different from that used by
SelectionDAG because narrowing of scalars works differently in
general. SelectionDAG iteratively cuts wide scalars into low and
high halves until a legal size is reached. By contrast, GlobalISel
does the narrowing in a single shot, which should be better for
compile-time and for the quality of the generated code.
This patch leaves three gaps open:
1. When the factors are uniform, we should execute the multiplication on
the SALU. Register bank mapping already ensures this.
However, the resulting code sequence is not optimal because it doesn't
fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32
doesn't have a carry-in.) It is very difficult to fix this after the
fact, so we should really use a different legalization sequence in
this case. Unfortunately, we don't have a divergence analysis and so
cannot make that choice.
(This only matters for 128-bit integers and larger.)
2. Avoid unnecessary multiplies when sources are known to be zero- or
sign-extended. The challenge is that the legalizer does not currently
have access to GISelKnownBits.
3. When the G_MUL is followed by a G_ADD, we should consider combining
the two instructions into a single multiply-add sequence, to utilize
the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has
multiple uses and the implied duplication of the multiply is an
overall negative). However, this is also not true when the factors
are uniform: in that case, it is generally better to *not* combine
the two operations, so that the multiply can be done on the SALU.
Again, we don't have a divergence analysis available and so cannot
make an informed choice.
Differential Revision: https://reviews.llvm.org/D124844
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.
Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.
Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
Handle the llvm.r600.* intrinsics which are still in use in libclc. I
thought it would be possible to switch it to using
llvm.amdgcn.implicitarg.ptr already, but it turns out the implicit
arguments are currently split into a piece before and after the
explicit kernel arguments.
Summary:
Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.
Reviewers: arsenm, sameerds, yaxunl
Differential Revision: https://reviews.llvm.org/D120265
Add a new llvm.fptrunc.round intrinsic to precisely control
the rounding mode when converting from f32 to f16.
Differential Revision: https://reviews.llvm.org/D110579
Summary:
Add code object v5 support (deafult is still v4)
Generate metadata for implicit kernel args for the new ABI
Set the metadata version to be 1.2
Reviewers:
t-tye, b-sumner, arsenm, and bcahoon
Fixes:
SWDEV-307188, SWDEV-307189
Differential Revision:
https://reviews.llvm.org/D118272
As the codegen fix in D111754, the LOD bias needs to be converted to 16
bits. Fix this in the combine.
Differential Revision: https://reviews.llvm.org/D116038
Emit an error if the return value is used on subtargets that do not
support them. Previously we were falling back to the DAG on selection
failure, where it would emit this error and then fail again.
The struct/raw forms for the buffer atomics now work as
expected. However, we're incorrectly handling the legacy form (which
we probably shouldn't handle at all). We also are not diagnosing the
use of the return value on gfx908. These will be addressed separately.
This was trying to clamp s65 down to s32, which wasn't handled so we
need to promote all the way to s128 first. Having to order the
legalization rules in just the right way is rather dissatisfying, but
I'm not sure how smart the legalizer should be in trying to interpret
the rules.
Shockingly we weren't doing this already. We should probably have this
be done earlier in the IR too, but it's still helpful to have the
lowering guarantee it so that we can modify the ABI implicit inputs
based on it.
We were trying to guess at the original IR type for image intrinsics
after legalization to figure out if they were d16, but this didn't
work. Explicitly track if this is a d16 operation or not in the
opcode, as is done for the buffer intrinsics.
The OpenCL library is using f32 image writes with a dmask of 15 for
some reason, and this was incorrectly switching them to use d16. Fixes
image failures in the OpenCL conformance test. The equivalent dmask
for loads doesn't even select in either selector.
If we know the source is a valid object, we do not need to insert a
null check. This misses a lot of opportunities from
metadata/attributes not tracked in codegen.
Use G_MERGE_VALUES and G_UNMERGE_VALUES on vector elements instead of
G_EXTRACT and G_INSERT when doing custom legalization for
G_EXTRACT_VECTOR_ELT and G_INSERT_VECTOR_ELT.
With this approach legalization artifact combiner gets direct access
to all vector elements.
Differential Revision: https://reviews.llvm.org/D116115
Artifact combiner is not able to access individual elements after using
LCMTy style merge/unmerge, extract and insert to change vector number of
elements (pad with undef or split to sub-vector instructions).
Use unmerge to individual elements instead and then merge elements into
requested types.
Change argument lowering for vectors and moreElementsVector to use
buildPadVectorWithUndefElements and buildDeleteTrailingVectorElements.
FewerElementsVector had a few helpers that had different behavior,
introduce new helper for most of the opcodes.
FewerElementsVector helper is more flexible since it can create leftover
instruction smaller then requested type (useful in case target wants to
avoid pad with undef and use fewer registers). If target does not want
leftover of different type it should call more elements first.
Some helpers were performing more elements first to have split without
leftover. Opcodes that used this helper use clampMaxNumElementsStrict
(does more elements first) in LegalizerInfo to avoid test changes.
Fixes failures caused by failing to combine artifacts created during
more/fewer elements vector.
Differential Revision: https://reviews.llvm.org/D114198
These actions should only be used for adjusting the register types
(and the memory type as needed to satisfy the register
type). Unaligned accesses should be split as a type of lowering.
This has the effect of improving the code in many cases since now we
produce zextloads instead of separate loads with ands. The load/store
legality rules still seem far more complicated than necessary though.
The LOD bias operand is of type 'half' when the A16-bit is ON' for MIMG instructions.
'bias' is only 16-bit but occupies 32-bits with upper 16-bits containing junk.
The patch fixes both the paths(ISelDAG and GlobalISel) for proper encoding of LOD bias operand.
Differential Revision: https://reviews.llvm.org/D111754
The ray_origin, ray_dir and ray_inv_dir arguments should all be vec3 to
match how the hardware instruction works.
Don't change the API of the corresponding OpenCL builtins.
Differential Revision: https://reviews.llvm.org/D115032
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.
Differential Revision: https://reviews.llvm.org/D113679
Scalarize before narrowing because the narrowing implementation does not
work on vectors. This matches what we do for regular G_MUL.
Differential Revision: https://reviews.llvm.org/D111129
We don't allow an initializer for LDS variables
and there is an early abort during instruction
selection. This patch legalizes them by ignoring
the init values. During assembly emission, proper
error reporting already exists for such instances.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109901
Add eraseInstr(s) utility functions. Before deleting an instruction
collects its use instructions. After deletion deletes use instructions
that became trivially dead.
This patch clears all dead instructions in existing legalizer mir tests.
Differential Revision: https://reviews.llvm.org/D109154
Rework getConstantstVRegValWithLookThrough in order to make it clear if we
are matching integer/float constant only or any constant(default).
Add helper functions that get DefVReg and APInt/APFloat from constant instr
getIConstantVRegValWithLookThrough: integer constant, only G_CONSTANT
getFConstantVRegValWithLookThrough: float constant, only G_FCONSTANT
getAnyConstantVRegValWithLookThrough: either G_CONSTANT or G_FCONSTANT
Rename getConstantVRegVal and getConstantVRegSExtVal to getIConstantVRegVal
and getIConstantVRegSExtVal. These now only match G_CONSTANT as described
in comment.
Relevant matchers now return both DefVReg and APInt/APFloat.
Replace existing uses of getConstantstVRegValWithLookThrough and
getConstantVRegVal with new helper functions. Any constant match is
only required in:
ConstantFoldBinOp: for constant argument that was bit-cast of float to int
getAArch64VectorSplat: AArch64::G_DUP operands can be any constant
amdgpu select for G_BUILD_VECTOR_TRUNC: operands can be any constant
In other places use integer only constant match.
Differential Revision: https://reviews.llvm.org/D104409
Legalizing G_MUL for non-standard types (like i33) generated an error. Putting
minScalar and maxScalar instead of clampScalar. Also using new rule, instead
of widening to the next power of 2, widen to the next multiple of the passed
argument (32 in this case), so instead of widening i65 to i128, we widen it to
i96.
Patch by: Mateja Marjanovic
Differential Revision: https://reviews.llvm.org/D109228
Add implementation for the legalization of G_ROTL and G_ROTR machine
instructions. They are very similar to funnel shift instructions, the only
difference is funnel shifts have 3 operands, whereas rotate instructions have
two operands, the first being the register that is being rotated and the second
being the number of shifts. The legalization of G_ROTL/G_ROTR is just lowering
them into funnel shift instructions if they are legal.
Patch by: Mateja Marjanovic
Differential Revision: https://reviews.llvm.org/D105347
Legalize G_MEMCPY, G_MEMMOVE, G_MEMSET and G_MEMCPY_INLINE.
Corresponding intrinsics are replaced by a loop that uses loads/stores in
AMDGPULowerIntrinsics pass unless their length is a constant lower then
MemIntrinsicExpandSizeThresholdOpt (default 1024). Any G_MEM* instruction that
reaches legalizer should have a const length argument and should be expanded
into appropriate number of loads + stores.
Differential Revision: https://reviews.llvm.org/D108357
Suffix opcodes with _gfx10.
Remove direct references to architecture specific opcodes.
Add a BVH flag and apply this to diassembly.
Fix a number of disassembly errors on gfx90a target caused by
previous incorrect BVH detection code.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D108117