This just commons up and simplifies some logic that was repeated in
SIInsertWaitcnts::updateEventWaitcntAfter. NFCI.
Differential Revision: https://reviews.llvm.org/D136253
The amdgcn.ldexp.* intrinsics take an i32 value as src1.
The V_LDEXP_F16 instruction considers src1 an f16 operand, and therefore
src1 is implicitly truncated to 16 bits when lowering to that instruction from the
intrinsic. This is unlikely to result in an error in practice
because values that large are not useful.
The operand class of src1 in the True16 version of the instruction has
been corrected to encode correctly on GFX11.
Reviewed By: foad, rampitec
Differential Revision: https://reviews.llvm.org/D136195
getDefIgnoringCopies and getSrcRegIgnoringCopies should not fail on
valid MIR, so don't bother to check for failure.
Differential Revision: https://reviews.llvm.org/D136238
Correct v_cndmask_b32 to support abs/neg modifiers in dpp/sdwa/e64 variants.
Correct v_cndmask_b16 for proper disassembly of abs/neg modifiers in e64_dpp variants.
Differential Revision: https://reviews.llvm.org/D135900
If inline asm has a VGPR def, it must have come from a VGPR write
somewhere inside the asm. This should be further extended to all
read after write hazards.
Unfortunately, we have a broken handling of this in the runtime of rocm
5.3. The runtime is expected to handle this correctly when v5 becomes
the default.
Differential Revision: https://reviews.llvm.org/D134714
Make support more generic to support future instructions.
Currently NFC.
Reviewed By: foad, arsenm
Differential Revision: https://reviews.llvm.org/D135678
The return type is two u8 packed into a 16 bit VGPR, so this instruction
should be True16.
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D135478
Typically when you match something, you want to check the result.
Fix a couple warnings in the AMDGPUPostLegalizerCombiner which appear as a
result of this.
Differential Revision: https://reviews.llvm.org/D135491
Apparently StackColoring depends on SlotIndexes, but not
LiveIntervals. If regalloc fast were manually requested, LiveIntervals
would be dropped before SILowerSGPRSpills but not SlotIndexes.
SILowerSGPRSpills preserved SlotIndexes, but only through
LiveIntervals. As a result, SILowerSGPRSpills was incorrectly
reporting it preserved SlotIndexes. Start updating these directly,
instead of depending on LiveIntervals also being available.
Previously we would be unable to legalize V2S16 BUILD_VECTOR_TRUNC on GFX8 & below as the custom legalization was missing.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D135149
For V_CMP_CLASS_F16_t16_e64 and V_CMPX_CLASS_F16_t16_e64,
https://reviews.llvm.org/D133723 changed the value type of src1 from i32 to i16.
These src1 operands are 16 bits, therefore need to be encoded as true16
operands. So the _e32 type was correctly set to VGPR_32_Lo128.
In _e64 form the operand class went from
VSrc_b32 to VSrc_b16. For some reason, we cannot encode inline literals for
VSrc_b16, see 5f5f566b26. In this phase of
the true16 implementation, VSrc_b16 and VSrc_b32 are still similar,
except from that quirk of inlines. So set the operand class to regain
that function.
Reviewed By: dp, arsenm
Differential Revision: https://reviews.llvm.org/D134897
If we have a constant aggregate, e.g., as an initializer, we usually
failed to extract the proper value/type from it. This patch provides the
size and offset information necessary to extract the right part of the
constant.
The bitmask used to extract the bits assumed 16 bit elements and wasn't taking the size of the elements into account.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D135156
If we can not prove that f16 operands of a buildvector are canonicalized, then we can not lower into a V_PACK. In this scenario, we would previously lower into some combination of and(sdwa), shr, or. This patch allows for matching into V_PERM instead.
Change-Id: Ifa4a74fdb81ef44f22ba490c7fdf81ec8aebc945
Before, the isPreLegalize() query in CombinerHelper only checked for the
presence of a LegalizerInfo object. This is problematic when we want to have
a combine actually check for legality in a pre-legalizer combine pass, since
if we pass a LegalizerInfo object to the constructor it causes the combines to
think that we're running *post* legalizer, which isn't true.
This change fixes it to instead check an explicit bool that passes to signal
whether the pass will be run before or after legalization.
Doing so exposed a bug in the extending loads combine, which tried to check for
legality of candidate extending loads if LegalizerInfo was present. Since we
only ran it pre-legalizer and therefore with a null LegalizerInfo, it never
actually ran. Also fixes the legality checks to keep the tests passing.
Differential Revision: https://reviews.llvm.org/D135044
VALU use of an SGPR (pair) as mask followed by SALU write to the
same SGPR can cause incorrect execution of subsequent SALU reads
of the SGPR.
Reviewed By: foad, rampitec
Differential Revision: https://reviews.llvm.org/D134151
Recognize more opcodes in the function.
Fixes some regressions introduced in D134857 for fdiv.f16 too.
Depends on D134857
Reviewed By: arsenm, foad
Differential Revision: https://reviews.llvm.org/D134862
Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal.
Also removes RegBankInfo's scalarization of small BUILD_VECTORs,
replacing it with InstructionSelector logic instead.
This allows for V2S16 BUILD_VECTOR instructions to survive
all the way to ISel so we can select FMA/MAD_MIX instructions
in D134354.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134433
This change sets
-amdgpu-assume-{external-call-stack-size | dynamic-stack-object-size}
options to zero by default for code object v5 and later. The runtime is
expected to adjust the scratch size if the amdhsa_uses_dynamic_stack bit
in the kernel descriptor is set.
Differential Revision: https://reviews.llvm.org/D128346
One of the conditions to flush the vmcnt counter in loop preheaders is: The loop
contains a use of a vgpr that is defined out of the loop. The code currently
checks if a waitcnt is needed by looking at the score of that vgpr in the score
brackets. This is not enough and may cause the generation of an unnecessary
vmcnt flush. This patch fixes that case.
Differential Revision: https://reviews.llvm.org/D130313
The association between kernel and struct is done by symbol name.
This doesn't work robustly for anonymous kernels as shown by the modified
test case.
An alternative association between function and struct can be constructed
if necessary, probably though metadata, but on the basis that we currently
miscompile anonymous kernels and that they are difficult to construct from
application code and difficult to call from the runtime, this patch makes
it a fatal error for now.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134741
Surprisingly these were getting legalized to something
zero initialized.
This fixes an infinite loop when combining some vector types.
Also fixes zero initializing some undef values.
SimplifyDemandedVectorElts / SimplifyDemandedBits are not checking
for the legality of the output undefs they are replacing unused
operations with. This resulted in turning vectors into undefs
that were later re-legalized back into zero vectors.
A kernel may have an associated struct for laying out LDS variables.
This patch puts that instance, if present, at a deterministic address by
allocating it at the same time as the module scope instance.
This is relatively likely to be where the instance was allocated anyway (~NFC)
but will allow later patches to calculate where a given field can be found,
which means a function which is only reachable from a single kernel will be
able to access a LDS variable with zero overhead. That will be particularly
helpful for applications that instantiate a function template containing LDS
variables once per kernel.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D127052
Make MIMG NSA minimum addresses threshold an attribute that can
be set on a function or configured via command line.
This enables frontend tuning which allows increased NSA usage
where beneficial.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D134780
Summary:
With opaque pointer support, the "ptr" type is introduced and thus BitCast is not necessary in some cases.
This work takes care of this change, and recognizes the new address patterns to do appropriate optimizations.
Reviewers:
arsenm
Differential Revision:
https://reviews.llvm.org/D134596
For the pattern of IR (%if terminates with a divergent branch.),
divergence analysis will report %phi as uniform to help optimal code
generation.
```
%if
| \
| %then
| /
%endif: %phi = phi [ %uniform, %if ], [ %undef, %then ]
```
In the backend, %phi and %uniform will be assigned a scalar register.
But the %undef from %then will make the scalar register dead in %then.
This will likely cause the register being over-written in %then. To fix
the issue, we will rewrite %undef as %uniform. For details, please refer
the comment in AMDGPURewriteUndefForPHI.cpp. Currently there is no test
changes shown, but this is mandatory for later changes.
Reviewed by: sameerds
Differential Revision: https://reviews.llvm.org/D133840
This is a follow-on to https://reviews.llvm.org/D134073.
It renames a couple of fields to match their operands, as well as
introducing sub-operand names where required.
This change _only_ fixes the 'R600' half of the target, not the
'AMDGPU' half. Fixing the AMDGPU half will be a significantly more
difficult change (which I've not yet attempted.)
Differential Revision: https://reviews.llvm.org/D134078
Fix regression from clang opencl test in builtins-fp-atomics-gfx90a.cl
test_flat_add_local_f64 caused by D130579
Revert a3becb333d.
Differential Revision: https://reviews.llvm.org/D134568