These generic instructions are trivially selected to
V_MAD_[IU]64_[IU]32 instructions when run on the VALU.
When at least both factors are scalar, it is usually better to execute
some or all of the instruction on the SALU. To this end, we lower the
instruction to simpler instructions that are supported on the SALU
when applying the register bank mapping.
Differential Revision: https://reviews.llvm.org/D124843
A later change will add a 3rd user, so factoring out the common code
seems useful.
Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).
(Resubmit with missing test added.)
Differential Revision: https://reviews.llvm.org/D125324
A later change will add a 3rd user, so factoring out the common code
seems useful.
Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).
Differential Revision: https://reviews.llvm.org/D125324
This fixed build failure with expensive checks after D126009.
The change has added new run lines for Global ISel which has
uncovered a pre-existing problem: it does not select a correct
flavor of these image instructions.
Even though single address image instructions only use a single VGPR
HW accesses 4 or 5 which creates alignment requirement.
Fixes: SWDEV-316648
Differential Revision: https://reviews.llvm.org/D126009
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.
Differential Revision: https://reviews.llvm.org/D114644
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.
This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.
Differential Revision: https://reviews.llvm.org/D114643
This change adds the constant splat versions of m_ICst() (by using
getBuildVectorConstantSplat()) and uses it in
matchOrShiftToFunnelShift(). The getBuildVectorConstantSplat() name is
shortened to getIConstantSplatVal() so that the *SExtVal() version would
have a more compact name.
Differential Revision: https://reviews.llvm.org/D125516
Previously it built MIR for the results and returned a Register.
This avoids building constants for earlier elements of the vector if
later elements will fail to fold, and allows CSEMIRBuilder::buildInstr
to avoid unconditionally building a copy from the result.
Use a new helper function MachineIRBuilder::buildBuildVectorConstant
to build a G_BUILD_VECTOR of G_CONSTANTs.
Differential Revision: https://reviews.llvm.org/D117758
Tablegen definitions for subtarget features and cpp predicate functions to
access the features.
New Sub-TargetProcessors and common latencies.
Simple changes to MIR codegen tests which pass on gfx11 because they have the
same output as previous subtargets or operate on pseudo instructions which
are reused from previous subtargets.
Contributors:
Jay Foad <jay.foad@amd.com>
Petar Avramovic <Petar.Avramovic@amd.com>
Patch 4/N for upstreaming of AMDGPU gfx11 architecture
Depends on D124538
Reviewed By: Petar.Avramovic, foad
Differential Revision: https://reviews.llvm.org/D125261
The default output format of the update_mir_test_checks.py script has
changed since some of these tests were generated.
Also, an upcoming commit will introduce differences between GFX9 and
GFX10 in the legalization of G_MUL.
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.
Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.
Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
The most common situation where G_ASSERT_ZEXT appears for AMDGPU is a
copy from a physical register, which happens to use set the actual
register class on the virtual register. After copy coalescing, the
assert's source operand had a vreg with a set class. The verifier was
strictly rejecting cases where the set class/bank weren't an exact
match. Additionally, RegBankSelect was also expecting a register bank
to be set on the register, not a class.
This is much stricter than regular copies so relax this behavior. This
now allows these 2 cases:
1. Source register has either class or bank, and the result does not
2. Source register has a register class, and the result is a register
with a matching bank.
This should avoid needing some kind of special handling to avoid
violating this constraint when folding copies.
This is to avoid relying on the post-isel hook.
This change also enable the saddr pattern selection for atomic
intrinsics in GlobalISel.
Differential Revision: https://reviews.llvm.org/D123583
Fix isVCC for register that was assigned register class during
inst-selection. This happens when register has multiple uses.
For wave32, uniform i1 to vcc copy was selected like vcc to vcc
copy when uniform i1 had assigned register class.
Uniform i1 register with assigned register class will have s1 LLT,
be defined using G_TRUNC and class will be SReg_32RegClass.
Vcc i1 register with assigned register class will have s1 LLT,
class will be SReg_32RegClass for wave32 and SReg_64RegClass for
wave64 and register will not be defined by G_TRUNC.
Differential Revision: https://reviews.llvm.org/D124163
It was only handled for FLAT initially because we did not have
unaligned DS instructions lowering. Now it is implemented but
the bug is not handled.
Differential Revision: https://reviews.llvm.org/D123338
Split waterfall loops into multiple blocks so that exec mask
manipulation (s_and_saveexec) does not occur in the middle of
a block.
VGPR live range optimizer is updated to handle waterfall loops
spanning multiple blocks.
Reviewed By: ruiling
Differential Revision: https://reviews.llvm.org/D122200
Summary:
Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.
Reviewers: arsenm, sameerds, yaxunl
Differential Revision: https://reviews.llvm.org/D120265
This change replaces the manual selection of buffer_atomic_cmpswap*
instructions in SelectionDAG and GlobalISel with a tblgen based
selection in BUFInstructions.td. This allows us to select the return and
no-return variants in tblgen.
Differential Revision: https://reviews.llvm.org/D121770
Summary:
In general, we need queue_ptr for aperture bases and trap handling,
and user SGPRs have to be set up to hold queue_ptr. In current implementation,
user SGPRs are set up unnecessarily for some cases. If the target has aperture
registers, queue_ptr is not needed to reference aperture bases. For trap
handling, if target suppots getDoorbellID, queue_ptr is also not necessary.
Futher, code object version 5 introduces new kernel ABI which passes queue_ptr
as an implicit kernel argument, so user SGPRs are no longer necessary for
queue_ptr. Based on the trap handling document:
https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table,
llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap
in the backend.
Reviewers: sameerds, arsenm
Fixes: SWDEV-307189
Differential Revision: https://reviews.llvm.org/D119762
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.
But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.
This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.
And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.
As an added benefit, this patch simplifies overall return instruction handling.
Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.
Reviewed By: arsenm, ronlieb
Differential Revision: https://reviews.llvm.org/D114652
In convertToThreeAddress handle VOP2 mac/fmac instructions with a
literal src0 operand, since these are prime candidates for
converting to madmk/fmamk.
Previously this would only happen if src0 (or src1) was a register
defined by a move-immediate instruction, but in many cases these
operands have already been folded because SIFoldOperands runs
before TwoAddressInstructionPass.
Differential Revision: https://reviews.llvm.org/D120736
Handle V_MAC_LEGACY_F32 and V_FMAC_LEGACY_F32 in
convertToThreeAddress, to avoid the need for an extra mov
instruction in some cases.
Differential Revision: https://reviews.llvm.org/D120704
Use a subtarget feature instead of a command line argument to reduce
global state.
We want to enable flat scratch for graphics in some cases and this
doesn't work well with command line options.
Differential Revision: https://reviews.llvm.org/D119425
Newly created fneg was built after some of it's uses in some cases.
Now it will be built immediately after instruction whose dst it negates.
Differential Revision: https://reviews.llvm.org/D119459
At -O0 we claim to CSE constants only. I think this should apply to
G_FCONSTANT as well as G_CONSTANT.
Differential Revision: https://reviews.llvm.org/D119344
Some globals lower to literal addresses on AMDGPU.
This may be wrong for non-integral address spaces. I'm wondering if we
should just allow regular G_ADD to use pointer types, and reserve
G_PTR_ADD for non-integral address spaces.
This will do the combine in cases that should fold, but don't
now. e.g. we're relying on the CSEMIRBuilder's incomplete constant
folding. For instance it doesn't handle FP operations or vectors (and
we don't have separate constant folding combines either to catch
them).
We can select _vgprcd versions of MAI instructions and have no
AGPRs with the whole budget left for VGPRs if:
1. This is a kernel;
2. It has no calls;
3. It runs at least on 2 waves thus having not more that 256 VGPRs.
4. There is no inline asm requesting AGPRs.
Differential Revision: https://reviews.llvm.org/D117253
The WWM register has unmodeled register liveness, For v_set_inactive_*,
clobberring source register is dangerous because it will overwrite the
inactive lanes. When the source vgpr is dead at v_set_inactive_lane,
the inactive lanes may be not really dead. This may make common
optimizations doing wrong.
For example in a simple if-then cfg in Machine IR:
bb.if:
%src =
bb.then:
%src1 = COPY %src
%dst = V_SET_INACTIVE %src1(tied-def 0), %inactive
bb.end
... = PHI [0, %bb.then] [%src, %bb.if]
The register coalescer will think it is safe to optimize "%src1 = COPY %src"
in bb.then. And at the same time, there is no interference for the PHI in
bb.end. The source and destination values of the PHI will be assigned
the same register. The single PHI register will be overwritten by the
v_set_inactive, then we would get wrong value in bb.end.
With this change, we will copy the content of the source register before
setting inactive lanes after register allocation. Yes, this will sacrifice
the WWM code generation a little, but I don't have any better idea to do things
correctly.
Differential Revision: https://reviews.llvm.org/D117482
Similar to the G_*MULO change.
The code for checking if a constant is legal/pre-legalize is shared between
these, and is kind of hairy. So, factor it out into a new function:
`isConstantLegalOrBeforeLegalizer`.
To make the refactoring clean, further refactor `isLegalOrBeforeLegalizer` into
a wrapper for two functions:
- `isPreLegalize`
- `isLegal`
This is a bit easier to read in general.
https://godbolt.org/z/KW7oszP1o
Differential Revision: https://reviews.llvm.org/D118655
Always set uniform metadata on the pointer if it is an instruction, but
otherwise do not bother to create a trivial getelementptr instruction,
because AMDGPUInstrInfo::isUniformMMO can already detect that various
non-instruction pointers are uniform.
Most of the test case churn is from tests that used undef as a pointer,
which AMDGPUInstrInfo::isUniformMMO treats as uniform.
Differential Revision: https://reviews.llvm.org/D118909
This allows us to set the noclobber flag on (the MMO of) a load
instruction instead of on the pointer. This fixes a bug where noclobber
was being applied to all loads from the same pointer, even if some of
them were clobbered.
Differential Revision: https://reviews.llvm.org/D118775
In some cases StructurizeCFG inserts i1 xor instructions to invert
predicates. Add a quick loop to clean these up afterwards if we can get
away with modifying an existing compare instruction instead.
(StructurizeCFG is generally run late in the pipeline so instcombine
does not clean them up for us.)
Differential Revision: https://reviews.llvm.org/D118623
This reverts commit a6b54ddaba.
Apparently it is not safe to modify the condition even if it passes the
hasOneUse test, because StructurizeCFG might have other references to
the condition that are not manifest in the IR use-def chains.
This avoids various cases where StructurizeCFG would otherwise insert an
xor i1 instruction, and it since it generally runs late in the pipeline,
instcombine does not clean up the xor-of-cmp pattern.
Differential Revision: https://reviews.llvm.org/D118478
The physical register in the asm has the wrong type for the declared
IR. It seems to work in the DAG by extracting the 4 elements that are
defined in the IR from the register, but that isn't handled here. This
doesn't seem to be a well tested path since other mismatched cases are
crashing the DAG asm handling.
Use shufflevector to do the subvector extracts. This allows a lot more
load merging on AMDGPU and also on NVPTX when <2 x half> is involved.
Differential Revision: https://reviews.llvm.org/D117219
This change folds (or (shl x, C0), (lshr y, C1)) to funnel shift iff C0
and C1 are constants where C0 + C1 is the bit-width of the shift
instructions.
Differential Revision: https://reviews.llvm.org/D116529
Emit an error if the return value is used on subtargets that do not
support them. Previously we were falling back to the DAG on selection
failure, where it would emit this error and then fail again.
The struct/raw forms for the buffer atomics now work as
expected. However, we're incorrectly handling the legacy form (which
we probably shouldn't handle at all). We also are not diagnosing the
use of the return value on gfx908. These will be addressed separately.
This was inheriting the mesa behavior, and as far as I know nobody is
using opencl kernels with amdpal. The isMesaKernel check was
irrelevant because this property needs to be held for all functions.
GlobalISelEmitter was skipping these patterns when its predicates were
checked. This patch should allow us to select d16_hi stores in
GlobalISel.
Differential Revision: https://reviews.llvm.org/D117762
Arbitrary stack pointers are accessed using MUBUF instructions with
the voffset field, which is interpreted as the swizzled address. We
want to fold fold into the MUBUF form to use the SP in the SGPR
offset, and previously we were special casing the interpretation of
the pointer value if the access memory operand said it was relative to
the stack pointer.
690f5b7a01 removed this check, and moved
the DAG path to special casing copies from SGPRs. This is not an
entirely sound approach, since it's still changing the interpretation
of pointer values based the context.
Introduce a new pseudo which corresponds to the wave-to-vector address
transform. This way the memory instruction has consistent semantics
where the incoming pointer is always interpreted as a vector address,
and we're not obligated to optimize into the MUBUF offset-only
addressing mode. The DAG should probably have an equivalent pseudo.
This should fix some correctness issues, and folding this into
addressing modes will be a future optimization patch.
This was ignoring the requested result register, resulting in a
missing def when this happened in the IRTranslator. Fixes some crashes
and verifier errors at -O0.
Alternatively we could pass DstOps to the constant fold functions.
If the conversion artifact introduced in the unmerge of cast of merge
combine already existed in the function, this would introduce dead
copies which kept the old casts around, neither of which were deleted,
and would fail legalization.
This would fail as follows:
The G_UNMERGE_VALUES of the G_SEXT of the G_BUILD_VECTOR would
introduce a G_SEXT for each of the scalars.
Some of the required G_SEXTs already existed in the function, so CSE
moves them up in the function and introduces a copy to the original
result register.
The introduced CSE copies are dead, since the originally G_SEXTs were
already directly used. These copies add a use to the illegal G_SEXTs,
so they are not deleted.
The artifact combiner does not see the defs that need to be updated,
since it was hidden inside the CSE builder.
I see 2 potential fixes, and opted for the mechanically simpler one,
which is to just not insert the cast if the result operand isn't
used. Alternatively, we could not insert the cast directly into the
result register, and use replaceRegOrBuildCopy similar to the case
where there is no conversion.
I suspect this is a wider problem in the artifact combiner.
For AMDGPU, any use of the physical register EXEC prevents sinking even if it is not a real physical register read. Add check to see if a physical
register use can be ignored for sinking.
Also perform same constant and ignorable physical register check when considering sinking in loops.
https://reviews.llvm.org/D116053