Update intrinsics to use n x f16 and n x i16 instead
of 32-bit types. This may avoid the need for a bitcast
and is probably less confusing.
Depends on making v16f16 and v16i16 types legal.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D128951
GFX11 has a new message type MSG_DEALLOC_VGPRS which can be used to
release a shader's VGPRs. Sending this at the end of a shader (just
before the s_endpgm) can help overall system performance in cases where
the s_endpgm would have to wait for outstanding VMEM stores to complete
before releasing the VGPRs.
Differential Revision: https://reviews.llvm.org/D128442
Follow up to D127894, new liveness update code needs to handle
the case where S_ANDN2 input must be extended through loops when
V_CNDMASK_B32 has been hoisted.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D128800
D106023 excluded 16-bit instructions from rematerialization, with the
justification that we can't rematerialize instructions that preserve
the high bits (plus the instructions which do are a confusing mess
between different subtargets). This doesn't make sense to me as a
problem since cases where we would rely on the high bit behavior would
still need to be represented as a register value constraint with a
tied operand. It's not a hidden side effect and should still be
rematerializable.
Without this, the new test case would fail with:
AMDGPUInstPrinter.cpp:545: void llvm::AMDGPUInstPrinter::printImmediate64(uint64_t, const llvm::MCSubtargetInfo &, llvm::raw_ostream &): Assertion `isUInt<32>(Imm) || Imm == 0x3fc45f306dc9c882' failed.
Differential Revision: https://reviews.llvm.org/D128435
The new resumable mca::Pipeline capability introduced in this patch
allows users to save the current state of pipeline and resume from the
very checkpoint.
It is better (but not require) to use with the new IncrementalSourceMgr,
where users can add mca::Instruction incrementally rather than having a
fixed number of instructions ahead-of-time.
Note that we're using unit tests to test these new features. Because
integrating them into the `llvm-mca` tool will make too many churns.
Differential Revision: https://reviews.llvm.org/D127083
VOPD is a new encoding for dual-issue instructions for use in wave32.
This patch includes MC layer support only.
A VOPD instruction is constituted of an X component (for which there are
13 possible opcodes) and a Y component (for which there are the 13 X
opcodes plus 3 more). Most of the complexity in defining and parsing
a VOPD operation arises from the possible different total numbers of
operands and deferred parsing of certain operands depending on the
constituent X and Y opcodes.
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D128218
waitcnt vmcnt instructions are currently generated in loop bodies before using
values loaded outside of the loop. In some cases, it is better to flush the
vmcnt counter in a loop preheader before entering the loop body. This patch
detects these cases and generates waitcnt instructions to flush the counter.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D115747
In GFX11 ShaderType is determined by the hardware and should no longer
be written into bits[3:2] of the ds_ordered_count offset field.
Differential Revision: https://reviews.llvm.org/D128196
Running iwyu-diff on LLVM codebase since fb67d683db detected a few
regressions, fixing them.
The impact on preprocessed output is negligible: -4k lines.
The granularity of SPI_SHADER_PGM_RSRC2_PS.EXTRA_LDS_SIZE changed
in GFX11. It is now in units of 256 dwords instead of 128 dwords.
COMPUTE_PGM_RSRC2.LDS_SIZE is unaffected. It is still in units of
128 dwords.
Differential Revision: https://reviews.llvm.org/D128179
The instructions that generate the source of dual source blend export
should run in strict-wqm. That is if any lane in a quad is active,
we need to enable all four lanes of that quad to make the shuffling
operation before exporting to dual source blend target work correctly.
Differential Revision: https://reviews.llvm.org/D127981
LDS_PARAM_LOAD and LDS_DIRECT_LOAD use EXEC per quad
(if any pixel is enabled in the quad, data is written
to all 4 pixels/threads in the quad).
Tag LDS_PARAM_LOAD and LDS_DIRECT_LOAD as using strict_wqm
to enforce this and avoid lane clobbering issues.
Note that only the instruction itself is tagged.
The implicit uses of these do not need to be set WQM.
The reduces unnecessary WQM calculation of M0.
Differential Revision: https://reviews.llvm.org/D127977
Detect LDS direct WAR/WAW hazards and compute values for
wait_vdst (va_vdst) parameter. Where appropriate this
raises wait_vdst from the default 0 to allow concurrent
issue of LDS direct with VALU execution.
Also detect LDS direct versus VMEM source VGPR hazards
and insert vm_vsrc=0 waits using s_waitcnt_depctr.
Differential Revision: https://reviews.llvm.org/D127963
`llvm::max(Align, MaybeAlign)` and `llvm::max(MaybeAlign, Align)` are
not used often enough to be required. They also make the code more opaque.
Differential Revision: https://reviews.llvm.org/D128121
This is a temporary measure to avoid generating incorrect code until the
compiler understands the new way that GFX11 encodes 16-bit operands in
VOP instructions.
Differential Revision: https://reviews.llvm.org/D128054
This includes:
- New llvm.amdgcn.image.msaa.load.* intrinsics
- NSA changes, because MIMG-NSA is now limited to 3 dwords
- Split CD forms of IMAGE_SAMPLE instructions out into separate
test files since they are no longer supported in GFX11
Differential Revision: https://reviews.llvm.org/D127837
Pre gfx1030 null for sdst is different.
c97436f8b6 [AMDGPU] Use null for dead sdst operand - requires a change to make
it not apply to pre gfx1030
Differential Revision: https://reviews.llvm.org/D127869
The sched_barrier builtin allow the scheduler's behavior to be shaped by users
when very specific codegen is needed in order to create highly optimized code.
This patch adds more granular control over the types of instructions that are
allowed to be reordered with respect to one or multiple sched_barriers. A mask
is used to specify groups of instructions that should be allowed to be scheduled
around a sched_barrier. The details about this mask may be used can be found in
llvm/include/llvm/IR/IntrinsicsAMDGPU.td.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D127123
On gfx10+ null register can be used as both 32 and 64 bit operand.
Define a 64 bit version of the register to use during codegen.
Differential Revision: https://reviews.llvm.org/D127527
Compared to permlane16, permlane64 has no BC input because it has no
boundary conditions, no fi input because the instruction acts as if FI
were always enabled, and no OLD input because it always writes to every
active lane.
Also use the new intrinsic in the atomic optimizer pass.
Differential Revision: https://reviews.llvm.org/D127662
GFX11 uses different pseudos for these because of a new constraint
on which operands' registers can overlap.
Differential Revision: https://reviews.llvm.org/D127659
This uses rotating reminder of division by 3 to select another
temp vgpr each next time in a sequence of several agpr copies.
Therefore, temp vgpr selection depends on the generated agpr
number. This number could change with any unrelated change to
the register definitions.
Stabilize the selection by using a real agpr number.
Differential Revision: https://reviews.llvm.org/D127524
The encoding of COMPUTE_TMPRING_SIZE.WAVESIZE and
SPI_TMPRING_SIZE.WAVESIZE has changed in GFX11: it is now in units
of 64 dwords instead of 256 dwords, and the field has been widened
from 13 bits to 15 bits.
Depends on D126989
Reviewed By: rampitec, arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D127248
sources to SALU and VALU instructions.
Contributors:
Baptiste Saleil <baptiste.saleil@amd.com>
Patch 20/N for upstreaming of AMDGPU gfx11 architecture
Depends on D126989
Reviewed By: rampitec, foad, #amdgpu
Differential Revision: https://reviews.llvm.org/D127143
Add a basic implementation of isExtractSubvectorCheap that only
considers extracts at offset 0.
Differential Revision: https://reviews.llvm.org/D127385
Add new intrinsic and codegen support for the s_sendmsg_rtn_b32 and
s_sendmsg_rtn_b64 instructions.
Differential Revision: https://reviews.llvm.org/D127315
In GFX10 dlc controlled L1 cache bypass. In GFX11 it has been repurposed
to control MALL NOALLOC, and glc controls L1 as well as L0 cache bypass.
Update the documentation and SIMemoryLegalizer accordingly. Set dlc for
nontemporal and volatile accesses.
Differential Revision: https://reviews.llvm.org/D127405
Changes for GFX11:
- Clauses may not mix instructions of different types, and there are
more types. For example image instructions with and without a sampler
are now different types.
- The max size of a clause is explicitly documented as 63 instructions.
Previously it was implicitly assumed to be 64. This is such a tiny
difference that it does not seem worth making it conditional on the
subtarget.
- It can be beneficial to clause stores as well as loads.
Differential Revision: https://reviews.llvm.org/D127391
Supports encoding existing instrutions on gfx11 and MC support for the new VOPC
dpp instructions.
Patch 19/N for upstreaming of AMDGPU gfx11 architecture
Depends on D126978
Reviewed By: rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D126989
Nic Curtis done the experiments to prove it is faster than a
separate mul and add.
Fixes: SWDEV-332806
Differential Revision: https://reviews.llvm.org/D127253
- VOP3 and SDWA forms of V_CMPX were not handled
- Hazard only exists if the compare defines EXEC (i.e. V_CMPX)
forwarded to the permlane.
Differential Revision: https://reviews.llvm.org/D127344
Clang-format InstructionSimplify and convert all "FunctionName"s to
"functionName". This patch does touch a lot of files but gets done with
the cleanup of InstructionSimplify in one commit.
This is the alternative to the less invasive clang-format only patch: D126783
Reviewed By: spatel, rengolin
Differential Revision: https://reviews.llvm.org/D126889
The generic legalizer framework is still used to reduce the problem
to scalar multiplication with the bit size a multiple of 32.
Generating optimal code sequences for big integer multiplication is
somewhat tricky and has a number of target-specific intricacies:
- The target has V_MAD_U64_U32 instructions that multiply two 32-bit
factors and add a 64-bit accumulator. Most partial products should
use this instruction.
- The accumulator is mapped to consecutive 32-bit GPRs, and partial-
product multiply-adds can feed the accumulator into each other
directly. (The register allocator's support for that is somewhat
limited, but that only matters for 128-bit integers and larger.)
- OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator
to be stored in an even-aligned pair of GPRs. To avoid excessive
register copies, it makes sense to compute odd partial products
separately from even partial products (where a partial product
src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both
halves together as a final step.
- We can combine G_MUL+G_ADD into a single cascade of multiply-adds.
- The target can keep many carry-bits in flight simultaneously, so
combining carries using G_UADDE is preferable over G_ZEXT + G_ADD.
- Not addressed by this patch: When the factors are sign-extended,
the V_MAD_I64_I32 instruction (signed version!) can be used.
It is difficult to address these points generically:
1) Finding matching pairs of G_MUL and G_UMULH to find a wide
multiply is expensive. We could add a G_UMUL_LOHI generic instruction
and conditionally use that in the generic legalizer, but by itself
this wouldn't allow us to use the accumulation capability of
V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE
post-legalization, but this is also expensive.
2) Similarly, making sense of the legalization outcome of a wide
pre-legalization G_MUL+G_ADD pair is extremely expensive.
3) How could the generic legalizer possibly deal with the
particular idiosyncracy of "odd" vs. "even" partial products.
All this points in the direction of directly emitting an ideal code
sequence during legalization, but the generic legalizer should not
be burdened with such overly target-specific concerns. Hence, a
custom legalization.
Note that the implemented approach is different from that used by
SelectionDAG because narrowing of scalars works differently in
general. SelectionDAG iteratively cuts wide scalars into low and
high halves until a legal size is reached. By contrast, GlobalISel
does the narrowing in a single shot, which should be better for
compile-time and for the quality of the generated code.
This patch leaves three gaps open:
1. When the factors are uniform, we should execute the multiplication on
the SALU. Register bank mapping already ensures this.
However, the resulting code sequence is not optimal because it doesn't
fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32
doesn't have a carry-in.) It is very difficult to fix this after the
fact, so we should really use a different legalization sequence in
this case. Unfortunately, we don't have a divergence analysis and so
cannot make that choice.
(This only matters for 128-bit integers and larger.)
2. Avoid unnecessary multiplies when sources are known to be zero- or
sign-extended. The challenge is that the legalizer does not currently
have access to GISelKnownBits.
3. When the G_MUL is followed by a G_ADD, we should consider combining
the two instructions into a single multiply-add sequence, to utilize
the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has
multiple uses and the implied duplication of the multiply is an
overall negative). However, this is also not true when the factors
are uniform: in that case, it is generally better to *not* combine
the two operations, so that the multiply can be done on the SALU.
Again, we don't have a divergence analysis available and so cannot
make an informed choice.
Differential Revision: https://reviews.llvm.org/D124844
Includes dpp versions of VOP3P instructions.
Patch 18/N for upstreaming of AMDGPU gfx11 architecture
Depends on D126917
Reviewed By: rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D126978
The reverted dependent commit is now relanded, so reland this.
Includes dpp instructions and vop1/vop2 promoted to vop3
Patch 17/N for upstreaming of AMDGPU gfx11 architecture
Depends on D126483
Reviewed By: rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D126917
There was an issue with encoding wide (>64 bit) instructions on
BigEndian hosts, which is fixed in D127195. Therefore reland this.
gfx11 adds the ability to use dpp modifiers on vop3 instructions.
This patch adds machine code layer support for that. The MCCodeEmitter
is changed to use APInt instead of uint64_t to support these wider
instructions.
Patch 16/N for upstreaming of AMDGPU gfx11 architecture
Differential Revision: https://reviews.llvm.org/D126483
MIR support is totally unusable for AMDGPU without this, since the set
of reserved registers is set from fields here.
Add a clone method to MachineFunctionInfo. This is a subtle variant of
the copy constructor that is required if there are any MIR constructs
that use pointers. Specifically, at minimum fields that reference
MachineBasicBlocks or the MachineFunction need to be adjusted to the
values in the new function.
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.
Differential Revision: https://reviews.llvm.org/D126910
Includes dpp instructions and vop1/vop2 promoted to vop3
Patch 17/N for upstreaming of AMDGPU gfx11 architecture
Depends on D126483
Reviewed By: rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D126917
gfx11 adds the ability to use dpp modifiers on vop3 instructions.
This patch adds machine code layer support for that. The MCCodeEmitter
is changed to use APInt instead of uint64_t to support these wider
instructions.
Patch 16/N for upstreaming of AMDGPU gfx11 architecture
Depends on D126475
Reviewed By: rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D126483
The AMDGPUResourceUsageAnalysis was previously a CGSCC pass, and assumed
that a function's callees were always analyzed prior to their callees.
When it was refactored into a module pass, this assumption no longer
always holds. This results in calls being erroneously identified as
indirect, and reserving private segment space for them. This results in
significantly slower kernel launch latency.
This patch changes the order in which the module's functions are analyzed
from the order in which they occur in the module to a post-order traversal
of the call graph. Perhaps Clang always generates the module's functions
in such an order, but this is not the case for the Cray Fortran compiler.
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D126025
This patch improves the codegen of extractelement and insertelement for vector
containing 8 elements. Before, a dag combine transformation was generating a
sequence of 8 select/cmp.
This patch changes the upper limit for this transformation and the movrel
instruction will eventually be used instead. Extractlement/insertelement for
vectors containing less than 8 elements are unchanged.
Differential Revision: https://reviews.llvm.org/D126389
This patch includes MC layer support for VOP3 encoded instructions and generic VOP support
classes.
Some VOP1 and VOP2 instructions which share an encoding with gfx10 and are using
the AssemblerPredicate = isGFX10Plus are also enabled. That predicate
will be changed to isGFX10Only in a later patch.
Patch 15/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D126468
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D126475
MC layer support for ds instructions
Contributors:
Piotr Sobczak <Piotr.Sobczak@amd.com>
Patch 14/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D126463
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D126468
Rename CalleeSavedRegs defs to avoid being overly specific:
* CSR_AMDGPU_AGPRs_32_255 => CSR_AMDGPU_AGPRs
* CSR_AMDGPU_SGPRs_30_31 + CSR_AMDGPU_SGPRs_32_105 => CSR_AMDGPU_SGPRs
* CSR_AMDGPU_SI_Gfx_SGPRs_4_29 + CSR_AMDGPU_SI_Gfx_SGPRs_64_105 =>
CSR_AMDGPU_SI_Gfx_SGPRs
* CSR_AMDGPU_HighRegs => CSR_AMDGPU
* CSR_AMDGPU_HighRegs_With_AGPRs => CSR_AMDGPU_GFX90AInsts
* CSR_AMDGPU_SI_Gfx_With_AGPRs => CSR_AMDGPU_SI_Gfx_GFX90AInsts
Introduce a class RegMask to mark the cases where we use the
CalleeSavedRegs class purely as an expedient way to produce a mask.
Update the names of these masks to not mention "CSR". Other targets also
seem to do this, so a reasonable alternative is to actually update
table-gen to include a new class to do this explicitly, but the current
approach seems harmless so I opted to just make it more explicit.
Reviewed By: arsenm, sebastian-ne
Differential Revision: https://reviews.llvm.org/D109008
Avoid the dependency on TargetInstrInfo, which depends on the subtarget
and therefore the individual function.
Currently AMDGPU is constructing PseudoSourceValue instances in MachineFunctionInfo.
In order to facilitate copying MachineFunctionInfo, we need to stop allocating these
there. Alternatively we could allow targets to subclass PseudoSourceValueManager,
and allocate them similarly to MachineFunctionInfo.
This patch implements a DAG mutation which adds edges between different groups of instructions. The purpose is to try to generate code that conforms to a pipeline (groupA instructions occur before groupB, groupB -> groupC, and so on). Currently the pipeline order is hardcoded as VMEM->DSRead->MFMA->DSWrite, but the patch was designed to be easily extensible. Alias analysis is problematic for pipelining as memory instructions will usually not be able to be reordered w.r.t one another.
Differential Revision: https://reviews.llvm.org/D125997
MC layer support for instructions in the MIMG encoding(Image
instructions).
Contributors:
Carl Ritson <carl.ritson@amd.com>
Patch 13/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D125992
Reviewed By: rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D126463
These generic instructions are trivially selected to
V_MAD_[IU]64_[IU]32 instructions when run on the VALU.
When at least both factors are scalar, it is usually better to execute
some or all of the instruction on the SALU. To this end, we lower the
instruction to simpler instructions that are supported on the SALU
when applying the register bank mapping.
Differential Revision: https://reviews.llvm.org/D124843
MCSymbolizer::tryAddingSymbolicOperand() overloaded the Size parameter
to specify either the instruction size or the operand size depending on
the architecture. However, for proper symbolic disassembly on X86, we
need to know both sizes, as an instruction can have two operands, and
the instruction size cannot be reliably calculated based on the operand
offset and its size. Hence, split Size into OpSize and InstSize.
For X86, the new interface allows to fix a couple of issues:
* Correctly adjust the value of PC-relative operands.
* Set operand size to zero when the operand is specified implicitly.
Differential Revision: https://reviews.llvm.org/D126101
MachineCode Support for FLAT type instructions
Contributors:
Sebastian Neubauer <sebastian.neubauer@amd.com>
Patch 12/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D125989
Reviewed By: rampitec, #amdgpu
Differential Revision: https://reviews.llvm.org/D125992
A new instruction encoding. Some of these instructions were previously VOP3
encoded.
Contributors:
Carl Ritson <carl.ritson@amd.com>
Patch 11/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D125824
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D125989
A later change will add a 3rd user, so factoring out the common code
seems useful.
Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).
(Resubmit with missing test added.)
Differential Revision: https://reviews.llvm.org/D125324
A later change will add a 3rd user, so factoring out the common code
seems useful.
Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).
Differential Revision: https://reviews.llvm.org/D125324
Even though single address image instructions only use a single VGPR
HW accesses 4 or 5 which creates alignment requirement.
Fixes: SWDEV-316648
Differential Revision: https://reviews.llvm.org/D126009
This brings the MachineInstrs in line with the corresponding intrinsics
which have side effects but do not access memory. It also matches how
BUF cache invalidation instructions are defined.
The lit test changes are just because the machine scheduler previously
treated them like loads, and added an artificial scheduling edge from
them to the exit SU, which caused them to be scheduled earlier.
Differential Revision: https://reviews.llvm.org/D126074
Extend SIInstrInfo::isOperandLegal to enforce a limit on the number of
literal operands for all VALU instructions, not just VOP3. In particular
it now handles VOP2 instructions with a mandatory literal operand like
V_FMAAK_F32.
Differential Revision: https://reviews.llvm.org/D126064
Extend the literal operand checking in SIInstrInfo::verifyInstruction to
check VOP2 instructions like V_FMAAK_F32 which have a mandatory literal
operand. The rule is that src0 can also be a literal, but only if it is
the same literal value.
AMDGPUAsmParser::validateConstantBusLimitations already handles this
correctly.
Differential Revision: https://reviews.llvm.org/D126063
It is already marked as having side effects, at least in MIR. It does
not interact with anything else that is modelled as a memory access
either in IR or MachineIR.
Differential Revision: https://reviews.llvm.org/D125985
s_getreg does not interact with anything else that is modelled as a
memory access either in IR or MachineIR.
Differential Revision: https://reviews.llvm.org/D125968
AMDGPUAsmParser::validateSOPLiteral already knew about this but
SIInstrInfo::verifyInstruction did not.
Differential Revision: https://reviews.llvm.org/D125976
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.
The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.
Differential Revision: https://reviews.llvm.org/D125557
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.
Differential Revision: https://reviews.llvm.org/D114644
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.
This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.
Differential Revision: https://reviews.llvm.org/D114643
We always had global and scratch loads to LDS in the gfx9,
but did not handle it. These were available via the 'lds'
encoding bit. In gfx940 this bit was reused as 'svs' which
resulted in new '_lds' opcodes effectively pushing this
bit into the opcode, but functionally it is the same. These
instructions are also available on gfx10.
Differential Revision: https://reviews.llvm.org/D125126
MC layer support for SOP(scalar alu operations) including encoding
support for s_delay_alu and s_sendmsg_rtn.
Contributors:
Jay Foad <jay.foad@amd.com>
Patch 7/N for upstreaming of AMDGPU gfx11 architecture.
Depends on D125319
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D125498
This reverts ffbee7acdc, see also bug 37653 which it was fixing.
The bug claims this is an undocumented feature which actually works.
In the reality it is documented as not working for a good reason.
It likely does something, but it is useless anyway. These instructions
write into the LDS. The LDS address is:
M0 + inst_offset + (TIDinWave * 4).
For a store wider than a DWORD neighboring lanes will overwrite each
other.
Differential Revision: https://reviews.llvm.org/D125409
On GFX10 VOP3 instructions can have a literal operand, so the conversion
from VOP3 MAD/FMA to VOP2 MADAK/MADMK/FMAAK/FMAMK will not happen in
SIFoldOperands. The only benefit of the VOP2 form is code size, so do it
in SIShrinkInstructions instead.
Differential Revision: https://reviews.llvm.org/D125567
Includes MachineCode layer support and tests, and MIR tests not requiring
CodeGen pass changes.
Includes a small change in SMInstructions.td to correct encoded bits.
Contributors:
Petar Avramovic <Petar.Avramovic@amd.com>
Dmitry Preobrazhensky <dmitry.preobrazhensky@amd.com>
Depends on D125316
Patch 6/N for upstreaming of AMDGPU gfx11 architecture.
Reviewed By: dp, Petar.Avramovic
Differential Revision: https://reviews.llvm.org/D125319
The name `MCFixedLenDisassembler.h` is out of date after D120958.
Rename it as `MCDecoderOps.h` to reflect the change.
Reviewed By: myhsu
Differential Revision: https://reviews.llvm.org/D124987
Adds an intrinsic/builtin that can be used to fine tune scheduler behavior. If
there is a need to have highly optimized codegen and kernel developers have
knowledge of inter-wave runtime behavior which is unknown to the compiler this
builtin can be used to tune scheduling.
This intrinsic creates a barrier between scheduling regions. The immediate
parameter is a mask to determine the types of instructions that should be
prevented from crossing the sched_barrier. In this initial patch, there are only
two variations. A mask of 0 means that no instructions may be scheduled across
the sched_barrier. A mask of 1 means that non-memory, non-side-effect inducing
instructions may cross the sched_barrier.
Note that this intrinsic is only meant to work with the scheduling passes. Any
other transformations that may move code will not be impacted in the ways
described above.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D124700
Refactor to pass a templatized size parameter to the decoder to allow wider than
64bit decodes in a later patch.
Contributors:
Jay Foad <jay.foad@amd.com>
Depends on D125261
Patch 5/N for upstreaming of AMDGPU gfx11 architecture.
Reviewed By: dp
Differential Revision: https://reviews.llvm.org/D125316
Tablegen definitions for subtarget features and cpp predicate functions to
access the features.
New Sub-TargetProcessors and common latencies.
Simple changes to MIR codegen tests which pass on gfx11 because they have the
same output as previous subtargets or operate on pseudo instructions which
are reused from previous subtargets.
Contributors:
Jay Foad <jay.foad@amd.com>
Petar Avramovic <Petar.Avramovic@amd.com>
Patch 4/N for upstreaming of AMDGPU gfx11 architecture
Depends on D124538
Reviewed By: Petar.Avramovic, foad
Differential Revision: https://reviews.llvm.org/D125261
This patch adds cluster edges between independent MFMA instructions. Additionally, it propogates all predecessors of cluster insts to the root of the cluster(s), and all successors to the leaf(ves) of the cluster(s) -- this is done to remove the possibility that those insts will be interspersed within the cluster.
Reviewed By: kerbowa
Differential Revision: https://reviews.llvm.org/D124678
Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us
to keep the calculation entirely on the SALU.
For subtargets where integer multiplication isn't full-rate, avoid
folding if the multiply has too many uses.
Finally, we expand 64x32 and 64x64 multiplies here as well, if they
feed into an addition. This results in better code generation than
the generic expansion for such multiplies because we end up using
the accumulator of the MAD instructions.
Differential Revision: https://reviews.llvm.org/D123835
As suggested by @foad on D124839
If we're extracting a vector element that originally came from a scalar_to_vector, then avoid the bitcasting of a vector type and perform the shift masking on the (any-extended) scalar source directly, making use of the fact that the upper elements of a scalar_to_vector are all undef.
Differential Revision: https://reviews.llvm.org/D125173
FeatureAtomicFaddInsts is replaced with three more granular features.
Contributors:
Petar Avramovic <Petar.Avramovic@amd.com>
Patch 3/N for upstreaming of AMDGPU gfx11 architecture
Depends on D124537
Reviewed By: foad, #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D124538
Given a DPP mov like this:
%2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
...
%3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec
this patch just removes a check that %2 (the "old reg") was defined in
the same BB as the DPP mov instruction. GCNDPPCombine requires that the
MIR is in SSA form so I don't understand why the BB matters.
This lets the optimization work in more real world cases when the
definition of %2 gets hoisted out of a loop.
Differential Revision: https://reviews.llvm.org/D124182
Image.sample instruction can be forced to return half type instead of
float when d16 flag is enabled.
This patch adds new pattern in InstCombine to detect if output of
image.sample is used later only by fptrunc which converts the type
from float to half. If pattern is detected then fptrunc and image.sample
are combined to single image.sample which is returning half type.
Later in Lowering part d16 flag is added to image sample intrinsic.
Differential Revision: https://reviews.llvm.org/D124232
Introduces a string attribute, amdgpu-requires-module-lds, to allow
eliding the module.lds block from kernels. Will allocate the block as before
if the attribute is missing or has its default value of true.
Patch uses the new attribute to detect the simplest possible instance of this,
where a kernel makes no calls and thus cannot call any functions that use LDS.
Tests updated to match, coverage was already good. Interesting cases is in
lower-module-lds-offsets where annotating the kernel allows the backend to pick
a different (in this case better) variable ordering than previously. A later
patch will avoid moving kernel variables into module.lds when the kernel can
have this attribute, allowing optimal ordering and locally unused variable
elimination.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122091
by moving around the code and by adding more comments, which would
later help during any required clean-up.
Differential Revision: https://reviews.llvm.org/D124733
Refactor to simplify a follow-up change.
No functional change intended. However, there is a rather subtle logic
change: the subsequent combines (e.g. reassociation) are skipped *always*
when one of the operands of the add is a mul, instead of only when
additionally mad64_32 etc. are available. This change makes sense because
the subsequent combines should never apply when one of the operands is a
mul.
Differential Revision: https://reviews.llvm.org/D123833
MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
can be accessed. A load from LDS to VMEM does not need a wait.
Differential Revision: https://reviews.llvm.org/D124626
This is the first patch of a series to upstream support for the new
subtarget.
Contributors:
Jay Foad <jay.foad@amd.com>
Konstantin Zhuravlyov <kzhuravl_dev@outlook.com>
Patch 1/N for upstreaming AMDGPU gfx11 architectures.
Reviewed By: foad, kzhuravl, #amdgpu
Differential Revision: https://reviews.llvm.org/D124536
As older waves execute long sequences of VALU instructions, this may
prevent younger waves from address calculation and then issuing their
VMEM loads, which in turn leads the VALU unit to idle. This patch tries
to prevent this by temporarily raising the wave's priority.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D124246
Add these bits to the MUBUF and FLAT LDS DMA instructions:
- LGKM_CNT - these operate on LDS;
- VALU - SPG 3.9.8: This instruction acts as both a MUBUF and
VALU instruction;
Codegen currently does not produce any of this, so the change is NFC.
Differential Revision: https://reviews.llvm.org/D124472
Image.sample instruction can be forced to return half type instead of
float when d16 flag is enabled.
This patch adds new pattern in InstCombine to detect if output of
image.sample is used later only by fptrunc which converts the type
from float to half. If pattern is detected then fptrunc and image.sample
are combined to single image.sample which is returning half type.
Later in Lowering part d16 flag is added to image sample intrinsic.
Differential Revision: https://reviews.llvm.org/D124232
The builtin predicate handling has a strange behavior where the code
assumes that a PatFrag is a stack of PatFrags, and each level adds at
most one predicate. I don't think this particularly makes sense,
especially without a diagnostic to ensure you aren't trying to set
multiple at once.
This wasn't followed for address spaces and alignment, which could
potentially fall through to report no builtin predicate was
added. Just switch these to follow the existing convention for now.
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.
Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.
Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
This is to avoid relying on the post-isel hook.
This change also enable the saddr pattern selection for atomic
intrinsics in GlobalISel.
Differential Revision: https://reviews.llvm.org/D123583
Here is the performance data:
```
Using platform: AMD Accelerated Parallel Processing
Using device: gfx900:xnack-
ds_write_b64 aligned by 8: 3.2 sec
ds_write2_b32 aligned by 8: 3.2 sec
ds_write_b16 * 4 aligned by 8: 7.0 sec
ds_write_b8 * 8 aligned by 8: 13.2 sec
ds_write_b64 aligned by 1: 7.3 sec
ds_write2_b32 aligned by 1: 7.5 sec
ds_write_b16 * 4 aligned by 1: 14.0 sec
ds_write_b8 * 8 aligned by 1: 13.2 sec
ds_write_b64 aligned by 2: 7.3 sec
ds_write2_b32 aligned by 2: 7.5 sec
ds_write_b16 * 4 aligned by 2: 7.1 sec
ds_write_b8 * 8 aligned by 2: 13.3 sec
ds_write_b64 aligned by 4: 4.6 sec
ds_write2_b32 aligned by 4: 3.2 sec
ds_write_b16 * 4 aligned by 4: 7.1 sec
ds_write_b8 * 8 aligned by 4: 13.3 sec
ds_read_b64 aligned by 8: 2.3 sec
ds_read2_b32 aligned by 8: 2.2 sec
ds_read_u16 * 4 aligned by 8: 4.8 sec
ds_read_u8 * 8 aligned by 8: 8.6 sec
ds_read_b64 aligned by 1: 4.4 sec
ds_read2_b32 aligned by 1: 7.3 sec
ds_read_u16 * 4 aligned by 1: 14.0 sec
ds_read_u8 * 8 aligned by 1: 8.7 sec
ds_read_b64 aligned by 2: 4.4 sec
ds_read2_b32 aligned by 2: 7.3 sec
ds_read_u16 * 4 aligned by 2: 4.8 sec
ds_read_u8 * 8 aligned by 2: 8.7 sec
ds_read_b64 aligned by 4: 4.4 sec
ds_read2_b32 aligned by 4: 2.3 sec
ds_read_u16 * 4 aligned by 4: 4.8 sec
ds_read_u8 * 8 aligned by 4: 8.7 sec
Using platform: AMD Accelerated Parallel Processing
Using device: gfx1030
ds_write_b64 aligned by 8: 4.4 sec
ds_write2_b32 aligned by 8: 4.3 sec
ds_write_b16 * 4 aligned by 8: 7.9 sec
ds_write_b8 * 8 aligned by 8: 13.0 sec
ds_write_b64 aligned by 1: 23.2 sec
ds_write2_b32 aligned by 1: 23.1 sec
ds_write_b16 * 4 aligned by 1: 44.0 sec
ds_write_b8 * 8 aligned by 1: 13.0 sec
ds_write_b64 aligned by 2: 23.2 sec
ds_write2_b32 aligned by 2: 23.1 sec
ds_write_b16 * 4 aligned by 2: 7.9 sec
ds_write_b8 * 8 aligned by 2: 13.1 sec
ds_write_b64 aligned by 4: 13.5 sec
ds_write2_b32 aligned by 4: 4.3 sec
ds_write_b16 * 4 aligned by 4: 7.9 sec
ds_write_b8 * 8 aligned by 4: 13.1 sec
ds_read_b64 aligned by 8: 3.5 sec
ds_read2_b32 aligned by 8: 3.4 sec
ds_read_u16 * 4 aligned by 8: 5.3 sec
ds_read_u8 * 8 aligned by 8: 8.5 sec
ds_read_b64 aligned by 1: 13.1 sec
ds_read2_b32 aligned by 1: 22.7 sec
ds_read_u16 * 4 aligned by 1: 43.9 sec
ds_read_u8 * 8 aligned by 1: 7.9 sec
ds_read_b64 aligned by 2: 13.1 sec
ds_read2_b32 aligned by 2: 22.7 sec
ds_read_u16 * 4 aligned by 2: 5.6 sec
ds_read_u8 * 8 aligned by 2: 7.9 sec
ds_read_b64 aligned by 4: 13.1 sec
ds_read2_b32 aligned by 4: 3.4 sec
ds_read_u16 * 4 aligned by 4: 5.6 sec
ds_read_u8 * 8 aligned by 4: 7.9 sec
```
GFX10 exposes a different pattern for sub-DWORD load/store performance
than GFX9. On GFX9 it is faster to issue a single unaligned load or
store than a fully split b8 access, where on GFX10 even a full split
is better. However, this is a theoretical only gain because splitting
an access to a sub-dword level will require more registers and packing/
unpacking logic, so ignoring this option it is better to use a single
64 bit instruction on a misaligned data with the exception of 4 byte
aligned data where ds_read2_b32/ds_write2_b32 is better.
Differential Revision: https://reviews.llvm.org/D123956
Fix isVCC for register that was assigned register class during
inst-selection. This happens when register has multiple uses.
For wave32, uniform i1 to vcc copy was selected like vcc to vcc
copy when uniform i1 had assigned register class.
Uniform i1 register with assigned register class will have s1 LLT,
be defined using G_TRUNC and class will be SReg_32RegClass.
Vcc i1 register with assigned register class will have s1 LLT,
class will be SReg_32RegClass for wave32 and SReg_64RegClass for
wave64 and register will not be defined by G_TRUNC.
Differential Revision: https://reviews.llvm.org/D124163
This fixes the assertion failure "Loop in the Block Graph!".
SIMachineScheduler groups instructions into blocks (also referred to
as coloring or groups) and then performs a two-level scheduling:
inter-block scheduling, and intra-block scheduling.
This approach requires that the dependency graph on the blocks which
is obtained by contracting the blocks in the original dependency graph
is acyclic. In other words: Whenever A and B end up in the same block,
all vertices on a path from A to B must be in the same block.
When compiling an example consisting of an export followed by
a buffer store, we see a dependency between these two. This dependency
may be false, but that is a different issue.
This dependency was not correctly accounted for by SiMachineScheduler.
A new test case si-scheduler-exports.ll demonstrating this is
also added in this commit.
The problematic part of SiMachineScheduler was a post-optimization of
the block assignment that tried to group all export instructions into
a separate export block for better execution performance. This routine
correctly checked that any paths from exports to exports did not
contain any non-exports, but not vice-versa: In case of an export with
a non-export successor dependency, that single export was moved
to a separate block, which could then be both a successor and a
predecessor block of a non-export block.
As fix, we now skip export grouping if there are exports with direct
non-export successor dependencies. This fixes the issue at hand,
but is slightly pessimistic:
We *could* group all exports into a separate block that have neither
direct nor indirect export successor dependencies.
We will review the potential performance impact and potentially
revisit with a more sophisticated implementation.
Note that just grouping all exports without direct non-export successor
dependencies could still lead to illegal blocks, since non-export A
could depend on export B that depends on export C. In that case,
export C has no non-export successor, but still may not be grouped
into an export block.
Based on available register budget, reserve highest available VGPR for
AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one
exist.
Fixes SWDEV-330006.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D123525
When folding a COPY of exec into another COPY, the call to
TII->isOperandLegal would crash because COPYs don't have defined
register classes for their operands.
Differential Revision: https://reviews.llvm.org/D122737
These don't seem to be very well used or tested, but try to make the
behavior a bit more consistent with LDS globals.
I'm not sure what the definition for amdgpu-gds-size is supposed to
mean. For now I assumed it's allocating a static size at the beginning
of the allocation, and any known globals are allocated after it.
There's no reason to create these immediately. They can be created in
the prolog/epilog code like CSR spills. There's probably a cleaner way
to do this by utilizing the CSR spill code.
This makes the frame index used transient state for
PrologEpilogInserter, and thus makes serialization easier. Really this
doesn't need to be saved here but there isn't really a better place
for it.
The sramecc feature changes the behaviour of d16 loads so they do not
preserve the unused 16 bits of the result register, but it has no impact
on d16 stores, so we should make use of them even when the feature is
enabled.
Differential Revision: https://reviews.llvm.org/D104912
Instead of lengthy constructors we can now set the members of a
read-only struct before the Attributor is created. Should make it
clearer what is configurable and also help introducing new options in
the future. This actually added IsModulePass and avoids deduction
through the Function set size. No functional change was intended.
This is an NFC patch in preparation to fix a bug related to always
reserving VGPR32 for AGPR copy.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D123651
This code was pattern matching the ID computation expression as it
appears in the library. This was a compare and select, but now that
umin is canonical, we were no longer matching. Update to match the
intrinsic instead.
For stores to constant address space, this will now consistently hit a
selection error instead of hitting unreachable in an asserts build.
I'm not sure what we should really do here. We could either just
codegen as if it were global, delete the instruction, or declare the
IR invalid (we really should have a target IR verifier to enforce it).
Summary:
Introduce a new function attribute, amdgpu-no-multigrid-sync-arg, which is default.
We use implicitarg_ptr + offset to check whether the multigrid synchronization
pointer is used. If yes, we remove this attribute and also remove
amdgpu-no-implicitarg-ptr. We generate metadata for the hidden_multigrid_sync_arg
only when the amdgpu-no-multigrid-sync-arg attribute is removed from the function.
Reviewers: arsenm, sameerds, b-sumner and foad
Differential Revision: https://reviews.llvm.org/D123548
Use default member initializers in AMDGPUSubtarget and subclasses. This
is to guard against adding a new feature boolean in AMDGPUSubtarget.h
but forgetting to initialize it to false in AMDGPUSubtarget.cpp.
This was mostly autogenerated by:
clang-tidy -checks=-*,cppcoreguidelines-prefer-member-initializer,modernize-use-default-member-init -header-filter=Subtarget -fix lib/Target/AMDGPU/*Subtarget.cpp
Differential Revision: https://reviews.llvm.org/D123613
Move features/bugs checks into the single place
allowsMisalignedMemoryAccessesImpl.
This is mostly NFCI except for the order of selection in couple places.
A separate change may be needed to stop lying about Fast.
Differential Revision: https://reviews.llvm.org/D123343
If the CFG structure of a waterfall loop is not the expected shape
then gracefully abort traversing the IR for the given loop.
This applies to nest waterfall loops which are not supported by
the VGPR live range optimizer.
Reviewed By: ruiling
Differential Revision: https://reviews.llvm.org/D123480
Summary:
In emitting metadata for implicit kernel arguments, we need to be in sync with the actual loads
to align the implicit kernel argument segment to 8 byte boundary. In this work, we simply force
this alignment through the first implicit argument.
In addition, we don't emit metadata for any implicit kernel argument if none of them is actually used.
Reviewers: arsenm, b-sumner
Differential Revision: https://reviews.llvm.org/D123346
The AMDGPUISD::SETCC node is like ISD::SETCC, but returns a lane mask
instead of a per-lane boolean. The lane mask is uniform.
This improves instruction selection for code patterns like
ctpop(ballot(x)), which can now use an S_BCNT1_* instruction instead
of V_BCNT_*.
GlobalISel already selects scalar instructions (an earlier commit
added a test case)..
Differential Revision: https://reviews.llvm.org/D123432
D67148 has removed TTI::getNumberOfRegisters(bool Vector) and
started to call TTI::getNumberOfRegisters(unsigned ClassID) from
the LoopVectorize. This has resulted in an unrestricted vectorization
on AMDGPU blowing up register pressure.
Differential Revision: https://reviews.llvm.org/D122850
Enable the PreRARematerialize pass when there are multiple high RP scheduling
regions present. Require the occupancy in all high RP regions be improved
before finalizing sinking. If any high RP region did not improve in occupancy
then un-do all sinking and restore the state to before the pass.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D122501
We found that it might be beneficial to have the SIOptimizeExecMasking
pass detect more cases where v_cmp, s_and_saveexec patterns can be
transformed to s_mov, v_cmpx patterns. Currently, the search range
for finding a fitting v_cmp instruction is 5, however, this is doubled
to 10 here.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D123367
Currently, the utility supports lowering of non atomic memory transfer routines only. This patch adds support for atomic version of memcopy. This may be useful for targets not supporting atomic memcopy.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D118443
It was only handled for FLAT initially because we did not have
unaligned DS instructions lowering. Now it is implemented but
the bug is not handled.
Differential Revision: https://reviews.llvm.org/D123338
There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.
Differential Revision: https://reviews.llvm.org/D123330
Summary:
If implicitarg_ptr intrinsic is not used, set implicit kernarg size to 0, otherwise
set it to 256 bytes for code object version 5 (and beyond).
Reviewers: arsenm
Differential Revision: https://reviews.llvm.org/D123262
Use new NotAtomic expansion to turn these into the equivalent
non-atomic operations. Independent lanes cannot access the private
memory of other lanes, so there's no possibility for synchronization.
These don't really appear directly in user code, but
InferAddressSpaces can make these appear after optimizations.
Fixes issues 54693 and 54274.
static_cast is a little safer here since the compiler will
ensure we're casting to a class derived from
yaml::MachineFunctionInfo.
I believe this first appeared on AMDGPU and was copied to the
other two targets.
Spotted when it was being copied to RISCV in D123178.
Differential Revision: https://reviews.llvm.org/D123260
Handle the llvm.r600.* intrinsics which are still in use in libclc. I
thought it would be possible to switch it to using
llvm.amdgcn.implicitarg.ptr already, but it turns out the implicit
arguments are currently split into a piece before and after the
explicit kernel arguments.
The diagnostic is unreliable, and triggers even for dead uses of
hostcall that may exist when linking the device-libs at lower
optimization levels.
Eliminate the diagnostic, and directly document the limitation for
OpenCL before code object V5.
Make some NFC changes to clarify the related code in the
MetadataStreamer.
Add a clang test to tie OCL sources containing printf to the backend IR
tests for this situation.
Reviewed By: sameerds, arsenm, yaxunl
Differential Revision: https://reviews.llvm.org/D121951
Ignore all debug uses when collecting trivially rematerializable defs. This fixes an issue with difference in codegen when enabling debug info.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D123048
Previously any load (global, local or constant) feeding into a
global load or store would be counted as an indirect access. This
patch only counts global loads feeding into a global load or store.
The rationale is that the latency for global loads is generally
much larger than the other kinds.
As a side effect this makes it easier to write small kernels test
cases that are not counted as having indirect accesses, despite
the fact that arguments to the kernel are accessed with an SMEM
load.
Differential Revision: https://reviews.llvm.org/D122804
Whenever a v_cmp, s_and_saveexec instruction sequence shall be
transformed to an equivalent s_mov, v_cmpx sequence, it needs
to be detected if the v_cmp target register is used between
the two instructions as the v_cmp result gets omitted by
using the v_cmpx instruction, resulting in invalid code.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D122797
Summary:
To compute the size of a VALU/SALU instruction, we need to check whether an operand
could ever be literal. Previously isLiteralConstant was used, which missed cases
like global variables or external symbols. These misses lead to under-estimation of
the instruction size and branch offset, and thus incorrectly skip the necessary branch
relaxation when the branch offset is actually greater than what the branch bits can hold.
In this work, we use isLiteralConstantLike to check the operands. It maybe conservative,
but it is safe.
Reviewers: arsenm
Differential Revision: https://reviews.llvm.org/D122778
Summary:
hasHostcallPtr() and hasHeapPtr() are only used in metadata emit.
However, we can use the corresponding function attributes directly
instead introducing the functions.
Reviewers: arsenm
Differential Revision: https://reviews.llvm.org/D122600
Revision https://reviews.llvm.org/D122332 added a pattern transformation
where v_cmpx instructions are introduced. However, the modifiers are
not correctly inherited from the original operands. The patch
adds the source modifiers, if they are exist, or sets them to 0.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D122489
Split waterfall loops into multiple blocks so that exec mask
manipulation (s_and_saveexec) does not occur in the middle of
a block.
VGPR live range optimizer is updated to handle waterfall loops
spanning multiple blocks.
Reviewed By: ruiling
Differential Revision: https://reviews.llvm.org/D122200
All LLVM backends use MCDisassembler as a base class for their
instruction decoders. Use "const MCDisassembler *" for the decoder
instead of "const void *". Remove unnecessary static casts.
Reviewed By: skan
Differential Revision: https://reviews.llvm.org/D122245
On GFX10.3 targets, the following instruction sequence
v_cmp_* SGPR, ...
s_and_saveexec ..., SGPR
leads to a fairly long stall caused by a VALU write to a SGPR and having the
following SALU wait for the SGPR.
An equivalent sequence is to save the exec mask manually instead of letting
s_and_saveexec do the work and use a v_cmpx instruction instead to do the
comparison.
This patch modifies the SIOptimizeExecMasking pass as this is the last position
where s_and_saveexec instructions are inserted. It does the transformation by
trying to find the pattern, extracting the operands and generating the new
instruction sequence.
It also changes some existing lit tests and introduces a few new tests to show
the changed behavior on GFX10.3 targets.
Same as D119696 including a buildbot and MIR test fix.
Reviewed By: critson
Differential Revision: https://reviews.llvm.org/D122332
Since there is a table introduced for MAI instructions extend it
to use for DGEMM classification.
Differential Revision: https://reviews.llvm.org/D122337
In some cases padding bubbles between sequential MFMA instructions may
lead to increased inter-wave performance. Add option to request to pad
some portion of these stall cycles with s_nops.
Fixes: SWDEV-326925
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D121437
and, also update the function indirectCopyToAGPR() to ensure that it is called only on GFX908 sub-target.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D122286
First, add code to reserve all required special purpose registers,
followed by code to reserve SGPRs, followed by code to reserve
VGPRs/AGPRs.
This patch is prepared as a pre-requisite to fix an issue related to
GFX90A hardware.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122219
In the frame index lowering we have to insert shift and add
instructions to adjust stack object access. We need to take care of the stack
object user kind and use scalar shift/add for scalar users.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D121524