Even if wave offset is not present we still need to do the rest
of the initialization. The mov into s32 was missing in the
kernels.
Fixes: SWDEV-310935
Differential Revision: https://reviews.llvm.org/D113628
We were previously setting an ignored bit in the kernel headers. The
current behavior is to add the large amount on top of the statically
known size of a single stack frame. I'm not sure if we should just use
the large size as the entire reported size instead.
If a kernel had no formal arguments but did have the implicit
arguments, we were reporting a required kernarg alignment of 4. For
some reason we require an 8-byte alignment for this, even though
there's no real advantage and I don't see where this is documented in
the ABI.
The code object header code also claims the minimum alignment is 16,
which is what I thought you always got at runtime anyway so I don't
know why this matters.
- CUDA cannot associate memory space with pointer types. Even though Clang could add extra attributes to specify the address space explicitly on a pointer type, it breaks the portability between Clang and NVCC.
- This change proposes to assume the address space from a pointer from the assumption built upon target-specific address space predicates, such as `__isGlobal` from CUDA. E.g.,
```
foo(float *p) {
__builtin_assume(__isGlobal(p));
// From there, we could assume p is a global pointer instead of a
// generic one.
}
```
This makes the code portable without introducing the implementation-specific features.
Note that NVCC starts to support __builtin_assume from version 11.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112041
NFC. This check mainly handles size affecting literals. Make it check all
explicit operands instead of a few by name. Also make the isLiteral
check handle the KIMM operands, see https://reviews.llvm.org/D111067
Change-Id: I1a362d55b2a10f5c74d445272e8b7829a8b77597
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D113318
Change-Id: Ie6c688f30a71e0335d1c6dd1ff65019bd7ce684e
As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper.
I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious.
Differential Revision: https://reviews.llvm.org/D113396
This patch changes the AMDGPU_Gfx calling convention. It defines the SGPR registers s[4:29] as callee-save and leaves some SGPRs usable for callers. The intention is to avoid unneccessary s_mov instructions for arguments the caller would otherwise save and restore in these registers.
Reviewed By: sebastian-ne
Differential Revision: https://reviews.llvm.org/D111637
There is no real source location for code inside prologue as it is
generated by compiler but source locations are being added to code
inside prologue as a side effect of https://reviews.llvm.org/D99269
because buildSpillLoadStore() is using source location of the real
instruction in the basic block if any.
Fixes: SWDEV-307590
Reviewed By: scott.linder, sebastian-ne
Differential Revision: https://reviews.llvm.org/D113100
Detailed description: This change enables the bit field extract patterns
selection to s_bfe_u32 or v_bfe_u32 dependent on the pattern root node
divergence.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D110950
The function to generate S_MOV_B64_IMM_PSEUDO was recently modified to
optimize AGPR to AGPR copy but it missed checking for the SGPR
clobbering for the S_MOV_B64_IMM_PSEUDO generation.
Differential Revision: https://reviews.llvm.org/D113005
Register operands with superclasses can possibly have multiple regBanks
if they have different register types. The regBank ambiguity resolved
during regbankselect should be used to constrain the operand regclass
instead of obtaining one from the MCInstrDesc.
This is a prerequisite patch for D109300 that introduces allocatable AV_*
Superclasses for AMDGPU by combining both VGPRs and AGPRs and we want to
restrain the regclass to either A or V based on the incoming regbank.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112323
With Global ISel getReservedRegs() is called before function is
regbank selected for the first time. Defer caching of usesAGPRs()
in this case.
Differential Revision: https://reviews.llvm.org/D112644
Change numBitsSigned to return the minimum size of a signed integer that
can hold the value. This is different by one from the previous result
but is more consistent with numBitsUnsigned. Update all callers. All
callers are now more consistent between the signed and unsigned cases,
and some callers get simpler, especially the ones that deal with
quantities like numBitsSigned(LHS) + numBitsSigned(RHS).
Differential Revision: https://reviews.llvm.org/D112813
Shift node is still needed to check if the shift is shr or shl to increment/decrement offset. Do not override the node.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112733
mul24 intrinsic's operands are simplified by
AMDGPUTargetLowering::performIntrinsicWOChainCombine(). This change adds
the mul24hi intrinsics in the combine since its operands can be
simplified like that of the mul24 intrinsics.
Differential Revision: https://reviews.llvm.org/D112702
- Move the `s_and exec` to its correct position before the content of
the waterfall loop
- Use the SI_WATERFALL pseudo instruction, like for sdag, to benefit
from optimizations
- Add support for indirect function calls
To support indirect calls, add a G_SI_CALL instruction without register
class restrictions and insert a waterfall loop when applying register
banks.
Differential Revision: https://reviews.llvm.org/D109052
- When an unconditional branch is expanded into an indirect branch, if
there is no scavenged register, an SGPR pair needs spilling to enable
the destination PC calculation. In addition, before jumping into the
destination, that clobbered SGPR pair need restoring.
- As SGPR cannot be spilled to or restored from memory directly, the
spilling/restoring of that SGPR pair reuses the regular SGPR spilling
support but without spilling it into memory. As that spilling and
restoring points are fully controlled, we only need to spill that SGPR
into the temporary VGPR, which needs spilling into its emergency slot.
- The target-specific hook is revised to take additional restore block,
where the restoring code is filled. After that, the relaxation will
place that restore block directly before the destination block and
insert an unconditional branch in any fall-through block into the
destination block.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D106449
The scheduler should set critical/excess register usage thresholds that
are guided by the maximum possible occupancy for the function. This
change is focused on setting proper lower bounds on register usage which
we would typically only see when a specific number of maximum waves is
requested with the "waves-per-eu" attribute, or by setting
"amdgpu-num-vgpr|sgpr" directly. This was broken previously. I have a
follow-on patch that will address issues with the scheduler not
targeting correct upper bounds on register usage which is typical with
launch bounds and min "waves-per-eu".
Changes by this patch:
Set the initial critical register usage thresholds to minimum values
that are determined by the maximum possible occupancy for the function,
or the number of allocatable registers, whichever is lower.
Avoid unisgned overflow if register limits are lower than the register
tracking "ErrorMargin", I.e. when using stress-regalloc=2.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112373
We were bailing out of creating 24-bit muls for results wider than 32
bits in AMDGPUCodeGenPrepare. With the 24-bit mulhi intrinsic, this
change teaches AMDGPUCodeGenPrepare to generate the 48-bit mul
correctly.
Differential Revision: https://reviews.llvm.org/D112395
These intrinsics maps to the 24-bit v_mul_hi instructions.
This change also fixes an incorrect assumption on the associativity of
24-bit mulhi in its SDNode record in tblgen.
Differential Revision: https://reviews.llvm.org/D112394
The combine asserted if constants could not be represented as uint64_t.
Use APInts to fix this.
Differential Revision: https://reviews.llvm.org/D112416
This can merge the acceptable ranges based on the call graph, rather
than the simple application of the attribute. Remove the handling from
the old pass.
Run post-RA SIShrinkInstructions just before post-RA scheduling, instead
of afterwards. After the fixes in D112305 and D112317 this seems to make
no difference, but it paves the way for scheduler tweaks that are
sensitive to the e32 vs e64 encoding of VALU instructions.
Differential Revision: https://reviews.llvm.org/D112341
As described in the comment, the way we change vcc to vcc_lo in these
operands confuses addPhysRegDataDeps into treating them as implicit
pseudo operands. Fix this by setting the correct latency from the
SchedModel after addPhysRegDataDeps wrongly set it to 0.
Differential Revision: https://reviews.llvm.org/D112317