This reverts commit ca907bfb57.
According to michel.daenzer,
> This completely broke the Mesa radeonsi driver on Navi 14. Xorg +
> xterm come up with major corruption & psychedelic colours.
When memory operations are outstanding on function calls, either the
caller or the callee can insert a waitcnt to ensure that all reads are
finished.
Calls need some time to be executed, so if the callee inserts the
waitcnt, filling the instruction buffer and waiting for memory will be
interleaved, hiding some latency. This comes at the cost of having a
waitcnt inside functions that may not be needed as no memory operations
are outstanding.
For function calls, this is already implemented. The same principal
applies to returns: If the caller inserts a waitcnt after the call, the
callee does not have to wait and the return and memory operation can be
run in parallel.
This commit implements waiting in the caller after returning from a
function call.
Differential Revision: https://reviews.llvm.org/D87674
During the PEI pass, the dead TargetStackID::SGPRSpill spill slots
are not being removed while spilling the FP/BP to memory.
Fixes: SWDEV-250393
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D87032
Before calling target hook to determine if two loads/stores are clusterable,
we put them into different groups to avoid fake cluster due to dependency.
For now, we are putting the loads/stores into the same group if they have
the same predecessor. We assume that, if two loads/stores have the same
predecessor, it is likely that, they didn't have dependency for each other.
However, one SUnit might have several predecessors and for now, we just
pick up the first predecessor that has non-data/non-artificial dependency,
which is too arbitrary. And we are struggling to fix it.
So, I am proposing some better implementation.
1. Collect all the loads/stores that has memory info first to reduce the complexity.
2. Sort these loads/stores so that we can stop the seeking as early as possible.
3. For each load/store, seeking for the first non-dependency instruction with the
sorted order, and check if they can cluster or not.
Reviewed By: Jay Foad
Differential Revision: https://reviews.llvm.org/D85517
If we need a scratch register for the spill don't use the same scratch
register that is being used for the MBUF offset.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D85772
If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa
as they both depend on the base register "reg"
+-------+
+----> reg |
| +---+---+
| ^
| |
| |
| |
| +---+---+
| | SUa | Load 0(reg)
| +---+---+
| ^
| |
| |
| +---+---+
+----+ SUb | Load 4(reg)
+-------+
But if it is store cluster, we need to create it as follow shows to avoid the instruction store
depend on scheduled in-between SUb and SUa.
+-------+
+----> reg |
| +---+---+
| ^
| | Missing +-------+
| | +-------------------->+ y |
| | | +---+---+
| +---+-+-+ ^
| | SUa | Store x 0(reg) |
| +---+---+ |
| ^ |
| | +------------------------+
| | |
| +---+--++
+----+ SUb | Store y 4(reg)
+-------+
Reviewed By: evandro, arsenm, rampitec, foad, fhahn
Differential Revision: https://reviews.llvm.org/D72031
When the callee requires a dynamic stack realignment,
it is not possible to correcty access the incoming
stack arguments using the stack pointer. We reserve a
base pointer in such cases to access the function arguments
inside the callee. The base pointer will hold the incoming
stack pointer value before any kind of delta added to it.
Reviewed By: arsenm, scott.linder
Differential Revision: https://reviews.llvm.org/D78811
Remove the gap left between the stack pointer (s32) and frame pointer
(s34) now that the scratch wave offset is no longer a part of the
calling convention ABI.
Update llvm/docs/AMDGPUUsage.rst to reflect the change.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75657
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in
the entry function prologue. This allows us to removes the scratch wave
offset register from the calling convention ABI.
As part of this change, allow the use of an inline constant zero for the
SOffset of MUBUF instructions accessing the stack in entry functions
when a frame pointer is not requested/required. Entry functions with
calls still need to set up the calling convention ABI stack pointer
register, and reference it in order to address arguments of called
functions. The ABI stack pointer register remains unswizzled, but is now
wave-relative instead of queue-relative.
Non-entry functions also use an inline constant zero SOffset for
wave-relative scratch access, but continue to use the stack and frame
pointers as before. When the stack or frame pointer is converted to a
swizzled offset it is now scaled directly, as the scratch wave offset no
longer needs to be subtracted first.
Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling
convention.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75138
Make the FP register callee saved.
This is tricky because now the FP needs to be spilled in the prolog
relative to the incoming SP register, rather than the frame register
used throughout the rest of the function. I don't like how this
bypassess the standard mechanism for CSR spills just to get the
correct insert point. I may look for a better solution, since all CSR
VGPRs may also need to have all lanes activated. Another option might
be to make getFrameIndexReference change the base register if the
frame index is a CSR, and then try to figure out the right insertion
point in emitProlog.
If there is a free VGPR lane available for SGPR spilling, try to use
it for the FP. If that would require intrtoducing a new VGPR spill,
try to use a free call clobbered SGPR. Only fallback to introducing a
new VGPR spill as a last resort.
This also doesn't attempt to handle SGPR spilling with scalar stores.
llvm-svn: 365372
Every called function could possibly need this to calculate the
absolute address of stack objectst, and this avoids inserting a copy
around every call site in the kernel. It's also somewhat cleaner to
keep this in a callee saved SGPR.
llvm-svn: 363990
This forced the caller to be aware of this, which is an ugly ABI
feature.
Partially reverts r295877. The original reasons for doing this are
mostly fixed. Alloca is now in a non-0 address space, so it should be
OK to have 0 as a valid pointer. Since we treat the absolute address
as the pointer value, this part only really needed to apply to
kernels.
Since r357093, we avoid the need to increment/decrement the offset
register in more cases, and since r354816 the scavenger can fail
without spilling, so it's less critical that we try to avoid an offset
that fits in the MUBUF offset.
Restrict to callable functions for now to split this into 2 steps to
limit thte number of test updates and in case anything breaks.
llvm-svn: 362665
Since the beginning, the offset of a frame index has been consistently
interpreted backwards. It was treating it as an offset from the
scratch wave offset register as a frame register. The correct
interpretation is the offset from the SP on entry to the function,
before the prolog. Frame index elimination then should select either
SP or another register as an FP.
Treat the scratch wave offset on kernel entry as the pre-incremented
SP. Rely more heavily on the standard hasFP and frame pointer
elimination logic, and clean up the private reservation code. This
saves a copy in most callee functions.
The kernel prolog emission code is still kind of a mess relying on
checking the uses of physical registers, which I would prefer to
eliminate.
Currently selection directly emits MUBUF instructions, which require
using a reference to some register. Use the register chosen for SP,
and then ignore this later. This should probably be cleaned up to use
pseudos that don't refer to any specific base register until frame
index elimination.
Add a workaround for shaders using large numbers of SGPRs. I'm not
sure these cases were ever working correctly, since as far as I can
tell the logic for figuring out which SGPR is the scratch wave offset
doesn't match up with the shader input initialization in the shader
programming guide.
llvm-svn: 362661
This is something of a workaround, and the state of stack realignment
controls is kind of a mess. Ideally, we would be able to specify the
stack is infinitely aligned on entry to a kernel.
TargetFrameLowering provides multiple controls which apply at
different points. The StackRealignable field is used during
SelectionDAG, and for some reason distinct from this
hook. StackAlignment is a single field not dependent on the
function. It would probably be better to make that dependent on the
calling convention, and the maximum value for kernels.
Currently this doesn't really change anything, since the frame
lowering mostly does its own thing. This helps avoid regressions in a
future change which will rely more heavily on hasFP.
llvm-svn: 362447
While the stack access instructions don't care about
alignment > 4, some transformations on the pointer calculation
do make assumptions based on knowing the low bits of a pointer
are 0. If a stack object ends up being accessed through its
absolute address (relative to the kernel scratch wave offset),
the addressing expression may depend on the stack frame being
properly aligned. This was breaking in a testcase due to the
add->or combine.
I think some of the SP/FP handling logic is still backwards,
and overly simplistic to support all of the stack features.
Code which tries to modify the SP with inline asm for example
or variable sized objects will probably require redoing this.
llvm-svn: 328831