Commit Graph

9 Commits

Author SHA1 Message Date
Sebastian Neubauer a343b9b032 Revert "[AMDGPU] Insert waitcnt after returning from call"
This reverts commit ca907bfb57.

According to michel.daenzer,
> This completely broke the Mesa radeonsi driver on Navi 14. Xorg +
> xterm come up with major corruption & psychedelic colours.
2020-09-23 17:16:39 +02:00
Sebastian Neubauer ca907bfb57 [AMDGPU] Insert waitcnt after returning from call
When memory operations are outstanding on function calls, either the
caller or the callee can insert a waitcnt to ensure that all reads are
finished.
Calls need some time to be executed, so if the callee inserts the
waitcnt, filling the instruction buffer and waiting for memory will be
interleaved, hiding some latency. This comes at the cost of having a
waitcnt inside functions that may not be needed as no memory operations
are outstanding.

For function calls, this is already implemented. The same principal
applies to returns: If the caller inserts a waitcnt after the call, the
callee does not have to wait and the return and memory operation can be
run in parallel.

This commit implements waiting in the caller after returning from a
function call.

Differential Revision: https://reviews.llvm.org/D87674
2020-09-23 12:17:59 +02:00
Jay Foad 4bdab2e86a [AMDGPU] Fix offset for REL32_HI relocs
The addend in a REL32 reloc needs to be adjusted to account for the
offset from the PC value returned by the s_getpc instruction to the
point where the reloc is applied. This was being done correctly for
(GOTPC)REL32_LO but not for (GOTPC)REL32_HI. This will only make a
difference if the target symbol happens to get loaded almost exactly
a multiple of 4G away from the relocated instructions.

Differential Revision: https://reviews.llvm.org/D86938
2020-09-02 10:55:55 +01:00
Changpeng Fang e7081d117a AMDGPU: Implement waterfall loop for MIMG instructions with 256-bit SRsrc
Summary:
  When the resource descriptor is of vgpr, we need a waterfall loop
to read into a sgpr. In this patchm we generalized the  implementation
to work for any regster class sizes, and extend the work to MIMG
instructions.

Fixes: SWDEV-223405

Reviewers:
  arsenm, nhaehnle

Differential Revision:
  https://reviews.llvm.org/D82603
2020-08-18 16:27:36 -07:00
Carl Ritson 5bf2a9dd40 [AMDGPU] Update VMEM scalar write hazard mitigation sequence
Using s_waitcnt_depctr 0xffe3 is potentially faster than v_nop.

Reviewed By: rampitec, foad

Differential Revision: https://reviews.llvm.org/D83872
2020-07-16 11:37:45 +09:00
Jay Foad 590964c835 [AMDGPU] More accurate gfx10 latencies
Differential Revision: https://reviews.llvm.org/D81012
2020-06-04 10:29:32 +01:00
Jay Foad 42a5560503 [AMDGPU] New SIInsertHardClauses pass
Enable clausing of memory loads on gfx10 by adding a new pass to insert
the s_clause instructions that mark the start of each hard clause.

Differential Revision: https://reviews.llvm.org/D79792
2020-05-14 18:54:49 +01:00
Christudasan Devadasan b8a616ec59 [AMDGPU] Fixed the test by adding the triple. 2020-05-06 00:14:01 +05:30
Christudasan Devadasan 375cec4b6c [AMDGPU] Introduce more scratch registers in the ABI.
The AMDGPU target has a convention that defined all VGPRs
(execept the initial 32 argument registers) as callee-saved.
This convention is not efficient always, esp. when the callee
requiring more registers, ended up emitting a large number of
spills, even though its caller requires only a few.

This patch revises the ABI by introducing more scratch registers
that a callee can freely use.
The 256 vgpr registers now become:
  32 argument registers
  112 scratch registers and
  112 callee saved registers.
The scratch registers and the CSRs are intermixed at regular
intervals (a split boundary of 8) to obtain a better occupancy.

Reviewers: arsenm, t-tye, rampitec, b-sumner, mjbedy, tpr

Reviewed By: arsenm, t-tye

Differential Revision: https://reviews.llvm.org/D76356
2020-05-05 23:02:58 +05:30