Commit Graph

5060 Commits

Author SHA1 Message Date
Brendon Cahoon d45a247998 [AMDGPU] Don't remove VGPR to AGPR dead spills from frame info
Removing dead frame indices for VGPR to AGPR spills is incorrect
when the frame index is shared by multiple objects, which may
occur due to stack slot coloring. The problem is that subsequent
code that processes the other object will assert because the stack
frame index is marked dead.

Removing dead frame indices is needed prior to stack slot
coloring, which is what happens with SGPR to VGPR spills. These
spills are lowered prior to stack slot coloring, but the VGPR
to AGPR spills are processed afterwards during the Prolog/Epilog
Inserter pass. This patch marks the VGPR to AGPR spill slot as
dead if the slot is not used by another object.

Differential Revision: https://reviews.llvm.org/D115996
2021-12-23 11:09:19 -06:00
Petar Avramovic fd3cde600b AMDGPU/GlobalISel: Fix attempt to select non-legal instr in mir test
Delete inst-select-insert.xfail.mir.
G_INSERT instructions in inst-select-insert.xfail.mir are no longer
legal after D114198. This breaks build bots, since builds with
LLVM_ENABLE_ASSERTIONS=Off don't check for legality and report cannot
select while build with LLVM_ENABLE_ASSERTIONS=On reports instruction
is not legal.
2021-12-23 16:14:33 +01:00
Petar Avramovic 29f88b93fd [GlobalISel] Rework more/fewer elements for vectors
Artifact combiner is not able to access individual elements after using
LCMTy style merge/unmerge, extract and insert to change vector number of
elements (pad with undef or split to sub-vector instructions).
Use unmerge to individual elements instead and then merge elements into
requested types.
Change argument lowering for vectors and moreElementsVector to use
buildPadVectorWithUndefElements and buildDeleteTrailingVectorElements.
FewerElementsVector had a few helpers that had different behavior,
introduce new helper for most of the opcodes.
FewerElementsVector helper is more flexible since it can create leftover
instruction smaller then requested type (useful in case target wants to
avoid pad with undef and use fewer registers). If target does not want
leftover of different type it should call more elements first.
Some helpers were performing more elements first to have split without
leftover. Opcodes that used this helper use clampMaxNumElementsStrict
(does more elements first) in LegalizerInfo to avoid test changes.
Fixes failures caused by failing to combine artifacts created during
more/fewer elements vector.

Differential Revision: https://reviews.llvm.org/D114198
2021-12-23 14:30:02 +01:00
Petar Avramovic d2863088ab GlobalISel: Regen vector mir tests, add tests for vector arg lowering
Precommit for D114198 (Rework more/fewer elements for vectors).
Regenerate auto-generated mir tests for vectors (use CHECK-NEXT instead
of CHECK). Remove -global-isel-abort=0 where it is no longer needed.
Add mir tests for different AMDGPU sub-targets and they way they lower
function vector arguments (tests for legalization artifact combiner).
2021-12-23 14:30:01 +01:00
alex-t e4103c91f8 [AMDGPU] Select build_vector DAG nodes according to the divergence
This change enables divergence-driven instruction selection for the build_vector DAG nodes.
It also enables packed i16 instructions for GFX9.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D116187
2021-12-23 02:27:12 +03:00
Ron Lieberman 09b53296cf Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range"
This reverts commit 9075009d1f.

 Failed amdgpu runtime buildbot # 3514
2021-12-22 11:39:28 -05:00
RamNalamothu 9075009d1f [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D114652
2021-12-22 20:51:12 +05:30
Matt Arsenault 01d97dfde1 AMDGPU/GlobalISel: Regenerate test checks 2021-12-21 10:57:46 -05:00
Jay Foad 17006033f9 [GlobalISel] Verify operand types for G_SHL, G_LSHR, G_ASHR
Differential Revision: https://reviews.llvm.org/D115868
2021-12-21 11:59:33 +00:00
Matt Arsenault c222972442 AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting
These actions should only be used for adjusting the register types
(and the memory type as needed to satisfy the register
type). Unaligned accesses should be split as a type of lowering.

This has the effect of improving the code in many cases since now we
produce zextloads instead of separate loads with ands. The load/store
legality rules still seem far more complicated than necessary though.
2021-12-20 18:07:11 -05:00
alex-t 19727e31fb [AMDGPU] Enable divergence predicates for ctlz/cttz
ctlz/cttz get lowered to the set of target opcodes
This change enables the ISel to select SALU or VALU form according to the SDNode divergence.
CTLZ - S_FLBIT_I32_B32 if uniform and V_FFBH_U32_e64 if divergent
CTTZ - S_FF1_I32_B32   if uniform and V_FFBL_B32_e64 if divergent
Also @llvm.amdgcn.sffbh.i32 gets lowered to S_FLBIT_I32 if uniform and V_FFBH_I32_e64 if divergent
NOTE: 64bit versions S_FF1_I32_B64 and S_FLBIT_I32_B64 are not currently supported by the DAG ISel.
ctlz/cttz with i64 input are split into two 32bit instructions. Nevertheless, they already have the patterns
and were equipped with the divergence predicates to make sure they will be selected correctly when enabled.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D116044
2021-12-20 20:53:48 +03:00
alex-t 98d09705e1 [AMDGPU] Re-enabling divergence predicates for min/max
This patch enables divergence predicates for min/max nodes.
It makes ISD::MIN/MAX selected to S_MIN_I(U)32/S_MAX_I(U)32 or V_MIN_I(U)32_e64/V_MAX_I(U)32_e64

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D115954
2021-12-20 16:10:55 +03:00
alex-t 1448aa9dbd [AMDGPU] Expand not pattern according to the XOR node divergence
The "not" is defined as XOR $src -1.
 We need to transform this pattern to either S_NOT_B32 or V_NOT_B32_e32
 dependent on the "xor" node divergence.

Reviewed By: rampitec, foad

Differential Revision: https://reviews.llvm.org/D115884
2021-12-20 14:41:38 +03:00
Matt Arsenault 37a203f63e AMDGPU: Regenerate more mir test checks with -NEXT 2021-12-18 11:38:30 -05:00
Matt Arsenault 591371f7df AMDGPU: Regenerate some mir test checks with -NEXT 2021-12-18 10:46:15 -05:00
rkorsa c680fb69d6 [AMDGPU] Fixes in ISelDAG path and GlobalISel path for 'bias' operand with A16 bit on
The LOD bias operand is of type 'half' when the A16-bit is ON' for MIMG instructions.
'bias' is only 16-bit but occupies 32-bits with upper 16-bits containing junk.
The patch fixes both the paths(ISelDAG and GlobalISel) for proper encoding of LOD bias operand.

Differential Revision: https://reviews.llvm.org/D111754
2021-12-17 16:11:51 +05:30
Mircea Trofin 09103807e7 [NFC][regalloc] Introduce the RegAllocEvictionAdvisorAnalysis
This patch introduces the eviction analysis and the eviction advisor,
the default implementation, and the scaffolding for introducing the
other implementations of the advisor.

Differential Revision: https://reviews.llvm.org/D115707
2021-12-16 17:56:46 -08:00
Ron Lieberman f4420f5224 Revert "AMDGPU: Update pass pipeline test"
needed to match revert of
rG2b4876157562: AMDGPU: Remove AMDGPUFixFunctionBitcasts pass

This reverts commit 7ca355225d.
2021-12-16 21:39:04 +00:00
Ron Lieberman 8a85be807b Revert "AMDGPU: Remove AMDGPUFixFunctionBitcasts pass"
Offload abort in Nekbone

This reverts commit 2b48761575.
2021-12-16 21:21:32 +00:00
Jay Foad cce93b3397 [MachineVerifier] Undef subreg operands do not require subranges
D112556 added verification that the live interval for a subreg operand
must have subranges. This patch fixes a corner case, where if all subreg
operands for a particular register are undef uses then no subranges
are required. This matches how LiveIntervalCalc would build the live
intervals in the first place, since an undef use is not considered
to read the register.

Before this patch, CodeGen/AMDGPU/no-remat-indirect-mov.mir would fail
with -early-live-intervals:

 # After Live Interval Analysis
...
*** Bad machine code: Live interval for subreg operand has no subranges ***
- function:    index_vgpr_waterfall_loop
- basic block: %bb.1  (0x6a9a968) [352B;496B)
- instruction: 432B	%24:vgpr_32 = V_MOV_B32_e32 undef %18.sub0:vreg_512, implicit $exec, implicit %18:vreg_512, implicit $m0
- operand 1:   undef %18.sub0:vreg_512

Differential Revision: https://reviews.llvm.org/D115360
2021-12-16 09:49:27 +00:00
Matt Arsenault 7ca355225d AMDGPU: Update pass pipeline test 2021-12-15 18:37:18 -05:00
Matt Arsenault f0cc43cc91 AMDGPU: Use v_accvgpr_mov_b32 when copying AGPR tuples on gfx90a
This is an optimization, but also fixes a compile failure when no free
VGPRs are available. The problem still exists for gfx908 where a
scratch register is still required. This also still exists for the
SGPR to AGPR case.
2021-12-15 18:20:49 -05:00
Matt Arsenault 20a6cbd220 AMDGPU: Regenerate checks 2021-12-15 18:20:49 -05:00
Matt Arsenault 2b48761575 AMDGPU: Remove AMDGPUFixFunctionBitcasts pass
This was a workaround for not supporting indirect calls when
instcombine didn't eliminate constant expression casts of the callee
at -O0. Indirect calls are supposed to work now, so drop the hack.
2021-12-15 18:20:48 -05:00
Jay Foad 54fc9eb9b3 [AMDGPU] Use v_fma_f16 on GFX10
Teach convertToThreeAddress to use the V_FMA_F16_gfx9 pseudo (i.e. the
standard instruction in GFX9 onwards) instead of V_FMA_F16 (the legacy
pseudo for GFX8 compatibility, which is no longer supported in GFX10).
This follows the example of macToMad in SIFoldOperands.

Differential Revision: https://reviews.llvm.org/D115731
2021-12-15 13:14:48 +00:00
Jay Foad 4db7422771 [AMDGPU] Improve zeroesHigh16BitsOfDest for GFX9 legacy opcodes
Pseudos like V_MAD_U16 and V_FMA_F16 map down to what GFX9 calls
v_mad_legacy_u16 and v_fma_legacy_f16, which are documented to have the
same zeroing behaviour as on GFX8.

Differential Revision: https://reviews.llvm.org/D115729
2021-12-15 13:14:48 +00:00
Jon Chesterfield 624f12d34f [amdgpu] Drop lowering of LDS used by global variables
Approximately revert D103431.

LDS variables are allocated at kernel launch and deallocated at kernel exit.
The address is therefore kernel execution dependent. Global variables are
initialized by values written to .data, which can't be done for a LDS variable
as there is no kernel running, or by a global constructor. Initializing the
global to the address of some LDS allocated by a global constructor is possible
but indistinguishable from undef.

Assigning the address of a LDS variable to a global should be a sema error. It
isn't for openmp, haven't checked other languages. Failing that it could be set
to undef, perhaps in this pass.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D115413
2021-12-14 21:59:26 +00:00
Jay Foad 819fb457a6 [AMDGPU] Regenerate checks in high-bits-zeroed-16-bit-ops.mir 2021-12-14 15:33:35 +00:00
Stanislav Mekhanoshin c4aef9c281 Check subrange liveness at rematerialization
LiveRangeEdit::allUsesAvailableAt checks that VNI at use is the same
as at the original use slot. However, the VNI can be the same while
a specific subrange needed for use can be dead at the new index.

This patch adds subrange liveness check if there is a subreg use.

Fixes: SWDEV-312810

Differential Revision: https://reviews.llvm.org/D115278
2021-12-13 11:11:55 -08:00
Neubauer, Sebastian 26924b57e8 [AMDGPU] Ignore special ABI registers for graphics
Fixed ABI arguments are compute specific and should not be added to
graphics shaders or functions, so do not try to add them.

Differential Revision: https://reviews.llvm.org/D115344
2021-12-13 16:44:37 +01:00
Jon Chesterfield 28345d7f6f [amdgpu] Add regression test for LDS in metadata 2021-12-13 13:35:38 +00:00
Jon Chesterfield 24b28db8cc [amdgpu] Increase alignment of all LDS variables
Currently the superalign option only increases the alignment of
variables that are moved into the module.lds block. Change that to all LDS
variables. Also only increase the alignment once, instead of once per function.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D115488
2021-12-12 19:30:32 +00:00
Matt Arsenault 06b90175e7 AMDGPU: Remove fixed function ABI option 2021-12-10 19:41:19 -05:00
Christudasan Devadasan cf58b9ce98 [AMDGPU] Add AV class spill pseudo instructions
While enabling vector superclasses with D109301,
the AV spills are converted into VGPR spills by
introducing appropriate copies. The whole thing
ended up adding two instructions per spill (a copy
+ vgpr spill pseudo) and caused an incorrect
liverange update during inline spiller.

This patch adds the pseudo instructions for all
AV spills from 32b to 1024b and handles them in
the way all other spills are lowered.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D115439
2021-12-10 03:10:34 -05:00
Konstantin Schwarz a344653725 [GlobalISel] Fix IRTranslator for constexpr fcmp
The existing code assumed fcmp to always be an Instruction, but it can also be a ConstExpr.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D115450
2021-12-10 08:49:12 +01:00
Matt Arsenault 017ef78549 AMDGPU: Mark scc defs dead in SGPR to VMEM path for no free SGPRs
This introduces verifier errors into this broken situation which we do
not handle correctly, which is better than being silently
miscompiled. For the emergency stack slot, the scavenger likes to move
the restore instruction as late as possible, which ends up separating
the SCC def from the conditional branch.
2021-12-08 18:40:49 -05:00
Matt Arsenault 0383872295 AMDGPU: Simplify test for SGPR spilling bug
Remove the control flow from the test from
25eb7fa01d. It's not necessary to
reproduce the original assert with the patch reverted. The control
flow happens to expose a different issue that calls for a separate
test in a future change.
2021-12-08 18:40:44 -05:00
Matt Arsenault a92cf7cea5 AMDGPU: Mark SCC def as dead when expanding frame indexes
This improves liveness queries in a future change.
2021-12-08 18:40:39 -05:00
Sanjay Patel e9179a6a02 [Support] improve known bits analysis for multiply by power-of-2 (1 set bit)
This can be viewed as recognizing that multiply-by-power-of-2 doesn't
have a carry into the top bit of an M-bit * N-bit number.

Enhancing canonicalization of mul -> select might also handle some of
these if we were ok with increasing instruction count with casts in
some cases.

This doesn't help https://llvm.org/PR49055 , but it's a simpler
pattern that we miss.
Note: "-sccp" already gets these examples using a constant
range analysis.

Differential Revision: https://reviews.llvm.org/D114962
2021-12-08 11:50:05 -05:00
Jack Andersen f108c7f59d [GlobalISel] Allow DBG_VALUE to use undefined vregs before LiveDebugValues.
Expanding on D109750.

Since `DBG_VALUE` instructions have final register validity determined in
`LDVImpl::handleDebugValue`, there is no apparent reason to immediately prune
unused register operands as their defs are erased. Consequently, this renders
`MachineInstr::eraseFromParentAndMarkDBGValuesForRemoval` moot; gaining a
substantial performance improvement.

The only necessary changes involve making relevant passes consider invalid
DBG_VALUE vregs uses as valid.

Reviewed By: MatzeB

Differential Revision: https://reviews.llvm.org/D112852
2021-12-05 15:55:59 -05:00
Matt Arsenault 729bf9b26b AMDGPU: Enable fixed function ABI by default
Code using indirect calls is broken without this, and there isn't
really much value in supporting the old attempt to vary the argument
placement based on uses. This resulted in more argument shuffling code
anyway.

Also have the option stop implying all inputs need to be passed. This
will no rely on the amdgpu-no-* attributes to avoid passing
unnecessary values.
2021-12-04 10:49:18 -05:00
Matt Arsenault 2959e082e1 AMDGPU: Assume all amdhsa kernarg passed implicit arguments by default
Previously we would require adding an attribute to kernels to enable
the inputs passed in the kernarg segment, accessed by
llvm.amdgcn.implicitarg.ptr. This violates the principle of being
correct by default. Some OpenMP testcases were broken recently since
it wasn't correctly setting this attribute, and no known frontends are
setting this to anything other than the maximum.

Most of the test changes are from load widening of argument loads
since there now more implied dereferenceable bytes.
2021-12-04 10:38:25 -05:00
Matt Arsenault ae0ba7dedd AMDGPU: Optimize out implicit kernarg argument allocation if unused
We already annotate whether llvm.amdgcn.implicitarg.ptr is known to be
unused. Start using it to avoid allocating the implicit arguments if
unneeded.
2021-12-04 10:38:25 -05:00
Jay Foad 2774bad112 [AMDGPU] Change llvm.amdgcn.image.bvh.intersect.ray to take vec3 args
The ray_origin, ray_dir and ray_inv_dir arguments should all be vec3 to
match how the hardware instruction works.

Don't change the API of the corresponding OpenCL builtins.

Differential Revision: https://reviews.llvm.org/D115032
2021-12-04 10:32:11 +00:00
Jay Foad bc7dacf589 [AMDGPU] Generate checks for llvm.amdgcn.image.bvh.intersect.ray
Differential Revision: https://reviews.llvm.org/D114955
2021-12-04 10:32:11 +00:00
Stanislav Mekhanoshin e1d6306815 [AMDGPU] Fixed incomplete definitions in twoaddr-fma.mir. NFC. 2021-12-03 10:18:03 -08:00
Stanislav Mekhanoshin 3b17cb1506 [AMDGPU] Kill def when folding immediate in two-addr pass
Two-address pass works right before RA and if an immediate
was folded into an instruction there is nothing to remove
the dead def. We end up with something like:

	v_mov_b32_e32 v14, 0xc1700000
	v_mov_b32_e32 v14, 0x41200000
	v_fmaak_f32 v51, s67, v19, 0xc1700000
	v_fmaak_f32 v38, v51, v19, 0x4120000

The patch kills the dead move instruction right in the folding.

Differential Revision: https://reviews.llvm.org/D114999
2021-12-03 09:37:49 -08:00
Jay Foad b670dcb81b [AMDGPU] Add some more GFX10 test coverage 2021-12-03 14:03:31 +00:00
Jay Foad b29b6f92af [AMDGPU] Add some more GFX10 GlobalISel test coverage 2021-12-03 13:40:27 +00:00
Petar Avramovic 0b34ffe4a6 AMDGPU/GlobalISel: Add clamp combine
Add clamp combine. Source is fminnum(fmaxnum(Val, 0.0), 1.0) or
fmaxnum(fminnum(Val, 1.0), 0.0) or fmed3 intrinsic with 0.0 and
1.0 as two out of three operands.

Differential Revision: https://reviews.llvm.org/D90052
2021-12-03 12:49:39 +01:00