llvm-project

Commit Graph

Author	SHA1	Message	Date
Nico Weber	a278250b0f	Revert "Cleanup codegen includes" This reverts commit `7f230feeea`. Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang, and many LLVM tests, see comments on https://reviews.llvm.org/D121169	2022-03-10 07:59:22 -05:00
serge-sans-paille	7f230feeea	Cleanup codegen includes after: 1061034926 before: 1063332844 Differential Revision: https://reviews.llvm.org/D121169	2022-03-10 10:00:30 +01:00
Venkata Ramanaiah Nalamothu	04fff547e2	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm, ronlieb Differential Revision: https://reviews.llvm.org/D114652	2022-03-09 12:18:02 +05:30
Stanislav Mekhanoshin	e7b362d75d	[AMDGPU] Add v_mov_b64 gfx940 opcode Differential Revision: https://reviews.llvm.org/D121023	2022-03-07 12:07:12 -08:00
Stanislav Mekhanoshin	2c830c8fab	[AMDGPU] gfx940: support V_FMAMK_F32 and V_FMAAK_F32 Differential Revision: https://reviews.llvm.org/D120769	2022-03-07 11:31:01 -08:00
Jay Foad	5ddfedc956	[AMDGPU] Fix deleting of move-immediate instructions after folding SIInstrInfo::FoldImmediate tried to delete move-immediate instructions after folding them into their only use. This did not work because it was checking hasOneNonDBGUse after doing the fold, at which point there should be no uses. This seems to have no effect on codegen, it just means less stuff for DCE to clean up later. Differential Revision: https://reviews.llvm.org/D120815	2022-03-02 16:11:16 +00:00
Jay Foad	8bed52c9eb	[AMDGPU] Make more use of madmk/fmamk instructions In convertToThreeAddress handle VOP2 mac/fmac instructions with a literal src0 operand, since these are prime candidates for converting to madmk/fmamk. Previously this would only happen if src0 (or src1) was a register defined by a move-immediate instruction, but in many cases these operands have already been folded because SIFoldOperands runs before TwoAddressInstructionPass. Differential Revision: https://reviews.llvm.org/D120736	2022-03-02 10:22:10 +00:00
Jay Foad	289339140e	[AMDGPU] Handle legacy multiply-accumulate opcodes in convertToThreeAddress Handle V_MAC_LEGACY_F32 and V_FMAC_LEGACY_F32 in convertToThreeAddress, to avoid the need for an extra mov instruction in some cases. Differential Revision: https://reviews.llvm.org/D120704	2022-03-01 16:58:00 +00:00
Jay Foad	9ac3a85047	[AMDGPU] Disentangle MFMA handling in convertToThreeAddress. NFC. Move MFMA handling to the top of convertToThreeAddress and pull IsF16 calculation out of the switch. I think this makes it clearer exactly which mac/fmac opcodes are handled, since they are now listed in the switch with minimal extra clutter. Differential Revision: https://reviews.llvm.org/D120703	2022-03-01 16:56:56 +00:00
Jay Foad	68895098d1	[AMDGPU] Preserve src2_modifiers in convertToThreeAddress Found by code inspection. I don't think it makes a difference with current codegen, because if any source modifiers were present we would have selected mad/fma instead of mac/fmac in the first place. Differential Revision: https://reviews.llvm.org/D120709	2022-03-01 14:48:25 +00:00
Sebastian Neubauer	6527b2a4d5	[AMDGPU][NFC] Fix typos Fix some typos in the amdgpu backend. Differential Revision: https://reviews.llvm.org/D119235	2022-02-18 15:05:21 +01:00
Sebastian Neubauer	1f0aadfa62	[AMDGPU] Fix kill flag on overlapping sgpr copy Same as on vgpr copies, we cannot kill the source register if it overlaps with the destination register. Otherwise, the kill of the source register will also count as a kill for the destination register. Differential Revision: https://reviews.llvm.org/D120042	2022-02-18 14:36:00 +01:00
Matt Arsenault	8b2ca766f0	AMDGPU: Reserve v32 if we may need to copy between AGPRs on gfx908 We need to guarantee cheap copies between AGPRs, and unfortunately gfx908 cannot directly do this. Theoretically we could set the scavenger up with an emergency spill slot, but it also feels unreasonable to pay that cost for what was assumed to be a simple and cheap copy. Pick a register that doesn't conflict with any ABI registers. This does not address the same issue when copying from SGPR to AGPR for gfx90a (this coincidentally fixes it for gfx908), but that's less interesting since the register allocator shouldn't be proactively introducing such copies. One edge case I'm worried about is respecting the VGPR budget implied by amdgpu-waves-per-eu. If the theoretical upper bound of a function is 32 VGPRs, this will force the actual count to be 33. This is also broken if inline assembly uses/defs something in v32. The coalescer will eliminate the intermediate vreg between the def and use, and the introduced copy will clobber the user value. (cherry picked from commit 3335784ac2d587ff4eac04586e189532ae8b2607)	2022-02-08 11:14:52 -05:00
Carl Ritson	a1fb307b4b	[AMDGPU] Allow hoisting of some VALU compare instructions Conversatively allow hoisting/sinking of VALU comparisons. If the result of a comparison is masked with exec, narrowing the set of active lanes, then it is safe to hoist it as the masking instruction will never by hoisted. Heuristically this is also true for sinking, as we do not expect the result of a sunk comparison that is masked with exec to be used outside of the loop. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D118975	2022-02-08 11:27:23 +09:00
Ruiling Song	0719c43735	AMDGPU: Don't clobber source register for V_SET_INACTIVE_* The WWM register has unmodeled register liveness, For v_set_inactive_*, clobberring source register is dangerous because it will overwrite the inactive lanes. When the source vgpr is dead at v_set_inactive_lane, the inactive lanes may be not really dead. This may make common optimizations doing wrong. For example in a simple if-then cfg in Machine IR: bb.if: %src = bb.then: %src1 = COPY %src %dst = V_SET_INACTIVE %src1(tied-def 0), %inactive bb.end ... = PHI [0, %bb.then] [%src, %bb.if] The register coalescer will think it is safe to optimize "%src1 = COPY %src" in bb.then. And at the same time, there is no interference for the PHI in bb.end. The source and destination values of the PHI will be assigned the same register. The single PHI register will be overwritten by the v_set_inactive, then we would get wrong value in bb.end. With this change, we will copy the content of the source register before setting inactive lanes after register allocation. Yes, this will sacrifice the WWM code generation a little, but I don't have any better idea to do things correctly. Differential Revision: https://reviews.llvm.org/D117482	2022-02-06 12:38:26 +08:00
Jay Foad	ddd3807e69	[AMDGPU] Use new target MMO flag MONoClobber This allows us to set the noclobber flag on (the MMO of) a load instruction instead of on the pointer. This fixes a bug where noclobber was being applied to all loads from the same pointer, even if some of them were clobbered. Differential Revision: https://reviews.llvm.org/D118775	2022-02-02 17:12:36 +00:00
Stanislav Mekhanoshin	dbf278b984	[AMDGPU] Prevent aliasing of SrcC and Dst in MAI Form the MAI spec: It’s ok that Src_C and vDst are the exact same VGPRs or Src_C and vDst are completely separated. The case that Src_C and vDst are overlapping should be avoid as new value could be written to accumulator input before it gets read. Note that this inevitably increases register pressure to the point where some programs will become uncompilable. This patch separates MAC and FMA versions of MFMA instructions using either tied dst and src2 or earlyclobber dst. Fixes: SWDEV-318900 Differential Revision: https://reviews.llvm.org/D117844	2022-01-26 14:48:20 -08:00
Stanislav Mekhanoshin	c27f8fb968	[AMDGPU] Remove cndmask from readsExecAsData Differential Revision: https://reviews.llvm.org/D117909	2022-01-24 11:24:47 -08:00
Stanislav Mekhanoshin	41ebd19681	[AMDGPU] Do not ignore exec use where exec is read as data Compares, v_cndmask_b32, and v_readfirstlane_b32 use EXEC in a way which modifies the result. This implicit EXEC use shall not be ignored for the purposes of instruction moves. Differential Revision: https://reviews.llvm.org/D117814	2022-01-20 14:05:22 -08:00
Matt Arsenault	7f26a1027f	AMDGPU/GlobalISel: Introduce pseudo to copy sp in call sequences Arbitrary stack pointers are accessed using MUBUF instructions with the voffset field, which is interpreted as the swizzled address. We want to fold fold into the MUBUF form to use the SP in the SGPR offset, and previously we were special casing the interpretation of the pointer value if the access memory operand said it was relative to the stack pointer. `690f5b7a01` removed this check, and moved the DAG path to special casing copies from SGPRs. This is not an entirely sound approach, since it's still changing the interpretation of pointer values based the context. Introduce a new pseudo which corresponds to the wave-to-vector address transform. This way the memory instruction has consistent semantics where the incoming pointer is always interpreted as a vector address, and we're not obligated to optimize into the MUBUF offset-only addressing mode. The DAG should probably have an equivalent pseudo. This should fix some correctness issues, and folding this into addressing modes will be a future optimization patch.	2022-01-19 10:13:31 -05:00
David Salinas	c0581f7df6	Revert D109159 : Revert "[amdgpu] Enable selection of `s_cselect_b64`." This reverts commit `640beb38e7`. That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort). Reverting until we have a better solution to s_cselect_b64 codegen cleanup Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D116686 Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2	2022-01-11 21:14:09 +00:00
Nico Weber	085f078307	Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."" This reverts commit `859ebca744`. The change contained many unrelated changes and e.g. restored unit test failes for the old lld port.	2022-01-05 13:10:25 -05:00
David Salinas	859ebca744	Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`." This reverts commit `640beb38e7`. That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort). Reverting until we have a better solution to s_cselect_b64 codegen cleanup Change-Id: Ibf8e397df94001f248fba609f072088a46abae08 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D115960 Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105	2022-01-05 17:57:32 +00:00
Kazu Hirata	2d303e6781	Remove redundant return and continue statements (NFC) Identified with readability-redundant-control-flow.	2021-12-24 23:17:54 -08:00
Ron Lieberman	09b53296cf	Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range" This reverts commit `9075009d1f`. Failed amdgpu runtime buildbot # 3514	2021-12-22 11:39:28 -05:00
RamNalamothu	9075009d1f	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D114652	2021-12-22 20:51:12 +05:30
Kazu Hirata	f78c1b07cb	[Target] Use range-based for loops (NFC)	2021-12-17 10:11:08 -08:00
Matt Arsenault	f0cc43cc91	AMDGPU: Use v_accvgpr_mov_b32 when copying AGPR tuples on gfx90a This is an optimization, but also fixes a compile failure when no free VGPRs are available. The problem still exists for gfx908 where a scratch register is still required. This also still exists for the SGPR to AGPR case.	2021-12-15 18:20:49 -05:00
Jay Foad	54fc9eb9b3	[AMDGPU] Use v_fma_f16 on GFX10 Teach convertToThreeAddress to use the V_FMA_F16_gfx9 pseudo (i.e. the standard instruction in GFX9 onwards) instead of V_FMA_F16 (the legacy pseudo for GFX8 compatibility, which is no longer supported in GFX10). This follows the example of macToMad in SIFoldOperands. Differential Revision: https://reviews.llvm.org/D115731	2021-12-15 13:14:48 +00:00
Jay Foad	61f8af2657	[AMDGPU] Remove a FIXME implemented in D11061	2021-12-13 14:46:40 +00:00
Christudasan Devadasan	cf58b9ce98	[AMDGPU] Add AV class spill pseudo instructions While enabling vector superclasses with D109301, the AV spills are converted into VGPR spills by introducing appropriate copies. The whole thing ended up adding two instructions per spill (a copy + vgpr spill pseudo) and caused an incorrect liverange update during inline spiller. This patch adds the pseudo instructions for all AV spills from 32b to 1024b and handles them in the way all other spills are lowered. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D115439	2021-12-10 03:10:34 -05:00
Stanislav Mekhanoshin	3b17cb1506	[AMDGPU] Kill def when folding immediate in two-addr pass Two-address pass works right before RA and if an immediate was folded into an instruction there is nothing to remove the dead def. We end up with something like: v_mov_b32_e32 v14, 0xc1700000 v_mov_b32_e32 v14, 0x41200000 v_fmaak_f32 v51, s67, v19, 0xc1700000 v_fmaak_f32 v38, v51, v19, 0x4120000 The patch kills the dead move instruction right in the folding. Differential Revision: https://reviews.llvm.org/D114999	2021-12-03 09:37:49 -08:00
Christudasan Devadasan	5297cbf045	[AMDGPU] Enable copy between VGPR and AGPR classes during regalloc Greedy register allocator prefers to move a constrained live range into a larger allocatable class over spilling them. This patch defines the necessary superclasses for vector registers. For subtargets that support copy between VGPRs and AGPRs, the vector register spills during regalloc now become just copies. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D109301	2021-11-29 22:19:33 -05:00
Christudasan Devadasan	654c89d85a	[AMDGPU] Make vector superclasses allocatable The combined vector register classes with both VGPRs and AGPRs are currently unallocatable. This patch turns them into allocatable as a prerequisite to enable copy between VGPR and AGPR registers during regalloc. Also, added the missing AV register classes from 192b to 1024b. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D109300	2021-11-26 00:42:12 -05:00
Jay Foad	ff7f2cfa95	[AMDGPU] Add an implicit use of M0 to all V_MOV_B32_indirect_read/write NFCI. Previously the implicit use was added to V_MOV_B32_indirect_read when building the instruction. V_MOV_B32_indirect_write didn't have an implicit use of M0 at all, but apparently it did not cause any problems. Differential Revision: https://reviews.llvm.org/D114239	2021-11-19 19:00:17 +00:00
Jay Foad	30b27ecfc2	[AMDGPU] Use new opcode for indexed vgpr reads Introduce V_MOV_B32_indirect_read for indexed vgpr reads (and rename the old V_MOV_B32_indirect to V_MOV_B32_indirect_write) so they can be unambiguously distinguished from regular V_MOV_B32_e32. Previously they were distinguished by looking for extra implicit operands but this is fragile because regular moves sometimes have extra implicit operands too: - either by accident, when instructions end up with duplicate implicit operands (see e.g. D100939) - or by design, when SIInstrInfo::copyPhysReg breaks a multi-dword copy into individual subreg mov instructions and adds implicit operands for the super-register. The effect of this is that SIInstrInfo::isFoldableCopy can be simplified and identifies more foldable copies. The test diffs show that more immediate 0 values have been folded as inline operands. SIInstrInfo::isReallyTriviallyReMaterializable could probably be simplified too but that is not part of this patch. Differential Revision: https://reviews.llvm.org/D114230	2021-11-19 13:08:11 +00:00
Jay Foad	3264e95938	[CodeGen] Update LiveIntervals in TargetInstrInfo::convertToThreeAddress Delegate updating of LiveIntervals to each target's convertToThreeAddress implementation, instead of repairing LiveIntervals after the fact in TwoAddressInstruction::convertInstTo3Addr. Differential Revision: https://reviews.llvm.org/D113493	2021-11-17 10:16:47 +00:00
Joe Nash	79f52af4cd	[AMDGPU] Make getInstSizeInBytes more generic NFC. This check mainly handles size affecting literals. Make it check all explicit operands instead of a few by name. Also make the isLiteral check handle the KIMM operands, see https://reviews.llvm.org/D111067 Change-Id: I1a362d55b2a10f5c74d445272e8b7829a8b77597 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D113318 Change-Id: Ie6c688f30a71e0335d1c6dd1ff65019bd7ce684e	2021-11-08 10:34:49 -05:00
Kazu Hirata	e4bab21848	[AMDGPU] Use MachineBasicBlock::{predecessors,successors} (NFC)	2021-11-06 19:31:20 -07:00
Jay Foad	7afef22926	[AMDGPU] Use MachineInstrBuilder::addReg. NFC.	2021-11-01 15:29:51 +00:00
Jay Foad	2b548b18c1	[AMDGPU] Shrink v_mac_legacy_f32 and v_fmac_legacy_f32 Differential Revision: https://reviews.llvm.org/D112917	2021-11-01 13:55:53 +00:00
Kazu Hirata	72710af233	[CodeGen, Target] Use MachineBasicBlock::terminators (NFC)	2021-10-31 07:57:34 -07:00
Michael Liao	e6a4ba3aa6	[amdgpu] Handle the case where there is no scavenged register. - When an unconditional branch is expanded into an indirect branch, if there is no scavenged register, an SGPR pair needs spilling to enable the destination PC calculation. In addition, before jumping into the destination, that clobbered SGPR pair need restoring. - As SGPR cannot be spilled to or restored from memory directly, the spilling/restoring of that SGPR pair reuses the regular SGPR spilling support but without spilling it into memory. As that spilling and restoring points are fully controlled, we only need to spill that SGPR into the temporary VGPR, which needs spilling into its emergency slot. - The target-specific hook is revised to take additional restore block, where the restoring code is filled. After that, the relaxation will place that restore block directly before the destination block and insert an unconditional branch in any fall-through block into the destination block. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D106449	2021-10-27 18:37:27 -04:00
Stanislav Mekhanoshin	6185835656	[AMDGPU] Allow rematerialization of SOP with virtual registers D106408 was doing this for all targets although it was reverted due to couple performance regressions on some targets. The difference for AMDGPU is the ability to rematerialize SOP instructions with virtual register uses like we already do for VOP. Differential Revision: https://reviews.llvm.org/D110743	2021-10-20 11:46:50 -07:00
Joe Nash	b4b7e605a6	[AMDGPU] Support shared literals in FMAMK/FMAAK These instructions should allow src0 to be a literal with the same value as the mandatory other literal. Enable it by introducing an operand that defers adding its value to the MI when decoding till the mandatory literal is parsed. Reviewed By: dp, foad Differential Revision: https://reviews.llvm.org/D111067 Change-Id: I22b0ae0d35bad17b6f976808e48bffe9a6af70b7	2021-10-11 13:09:54 -04:00
Carl Ritson	adf7043a9f	[AMDGPU] Only remove branches in SIInstrInfo::removeBranch Without this change _term instructions can be removed during critical edge splitting. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D111126	2021-10-06 10:34:26 +09:00
Jay Foad	6cef28ed2d	[TII] Remove the MFI argument to convertToThreeAddress. NFC. This simplifies the API and addresses a FIXME in TwoAddressInstructionPass::convertInstTo3Addr. Differential Revision: https://reviews.llvm.org/D110229	2021-09-23 08:58:46 +01:00
Joe Nash	3ce1b9631a	[AMDGPU] Switch PostRA sched to MachineSched Use GCNHazardRecognizer in postra sched. Updated tests for the new schedules. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D109536 Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde	2021-09-14 15:11:27 -04:00
Michael Liao	640beb38e7	[amdgpu] Enable selection of `s_cselect_b64`. Differential Revision: https://reviews.llvm.org/D109159	2021-09-07 10:45:07 -04:00
Stanislav Mekhanoshin	d0c064715c	[AMDGPU] Small cleanup in optimizeCompareInstr. NFC.	2021-09-03 11:31:40 -07:00
Stanislav Mekhanoshin	78fbd1aa3d	[AMDGPU] Process any power of 2 in optimizeCompareInstr Differential Revision: https://reviews.llvm.org/D109201	2021-09-02 17:39:17 -07:00
Stanislav Mekhanoshin	2cfda6a691	[AMDGPU] Fold immediates in the optimizeCompareInstr Peephole works before the first SIFoldOperands so most of the immediates are in registers. Differential Revision: https://reviews.llvm.org/D109186	2021-09-02 17:23:26 -07:00
Stanislav Mekhanoshin	832c87b4fb	[AMDGPU] Use S_BITCMP0_* to replace AND in optimizeCompareInstr These can be used for reversed conditions if result of the AND is unused except in the compare: s_cmp_eq_u32 (s_and_b32 $src, 1), 0 => s_bitcmp0_b32 $src, 0 s_cmp_eq_i32 (s_and_b32 $src, 1), 0 => s_bitcmp0_b32 $src, 0 s_cmp_eq_u64 (s_and_b64 $src, 1), 0 => s_bitcmp0_b64 $src, 0 s_cmp_lg_u32 (s_and_b32 $src, 1), 1 => s_bitcmp0_b32 $src, 0 s_cmp_lg_i32 (s_and_b32 $src, 1), 1 => s_bitcmp0_b32 $src, 0 s_cmp_lg_u64 (s_and_b64 $src, 1), 1 => s_bitcmp0_b64 $src, 0 Differential Revision: https://reviews.llvm.org/D109099	2021-09-02 09:38:01 -07:00
Stanislav Mekhanoshin	f3645c792a	[AMDGPU] Use S_BITCMP1_* to replace AND in optimizeCompareInstr Differential Revision: https://reviews.llvm.org/D109082	2021-09-01 15:59:12 -07:00
Stanislav Mekhanoshin	bf77b11277	[AMDGPU] Introduce optimizeCompareInstr The following patterns are currently handled: s_cmp_eq_u32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_eq_i32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_eq_u64 (s_and_b64 $src, 1), 1 => s_and_b64 $src, 1 s_cmp_ge_u32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_ge_i32 (s_and_b32 $src, 1), 1 => s_and_b32 $src, 1 s_cmp_lg_u32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1 s_cmp_lg_i32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1 s_cmp_lg_u64 (s_and_b64 $src, 1), 0 => s_and_b64 $src, 1 s_cmp_gt_u32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1 s_cmp_gt_i32 (s_and_b32 $src, 1), 0 => s_and_b32 $src, 1 Differential Revision: https://reviews.llvm.org/D109031	2021-09-01 15:57:05 -07:00
alex-t	ed0f4415f0	[AMDGPU] Divergence-driven compare operations instruction selection Description: This change enables the compare operations to be selected to SALU/VALU form dependent of the SDNode divergence flag. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D106079	2021-08-25 18:30:49 +03:00
Christudasan Devadasan	4f5ba46e16	[AMDGPU] Set wait state for meta instructions to zero It looked more reasonable to set the wait state to zero for all non-instructions. With that we can avoid the special handling for them in `getWaitStatesSince` and `AdvanceCycle`. This NFC patch makes the handling more generic.	2021-08-18 01:46:59 -04:00
Christudasan Devadasan	686607676f	[AMDGPU] Skip pseudo MIs in hazard recognizer Instructions like WAVE_BARRIER and SI_MASKED_UNREACHABLE are only placeholders to prevent certain unwanted transformations and will get discarded during assembly emission. They should not be counted during nop insertion. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D108022	2021-08-16 23:11:14 -04:00
Michael Liao	b0402a35fc	[amdgpu] Add 64-bit PC support when expanding unconditional branches. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D106445	2021-07-26 14:50:30 -04:00
Carl Ritson	6efb3220b4	[AMDGPU] Add VReg_192/VReg_224 support for MIMG instructions Allow MIMG instructions to be selected with 6/7 VGPRs for vaddr. Previously these were rounded up to VReg_256 this saves VGPRs. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D103800	2021-07-22 10:42:15 +09:00
Stanislav Mekhanoshin	9625ca5b60	[AMDGPU] Mark relevant rematerializable VOP2 instructions Differential Revision: https://reviews.llvm.org/D106023	2021-07-21 14:24:59 -07:00
Stanislav Mekhanoshin	4eb24817ec	[AMDGPU] Mark all relevant VOP1 instructions rematerializable Differential Revision: https://reviews.llvm.org/D105919	2021-07-21 14:05:32 -07:00
Sebastian Neubauer	afd895709d	[AMDGPU] Use isMetaInstruction for instruction size Meta instructions have a size of 0. Use isMetaInstruction instead of listing them explicitly. Differential Revision: https://reviews.llvm.org/D106043	2021-07-15 12:23:11 +02:00
Stanislav Mekhanoshin	76b7d3432e	[AMDGPU] Add TII::isIgnorableUse() to allow VOP rematerialization Any def of EXEC prevents rematerialization of any VOP instruction because of the physreg use. Create a callback to check if the physreg use can be ingored to allow rematerialization. Differential Revision: https://reviews.llvm.org/D105836	2021-07-14 13:03:58 -07:00
Sebastian Neubauer	9d72c0ad43	[AMDGPU] Mark waterfall loops as SI_WATERFALL_LOOP This way, they can be detected later, e.g. by the SIOptimizeVGPRLiveRange pass. Differential Revision: https://reviews.llvm.org/D105467	2021-07-13 12:15:08 +02:00
Stanislav Mekhanoshin	d46d534dbb	[AMDGPU] Make some VOP1 instructions rematerializable This is a pilot change to verify the logic. The rest will be done in a same way, at least the rest of VOP1. Differential Revision: https://reviews.llvm.org/D105742	2021-07-12 23:43:45 -07:00
Stanislav Mekhanoshin	661577e698	[AMDGPU] Fix immediate sign during V_MOV_B64_PSEUDO expansion Creating a V_MOV_B32 with zero extended immediate source prevented conversion to V_BFREV_B32. Differential Revision: https://reviews.llvm.org/D105235	2021-07-01 09:00:29 -07:00
Stanislav Mekhanoshin	381ded345b	[AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants This is to allow 64 bit constant rematerialization. If a constant is split into two separate moves initializing sub0 and sub1 like now RA cannot rematerizalize a 64 bit register. This gives 10-20% uplift in a set of huge apps heavily using double precession math. Fixes: SWDEV-292645 Differential Revision: https://reviews.llvm.org/D104874	2021-06-30 11:45:38 -07:00
Piotr Sobczak	f38a8b54ea	[AMDGPU] Fix 224-bit spills Related to D104622. Differential Revision: https://reviews.llvm.org/D105109	2021-06-29 17:52:16 +02:00
Carl Ritson	98f48723f2	[AMDGPU] Add 224-bit vector types and link 192-bit types to MVTs Add SReg_224, VReg_224, AReg_224, etc. Link 224-bit types with v7i32/v7f32. Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D104622	2021-06-24 12:41:22 +09:00
Carl Ritson	f8816c7400	[AMDGPU] Add v5f32/VReg_160 support for MIMG instructions Avoid having to round up to v8f32/VReg_256 when only 5 VGPRs are required for a MIMG address operand. Maintain _V8 instruction variants of pseudo instructions allowing assembly prior to GFX10 to work as-is. Currently the validator can tell for GFX10 what the correct size is, so will disallow oversize address registers. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D103672	2021-06-08 11:11:40 +09:00
Stanislav Mekhanoshin	9e2e49328f	[AMDGPU] All GWS instructions need aligned VGPR on gfx90a Fixes: SWDEV-288006 Differential Revision: https://reviews.llvm.org/D103197	2021-06-01 17:08:03 -07:00
Brendon Cahoon	3f7b7e7393	[AMDGPU] Update SCC defs to VCC when uses are changed to VCC The FixSGPRCopies pass converts instructions to VALU when removing illegal VGPR to SGPR copies. Instructions that use SCC are changed to use VCC instead. When that happens, the pass must also change instructions that define SCC to define VCC. The pass was not changing the SCC definition when an ADDC is converted due to a input that is a VGPR to SGPR copy. But, the initial ADD insruction, which define SCC, is not converted. This causes a compilation failure due to a use of an undefined physical register. This patch adds code that inserts the SCC definition in the MoveToVALU worklist when a SCC use is converted to a VCC use. Differential Revision: https://reviews.llvm.org/D102111	2021-05-14 18:05:05 -04:00
Matt Arsenault	c7cff08f79	AMDGPU: Fix assert when rewriting saddr d16 loads moveOperands does not handle moving tied operands since it would generally have to fixup the tied operand references. Avoid the assert by untying and retying after the modification. These in place modifications really aren't managable.	2021-05-14 13:24:19 -04:00
Jay Foad	7f81c5a5ba	[AMDGPU] getMemOperandsWithOffset: add vaddr operand for stack access BUF instructions A consequence is that checkInstOffsetsDoNotOverlap can now distinguish sp+offset from fp+offset, so it knows that it shouldn't try to work out whether the accesses overlap just by comparing the offsets. For example in these two instructions: MIR: BUFFER_STORE_DWORD_OFFSET %0:vgpr_32(s32), $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable store 4 into stack + 4, addrspace 5) %4:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from `i8 addrspace(5)* undef`, addrspace 5) ISA: buffer_store_dword v0, off, s[0:3], s32 offset:4 buffer_load_dword v0, off, s[0:3], s34 Differential Revision: https://reviews.llvm.org/D73957	2021-05-14 10:10:43 +01:00
David Stuttard	72d570ca08	[AMDGPU][AsmParser/Disassembler] Correct A16 and G16 handling A16 support for image instructions assembly/disassembly (gfx10) was missing Also refactor MIMG op addr size calcs to common function We'd got 3 places where the same operation was being done. One test is now marked XFAIL until a related codegen patch is in place Differential Revision: https://reviews.llvm.org/D102231 Change-Id: I7e86e730ef8c71901457855cba570581f4f576bb	2021-05-14 09:25:44 +01:00
Sebastian Neubauer	13c0316239	[AMDGPU] Restrict immediate scratch offsets gfx9 does not work with negative offsets, gfx10 works only with aligned negative offsets, but not with unaligned negative offsets. This is slightly more conservative than needed, gfx9 does support negative offsets when a VGPR address is used and gfx10 supports negative, unaligned offsets when an SGPR address is used, but we do not make use of that with this patch. Differential Revision: https://reviews.llvm.org/D101292	2021-05-07 14:51:32 +02:00
Stanislav Mekhanoshin	4d6ebe8ac0	[AMDGPU] Change FLAT Scratch SADDR to VADDR form in moveToVALU Extend the legalization of global SADDR loads and stores with changing to VADDR to the FLAT scratch instructions. Differential Revision: https://reviews.llvm.org/D101408	2021-05-03 10:57:14 -07:00
Stanislav Mekhanoshin	89a94be16b	[AMDGPU] Change FLAT SADDR to VADDR form in moveToVALU Instead of legalizing saddr operand with a readfirstlane when address is moved from SGPR to VGPR we can just change the opcode. Differential Revision: https://reviews.llvm.org/D101405	2021-05-03 10:36:26 -07:00
David Stuttard	a67a377014	[AMDGPU] Tidy up some simple expressions for clarity NFC Slight refactor for clarity. Change-Id: Ib25e7f4582c67a7c57f066cfd5382c1405d7d4c5 Differential Revision: https://reviews.llvm.org/D101610	2021-04-30 11:13:54 +01:00
Jay Foad	ec8c61efdf	[AMDGPU] Allow multiple uses of the same literal In GFX10 VOP3 can have a literal, which opens up the possibility of two operands using the same literal value, which is allowed and only counts as one use of the constant bus. AMDGPUAsmParser::validateConstantBusLimitations already knew about this but SIInstrInfo::verifyInstruction did not. Differential Revision: https://reviews.llvm.org/D100770	2021-04-20 16:44:01 +01:00
Sebastian Neubauer	cc7add5298	[AMDGPU] Use SIInstrFlags for flat variants. NFC Use SIInstrFlags to differentiate between the different variants of flat instructions (flat, global and scratch). This should make it easier to bundle the immediate offset logic in a single place and implement restrictions and bug workarounds. Fixed version of D99587, which does not rely on the address space. Differential Revision: https://reviews.llvm.org/D99743	2021-04-09 12:28:36 +02:00
Stanislav Mekhanoshin	bc27a31801	[AMDGPU] Fix copyPhysReg to not produce unalined vgpr access RA can insert something like a sub1_sub2 COPY of a wide VGPR tuple which results in the unaligned acces with v_pk_mov_b32 after the copy is expanded. This is regression after D97316. Differential Revision: https://reviews.llvm.org/D98549	2021-03-15 14:14:30 -07:00
Stanislav Mekhanoshin	3bffb1cd0e	[AMDGPU] Use single cache policy operand Replace individual operands GLC, SLC, and DLC with a single cache_policy bitmask operand. This will reduce the number of operands in MIR and I hope the amount of code. These operands are mostly 0 anyway. Additional advantage that parser will accept these flags in any order unlike now. Differential Revision: https://reviews.llvm.org/D96469	2021-03-15 13:00:59 -07:00
Jay Foad	70f013fd3b	[AMDGPU] Fix isReallyTriviallyReMaterializable for V_MOV_* D57708 changed SIInstrInfo::isReallyTriviallyReMaterializable to reject V_MOVs with extra implicit operands, but it accidentally rejected all V_MOVs because of their implicit use of exec. Fix it but avoid adding a moderately expensive call to MI.getDesc().getNumImplicitUses(). In real graphics shaders this changes quite a few vgpr copies into move- immediates, which is good for avoiding stalls on GFX10. Differential Revision: https://reviews.llvm.org/D98347	2021-03-10 16:18:12 +00:00
Ruiling Song	f0ccdde3c9	[AMDGPU] Remove SI_MASK_BRANCH This is already deprecated, so remove code working on this. Also update the tests by using S_CBRANCH_EXECZ instead of SI_MASK_BRANCH. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D97545	2021-03-09 09:13:23 +08:00
Piotr Sobczak	4672bac177	[AMDGPU] Introduce Strict WQM mode * Add amdgcn_strict_wqm intrinsic. * Add a corresponding STRICT_WQM machine instruction. * The semantic is similar to amdgcn_strict_wwm with a notable difference that not all threads will be forcibly enabled during the computations of the intrinsic's argument, but only all threads in quads that have at least one thread active. * The difference between amdgc_wqm and amdgcn_strict_wqm, is that in the strict mode an inactive lane will always be enabled irrespective of control flow decisions. Reviewed By: critson Differential Revision: https://reviews.llvm.org/D96258	2021-03-03 14:19:16 +01:00
Piotr Sobczak	c3ce7bae80	[AMDGPU] Rename amdgcn_wwm to amdgcn_strict_wwm * Introduce the new intrinsic amdgcn_strict_wwm * Deprecate the old intrinsic amdgcn_wwm The change is done for consistency as the "strict" prefix will become an important, distinguishing factor between amdgcn_wqm and amdgcn_strictwqm in the future. The "strict" prefix indicates that inactive lanes do not take part in control flow, specifically an inactive lane enabled by a strict mode will always be enabled irrespective of control flow decisions. The amdgcn_wwm will be removed, but doing so in two steps gives users time to switch to the new name at their own pace. Reviewed By: critson Differential Revision: https://reviews.llvm.org/D96257	2021-03-03 09:33:57 +01:00
Jay Foad	3ad5216ed8	[AMDGPU] Better codegen for i64 bitreverse Differential Revision: https://reviews.llvm.org/D97547	2021-02-26 15:51:36 +00:00
Matt Arsenault	78b6d73a93	AMDGPU: Add even aligned VGPR/AGPR register classes gfx90a operations require even aligned registers, but this was previously achieved by reserving registers inside the full class. Ideally this would be captured in the static instruction definitions for the operands, and we would have different instructions per subtarget. The hackiest part of this is we need to manually reassign AGPR register classes after instruction selection (we get away without this for VGPRs since those types are actually registered for legal types).	2021-02-24 14:49:37 -05:00
Stanislav Mekhanoshin	a8d9d50762	[AMDGPU] gfx90a support Differential Revision: https://reviews.llvm.org/D96906	2021-02-17 16:01:32 -08:00
Piotr Sobczak	c72a63b4b0	[AMDGPU] Add implicit vcc_lo on S_CBRANCH_VCCNZ in wave32 * Update skip-if-dead.ll with tests for wave32. * Fix the crash in verifier in one newly enabled test by adding missing fixImplicitOperands in branch insertion code. ``` * Bad machine code: Using an undefined physical register * - function: test_kill_divergent_loop - basic block: %bb.2 bb (0xad96308) - instruction: S_CBRANCH_VCCNZ %bb.1, implicit $vcc_lo - operand 1: implicit $vcc_lo LLVM ERROR: Found 1 machine code errors. ``` * Simplify "cbranch_kill" to not use interp instructions. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D96793	2021-02-17 15:14:57 +01:00
Carl Ritson	c16f776028	[AMDGPU] Move kill lowering to WQM pass and add live mask tracking Move implementation of kill intrinsics to WQM pass. Add live lane tracking by updating a stored exec mask when lanes are killed. Use live lane tracking to enable early termination of shader at any point in control flow. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D94746	2021-02-11 20:31:29 +09:00
Jay Foad	a4b1df8af3	[AMDGPU] Use named unified buffer format constant. NFC.	2021-02-08 17:34:36 +00:00
Carl Ritson	0e8f50595e	[AMDGPU] Mark V_SET_INACTIVE as defining SCC V_SET_INACTIVE is implemented with S_NOT which clobbers SCC. Mark sure it is marked appropriately. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D95509	2021-01-29 09:46:41 +09:00
dfukalov	560d7e0411	[NFC][AMDGPU] Split AMDGPUSubtarget.h to R600 and GCN subtargets ... to reduce headers dependency. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D95036	2021-01-20 22:22:45 +03:00
Joe Nash	314e29ed2b	[AMDGPU] Add _e64 suffix to VOP3 Insts Previously, instructions which could be expressed as VOP3 in addition to another encoding had a _e64 suffix on the tablegen record name, while those only available as VOP3 did not. With this patch, all VOP3s will have the _e64 suffix. The assembly does not change, only the mir. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D94341 Change-Id: Ia8ec8890d47f8f94bbbdac43745b4e9dd2b03423	2021-01-12 18:33:18 -05:00
Sebastian Neubauer	6a195491b6	[AMDGPU] Fix failing assert with scratch ST mode In ST mode, flat scratch instructions have neither an sgpr nor a vgpr for the address. This lead to an assertion when inserting hard clauses. Differential Revision: https://reviews.llvm.org/D94406	2021-01-12 09:54:02 +01:00
dfukalov	6a87e9b08b	[NFC][AMDGPU] Reduce include files dependency. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D93813	2021-01-07 22:22:05 +03:00
dfukalov	9ed8e0caab	[NFC] Reduce include files dependency and AA header cleanup (part 2). Continuing work started in https://reviews.llvm.org/D92489: Removed a bunch of includes from "AliasAnalysis.h" and "LoopPassManager.h". Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D92852	2020-12-17 14:04:48 +03:00

1 2 3 4 5 ...

680 Commits