llvm-project

Commit Graph

Author	SHA1	Message	Date
Nicolai Hähnle	5df2893a9a	AMDGPU: Add G_AMDGPU_MAD_64_32 instructions These generic instructions are trivially selected to V_MAD_[IU]64_[IU]32 instructions when run on the VALU. When at least both factors are scalar, it is usually better to execute some or all of the instruction on the SALU. To this end, we lower the instruction to simpler instructions that are supported on the SALU when applying the register bank mapping. Differential Revision: https://reviews.llvm.org/D124843	2022-05-27 12:36:17 -05:00
Nicolai Hähnle	affa1b1cc5	AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane A later change will add a 3rd user, so factoring out the common code seems useful. Reorganizing the executeInWaterfallLoop causes some more COPYs to be generated, but those all fold away during instruction selection. Generating the comparisons uses generic instructions over machine instructions now which admittedly shouldn't make a difference (though it should make it easier to move the waterfall loop generation to another place). (Resubmit with missing test added.) Differential Revision: https://reviews.llvm.org/D125324	2022-05-25 12:14:01 -05:00
Nicolai Hähnle	afc90101a5	Revert "AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane" This reverts commit `2a28467e53`.	2022-05-25 12:03:23 -05:00
Nicolai Hähnle	2a28467e53	AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane A later change will add a 3rd user, so factoring out the common code seems useful. Reorganizing the executeInWaterfallLoop causes some more COPYs to be generated, but those all fold away during instruction selection. Generating the comparisons uses generic instructions over machine instructions now which admittedly shouldn't make a difference (though it should make it easier to move the waterfall loop generation to another place). Differential Revision: https://reviews.llvm.org/D125324	2022-05-25 11:35:02 -05:00
Stanislav Mekhanoshin	26c4cf2391	[AMDGPU] Disable newly added gfx90a global isel image tests. NFC. This fixed build failure with expensive checks after D126009. The change has added new run lines for Global ISel which has uncovered a pre-existing problem: it does not select a correct flavor of these image instructions.	2022-05-24 10:53:55 -07:00
Stanislav Mekhanoshin	5df6669d45	[AMDGPU] Enforce alignment of image vaddr on gfx90a Even though single address image instructions only use a single VGPR HW accesses 4 or 5 which creates alignment requirement. Fixes: SWDEV-316648 Differential Revision: https://reviews.llvm.org/D126009	2022-05-24 10:05:39 -07:00
Jay Foad	e2926501d8	[AMDGPU] Aggressively fold immediates in SIShrinkInstructions Fold immediates regardless of how many uses they have. This is expected to increase overall code size, but decrease register usage. Differential Revision: https://reviews.llvm.org/D114644	2022-05-18 11:04:33 +01:00
Jay Foad	3eb2281bc0	[AMDGPU] Aggressively fold immediates in SIFoldOperands Previously SIFoldOperands::foldInstOperand would only fold a non-inlinable immediate into a single user, so as not to increase code size by adding the same 32-bit literal operand to many instructions. This patch removes that restriction, so that a non-inlinable immediate will be folded into any number of users. The rationale is: - It reduces the number of registers used for holding constant values, which might increase occupancy. (On the other hand, many of these registers are SGPRs which no longer affect occupancy on GFX10+.) - It reduces ALU stalls between the instruction that loads a constant into a register, and the instruction that uses it. - The above benefits are expected to outweigh any increase in code size. Differential Revision: https://reviews.llvm.org/D114643	2022-05-18 10:19:35 +01:00
Abinav Puthan Purayil	485dd0b752	[GlobalISel] Handle constant splat in funnel shift combine This change adds the constant splat versions of m_ICst() (by using getBuildVectorConstantSplat()) and uses it in matchOrShiftToFunnelShift(). The getBuildVectorConstantSplat() name is shortened to getIConstantSplatVal() so that the *SExtVal() version would have a more compact name. Differential Revision: https://reviews.llvm.org/D125516	2022-05-16 16:03:30 +05:30
Abinav Puthan Purayil	f96d20450c	[AMDGPU][GlobalISel] Pre-commit tests for D125516 Differential Revision: https://reviews.llvm.org/D125539	2022-05-16 16:03:30 +05:30
Jay Foad	26e1ebd3ea	[GlobalISel] Change ConstantFoldVectorBinop to return vector of APInt Previously it built MIR for the results and returned a Register. This avoids building constants for earlier elements of the vector if later elements will fail to fold, and allows CSEMIRBuilder::buildInstr to avoid unconditionally building a copy from the result. Use a new helper function MachineIRBuilder::buildBuildVectorConstant to build a G_BUILD_VECTOR of G_CONSTANTs. Differential Revision: https://reviews.llvm.org/D117758	2022-05-13 09:33:07 +01:00
Joe Nash	18ed279a3a	[AMDGPU] gfx11 subtarget features & early tests Tablegen definitions for subtarget features and cpp predicate functions to access the features. New Sub-TargetProcessors and common latencies. Simple changes to MIR codegen tests which pass on gfx11 because they have the same output as previous subtargets or operate on pseudo instructions which are reused from previous subtargets. Contributors: Jay Foad <jay.foad@amd.com> Petar Avramovic <Petar.Avramovic@amd.com> Patch 4/N for upstreaming of AMDGPU gfx11 architecture Depends on D124538 Reviewed By: Petar.Avramovic, foad Differential Revision: https://reviews.llvm.org/D125261	2022-05-11 10:31:49 -04:00
Abinav Puthan Purayil	7f6489d0e3	[AMDGPU] Regenerate checks in a mir test	2022-05-09 13:28:09 +05:30
Nicolai Hähnle	cdc5b64ed6	AMDGPU/GISel: Update some MIR tests to reduce future churn The default output format of the update_mir_test_checks.py script has changed since some of these tests were generated. Also, an upcoming commit will introduce differences between GFX9 and GFX10 in the legalization of G_MUL.	2022-05-03 07:08:56 -05:00
Matt Arsenault	794a0bb547	AMDGPU: Directly implement computeKnownBits for workitem intrinsics Currently metadata is inserted in a late pass which is lowered to an AssertZext. The metadata would be more useful if it was inserted earlier after inlining, but before codegen. Probably shouldn't change anything now. Just replacing the late metadata annotation needs more work, since we lose out on optimizations after these are lowered to CopyFromReg. Seems to be slightly better than relying on the AssertZext from the metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using -start-before=amdgpu-isel instead of running the usual codegen pipeline.	2022-04-22 10:49:50 -04:00
Matt Arsenault	40bc9112c0	GlobalISel: Relax handling of G_ASSERT_* with source register classes The most common situation where G_ASSERT_ZEXT appears for AMDGPU is a copy from a physical register, which happens to use set the actual register class on the virtual register. After copy coalescing, the assert's source operand had a vreg with a set class. The verifier was strictly rejecting cases where the set class/bank weren't an exact match. Additionally, RegBankSelect was also expecting a register bank to be set on the register, not a class. This is much stricter than regular copies so relax this behavior. This now allows these 2 cases: 1. Source register has either class or bank, and the result does not 2. Source register has a register class, and the result is a register with a matching bank. This should avoid needing some kind of special handling to avoid violating this constraint when folding copies.	2022-04-22 10:49:50 -04:00
Abinav Puthan Purayil	45ca94334e	[AMDGPU] Select no-return atomic intrinsics in tblgen This is to avoid relying on the post-isel hook. This change also enable the saddr pattern selection for atomic intrinsics in GlobalISel. Differential Revision: https://reviews.llvm.org/D123583	2022-04-22 09:37:40 +05:30
Petar Avramovic	e06290e53f	AMDGPU/GlobalISel: Fix isVCC for uniform s1 with reg class on wave32 Fix isVCC for register that was assigned register class during inst-selection. This happens when register has multiple uses. For wave32, uniform i1 to vcc copy was selected like vcc to vcc copy when uniform i1 had assigned register class. Uniform i1 register with assigned register class will have s1 LLT, be defined using G_TRUNC and class will be SReg_32RegClass. Vcc i1 register with assigned register class will have s1 LLT, class will be SReg_32RegClass for wave32 and SReg_64RegClass for wave64 and register will not be defined by G_TRUNC. Differential Revision: https://reviews.llvm.org/D124163	2022-04-21 16:12:04 +02:00
Petar Avramovic	4e0dacb2cf	AMDGPU/GlobalISel: Precommit test for D124163	2022-04-21 16:12:03 +02:00
Dmitry Preobrazhensky	ab18e1a533	[AMDGPU][GFX10] Enabled op_sel for v_add_nc_u16 and v_sub_nc_u16 Differential Revision: https://reviews.llvm.org/D123594	2022-04-13 13:48:42 +03:00
Stanislav Mekhanoshin	f6462a26f0	[AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b128 aligned by 16: 4.9 sec ds_write2_b64 aligned by 16: 5.1 sec ds_write2_b32 * 2 aligned by 16: 5.5 sec ds_write_b128 aligned by 1: 8.1 sec ds_write2_b64 aligned by 1: 8.7 sec ds_write2_b32 * 2 aligned by 1: 14.0 sec ds_write_b128 aligned by 2: 8.1 sec ds_write2_b64 aligned by 2: 8.7 sec ds_write2_b32 * 2 aligned by 2: 14.0 sec ds_write_b128 aligned by 4: 5.6 sec ds_write2_b64 aligned by 4: 8.7 sec ds_write2_b32 * 2 aligned by 4: 5.6 sec ds_write_b128 aligned by 8: 5.6 sec ds_write2_b64 aligned by 8: 5.1 sec ds_write2_b32 * 2 aligned by 8: 5.6 sec ds_read_b128 aligned by 16: 3.8 sec ds_read2_b64 aligned by 16: 3.8 sec ds_read2_b32 * 2 aligned by 16: 4.0 sec ds_read_b128 aligned by 1: 4.6 sec ds_read2_b64 aligned by 1: 8.1 sec ds_read2_b32 * 2 aligned by 1: 14.0 sec ds_read_b128 aligned by 2: 4.6 sec ds_read2_b64 aligned by 2: 8.1 sec ds_read2_b32 * 2 aligned by 2: 14.0 sec ds_read_b128 aligned by 4: 4.6 sec ds_read2_b64 aligned by 4: 8.1 sec ds_read2_b32 * 2 aligned by 4: 4.0 sec ds_read_b128 aligned by 8: 4.6 sec ds_read2_b64 aligned by 8: 3.8 sec ds_read2_b32 * 2 aligned by 8: 4.0 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b128 aligned by 16: 6.2 sec ds_write2_b64 aligned by 16: 7.1 sec ds_write2_b32 * 2 aligned by 16: 7.6 sec ds_write_b128 aligned by 1: 24.1 sec ds_write2_b64 aligned by 1: 25.2 sec ds_write2_b32 * 2 aligned by 1: 43.7 sec ds_write_b128 aligned by 2: 24.1 sec ds_write2_b64 aligned by 2: 25.1 sec ds_write2_b32 * 2 aligned by 2: 43.7 sec ds_write_b128 aligned by 4: 14.4 sec ds_write2_b64 aligned by 4: 25.1 sec ds_write2_b32 * 2 aligned by 4: 7.6 sec ds_write_b128 aligned by 8: 14.4 sec ds_write2_b64 aligned by 8: 7.1 sec ds_write2_b32 * 2 aligned by 8: 7.6 sec ds_read_b128 aligned by 16: 6.2 sec ds_read2_b64 aligned by 16: 6.3 sec ds_read2_b32 * 2 aligned by 16: 7.5 sec ds_read_b128 aligned by 1: 12.5 sec ds_read2_b64 aligned by 1: 24.0 sec ds_read2_b32 * 2 aligned by 1: 43.6 sec ds_read_b128 aligned by 2: 12.5 sec ds_read2_b64 aligned by 2: 24.0 sec ds_read2_b32 * 2 aligned by 2: 43.6 sec ds_read_b128 aligned by 4: 12.5 sec ds_read2_b64 aligned by 4: 24.0 sec ds_read2_b32 * 2 aligned by 4: 7.5 sec ds_read_b128 aligned by 8: 12.5 sec ds_read2_b64 aligned by 8: 6.3 sec ds_read2_b32 * 2 aligned by 8: 7.5 sec ``` Differential Revision: https://reviews.llvm.org/D123634	2022-04-12 16:07:13 -07:00
Matt Arsenault	1416744f84	GlobalISel: Implement computeKnownBits for overflow bool results	2022-04-11 19:43:37 -04:00
Matt Arsenault	eee82dc66d	AMDGPU/GlobalISel: Add some additional IR tests for zextload	2022-04-11 19:43:37 -04:00
Matt Arsenault	4c037bdbab	AMDGPU/GlobalISel: Add more tests for inreg extend + load combine	2022-04-11 19:43:37 -04:00
Nicolai Hähnle	61df26c86c	AMDGPU: Add codegen test for ctpop(ballot(x)) Highlights a gap in DAG-based ISel where we unnecessarily choose vector instructions. GlobalISel already looks good.	2022-04-08 15:00:05 -05:00
Stanislav Mekhanoshin	16cf9e6dad	[AMDGPU] Fix handling of gfx10 LDS misaligned access bug It was only handled for FLAT initially because we did not have unaligned DS instructions lowering. Now it is implemented but the bug is not handled. Differential Revision: https://reviews.llvm.org/D123338	2022-04-07 15:08:29 -07:00
Stanislav Mekhanoshin	78cb11c8e4	[AMDGPU] Fix test difference in debug and release. NFC. Added -disable-gisel-legality-check to couple GlobalISel tests which have not legal instructions to avoid difference in debug and release builds.	2022-04-06 23:58:21 -07:00
Stanislav Mekhanoshin	a41a676e8a	[AMDGPU] Check SI LDS offset bug in the allowsMisalignedMemoryAccesses Differential Revision: https://reviews.llvm.org/D123268	2022-04-06 18:05:02 -07:00
Stanislav Mekhanoshin	09c2b7c35a	[AMDGPU] Regenerate global isel lds ops test checks. NFC.	2022-04-06 16:15:59 -07:00
Abinav Puthan Purayil	898d5776ec	[AMDGPU][GlobalISel] Scalarize add/sub with overflow ops in the legalizer Differential Revision: https://reviews.llvm.org/D122803	2022-03-31 21:46:34 +05:30
Abinav Puthan Purayil	db17ebd593	[AMDGPU][GlobalISel] Add end to end IR tests for add/sub with overflow Differential Revision: https://reviews.llvm.org/D122818	2022-03-31 21:46:34 +05:30
Jay Foad	e8e32e5714	[AMDGPU] Fix typo in RUN line	2022-03-31 16:23:40 +01:00
Abinav Puthan Purayil	2f284b0ff9	[AMDGPU] Regenerate checks in some mir tests	2022-03-31 17:49:00 +05:30
Carl Ritson	1f52d02ceb	[AMDGPU] Split waterfall loop exec manipulation Split waterfall loops into multiple blocks so that exec mask manipulation (s_and_saveexec) does not occur in the middle of a block. VGPR live range optimizer is updated to handle waterfall loops spanning multiple blocks. Reviewed By: ruiling Differential Revision: https://reviews.llvm.org/D122200	2022-03-28 17:44:54 +09:00
Stanislav Mekhanoshin	6e3e14f600	[AMDGPU] Support gfx940 smfmac instructions Differential Revision: https://reviews.llvm.org/D122191	2022-03-24 12:40:42 -07:00
Stanislav Mekhanoshin	27439a7642	[AMDGPU] New gfx940 mfma instructions Differential Revision: https://reviews.llvm.org/D122044	2022-03-24 12:12:52 -07:00
Stanislav Mekhanoshin	72c1a0d9c2	[AMDGPU] Allow v_accvgpr_write to use SGPR on gfx90a This is undocumented, but it should work. Differential Revision: https://reviews.llvm.org/D122252	2022-03-22 13:52:29 -07:00
Abinav Puthan Purayil	aee3684995	[AMDGPU] Use COPY_TO_REGCLASS for buffer_atomic_cmpswap selection GlobalISel was selecting the av_* regclass for some cases. Differential Revision: https://reviews.llvm.org/D121933	2022-03-18 08:56:23 +05:30
Changpeng Fang	dd5895cc39	AMDGPU: Use the implicit kernargs for code object version 5 Summary: Specifically, for trap handling, for targets that do not support getDoorbellID, we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1]. To get aperture bases when targets do not have aperture registers, we load private_base or shared_base directly from the implicit kernarg. In clang, we use implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}. Reviewers: arsenm, sameerds, yaxunl Differential Revision: https://reviews.llvm.org/D120265	2022-03-17 14:12:36 -07:00
Abinav Puthan Purayil	f59cb41ba1	[AMDGPU] Select buffer_atomic_cmpswap* in tblgen This change replaces the manual selection of buffer_atomic_cmpswap* instructions in SelectionDAG and GlobalISel with a tblgen based selection in BUFInstructions.td. This allows us to select the return and no-return variants in tblgen. Differential Revision: https://reviews.llvm.org/D121770	2022-03-17 10:12:32 +05:30
Stanislav Mekhanoshin	36fe3f13a9	[AMDGPU] flat scratch SVS addressing mode for gfx940 Both VADDR and SADDR are used in SVS mode. Differential Revision: https://reviews.llvm.org/D121254	2022-03-14 15:23:36 -07:00
Changpeng Fang	0f20a35b9e	AMDGPU: Set up User SGPRs for queue_ptr only when necessary Summary: In general, we need queue_ptr for aperture bases and trap handling, and user SGPRs have to be set up to hold queue_ptr. In current implementation, user SGPRs are set up unnecessarily for some cases. If the target has aperture registers, queue_ptr is not needed to reference aperture bases. For trap handling, if target suppots getDoorbellID, queue_ptr is also not necessary. Futher, code object version 5 introduces new kernel ABI which passes queue_ptr as an implicit kernel argument, so user SGPRs are no longer necessary for queue_ptr. Based on the trap handling document: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table, llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap in the backend. Reviewers: sameerds, arsenm Fixes: SWDEV-307189 Differential Revision: https://reviews.llvm.org/D119762	2022-03-09 10:14:05 -08:00
Jay Foad	28f67aed9d	[AMDGPU] Fix some confusing check prefixes. NFC. Tahiti is SI/GFX6. Kaveri and Hawaii are CI/GFX7. Fiji is VI/GFX8.	2022-03-09 17:05:49 +00:00
Venkata Ramanaiah Nalamothu	04fff547e2	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm, ronlieb Differential Revision: https://reviews.llvm.org/D114652	2022-03-09 12:18:02 +05:30
Stanislav Mekhanoshin	932f628121	[AMDGPU] new gfx940 fp atomics Differential Revision: https://reviews.llvm.org/D121028	2022-03-07 12:32:02 -08:00
Jay Foad	8bed52c9eb	[AMDGPU] Make more use of madmk/fmamk instructions In convertToThreeAddress handle VOP2 mac/fmac instructions with a literal src0 operand, since these are prime candidates for converting to madmk/fmamk. Previously this would only happen if src0 (or src1) was a register defined by a move-immediate instruction, but in many cases these operands have already been folded because SIFoldOperands runs before TwoAddressInstructionPass. Differential Revision: https://reviews.llvm.org/D120736	2022-03-02 10:22:10 +00:00
Jay Foad	289339140e	[AMDGPU] Handle legacy multiply-accumulate opcodes in convertToThreeAddress Handle V_MAC_LEGACY_F32 and V_FMAC_LEGACY_F32 in convertToThreeAddress, to avoid the need for an extra mov instruction in some cases. Differential Revision: https://reviews.llvm.org/D120704	2022-03-01 16:58:00 +00:00
Jay Foad	f510045d82	[CodeGen] Remove unneeded regex escaping in FileCheck patterns. NFC. Take advantage of D117117 to simplify all {{\[}} to [ and {{\]}} to ]. Differential Revision: https://reviews.llvm.org/D117298	2022-02-18 16:10:56 +00:00
Jay Foad	eeef6ad1d4	[AMDGPU] Reenable some disabled RUN lines These look like they were temporarily disabled to placate update_mir_test_checks and then committed by mistake.	2022-02-15 11:28:24 +00:00
Sebastian Neubauer	a5d4f82b73	[AMDGPU] Make enable-flat-scratch a subtarget feature Use a subtarget feature instead of a command line argument to reduce global state. We want to enable flat scratch for graphics in some cases and this doesn't work well with command line options. Differential Revision: https://reviews.llvm.org/D119425	2022-02-11 18:23:07 +01:00
Mirko Brkusanin	5ff35ba8ae	[AMDGPU][GlobalISel] Fix insert point in FoldableFneg combine Newly created fneg was built after some of it's uses in some cases. Now it will be built immediately after instruction whose dst it negates. Differential Revision: https://reviews.llvm.org/D119459	2022-02-11 12:09:40 +01:00
Jay Foad	abda8d2229	[GlobalISel] CSE FP constants at -O0 At -O0 we claim to CSE constants only. I think this should apply to G_FCONSTANT as well as G_CONSTANT. Differential Revision: https://reviews.llvm.org/D119344	2022-02-10 09:17:11 +00:00
Matt Arsenault	5af0f097ba	GlobalISel: Constant fold G_PTR_ADD Some globals lower to literal addresses on AMDGPU. This may be wrong for non-integral address spaces. I'm wondering if we should just allow regular G_ADD to use pointer types, and reserve G_PTR_ADD for non-integral address spaces.	2022-02-08 19:21:06 -05:00
Matt Arsenault	2af4a554fe	GlobalISel: Constant fold FP bin ops in MIRBuilder Might as well handle these if we're going to handle the integer ops here.	2022-02-08 18:51:10 -05:00
Matt Arsenault	930f2498d4	GlobalISel: Constant fold integer min/max opcodes	2022-02-08 18:50:35 -05:00
Matt Arsenault	0877fbcc16	GlobalISel: Add FoldBinOpIntoSelect combine This will do the combine in cases that should fold, but don't now. e.g. we're relying on the CSEMIRBuilder's incomplete constant folding. For instance it doesn't handle FP operations or vectors (and we don't have separate constant folding combines either to catch them).	2022-02-08 18:17:21 -05:00
Matt Arsenault	5847d5fb24	AMDGPU/GlobalISel: Add baseline test for binop fold into select combine	2022-02-08 18:17:21 -05:00
Stanislav Mekhanoshin	aeaf85b9c2	[AMDGPU] Select VGPR versions of MFMA if possible We can select _vgprcd versions of MAI instructions and have no AGPRs with the whole budget left for VGPRs if: 1. This is a kernel; 2. It has no calls; 3. It runs at least on 2 waves thus having not more that 256 VGPRs. 4. There is no inline asm requesting AGPRs. Differential Revision: https://reviews.llvm.org/D117253	2022-02-08 10:19:41 -08:00
Ruiling Song	0719c43735	AMDGPU: Don't clobber source register for V_SET_INACTIVE_* The WWM register has unmodeled register liveness, For v_set_inactive_*, clobberring source register is dangerous because it will overwrite the inactive lanes. When the source vgpr is dead at v_set_inactive_lane, the inactive lanes may be not really dead. This may make common optimizations doing wrong. For example in a simple if-then cfg in Machine IR: bb.if: %src = bb.then: %src1 = COPY %src %dst = V_SET_INACTIVE %src1(tied-def 0), %inactive bb.end ... = PHI [0, %bb.then] [%src, %bb.if] The register coalescer will think it is safe to optimize "%src1 = COPY %src" in bb.then. And at the same time, there is no interference for the PHI in bb.end. The source and destination values of the PHI will be assigned the same register. The single PHI register will be overwritten by the v_set_inactive, then we would get wrong value in bb.end. With this change, we will copy the content of the source register before setting inactive lanes after register allocation. Yes, this will sacrifice the WWM code generation a little, but I don't have any better idea to do things correctly. Differential Revision: https://reviews.llvm.org/D117482	2022-02-06 12:38:26 +08:00
Jessica Paquette	9a61e731ff	[GlobalISel] Combine (G_ADDO x, 0) -> x + no carry out Similar to the G_MULO change. The code for checking if a constant is legal/pre-legalize is shared between these, and is kind of hairy. So, factor it out into a new function: `isConstantLegalOrBeforeLegalizer`. To make the refactoring clean, further refactor `isLegalOrBeforeLegalizer` into a wrapper for two functions: - `isPreLegalize` - `isLegal` This is a bit easier to read in general. https://godbolt.org/z/KW7oszP1o Differential Revision: https://reviews.llvm.org/D118655	2022-02-03 14:25:15 -08:00
Jay Foad	b9cf52bc3d	[AMDGPU] Simplify AMDGPUAnnotateUniformValues::visitLoadInst Always set uniform metadata on the pointer if it is an instruction, but otherwise do not bother to create a trivial getelementptr instruction, because AMDGPUInstrInfo::isUniformMMO can already detect that various non-instruction pointers are uniform. Most of the test case churn is from tests that used undef as a pointer, which AMDGPUInstrInfo::isUniformMMO treats as uniform. Differential Revision: https://reviews.llvm.org/D118909	2022-02-03 16:27:48 +00:00
Jay Foad	ddd3807e69	[AMDGPU] Use new target MMO flag MONoClobber This allows us to set the noclobber flag on (the MMO of) a load instruction instead of on the pointer. This fixes a bug where noclobber was being applied to all loads from the same pointer, even if some of them were clobbered. Differential Revision: https://reviews.llvm.org/D118775	2022-02-02 17:12:36 +00:00
Jay Foad	d2e5d3512b	[StructurizeCFG] Clean up some boolean not instructions In some cases StructurizeCFG inserts i1 xor instructions to invert predicates. Add a quick loop to clean these up afterwards if we can get away with modifying an existing compare instruction instead. (StructurizeCFG is generally run late in the pipeline so instcombine does not clean them up for us.) Differential Revision: https://reviews.llvm.org/D118623	2022-02-01 09:35:37 +00:00
Jay Foad	8faad29634	Revert "[Local] invertCondition: try modifying an existing ICmpInst" This reverts commit `a6b54ddaba`. Apparently it is not safe to modify the condition even if it passes the hasOneUse test, because StructurizeCFG might have other references to the condition that are not manifest in the IR use-def chains.	2022-01-31 14:55:36 +00:00
Jay Foad	a6b54ddaba	[Local] invertCondition: try modifying an existing ICmpInst This avoids various cases where StructurizeCFG would otherwise insert an xor i1 instruction, and it since it generally runs late in the pipeline, instcombine does not clean up the xor-of-cmp pattern. Differential Revision: https://reviews.llvm.org/D118478	2022-01-31 10:44:17 +00:00
Matt Arsenault	aa88b65392	AMDGPU/GlobalISel: Fix assert on invalid cond code for llvm.amdgcn.icmp	2022-01-27 10:34:06 -05:00
Matt Arsenault	f482e86980	AMDGPU/GlobalISel: Fix flat_scratch_init handling for shaders I don't think this is actually defined for mesa, but this is what we were doing on the DAG path.	2022-01-27 10:20:52 -05:00
Jay Foad	15b11e00f0	[AMDGPU] Update MachineMemOperands syntax in commented out tests	2022-01-27 10:56:35 +00:00
Jay Foad	b30d9df457	[AMDGPU] Remove unused CI check lines	2022-01-27 10:53:42 +00:00
Jay Foad	3b259a6842	[AMDGPU] Remove unused GFX6 check lines	2022-01-27 10:48:32 +00:00
Matt Arsenault	045be6ff36	AMDGPU/GlobalISel: Fold wave address into mubuf addressing modes	2022-01-26 15:25:26 -05:00
Matt Arsenault	2d670de84c	GlobalISel: Avoid crash on asm with lying result types The physical register in the asm has the wrong type for the declared IR. It seems to work in the DAG by extracting the 4 elements that are defined in the IR from the register, but that isn't handled here. This doesn't seem to be a well tested path since other mismatched cases are crashing the DAG asm handling.	2022-01-26 15:23:59 -05:00
Sebastian Neubauer	4723f3cf03	[AMDGPU][GlobalISel] Combine unmerge of undef Fold (unmerge undef) -> undef, undef, ... Differential Revision: https://reviews.llvm.org/D118138	2022-01-26 12:30:36 +01:00
Sebastian Neubauer	6680466663	[AMDGPU][NFC] Pre-commit regenerated test	2022-01-26 12:30:33 +01:00
Benjamin Kramer	0776f6e04d	[LSV] Vectorize loads of vectors by turning it into a larger vector Use shufflevector to do the subvector extracts. This allows a lot more load merging on AMDGPU and also on NVPTX when <2 x half> is involved. Differential Revision: https://reviews.llvm.org/D117219	2022-01-26 11:38:41 +01:00
Stanislav Mekhanoshin	bb1fe36977	[AMDGPU] Make v8i16/v8f16 legal Differential Revision: https://reviews.llvm.org/D117721	2022-01-24 11:51:08 -08:00
Matt Arsenault	99e8e17313	Reapply "Revert "GlobalISel: Add G_ASSERT_ALIGN hint instruction" This reverts commit `a97e20a3a8`.	2022-01-24 09:26:52 -05:00
Abinav Puthan Purayil	912af6b570	[AMDGPU][GlobalISel] Remove the post ':' part of vreg operands in fsh combine tests.	2022-01-24 16:30:40 +05:30
Jay Foad	aa50b93e7c	[AMDGPU][GlobalISel] Add more sign/zero/any-extension tests Add s1 to s16 cases, and for sgprs s1 to s64 and s32 to s64.	2022-01-24 10:16:51 +00:00
Jay Foad	906ebd5830	[AMDGPU][GlobalISel] Regenerate checks in inst-select-*ext.mir	2022-01-24 10:16:51 +00:00
Abinav Puthan Purayil	68b70d17d8	[GlobalISel] Fold or of shifts with constant amount to funnel shift. This change folds (or (shl x, C0), (lshr y, C1)) to funnel shift iff C0 and C1 are constants where C0 + C1 is the bit-width of the shift instructions. Differential Revision: https://reviews.llvm.org/D116529	2022-01-24 10:43:32 +05:30
Sebastian Neubauer	ae2f9c8be8	[AMDGPU] Remove lz and nomip combine from codegen These combines have been moved into the IR combiner in D116042. Differential Revision: https://reviews.llvm.org/D116116	2022-01-21 12:09:08 +01:00
Matt Arsenault	064cea9c9a	AMDGPU/GlobalISel: Try to use s_and_b64 in ptrmask selection Avoids a test diff with SDAG.	2022-01-20 12:56:53 -05:00
Matt Arsenault	2d1f9aa27d	AMDGPU/GlobalISel: Regenerate test checks with -NEXT	2022-01-20 12:56:53 -05:00
Matt Arsenault	08549ba51e	AMDGPU/GlobalISel: Explicitly set -global-isel-abort in failure tests If the default mode is the fallback, this would fail since it would end up seeing the DAG failure message instead.	2022-01-20 12:56:53 -05:00
Matt Arsenault	2e49e0cfde	AMDGPU/GlobalISel: Directly diagnose return value use for FP atomics Emit an error if the return value is used on subtargets that do not support them. Previously we were falling back to the DAG on selection failure, where it would emit this error and then fail again.	2022-01-20 12:46:45 -05:00
Matt Arsenault	be7e938e27	AMDGPU/GlobalISel: Stop handling llvm.amdgcn.buffer.atomic.fadd This code is not structured to handle the legacy buffer intrinsics and was miscompiling them.	2022-01-20 12:12:06 -05:00
Matt Arsenault	8ff3c9e0be	AMDGPU/GlobalISel: Fix selection of gfx90a FP atomics The struct/raw forms for the buffer atomics now work as expected. However, we're incorrectly handling the legacy form (which we probably shouldn't handle at all). We also are not diagnosing the use of the return value on gfx908. These will be addressed separately.	2022-01-20 12:12:06 -05:00
Matt Arsenault	89c447e4e6	AMDGPU: Stop reserving 36-bytes before kernel arguments for amdpal This was inheriting the mesa behavior, and as far as I know nobody is using opencl kernels with amdpal. The isMesaKernel check was irrelevant because this property needs to be held for all functions.	2022-01-20 12:12:05 -05:00
Abinav Puthan Purayil	d8b690409d	[AMDGPU] Set MemoryVT for truncstores in tblgen. GlobalISelEmitter was skipping these patterns when its predicates were checked. This patch should allow us to select d16_hi stores in GlobalISel. Differential Revision: https://reviews.llvm.org/D117762	2022-01-20 19:05:12 +05:30
Jay Foad	847bb26820	[AMDGPU] Regenerate some MIR checks	2022-01-20 12:41:40 +00:00
Matt Arsenault	052503979e	AMDGPU/GlobalISel: Fix introducing f16 fmed3 for gfx8	2022-01-19 10:43:21 -05:00
Matt Arsenault	adab71711e	AMDGPU/GlobalISel: Fix legalize failure on i65 ctpop	2022-01-19 10:26:28 -05:00
Matt Arsenault	b965617ccc	GlobalISel: Fix assert on unmerge to different element of casted vector This was failing if a G_UNMERGE_VALUES produced a different element type than the cast result type.	2022-01-19 10:13:31 -05:00
Matt Arsenault	7f26a1027f	AMDGPU/GlobalISel: Introduce pseudo to copy sp in call sequences Arbitrary stack pointers are accessed using MUBUF instructions with the voffset field, which is interpreted as the swizzled address. We want to fold fold into the MUBUF form to use the SP in the SGPR offset, and previously we were special casing the interpretation of the pointer value if the access memory operand said it was relative to the stack pointer. `690f5b7a01` removed this check, and moved the DAG path to special casing copies from SGPRs. This is not an entirely sound approach, since it's still changing the interpretation of pointer values based the context. Introduce a new pseudo which corresponds to the wave-to-vector address transform. This way the memory instruction has consistent semantics where the incoming pointer is always interpreted as a vector address, and we're not obligated to optimize into the MUBUF offset-only addressing mode. The DAG should probably have an equivalent pseudo. This should fix some correctness issues, and folding this into addressing modes will be a future optimization patch.	2022-01-19 10:13:31 -05:00
Matt Arsenault	da72822763	GlobalISel: Fix CSEMIRBuilder mishandling constant folds of vectors This was ignoring the requested result register, resulting in a missing def when this happened in the IRTranslator. Fixes some crashes and verifier errors at -O0. Alternatively we could pass DstOps to the constant fold functions.	2022-01-18 17:21:02 -05:00
Matt Arsenault	42098c4a30	GlobalISel: Fix legalization error where CSE leaves behind dead defs If the conversion artifact introduced in the unmerge of cast of merge combine already existed in the function, this would introduce dead copies which kept the old casts around, neither of which were deleted, and would fail legalization. This would fail as follows: The G_UNMERGE_VALUES of the G_SEXT of the G_BUILD_VECTOR would introduce a G_SEXT for each of the scalars. Some of the required G_SEXTs already existed in the function, so CSE moves them up in the function and introduces a copy to the original result register. The introduced CSE copies are dead, since the originally G_SEXTs were already directly used. These copies add a use to the illegal G_SEXTs, so they are not deleted. The artifact combiner does not see the defs that need to be updated, since it was hidden inside the CSE builder. I see 2 potential fixes, and opted for the mechanically simpler one, which is to just not insert the cast if the result operand isn't used. Alternatively, we could not insert the cast directly into the result register, and use replaceRegOrBuildCopy similar to the case where there is no conversion. I suspect this is a wider problem in the artifact combiner.	2022-01-18 17:04:40 -05:00
Matt Arsenault	de1600a1d9	AMDGPU: Avoid enabling kernel workitem IDs with reqd_work_group_size	2022-01-18 13:52:04 -05:00
Matt Arsenault	f5ff1cab43	AMDGPU/GlobalISel: Regenerate base test checks	2022-01-18 11:26:47 -05:00
Vang Thao	10ed1eca24	[MachineSink] Allow sinking of constant or ignorable physreg uses For AMDGPU, any use of the physical register EXEC prevents sinking even if it is not a real physical register read. Add check to see if a physical register use can be ignored for sinking. Also perform same constant and ignorable physical register check when considering sinking in loops. https://reviews.llvm.org/D116053	2022-01-18 14:17:40 +00:00

1 2 3 4 5 ...

1664 Commits