llvm-project

Commit Graph

Author	SHA1	Message	Date
Nikita Popov	8e70258b18	[AMDGPUCodeGenPrepare] Check result of ConstantFoldBinaryOpOperands() This function will become fallible once we don't support constant expressions for all binops, so make sure to check the result.	2022-07-04 14:20:23 +02:00
Mirko Brkusanin	2208342c9b	[AMDGPU][GlobalISel] Always use VGPR bank for G_FCMP Differential Revision: https://reviews.llvm.org/D128980	2022-07-01 15:03:37 +02:00
Piotr Sobczak	b6ef36a1c4	[AMDGPU] Update WMMA intrinsics with explicit f16 types Update intrinsics to use n x f16 and n x i16 instead of 32-bit types. This may avoid the need for a bitcast and is probably less confusing. Depends on making v16f16 and v16i16 types legal. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D128951	2022-07-01 08:55:25 +02:00
Piotr Sobczak	bd675af2a2	[AMDGPU] Make v16i16/v16f16 legal There are upcoming intrinsics to use the new types. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D128865	2022-06-30 23:08:40 +02:00
Jay Foad	0f94d2b385	[AMDGPU] GFX11: automatically release VGPRs at the end of the shader GFX11 has a new message type MSG_DEALLOC_VGPRS which can be used to release a shader's VGPRs. Sending this at the end of a shader (just before the s_endpgm) can help overall system performance in cases where the s_endpgm would have to wait for outstanding VMEM stores to complete before releasing the VGPRs. Differential Revision: https://reviews.llvm.org/D128442	2022-06-30 20:55:14 +01:00
Piotr Sobczak	4874838a63	[AMDGPU] gfx11 WMMA instruction support gfx11 introduces new WMMA (Wave Matrix Multiply-accumulate) instructions. Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D128756	2022-06-30 11:13:45 -04:00
Carl Ritson	d0f6641615	[AMDGPU] Fix liveness for loops in si-optimize-exec-masking-pre-ra Follow up to D127894, new liveness update code needs to handle the case where S_ANDN2 input must be extended through loops when V_CNDMASK_B32 has been hoisted. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D128800	2022-06-30 15:26:50 +09:00
Jay Foad	cfb7ffdec0	[AMDGPU] New AMDGPUInsertDelayAlu pass Differential Revision: https://reviews.llvm.org/D128270	2022-06-29 21:30:20 +01:00
Matt Arsenault	0bdaef38c9	AMDGPU: Add gfx11 feature to force initializing 16 input SGPRs The total user+system SGPR count needs to be padded out to 16 if fewer inputs are enabled.	2022-06-29 14:52:19 -04:00
Matt Arsenault	ffd6aaf5b6	AMDGPU: Make packed 32-bit instructions rematerializable	2022-06-29 11:57:54 -04:00
Matt Arsenault	4c400dc103	AMDGPU: Make 16-bit pk instructions rematerializable	2022-06-29 11:57:53 -04:00
Matt Arsenault	da6d7728d4	AMDGPU: Mark more instructions as rematerializable D106023 excluded 16-bit instructions from rematerialization, with the justification that we can't rematerialize instructions that preserve the high bits (plus the instructions which do are a confusing mess between different subtargets). This doesn't make sense to me as a problem since cases where we would rely on the high bit behavior would still need to be represented as a register value constraint with a tied operand. It's not a hidden side effect and should still be rematerializable.	2022-06-29 11:19:15 -04:00
Matt Arsenault	d342d130da	AMDGPU: Use isMeta flags on pseudoinstructions	2022-06-29 10:31:29 -04:00
Stanislav Mekhanoshin	21895c6b50	[AMDGPU] Relax verification of soffset in scalar stores It must use m0 only on GFX8. Later chips can use ang SGPR. Differential Revision: https://reviews.llvm.org/D128765	2022-06-28 16:10:08 -07:00
Jay Foad	3fbc945c3a	[AMDGPU] llvm.amdgcn.exp.compr is not supported on GFX11 Differential Revision: https://reviews.llvm.org/D128259	2022-06-28 14:48:25 +01:00
Joe Nash	f1cfaa956d	[AMDGPU] Use GFX11 S_PACK_HL instruction in more cases Differential Revision: https://reviews.llvm.org/D128527	2022-06-28 14:35:19 +01:00
Jay Foad	b5818e4eb4	[AMDGPU] Cluster stores as well as loads for GFX11 Differential Revision: https://reviews.llvm.org/D128517	2022-06-27 16:41:41 +01:00
Jay Foad	77e63b25f9	[AMDGPU] Fix assertion failure on mad with negative immediate addend Without this, the new test case would fail with: AMDGPUInstPrinter.cpp:545: void llvm::AMDGPUInstPrinter::printImmediate64(uint64_t, const llvm::MCSubtargetInfo &, llvm::raw_ostream &): Assertion `isUInt<32>(Imm) \|\| Imm == 0x3fc45f306dc9c882' failed. Differential Revision: https://reviews.llvm.org/D128435	2022-06-27 09:49:20 +01:00
Kazu Hirata	a7938c74f1	[llvm] Don't use Optional::hasValue (NFC) This patch replaces Optional::hasValue with the implicit cast to bool in conditionals only.	2022-06-25 21:42:52 -07:00
Kazu Hirata	3b7c3a654c	Revert "Don't use Optional::hasValue (NFC)" This reverts commit `aa8feeefd3`.	2022-06-25 11:56:50 -07:00
Kazu Hirata	aa8feeefd3	Don't use Optional::hasValue (NFC)	2022-06-25 11:55:57 -07:00
Min-Yih Hsu	97579dcc6d	[MCA] Introducing incremental SourceMgr and resumable pipeline The new resumable mca::Pipeline capability introduced in this patch allows users to save the current state of pipeline and resume from the very checkpoint. It is better (but not require) to use with the new IncrementalSourceMgr, where users can add mca::Instruction incrementally rather than having a fixed number of instructions ahead-of-time. Note that we're using unit tests to test these new features. Because integrating them into the `llvm-mca` tool will make too many churns. Differential Revision: https://reviews.llvm.org/D127083	2022-06-24 15:39:51 -07:00
Joe Nash	07b7fada73	[AMDGPU] gfx11 VOPD instructions MC support VOPD is a new encoding for dual-issue instructions for use in wave32. This patch includes MC layer support only. A VOPD instruction is constituted of an X component (for which there are 13 possible opcodes) and a Y component (for which there are the 13 X opcodes plus 3 more). Most of the complexity in defining and parsing a VOPD operation arises from the possible different total numbers of operands and deferred parsing of certain operands depending on the constituent X and Y opcodes. Reviewed By: dp Differential Revision: https://reviews.llvm.org/D128218	2022-06-24 11:08:39 -04:00
Konstantin Zhuravlyov	7736ce1c56	AMDGPU: Clear kill flags when optimizing vcmp save exec sequence It was causing bad machine code for several blender scenes: * Bad machine code: Using an undefined physical register * - function: kernel_holdout_emission_blurring_pathtermination_ao - basic block: %bb.28 if.end40.i (0x7f84861a2320) - instruction: V_CMPX_EQ_U32_nosdst_e64 0, $vgpr3, implicit-def $exec, implicit $exec - operand 1: $vgpr3 Differential Revision: https://reviews.llvm.org/D127768	2022-06-24 11:30:22 -04:00
Joe Nash	ae72fee74e	[AMDGPU] gfx11 Select on Buffer Atomic FAdd Rtn type Reviewed By: #amdgpu, foad, rampitec Differential Revision: https://reviews.llvm.org/D128205	2022-06-23 11:05:32 -04:00
Baptiste Saleil	79e77a9f39	[AMDGPU] Flush the vmcnt counter in loop preheaders when necessary waitcnt vmcnt instructions are currently generated in loop bodies before using values loaded outside of the loop. In some cases, it is better to flush the vmcnt counter in a loop preheader before entering the loop body. This patch detects these cases and generates waitcnt instructions to flush the counter. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D115747	2022-06-23 10:53:21 -04:00
Rodrigo Dominguez	971fa4b196	[AMDGPU] GFX11: remove ShaderType from ds_ordered_count offset field In GFX11 ShaderType is determined by the hardware and should no longer be written into bits[3:2] of the ds_ordered_count offset field. Differential Revision: https://reviews.llvm.org/D128196	2022-06-23 14:20:33 +01:00
Ruiling Song	49b8ca3f7c	AMDGPU: Don't crash on global_ctor/dtor declaration Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D128320	2022-06-23 21:04:54 +08:00
Dmitry Preobrazhensky	dcb24f93af	[AMDGPU][MC][GFX11] Correct disassembly of VOP3.DPP8 opcodes Fix bug #56163. Add W32/W64 tests for all VOP3.DPP opcodes. Differential Revision: https://reviews.llvm.org/D128369	2022-06-23 13:07:45 +03:00
Matt Arsenault	b03d902b61	AMDGPU: Fix invalid liveness after si-optimize-exec-masking-pre-ra This was leaving behind a use at the deleted instruction which the verifier would fail during allocation.	2022-06-22 20:49:03 -04:00
serge-sans-paille	27fd01d3f8	[iwyu] Handle regressions in libLLVM header include Running iwyu-diff on LLVM codebase since `fb67d683db` detected a few regressions, fixing them. The impact on preprocessed output is negligible: -4k lines.	2022-06-22 18:50:39 +02:00
Guillaume Chatelet	cef65864af	[Alignment] Use Align for MaxKernArgAlign Differential Revision: https://reviews.llvm.org/D128118	2022-06-22 13:40:37 +00:00
Ruiling Song	4dcb42fae5	AMDGPU: Skip unexpected CFG in SIOptimizeVGPRLiveRange There are some cases that we use si_if/si_else in unatural way. Just skip them. Fixes: https://github.com/llvm/llvm-project/issues/55922 Reviewed by: critson Differential Revision: https://reviews.llvm.org/D128193	2022-06-22 12:49:41 +08:00
Joe Nash	90254d524f	[AMDGPU] gfx11 Remove SDWA from shuffle_vector ISel gfx11 does not have SDWA Reviewed By: #amdgpu, rampitec Differential Revision: https://reviews.llvm.org/D128208	2022-06-21 14:55:00 -04:00
Jay Foad	929a8ad2b6	[AMDGPU] Update SPI_SHADER_PGM_RSRC2_PS.EXTRA_LDS_SIZE for GFX11 The granularity of SPI_SHADER_PGM_RSRC2_PS.EXTRA_LDS_SIZE changed in GFX11. It is now in units of 256 dwords instead of 128 dwords. COMPUTE_PGM_RSRC2.LDS_SIZE is unaffected. It is still in units of 128 dwords. Differential Revision: https://reviews.llvm.org/D128179	2022-06-21 14:48:12 +01:00
Carl Ritson	62abc8c200	[AMDGPU] Set GFX11 null export target based on export attributes If shader only has depth exports use MRTZ otherwise use MRT0. Differential Revision: https://reviews.llvm.org/D128185	2022-06-21 09:40:31 +01:00
Kazu Hirata	7a47ee51a1	[llvm] Don't use Optional::getValue (NFC)	2022-06-20 22:45:45 -07:00
Kazu Hirata	064a08cd95	Don't use Optional::hasValue (NFC)	2022-06-20 20:05:16 -07:00
Ruiling Song	732eed40fd	[AMDGPU] Mark GFX11 dual source blend export as strict-wqm The instructions that generate the source of dual source blend export should run in strict-wqm. That is if any lane in a quad is active, we need to enable all four lanes of that quad to make the shuffling operation before exporting to dual source blend target work correctly. Differential Revision: https://reviews.llvm.org/D127981	2022-06-20 21:58:12 +01:00
Piotr Sobczak	29621c13ef	[AMDGPU] Tag GFX11 LDS loads as using strict_wqm LDS_PARAM_LOAD and LDS_DIRECT_LOAD use EXEC per quad (if any pixel is enabled in the quad, data is written to all 4 pixels/threads in the quad). Tag LDS_PARAM_LOAD and LDS_DIRECT_LOAD as using strict_wqm to enforce this and avoid lane clobbering issues. Note that only the instruction itself is tagged. The implicit uses of these do not need to be set WQM. The reduces unnecessary WQM calculation of M0. Differential Revision: https://reviews.llvm.org/D127977	2022-06-20 21:58:12 +01:00
Jay Foad	13107c2770	[AMDGPU] Add support for GFX11 LDSDIR hazards Detect LDS direct WAR/WAW hazards and compute values for wait_vdst (va_vdst) parameter. Where appropriate this raises wait_vdst from the default 0 to allow concurrent issue of LDS direct with VALU execution. Also detect LDS direct versus VMEM source VGPR hazards and insert vm_vsrc=0 waits using s_waitcnt_depctr. Differential Revision: https://reviews.llvm.org/D127963	2022-06-20 21:58:12 +01:00
Guillaume Chatelet	d154d0ac06	[NFC] Simplify code	2022-06-20 15:15:52 +00:00
Jay Foad	ba306216d2	[AMDGPU] Reorder cases. NFC.	2022-06-20 14:30:17 +01:00
Jay Foad	d7762a3b36	[AMDGPU] Increase instruction cache line size to 128 bytes for GFX11 Differential Revision: https://reviews.llvm.org/D128189	2022-06-20 14:25:10 +01:00
Jay Foad	b8e32e808d	[AMDGPU] Remove a duplicate atomic fadd pattern This was left over after D124538.	2022-06-20 14:08:57 +01:00
Dmitry Preobrazhensky	485e8b4f63	[AMDGPU][MC][GFX11] Correct disassembly of DPP variants of VOPC64 opcodes Fix bugs https://github.com/llvm/llvm-project/issues/56091, https://github.com/llvm/llvm-project/issues/56065. Differential Revision: https://reviews.llvm.org/D128075	2022-06-20 14:23:07 +03:00
Mirko Brkusanin	6cae753bf4	[AMDGPU][GlobalISel] Legalize G_FSUB for s16 Differential Revision: https://reviews.llvm.org/D128066	2022-06-20 12:25:49 +02:00
Guillaume Chatelet	f1255186c7	[NFC][Alignment] Remove max functions between Align and MaybeAlign `llvm::max(Align, MaybeAlign)` and `llvm::max(MaybeAlign, Align)` are not used often enough to be required. They also make the code more opaque. Differential Revision: https://reviews.llvm.org/D128121	2022-06-20 08:37:48 +00:00
Jay Foad	7050d5b98c	[AMDGPU] Limit GFX11 to using 128 VGPRs This is a temporary measure to avoid generating incorrect code until the compiler understands the new way that GFX11 encodes 16-bit operands in VOP instructions. Differential Revision: https://reviews.llvm.org/D128054	2022-06-20 07:58:27 +01:00
Kazu Hirata	129b531c9c	[llvm] Use value_or instead of getValueOr (NFC)	2022-06-18 23:07:11 -07:00
Kazu Hirata	437f960062	[llvm] Call *set::insert without checking membership first (NFC)	2022-06-18 10:22:05 -07:00
Kazu Hirata	4271a1ff33	[llvm] Call *set::insert without checking membership first (NFC)	2022-06-18 10:17:22 -07:00
Joe Nash	2a68364745	[AMDGPU] gfx11 waitcnt support for VINTERP and LDSDIR instructions Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D127781	2022-06-17 09:30:37 -04:00
Joe Nash	20d20156f4	[AMDGPU] gfx11 VINTERP intrinsics and ISel support Depends on D127664 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D127756	2022-06-17 09:16:59 -04:00
Joe Nash	6d5d8b1313	[AMDGPU] gfx11 ldsdir intrinsics and ISel Reviewed By: #amdgpu, rampitec Differential Revision: https://reviews.llvm.org/D127664	2022-06-17 09:03:16 -04:00
LiaoChunyu	6181c19283	[AMDGPU][NFC] Remove isConstantAddr fix isConstantAddr defined but not used Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D127959	2022-06-17 08:49:29 +08:00
Joe Nash	2d43de13df	[AMDGPU] gfx11 new dot instruction codegen support Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D127904	2022-06-16 14:19:34 -04:00
Jay Foad	7e681ef35e	[AMDGPU] Add GFX11 codegen for llvm.amdgcn.mov.dpp8 Differential Revision: https://reviews.llvm.org/D127980	2022-06-16 19:44:28 +01:00
Jay Foad	36ec1fcaac	[AMDGPU] Add GFX11 llvm.amdgcn.ds.add.gs.reg.rtn / llvm.amdgcn.ds.sub.gs.reg.rtn intrinsics Differential Revision: https://reviews.llvm.org/D127955	2022-06-16 18:23:14 +01:00
Jay Foad	c155a944fb	[AMDGPU] GFX11 CodeGen support for MIMG instructions This includes: - New llvm.amdgcn.image.msaa.load.* intrinsics - NSA changes, because MIMG-NSA is now limited to 3 dwords - Split CD forms of IMAGE_SAMPLE instructions out into separate test files since they are no longer supported in GFX11 Differential Revision: https://reviews.llvm.org/D127837	2022-06-16 18:23:14 +01:00
Jay Foad	445a483b41	[AMDGPU] Add new GFX11 intrinsic llvm.amdgcn.exp.row Differential Revision: https://reviews.llvm.org/D127671	2022-06-16 18:23:14 +01:00
Dmitry Preobrazhensky	b26afab9d1	[AMDGPU][MC][GFX11] Correct src0 for dpp variants of v_cvt_*_e64 Differential Revision: https://reviews.llvm.org/D127847	2022-06-16 13:48:43 +03:00
David Stuttard	77851cc1cf	[AMDGPU] Change use null for dead sdst to be gfx1030+ Pre gfx1030 null for sdst is different. `c97436f8b6` [AMDGPU] Use null for dead sdst operand - requires a change to make it not apply to pre gfx1030 Differential Revision: https://reviews.llvm.org/D127869	2022-06-16 10:39:06 +01:00
Jay Foad	9dff14be9e	[AMDGPU] Add support for GFX11 hazards Add support for partial stall over EXEC hazard and trans use hazard. Differential Revision: https://reviews.llvm.org/D127872	2022-06-16 08:15:21 +01:00
Austin Kerbow	4bba82116a	[AMDGPU] Fix buildbot failures after `48ebc1af29` Some buildbots (lto, windows) were failing due to some function reference variables being improperly initialized.	2022-06-15 00:23:30 -07:00
Austin Kerbow	48ebc1af29	[AMDGPU] Add more expressive sched_barrier controls The sched_barrier builtin allow the scheduler's behavior to be shaped by users when very specific codegen is needed in order to create highly optimized code. This patch adds more granular control over the types of instructions that are allowed to be reordered with respect to one or multiple sched_barriers. A mask is used to specify groups of instructions that should be allowed to be scheduled around a sched_barrier. The details about this mask may be used can be found in llvm/include/llvm/IR/IntrinsicsAMDGPU.td. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D127123	2022-06-14 22:03:05 -07:00
Austin Kerbow	bd9eed3aec	[AMDGPU] Add isMFMA helper function. NFC Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D127124	2022-06-14 22:01:49 -07:00
Joe Nash	989bd57f98	[AMDGPU] gfx11 support add_f16 The instruction was skipped in the earlier large patch adding VOP2, https://reviews.llvm.org/D126917. Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D127697	2022-06-14 08:59:45 -04:00
Dmitry Preobrazhensky	365d827f65	[AMDGPU][MC][GFX11] Correct ds_swizzle_b32 Enable offset parsing. Differential Revision: https://reviews.llvm.org/D127404	2022-06-14 12:58:03 +03:00
Stanislav Mekhanoshin	c97436f8b6	[AMDGPU] Use null for dead sdst operand Differential Revision: https://reviews.llvm.org/D127542	2022-06-13 14:41:40 -07:00
Stanislav Mekhanoshin	cb9ae93712	[AMDGPU] Define SGPR_NULL64 register. NFCI. On gfx10+ null register can be used as both 32 and 64 bit operand. Define a 64 bit version of the register to use during codegen. Differential Revision: https://reviews.llvm.org/D127527	2022-06-13 13:23:33 -07:00
Jay Foad	bfcfd53b92	[AMDGPU] Add GFX11 llvm.amdgcn.permlane64 intrinsic Compared to permlane16, permlane64 has no BC input because it has no boundary conditions, no fi input because the instruction acts as if FI were always enabled, and no OLD input because it always writes to every active lane. Also use the new intrinsic in the atomic optimizer pass. Differential Revision: https://reviews.llvm.org/D127662	2022-06-13 21:12:11 +01:00
Jay Foad	7b9f620e78	[AMDGPU] Work around GFX11 flat scratch SVS swizzling bug Differential Revision: https://reviews.llvm.org/D127635	2022-06-13 21:00:42 +01:00
Jay Foad	d943c51465	[AMDGPU] Fix GFX11 codegen for V_MAD_U64_U32 and V_MAD_I64_I32 GFX11 uses different pseudos for these because of a new constraint on which operands' registers can overlap. Differential Revision: https://reviews.llvm.org/D127659	2022-06-13 20:59:18 +01:00
Stanislav Mekhanoshin	0f81830632	[AMDGPU] Make temp vgpr selection stable in indirectCopyToAGPR This uses rotating reminder of division by 3 to select another temp vgpr each next time in a sequence of several agpr copies. Therefore, temp vgpr selection depends on the generated agpr number. This number could change with any unrelated change to the register definitions. Stabilize the selection by using a real agpr number. Differential Revision: https://reviews.llvm.org/D127524	2022-06-13 09:39:46 -07:00
Fangrui Song	adf4142f76	[MC] De-capitalize SwitchSection. NFC Add SwitchSection to return switchSection. The API will be removed soon.	2022-06-10 22:50:55 -07:00
Jay Foad	ff85d61a6e	Update *_TMPRING_SIZE.WAVESIZE for GFX11 The encoding of COMPUTE_TMPRING_SIZE.WAVESIZE and SPI_TMPRING_SIZE.WAVESIZE has changed in GFX11: it is now in units of 64 dwords instead of 256 dwords, and the field has been widened from 13 bits to 15 bits. Depends on D126989 Reviewed By: rampitec, arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D127248	2022-06-10 13:24:00 -04:00
Joe Nash	ea3c9a87d3	[AMDGPU] gfx11 add bits to COMPUTE_PGM_RSRC3 Contributors: Konstantin Zhuravlyov <kzhuravl_dev@outlook.com> Patch 21/N for upstreaming of AMDGPU gfx11 architecture Depends on D127143 Reviewed By: rampitec, #amdgpu, kzhuravl Differential Revision: https://reviews.llvm.org/D127241	2022-06-10 13:07:14 -04:00
Joe Nash	78d8fdb88b	[AMDGPU] NFC. Comment change to GFX10+ in AsmParser	2022-06-10 12:34:07 -04:00
Joe Nash	9175ab7746	[AMDGPU] gfx11 SRC_POPS_EXISTING_WAVE_ID is removed	2022-06-10 12:32:22 -04:00
Joe Nash	fd3304ef85	[AMDGPU] gfx11 EXECZ and VCCZ are no longer allowed to be used as sources to SALU and VALU instructions. Contributors: Baptiste Saleil <baptiste.saleil@amd.com> Patch 20/N for upstreaming of AMDGPU gfx11 architecture Depends on D126989 Reviewed By: rampitec, foad, #amdgpu Differential Revision: https://reviews.llvm.org/D127143	2022-06-10 10:03:43 -04:00
Jay Foad	4b2d70fa5b	[AMDGPU] Basic implementation of isExtractSubvectorCheap Add a basic implementation of isExtractSubvectorCheap that only considers extracts at offset 0. Differential Revision: https://reviews.llvm.org/D127385	2022-06-10 14:43:07 +01:00
Ivan Kosarev	60d6fbb621	[AMDGPU][GFX9][GFX10] Support base+soffset+offset SMEM atomics. Resolves a part of https://github.com/llvm/llvm-project/issues/38652 Reviewed By: dp Differential Revision: https://reviews.llvm.org/D127314	2022-06-10 13:22:41 +01:00
Dmitry Preobrazhensky	f8aba9995a	[AMDGPU][MC][GFX1013] Enable image_msaa_load Differential Revision: https://reviews.llvm.org/D127198	2022-06-10 13:42:05 +03:00
Jay Foad	6c372daa84	[AMDGPU] New GFX11 intrinsic llvm.amdgcn.s.sendmsg.rtn Add new intrinsic and codegen support for the s_sendmsg_rtn_b32 and s_sendmsg_rtn_b64 instructions. Differential Revision: https://reviews.llvm.org/D127315	2022-06-10 08:15:23 +01:00
Jay Foad	b0a3849439	[AMDGPU] Update dlc usage for GFX11 In GFX10 dlc controlled L1 cache bypass. In GFX11 it has been repurposed to control MALL NOALLOC, and glc controls L1 as well as L0 cache bypass. Update the documentation and SIMemoryLegalizer accordingly. Set dlc for nontemporal and volatile accesses. Differential Revision: https://reviews.llvm.org/D127405	2022-06-10 08:10:34 +01:00
Jay Foad	ffe86e3bdd	[AMDGPU] Update SIInsertHardClauses for GFX11 Changes for GFX11: - Clauses may not mix instructions of different types, and there are more types. For example image instructions with and without a sampler are now different types. - The max size of a clause is explicitly documented as 63 instructions. Previously it was implicitly assumed to be 64. This is such a tiny difference that it does not seem worth making it conditional on the subtarget. - It can be beneficial to clause stores as well as loads. Differential Revision: https://reviews.llvm.org/D127391	2022-06-09 21:29:56 +01:00
Joe Nash	be1082c6d5	[AMDGPU] gfx11 VOPC instructions Supports encoding existing instrutions on gfx11 and MC support for the new VOPC dpp instructions. Patch 19/N for upstreaming of AMDGPU gfx11 architecture Depends on D126978 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D126989	2022-06-09 15:22:42 -04:00
Stanislav Mekhanoshin	23db8e4b43	[AMDGPU] Use v_mad_u64_u32 for IMAD32 Nic Curtis done the experiments to prove it is faster than a separate mul and add. Fixes: SWDEV-332806 Differential Revision: https://reviews.llvm.org/D127253	2022-06-09 11:39:49 -07:00
Stanislav Mekhanoshin	5c974d086c	[AMDGPU] Fix hazard handling of v_cmpx to permlane - VOP3 and SDWA forms of V_CMPX were not handled - Hazard only exists if the compare defines EXEC (i.e. V_CMPX) forwarded to the permlane. Differential Revision: https://reviews.llvm.org/D127344	2022-06-09 10:33:54 -07:00
Simon Moll	b8c2781ff6	[NFC] format InstructionSimplify & lowerCaseFunctionNames Clang-format InstructionSimplify and convert all "FunctionName"s to "functionName". This patch does touch a lot of files but gets done with the cleanup of InstructionSimplify in one commit. This is the alternative to the less invasive clang-format only patch: D126783 Reviewed By: spatel, rengolin Differential Revision: https://reviews.llvm.org/D126889	2022-06-09 16:10:08 +02:00
Benjamin Kramer	0abb472fff	AMDGPU/GISel: Remove unused variable. NFC.	2022-06-09 13:43:47 +02:00
Nicolai Hähnle	264d1136f9	AMDGPU/GISel: Introduce custom legalization of G_MUL The generic legalizer framework is still used to reduce the problem to scalar multiplication with the bit size a multiple of 32. Generating optimal code sequences for big integer multiplication is somewhat tricky and has a number of target-specific intricacies: - The target has V_MAD_U64_U32 instructions that multiply two 32-bit factors and add a 64-bit accumulator. Most partial products should use this instruction. - The accumulator is mapped to consecutive 32-bit GPRs, and partial- product multiply-adds can feed the accumulator into each other directly. (The register allocator's support for that is somewhat limited, but that only matters for 128-bit integers and larger.) - OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator to be stored in an even-aligned pair of GPRs. To avoid excessive register copies, it makes sense to compute odd partial products separately from even partial products (where a partial product src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both halves together as a final step. - We can combine G_MUL+G_ADD into a single cascade of multiply-adds. - The target can keep many carry-bits in flight simultaneously, so combining carries using G_UADDE is preferable over G_ZEXT + G_ADD. - Not addressed by this patch: When the factors are sign-extended, the V_MAD_I64_I32 instruction (signed version!) can be used. It is difficult to address these points generically: 1) Finding matching pairs of G_MUL and G_UMULH to find a wide multiply is expensive. We could add a G_UMUL_LOHI generic instruction and conditionally use that in the generic legalizer, but by itself this wouldn't allow us to use the accumulation capability of V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE post-legalization, but this is also expensive. 2) Similarly, making sense of the legalization outcome of a wide pre-legalization G_MUL+G_ADD pair is extremely expensive. 3) How could the generic legalizer possibly deal with the particular idiosyncracy of "odd" vs. "even" partial products. All this points in the direction of directly emitting an ideal code sequence during legalization, but the generic legalizer should not be burdened with such overly target-specific concerns. Hence, a custom legalization. Note that the implemented approach is different from that used by SelectionDAG because narrowing of scalars works differently in general. SelectionDAG iteratively cuts wide scalars into low and high halves until a legal size is reached. By contrast, GlobalISel does the narrowing in a single shot, which should be better for compile-time and for the quality of the generated code. This patch leaves three gaps open: 1. When the factors are uniform, we should execute the multiplication on the SALU. Register bank mapping already ensures this. However, the resulting code sequence is not optimal because it doesn't fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32 doesn't have a carry-in.) It is very difficult to fix this after the fact, so we should really use a different legalization sequence in this case. Unfortunately, we don't have a divergence analysis and so cannot make that choice. (This only matters for 128-bit integers and larger.) 2. Avoid unnecessary multiplies when sources are known to be zero- or sign-extended. The challenge is that the legalizer does not currently have access to GISelKnownBits. 3. When the G_MUL is followed by a G_ADD, we should consider combining the two instructions into a single multiply-add sequence, to utilize the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has multiple uses and the implied duplication of the multiply is an overall negative). However, this is also not true when the factors are uniform: in that case, it is generally better to not combine the two operations, so that the multiply can be done on the SALU. Again, we don't have a divergence analysis available and so cannot make an informed choice. Differential Revision: https://reviews.llvm.org/D124844	2022-06-09 13:38:56 +02:00
Joe Nash	40f35cef89	[AMDGPU] gfx11 VOP3P instruction MC support Includes dpp versions of VOP3P instructions. Patch 18/N for upstreaming of AMDGPU gfx11 architecture Depends on D126917 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D126978	2022-06-08 13:32:01 -04:00
Joe Nash	086a9c1062	Reland [AMDGPU] gfx11 VOP1+VOP2 Instruction MC support The reverted dependent commit is now relanded, so reland this. Includes dpp instructions and vop1/vop2 promoted to vop3 Patch 17/N for upstreaming of AMDGPU gfx11 architecture Depends on D126483 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D126917	2022-06-08 11:10:57 -04:00
Joe Nash	e243ead6fc	Reland [AMDGPU] gfx11 vop3dpp instructions There was an issue with encoding wide (>64 bit) instructions on BigEndian hosts, which is fixed in D127195. Therefore reland this. gfx11 adds the ability to use dpp modifiers on vop3 instructions. This patch adds machine code layer support for that. The MCCodeEmitter is changed to use APInt instead of uint64_t to support these wider instructions. Patch 16/N for upstreaming of AMDGPU gfx11 architecture Differential Revision: https://reviews.llvm.org/D126483	2022-06-07 14:49:13 -04:00
Jay Foad	81edc831fb	[AMDGPU] Add support for the .reloc directive Differential Revision: https://reviews.llvm.org/D127117	2022-06-07 15:18:54 +01:00
Matt Arsenault	cc5a1b3dd9	llvm-reduce: Add cloning of target MachineFunctionInfo MIR support is totally unusable for AMDGPU without this, since the set of reserved registers is set from fields here. Add a clone method to MachineFunctionInfo. This is a subtle variant of the copy constructor that is required if there are any MIR constructs that use pointers. Specifically, at minimum fields that reference MachineBasicBlocks or the MachineFunction need to be adjusted to the values in the new function.	2022-06-07 10:14:48 -04:00
Matt Arsenault	cfe5168499	AMDGPU: Make PSV instances static members	2022-06-07 10:14:48 -04:00
Guillaume Chatelet	0788186182	[Alignment][NFC] Remove usage of MemSDNode::getAlignment I can't remove the function just yet as it is used in the generated .inc files. I would also like to provide a way to compare alignment with TypeSize since it came up a few times. Differential Revision: https://reviews.llvm.org/D126910	2022-06-07 13:52:20 +00:00
Fangrui Song	15d82c62dc	[MC] De-capitalize MCStreamer functions Follow-up to `c031378ce0` . The class is mostly consistent now.	2022-06-07 00:31:02 -07:00
Joe Nash	eaed07eb7e	Revert "[AMDGPU] gfx11 vop3dpp instructions" This reverts commit `99a83b1286`.	2022-06-06 17:12:09 -04:00
Joe Nash	f617f89e5b	Revert "[AMDGPU] gfx11 VOP1+VOP2 Instruction MC support" This reverts commit `6079804498`.	2022-06-06 17:11:35 -04:00
Ivan Kosarev	facbfb121a	[AMDGPU][GFX9+] Support base+soffset+offset s_atc_probe's. Resolves part of https://github.com/llvm/llvm-project/issues/38652 Reviewed By: dp Differential Revision: https://reviews.llvm.org/D126791	2022-06-06 16:46:22 +01:00
Ivan Kosarev	79ec1e8fd6	[AMDGPU][GFX9][GFX10] Support base+soffset+offset s_dcache_discard's. Resolves part of https://github.com/llvm/llvm-project/issues/38652 Reviewed By: dp Differential Revision: https://reviews.llvm.org/D126766	2022-06-06 16:32:16 +01:00
Joe Nash	6079804498	[AMDGPU] gfx11 VOP1+VOP2 Instruction MC support Includes dpp instructions and vop1/vop2 promoted to vop3 Patch 17/N for upstreaming of AMDGPU gfx11 architecture Depends on D126483 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D126917	2022-06-06 09:57:59 -04:00
Joe Nash	99a83b1286	[AMDGPU] gfx11 vop3dpp instructions gfx11 adds the ability to use dpp modifiers on vop3 instructions. This patch adds machine code layer support for that. The MCCodeEmitter is changed to use APInt instead of uint64_t to support these wider instructions. Patch 16/N for upstreaming of AMDGPU gfx11 architecture Depends on D126475 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D126483	2022-06-06 09:34:59 -04:00
Fangrui Song	77e300ffdf	[MC] Change EndOfStatement "unexpected tokens in .xxx directive " to "expected newline"	2022-06-05 15:11:01 -07:00
Fangrui Song	95a134254a	Remove unneeded cl::ZeroOrMore for cl::opt/cl::list options	2022-06-05 01:07:51 -07:00
Kazu Hirata	e0039b8d6a	Use llvm::less_second (NFC)	2022-06-04 22:48:32 -07:00
Jacob Weightman	814a0abcce	AMDGPU: allow reordering of functions in AMDGPUResourceUsageAnalysis The AMDGPUResourceUsageAnalysis was previously a CGSCC pass, and assumed that a function's callees were always analyzed prior to their callees. When it was refactored into a module pass, this assumption no longer always holds. This results in calls being erroneously identified as indirect, and reserving private segment space for them. This results in significantly slower kernel launch latency. This patch changes the order in which the module's functions are analyzed from the order in which they occur in the module to a post-order traversal of the call graph. Perhaps Clang always generates the module's functions in such an order, but this is not the case for the Cray Fortran compiler. Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D126025	2022-06-03 15:55:54 -05:00
Matt Arsenault	dd7e407d81	AMDGPU: Move SpilledReg from MFI to SIRegisterInfo This isn't the most natural place for it, but it avoids a circular include dependency in an out of tree patch.	2022-06-02 17:11:24 -04:00
Julien Pages	2dfe419446	[AMDGPU] Improve codegen of extractelement/insertelement in some cases This patch improves the codegen of extractelement and insertelement for vector containing 8 elements. Before, a dag combine transformation was generating a sequence of 8 select/cmp. This patch changes the upper limit for this transformation and the movrel instruction will eventually be used instead. Extractlement/insertelement for vectors containing less than 8 elements are unchanged. Differential Revision: https://reviews.llvm.org/D126389	2022-06-02 17:05:55 -04:00
Joe Nash	3732cd59be	[AMDGPU] gfx11 vop3 and inherited vop instructions This patch includes MC layer support for VOP3 encoded instructions and generic VOP support classes. Some VOP1 and VOP2 instructions which share an encoding with gfx10 and are using the AssemblerPredicate = isGFX10Plus are also enabled. That predicate will be changed to isGFX10Only in a later patch. Patch 15/N for upstreaming of AMDGPU gfx11 architecture. Depends on D126468 Reviewed By: dp Differential Revision: https://reviews.llvm.org/D126475	2022-06-02 14:03:02 -04:00
Joe Nash	e4870c8357	[AMDGPU] gfx11 ds instructions MC layer support for ds instructions Contributors: Piotr Sobczak <Piotr.Sobczak@amd.com> Patch 14/N for upstreaming of AMDGPU gfx11 architecture. Depends on D126463 Reviewed By: arsenm, #amdgpu Differential Revision: https://reviews.llvm.org/D126468	2022-06-02 13:36:56 -04:00
Matt Arsenault	89b1808a2f	AMDGPU: Fix missing c++ mode comment	2022-06-01 21:14:48 -04:00
Stanislav Mekhanoshin	c9e242f6dd	[AMDGPU] Change GISel error handling for TFE on GFX90A Differential Revision: https://reviews.llvm.org/D126797	2022-06-01 11:07:25 -07:00
Scott Linder	2d43955cec	[AMDGPU][NFC] Refactor AMDGPUCallingConv.td Rename CalleeSavedRegs defs to avoid being overly specific: * CSR_AMDGPU_AGPRs_32_255 => CSR_AMDGPU_AGPRs * CSR_AMDGPU_SGPRs_30_31 + CSR_AMDGPU_SGPRs_32_105 => CSR_AMDGPU_SGPRs * CSR_AMDGPU_SI_Gfx_SGPRs_4_29 + CSR_AMDGPU_SI_Gfx_SGPRs_64_105 => CSR_AMDGPU_SI_Gfx_SGPRs * CSR_AMDGPU_HighRegs => CSR_AMDGPU * CSR_AMDGPU_HighRegs_With_AGPRs => CSR_AMDGPU_GFX90AInsts * CSR_AMDGPU_SI_Gfx_With_AGPRs => CSR_AMDGPU_SI_Gfx_GFX90AInsts Introduce a class RegMask to mark the cases where we use the CalleeSavedRegs class purely as an expedient way to produce a mask. Update the names of these masks to not mention "CSR". Other targets also seem to do this, so a reasonable alternative is to actually update table-gen to include a new class to do this explicitly, but the current approach seems harmless so I opted to just make it more explicit. Reviewed By: arsenm, sebastian-ne Differential Revision: https://reviews.llvm.org/D109008	2022-06-01 16:24:09 +00:00
Matt Arsenault	0e1c71e4a4	CodeGen: Move getAddressSpaceForPseudoSourceKind into TargetMachine Avoid the dependency on TargetInstrInfo, which depends on the subtarget and therefore the individual function. Currently AMDGPU is constructing PseudoSourceValue instances in MachineFunctionInfo. In order to facilitate copying MachineFunctionInfo, we need to stop allocating these there. Alternatively we could allow targets to subclass PseudoSourceValueManager, and allocate them similarly to MachineFunctionInfo.	2022-06-01 09:45:40 -04:00
Stanislav Mekhanoshin	dec1283279	[AMDGPU] Fix image opcodes GlobalISel on gfx90a. - Correct flavor of an instruction was not selected. - GFX90A does not support TFE. Differential Revision: https://reviews.llvm.org/D126312	2022-05-31 14:07:46 -07:00
jeff	2e61dfb124	[AMDGPU] Instruction Type Pipeline This patch implements a DAG mutation which adds edges between different groups of instructions. The purpose is to try to generate code that conforms to a pipeline (groupA instructions occur before groupB, groupB -> groupC, and so on). Currently the pipeline order is hardcoded as VMEM->DSRead->MFMA->DSWrite, but the patch was designed to be easily extensible. Alias analysis is problematic for pipelining as memory instructions will usually not be able to be reordered w.r.t one another. Differential Revision: https://reviews.llvm.org/D125997	2022-05-31 17:48:52 +00:00
Joe Nash	e8860bee28	[AMDGPU] gfx11 Image instructions MC layer support for instructions in the MIMG encoding(Image instructions). Contributors: Carl Ritson <carl.ritson@amd.com> Patch 13/N for upstreaming of AMDGPU gfx11 architecture. Depends on D125992 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D126463	2022-05-31 10:53:35 -04:00
Ivan Kosarev	f199b2b00f	[AMDGPU][NFC] Refine defining the offset field for GFX10+ SMEM instructions. Reviewed By: dp Differential Revision: https://reviews.llvm.org/D126662	2022-05-31 09:54:51 +01:00
Ivan Kosarev	b4dbcba3b7	[AMDGPU][GFX9][NFC] Rename the base class for SMEM stores.	2022-05-30 10:31:59 +01:00
Ivan Kosarev	082822b381	[AMDGPU][GFX9] Support base+soffset+offset SMEM stores. Reviewed By: dp Differential Revision: https://reviews.llvm.org/D126388	2022-05-30 10:27:57 +01:00
Nicolai Hähnle	5df2893a9a	AMDGPU: Add G_AMDGPU_MAD_64_32 instructions These generic instructions are trivially selected to V_MAD_[IU]64_[IU]32 instructions when run on the VALU. When at least both factors are scalar, it is usually better to execute some or all of the instruction on the SALU. To this end, we lower the instruction to simpler instructions that are supported on the SALU when applying the register bank mapping. Differential Revision: https://reviews.llvm.org/D124843	2022-05-27 12:36:17 -05:00
Ivan Kosarev	b0ccf38b01	[AMDGPU][GFX9] Support base+soffset+offset SMEM loads. Resolves part of https://github.com/llvm/llvm-project/issues/38652 Reviewed By: dp Differential Revision: https://reviews.llvm.org/D125700	2022-05-26 12:42:33 +01:00
serge-sans-paille	fb67d683db	[iwyu] Handle regressions in libLLVM header include Running iwyu-diff on LLVM codebase since `7030654296` detected a few regressions, fixing them. Differential Revision: https://reviews.llvm.org/D126417	2022-05-26 08:12:34 +02:00
Maksim Panchenko	bed9efed71	[MCDisassembler] Disambiguate Size parameter in tryAddingSymbolicOperand() MCSymbolizer::tryAddingSymbolicOperand() overloaded the Size parameter to specify either the instruction size or the operand size depending on the architecture. However, for proper symbolic disassembly on X86, we need to know both sizes, as an instruction can have two operands, and the instruction size cannot be reliably calculated based on the operand offset and its size. Hence, split Size into OpSize and InstSize. For X86, the new interface allows to fix a couple of issues: * Correctly adjust the value of PC-relative operands. * Set operand size to zero when the operand is specified implicitly. Differential Revision: https://reviews.llvm.org/D126101	2022-05-25 13:44:32 -07:00
Joe Nash	835e09c4c3	[AMDGPU] gfx11 FLAT Instructions MachineCode Support for FLAT type instructions Contributors: Sebastian Neubauer <sebastian.neubauer@amd.com> Patch 12/N for upstreaming of AMDGPU gfx11 architecture. Depends on D125989 Reviewed By: rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D125992	2022-05-25 15:29:39 -04:00
Joe Nash	ef1ea5ac01	[AMDGPU] gfx11 vinterp instructions MC support A new instruction encoding. Some of these instructions were previously VOP3 encoded. Contributors: Carl Ritson <carl.ritson@amd.com> Patch 11/N for upstreaming of AMDGPU gfx11 architecture. Depends on D125824 Reviewed By: critson Differential Revision: https://reviews.llvm.org/D125989	2022-05-25 14:59:16 -04:00
Joe Nash	1a51ab766f	[AMDGPU] gfx11 export instructions Contributors: Jay Foad <jay.foad@amd.com> Dmitry Preobrazhensky <d-pre@mail.ru> Patch 10/N for upstreaming of AMDGPU gfx11 architecture. Depends on D125822 Reviewed By: dp Differential Revision: https://reviews.llvm.org/D125824	2022-05-25 14:44:09 -04:00
Nicolai Hähnle	affa1b1cc5	AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane A later change will add a 3rd user, so factoring out the common code seems useful. Reorganizing the executeInWaterfallLoop causes some more COPYs to be generated, but those all fold away during instruction selection. Generating the comparisons uses generic instructions over machine instructions now which admittedly shouldn't make a difference (though it should make it easier to move the waterfall loop generation to another place). (Resubmit with missing test added.) Differential Revision: https://reviews.llvm.org/D125324	2022-05-25 12:14:01 -05:00
Nicolai Hähnle	afc90101a5	Revert "AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane" This reverts commit `2a28467e53`.	2022-05-25 12:03:23 -05:00
Nicolai Hähnle	2a28467e53	AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane A later change will add a 3rd user, so factoring out the common code seems useful. Reorganizing the executeInWaterfallLoop causes some more COPYs to be generated, but those all fold away during instruction selection. Generating the comparisons uses generic instructions over machine instructions now which admittedly shouldn't make a difference (though it should make it easier to move the waterfall loop generation to another place). Differential Revision: https://reviews.llvm.org/D125324	2022-05-25 11:35:02 -05:00
Stanislav Mekhanoshin	5df6669d45	[AMDGPU] Enforce alignment of image vaddr on gfx90a Even though single address image instructions only use a single VGPR HW accesses 4 or 5 which creates alignment requirement. Fixes: SWDEV-316648 Differential Revision: https://reviews.llvm.org/D126009	2022-05-24 10:05:39 -07:00
Ivan Kosarev	1586e1dc95	[AMDGPU][MC][GFX11] Support base+soffset+offset SMEM loads. Reviewed By: dp Differential Revision: https://reviews.llvm.org/D126207	2022-05-24 15:13:14 +01:00
Dmitry Preobrazhensky	818cc9b285	[AMDGPU][MC][GFX940] Disable v_mac_f32_dpp Differential Revision: https://reviews.llvm.org/D126070	2022-05-23 15:49:44 +03:00
Jay Foad	9af56c676e	[AMDGPU] Mark SMEM cache invalidations as not reading memory This brings the MachineInstrs in line with the corresponding intrinsics which have side effects but do not access memory. It also matches how BUF cache invalidation instructions are defined. The lit test changes are just because the machine scheduler previously treated them like loads, and added an artificial scheduling edge from them to the exit SU, which caused them to be scheduled earlier. Differential Revision: https://reviews.llvm.org/D126074	2022-05-20 17:18:03 +01:00
Jay Foad	78ec59e6ae	[AMDGPU] Handle mandatory literals in isOperandLegal Extend SIInstrInfo::isOperandLegal to enforce a limit on the number of literal operands for all VALU instructions, not just VOP3. In particular it now handles VOP2 instructions with a mandatory literal operand like V_FMAAK_F32. Differential Revision: https://reviews.llvm.org/D126064	2022-05-20 16:14:00 +01:00
Jay Foad	5b18ef7256	[AMDGPU] Add verification for mandatory literals Extend the literal operand checking in SIInstrInfo::verifyInstruction to check VOP2 instructions like V_FMAAK_F32 which have a mandatory literal operand. The rule is that src0 can also be a literal, but only if it is the same literal value. AMDGPUAsmParser::validateConstantBusLimitations already handles this correctly. Differential Revision: https://reviews.llvm.org/D126063	2022-05-20 16:14:00 +01:00
Dmitry Preobrazhensky	f598dfb3bf	[AMDGPU][MC][GFX8+] Correct SMEM offset parsing Differential Revision: https://reviews.llvm.org/D125907	2022-05-20 14:00:34 +03:00
Jay Foad	9ece051847	[AMDGPU] Mark s_get_waveid_in_workgroup as not reading memory It is already marked as having side effects, at least in MIR. It does not interact with anything else that is modelled as a memory access either in IR or MachineIR. Differential Revision: https://reviews.llvm.org/D125985	2022-05-19 21:25:46 +01:00
Jay Foad	86b55edab6	[AMDGPU] Mark s_getreg as having side effects instead of reading memory s_getreg does not interact with anything else that is modelled as a memory access either in IR or MachineIR. Differential Revision: https://reviews.llvm.org/D125968	2022-05-19 21:25:46 +01:00
Jay Foad	d14f2a6359	[AMDGPU] Allow multiple uses of the same literal in SOP2/SOPC AMDGPUAsmParser::validateSOPLiteral already knew about this but SIInstrInfo::verifyInstruction did not. Differential Revision: https://reviews.llvm.org/D125976	2022-05-19 16:42:20 +01:00
Joe Nash	ac2ff258d6	[AMDGPU] gfx11 scalar memory instructions Contributors: Mirko Brkusanin <Mirko.Brkusanin@amd.com> Patch 9/N for upstreaming of AMDGPU gfx11 architecture. Depends on D125820 Reviewed By: kosarev, #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D125822	2022-05-19 10:27:47 -04:00
Joe Nash	729467acef	[AMDGPU] gfx11 LDSDIR instructions MC support Contributors: Carl Ritson <carl.ritson@amd.com> Patch 8/N for upstreaming of AMDGPU gfx11 architecture. Depends on D125498 Reviewed By: critson, rampitec, #amdgpu Differential Revision: https://reviews.llvm.org/D125820	2022-05-19 10:08:47 -04:00
Dmitry Preobrazhensky	44673278e0	[AMDGPU][MC][GFX940] Add SMFMAC aliases Differential Revision: https://reviews.llvm.org/D125888	2022-05-19 13:40:48 +03:00
Jay Foad	6bec3e9303	[APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf Most clients only used these methods because they wanted to be able to extend or truncate to the same bit width (which is a no-op). Now that the standard zext, sext and trunc allow this, there is no reason to use the OrSelf versions. The OrSelf versions additionally have the strange behaviour of allowing extending to a smaller width, or truncating to a larger width, which are also treated as no-ops. A small amount of client code relied on this (ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and needed rewriting. Differential Revision: https://reviews.llvm.org/D125557	2022-05-19 11:23:13 +01:00
Dmitry Preobrazhensky	32ca9bd7b5	[AMDGPU][MC][GFX940] Correct tied operand decoding for smfmac opcodes Differential Revision: https://reviews.llvm.org/D125790	2022-05-18 15:39:30 +03:00
Dmitry Preobrazhensky	169416c64a	[AMDGPU][MC][GFX7] Disable cache policy modifiers with SMRD Differential Revision: https://reviews.llvm.org/D125799	2022-05-18 15:17:49 +03:00
Dmitry Preobrazhensky	95a8af2750	[AMDGPU][MC][NFC] MUBUF code cleanup Removed code that is no longer used after https://reviews.llvm.org/D124485. Differential Revision: https://reviews.llvm.org/D125811	2022-05-18 15:00:38 +03:00
Jay Foad	e2926501d8	[AMDGPU] Aggressively fold immediates in SIShrinkInstructions Fold immediates regardless of how many uses they have. This is expected to increase overall code size, but decrease register usage. Differential Revision: https://reviews.llvm.org/D114644	2022-05-18 11:04:33 +01:00
Jay Foad	3eb2281bc0	[AMDGPU] Aggressively fold immediates in SIFoldOperands Previously SIFoldOperands::foldInstOperand would only fold a non-inlinable immediate into a single user, so as not to increase code size by adding the same 32-bit literal operand to many instructions. This patch removes that restriction, so that a non-inlinable immediate will be folded into any number of users. The rationale is: - It reduces the number of registers used for holding constant values, which might increase occupancy. (On the other hand, many of these registers are SGPRs which no longer affect occupancy on GFX10+.) - It reduces ALU stalls between the instruction that loads a constant into a register, and the instruction that uses it. - The above benefits are expected to outweigh any increase in code size. Differential Revision: https://reviews.llvm.org/D114643	2022-05-18 10:19:35 +01:00
Jay Foad	dd12c3433e	[AMDGPU] Shrink F16 MAD/FMA to MADAK/MADMK/FMAAK/FMAMK on GFX10 Differential Revision: https://reviews.llvm.org/D125803	2022-05-18 10:00:06 +01:00
Shao-Ce SUN	25af3afa67	[NFC][AMDGPU][CodeGen] Use ArrayRef in TargetLowering functions Based on D123467. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D124508	2022-05-18 10:50:23 +08:00
Stanislav Mekhanoshin	dee3190293	[AMDGPU] Add llvm.amdgcn.global.load.lds intrinsic Differential Revision: https://reviews.llvm.org/D125279	2022-05-17 12:35:27 -07:00
Stanislav Mekhanoshin	a09af86693	[AMDGPU] Enable FLAT LDS DMA on gfx9/10 before gfx940 We always had global and scratch loads to LDS in the gfx9, but did not handle it. These were available via the 'lds' encoding bit. In gfx940 this bit was reused as 'svs' which resulted in new '_lds' opcodes effectively pushing this bit into the opcode, but functionally it is the same. These instructions are also available on gfx10. Differential Revision: https://reviews.llvm.org/D125126	2022-05-17 12:16:37 -07:00
Joe Nash	d21b9b4946	[AMDGPU] gfx11 scalar alu instructions MC layer support for SOP(scalar alu operations) including encoding support for s_delay_alu and s_sendmsg_rtn. Contributors: Jay Foad <jay.foad@amd.com> Patch 7/N for upstreaming of AMDGPU gfx11 architecture. Depends on D125319 Reviewed By: #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D125498	2022-05-17 13:35:41 -04:00
Stanislav Mekhanoshin	791ec1c68e	[AMDGPU] Add intrinsics llvm.amdgcn.{raw\|struct}.buffer.load.lds Differential Revision: https://reviews.llvm.org/D124884	2022-05-17 10:32:13 -07:00
Stanislav Mekhanoshin	332b73fe12	[AMDGPU] Revert wide LDS DMA support. This reverts `ffbee7acdc`, see also bug 37653 which it was fixing. The bug claims this is an undocumented feature which actually works. In the reality it is documented as not working for a good reason. It likely does something, but it is useless anyway. These instructions write into the LDS. The LDS address is: M0 + inst_offset + (TIDinWave * 4). For a store wider than a DWORD neighboring lanes will overwrite each other. Differential Revision: https://reviews.llvm.org/D125409	2022-05-16 11:23:35 -07:00
Joe Nash	6ef17f20d9	[AMDGPU] Mark sendmsg hasSideEffects. NFC Address the FIXME by marking the sendmsg instructions with hasSideEffects. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D125569	2022-05-16 09:59:27 -04:00
Jay Foad	27fa41583f	[AMDGPU] Shrink MAD/FMA to MADAK/MADMK/FMAAK/FMAMK on GFX10 On GFX10 VOP3 instructions can have a literal operand, so the conversion from VOP3 MAD/FMA to VOP2 MADAK/MADMK/FMAAK/FMAMK will not happen in SIFoldOperands. The only benefit of the VOP2 form is code size, so do it in SIShrinkInstructions instead. Differential Revision: https://reviews.llvm.org/D125567	2022-05-16 15:15:23 +01:00
Joe Nash	c70259405c	[AMDGPU] gfx11 BUF Instructions Includes MachineCode layer support and tests, and MIR tests not requiring CodeGen pass changes. Includes a small change in SMInstructions.td to correct encoded bits. Contributors: Petar Avramovic <Petar.Avramovic@amd.com> Dmitry Preobrazhensky <dmitry.preobrazhensky@amd.com> Depends on D125316 Patch 6/N for upstreaming of AMDGPU gfx11 architecture. Reviewed By: dp, Petar.Avramovic Differential Revision: https://reviews.llvm.org/D125319	2022-05-16 09:41:40 -04:00
Jay Foad	c1af2d329f	[AMDGPU] SIShrinkInstructions: change static functions to methods This is a mechanical change to avoid passing MRI and TII around explicitly. NFC. Differential Revision: https://reviews.llvm.org/D125566	2022-05-16 09:43:41 +01:00
Jay Foad	dfb006c0c9	[AMDGPU] Extract SIInstrInfo::removeModOperands. NFC. Make this an externally callable function for use in a future patch. Differential Revision: https://reviews.llvm.org/D125565	2022-05-16 09:43:41 +01:00
Sheng	c644488a8b	Rename `MCFixedLenDisassembler.h` as `MCDecoderOps.h` The name `MCFixedLenDisassembler.h` is out of date after D120958. Rename it as `MCDecoderOps.h` to reflect the change. Reviewed By: myhsu Differential Revision: https://reviews.llvm.org/D124987	2022-05-15 08:44:58 +08:00
Ivan Kosarev	bf5fc0d603	[AMDGPU][NFC] Remove unused function. Introduced in https://reviews.llvm.org/rG229d5e669bbbe7ca38ad832627a9809405939f1b and then became unused in https://reviews.llvm.org/D19584 Reviewed By: foad, dp Differential Revision: https://reviews.llvm.org/D125385	2022-05-12 08:52:06 +01:00
Ivan Kosarev	cb67b2ccc4	[AMDGPU][GFX10] Support base+soffset+offset SMEM stores. Also makes another step towards resolving https://github.com/llvm/llvm-project/issues/38652 Reviewed By: foad, dp Differential Revision: https://reviews.llvm.org/D125380	2022-05-12 08:48:05 +01:00
Austin Kerbow	2db700215a	[AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic Adds an intrinsic/builtin that can be used to fine tune scheduler behavior. If there is a need to have highly optimized codegen and kernel developers have knowledge of inter-wave runtime behavior which is unknown to the compiler this builtin can be used to tune scheduling. This intrinsic creates a barrier between scheduling regions. The immediate parameter is a mask to determine the types of instructions that should be prevented from crossing the sched_barrier. In this initial patch, there are only two variations. A mask of 0 means that no instructions may be scheduled across the sched_barrier. A mask of 1 means that non-memory, non-side-effect inducing instructions may cross the sched_barrier. Note that this intrinsic is only meant to work with the scheduling passes. Any other transformations that may move code will not be impacted in the ways described above. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D124700	2022-05-11 13:22:51 -07:00
Joe Nash	a0a406b257	[AMDGPU] gfx11 Decode wider instructions. NFC Refactor to pass a templatized size parameter to the decoder to allow wider than 64bit decodes in a later patch. Contributors: Jay Foad <jay.foad@amd.com> Depends on D125261 Patch 5/N for upstreaming of AMDGPU gfx11 architecture. Reviewed By: dp Differential Revision: https://reviews.llvm.org/D125316	2022-05-11 11:05:58 -04:00
Joe Nash	18ed279a3a	[AMDGPU] gfx11 subtarget features & early tests Tablegen definitions for subtarget features and cpp predicate functions to access the features. New Sub-TargetProcessors and common latencies. Simple changes to MIR codegen tests which pass on gfx11 because they have the same output as previous subtargets or operate on pseudo instructions which are reused from previous subtargets. Contributors: Jay Foad <jay.foad@amd.com> Petar Avramovic <Petar.Avramovic@amd.com> Patch 4/N for upstreaming of AMDGPU gfx11 architecture Depends on D124538 Reviewed By: Petar.Avramovic, foad Differential Revision: https://reviews.llvm.org/D125261	2022-05-11 10:31:49 -04:00
Mehdi Amini	3ffb08844c	Remove unused variable (fix -Werror build on MSVC)	2022-05-10 21:04:52 +00:00
jeff	f822db7670	[AMDGPU] Allow for MFMA Inst Clustering This patch adds cluster edges between independent MFMA instructions. Additionally, it propogates all predecessors of cluster insts to the root of the cluster(s), and all successors to the leaf(ves) of the cluster(s) -- this is done to remove the possibility that those insts will be interspersed within the cluster. Reviewed By: kerbowa Differential Revision: https://reviews.llvm.org/D124678	2022-05-10 12:57:40 -07:00
jeff	3ff8ee2447	[NFC] Fix typo Reviewed By: kerbowa Differential Revision: https://reviews.llvm.org/D124647	2022-05-10 12:11:21 -07:00
Ivan Kosarev	88f04bdbd8	[AMDGPU][GFX10] Support base+soffset+offset SMEM loads. Also makes a step towards resolving https://github.com/llvm/llvm-project/issues/38652 Reviewed By: foad, dp Differential Revision: https://reviews.llvm.org/D125117	2022-05-10 16:17:14 +01:00
Nicolai Hähnle	6c2a01ce3a	AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32 Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us to keep the calculation entirely on the SALU. For subtargets where integer multiplication isn't full-rate, avoid folding if the multiply has too many uses. Finally, we expand 64x32 and 64x64 multiplies here as well, if they feed into an addition. This results in better code generation than the generic expansion for such multiplies because we end up using the accumulator of the MAD instructions. Differential Revision: https://reviews.llvm.org/D123835	2022-05-10 09:15:51 -05:00
Simon Pilgrim	7e3ef7dcd2	[AMDGPU] lowerEXTRACT_VECTOR_ELT - fold from a SCALAR_TO_VECTOR source As suggested by @foad on D124839 If we're extracting a vector element that originally came from a scalar_to_vector, then avoid the bitcasting of a vector type and perform the shift masking on the (any-extended) scalar source directly, making use of the fact that the upper elements of a scalar_to_vector are all undef. Differential Revision: https://reviews.llvm.org/D125173	2022-05-07 20:23:31 +01:00
Joe Nash	7e71a03966	[AMDGPU] Split FeatureAtomicFaddInsts FeatureAtomicFaddInsts is replaced with three more granular features. Contributors: Petar Avramovic <Petar.Avramovic@amd.com> Patch 3/N for upstreaming of AMDGPU gfx11 architecture Depends on D124537 Reviewed By: foad, #amdgpu, arsenm Differential Revision: https://reviews.llvm.org/D124538	2022-05-05 13:27:45 -04:00
Jay Foad	ba6c8d42d4	[AMDGPU] Combine DPP mov even if old reg def is in different BB Given a DPP mov like this: %2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec ... %3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec this patch just removes a check that %2 (the "old reg") was defined in the same BB as the DPP mov instruction. GCNDPPCombine requires that the MIR is in SSA form so I don't understand why the BB matters. This lets the optimization work in more real world cases when the definition of %2 gets hoisted out of a loop. Differential Revision: https://reviews.llvm.org/D124182	2022-05-05 11:30:31 +01:00
Mariusz Sikora	2417de2758	[AMDGPU] Use d16 flag for image.sample instructions Image.sample instruction can be forced to return half type instead of float when d16 flag is enabled. This patch adds new pattern in InstCombine to detect if output of image.sample is used later only by fptrunc which converts the type from float to half. If pattern is detected then fptrunc and image.sample are combined to single image.sample which is returning half type. Later in Lowering part d16 flag is added to image sample intrinsic. Differential Revision: https://reviews.llvm.org/D124232	2022-05-05 06:29:19 +02:00
Stanislav Mekhanoshin	63f21f4cc7	[AMDGPU] Handle LDS DMA and LDS_DIRECT hazards There shall be 1 wait state between M0 write and LDS DMA/LDS_DIRECT use. Differential Revision: https://reviews.llvm.org/D124550	2022-05-04 14:45:16 -07:00
Jon Chesterfield	bc78c09952	[amdgpu] Elide module lds allocation in kernels with no callees Introduces a string attribute, amdgpu-requires-module-lds, to allow eliding the module.lds block from kernels. Will allocate the block as before if the attribute is missing or has its default value of true. Patch uses the new attribute to detect the simplest possible instance of this, where a kernel makes no calls and thus cannot call any functions that use LDS. Tests updated to match, coverage was already good. Interesting cases is in lower-module-lds-offsets where annotating the kernel allows the backend to pick a different (in this case better) variable ordering than previously. A later patch will avoid moving kernel variables into module.lds when the kernel can have this attribute, allowing optimal ordering and locally unused variable elimination. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122091	2022-05-04 22:42:07 +01:00
serge-sans-paille	7030654296	[iwyu] Handle regressions in libLLVM header include Running iwyu-diff on LLVM codebase since `fa5a4e1b95` detected a few regressions, fixing them. Differential Revision: https://reviews.llvm.org/D124847	2022-05-04 08:32:38 +02:00
Nicolai Hähnle	8b42e6d057	AMDGPU: Remove redundant call to MachineInstrBuilder::setMBB setInstrAndDebugLoc also sets the basic block automatically. Differential Revision: https://reviews.llvm.org/D124809	2022-05-03 07:49:20 -05:00
hsmahesha	589b9df4e1	[AMDGPU] Fix scalar_to_vector for v8i16/v8f16 so that the stack access is avoided. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D124734	2022-05-03 07:28:15 +05:30
hsmahesha	3175323ce1	[AMDGPU][NFC] Make lowerINSERT_VECTOR_ELT() more readable by moving around the code and by adding more comments, which would later help during any required clean-up. Differential Revision: https://reviews.llvm.org/D124733	2022-05-03 07:28:15 +05:30
Nicolai Hähnle	deaa678137	AMDGPU/SDAG: Factor out the fold (add (mul x, y), y) --> mad_[iu]64_[iu]32 Refactor to simplify a follow-up change. No functional change intended. However, there is a rather subtle logic change: the subsequent combines (e.g. reassociation) are skipped always when one of the operands of the add is a mul, instead of only when additionally mad64_32 etc. are available. This change makes sense because the subsequent combines should never apply when one of the operands is a mul. Differential Revision: https://reviews.llvm.org/D123833	2022-05-02 17:40:03 -05:00
Stanislav Mekhanoshin	51e02409f0	[AMDGPU] Produce waitcounts for LDS DMA MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written can be accessed. A load from LDS to VMEM does not need a wait. Differential Revision: https://reviews.llvm.org/D124626	2022-04-29 11:14:11 -07:00
Joe Nash	813e521e55	[AMDGPU] Add gfx11 subtarget ELF definition This is the first patch of a series to upstream support for the new subtarget. Contributors: Jay Foad <jay.foad@amd.com> Konstantin Zhuravlyov <kzhuravl_dev@outlook.com> Patch 1/N for upstreaming AMDGPU gfx11 architectures. Reviewed By: foad, kzhuravl, #amdgpu Differential Revision: https://reviews.llvm.org/D124536	2022-04-29 12:27:17 -04:00
Ivan Kosarev	6ddf2a824d	[AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling. As older waves execute long sequences of VALU instructions, this may prevent younger waves from address calculation and then issuing their VMEM loads, which in turn leads the VALU unit to idle. This patch tries to prevent this by temporarily raising the wave's priority. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D124246	2022-04-27 14:37:18 +01:00
Stanislav Mekhanoshin	6a24e37219	[AMDGPU] Remove now unused variable HasLdsModifier. NFC.	2022-04-26 17:49:30 -07:00
Stanislav Mekhanoshin	0274811b5a	[AMDGPU] Add both mayLoad and mayStore to MUBUF LDS opcodes Differential Revision: https://reviews.llvm.org/D124483	2022-04-26 17:30:24 -07:00
Stanislav Mekhanoshin	00d84a9f92	[AMDGPU] Remove vdata from buffer to lds load Differential Revision: https://reviews.llvm.org/D124485	2022-04-26 17:16:26 -07:00
Stanislav Mekhanoshin	a9ccc7bc54	[AMDGPU] Properly mark MUBUF and FLAT LDS DMA instructions. NFC. Add these bits to the MUBUF and FLAT LDS DMA instructions: - LGKM_CNT - these operate on LDS; - VALU - SPG 3.9.8: This instruction acts as both a MUBUF and VALU instruction; Codegen currently does not produce any of this, so the change is NFC. Differential Revision: https://reviews.llvm.org/D124472	2022-04-26 14:20:26 -07:00
Vasileios Porpodas	fa8a9fea47	Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`" This reverts commit `6a9bbd9f20`. Code review: https://reviews.llvm.org/D124202	2022-04-26 14:02:40 -07:00
Piotr Sobczak	c6afbdb5d2	Revert "[AMDGPU] Use d16 flag for image.sample instructions" This reverts commit `d1762fc454`. Reverting D124232 as the buildbot reported some errors in sanitizers.	2022-04-25 17:18:49 +02:00
Mariusz Sikora	d1762fc454	[AMDGPU] Use d16 flag for image.sample instructions Image.sample instruction can be forced to return half type instead of float when d16 flag is enabled. This patch adds new pattern in InstCombine to detect if output of image.sample is used later only by fptrunc which converts the type from float to half. If pattern is detected then fptrunc and image.sample are combined to single image.sample which is returning half type. Later in Lowering part d16 flag is added to image sample intrinsic. Differential Revision: https://reviews.llvm.org/D124232	2022-04-25 13:05:52 +01:00
Matt Arsenault	0ecbb683a2	TableGen/GlobalISel: Make address space/align predicates consistent The builtin predicate handling has a strange behavior where the code assumes that a PatFrag is a stack of PatFrags, and each level adds at most one predicate. I don't think this particularly makes sense, especially without a diagnostic to ensure you aren't trying to set multiple at once. This wasn't followed for address spaces and alignment, which could potentially fall through to report no builtin predicate was added. Just switch these to follow the existing convention for now.	2022-04-22 15:48:07 -04:00
Matt Arsenault	794a0bb547	AMDGPU: Directly implement computeKnownBits for workitem intrinsics Currently metadata is inserted in a late pass which is lowered to an AssertZext. The metadata would be more useful if it was inserted earlier after inlining, but before codegen. Probably shouldn't change anything now. Just replacing the late metadata annotation needs more work, since we lose out on optimizations after these are lowered to CopyFromReg. Seems to be slightly better than relying on the AssertZext from the metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using -start-before=amdgpu-isel instead of running the usual codegen pipeline.	2022-04-22 10:49:50 -04:00
Abinav Puthan Purayil	561af89fed	[AMDGPU] Use a wrapper multiclass for buffer atomic intrinsic patterns. NFC	2022-04-22 13:59:34 +05:30
Abinav Puthan Purayil	272a876804	[AMDGPU] Rename the FlatSignedIntrPat multiclass to FlatSignedAtomicIntrPat. NFC	2022-04-22 11:47:23 +05:30
Abinav Puthan Purayil	2147b6c89d	[AMDGPU] Remove no-ret atomic ops selection in the post-isel hook No-ret atomic ops are now selected in tblgen. Differential Revision: https://reviews.llvm.org/D124086	2022-04-22 09:37:41 +05:30
Abinav Puthan Purayil	165ae7276c	[AMDGPU] Remove atomic pattern args in FLAT_[Global_]Atomic_Pseudo defs We already have explicit patterns for these. Differential Revision: https://reviews.llvm.org/D124084	2022-04-22 09:37:40 +05:30
Abinav Puthan Purayil	f935908d7b	[AMDGPU] Select no-return DS_PK_ADD_F16 in tblgen Differential Revision: https://reviews.llvm.org/D123584	2022-04-22 09:37:40 +05:30
Abinav Puthan Purayil	45ca94334e	[AMDGPU] Select no-return atomic intrinsics in tblgen This is to avoid relying on the post-isel hook. This change also enable the saddr pattern selection for atomic intrinsics in GlobalISel. Differential Revision: https://reviews.llvm.org/D123583	2022-04-22 09:37:40 +05:30
Stanislav Mekhanoshin	ac94073daa	[AMDGPU] Refine 64 bit misaligned LDS ops selection Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b64 aligned by 8: 3.2 sec ds_write2_b32 aligned by 8: 3.2 sec ds_write_b16 * 4 aligned by 8: 7.0 sec ds_write_b8 * 8 aligned by 8: 13.2 sec ds_write_b64 aligned by 1: 7.3 sec ds_write2_b32 aligned by 1: 7.5 sec ds_write_b16 * 4 aligned by 1: 14.0 sec ds_write_b8 * 8 aligned by 1: 13.2 sec ds_write_b64 aligned by 2: 7.3 sec ds_write2_b32 aligned by 2: 7.5 sec ds_write_b16 * 4 aligned by 2: 7.1 sec ds_write_b8 * 8 aligned by 2: 13.3 sec ds_write_b64 aligned by 4: 4.6 sec ds_write2_b32 aligned by 4: 3.2 sec ds_write_b16 * 4 aligned by 4: 7.1 sec ds_write_b8 * 8 aligned by 4: 13.3 sec ds_read_b64 aligned by 8: 2.3 sec ds_read2_b32 aligned by 8: 2.2 sec ds_read_u16 * 4 aligned by 8: 4.8 sec ds_read_u8 * 8 aligned by 8: 8.6 sec ds_read_b64 aligned by 1: 4.4 sec ds_read2_b32 aligned by 1: 7.3 sec ds_read_u16 * 4 aligned by 1: 14.0 sec ds_read_u8 * 8 aligned by 1: 8.7 sec ds_read_b64 aligned by 2: 4.4 sec ds_read2_b32 aligned by 2: 7.3 sec ds_read_u16 * 4 aligned by 2: 4.8 sec ds_read_u8 * 8 aligned by 2: 8.7 sec ds_read_b64 aligned by 4: 4.4 sec ds_read2_b32 aligned by 4: 2.3 sec ds_read_u16 * 4 aligned by 4: 4.8 sec ds_read_u8 * 8 aligned by 4: 8.7 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b64 aligned by 8: 4.4 sec ds_write2_b32 aligned by 8: 4.3 sec ds_write_b16 * 4 aligned by 8: 7.9 sec ds_write_b8 * 8 aligned by 8: 13.0 sec ds_write_b64 aligned by 1: 23.2 sec ds_write2_b32 aligned by 1: 23.1 sec ds_write_b16 * 4 aligned by 1: 44.0 sec ds_write_b8 * 8 aligned by 1: 13.0 sec ds_write_b64 aligned by 2: 23.2 sec ds_write2_b32 aligned by 2: 23.1 sec ds_write_b16 * 4 aligned by 2: 7.9 sec ds_write_b8 * 8 aligned by 2: 13.1 sec ds_write_b64 aligned by 4: 13.5 sec ds_write2_b32 aligned by 4: 4.3 sec ds_write_b16 * 4 aligned by 4: 7.9 sec ds_write_b8 * 8 aligned by 4: 13.1 sec ds_read_b64 aligned by 8: 3.5 sec ds_read2_b32 aligned by 8: 3.4 sec ds_read_u16 * 4 aligned by 8: 5.3 sec ds_read_u8 * 8 aligned by 8: 8.5 sec ds_read_b64 aligned by 1: 13.1 sec ds_read2_b32 aligned by 1: 22.7 sec ds_read_u16 * 4 aligned by 1: 43.9 sec ds_read_u8 * 8 aligned by 1: 7.9 sec ds_read_b64 aligned by 2: 13.1 sec ds_read2_b32 aligned by 2: 22.7 sec ds_read_u16 * 4 aligned by 2: 5.6 sec ds_read_u8 * 8 aligned by 2: 7.9 sec ds_read_b64 aligned by 4: 13.1 sec ds_read2_b32 aligned by 4: 3.4 sec ds_read_u16 * 4 aligned by 4: 5.6 sec ds_read_u8 * 8 aligned by 4: 7.9 sec ``` GFX10 exposes a different pattern for sub-DWORD load/store performance than GFX9. On GFX9 it is faster to issue a single unaligned load or store than a fully split b8 access, where on GFX10 even a full split is better. However, this is a theoretical only gain because splitting an access to a sub-dword level will require more registers and packing/ unpacking logic, so ignoring this option it is better to use a single 64 bit instruction on a misaligned data with the exception of 4 byte aligned data where ds_read2_b32/ds_write2_b32 is better. Differential Revision: https://reviews.llvm.org/D123956	2022-04-21 09:37:16 -07:00
Petar Avramovic	e06290e53f	AMDGPU/GlobalISel: Fix isVCC for uniform s1 with reg class on wave32 Fix isVCC for register that was assigned register class during inst-selection. This happens when register has multiple uses. For wave32, uniform i1 to vcc copy was selected like vcc to vcc copy when uniform i1 had assigned register class. Uniform i1 register with assigned register class will have s1 LLT, be defined using G_TRUNC and class will be SReg_32RegClass. Vcc i1 register with assigned register class will have s1 LLT, class will be SReg_32RegClass for wave32 and SReg_64RegClass for wave64 and register will not be defined by G_TRUNC. Differential Revision: https://reviews.llvm.org/D124163	2022-04-21 16:12:04 +02:00
Jannik Silvanus	607f8ced39	[AMDGPU]: Fix failing assertion in SIMachineScheduler This fixes the assertion failure "Loop in the Block Graph!". SIMachineScheduler groups instructions into blocks (also referred to as coloring or groups) and then performs a two-level scheduling: inter-block scheduling, and intra-block scheduling. This approach requires that the dependency graph on the blocks which is obtained by contracting the blocks in the original dependency graph is acyclic. In other words: Whenever A and B end up in the same block, all vertices on a path from A to B must be in the same block. When compiling an example consisting of an export followed by a buffer store, we see a dependency between these two. This dependency may be false, but that is a different issue. This dependency was not correctly accounted for by SiMachineScheduler. A new test case si-scheduler-exports.ll demonstrating this is also added in this commit. The problematic part of SiMachineScheduler was a post-optimization of the block assignment that tried to group all export instructions into a separate export block for better execution performance. This routine correctly checked that any paths from exports to exports did not contain any non-exports, but not vice-versa: In case of an export with a non-export successor dependency, that single export was moved to a separate block, which could then be both a successor and a predecessor block of a non-export block. As fix, we now skip export grouping if there are exports with direct non-export successor dependencies. This fixes the issue at hand, but is slightly pessimistic: We could group all exports into a separate block that have neither direct nor indirect export successor dependencies. We will review the potential performance impact and potentially revisit with a more sophisticated implementation. Note that just grouping all exports without direct non-export successor dependencies could still lead to illegal blocks, since non-export A could depend on export B that depends on export C. In that case, export C has no non-export successor, but still may not be grouped into an export block.	2022-04-21 14:52:29 +01:00
Dmitry Preobrazhensky	81af32b9a3	[AMDGPU][MC][NFC][GFX940] Corrected an error position Differential Revision: https://reviews.llvm.org/D124099	2022-04-21 14:04:46 +03:00
Dmitry Preobrazhensky	b4231ac4be	[AMDGPU][GFX90A+] Disabled ds_ordered_count and exp Differential Revision: https://reviews.llvm.org/D124087	2022-04-21 13:16:44 +03:00
hsmahesha	5bd87350a5	[AMDGPU] On gfx908, reserve VGPR for AGPR copy based on register budget. Based on available register budget, reserve highest available VGPR for AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one exist. Fixes SWDEV-330006. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D123525	2022-04-21 07:57:26 +05:30
Stanislav Mekhanoshin	aa14e2ef3e	[AMDGPU] Remove obsolete hack from allowsMisalignedMemoryAccesses. NFCI. Differential Revision: https://reviews.llvm.org/D124035	2022-04-20 11:52:56 -07:00
Jay Foad	879ac41089	[AMDGPU] Fix crash in SIOptimizeExecMaskingPreRA When folding a COPY of exec into another COPY, the call to TII->isOperandLegal would crash because COPYs don't have defined register classes for their operands. Differential Revision: https://reviews.llvm.org/D122737	2022-04-20 14:42:48 +01:00
Jay Foad	1f91512268	[AMDGPU] Simplify calls to getDefSrcRegIgnoringCopies. NFC. getDefSrcRegIgnoringCopies never returns None on valid MIR.	2022-04-20 12:37:24 +01:00
Abinav Puthan Purayil	b7df71524e	[AMDGPU][GlobalISel] Force return atomic selection for now	2022-04-20 16:00:08 +05:30
Fangrui Song	bec8dff33e	[AMDGPU] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds	2022-04-19 22:36:58 -07:00
Matt Arsenault	1900b6c77b	AMDGPU: Add assert for GDS globals	2022-04-19 22:28:11 -04:00
Matt Arsenault	987df725ac	AMDGPU: Serialize VGPRForAGPRCopy	2022-04-19 22:14:52 -04:00
Matt Arsenault	b5ec131267	AMDGPU: Fix allocating GDS globals to LDS offsets These don't seem to be very well used or tested, but try to make the behavior a bit more consistent with LDS globals. I'm not sure what the definition for amdgpu-gds-size is supposed to mean. For now I assumed it's allocating a static size at the beginning of the allocation, and any known globals are allocated after it.	2022-04-19 22:14:48 -04:00
Matt Arsenault	378bb8014d	AMDGPU: Serialize a few more MachineFunctionInfo fields in MIR	2022-04-19 22:12:59 -04:00
Matt Arsenault	f90f4884c8	AMDGPU: Serialize gds size in MIR	2022-04-19 22:12:59 -04:00
Matt Arsenault	5cd17f9d43	AMDGPU: Serialize WWM registers	2022-04-19 21:44:43 -04:00
Matt Arsenault	e0d585d75a	AMDGPU: Defer creation of WWM VGPR spill slots There's no reason to create these immediately. They can be created in the prolog/epilog code like CSR spills. There's probably a cleaner way to do this by utilizing the CSR spill code. This makes the frame index used transient state for PrologEpilogInserter, and thus makes serialization easier. Really this doesn't need to be saved here but there isn't really a better place for it.	2022-04-19 21:07:13 -04:00
Matt Arsenault	4271ae22be	AMDGPU: Remove some unreachable code in WWM pass Defs must be registers and there's no point to code after llvm_unreachable.	2022-04-19 21:04:33 -04:00
Matt Arsenault	bc7902f148	AMDGPU: Remove unused MachineFunctionInfo fields These were leftovers from a half-implement spill to LDS attempt.	2022-04-19 21:04:33 -04:00
Dmitry Preobrazhensky	e01dbabdd1	[AMDGPU][MC] Corrected error message "image data size does not match dmask and tfe" Differential Revision: https://reviews.llvm.org/D123929	2022-04-19 13:52:58 +03:00
Jay Foad	f707e1255e	[AMDGPU] Select d16 stores even when sramecc is enabled The sramecc feature changes the behaviour of d16 loads so they do not preserve the unused 16 bits of the result register, but it has no impact on d16 stores, so we should make use of them even when the feature is enabled. Differential Revision: https://reviews.llvm.org/D104912	2022-04-19 09:34:32 +01:00
Austin Kerbow	7f97ac94f7	Revert "[AMDGPU] Omit unnecessary waitcnt before barriers" This reverts commit `8d0c34fd4f`.	2022-04-18 21:24:08 -07:00
Stanislav Mekhanoshin	c1c49a3561	[AMDGPU] Fix comment type in the DSInstructions.td. NFC.	2022-04-18 14:28:12 -07:00
Christudasan Devadasan	34a68037dd	[AMDGPU][SIFrameLowering] Refactor custom SGPR spills (NFC). Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D123666	2022-04-17 13:42:42 +05:30
Johannes Doerfert	3be3b40188	[Attributor][NFCI] Introduce AttributorConfig to bundle all options Instead of lengthy constructors we can now set the members of a read-only struct before the Attributor is created. Should make it clearer what is configurable and also help introducing new options in the future. This actually added IsModulePass and avoids deduction through the Function set size. No functional change was intended.	2022-04-15 18:17:19 -05:00
Matt Arsenault	df29ec2f54	AMDGPU: Select i8/i16 global and flat atomic load/store As far as I know these should be atomic anyway, as long as the address is aligned. Unaligned atomics hit an ugly error in AtomicExpand.	2022-04-14 20:52:05 -04:00
Matt Arsenault	c528fbf882	AMDGPU: Fix assert if v_mov_b32_dpp is last instruction in the block This can happen if the use instruction is a phi. Fixes issue 49961	2022-04-14 20:21:22 -04:00
Stanislav Mekhanoshin	49b39c4f2e	[AMDGPU] Remove redundand RequiredAlignment assignment. NFCI. Differential Revision: https://reviews.llvm.org/D123699	2022-04-14 02:03:51 -07:00
hsmahesha	ea47373af4	[AMDGPU][NFC] Organize code around reserving VGPR32 for AGPR copy. This is an NFC patch in preparation to fix a bug related to always reserving VGPR32 for AGPR copy. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D123651	2022-04-14 12:51:33 +05:30
Carl Ritson	35ea326047	[AMDGPU] Try to avoid inserting duplicate s_inst_prefetch Check for existing s_inst_prefetch instructions when configuring prefetches during loop alignment. Reviewed By: rampitec, foad Differential Revision: https://reviews.llvm.org/D123569	2022-04-14 16:06:24 +09:00
Stanislav Mekhanoshin	d951d937a0	[AMDGPU] Increate hazard for store dwordx3/4 to 2 waitstates on gfx940 Fixes: SWDEV-327053 Differential Revision: https://reviews.llvm.org/D123687	2022-04-13 14:21:45 -07:00
Jay Foad	ccaf6dabcc	[AMDGPU] Initialize a couple more Subtarget fields This is just for consistency. The fields are never actually used so it is NFC.	2022-04-13 16:36:10 +01:00
Dmitry Preobrazhensky	5c0bf1303e	[AMDGPU][MC][GFX10] Removed unsupported 64bit DPP opcodes Removed 64bit DPP opcodes from asm matcher tables. Differential Revision: https://reviews.llvm.org/D123611	2022-04-13 14:43:40 +03:00
Dmitry Preobrazhensky	ab18e1a533	[AMDGPU][GFX10] Enabled op_sel for v_add_nc_u16 and v_sub_nc_u16 Differential Revision: https://reviews.llvm.org/D123594	2022-04-13 13:48:42 +03:00
Matt Arsenault	c986d476cd	AMDGPU: Update reqd-work-group-size optimization for umin intrinsic This code was pattern matching the ID computation expression as it appears in the library. This was a compare and select, but now that umin is canonical, we were no longer matching. Update to match the intrinsic instead.	2022-04-12 20:03:02 -04:00
Stanislav Mekhanoshin	f6462a26f0	[AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b128 aligned by 16: 4.9 sec ds_write2_b64 aligned by 16: 5.1 sec ds_write2_b32 * 2 aligned by 16: 5.5 sec ds_write_b128 aligned by 1: 8.1 sec ds_write2_b64 aligned by 1: 8.7 sec ds_write2_b32 * 2 aligned by 1: 14.0 sec ds_write_b128 aligned by 2: 8.1 sec ds_write2_b64 aligned by 2: 8.7 sec ds_write2_b32 * 2 aligned by 2: 14.0 sec ds_write_b128 aligned by 4: 5.6 sec ds_write2_b64 aligned by 4: 8.7 sec ds_write2_b32 * 2 aligned by 4: 5.6 sec ds_write_b128 aligned by 8: 5.6 sec ds_write2_b64 aligned by 8: 5.1 sec ds_write2_b32 * 2 aligned by 8: 5.6 sec ds_read_b128 aligned by 16: 3.8 sec ds_read2_b64 aligned by 16: 3.8 sec ds_read2_b32 * 2 aligned by 16: 4.0 sec ds_read_b128 aligned by 1: 4.6 sec ds_read2_b64 aligned by 1: 8.1 sec ds_read2_b32 * 2 aligned by 1: 14.0 sec ds_read_b128 aligned by 2: 4.6 sec ds_read2_b64 aligned by 2: 8.1 sec ds_read2_b32 * 2 aligned by 2: 14.0 sec ds_read_b128 aligned by 4: 4.6 sec ds_read2_b64 aligned by 4: 8.1 sec ds_read2_b32 * 2 aligned by 4: 4.0 sec ds_read_b128 aligned by 8: 4.6 sec ds_read2_b64 aligned by 8: 3.8 sec ds_read2_b32 * 2 aligned by 8: 4.0 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b128 aligned by 16: 6.2 sec ds_write2_b64 aligned by 16: 7.1 sec ds_write2_b32 * 2 aligned by 16: 7.6 sec ds_write_b128 aligned by 1: 24.1 sec ds_write2_b64 aligned by 1: 25.2 sec ds_write2_b32 * 2 aligned by 1: 43.7 sec ds_write_b128 aligned by 2: 24.1 sec ds_write2_b64 aligned by 2: 25.1 sec ds_write2_b32 * 2 aligned by 2: 43.7 sec ds_write_b128 aligned by 4: 14.4 sec ds_write2_b64 aligned by 4: 25.1 sec ds_write2_b32 * 2 aligned by 4: 7.6 sec ds_write_b128 aligned by 8: 14.4 sec ds_write2_b64 aligned by 8: 7.1 sec ds_write2_b32 * 2 aligned by 8: 7.6 sec ds_read_b128 aligned by 16: 6.2 sec ds_read2_b64 aligned by 16: 6.3 sec ds_read2_b32 * 2 aligned by 16: 7.5 sec ds_read_b128 aligned by 1: 12.5 sec ds_read2_b64 aligned by 1: 24.0 sec ds_read2_b32 * 2 aligned by 1: 43.6 sec ds_read_b128 aligned by 2: 12.5 sec ds_read2_b64 aligned by 2: 24.0 sec ds_read2_b32 * 2 aligned by 2: 43.6 sec ds_read_b128 aligned by 4: 12.5 sec ds_read2_b64 aligned by 4: 24.0 sec ds_read2_b32 * 2 aligned by 4: 7.5 sec ds_read_b128 aligned by 8: 12.5 sec ds_read2_b64 aligned by 8: 6.3 sec ds_read2_b32 * 2 aligned by 8: 7.5 sec ``` Differential Revision: https://reviews.llvm.org/D123634	2022-04-12 16:07:13 -07:00
Matt Arsenault	788f94f731	AMDGPU: Don't use unreachable on stores to unhandled address space For stores to constant address space, this will now consistently hit a selection error instead of hitting unreachable in an asserts build. I'm not sure what we should really do here. We could either just codegen as if it were global, delete the instruction, or declare the IR invalid (we really should have a target IR verifier to enforce it).	2022-04-12 17:31:50 -04:00
Changpeng Fang	8edaf25986	AMDGPU: Emit metadata for the hidden_multigrid_sync_arg conditionally Summary: Introduce a new function attribute, amdgpu-no-multigrid-sync-arg, which is default. We use implicitarg_ptr + offset to check whether the multigrid synchronization pointer is used. If yes, we remove this attribute and also remove amdgpu-no-implicitarg-ptr. We generate metadata for the hidden_multigrid_sync_arg only when the amdgpu-no-multigrid-sync-arg attribute is removed from the function. Reviewers: arsenm, sameerds, b-sumner and foad Differential Revision: https://reviews.llvm.org/D123548	2022-04-12 12:36:30 -07:00
Anshil Gandhi	528aa09010	[AMDGPU][Codegen] Unsupported image sample texture map instructions Disables image_sample_*_g16 instructions on architectures lacking g16 support. This patch fixes the issue 54672. Differential Revision: https://reviews.llvm.org/D123461	2022-04-12 10:38:59 -06:00
Jay Foad	8a53b25ed5	[AMDGPU] Use default member initializers in Subtarget classes Use default member initializers in AMDGPUSubtarget and subclasses. This is to guard against adding a new feature boolean in AMDGPUSubtarget.h but forgetting to initialize it to false in AMDGPUSubtarget.cpp. This was mostly autogenerated by: clang-tidy -checks=-,cppcoreguidelines-prefer-member-initializer,modernize-use-default-member-init -header-filter=Subtarget -fix lib/Target/AMDGPU/Subtarget.cpp Differential Revision: https://reviews.llvm.org/D123613	2022-04-12 16:42:30 +01:00
Stanislav Mekhanoshin	3870b36025	[AMDGPU] Split unaligned 3 DWORD DS operations I have written a minitest to check the performance. Overall the benefit of aligned b96 operations on data which is not known but happens to be aligned is small, while performance hit of using b96 operations on a really unaligned memory is high. The only exception is when data is not aligned even by 4, it is better to use b96 in this case. Here is the test output on Vega and Navi: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b96 aligned: 3.4 sec ds_write_b32 + ds_write_b64 aligned: 4.5 sec ds_write_b32 * 3 aligned: 4.8 sec ds_write_b96 misaligned by 1: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 1: 7.2 sec ds_write_b32 * 3 misaligned by 1: 10.0 sec ds_write_b96 misaligned by 2: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 2: 7.2 sec ds_write_b32 * 3 misaligned by 2: 10.1 sec ds_write_b96 misaligned by 4: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.2 sec ds_write_b32 * 3 misaligned by 4: 4.9 sec ds_write_b96 misaligned by 8: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 8: 4.6 sec ds_write_b32 * 3 misaligned by 8: 4.9 sec ds_read_b96 aligned: 3.3 sec ds_read_b32 + ds_read_b64 aligned: 4.9 sec ds_read_b32 * 3 aligned: 2.6 sec ds_read_b96 misaligned by 1: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 1: 7.2 sec ds_read_b32 * 3 misaligned by 1: 10.1 sec ds_read_b96 misaligned by 2: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 2: 7.2 sec ds_read_b32 * 3 misaligned by 2: 10.1 sec ds_read_b96 misaligned by 4: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 4: 2.6 sec ds_read_b32 * 3 misaligned by 4: 2.6 sec ds_read_b96 misaligned by 8: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 8: 4.9 sec ds_read_b32 * 3 misaligned by 8: 2.6 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b96 aligned: 4.1 sec ds_write_b32 + ds_write_b64 aligned: 13.0 sec ds_write_b32 * 3 aligned: 4.5 sec ds_write_b96 misaligned by 1: 12.5 sec ds_write_b32 + ds_write_b64 misaligned by 1: 22.0 sec ds_write_b32 * 3 misaligned by 1: 31.5 sec ds_write_b96 misaligned by 2: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 2: 22.0 sec ds_write_b32 * 3 misaligned by 2: 31.5 sec ds_write_b96 misaligned by 4: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.0 sec ds_write_b32 * 3 misaligned by 4: 4.5 sec ds_write_b96 misaligned by 8: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 8: 13.0 sec ds_write_b32 * 3 misaligned by 8: 4.5 sec ds_read_b96 aligned: 3.8 sec ds_read_b32 + ds_read_b64 aligned: 12.8 sec ds_read_b32 * 3 aligned: 4.4 sec ds_read_b96 misaligned by 1: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 1: 21.8 sec ds_read_b32 * 3 misaligned by 1: 31.5 sec ds_read_b96 misaligned by 2: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 2: 21.9 sec ds_read_b32 * 3 misaligned by 2: 31.5 sec ds_read_b96 misaligned by 4: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 4: 3.8 sec ds_read_b32 * 3 misaligned by 4: 4.5 sec ds_read_b96 misaligned by 8: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 8: 12.8 sec ds_read_b32 * 3 misaligned by 8: 4.5 sec ``` Fixes: SWDEV-330802 Differential Revision: https://reviews.llvm.org/D123524	2022-04-12 07:52:39 -07:00
Stanislav Mekhanoshin	b8e09f1553	[AMDGPU] Refactor LDS alignment checks. Move features/bugs checks into the single place allowsMisalignedMemoryAccessesImpl. This is mostly NFCI except for the order of selection in couple places. A separate change may be needed to stop lying about Fast. Differential Revision: https://reviews.llvm.org/D123343	2022-04-12 07:49:40 -07:00
Carl Ritson	2bca7d859a	[AMDGPU] Graceful abort for waterfalls in SIOptimizeVGPRLiveRange If the CFG structure of a waterfall loop is not the expected shape then gracefully abort traversing the IR for the given loop. This applies to nest waterfall loops which are not supported by the VGPR live range optimizer. Reviewed By: ruiling Differential Revision: https://reviews.llvm.org/D123480	2022-04-12 14:15:44 +09:00
Matt Arsenault	463bc93e5f	AMDGPU/GlobalISel: Remove unused parameter	2022-04-11 19:43:37 -04:00
Matt Arsenault	203a1e36ed	Reapply "AMDGPU: Remove AMDGPUFixFunctionBitcasts pass" This reverts commit `8a85be807b`. The unrelated failure this exposed was fixed.	2022-04-11 19:43:37 -04:00
Changpeng Fang	7f9868f9b7	AMDGPU: Align the implicit kernel argument segment to 8 bytes for v5 Summary: In emitting metadata for implicit kernel arguments, we need to be in sync with the actual loads to align the implicit kernel argument segment to 8 byte boundary. In this work, we simply force this alignment through the first implicit argument. In addition, we don't emit metadata for any implicit kernel argument if none of them is actually used. Reviewers: arsenm, b-sumner Differential Revision: https://reviews.llvm.org/D123346	2022-04-11 16:12:39 -07:00
Nicolai Hähnle	4df4922da6	AMDGPU/SDAG: Custom SETCC (i.e. ballot) is always uniform The AMDGPUISD::SETCC node is like ISD::SETCC, but returns a lane mask instead of a per-lane boolean. The lane mask is uniform. This improves instruction selection for code patterns like ctpop(ballot(x)), which can now use an S_BCNT1_* instruction instead of V_BCNT_*. GlobalISel already selects scalar instructions (an earlier commit added a test case).. Differential Revision: https://reviews.llvm.org/D123432	2022-04-11 14:04:21 -05:00
Stanislav Mekhanoshin	fced87d457	[AMDGPU] Fix regression with vectorization limiting D67148 has removed TTI::getNumberOfRegisters(bool Vector) and started to call TTI::getNumberOfRegisters(unsigned ClassID) from the LoopVectorize. This has resulted in an unrestricted vectorization on AMDGPU blowing up register pressure. Differential Revision: https://reviews.llvm.org/D122850	2022-04-08 17:46:49 -07:00
Vang Thao	311edc6b5b	[AMDGPU] Enable PreRARematerialize scheduling pass with multiple high RP regions Enable the PreRARematerialize pass when there are multiple high RP scheduling regions present. Require the occupancy in all high RP regions be improved before finalizing sinking. If any high RP region did not improve in occupancy then un-do all sinking and restore the state to before the pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D122501	2022-04-08 13:08:32 -07:00
Vang Thao	cd1071171c	[AMDGPU] Fix inline asm causing assert during PreRARematerialize stage in scheduler pass Reviewed By: foad Differential Revision: https://reviews.llvm.org/D123348	2022-04-08 09:22:32 -07:00
Christudasan Devadasan	2c46d067e1	[AMDGPU][SIMachineFunctionInfo] Code cleanup (NFC).	2022-04-08 19:42:48 +05:30
Abinav Puthan Purayil	b536f24d22	[AMDGPU] Use GCNPat in the buffer atomic pattern multiclasses	2022-04-08 16:28:11 +05:30
Thomas Symalla	6d97ca690c	[AMDGPU] Increase detection range for s_mov, v_cmpx transformation. We found that it might be beneficial to have the SIOptimizeExecMasking pass detect more cases where v_cmp, s_and_saveexec patterns can be transformed to s_mov, v_cmpx patterns. Currently, the search range for finding a fitting v_cmp instruction is 5, however, this is doubled to 10 here. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D123367	2022-04-08 12:47:24 +02:00
Evgeniy Brevnov	da41214d65	Add support for atomic memory copy lowering Currently, the utility supports lowering of non atomic memory transfer routines only. This patch adds support for atomic version of memcopy. This may be useful for targets not supporting atomic memcopy. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D118443	2022-04-08 10:41:31 +07:00
Stanislav Mekhanoshin	16cf9e6dad	[AMDGPU] Fix handling of gfx10 LDS misaligned access bug It was only handled for FLAT initially because we did not have unaligned DS instructions lowering. Now it is implemented but the bug is not handled. Differential Revision: https://reviews.llvm.org/D123338	2022-04-07 15:08:29 -07:00
Stanislav Mekhanoshin	e66f0edb40	[AMDGPU] Split unaligned LDS access instead of scalarizing There is no need to fully scalarize an unaligned operation in some case, just split it to alignment. Differential Revision: https://reviews.llvm.org/D123330	2022-04-07 14:27:37 -07:00
Changpeng Fang	6733590db2	AMDGPU: Set implicit kernarg size to be of 256 bytes for code object version 5 Summary: If implicitarg_ptr intrinsic is not used, set implicit kernarg size to 0, otherwise set it to 256 bytes for code object version 5 (and beyond). Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D123262	2022-04-07 08:35:23 -07:00
Dmitry Preobrazhensky	1f6aa90386	[AMDGPU][MC][GFX10] Added syntactic sugar for s_waitcnt_depctr operand Added the following helpers: depctr_hold_cnt(...) depctr_sa_sdst(...) depctr_va_vdst(...) depctr_va_sdst(...) depctr_va_ssrc(...) depctr_va_vcc(...) depctr_vm_vsrc(...) Differential Revision: https://reviews.llvm.org/D123022	2022-04-07 17:03:44 +03:00
Matt Arsenault	e6012c8e0f	AMDGPU: Handle private atomics Use new NotAtomic expansion to turn these into the equivalent non-atomic operations. Independent lanes cannot access the private memory of other lanes, so there's no possibility for synchronization. These don't really appear directly in user code, but InferAddressSpaces can make these appear after optimizations. Fixes issues 54693 and 54274.	2022-04-06 22:47:19 -04:00
Stanislav Mekhanoshin	a41a676e8a	[AMDGPU] Check SI LDS offset bug in the allowsMisalignedMemoryAccesses Differential Revision: https://reviews.llvm.org/D123268	2022-04-06 18:05:02 -07:00
Craig Topper	1235aaefbd	[AArch64][AMDGPU][WebAssembly] Use static_cast instead of a reinterpret_cast to downcast in parseMachineFunctionInfo. NFC static_cast is a little safer here since the compiler will ensure we're casting to a class derived from yaml::MachineFunctionInfo. I believe this first appeared on AMDGPU and was copied to the other two targets. Spotted when it was being copied to RISCV in D123178. Differential Revision: https://reviews.llvm.org/D123260	2022-04-06 15:09:18 -07:00
Jay Foad	538c77172a	[AMDGPU] Fix unused variable warning after D117484	2022-04-06 14:45:38 +01:00
Matt Arsenault	54c525fc53	AMDGPU/GlobalISel: Handle legacy grid ID intrinsics Handle the llvm.r600.* intrinsics which are still in use in libclc. I thought it would be possible to switch it to using llvm.amdgcn.implicitarg.ptr already, but it turns out the implicit arguments are currently split into a piece before and after the explicit kernel arguments.	2022-04-05 22:01:31 -04:00
Matt Arsenault	6071c92768	AMDGPU: Fix LiveVariables error after lowering SI_END_CF This wasn't accounting for the block change in updating LiveVariables.	2022-04-05 21:57:50 -04:00
Scott Linder	09f33a430b	[AMDGPU][OpenCL] Remove "printf and hostcall" diagnostic The diagnostic is unreliable, and triggers even for dead uses of hostcall that may exist when linking the device-libs at lower optimization levels. Eliminate the diagnostic, and directly document the limitation for OpenCL before code object V5. Make some NFC changes to clarify the related code in the MetadataStreamer. Add a clang test to tie OCL sources containing printf to the backend IR tests for this situation. Reviewed By: sameerds, arsenm, yaxunl Differential Revision: https://reviews.llvm.org/D121951	2022-04-05 19:10:23 +00:00
Vang Thao	45c2371c0d	[AMDGPU] Ignore debug use during PreRARematerialize stage in scheduling pass Ignore all debug uses when collecting trivially rematerializable defs. This fixes an issue with difference in codegen when enabling debug info. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D123048	2022-04-04 11:15:06 -07:00
Jay Foad	c246b7bd4a	[AMDGPU] Only count global-to-global as indirect accesses Previously any load (global, local or constant) feeding into a global load or store would be counted as an indirect access. This patch only counts global loads feeding into a global load or store. The rationale is that the latency for global loads is generally much larger than the other kinds. As a side effect this makes it easier to write small kernels test cases that are not counted as having indirect accesses, despite the fact that arguments to the kernel are accessed with an SMEM load. Differential Revision: https://reviews.llvm.org/D122804	2022-04-01 13:48:13 +01:00
Thomas Symalla	1a6aa8b195	[AMDGPU] Add missing use check in SIOptimizeExecMasking pass. Whenever a v_cmp, s_and_saveexec instruction sequence shall be transformed to an equivalent s_mov, v_cmpx sequence, it needs to be detected if the v_cmp target register is used between the two instructions as the v_cmp result gets omitted by using the v_cmpx instruction, resulting in invalid code. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D122797	2022-03-31 19:25:35 +02:00
Abinav Puthan Purayil	898d5776ec	[AMDGPU][GlobalISel] Scalarize add/sub with overflow ops in the legalizer Differential Revision: https://reviews.llvm.org/D122803	2022-03-31 21:46:34 +05:30
Changpeng Fang	1711020c37	AMDGPU: Use isLiteralConstantLike to check whether the operand could ever be literal Summary: To compute the size of a VALU/SALU instruction, we need to check whether an operand could ever be literal. Previously isLiteralConstant was used, which missed cases like global variables or external symbols. These misses lead to under-estimation of the instruction size and branch offset, and thus incorrectly skip the necessary branch relaxation when the branch offset is actually greater than what the branch bits can hold. In this work, we use isLiteralConstantLike to check the operands. It maybe conservative, but it is safe. Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D122778	2022-03-31 08:06:31 -07:00
Abinav Puthan Purayil	acf83abcbf	[AMDGPU][GlobalISel] Remove unused variable. NFC.	2022-03-31 16:50:34 +05:30
Stanislav Mekhanoshin	f311f934e1	[AMDGPU] gfx940 VALU hazard recognizer Differntial Revision: https://reviews.llvm.org/D122339	2022-03-29 10:57:54 -07:00
Shao-Ce SUN	662b9fa02c	[NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122557	2022-03-29 09:53:24 +08:00
Changpeng Fang	8384ced974	[AMDGPU][NFC]: Remove unnecessary MFI functions Summary: hasHostcallPtr() and hasHeapPtr() are only used in metadata emit. However, we can use the corresponding function attributes directly instead introducing the functions. Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D122600	2022-03-28 12:13:33 -07:00
Thomas Symalla	3bd15c03c6	[AMDGPU] Fix adding modifiers when creating v_cmpx instructions. Revision https://reviews.llvm.org/D122332 added a pattern transformation where v_cmpx instructions are introduced. However, the modifiers are not correctly inherited from the original operands. The patch adds the source modifiers, if they are exist, or sets them to 0. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D122489	2022-03-28 17:52:53 +02:00
Carl Ritson	1f52d02ceb	[AMDGPU] Split waterfall loop exec manipulation Split waterfall loops into multiple blocks so that exec mask manipulation (s_and_saveexec) does not occur in the middle of a block. VGPR live range optimizer is updated to handle waterfall loops spanning multiple blocks. Reviewed By: ruiling Differential Revision: https://reviews.llvm.org/D122200	2022-03-28 17:44:54 +09:00
Kazu Hirata	6212871968	[Target] Apply clang-tidy fixes for readability-redundant-member-init (NFC)	2022-03-27 22:22:37 -07:00
Maksim Panchenko	4ae9745af1	[Disassember][NFCI] Use strong type for instruction decoder All LLVM backends use MCDisassembler as a base class for their instruction decoders. Use "const MCDisassembler " for the decoder instead of "const void ". Remove unnecessary static casts. Reviewed By: skan Differential Revision: https://reviews.llvm.org/D122245	2022-03-25 18:53:59 -07:00
Jay Foad	be9acee059	[AMDGPU] Move VOP3 classes into VOPInstructions.td. NFC. These classes are also used by VOP1/2/C instructions. Differential Revision: https://reviews.llvm.org/D122470	2022-03-25 13:56:43 +00:00
Thomas Symalla	718aec209c	[AMDGPU] Improve v_cmpx usage on GFX10.3. On GFX10.3 targets, the following instruction sequence v_cmp_* SGPR, ... s_and_saveexec ..., SGPR leads to a fairly long stall caused by a VALU write to a SGPR and having the following SALU wait for the SGPR. An equivalent sequence is to save the exec mask manually instead of letting s_and_saveexec do the work and use a v_cmpx instruction instead to do the comparison. This patch modifies the SIOptimizeExecMasking pass as this is the last position where s_and_saveexec instructions are inserted. It does the transformation by trying to find the pattern, extracting the operands and generating the new instruction sequence. It also changes some existing lit tests and introduces a few new tests to show the changed behavior on GFX10.3 targets. Same as D119696 including a buildbot and MIR test fix. Reviewed By: critson Differential Revision: https://reviews.llvm.org/D122332	2022-03-25 11:40:18 +01:00
Stanislav Mekhanoshin	64838ba365	[AMDGPU] Use GenericTable to classify DGEMM Since there is a table introduced for MAI instructions extend it to use for DGEMM classification. Differential Revision: https://reviews.llvm.org/D122337	2022-03-24 13:00:37 -07:00
Stanislav Mekhanoshin	cad9de71d7	[AMDGPU] gfx940 MAI hazard recognizer Differential Revision: https://reviews.llvm.org/D122263	2022-03-24 12:59:52 -07:00
Stanislav Mekhanoshin	6e3e14f600	[AMDGPU] Support gfx940 smfmac instructions Differential Revision: https://reviews.llvm.org/D122191	2022-03-24 12:40:42 -07:00
Stanislav Mekhanoshin	27439a7642	[AMDGPU] New gfx940 mfma instructions Differential Revision: https://reviews.llvm.org/D122044	2022-03-24 12:12:52 -07:00
Vasileios Porpodas	39aa202aff	Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3, fixed assertion crash. Original review: https://reviews.llvm.org/D121354 This reverts commit `e6ead19b77`.	2022-03-23 18:32:17 -07:00
Austin Kerbow	1e15adba62	[AMDGPU] Add s_nop WaitStates between neighboring mfma In some cases padding bubbles between sequential MFMA instructions may lead to increased inter-wave performance. Add option to request to pad some portion of these stall cycles with s_nops. Fixes: SWDEV-326925 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D121437	2022-03-23 13:56:09 -07:00
Arthur Eubanks	e6ead19b77	Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash." This reverts commit `27bd8f9492`. Causes crashes, see comments in D121973	2022-03-23 10:57:45 -07:00
hsmahesha	f5b6866d7e	[AMDGPU] Add missing testcase for SGPR to AGPR copy and, also update the function indirectCopyToAGPR() to ensure that it is called only on GFX908 sub-target. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D122286	2022-03-23 21:38:04 +05:30
hsmahesha	f014303e2c	[AMDGPU] [NFC]: Organize the code around reserving registers. First, add code to reserve all required special purpose registers, followed by code to reserve SGPRs, followed by code to reserve VGPRs/AGPRs. This patch is prepared as a pre-requisite to fix an issue related to GFX90A hardware. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122219	2022-03-23 07:15:59 +05:30
Vasileios Porpodas	27bd8f9492	Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash. Original review: https://reviews.llvm.org/D121354 This reverts commit `f7d7d2a08d`.	2022-03-22 16:41:55 -07:00
Stanislav Mekhanoshin	72c1a0d9c2	[AMDGPU] Allow v_accvgpr_write to use SGPR on gfx90a This is undocumented, but it should work. Differential Revision: https://reviews.llvm.org/D122252	2022-03-22 13:52:29 -07:00
Arthur Eubanks	f7d7d2a08d	Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads."" This reverts commit `79613185d3`. Causes crashes, see comments in https://reviews.llvm.org/D121973.	2022-03-22 13:33:49 -07:00
alex-t	7636c9a929	[AMDGPU] use scalar shift for SALU users in frame index elimination In the frame index lowering we have to insert shift and add instructions to adjust stack object access. We need to take care of the stack object user kind and use scalar shift/add for scalar users. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D121524	2022-03-22 13:16:24 +01:00

... 4 5 6 7 8 ...

7293 Commits