llvm-project

Commit Graph

Author	SHA1	Message	Date
Abinav Puthan Purayil	561af89fed	[AMDGPU] Use a wrapper multiclass for buffer atomic intrinsic patterns. NFC	2022-04-22 13:59:34 +05:30
Abinav Puthan Purayil	272a876804	[AMDGPU] Rename the FlatSignedIntrPat multiclass to FlatSignedAtomicIntrPat. NFC	2022-04-22 11:47:23 +05:30
Abinav Puthan Purayil	2147b6c89d	[AMDGPU] Remove no-ret atomic ops selection in the post-isel hook No-ret atomic ops are now selected in tblgen. Differential Revision: https://reviews.llvm.org/D124086	2022-04-22 09:37:41 +05:30
Abinav Puthan Purayil	165ae7276c	[AMDGPU] Remove atomic pattern args in FLAT_[Global_]Atomic_Pseudo defs We already have explicit patterns for these. Differential Revision: https://reviews.llvm.org/D124084	2022-04-22 09:37:40 +05:30
Abinav Puthan Purayil	f935908d7b	[AMDGPU] Select no-return DS_PK_ADD_F16 in tblgen Differential Revision: https://reviews.llvm.org/D123584	2022-04-22 09:37:40 +05:30
Abinav Puthan Purayil	45ca94334e	[AMDGPU] Select no-return atomic intrinsics in tblgen This is to avoid relying on the post-isel hook. This change also enable the saddr pattern selection for atomic intrinsics in GlobalISel. Differential Revision: https://reviews.llvm.org/D123583	2022-04-22 09:37:40 +05:30
Stanislav Mekhanoshin	ac94073daa	[AMDGPU] Refine 64 bit misaligned LDS ops selection Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b64 aligned by 8: 3.2 sec ds_write2_b32 aligned by 8: 3.2 sec ds_write_b16 * 4 aligned by 8: 7.0 sec ds_write_b8 * 8 aligned by 8: 13.2 sec ds_write_b64 aligned by 1: 7.3 sec ds_write2_b32 aligned by 1: 7.5 sec ds_write_b16 * 4 aligned by 1: 14.0 sec ds_write_b8 * 8 aligned by 1: 13.2 sec ds_write_b64 aligned by 2: 7.3 sec ds_write2_b32 aligned by 2: 7.5 sec ds_write_b16 * 4 aligned by 2: 7.1 sec ds_write_b8 * 8 aligned by 2: 13.3 sec ds_write_b64 aligned by 4: 4.6 sec ds_write2_b32 aligned by 4: 3.2 sec ds_write_b16 * 4 aligned by 4: 7.1 sec ds_write_b8 * 8 aligned by 4: 13.3 sec ds_read_b64 aligned by 8: 2.3 sec ds_read2_b32 aligned by 8: 2.2 sec ds_read_u16 * 4 aligned by 8: 4.8 sec ds_read_u8 * 8 aligned by 8: 8.6 sec ds_read_b64 aligned by 1: 4.4 sec ds_read2_b32 aligned by 1: 7.3 sec ds_read_u16 * 4 aligned by 1: 14.0 sec ds_read_u8 * 8 aligned by 1: 8.7 sec ds_read_b64 aligned by 2: 4.4 sec ds_read2_b32 aligned by 2: 7.3 sec ds_read_u16 * 4 aligned by 2: 4.8 sec ds_read_u8 * 8 aligned by 2: 8.7 sec ds_read_b64 aligned by 4: 4.4 sec ds_read2_b32 aligned by 4: 2.3 sec ds_read_u16 * 4 aligned by 4: 4.8 sec ds_read_u8 * 8 aligned by 4: 8.7 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b64 aligned by 8: 4.4 sec ds_write2_b32 aligned by 8: 4.3 sec ds_write_b16 * 4 aligned by 8: 7.9 sec ds_write_b8 * 8 aligned by 8: 13.0 sec ds_write_b64 aligned by 1: 23.2 sec ds_write2_b32 aligned by 1: 23.1 sec ds_write_b16 * 4 aligned by 1: 44.0 sec ds_write_b8 * 8 aligned by 1: 13.0 sec ds_write_b64 aligned by 2: 23.2 sec ds_write2_b32 aligned by 2: 23.1 sec ds_write_b16 * 4 aligned by 2: 7.9 sec ds_write_b8 * 8 aligned by 2: 13.1 sec ds_write_b64 aligned by 4: 13.5 sec ds_write2_b32 aligned by 4: 4.3 sec ds_write_b16 * 4 aligned by 4: 7.9 sec ds_write_b8 * 8 aligned by 4: 13.1 sec ds_read_b64 aligned by 8: 3.5 sec ds_read2_b32 aligned by 8: 3.4 sec ds_read_u16 * 4 aligned by 8: 5.3 sec ds_read_u8 * 8 aligned by 8: 8.5 sec ds_read_b64 aligned by 1: 13.1 sec ds_read2_b32 aligned by 1: 22.7 sec ds_read_u16 * 4 aligned by 1: 43.9 sec ds_read_u8 * 8 aligned by 1: 7.9 sec ds_read_b64 aligned by 2: 13.1 sec ds_read2_b32 aligned by 2: 22.7 sec ds_read_u16 * 4 aligned by 2: 5.6 sec ds_read_u8 * 8 aligned by 2: 7.9 sec ds_read_b64 aligned by 4: 13.1 sec ds_read2_b32 aligned by 4: 3.4 sec ds_read_u16 * 4 aligned by 4: 5.6 sec ds_read_u8 * 8 aligned by 4: 7.9 sec ``` GFX10 exposes a different pattern for sub-DWORD load/store performance than GFX9. On GFX9 it is faster to issue a single unaligned load or store than a fully split b8 access, where on GFX10 even a full split is better. However, this is a theoretical only gain because splitting an access to a sub-dword level will require more registers and packing/ unpacking logic, so ignoring this option it is better to use a single 64 bit instruction on a misaligned data with the exception of 4 byte aligned data where ds_read2_b32/ds_write2_b32 is better. Differential Revision: https://reviews.llvm.org/D123956	2022-04-21 09:37:16 -07:00
Petar Avramovic	e06290e53f	AMDGPU/GlobalISel: Fix isVCC for uniform s1 with reg class on wave32 Fix isVCC for register that was assigned register class during inst-selection. This happens when register has multiple uses. For wave32, uniform i1 to vcc copy was selected like vcc to vcc copy when uniform i1 had assigned register class. Uniform i1 register with assigned register class will have s1 LLT, be defined using G_TRUNC and class will be SReg_32RegClass. Vcc i1 register with assigned register class will have s1 LLT, class will be SReg_32RegClass for wave32 and SReg_64RegClass for wave64 and register will not be defined by G_TRUNC. Differential Revision: https://reviews.llvm.org/D124163	2022-04-21 16:12:04 +02:00
Jannik Silvanus	607f8ced39	[AMDGPU]: Fix failing assertion in SIMachineScheduler This fixes the assertion failure "Loop in the Block Graph!". SIMachineScheduler groups instructions into blocks (also referred to as coloring or groups) and then performs a two-level scheduling: inter-block scheduling, and intra-block scheduling. This approach requires that the dependency graph on the blocks which is obtained by contracting the blocks in the original dependency graph is acyclic. In other words: Whenever A and B end up in the same block, all vertices on a path from A to B must be in the same block. When compiling an example consisting of an export followed by a buffer store, we see a dependency between these two. This dependency may be false, but that is a different issue. This dependency was not correctly accounted for by SiMachineScheduler. A new test case si-scheduler-exports.ll demonstrating this is also added in this commit. The problematic part of SiMachineScheduler was a post-optimization of the block assignment that tried to group all export instructions into a separate export block for better execution performance. This routine correctly checked that any paths from exports to exports did not contain any non-exports, but not vice-versa: In case of an export with a non-export successor dependency, that single export was moved to a separate block, which could then be both a successor and a predecessor block of a non-export block. As fix, we now skip export grouping if there are exports with direct non-export successor dependencies. This fixes the issue at hand, but is slightly pessimistic: We could group all exports into a separate block that have neither direct nor indirect export successor dependencies. We will review the potential performance impact and potentially revisit with a more sophisticated implementation. Note that just grouping all exports without direct non-export successor dependencies could still lead to illegal blocks, since non-export A could depend on export B that depends on export C. In that case, export C has no non-export successor, but still may not be grouped into an export block.	2022-04-21 14:52:29 +01:00
Dmitry Preobrazhensky	81af32b9a3	[AMDGPU][MC][NFC][GFX940] Corrected an error position Differential Revision: https://reviews.llvm.org/D124099	2022-04-21 14:04:46 +03:00
Dmitry Preobrazhensky	b4231ac4be	[AMDGPU][GFX90A+] Disabled ds_ordered_count and exp Differential Revision: https://reviews.llvm.org/D124087	2022-04-21 13:16:44 +03:00
hsmahesha	5bd87350a5	[AMDGPU] On gfx908, reserve VGPR for AGPR copy based on register budget. Based on available register budget, reserve highest available VGPR for AGPR copy before RA. After RA, shift it to lowest unused VGPR if the one exist. Fixes SWDEV-330006. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D123525	2022-04-21 07:57:26 +05:30
Stanislav Mekhanoshin	aa14e2ef3e	[AMDGPU] Remove obsolete hack from allowsMisalignedMemoryAccesses. NFCI. Differential Revision: https://reviews.llvm.org/D124035	2022-04-20 11:52:56 -07:00
Jay Foad	879ac41089	[AMDGPU] Fix crash in SIOptimizeExecMaskingPreRA When folding a COPY of exec into another COPY, the call to TII->isOperandLegal would crash because COPYs don't have defined register classes for their operands. Differential Revision: https://reviews.llvm.org/D122737	2022-04-20 14:42:48 +01:00
Jay Foad	1f91512268	[AMDGPU] Simplify calls to getDefSrcRegIgnoringCopies. NFC. getDefSrcRegIgnoringCopies never returns None on valid MIR.	2022-04-20 12:37:24 +01:00
Abinav Puthan Purayil	b7df71524e	[AMDGPU][GlobalISel] Force return atomic selection for now	2022-04-20 16:00:08 +05:30
Fangrui Song	bec8dff33e	[AMDGPU] Fix -Wunused-variable in -DLLVM_ENABLE_ASSERTIONS=off builds	2022-04-19 22:36:58 -07:00
Matt Arsenault	1900b6c77b	AMDGPU: Add assert for GDS globals	2022-04-19 22:28:11 -04:00
Matt Arsenault	987df725ac	AMDGPU: Serialize VGPRForAGPRCopy	2022-04-19 22:14:52 -04:00
Matt Arsenault	b5ec131267	AMDGPU: Fix allocating GDS globals to LDS offsets These don't seem to be very well used or tested, but try to make the behavior a bit more consistent with LDS globals. I'm not sure what the definition for amdgpu-gds-size is supposed to mean. For now I assumed it's allocating a static size at the beginning of the allocation, and any known globals are allocated after it.	2022-04-19 22:14:48 -04:00
Matt Arsenault	378bb8014d	AMDGPU: Serialize a few more MachineFunctionInfo fields in MIR	2022-04-19 22:12:59 -04:00
Matt Arsenault	f90f4884c8	AMDGPU: Serialize gds size in MIR	2022-04-19 22:12:59 -04:00
Matt Arsenault	5cd17f9d43	AMDGPU: Serialize WWM registers	2022-04-19 21:44:43 -04:00
Matt Arsenault	e0d585d75a	AMDGPU: Defer creation of WWM VGPR spill slots There's no reason to create these immediately. They can be created in the prolog/epilog code like CSR spills. There's probably a cleaner way to do this by utilizing the CSR spill code. This makes the frame index used transient state for PrologEpilogInserter, and thus makes serialization easier. Really this doesn't need to be saved here but there isn't really a better place for it.	2022-04-19 21:07:13 -04:00
Matt Arsenault	4271ae22be	AMDGPU: Remove some unreachable code in WWM pass Defs must be registers and there's no point to code after llvm_unreachable.	2022-04-19 21:04:33 -04:00
Matt Arsenault	bc7902f148	AMDGPU: Remove unused MachineFunctionInfo fields These were leftovers from a half-implement spill to LDS attempt.	2022-04-19 21:04:33 -04:00
Dmitry Preobrazhensky	e01dbabdd1	[AMDGPU][MC] Corrected error message "image data size does not match dmask and tfe" Differential Revision: https://reviews.llvm.org/D123929	2022-04-19 13:52:58 +03:00
Jay Foad	f707e1255e	[AMDGPU] Select d16 stores even when sramecc is enabled The sramecc feature changes the behaviour of d16 loads so they do not preserve the unused 16 bits of the result register, but it has no impact on d16 stores, so we should make use of them even when the feature is enabled. Differential Revision: https://reviews.llvm.org/D104912	2022-04-19 09:34:32 +01:00
Austin Kerbow	7f97ac94f7	Revert "[AMDGPU] Omit unnecessary waitcnt before barriers" This reverts commit `8d0c34fd4f`.	2022-04-18 21:24:08 -07:00
Stanislav Mekhanoshin	c1c49a3561	[AMDGPU] Fix comment type in the DSInstructions.td. NFC.	2022-04-18 14:28:12 -07:00
Christudasan Devadasan	34a68037dd	[AMDGPU][SIFrameLowering] Refactor custom SGPR spills (NFC). Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D123666	2022-04-17 13:42:42 +05:30
Johannes Doerfert	3be3b40188	[Attributor][NFCI] Introduce AttributorConfig to bundle all options Instead of lengthy constructors we can now set the members of a read-only struct before the Attributor is created. Should make it clearer what is configurable and also help introducing new options in the future. This actually added IsModulePass and avoids deduction through the Function set size. No functional change was intended.	2022-04-15 18:17:19 -05:00
Matt Arsenault	df29ec2f54	AMDGPU: Select i8/i16 global and flat atomic load/store As far as I know these should be atomic anyway, as long as the address is aligned. Unaligned atomics hit an ugly error in AtomicExpand.	2022-04-14 20:52:05 -04:00
Matt Arsenault	c528fbf882	AMDGPU: Fix assert if v_mov_b32_dpp is last instruction in the block This can happen if the use instruction is a phi. Fixes issue 49961	2022-04-14 20:21:22 -04:00
Stanislav Mekhanoshin	49b39c4f2e	[AMDGPU] Remove redundand RequiredAlignment assignment. NFCI. Differential Revision: https://reviews.llvm.org/D123699	2022-04-14 02:03:51 -07:00
hsmahesha	ea47373af4	[AMDGPU][NFC] Organize code around reserving VGPR32 for AGPR copy. This is an NFC patch in preparation to fix a bug related to always reserving VGPR32 for AGPR copy. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D123651	2022-04-14 12:51:33 +05:30
Carl Ritson	35ea326047	[AMDGPU] Try to avoid inserting duplicate s_inst_prefetch Check for existing s_inst_prefetch instructions when configuring prefetches during loop alignment. Reviewed By: rampitec, foad Differential Revision: https://reviews.llvm.org/D123569	2022-04-14 16:06:24 +09:00
Stanislav Mekhanoshin	d951d937a0	[AMDGPU] Increate hazard for store dwordx3/4 to 2 waitstates on gfx940 Fixes: SWDEV-327053 Differential Revision: https://reviews.llvm.org/D123687	2022-04-13 14:21:45 -07:00
Jay Foad	ccaf6dabcc	[AMDGPU] Initialize a couple more Subtarget fields This is just for consistency. The fields are never actually used so it is NFC.	2022-04-13 16:36:10 +01:00
Dmitry Preobrazhensky	5c0bf1303e	[AMDGPU][MC][GFX10] Removed unsupported 64bit DPP opcodes Removed 64bit DPP opcodes from asm matcher tables. Differential Revision: https://reviews.llvm.org/D123611	2022-04-13 14:43:40 +03:00
Dmitry Preobrazhensky	ab18e1a533	[AMDGPU][GFX10] Enabled op_sel for v_add_nc_u16 and v_sub_nc_u16 Differential Revision: https://reviews.llvm.org/D123594	2022-04-13 13:48:42 +03:00
Matt Arsenault	c986d476cd	AMDGPU: Update reqd-work-group-size optimization for umin intrinsic This code was pattern matching the ID computation expression as it appears in the library. This was a compare and select, but now that umin is canonical, we were no longer matching. Update to match the intrinsic instead.	2022-04-12 20:03:02 -04:00
Stanislav Mekhanoshin	f6462a26f0	[AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b128 aligned by 16: 4.9 sec ds_write2_b64 aligned by 16: 5.1 sec ds_write2_b32 * 2 aligned by 16: 5.5 sec ds_write_b128 aligned by 1: 8.1 sec ds_write2_b64 aligned by 1: 8.7 sec ds_write2_b32 * 2 aligned by 1: 14.0 sec ds_write_b128 aligned by 2: 8.1 sec ds_write2_b64 aligned by 2: 8.7 sec ds_write2_b32 * 2 aligned by 2: 14.0 sec ds_write_b128 aligned by 4: 5.6 sec ds_write2_b64 aligned by 4: 8.7 sec ds_write2_b32 * 2 aligned by 4: 5.6 sec ds_write_b128 aligned by 8: 5.6 sec ds_write2_b64 aligned by 8: 5.1 sec ds_write2_b32 * 2 aligned by 8: 5.6 sec ds_read_b128 aligned by 16: 3.8 sec ds_read2_b64 aligned by 16: 3.8 sec ds_read2_b32 * 2 aligned by 16: 4.0 sec ds_read_b128 aligned by 1: 4.6 sec ds_read2_b64 aligned by 1: 8.1 sec ds_read2_b32 * 2 aligned by 1: 14.0 sec ds_read_b128 aligned by 2: 4.6 sec ds_read2_b64 aligned by 2: 8.1 sec ds_read2_b32 * 2 aligned by 2: 14.0 sec ds_read_b128 aligned by 4: 4.6 sec ds_read2_b64 aligned by 4: 8.1 sec ds_read2_b32 * 2 aligned by 4: 4.0 sec ds_read_b128 aligned by 8: 4.6 sec ds_read2_b64 aligned by 8: 3.8 sec ds_read2_b32 * 2 aligned by 8: 4.0 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b128 aligned by 16: 6.2 sec ds_write2_b64 aligned by 16: 7.1 sec ds_write2_b32 * 2 aligned by 16: 7.6 sec ds_write_b128 aligned by 1: 24.1 sec ds_write2_b64 aligned by 1: 25.2 sec ds_write2_b32 * 2 aligned by 1: 43.7 sec ds_write_b128 aligned by 2: 24.1 sec ds_write2_b64 aligned by 2: 25.1 sec ds_write2_b32 * 2 aligned by 2: 43.7 sec ds_write_b128 aligned by 4: 14.4 sec ds_write2_b64 aligned by 4: 25.1 sec ds_write2_b32 * 2 aligned by 4: 7.6 sec ds_write_b128 aligned by 8: 14.4 sec ds_write2_b64 aligned by 8: 7.1 sec ds_write2_b32 * 2 aligned by 8: 7.6 sec ds_read_b128 aligned by 16: 6.2 sec ds_read2_b64 aligned by 16: 6.3 sec ds_read2_b32 * 2 aligned by 16: 7.5 sec ds_read_b128 aligned by 1: 12.5 sec ds_read2_b64 aligned by 1: 24.0 sec ds_read2_b32 * 2 aligned by 1: 43.6 sec ds_read_b128 aligned by 2: 12.5 sec ds_read2_b64 aligned by 2: 24.0 sec ds_read2_b32 * 2 aligned by 2: 43.6 sec ds_read_b128 aligned by 4: 12.5 sec ds_read2_b64 aligned by 4: 24.0 sec ds_read2_b32 * 2 aligned by 4: 7.5 sec ds_read_b128 aligned by 8: 12.5 sec ds_read2_b64 aligned by 8: 6.3 sec ds_read2_b32 * 2 aligned by 8: 7.5 sec ``` Differential Revision: https://reviews.llvm.org/D123634	2022-04-12 16:07:13 -07:00
Matt Arsenault	788f94f731	AMDGPU: Don't use unreachable on stores to unhandled address space For stores to constant address space, this will now consistently hit a selection error instead of hitting unreachable in an asserts build. I'm not sure what we should really do here. We could either just codegen as if it were global, delete the instruction, or declare the IR invalid (we really should have a target IR verifier to enforce it).	2022-04-12 17:31:50 -04:00
Changpeng Fang	8edaf25986	AMDGPU: Emit metadata for the hidden_multigrid_sync_arg conditionally Summary: Introduce a new function attribute, amdgpu-no-multigrid-sync-arg, which is default. We use implicitarg_ptr + offset to check whether the multigrid synchronization pointer is used. If yes, we remove this attribute and also remove amdgpu-no-implicitarg-ptr. We generate metadata for the hidden_multigrid_sync_arg only when the amdgpu-no-multigrid-sync-arg attribute is removed from the function. Reviewers: arsenm, sameerds, b-sumner and foad Differential Revision: https://reviews.llvm.org/D123548	2022-04-12 12:36:30 -07:00
Anshil Gandhi	528aa09010	[AMDGPU][Codegen] Unsupported image sample texture map instructions Disables image_sample_*_g16 instructions on architectures lacking g16 support. This patch fixes the issue 54672. Differential Revision: https://reviews.llvm.org/D123461	2022-04-12 10:38:59 -06:00
Jay Foad	8a53b25ed5	[AMDGPU] Use default member initializers in Subtarget classes Use default member initializers in AMDGPUSubtarget and subclasses. This is to guard against adding a new feature boolean in AMDGPUSubtarget.h but forgetting to initialize it to false in AMDGPUSubtarget.cpp. This was mostly autogenerated by: clang-tidy -checks=-,cppcoreguidelines-prefer-member-initializer,modernize-use-default-member-init -header-filter=Subtarget -fix lib/Target/AMDGPU/Subtarget.cpp Differential Revision: https://reviews.llvm.org/D123613	2022-04-12 16:42:30 +01:00
Stanislav Mekhanoshin	3870b36025	[AMDGPU] Split unaligned 3 DWORD DS operations I have written a minitest to check the performance. Overall the benefit of aligned b96 operations on data which is not known but happens to be aligned is small, while performance hit of using b96 operations on a really unaligned memory is high. The only exception is when data is not aligned even by 4, it is better to use b96 in this case. Here is the test output on Vega and Navi: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b96 aligned: 3.4 sec ds_write_b32 + ds_write_b64 aligned: 4.5 sec ds_write_b32 * 3 aligned: 4.8 sec ds_write_b96 misaligned by 1: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 1: 7.2 sec ds_write_b32 * 3 misaligned by 1: 10.0 sec ds_write_b96 misaligned by 2: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 2: 7.2 sec ds_write_b32 * 3 misaligned by 2: 10.1 sec ds_write_b96 misaligned by 4: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.2 sec ds_write_b32 * 3 misaligned by 4: 4.9 sec ds_write_b96 misaligned by 8: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 8: 4.6 sec ds_write_b32 * 3 misaligned by 8: 4.9 sec ds_read_b96 aligned: 3.3 sec ds_read_b32 + ds_read_b64 aligned: 4.9 sec ds_read_b32 * 3 aligned: 2.6 sec ds_read_b96 misaligned by 1: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 1: 7.2 sec ds_read_b32 * 3 misaligned by 1: 10.1 sec ds_read_b96 misaligned by 2: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 2: 7.2 sec ds_read_b32 * 3 misaligned by 2: 10.1 sec ds_read_b96 misaligned by 4: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 4: 2.6 sec ds_read_b32 * 3 misaligned by 4: 2.6 sec ds_read_b96 misaligned by 8: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 8: 4.9 sec ds_read_b32 * 3 misaligned by 8: 2.6 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b96 aligned: 4.1 sec ds_write_b32 + ds_write_b64 aligned: 13.0 sec ds_write_b32 * 3 aligned: 4.5 sec ds_write_b96 misaligned by 1: 12.5 sec ds_write_b32 + ds_write_b64 misaligned by 1: 22.0 sec ds_write_b32 * 3 misaligned by 1: 31.5 sec ds_write_b96 misaligned by 2: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 2: 22.0 sec ds_write_b32 * 3 misaligned by 2: 31.5 sec ds_write_b96 misaligned by 4: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.0 sec ds_write_b32 * 3 misaligned by 4: 4.5 sec ds_write_b96 misaligned by 8: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 8: 13.0 sec ds_write_b32 * 3 misaligned by 8: 4.5 sec ds_read_b96 aligned: 3.8 sec ds_read_b32 + ds_read_b64 aligned: 12.8 sec ds_read_b32 * 3 aligned: 4.4 sec ds_read_b96 misaligned by 1: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 1: 21.8 sec ds_read_b32 * 3 misaligned by 1: 31.5 sec ds_read_b96 misaligned by 2: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 2: 21.9 sec ds_read_b32 * 3 misaligned by 2: 31.5 sec ds_read_b96 misaligned by 4: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 4: 3.8 sec ds_read_b32 * 3 misaligned by 4: 4.5 sec ds_read_b96 misaligned by 8: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 8: 12.8 sec ds_read_b32 * 3 misaligned by 8: 4.5 sec ``` Fixes: SWDEV-330802 Differential Revision: https://reviews.llvm.org/D123524	2022-04-12 07:52:39 -07:00
Stanislav Mekhanoshin	b8e09f1553	[AMDGPU] Refactor LDS alignment checks. Move features/bugs checks into the single place allowsMisalignedMemoryAccessesImpl. This is mostly NFCI except for the order of selection in couple places. A separate change may be needed to stop lying about Fast. Differential Revision: https://reviews.llvm.org/D123343	2022-04-12 07:49:40 -07:00
Carl Ritson	2bca7d859a	[AMDGPU] Graceful abort for waterfalls in SIOptimizeVGPRLiveRange If the CFG structure of a waterfall loop is not the expected shape then gracefully abort traversing the IR for the given loop. This applies to nest waterfall loops which are not supported by the VGPR live range optimizer. Reviewed By: ruiling Differential Revision: https://reviews.llvm.org/D123480	2022-04-12 14:15:44 +09:00
Matt Arsenault	463bc93e5f	AMDGPU/GlobalISel: Remove unused parameter	2022-04-11 19:43:37 -04:00
Matt Arsenault	203a1e36ed	Reapply "AMDGPU: Remove AMDGPUFixFunctionBitcasts pass" This reverts commit `8a85be807b`. The unrelated failure this exposed was fixed.	2022-04-11 19:43:37 -04:00
Changpeng Fang	7f9868f9b7	AMDGPU: Align the implicit kernel argument segment to 8 bytes for v5 Summary: In emitting metadata for implicit kernel arguments, we need to be in sync with the actual loads to align the implicit kernel argument segment to 8 byte boundary. In this work, we simply force this alignment through the first implicit argument. In addition, we don't emit metadata for any implicit kernel argument if none of them is actually used. Reviewers: arsenm, b-sumner Differential Revision: https://reviews.llvm.org/D123346	2022-04-11 16:12:39 -07:00
Nicolai Hähnle	4df4922da6	AMDGPU/SDAG: Custom SETCC (i.e. ballot) is always uniform The AMDGPUISD::SETCC node is like ISD::SETCC, but returns a lane mask instead of a per-lane boolean. The lane mask is uniform. This improves instruction selection for code patterns like ctpop(ballot(x)), which can now use an S_BCNT1_* instruction instead of V_BCNT_*. GlobalISel already selects scalar instructions (an earlier commit added a test case).. Differential Revision: https://reviews.llvm.org/D123432	2022-04-11 14:04:21 -05:00
Stanislav Mekhanoshin	fced87d457	[AMDGPU] Fix regression with vectorization limiting D67148 has removed TTI::getNumberOfRegisters(bool Vector) and started to call TTI::getNumberOfRegisters(unsigned ClassID) from the LoopVectorize. This has resulted in an unrestricted vectorization on AMDGPU blowing up register pressure. Differential Revision: https://reviews.llvm.org/D122850	2022-04-08 17:46:49 -07:00
Vang Thao	311edc6b5b	[AMDGPU] Enable PreRARematerialize scheduling pass with multiple high RP regions Enable the PreRARematerialize pass when there are multiple high RP scheduling regions present. Require the occupancy in all high RP regions be improved before finalizing sinking. If any high RP region did not improve in occupancy then un-do all sinking and restore the state to before the pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D122501	2022-04-08 13:08:32 -07:00
Vang Thao	cd1071171c	[AMDGPU] Fix inline asm causing assert during PreRARematerialize stage in scheduler pass Reviewed By: foad Differential Revision: https://reviews.llvm.org/D123348	2022-04-08 09:22:32 -07:00
Christudasan Devadasan	2c46d067e1	[AMDGPU][SIMachineFunctionInfo] Code cleanup (NFC).	2022-04-08 19:42:48 +05:30
Abinav Puthan Purayil	b536f24d22	[AMDGPU] Use GCNPat in the buffer atomic pattern multiclasses	2022-04-08 16:28:11 +05:30
Thomas Symalla	6d97ca690c	[AMDGPU] Increase detection range for s_mov, v_cmpx transformation. We found that it might be beneficial to have the SIOptimizeExecMasking pass detect more cases where v_cmp, s_and_saveexec patterns can be transformed to s_mov, v_cmpx patterns. Currently, the search range for finding a fitting v_cmp instruction is 5, however, this is doubled to 10 here. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D123367	2022-04-08 12:47:24 +02:00
Evgeniy Brevnov	da41214d65	Add support for atomic memory copy lowering Currently, the utility supports lowering of non atomic memory transfer routines only. This patch adds support for atomic version of memcopy. This may be useful for targets not supporting atomic memcopy. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D118443	2022-04-08 10:41:31 +07:00
Stanislav Mekhanoshin	16cf9e6dad	[AMDGPU] Fix handling of gfx10 LDS misaligned access bug It was only handled for FLAT initially because we did not have unaligned DS instructions lowering. Now it is implemented but the bug is not handled. Differential Revision: https://reviews.llvm.org/D123338	2022-04-07 15:08:29 -07:00
Stanislav Mekhanoshin	e66f0edb40	[AMDGPU] Split unaligned LDS access instead of scalarizing There is no need to fully scalarize an unaligned operation in some case, just split it to alignment. Differential Revision: https://reviews.llvm.org/D123330	2022-04-07 14:27:37 -07:00
Changpeng Fang	6733590db2	AMDGPU: Set implicit kernarg size to be of 256 bytes for code object version 5 Summary: If implicitarg_ptr intrinsic is not used, set implicit kernarg size to 0, otherwise set it to 256 bytes for code object version 5 (and beyond). Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D123262	2022-04-07 08:35:23 -07:00
Dmitry Preobrazhensky	1f6aa90386	[AMDGPU][MC][GFX10] Added syntactic sugar for s_waitcnt_depctr operand Added the following helpers: depctr_hold_cnt(...) depctr_sa_sdst(...) depctr_va_vdst(...) depctr_va_sdst(...) depctr_va_ssrc(...) depctr_va_vcc(...) depctr_vm_vsrc(...) Differential Revision: https://reviews.llvm.org/D123022	2022-04-07 17:03:44 +03:00
Matt Arsenault	e6012c8e0f	AMDGPU: Handle private atomics Use new NotAtomic expansion to turn these into the equivalent non-atomic operations. Independent lanes cannot access the private memory of other lanes, so there's no possibility for synchronization. These don't really appear directly in user code, but InferAddressSpaces can make these appear after optimizations. Fixes issues 54693 and 54274.	2022-04-06 22:47:19 -04:00
Stanislav Mekhanoshin	a41a676e8a	[AMDGPU] Check SI LDS offset bug in the allowsMisalignedMemoryAccesses Differential Revision: https://reviews.llvm.org/D123268	2022-04-06 18:05:02 -07:00
Craig Topper	1235aaefbd	[AArch64][AMDGPU][WebAssembly] Use static_cast instead of a reinterpret_cast to downcast in parseMachineFunctionInfo. NFC static_cast is a little safer here since the compiler will ensure we're casting to a class derived from yaml::MachineFunctionInfo. I believe this first appeared on AMDGPU and was copied to the other two targets. Spotted when it was being copied to RISCV in D123178. Differential Revision: https://reviews.llvm.org/D123260	2022-04-06 15:09:18 -07:00
Jay Foad	538c77172a	[AMDGPU] Fix unused variable warning after D117484	2022-04-06 14:45:38 +01:00
Matt Arsenault	54c525fc53	AMDGPU/GlobalISel: Handle legacy grid ID intrinsics Handle the llvm.r600.* intrinsics which are still in use in libclc. I thought it would be possible to switch it to using llvm.amdgcn.implicitarg.ptr already, but it turns out the implicit arguments are currently split into a piece before and after the explicit kernel arguments.	2022-04-05 22:01:31 -04:00
Matt Arsenault	6071c92768	AMDGPU: Fix LiveVariables error after lowering SI_END_CF This wasn't accounting for the block change in updating LiveVariables.	2022-04-05 21:57:50 -04:00
Scott Linder	09f33a430b	[AMDGPU][OpenCL] Remove "printf and hostcall" diagnostic The diagnostic is unreliable, and triggers even for dead uses of hostcall that may exist when linking the device-libs at lower optimization levels. Eliminate the diagnostic, and directly document the limitation for OpenCL before code object V5. Make some NFC changes to clarify the related code in the MetadataStreamer. Add a clang test to tie OCL sources containing printf to the backend IR tests for this situation. Reviewed By: sameerds, arsenm, yaxunl Differential Revision: https://reviews.llvm.org/D121951	2022-04-05 19:10:23 +00:00
Vang Thao	45c2371c0d	[AMDGPU] Ignore debug use during PreRARematerialize stage in scheduling pass Ignore all debug uses when collecting trivially rematerializable defs. This fixes an issue with difference in codegen when enabling debug info. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D123048	2022-04-04 11:15:06 -07:00
Jay Foad	c246b7bd4a	[AMDGPU] Only count global-to-global as indirect accesses Previously any load (global, local or constant) feeding into a global load or store would be counted as an indirect access. This patch only counts global loads feeding into a global load or store. The rationale is that the latency for global loads is generally much larger than the other kinds. As a side effect this makes it easier to write small kernels test cases that are not counted as having indirect accesses, despite the fact that arguments to the kernel are accessed with an SMEM load. Differential Revision: https://reviews.llvm.org/D122804	2022-04-01 13:48:13 +01:00
Thomas Symalla	1a6aa8b195	[AMDGPU] Add missing use check in SIOptimizeExecMasking pass. Whenever a v_cmp, s_and_saveexec instruction sequence shall be transformed to an equivalent s_mov, v_cmpx sequence, it needs to be detected if the v_cmp target register is used between the two instructions as the v_cmp result gets omitted by using the v_cmpx instruction, resulting in invalid code. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D122797	2022-03-31 19:25:35 +02:00
Abinav Puthan Purayil	898d5776ec	[AMDGPU][GlobalISel] Scalarize add/sub with overflow ops in the legalizer Differential Revision: https://reviews.llvm.org/D122803	2022-03-31 21:46:34 +05:30
Changpeng Fang	1711020c37	AMDGPU: Use isLiteralConstantLike to check whether the operand could ever be literal Summary: To compute the size of a VALU/SALU instruction, we need to check whether an operand could ever be literal. Previously isLiteralConstant was used, which missed cases like global variables or external symbols. These misses lead to under-estimation of the instruction size and branch offset, and thus incorrectly skip the necessary branch relaxation when the branch offset is actually greater than what the branch bits can hold. In this work, we use isLiteralConstantLike to check the operands. It maybe conservative, but it is safe. Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D122778	2022-03-31 08:06:31 -07:00
Abinav Puthan Purayil	acf83abcbf	[AMDGPU][GlobalISel] Remove unused variable. NFC.	2022-03-31 16:50:34 +05:30
Stanislav Mekhanoshin	f311f934e1	[AMDGPU] gfx940 VALU hazard recognizer Differntial Revision: https://reviews.llvm.org/D122339	2022-03-29 10:57:54 -07:00
Shao-Ce SUN	662b9fa02c	[NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122557	2022-03-29 09:53:24 +08:00
Changpeng Fang	8384ced974	[AMDGPU][NFC]: Remove unnecessary MFI functions Summary: hasHostcallPtr() and hasHeapPtr() are only used in metadata emit. However, we can use the corresponding function attributes directly instead introducing the functions. Reviewers: arsenm Differential Revision: https://reviews.llvm.org/D122600	2022-03-28 12:13:33 -07:00
Thomas Symalla	3bd15c03c6	[AMDGPU] Fix adding modifiers when creating v_cmpx instructions. Revision https://reviews.llvm.org/D122332 added a pattern transformation where v_cmpx instructions are introduced. However, the modifiers are not correctly inherited from the original operands. The patch adds the source modifiers, if they are exist, or sets them to 0. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D122489	2022-03-28 17:52:53 +02:00
Carl Ritson	1f52d02ceb	[AMDGPU] Split waterfall loop exec manipulation Split waterfall loops into multiple blocks so that exec mask manipulation (s_and_saveexec) does not occur in the middle of a block. VGPR live range optimizer is updated to handle waterfall loops spanning multiple blocks. Reviewed By: ruiling Differential Revision: https://reviews.llvm.org/D122200	2022-03-28 17:44:54 +09:00
Kazu Hirata	6212871968	[Target] Apply clang-tidy fixes for readability-redundant-member-init (NFC)	2022-03-27 22:22:37 -07:00
Maksim Panchenko	4ae9745af1	[Disassember][NFCI] Use strong type for instruction decoder All LLVM backends use MCDisassembler as a base class for their instruction decoders. Use "const MCDisassembler " for the decoder instead of "const void ". Remove unnecessary static casts. Reviewed By: skan Differential Revision: https://reviews.llvm.org/D122245	2022-03-25 18:53:59 -07:00
Jay Foad	be9acee059	[AMDGPU] Move VOP3 classes into VOPInstructions.td. NFC. These classes are also used by VOP1/2/C instructions. Differential Revision: https://reviews.llvm.org/D122470	2022-03-25 13:56:43 +00:00
Thomas Symalla	718aec209c	[AMDGPU] Improve v_cmpx usage on GFX10.3. On GFX10.3 targets, the following instruction sequence v_cmp_* SGPR, ... s_and_saveexec ..., SGPR leads to a fairly long stall caused by a VALU write to a SGPR and having the following SALU wait for the SGPR. An equivalent sequence is to save the exec mask manually instead of letting s_and_saveexec do the work and use a v_cmpx instruction instead to do the comparison. This patch modifies the SIOptimizeExecMasking pass as this is the last position where s_and_saveexec instructions are inserted. It does the transformation by trying to find the pattern, extracting the operands and generating the new instruction sequence. It also changes some existing lit tests and introduces a few new tests to show the changed behavior on GFX10.3 targets. Same as D119696 including a buildbot and MIR test fix. Reviewed By: critson Differential Revision: https://reviews.llvm.org/D122332	2022-03-25 11:40:18 +01:00
Stanislav Mekhanoshin	64838ba365	[AMDGPU] Use GenericTable to classify DGEMM Since there is a table introduced for MAI instructions extend it to use for DGEMM classification. Differential Revision: https://reviews.llvm.org/D122337	2022-03-24 13:00:37 -07:00
Stanislav Mekhanoshin	cad9de71d7	[AMDGPU] gfx940 MAI hazard recognizer Differential Revision: https://reviews.llvm.org/D122263	2022-03-24 12:59:52 -07:00
Stanislav Mekhanoshin	6e3e14f600	[AMDGPU] Support gfx940 smfmac instructions Differential Revision: https://reviews.llvm.org/D122191	2022-03-24 12:40:42 -07:00
Stanislav Mekhanoshin	27439a7642	[AMDGPU] New gfx940 mfma instructions Differential Revision: https://reviews.llvm.org/D122044	2022-03-24 12:12:52 -07:00
Vasileios Porpodas	39aa202aff	Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 3, fixed assertion crash. Original review: https://reviews.llvm.org/D121354 This reverts commit `e6ead19b77`.	2022-03-23 18:32:17 -07:00
Austin Kerbow	1e15adba62	[AMDGPU] Add s_nop WaitStates between neighboring mfma In some cases padding bubbles between sequential MFMA instructions may lead to increased inter-wave performance. Add option to request to pad some portion of these stall cycles with s_nops. Fixes: SWDEV-326925 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D121437	2022-03-23 13:56:09 -07:00
Arthur Eubanks	e6ead19b77	Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash." This reverts commit `27bd8f9492`. Causes crashes, see comments in D121973	2022-03-23 10:57:45 -07:00
hsmahesha	f5b6866d7e	[AMDGPU] Add missing testcase for SGPR to AGPR copy and, also update the function indirectCopyToAGPR() to ensure that it is called only on GFX908 sub-target. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D122286	2022-03-23 21:38:04 +05:30
hsmahesha	f014303e2c	[AMDGPU] [NFC]: Organize the code around reserving registers. First, add code to reserve all required special purpose registers, followed by code to reserve SGPRs, followed by code to reserve VGPRs/AGPRs. This patch is prepared as a pre-requisite to fix an issue related to GFX90A hardware. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122219	2022-03-23 07:15:59 +05:30
Vasileios Porpodas	27bd8f9492	Recommit "[SLP] Fix lookahead operand reordering for splat loads." attempt 2, fixed assertion crash. Original review: https://reviews.llvm.org/D121354 This reverts commit `f7d7d2a08d`.	2022-03-22 16:41:55 -07:00
Stanislav Mekhanoshin	72c1a0d9c2	[AMDGPU] Allow v_accvgpr_write to use SGPR on gfx90a This is undocumented, but it should work. Differential Revision: https://reviews.llvm.org/D122252	2022-03-22 13:52:29 -07:00
Arthur Eubanks	f7d7d2a08d	Revert "Recommit "[SLP] Fix lookahead operand reordering for splat loads."" This reverts commit `79613185d3`. Causes crashes, see comments in https://reviews.llvm.org/D121973.	2022-03-22 13:33:49 -07:00
alex-t	7636c9a929	[AMDGPU] use scalar shift for SALU users in frame index elimination In the frame index lowering we have to insert shift and add instructions to adjust stack object access. We need to take care of the stack object user kind and use scalar shift/add for scalar users. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D121524	2022-03-22 13:16:24 +01:00
alex-t	0a488cba2c	[AMDGPU] use scalar shift for SALU users in frame index elimination In the frame index lowering we have to insert shift and add instructions to adjust stack object access. We need to take care of the stack object user kind and use scalar shift/add for scalar users. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D121524	2022-03-22 11:43:23 +01:00
Vasileios Porpodas	79613185d3	Recommit "[SLP] Fix lookahead operand reordering for splat loads." Original review: https://reviews.llvm.org/D121354 The original commit `9136145eb0` broke the build on several targets. Differential Revision: https://reviews.llvm.org/D121973	2022-03-21 15:57:32 -07:00
Stanislav Mekhanoshin	9b1fa6f89f	[AMDGPU] Fix AV classes VTs. NFCI. NFC at this point, but will be used at a later patch. Differential Revision: https://reviews.llvm.org/D122174	2022-03-21 13:38:05 -07:00
alex-t	a0ea7ec90f	[AMDGPU] divergence patterns for the BUILD_VECTOR i16, undef expansion. BUILD_VECTOR of i16 and undef gets expanded to the COPY_TO_REGCLASS. The latter is further lowererd to the copy instructions. We need to provide the correct register class for the uniform and divergent BUILD_VECTOR nodes to avoid VGPR to SGPR copies. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D122068	2022-03-21 21:11:20 +01:00
Dmitry Preobrazhensky	1d817a1448	[AMDGPU][MC][NFC] Refactored sendmsg(...) handling Differential Revision: https://reviews.llvm.org/D121995	2022-03-21 15:37:30 +03:00
Jay Foad	321d3aae7c	[AMDGPU] SIInstrInfo::verifyInstruction tweaks. NFCI. Simplify some for loops. Don't bother checking src2 operand for writelane because it doesn't have one. Check all VALU instructions, not just VOP1/2/3/C/SDWA.	2022-03-21 11:15:55 +00:00
Thomas Symalla	7de6107dce	Revert "[AMDGPU] Improve v_cmpx usage on GFX10.3." This reverts commit `011c64191e` and `e725e2afe0`. Differential Revision: https://reviews.llvm.org/D122117	2022-03-21 09:50:44 +01:00
Thomas Symalla	e725e2afe0	[AMDGPU] [NFC] Fix missing include.	2022-03-21 09:37:22 +01:00
Thomas Symalla	011c64191e	[AMDGPU] Improve v_cmpx usage on GFX10.3. On GFX10.3 targets, the following instruction sequence v_cmp_* SGPR, ... s_and_saveexec ..., SGPR leads to a fairly long stall caused by a VALU write to a SGPR and having the following SALU wait for the SGPR. An equivalent sequence is to save the exec mask manually instead of letting s_and_saveexec do the work and use a v_cmpx instruction instead to do the comparison. This patch modifies the SIOptimizeExecMasking pass as this is the last position where s_and_saveexec instructions are inserted. It does the transformation by trying to find the pattern, extracting the operands and generating the new instruction sequence. It also changes some existing lit tests and introduces a few new tests to show the changed behavior on GFX10.3 targets. Reviewed By: sebastian-ne, critson Differential Revision: https://reviews.llvm.org/D119696	2022-03-21 09:31:59 +01:00
Jon Chesterfield	bcbd4cf1f2	Revert "[amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal" Reconsidered, better to handle per-function state in the constructor as before. This reverts commit `98e474c1b3`.	2022-03-20 00:58:26 +00:00
Jon Chesterfield	98e474c1b3	[amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal	2022-03-19 16:42:17 +00:00
Stanislav Mekhanoshin	e9a49c6483	[AMDGPU] gfx940 basic speed model This is incomplete and will handle more instructions as they are added. Differential Revision: https://reviews.llvm.org/D121966	2022-03-18 13:19:47 -07:00
Stanislav Mekhanoshin	4570527e72	[AMDGPU] Disable some MFMA instructions on gfx940 Differential Revision: https://reviews.llvm.org/D121956	2022-03-18 13:19:12 -07:00
Stanislav Mekhanoshin	0a79e1f30a	[AMDGPU] reuse blgp as neg in 2 mfma operations on gfx940 GFX940 repurposes BLGP as NEG only in DGEMM MFMA. Differential Revision: https://reviews.llvm.org/D121745	2022-03-18 12:56:51 -07:00
Abinav Puthan Purayil	aee3684995	[AMDGPU] Use COPY_TO_REGCLASS for buffer_atomic_cmpswap selection GlobalISel was selecting the av_* regclass for some cases. Differential Revision: https://reviews.llvm.org/D121933	2022-03-18 08:56:23 +05:30
Changpeng Fang	dd5895cc39	AMDGPU: Use the implicit kernargs for code object version 5 Summary: Specifically, for trap handling, for targets that do not support getDoorbellID, we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1]. To get aperture bases when targets do not have aperture registers, we load private_base or shared_base directly from the implicit kernarg. In clang, we use implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}. Reviewers: arsenm, sameerds, yaxunl Differential Revision: https://reviews.llvm.org/D120265	2022-03-17 14:12:36 -07:00
Stanislav Mekhanoshin	d9ac55fab2	[AMDGPU] New MFMA names for existing instructions Old names are supported as aliases. _1k MFMA got new opcodes. Differential Revision: https://reviews.llvm.org/D121741	2022-03-17 13:05:36 -07:00
Stanislav Mekhanoshin	522b259976	[AMDGPU] Allow v_accvgpr_write to use SGPR src on gfx940 Differential Revision: https://reviews.llvm.org/D121843	2022-03-17 12:12:06 -07:00
Vang Thao	27e1931508	[AMDGPU] Fix PreRARematerialize scheduler pass sinking subreg defs When collecting trivially rematerializable defs, skip any subreg defs. We do not want to sink these. Differential Revision: https://reviews.llvm.org/D121874	2022-03-17 11:38:53 -07:00
Jay Foad	313f306b26	[AMDGPU] Stop using getMinimalPhysRegClass in LowerFormalArguments NFCI. The motivation for this is avoid problems in future if we add new classes containing only a subset of all VGPRs, or a subset of all SGPRs. getMinimalPhysRegClass would favour these smaller classes, which is not what we want here. Differential Revision: https://reviews.llvm.org/D121914	2022-03-17 15:19:17 +00:00
Dmitry Preobrazhensky	9c632b61eb	[AMDGPU][MC] A fix for commit `5977dfb` The commit code `5977dfba64` failed to compile with GCC5. This patch addresses the issue. For a related discussion, see https://reviews.llvm.org/D121696	2022-03-17 14:41:21 +03:00
Abinav Puthan Purayil	f59cb41ba1	[AMDGPU] Select buffer_atomic_cmpswap* in tblgen This change replaces the manual selection of buffer_atomic_cmpswap* instructions in SelectionDAG and GlobalISel with a tblgen based selection in BUFInstructions.td. This allows us to select the return and no-return variants in tblgen. Differential Revision: https://reviews.llvm.org/D121770	2022-03-17 10:12:32 +05:30
Christudasan Devadasan	6dd21d1db1	[AMDGPU][SIFoldOperands] Consider the alignment constraints Enforced an alignment check while folding the operands.	2022-03-17 08:27:53 +05:30
Christudasan Devadasan	af717d4aca	[AMDGPU][MachineVerifier] Alignment check for fp32 packed math instructions The fp32 packed math instructions are introduced in gfx90a. If their vector register operands are not properly aligned, the verifier should flag them. Currently, the verifier failed to report it and the compiler ended up emitting a broken assembly. This patch fixes that missed case in TII::verifyInstruction. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D121794	2022-03-17 08:21:35 +05:30
Jay Foad	fb8d23b8e7	[AMDGPU] Define new feature HasFlatScratchSVSMode. NFC. This is by analogy with HasFlatScratchSTMode and is slightly more informative than using isGFX940Plus. Differential Revision: https://reviews.llvm.org/D121804	2022-03-16 19:54:02 +00:00
Joe Nash	fb81f06f63	[AMDGPU] Calculate RegWidth in bits in AsmParser NFC. Switch from calculations based on dwords to bits, to be more flexible. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D121730	2022-03-16 10:52:14 -04:00
Dmitry Preobrazhensky	5977dfba64	[AMDGPU][MC][NFC] Refactored custom operands handling The original design of custom operands support assumed that most GPUs have the same or very similar operand names end encodings. This is no longer the case. As a result the support code becomes over-complicated and difficult to maintain. This change implements a different design with the following benefits: - support of aliases; - support of operands with overlapped encodings; - identification of defined but unsupported operands. Differential Revision: https://reviews.llvm.org/D121696	2022-03-16 16:04:55 +03:00
Shengchen Kan	37b378386e	[NFC][CodeGen] Rename some functions in MachineInstr.h and remove duplicated comments	2022-03-16 20:25:42 +08:00
serge-sans-paille	989f1c72e0	Cleanup codegen includes This is a (fixed) recommit of https://reviews.llvm.org/D121169 after: 1061034926 before: 1063332844 Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D121681	2022-03-16 08:43:00 +01:00
Joe Nash	5464fd36ba	[AMDGPU] Fix typo consecutive in GCNNSAReassign	2022-03-15 12:42:52 -04:00
Pavel Labath	991dc4b4e0	Remove a top-level "using namespace" in TargetTransformInfoImpl.h Avoids polluting the namespace of all files including the header.	2022-03-15 13:49:20 +01:00
Ruiling Song	98dd390573	AMDGPU: Use removeAllRegUnitsForPhysReg() I met the issue here when working on something else. Actually we have already reserved EXEC, but it looks like the register coalescer is causing the sub-register of EXEC appears in LiveIntervals. I have not looked deeper why register coalscer have such behavior, but removeAllRegUnitsForPhysReg() is the right way. Reviewed By: critson, foad, arsenm Differential Revision: https://reviews.llvm.org/D117014	2022-03-15 10:28:27 +08:00
Stanislav Mekhanoshin	c4500de255	[AMDGPU] gfx940: disable OP_SEL on V_DOT instructions Differential Revision: https://reviews.llvm.org/D121634	2022-03-14 17:02:00 -07:00
Stanislav Mekhanoshin	8dd3d1cf1f	[AMDGPU] Add symbolic names for gfx940 HWREGs The namespaces of HWREGs is now overlapping with gfx10. Thus the patch is longer than necessary to just support new names. It also need to handle proper error messages, i.e. to issue a "specified hardware register is not supported on this GPU" message. This may need a major refactoring in the future. Differential Revision: https://reviews.llvm.org/D121418	2022-03-14 16:13:33 -07:00
Stanislav Mekhanoshin	23499103f7	[AMDGPU] Support for gfx940 flat lds opcodes Differential Revision: https://reviews.llvm.org/D121414	2022-03-14 15:46:19 -07:00
Stanislav Mekhanoshin	1f53f20fc1	[AMDGPU] Support gfx940 v_lshl_add_u64 instruction Differential Revision: https://reviews.llvm.org/D121401	2022-03-14 15:45:42 -07:00
Stanislav Mekhanoshin	36fe3f13a9	[AMDGPU] flat scratch SVS addressing mode for gfx940 Both VADDR and SADDR are used in SVS mode. Differential Revision: https://reviews.llvm.org/D121254	2022-03-14 15:23:36 -07:00
Stanislav Mekhanoshin	47bac63d3f	[AMDGPU] gfx940 memory model Differential Revision: https://reviews.llvm.org/D121242	2022-03-14 15:01:46 -07:00
Stanislav Mekhanoshin	72a9e5f891	[AMDGPU] Restrict machine copy propagation from creating unaligned classes Fixes: SWDEV-326366 Differential Revision: https://reviews.llvm.org/D121491	2022-03-14 14:09:40 -07:00
Austin Kerbow	62bcfcb5a5	[AMDGPU] Add llvm.amdgcn.s.setprio intrinsic Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D120976	2022-03-12 22:15:42 -08:00
serge-sans-paille	ed98c1b376	Cleanup includes: DebugInfo & CodeGen Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D121332	2022-03-12 17:26:40 +01:00
Stanislav Mekhanoshin	31f215ab0c	[AMDGPU] Support v_mov_b64 in dpp combine Differential Revision: https://reviews.llvm.org/D121411	2022-03-11 11:37:32 -08:00
Stanislav Mekhanoshin	6181458662	[AMDGPU] gfx940 MUBUF format changes Differential Revision: https://reviews.llvm.org/D121234	2022-03-11 11:36:49 -08:00
Nikita Popov	3ed643ea76	[AMDGPUPromoteAlloca] Make compatible with opaque pointers This mainly changes the handling of bitcasts to not check the types being casted from/to -- we should only care about the actual load/store types. The GEP handling is also changed to not care about types, and just make sure that we get an offset corresponding to a vector element. This was a bit of a struggle for me, because this code seems to be pretty sensitive to small changes. The end result seems to produce strictly better results for the existing test coverage though, because we can now deal with more situations involving bitcasts. Differential Revision: https://reviews.llvm.org/D121371	2022-03-11 09:20:51 +01:00
Joe Nash	c8e6d68a9f	[AMDGPU] Use subreg encoding instead of reassign The HWEncoding for these 64 bit registers should be the same as as the encoding for the previously defined low halves of the registers. So reuse that value instead of repeating the assignment. NFC. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D121391	2022-03-10 12:50:29 -05:00
Nico Weber	a278250b0f	Revert "Cleanup codegen includes" This reverts commit `7f230feeea`. Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang, and many LLVM tests, see comments on https://reviews.llvm.org/D121169	2022-03-10 07:59:22 -05:00
alex-t	d159b4444c	[AMDGPU] Enable divergence predicates for negative inline constant subtraction We have a pattern that undo sub x, c -> add x, -c canonicalization since c is more likely an inline immediate than -c. This patch enables it to select scalar or vector subtracion by the input node divergence. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D121360	2022-03-10 15:03:22 +03:00
serge-sans-paille	7f230feeea	Cleanup codegen includes after: 1061034926 before: 1063332844 Differential Revision: https://reviews.llvm.org/D121169	2022-03-10 10:00:30 +01:00
Changpeng Fang	0f20a35b9e	AMDGPU: Set up User SGPRs for queue_ptr only when necessary Summary: In general, we need queue_ptr for aperture bases and trap handling, and user SGPRs have to be set up to hold queue_ptr. In current implementation, user SGPRs are set up unnecessarily for some cases. If the target has aperture registers, queue_ptr is not needed to reference aperture bases. For trap handling, if target suppots getDoorbellID, queue_ptr is also not necessary. Futher, code object version 5 introduces new kernel ABI which passes queue_ptr as an implicit kernel argument, so user SGPRs are no longer necessary for queue_ptr. Based on the trap handling document: https://llvm.org/docs/AMDGPUUsage.html#amdgpu-trap-handler-for-amdhsa-os-v4-onwards-table, llvm.debugtrap does not need queue_ptr, we remove queue_ptr suport for llvm.debugtrap in the backend. Reviewers: sameerds, arsenm Fixes: SWDEV-307189 Differential Revision: https://reviews.llvm.org/D119762	2022-03-09 10:14:05 -08:00
Stanislav Mekhanoshin	33fb23f728	[AMDGPU] Merge flat with global in the SILoadStoreOptimizer Flat can be merged with flat global since address cast is a no-op. A combined memory operation needs to be promoted to flat. Differential Revision: https://reviews.llvm.org/D120431	2022-03-09 10:04:37 -08:00
Jay Foad	c7218164c4	[AMDGPU] Remove HasAtomicFaddInstsGFX90X and HasAtomicFaddInstsGFX940 These compound predicates are not required, since we can use a combination of setting the SubtargetPredicate (to a subtarget predicate like isGFX940Plus) and OtherPredicates (to a list of feature predicates like HasAtomicFaddInsts) instead. NFC. Differential Revision: https://reviews.llvm.org/D121289	2022-03-09 18:02:21 +00:00
Vang Thao	28322c2514	[AMDGPU] Add scheduler pass to rematerialize trivial defs Add a new pass in the pre-ra AMDGPU scheduler to check if sinking trivially rematerializable defs that only has one use outside of the defining block will increase occupancy. If we can determine that occupancy can be increased, then rematerialize only the minimum amount of defs required to increase occupancy. Also re-schedule all regions that had occupancy matching the previous min occupancy using the new occupancy. This is based off of the discussion in https://reviews.llvm.org/D117562. The logic to determine the defs we should collect and determining if sinking would be beneficial is mostly the same. Main differences is that we are no longer limiting it to immediate defs and the def and use does not have to be part of a loop. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D119475	2022-03-09 09:34:33 -08:00
Venkata Ramanaiah Nalamothu	04fff547e2	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm, ronlieb Differential Revision: https://reviews.llvm.org/D114652	2022-03-09 12:18:02 +05:30
Stanislav Mekhanoshin	9eabea3968	[AMDGPU] Set noclobber metadata on loads instead of cast to constant A load via pointer cast to constant will return true from pointsToConstantMemory which is not necessarily so. Fixes: SWDEV-326463 Differential Revision: https://reviews.llvm.org/D121172	2022-03-07 23:13:02 -08:00
Christudasan Devadasan	0d849b8249	AMDGPU: Skip folding REG_SEQUENCE if found unknown regclasses for its users Use TII::getRegClass to return a valid regclass or a nullptr if the RC is unknown for a given OpIdx. This fixes a potential crash occurred while getting the RC from a variadic instruction. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D120813	2022-03-08 10:11:57 +05:30
Jacob Lambert	5160447f58	[AMDGPU] Add gfx10 assembler directive to specify shared VGPR count Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D105507	2022-03-07 14:27:41 -08:00
Stanislav Mekhanoshin	932f628121	[AMDGPU] new gfx940 fp atomics Differential Revision: https://reviews.llvm.org/D121028	2022-03-07 12:32:02 -08:00
Stanislav Mekhanoshin	e7b362d75d	[AMDGPU] Add v_mov_b64 gfx940 opcode Differential Revision: https://reviews.llvm.org/D121023	2022-03-07 12:07:12 -08:00
Stanislav Mekhanoshin	8992b50e2f	[AMDGPU] gfx940 uses new names for coherency bits Differential Revision: https://reviews.llvm.org/D120855	2022-03-07 11:50:07 -08:00
Austin Kerbow	0c0636f782	[AMDGPU] Fix uninitialized value after `8d0c34fd4f`	2022-03-07 11:32:01 -08:00
Stanislav Mekhanoshin	2c830c8fab	[AMDGPU] gfx940: support V_FMAMK_F32 and V_FMAAK_F32 Differential Revision: https://reviews.llvm.org/D120769	2022-03-07 11:31:01 -08:00
Austin Kerbow	8d0c34fd4f	[AMDGPU] Omit unnecessary waitcnt before barriers It is not necessary to wait for all outstanding memory operations before barriers on hardware that can back off of the barrier in the event of an exception when traps are enabled. Add a new subtarget feature which tracks which HW has this ability. Reviewed By: #amdgpu, rampitec Differential Revision: https://reviews.llvm.org/D120544	2022-03-07 08:23:53 -08:00
Jay Foad	d7d4ed0847	[AMDGPU] Tweak predicates for image_bvh_intersect_ray instructions Don't override SubtargetPredicate since that is already set in the base classes for the appropriate subtarget like MIMG_gfx10. Use OtherPredicates instead for consistency with the way we handle features like HasImageInsts and HasExtendedImageInsts. NFC. Differential Revision: https://reviews.llvm.org/D120909	2022-03-04 12:05:23 +00:00
Aakanksha	840695814a	[AMDGPU] Add gfx1036 target Differential Revision: https://reviews.llvm.org/D120846	2022-03-02 23:26:38 +00:00
Stanislav Mekhanoshin	35ec58d8c0	[AMDGPU] gfx940 removes all image instructions Differential Revision: https://reviews.llvm.org/D120763	2022-03-02 13:55:26 -08:00
Stanislav Mekhanoshin	2e2e64df4a	[AMDGPU] Add gfx940 target This is target definition only. Differential Revision: https://reviews.llvm.org/D120688	2022-03-02 13:54:48 -08:00
Jay Foad	5ddfedc956	[AMDGPU] Fix deleting of move-immediate instructions after folding SIInstrInfo::FoldImmediate tried to delete move-immediate instructions after folding them into their only use. This did not work because it was checking hasOneNonDBGUse after doing the fold, at which point there should be no uses. This seems to have no effect on codegen, it just means less stuff for DCE to clean up later. Differential Revision: https://reviews.llvm.org/D120815	2022-03-02 16:11:16 +00:00
Jay Foad	8bed52c9eb	[AMDGPU] Make more use of madmk/fmamk instructions In convertToThreeAddress handle VOP2 mac/fmac instructions with a literal src0 operand, since these are prime candidates for converting to madmk/fmamk. Previously this would only happen if src0 (or src1) was a register defined by a move-immediate instruction, but in many cases these operands have already been folded because SIFoldOperands runs before TwoAddressInstructionPass. Differential Revision: https://reviews.llvm.org/D120736	2022-03-02 10:22:10 +00:00
Mircea Trofin	cb2160760e	[nfc][codegen] Move RegisterBank[Info].h under CodeGen This wraps up from D119053. The 2 headers are moved as described, fixed file headers and include guards, updated all files where the old paths were detected (simple grep through the repo), and `clang-format`-ed it all. Differential Revision: https://reviews.llvm.org/D119876	2022-03-01 21:53:25 -08:00
Abinav Puthan Purayil	8b4ab01c38	[AMDGPU] Select no-return atomic ops in BUFInstructions.td This change adds the selection of no-return buffer_* instructions in tblgen. The motivation for this is to get the no-return atomic isel working without relying on post-isel hooks so that GlobalISel can start selecting them (once GlobalISelEmitter allows no return atomic patterns like how DAGISel does). This change handles the selection of no-return mubuf_atomic_cmpswap in tblgen without changing the extract_subreg generation for the return variant. This handling was done by the post-isel hook. Differential Revision: https://reviews.llvm.org/D120538	2022-03-02 08:25:28 +05:30
Jay Foad	289339140e	[AMDGPU] Handle legacy multiply-accumulate opcodes in convertToThreeAddress Handle V_MAC_LEGACY_F32 and V_FMAC_LEGACY_F32 in convertToThreeAddress, to avoid the need for an extra mov instruction in some cases. Differential Revision: https://reviews.llvm.org/D120704	2022-03-01 16:58:00 +00:00
Jay Foad	9ac3a85047	[AMDGPU] Disentangle MFMA handling in convertToThreeAddress. NFC. Move MFMA handling to the top of convertToThreeAddress and pull IsF16 calculation out of the switch. I think this makes it clearer exactly which mac/fmac opcodes are handled, since they are now listed in the switch with minimal extra clutter. Differential Revision: https://reviews.llvm.org/D120703	2022-03-01 16:56:56 +00:00
Jay Foad	68895098d1	[AMDGPU] Preserve src2_modifiers in convertToThreeAddress Found by code inspection. I don't think it makes a difference with current codegen, because if any source modifiers were present we would have selected mad/fma instead of mac/fmac in the first place. Differential Revision: https://reviews.llvm.org/D120709	2022-03-01 14:48:25 +00:00
Stanislav Mekhanoshin	517171ce20	[AMDGPU] Extend SILoadStoreOptimizer to handle flat load/stores TODO: merge flat with global promoting to flat. Differential Revision: https://reviews.llvm.org/D120351	2022-02-28 11:27:30 -08:00
Carl Ritson	2bbe6506d4	[AMDGPU] Remove redundant isVALU in SIPreEmitPeephole. NFC Remove redundant isVALU call added in D120202.	2022-02-27 16:09:20 +09:00
Benjamin Kramer	1de11fe360	Use RegisterInfo::regsOverlaps instead of checking aliases This is both less code and faster since it doesn't have to expand all the sub & superreg sets. NFCI.	2022-02-26 20:32:12 +01:00
Jameson Nash	c4b1a63a1b	mark getTargetTransformInfo and getTargetIRAnalysis as const Seems like this can be const, since Passes shouldn't modify it. Reviewed By: wsmoses Differential Revision: https://reviews.llvm.org/D120518	2022-02-25 14:30:44 -05:00
Changpeng Fang	ca62b1db9f	[AMDGPU][NFC]: Emit metadata for hidden_heap_v1 kernarg Summary: Emit metadata for hidden_heap_v1 kernarg Reviewers: sameerds, b-sumner Fixes: SWDEV-307188 Differential Revision: https://reviews.llvm.org/D119027	2022-02-25 10:45:35 -08:00
Aakanksha	bf60a1c546	Avoid comparisons between types of different widths in a loop condition to prevent the loop from behaving unexpectedly This change fixes the code violations flagged in AMD compute CodeQL scan - Query Description: "Comparisons between types of different widths in a loop condition can cause the loop to behave unexpectedly." Differential Revision: https://reviews.llvm.org/D120355	2022-02-25 17:30:12 +00:00
Carl Ritson	565af157ef	[AMDGPU] Extend pre-emit peephole to redundantly masked VCC Extend pre-emit peephole for S_CBRANCH_VCC[N]Z to eliminate redundant S_AND operations against EXEC for V_CMP results in VCC. These occur after after register allocation when VCC has been selected as the comparison destination. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D120202	2022-02-25 10:18:31 +09:00
Jay Foad	05d79e3562	[AMDGPU] Divergence-driven instruction selection for bitreverse Differential Revision: https://reviews.llvm.org/D119702	2022-02-24 20:21:59 +00:00
Stanislav Mekhanoshin	3279e44063	[AMDGPU] Extend SILoadStoreOptimizer to handle global stores TODO: merge flat load/stores. TODO: merge flat with global promoting to flat. Differential Revision: https://reviews.llvm.org/D120346	2022-02-24 11:09:51 -08:00
Stanislav Mekhanoshin	cefa1c5ca9	[AMDGPU] Fix combined MMO in load-store merge Loads and stores can be out of order in the SILoadStoreOptimizer. When combining MachineMemOperands of two instructions operands are sent in the IR order into the combineKnownAdjacentMMOs. At the moment it picks the first operand and just replaces its offset and size. This essentially loses alignment information and may generally result in an incorrect base pointer to be used. Use a base pointer in memory addresses order instead and only adjust size. Differential Revision: https://reviews.llvm.org/D120370	2022-02-24 10:47:57 -08:00
Bill Wendling	a5bbc6ef99	[NFC] Remove unnecessary "#include"s from header files	2022-02-23 01:20:48 -08:00
Stanislav Mekhanoshin	9e055c0fff	[AMDGPU] Extend SILoadStoreOptimizer to handle global saddr loads This adds handling of the _SADDR forms to the GLOBAL_LOAD combining. TODO: merge global stores. TODO: merge flat load/stores. TODO: merge flat with global promoting to flat. Differential Revision: https://reviews.llvm.org/D120285	2022-02-22 09:01:43 -08:00
Stanislav Mekhanoshin	ba17bd2674	[AMDGPU] Extend SILoadStoreOptimizer to handle global loads There can be situations where global and flat loads and stores are not combined by the vectorizer, in particular if their address space differ in the IR but they end up the same class instructions after selection. For example a divergent load from constant address space ends up being the same global_load as a load from global address space. TODO: merge global stores. TODO: handle SADDR forms. TODO: merge flat load/stores. TODO: merge flat with global promoting to flat. Differential Revision: https://reviews.llvm.org/D120279	2022-02-22 08:42:36 -08:00
Thomas Symalla	380ff31d83	[AMDGPU] Fix typo in comment [NFC] This replaces "V_MOB_B32" with "V_MOV_B32" in some comment.	2022-02-22 13:27:26 +01:00
Stanislav Mekhanoshin	dc0981562e	[AMDGPU] Remove redundand check in the SILoadStoreOptimizer Differential Revision: https://reviews.llvm.org/D120268	2022-02-21 15:04:44 -08:00
Jay Foad	359a792f9b	[AMDGPU] SILoadStoreOptimizer: avoid unbounded register pressure increases Previously when combining two loads this pass would sink the first one down to the second one, putting the combined load where the second one was. It would also sink any intervening instructions which depended on the first load down to just after the combined load. For example, if we started with this sequence of instructions (code flowing from left to right): X A B C D E F Y After combining loads X and Y into XY we might end up with: A B C D E F XY But if B D and F depended on X, we would get: A C E XY B D F Now if the original code had some short disjoint live ranges from A to B, C to D and E to F, in the transformed code these live ranges will be long and overlapping. In this way a single merge of two loads could cause an unbounded increase in register pressure. To fix this, change the way the way that loads are moved in order to merge them so that: - The second load is moved up to the first one. (But when merging stores, we still move the first store down to the second one.) - Intervening instructions are never moved. - Instead, if we find an intervening instruction that would need to be moved, give up on the merge. But this case should now be pretty rare because normal stores have no outputs, and normal loads only have address register inputs, but these will be identical for any pair of loads that we try to merge. As well as fixing the unbounded register pressure increase problem, moving loads up and stores down seems like it should usually be a win for memory latency reasons. Differential Revision: https://reviews.llvm.org/D119006	2022-02-21 10:51:14 +00:00
Jay Foad	57baa14d74	[AMDGPU] Rename AMDGPUCFGStructurizer to R600MachineCFGStructurizer Previously the name of the class (AMDGPUCFGStructurizer) did not match the name of the file (AMDILCFGStructurizer). Standardize on the name R600MachineCFGStructurizer by analogy with AMDGPUMachineCFGStructurizer. Differential Revision: https://reviews.llvm.org/D120128	2022-02-18 15:08:25 +00:00
Sebastian Neubauer	6527b2a4d5	[AMDGPU][NFC] Fix typos Fix some typos in the amdgpu backend. Differential Revision: https://reviews.llvm.org/D119235	2022-02-18 15:05:21 +01:00
Sebastian Neubauer	1f0aadfa62	[AMDGPU] Fix kill flag on overlapping sgpr copy Same as on vgpr copies, we cannot kill the source register if it overlaps with the destination register. Otherwise, the kill of the source register will also count as a kill for the destination register. Differential Revision: https://reviews.llvm.org/D120042	2022-02-18 14:36:00 +01:00
Jay Foad	69ab233a15	[AMDGPU] Return better Changed status from SIFoldOperands Differential Revision: https://reviews.llvm.org/D120023	2022-02-18 10:35:48 +00:00
Jay Foad	768e6faba8	[AMDGPU] Return better Changed status from SILowerControlFlow Differential Revision: https://reviews.llvm.org/D120025	2022-02-18 10:09:22 +00:00
Jay Foad	d86dcb7ea5	[AMDGPU] Return better Changed status from SIOptimizeExecMasking Differential Revision: https://reviews.llvm.org/D120024	2022-02-18 10:09:21 +00:00
Stanislav Mekhanoshin	b0aa1946df	[AMDGPU] Promote recursive loads from kernel argument to constant Not clobbered pointer load chains are promoted to global now. That is possible to promote these loads itself into constant address space. Loaded pointers still need to point to global because we need to be able to store into that pointer and because an actual load from it may occur after a clobber. Differential Revision: https://reviews.llvm.org/D119886	2022-02-17 11:07:03 -08:00
Jay Foad	c08896d292	[AMDGPU] Return better Changed status from SILowerI1Copies Differential Revision: https://reviews.llvm.org/D119946	2022-02-17 09:38:57 +00:00
Jay Foad	78ebb1dd24	[AMDGPU] Return better Changed status from SIAnnotateControlFlow Differential Revision: https://reviews.llvm.org/D119945	2022-02-17 09:38:57 +00:00
Jay Foad	1822a5ecdd	[AMDGPU] Return better Changed status from AMDGPUPerfHintAnalysis Differential Revision: https://reviews.llvm.org/D119944	2022-02-17 09:31:42 +00:00
Jay Foad	77e793d025	[AMDGPU] Return better Changed status from AMDGPUAnnotateUniformValues Differential Revision: https://reviews.llvm.org/D119943	2022-02-17 09:31:42 +00:00
Matt Arsenault	3884cb9235	AMDGPU: Always reserve VGPR for AGPR copies on gfx908 Just because there aren't AGPRs in the original program doesn't mean the register allocator can't choose to use them (unless we were to forcibly reserve all AGPRs if there weren't any uses). This happens in high pressure situations and introduces copies to avoid spills. In this test, the allocator ends up introducing a copy from SGPR to AGPR which requires an intermediate VGPR. I don't believe it would introduce a copy from AGPR to AGPR in this situation, since it would be trying to use an intermediate with a different class. Theoretically this is also broken on gfx90a, but I have been unable to come up with a testcase.	2022-02-16 18:48:18 -05:00
Jacob Lambert	7470244475	[AMDGPU] Add agpr_count to metadata and AsmParser gfx90a allows the number of ACC registers (AGPRs) to be set independently to the VGPR registers. For both HSA and PAL metadata, we now include an "agpr_count" key to report the number of AGPRs set for supported devices (gfx90a, gfx908, as determined by hasMAIInsts()). This is collected from SIProgramInfo.NumAccVGPR for both HSA and PAL. The AsmParser also now recognizes ".kernel.agpr_count" for supported devices. Differential Revision: https://reviews.llvm.org/D116140	2022-02-16 15:17:23 -08:00
Jacob Lambert	0bad7cb565	Hoist getTotalNumVGPRs into AMDGPUBaseInfo for use in both codegen and MC Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D119912	2022-02-16 11:04:08 -08:00
Dmitry Preobrazhensky	6655c5a6bb	[AMDGPU][MC][GFX10] Added an alias for HW_REG_HW_ID1 Enabled HW_REG_HW_ID as an alias for HW_REG_HW_ID1. This is required for compatibility with existing code. Differential Revision: https://reviews.llvm.org/D119939	2022-02-16 19:45:44 +03:00
Shao-Ce SUN	2aed07e96c	[NFC][MC] remove unused argument `MCRegisterInfo` in `MCCodeEmitter` Reviewed By: skan Differential Revision: https://reviews.llvm.org/D119846	2022-02-16 13:10:09 +08:00
Matt Arsenault	898dc8a4b1	AMDGPU: Use subtarget in class instead of querying function	2022-02-15 21:28:12 -05:00
Stanislav Mekhanoshin	29a0e0a9e5	[AMDGPU] Do not define GET_INSTRINFO_SCHED_ENUM Autogenerated names are too long and break compilation on Windows, while we do not need this enum at all. Differential Revision: https://reviews.llvm.org/D119869	2022-02-15 13:00:54 -08:00
Jay Foad	a65b9dd049	[AMDGPU] Divergence-driven instruction selection for bfm patterns Differential Revision: https://reviews.llvm.org/D119706	2022-02-15 10:49:18 +00:00
Jay Foad	f72d8897ac	[AMDGPU] Honor !invariant.load metadata on load-like intrinsics Differential Revision: https://reviews.llvm.org/D119739	2022-02-15 09:16:57 +00:00
Jay Foad	cb199e0fca	[MC] Define and use MCRegisterInfo::regsOverlap Separate MCRegisterInfo::regsOverlap out from TargetRegisterInfo::regsOverlap. This is useful in the AMDGPU AsmParser where we only have access to MCRegisterInfo. Differential Revision: https://reviews.llvm.org/D119533	2022-02-14 20:46:02 +00:00
Joe Nash	c87c61c52c	[AMDGPU] Fix AGPR offset for waitcnt An enum value stores the offset between AGPR ranges and VGPR ranges in the internal storage of SIInsertWaitcnts. It said 226 when it should say 256, causing some portion of the ranges to overlap. That in turn causes 'aliasing' between the registers, potentially inserting waitcnts that are not required. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D119749	2022-02-14 15:16:21 -05:00
alex-t	c23198ec13	[AMDGPU] Divergence-driven abs instruction selection This change enables "abs" SDNodes selection by the node divergence. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D119581	2022-02-14 21:36:32 +03:00
Stanislav Mekhanoshin	c7eb846345	[AMDGPU] Merge AMDGPULDSUtils into AMDGPUMemoryUtils Differential Revision: https://reviews.llvm.org/D119502	2022-02-11 10:32:24 -08:00
Jay Foad	b59ad64ead	[TableGen][AMDGPU] Allow empty register classes Remove ARTIFICIAL_VGPR which only existed to make VReg_1 not empty. Differential Revision: https://reviews.llvm.org/D119552	2022-02-11 17:30:04 +00:00
Sebastian Neubauer	a5d4f82b73	[AMDGPU] Make enable-flat-scratch a subtarget feature Use a subtarget feature instead of a command line argument to reduce global state. We want to enable flat scratch for graphics in some cases and this doesn't work well with command line options. Differential Revision: https://reviews.llvm.org/D119425	2022-02-11 18:23:07 +01:00
Sameer Sahasrabuddhe	d8f99bb6e0	[AMDGPU] replace hostcall module flag with function attribute The module flag to indicate use of hostcall is insufficient to catch all cases where hostcall might be in use by a kernel. This is now replaced by a function attribute that gets propagated to top-level kernel functions via their respective call-graph. If the attribute "amdgpu-no-hostcall-ptr" is absent on a kernel, the default behaviour is to emit kernel metadata indicating that the kernel uses the hostcall buffer pointer passed as an implicit argument. The attribute may be placed explicitly by the user, or inferred by the AMDGPU attributor by examining the call-graph. The attribute is inferred only if the function is not being sanitized, and the implictarg_ptr does not result in a load of any byte in the hostcall pointer argument. Reviewed By: jdoerfert, arsenm, kpyzhov Differential Revision: https://reviews.llvm.org/D119216	2022-02-11 22:51:56 +05:30
Julien Pages	dcb2da13f1	[AMDGPU] Add a new intrinsic to control fp_trunc rounding mode Add a new llvm.fptrunc.round intrinsic to precisely control the rounding mode when converting from f32 to f16. Differential Revision: https://reviews.llvm.org/D110579	2022-02-11 12:08:23 -05:00
Mirko Brkusanin	5ff35ba8ae	[AMDGPU][GlobalISel] Fix insert point in FoldableFneg combine Newly created fneg was built after some of it's uses in some cases. Now it will be built immediately after instruction whose dst it negates. Differential Revision: https://reviews.llvm.org/D119459	2022-02-11 12:09:40 +01:00
serge-sans-paille	06943537d9	Cleanup MCParser headers As usual with that header cleanup series, some implicit dependencies now need to be explicit: llvm/MC/MCParser/MCAsmParser.h no longer includes llvm/MC/MCParser/MCAsmLexer.h Preprocessed lines to build llvm on my setup: after: 1068185081 before: 1068324320 So no compile time benefit to expect, but we still get the looser coupling between files which is great. Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D119359	2022-02-11 10:39:29 +01:00
Stanislav Mekhanoshin	290e5722e8	[AMDGPU] Improve clobbering checks in the kernel argument promotion Use same MSSA clobbering checks as in the AMDGPUAnnotateUniformValues. Kernel argument promotion needs exactly the same information so factor out utility function isClobberedInFunction. Differential Revision: https://reviews.llvm.org/D119480	2022-02-10 14:51:47 -08:00
alex-t	d88a146f2b	[AMDGPU] Missed sign/zero extend patterns for divergence-driven instruction selection This change includes tablegen patterns that were missed by https://reviews.llvm.org/D110950 and https://reviews.llvm.org/D76230 Reviewed By: foad Differential Revision: https://reviews.llvm.org/D119302	2022-02-10 19:36:12 +03:00
Simon Pilgrim	8de7297374	[AMDGPU] Pull out repeated getVecSize() calls. NFC. This is guaranteed to be evaluated so we can avoid repeated calls. Helps the static analyzer as it couldn't recognise that each getVecSize() would return the same value.	2022-02-10 16:31:36 +00:00
Jay Foad	e34623b165	[AMDGPU] Rename DSAtomicCmpXChg to DSAtomicCmpXChgSwapped. NFC. This is just a reminder that the operands are swapped compared with all the other CmpXChg instructions. Differential Revision: https://reviews.llvm.org/D119421	2022-02-10 14:54:44 +00:00
Abinav Puthan Purayil	29bd3fadbc	[AMDGPU] Select no-return atomic ops in FLATInstructions.td. This change adds the selection for the no-return global_* and flat_* instructions in tblgen. The motivation for this is to get the no-return atomic isel working without relying on post-isel hooks so that GlobalISel can start selecting them (once GlobalISelEmitter allows no return atomic patterns like how DAGISel does). Differential Revision: https://reviews.llvm.org/D119227	2022-02-10 09:26:37 +05:30
Abinav Puthan Purayil	f4e8cf25af	[AMDGPU] Select no-return ds_* atomic ops in tblgen. SelectionDAG relies on MachineInstr's HasPostISelHook for selecting the no-return atomic ops. GlobalISel, at the moment, doesn't handle HasPostISelHook. This change adds the selection for no-return ds_* atomic ops in tblgen so that it can work with both GlobalISel and SelectionDAG. I couldn't add the predicates for GlobalISel in this change since there's a restriction in GlobalISelEmitter that disallows selecting generic atomics ops that return with instructions that doesn't return. We can't remove the HasPostISelHook code that selects the no return atomic ops in SelectionDAG yet since we still need to cover selections in FLATInstructions.td, BUFInstructions.td. Differential Revision: https://reviews.llvm.org/D115881	2022-02-10 09:26:37 +05:30
Jay Foad	476bb2d94e	[AMDGPU] Remove dead code from shrinkScalarLogicOp It looks like this code has been dead since shrinkScalarLogicOp was introduced in svn r348601.	2022-02-09 17:07:12 +00:00
Jay Foad	db28a45617	[AMDGPU] Remove irrelevant comments on V_BFE_I32 instructions These comments explain the encoding of the immediate operand of S_BFE_* which is not relevant for V_BFE_I32.	2022-02-09 12:06:29 +00:00
serge-sans-paille	ef736a1c39	Cleanup LLVMMC headers There's a few relevant forward declarations in there that may require downstream adding explicit includes: llvm/MC/MCContext.h no longer includes llvm/BinaryFormat/ELF.h, llvm/MC/MCSubtargetInfo.h, llvm/MC/MCTargetOptions.h llvm/MC/MCObjectStreamer.h no longer include llvm/MC/MCAssembler.h llvm/MC/MCAssembler.h no longer includes llvm/MC/MCFixup.h, llvm/MC/MCFragment.h Counting preprocessed lines required to rebuild llvm-project on my setup: before: 1052436830 after: 1049293745 Which is significant and backs up the change in addition to the usual benefits of decreasing coupling between headers and compilation units. Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D119244	2022-02-09 11:09:17 +01:00
Sameer Sahasrabuddhe	c6a6b57902	[AMDGPU] [NFC] Fix incorrect use of bitwise operator. Differential Revision: https://reviews.llvm.org/D119308	2022-02-08 22:12:54 -05:00
Stanislav Mekhanoshin	aeaf85b9c2	[AMDGPU] Select VGPR versions of MFMA if possible We can select _vgprcd versions of MAI instructions and have no AGPRs with the whole budget left for VGPRs if: 1. This is a kernel; 2. It has no calls; 3. It runs at least on 2 waves thus having not more that 256 VGPRs. 4. There is no inline asm requesting AGPRs. Differential Revision: https://reviews.llvm.org/D117253	2022-02-08 10:19:41 -08:00
Matt Arsenault	f2c99ea47d	AMDGPU: Use reserved VGPR for AGPR spills to memory Previously would reuse the VGPR used for large frame offsets with the one needed for copying from the AGPR. Fix this by reusing the register we already reserved for handling AGPR to AGPR copies.	2022-02-08 11:26:59 -05:00
Matt Arsenault	8b2ca766f0	AMDGPU: Reserve v32 if we may need to copy between AGPRs on gfx908 We need to guarantee cheap copies between AGPRs, and unfortunately gfx908 cannot directly do this. Theoretically we could set the scavenger up with an emergency spill slot, but it also feels unreasonable to pay that cost for what was assumed to be a simple and cheap copy. Pick a register that doesn't conflict with any ABI registers. This does not address the same issue when copying from SGPR to AGPR for gfx90a (this coincidentally fixes it for gfx908), but that's less interesting since the register allocator shouldn't be proactively introducing such copies. One edge case I'm worried about is respecting the VGPR budget implied by amdgpu-waves-per-eu. If the theoretical upper bound of a function is 32 VGPRs, this will force the actual count to be 33. This is also broken if inline assembly uses/defs something in v32. The coalescer will eliminate the intermediate vreg between the def and use, and the introduced copy will clobber the user value. (cherry picked from commit 3335784ac2d587ff4eac04586e189532ae8b2607)	2022-02-08 11:14:52 -05:00
Nikita Popov	997027347d	[AMDGPURewriteOutArguments] Don't use pointer element type Instead of using the pointer element type, look at how the pointer is actually being used in store instructions, while looking through bitcasts. This makes the transform compatible with opaque pointers and a bit more general. It's worth noting that I have dropped the 3-vector to 4-vector shufflevector special case, because this is now handled in a different way: If the value is actually used as a 4-vector, then we're directly going to use that type, instead of shuffling to a 3-vector in between. Differential Revision: https://reviews.llvm.org/D119237	2022-02-08 16:10:41 +01:00
Simon Pilgrim	fd2bb51f1e	[ADT] Add APInt/MathExtras isShiftedMask variant returning mask offset/length In many cases, calls to isShiftedMask are immediately followed with checks to determine the size and position of the bitmask. This patch adds variants of APInt::isShiftedMask, isShiftedMask_32 and isShiftedMask_64 that return these values as additional arguments. I've updated a number of cases that were either performing seperate size/position calculations or had created their own local wrapper versions of these. Differential Revision: https://reviews.llvm.org/D119019	2022-02-08 12:04:13 +00:00
Sameer Sahasrabuddhe	02a2e46ff0	[AMDGPU] [NFC] refactor the AMDGPU attributor Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D119087	2022-02-07 21:45:32 -05:00
Carl Ritson	a1fb307b4b	[AMDGPU] Allow hoisting of some VALU compare instructions Conversatively allow hoisting/sinking of VALU comparisons. If the result of a comparison is masked with exec, narrowing the set of active lanes, then it is safe to hoist it as the masking instruction will never by hoisted. Heuristically this is also true for sinking, as we do not expect the result of a sunk comparison that is masked with exec to be used outside of the loop. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D118975	2022-02-08 11:27:23 +09:00
Vang Thao	570471199b	[AMDGPU] Fix debug values in scheduler not placed correctly when reverting Debug position data is cleared after ScheduleDAGMILive::schedule() due to it also calling placeDebugValues(). Make it so the data is not cleared after initial call to placeDebugValues since we will call it again after reverting a schedule. Secondly, since we skip debug instructions when reverting the schedule on AMDGPU, all debug instructions are now moved to the end of the scheduling region. RegionEnd points to the beginning of this chunk of debug instructions since it was not incremented when a debug instruction was skipped. RegionBegin may also point to the same debug instruction if Unsched.front() is a debug instruction thus shrinking the region to 1. Fix RegionBegin and RegionEnd so that they point to the current beginning and ending before calling placeDebugValues() since both vars will be used as reference points to move debug instructions back. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D119022	2022-02-07 11:01:13 -08:00
Matt Arsenault	31973062ec	AMDGPU: Fix clobbering SCC when expanding large offset spill pseudos If we had a large offset which required materializing in a register, we would emit an s_add_i32, clobbering SCC. Start checking if SCC is live, and instead use a VGPR offset. For MUBUF, we switch to using offen. We would do this anyway in a normal load/store with a frame index, but not for spills. The same problem still exists in other contexts where we expand frame indices. The nasty edge case is when SGPRs are spilled to memory at a large frame offset where SCC is also clobbered. This requires a second scavenging index, and also required several patches in the scavenger to correctly handle multiple recursive scavenge indexes. An even nastier edge case we still don't support is if we don't have any free SGPRs. If SCC is live and we don't have any free SGPRs to save exec, we have no way of flipping exec back and forth without also clobbering SCC. Fixes: SWDEV-309419	2022-02-07 10:02:03 -05:00
Kazu Hirata	3a3cb929ab	[llvm] Use = default (NFC)	2022-02-06 22:18:35 -08:00
Ruiling Song	0719c43735	AMDGPU: Don't clobber source register for V_SET_INACTIVE_* The WWM register has unmodeled register liveness, For v_set_inactive_*, clobberring source register is dangerous because it will overwrite the inactive lanes. When the source vgpr is dead at v_set_inactive_lane, the inactive lanes may be not really dead. This may make common optimizations doing wrong. For example in a simple if-then cfg in Machine IR: bb.if: %src = bb.then: %src1 = COPY %src %dst = V_SET_INACTIVE %src1(tied-def 0), %inactive bb.end ... = PHI [0, %bb.then] [%src, %bb.if] The register coalescer will think it is safe to optimize "%src1 = COPY %src" in bb.then. And at the same time, there is no interference for the PHI in bb.end. The source and destination values of the PHI will be assigned the same register. The single PHI register will be overwritten by the v_set_inactive, then we would get wrong value in bb.end. With this change, we will copy the content of the source register before setting inactive lanes after register allocation. Yes, this will sacrifice the WWM code generation a little, but I don't have any better idea to do things correctly. Differential Revision: https://reviews.llvm.org/D117482	2022-02-06 12:38:26 +08:00
Matt Arsenault	8b8b491379	AMDGPU/GlobalISel: Fix assertions on invalid addrspacecasts Fixes some assert on invalid situations and starts directly emitting the error.	2022-02-04 17:28:49 -05:00
Matt Arsenault	4622afa94c	AMDGPU: Convert AMDGPUResourceUsageAnalysis to a Module pass This is more precise in the face of indirect calls and aliases, still assuming the call target is defined somewhere in the current module. This sometimes changes the order the functions are printed, and also changes the point where context errors are printed relative to stdout. This also likely has negative consequences for compile time and memory usage.	2022-02-04 15:56:04 -05:00
Matt Arsenault	935abab65c	AMDGPU: Use module level register maximums for unknown callees Compute the theoretical register budget based on the IR function signature/attributes, and use the global maximum register budgets for unknown callees. This should fix the kernel reported register usage in the presence of indirect calls. The previous fix in `2b08f6af62` was incorrect becauset it was only taking the maximum in the known call graph, and missing something that was either outside of it or codegened later. This fixes a second case I discovered where calls to aliases also did not work as expected. CallGraphAnalysis misses these, so functions called through aliases were not codegened ahead of callers as expected. CallGraphAnalysis should probably be fixed to understand this case, and there's likely a bug with IPRA here. This fixes numerous failures in the conformance test at -O0.	2022-02-04 15:56:03 -05:00
Sebastian Neubauer	4a02562275	[AMDGPU] Lazily init pal metadata on first function Delay reading global metadata until the first function or the end of the file is emitted. That way, earlier module passes can set metadata that is emitted in the ELF. `emitStartOfAsmFile` gets called when the passes are initialized, which prevented earlier passes from changing the metadata. This fixes issues encountered after converting AMDGPUResourceUsageAnalysis to a Module pass in D117504. Differential Revision: https://reviews.llvm.org/D118492	2022-02-04 18:39:35 +01:00
Jay Foad	a456ace9c1	[AMDGPU] SILoadStoreOptimizer: rewrite checkAndPrepareMerge. NFCI. Separate the function clearly into: - Checks that can be done on CI and Paired before the loop. - The loop over all instructions between CI and Paired. - Checks that must be done on InstsToMove after the loop. Previously these were mostly done inside the loop in a very confusing way. Differential Revision: https://reviews.llvm.org/D118994	2022-02-04 17:17:29 +00:00
Jay Foad	001cb43159	[AMDGPU] SILoadStoreOptimizer: fewer calls to offsetsCanBeCombined Only call offsetsCanBeCombined with Modify = true in cases where it will really do something. NFC.	2022-02-04 14:52:11 +00:00
Jay Foad	00bbda07ae	[AMDGPU] SILoadStoreOptimizer: simplify class/subclass checks Also add a comment explaining the difference between class and subclass. NFCI.	2022-02-04 14:08:04 +00:00
Jay Foad	33ef8bdf36	[AMDGPU] SILoadStoreOptimizer: simplify optimizeInstsWithSameBaseAddr Common up all the calls to CI.setMI. NFCI.	2022-02-04 13:25:05 +00:00
Jay Foad	ca05edd927	[AMDGPU] SILoadStoreOptimizer: simplify OptimizeListAgain test At this point CI represents the combined access (original CI combined with Paired) so it doesn't make any sense to add in Paired.width again. NFCI.	2022-02-04 13:02:19 +00:00
serge-sans-paille	ffe8720aa0	Reduce dependencies on llvm/BinaryFormat/Dwarf.h This header is very large (3M Lines once expended) and was included in location where dwarf-specific information were not needed. More specifically, this commit suppresses the dependencies on llvm/BinaryFormat/Dwarf.h in two headers: llvm/IR/IRBuilder.h and llvm/IR/DebugInfoMetadata.h. As these headers (esp. the former) are widely used, this has a decent impact on number of preprocessed lines generated during compilation of LLVM, as showcased below. This is achieved by moving some definitions back to the .cpp file, no performance impact implied[0]. As a consequence of that patch, downstream user may need to manually some extra files: llvm/IR/IRBuilder.h no longer includes llvm/BinaryFormat/Dwarf.h llvm/IR/DebugInfoMetadata.h no longer includes llvm/BinaryFormat/Dwarf.h In some situations, codes maybe relying on the fact that llvm/BinaryFormat/Dwarf.h was including llvm/ADT/Triple.h, this hidden dependency now needs to be explicit. $ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/Transforms/Scalar/*.cpp -std=c++14 -fno-rtti -fno-exceptions \| wc -l after: 10978519 before: 11245451 Related Discourse thread: https://llvm.discourse.group/t/include-what-you-use-include-cleanup [0] https://llvm-compile-time-tracker.com/compare.php?from=fa7145dfbf94cb93b1c3e610582c495cb806569b&to=995d3e326ee1d9489145e20762c65465a9caeab4&stat=instructions Differential Revision: https://reviews.llvm.org/D118781	2022-02-04 11:44:03 +01:00
Vang Thao	2ca194ff55	[AMDGPU] Fix scheduler live-ins with debug inst at start of block GCNDownwardRPTracker RPTracker.reset() skips debug instructions for NextMI so RPTracker.getNext() will never give the beginning of a sched region if it is a debug value. In this case we will never set the live-ins for that block. Add check to see if getNext also equals the MI after skipping debug instructions. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D118853	2022-02-03 12:41:32 -08:00
Stanislav Mekhanoshin	1d111090ad	[AMDGPU] Fix windows build warning with IMMBitSelConst. NFC. VS gives this warning for an integer constant: AMDGPUGenDAGISel.inc(214687): warning C4334: '<<': result of 32-bit shift implicitly converted to 64 bits (was 64-bit shift intended?)	2022-02-03 12:03:22 -08:00
Stanislav Mekhanoshin	d3b87e4a1c	[AMDGPU] HWRegs TMA and TBA also supported on gfx9 Differential Revision: https://reviews.llvm.org/D118860	2022-02-03 09:36:10 -08:00
Thomas Symalla	476babcc1d	[AMDGPU] Introduce new ISel combine for trunc-slr patterns In some cases, when selecting a (trunc (slr)) pattern, the slr gets translated to a v_lshrrev_b3e2_e64 instruction whereas the truncation gets selected to a sequence of v_and_b32_e64 and v_cmp_eq_u32_e64. In the final ISA, this appears as selecting the nth-bit: v_lshrrev_b32_e32 v0, 2, v1 v_and_b32_e32 v0, 1, v0 v_cmp_eq_u32_e32 vcc_lo, 1, v0 However, when the value used in the right shift is known at compilation time, the whole sequence can be reduced to two VALUs when the constant operand in the v_and is adjusted to (1 << lshrrev_operand): v_and_b32_e32 v0, (1 << 2), v1 v_cmp_ne_u32_e32 vcc_lo, 0, v0 In the example above, the following pseudo-code: v0 = (v1 >> 2) v0 = v0 & 1 vcc_lo = (v0 == 1) would be translated to: v0 = v1 & 0b100 vcc_lo = (v0 == 0b100) which should yield an equivalent result. This is a little bit hard to test as one needs to force the SelectionDAG to contain the nodes before instruction selection, but the test sequence was roughly derived from a production shader. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D118461	2022-02-03 18:06:44 +01:00
Jay Foad	b9cf52bc3d	[AMDGPU] Simplify AMDGPUAnnotateUniformValues::visitLoadInst Always set uniform metadata on the pointer if it is an instruction, but otherwise do not bother to create a trivial getelementptr instruction, because AMDGPUInstrInfo::isUniformMMO can already detect that various non-instruction pointers are uniform. Most of the test case churn is from tests that used undef as a pointer, which AMDGPUInstrInfo::isUniformMMO treats as uniform. Differential Revision: https://reviews.llvm.org/D118909	2022-02-03 16:27:48 +00:00
Matt Arsenault	d6fdbbcace	AMDGPU: Add second emergency slot for SGPR to vmem for large frames In a future change, we will sometimes use a VGPR offset for doing spills to memory, in which case we need 2 free VGPRs to do the SGPR spill. In most cases we could spill the VGPR along with the SGPR being spilled, but we don't have any free lanes for SGPR_1024 in wave32 so we could still potentially need a second scavenging slot.	2022-02-02 19:05:05 -05:00
Matt Arsenault	245e25f9c3	AMDGPU: Implement isAsmClobberable Warn on inline assembly clobbering reserved registers. It should also warn on at least some reserved register defs, but that isn't happening right now. If you have a def and re-use of a register we reserve, the register coalescer will eliminate the intermediate virtual register. When the reserved reg def is introduced later by the backend, it will end up clobbering the value the register coalescer assumed was live through the range. There is also isInlineAsmReadOnlyReg, although I don't understand what the distinction really is. It's called in SelectionDAGBuilder, long before the set of reserved registers is frozen so I'm not sure how that can possibly work reliably. Unfortunately this is also using the ugly tablegenerated names for the registers.	2022-02-02 14:20:12 -05:00
Jay Foad	ddd3807e69	[AMDGPU] Use new target MMO flag MONoClobber This allows us to set the noclobber flag on (the MMO of) a load instruction instead of on the pointer. This fixes a bug where noclobber was being applied to all loads from the same pointer, even if some of them were clobbered. Differential Revision: https://reviews.llvm.org/D118775	2022-02-02 17:12:36 +00:00
serge-sans-paille	e188aae406	Cleanup header dependencies in LLVMCore Based on the output of include-what-you-use. This is a big chunk of changes. It is very likely to break downstream code unless they took a lot of care in avoiding hidden ehader dependencies, something the LLVM codebase doesn't do that well :-/ I've tried to summarize the biggest change below: - llvm/include/llvm-c/Core.h: no longer includes llvm-c/ErrorHandling.h - llvm/IR/DIBuilder.h no longer includes llvm/IR/DebugInfo.h - llvm/IR/IRBuilder.h no longer includes llvm/IR/IntrinsicInst.h - llvm/IR/LLVMRemarkStreamer.h no longer includes llvm/Support/ToolOutputFile.h - llvm/IR/LegacyPassManager.h no longer include llvm/Pass.h - llvm/IR/Type.h no longer includes llvm/ADT/SmallPtrSet.h - llvm/IR/PassManager.h no longer includes llvm/Pass.h nor llvm/Support/Debug.h And the usual count of preprocessed lines: $ clang++ -E -Iinclude -I../llvm/include ../llvm/lib/IR/*.cpp -std=c++14 -fno-rtti -fno-exceptions \| wc -l before: 6400831 after: 6189948 200k lines less to process is no that bad ;-) Discourse thread on the topic: https://llvm.discourse.group/t/include-what-you-use-include-cleanup Differential Revision: https://reviews.llvm.org/D118652	2022-02-02 06:54:20 +01:00
Jacob Lambert	a24ff176a6	[AMDGPU][NFC] Fixing formatting Differential Revision: https://reviews.llvm.org/D117801	2022-02-01 17:59:01 -08:00
Stanislav Mekhanoshin	79606ee85c	[AMDGPU] Check atomics aliasing in the clobbering annotation MemorySSA considers any atomic a def to any operation it dominates just like a barrier or fence. That is correct from memory state perspective, but not required for the no-clobber metadata since we are not using it for reordering. Skip such atomics during the scan just like a barrier if it does not alias with the load. Differential Revision: https://reviews.llvm.org/D118661	2022-02-01 12:33:25 -08:00
Stanislav Mekhanoshin	c2b18a3cc5	[AMDGPU] Allow scalar loads after barrier Currently we cannot convert a vector load into scalar if there is dominating barrier or fence. It is considered a clobbering memory access to prevent memory operations reordering. While reordering is not possible the actual memory is not being clobbered by a barrier or fence and we can still use a scalar load for a uniform pointer. The solution is not to bail on a first clobbering access but traverse MemorySSA to the root excluding barriers and fences. Differential Revision: https://reviews.llvm.org/D118419	2022-02-01 11:43:17 -08:00
Changpeng Fang	1194b9cdda	AMDGPU {NFC}: Add code object v5 support and generate metadata for implicit kernel args Summary: Add code object v5 support (deafult is still v4) Generate metadata for implicit kernel args for the new ABI Set the metadata version to be 1.2 Reviewers: t-tye, b-sumner, arsenm, and bcahoon Fixes: SWDEV-307188, SWDEV-307189 Differential Revision: https://reviews.llvm.org/D118272	2022-01-31 18:07:47 -08:00
Jay Foad	0dcc8b86ee	[AMDGPU] AMDGPUAnnotateUniformValues: inline a single-use lambda. NFC.	2022-01-31 11:22:09 +00:00
Jay Foad	68e3946270	[AMDGPU] SILoadStoreOptimizer: break lists on instructions with side effects This just helps to keep the lists shorter and faster to sort. NFCI. Differential Revision: https://reviews.llvm.org/D118384	2022-01-28 18:03:42 +00:00
Jay Foad	4b133cee80	[AMDGPU] SILoadStoreOptimizer: reject AGPR DS_WRITE sooner Rejecting AGPR DS_WRITE instructions before adding them to any mergeable list seems cleaner than adding them to the list and rejecting them later. Differential Revision: https://reviews.llvm.org/D118368	2022-01-27 18:20:46 +00:00
Jay Foad	94a4594c54	[AMDGPU] SILoadStoreOptimizer: use separate lists for AGPR instructions Using separate lists for AGPR and non-AGPR instructions seems like a cleaner solution than putting them all in the same list and then later refusing to merge instructions of different AGPR-ness. Differential Revision: https://reviews.llvm.org/D118367	2022-01-27 18:20:46 +00:00
Jay Foad	8a52fef1e0	[AMDGPU] SILoadStoreOptimizer: tweak API of CombineInfo::setMI. NFC. Change CombineInfo::setMI to take a reference to the SILoadStoreOptimizer instance, for easy access to common fields like TII and STM. Differential Revision: https://reviews.llvm.org/D118366	2022-01-27 18:20:46 +00:00
Matt Arsenault	33b45ee44b	AMDGPU: Handle addrspacecast of constant 32-bit to flat I accidentally made this work on the GlobalISel path, and there's no real reason not to handle this.	2022-01-27 11:01:44 -05:00
Matt Arsenault	aa88b65392	AMDGPU/GlobalISel: Fix assert on invalid cond code for llvm.amdgcn.icmp	2022-01-27 10:34:06 -05:00
Matt Arsenault	f482e86980	AMDGPU/GlobalISel: Fix flat_scratch_init handling for shaders I don't think this is actually defined for mesa, but this is what we were doing on the DAG path.	2022-01-27 10:20:52 -05:00
Jay Foad	185cb8e82c	[AMDGPU] SILoadStoreOptimizer: Allow merging across a swizzled access Swizzled accesses are not merged, but there is no particular reason not to merge two instructions if any of the intervening instructions happens to be a swizzled access. This moves the check for swizzled accesses out of checkAndPrepareMerge into collectMergeableInsts where I think it makes more sense. Differential Revision: https://reviews.llvm.org/D118267	2022-01-27 14:40:58 +00:00
Jay Foad	95857a7058	[AMDGPU] SILoadStoreOptimizer: Remove redundant check for volatile SILoadStoreOptimizer::collectMergeableInsts already ends the current block if it sees a volatile (or ordered) memory access, so there is no need to check for them again when scanning the instructions between two pairing candidates in a block. Differential Revision: https://reviews.llvm.org/D118266	2022-01-27 10:14:53 +00:00
Stanislav Mekhanoshin	409c4436f9	[AMDGPU] Validate dst and src2 non-overlapping restriction in asm Differential Revision: https://reviews.llvm.org/D118089	2022-01-26 15:14:06 -08:00
Stanislav Mekhanoshin	dbf278b984	[AMDGPU] Prevent aliasing of SrcC and Dst in MAI Form the MAI spec: It’s ok that Src_C and vDst are the exact same VGPRs or Src_C and vDst are completely separated. The case that Src_C and vDst are overlapping should be avoid as new value could be written to accumulator input before it gets read. Note that this inevitably increases register pressure to the point where some programs will become uncompilable. This patch separates MAC and FMA versions of MFMA instructions using either tied dst and src2 or earlyclobber dst. Fixes: SWDEV-318900 Differential Revision: https://reviews.llvm.org/D117844	2022-01-26 14:48:20 -08:00
Matt Arsenault	045be6ff36	AMDGPU/GlobalISel: Fold wave address into mubuf addressing modes	2022-01-26 15:25:26 -05:00
Matt Arsenault	09fc311af7	AMDGPU/GlobalISel: Mostly fix BFI patterns Most importantly, fixes constant bus errors in the 64-bit cases. It's surprising to me these were even passing the selection test using SReg_* sources. Also fixes pattern matching in the 32-bit cases, with simple operands. These patterns aren't working in a few cases, like with mixed SGPR inputs. The patterns aren't looking through the SGPR->VGPR copies like they need to. The vector cases also have some unmerges of build_vector which are obscuring the inputs.	2022-01-26 15:06:50 -05:00
Matt Arsenault	e6564f39c7	AMDGPU: Emit user sgpr count directives in text asm We were emitting these in the object file but not printing them.	2022-01-26 13:51:12 -05:00
Konstantina	aa418b9133	[AMDGPU][SIWholeQuadMode] Use the right VCC register to activate the correct lanes. Reviewed By: critson Differential Revision: https://reviews.llvm.org/D118096	2022-01-26 08:54:39 -08:00
Stanislav Mekhanoshin	4e077c0a0b	[AMDGPU] Remove feature register-banking Since RegBankReassign pass was removed this feature is not use for anything. Differential Revision: https://reviews.llvm.org/D118195	2022-01-26 08:39:17 -08:00
Benjamin Kramer	f15014ff54	Revert "Rename llvm::array_lengthof into llvm::size to match std::size from C++17" This reverts commit `ef82063207`. - It conflicts with the existing llvm::size in STLExtras, which will now never be called. - Calling it without llvm:: breaks C++17 compat	2022-01-26 16:55:53 +01:00
serge-sans-paille	ef82063207	Rename llvm::array_lengthof into llvm::size to match std::size from C++17 As a conquence move llvm::array_lengthof from STLExtras.h to STLForwardCompat.h (which is included by STLExtras.h so no build breakage expected).	2022-01-26 16:17:45 +01:00
Nikita Popov	a5e324e3e2	[AMDGPUHSAMetadataStreamer] Do not assume ABI alignment for pointers AMDGPUHSAMetadataStreamer currently assumes that pointer arguments without align attribute have ABI alignment of the pointee type. This is incompatible with opaque pointers, but also plain incorrect: Pointer arguments without explicit alignment have alignment 1. It is the responsibility of the frontent to add correct align annotations. Differential Revision: https://reviews.llvm.org/D118229	2022-01-26 15:45:14 +01:00
alex-t	5157f984ae	[AMDGPU] Enable divergence-driven XNOR selection Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence. This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32) on those targets which have no V_XNOR. Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form. We assume that xor (not) is already turned into the not (xor) by the combiner. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D116270	2022-01-26 15:33:10 +03:00
Sebastian Neubauer	4ed7c6eec9	[AMDGPU] Only match correct type for a16 Addresses are floats when a sampler is present and unsigned integers when no sampler is present. Therefore, only zext instructions, not sext instructions should match. Also match integer constants that can be truncated. Differential Revision: https://reviews.llvm.org/D118043	2022-01-25 14:59:16 +01:00
Nikita Popov	aa97bc116d	[NFC] Remove uses of PointerType::getElementType() Instead use either Type::getPointerElementType() or Type::getNonOpaquePointerElementType(). This is part of D117885, in preparation for deprecating the API.	2022-01-25 09:44:52 +01:00
Stanislav Mekhanoshin	bb1fe36977	[AMDGPU] Make v8i16/v8f16 legal Differential Revision: https://reviews.llvm.org/D117721	2022-01-24 11:51:08 -08:00
Stanislav Mekhanoshin	c27f8fb968	[AMDGPU] Remove cndmask from readsExecAsData Differential Revision: https://reviews.llvm.org/D117909	2022-01-24 11:24:47 -08:00
Sebastian Neubauer	80532ebb50	[AMDGPU][InstCombine] Remove zero image offset Remove the offset parameter if it is zero. Differential Revision: https://reviews.llvm.org/D117876	2022-01-24 18:06:33 +01:00
Matt Arsenault	18aabae8e2	AMDGPU: Fix assertion on fixed stack objects with VGPR->AGPR spills These have negative / out of bounds frame index values and would assert when trying to set the BitVector. Fixed stack objects can't be colored away so ignore them.	2022-01-24 09:45:41 -05:00
Matt Arsenault	99e8e17313	Reapply "Revert "GlobalISel: Add G_ASSERT_ALIGN hint instruction" This reverts commit `a97e20a3a8`.	2022-01-24 09:26:52 -05:00
serge-sans-paille	5f290c090a	Move STLFunctionalExtras out of STLExtras Only using that change in StringRef already decreases the number of preoprocessed lines from 7837621 to 7776151 for LLVMSupport Perhaps more interestingly, it shows that many files were relying on the inclusion of StringRef.h to have the declaration from STLExtras.h. This patch tries hard to patch relevant part of llvm-project impacted by this hidden dependency removal. Potential impact: - "llvm/ADT/StringRef.h" no longer includes <memory>, "llvm/ADT/Optional.h" nor "llvm/ADT/STLExtras.h" Related Discourse thread: https://llvm.discourse.group/t/include-what-you-use-include-cleanup/5831	2022-01-24 14:13:21 +01:00
Sebastian Neubauer	f1e36474b9	[AMDGPU][NFC] Fix debug prints Print the instructions instead of pointers.	2022-01-24 13:55:00 +01:00
Sebastian Neubauer	ae2f9c8be8	[AMDGPU] Remove lz and nomip combine from codegen These combines have been moved into the IR combiner in D116042. Differential Revision: https://reviews.llvm.org/D116116	2022-01-21 12:09:08 +01:00
Sebastian Neubauer	603d18033c	[AMDGPU][InstCombine] Remove zero LOD bias If the bias is zero, we can remove it from the image instruction. Also copy other image optimizations (l->lz, mip->nomip) to IR combines. Differential Revision: https://reviews.llvm.org/D116042	2022-01-21 12:09:07 +01:00
Sebastian Neubauer	0530fdbbbb	[AMDGPU] Fix LOD bias in A16 combine As the codegen fix in D111754, the LOD bias needs to be converted to 16 bits. Fix this in the combine. Differential Revision: https://reviews.llvm.org/D116038	2022-01-21 12:09:06 +01:00
Stanislav Mekhanoshin	41ebd19681	[AMDGPU] Do not ignore exec use where exec is read as data Compares, v_cndmask_b32, and v_readfirstlane_b32 use EXEC in a way which modifies the result. This implicit EXEC use shall not be ignored for the purposes of instruction moves. Differential Revision: https://reviews.llvm.org/D117814	2022-01-20 14:05:22 -08:00
Matt Arsenault	064cea9c9a	AMDGPU/GlobalISel: Try to use s_and_b64 in ptrmask selection Avoids a test diff with SDAG.	2022-01-20 12:56:53 -05:00
Matt Arsenault	2e49e0cfde	AMDGPU/GlobalISel: Directly diagnose return value use for FP atomics Emit an error if the return value is used on subtargets that do not support them. Previously we were falling back to the DAG on selection failure, where it would emit this error and then fail again.	2022-01-20 12:46:45 -05:00
Matt Arsenault	be7e938e27	AMDGPU/GlobalISel: Stop handling llvm.amdgcn.buffer.atomic.fadd This code is not structured to handle the legacy buffer intrinsics and was miscompiling them.	2022-01-20 12:12:06 -05:00
Matt Arsenault	8ff3c9e0be	AMDGPU/GlobalISel: Fix selection of gfx90a FP atomics The struct/raw forms for the buffer atomics now work as expected. However, we're incorrectly handling the legacy form (which we probably shouldn't handle at all). We also are not diagnosing the use of the return value on gfx908. These will be addressed separately.	2022-01-20 12:12:06 -05:00
Matt Arsenault	89c447e4e6	AMDGPU: Stop reserving 36-bytes before kernel arguments for amdpal This was inheriting the mesa behavior, and as far as I know nobody is using opencl kernels with amdpal. The isMesaKernel check was irrelevant because this property needs to be held for all functions.	2022-01-20 12:12:05 -05:00
Abinav Puthan Purayil	d8b690409d	[AMDGPU] Set MemoryVT for truncstores in tblgen. GlobalISelEmitter was skipping these patterns when its predicates were checked. This patch should allow us to select d16_hi stores in GlobalISel. Differential Revision: https://reviews.llvm.org/D117762	2022-01-20 19:05:12 +05:30
Yaxun (Sam) Liu	15f54dd5e4	AMDGPU: Account for usage HIP-style dynamic LDS Disable promote alloca to LDS when HIP-style dynamic LDS since the size is unknown at compile time. Patch by: Siu Chi Chan Reviewed by: Matt Arsenault, Yaxun Liu Differential Revision: https://reviews.llvm.org/D117494	2022-01-19 13:05:29 -05:00
Matt Arsenault	052503979e	AMDGPU/GlobalISel: Fix introducing f16 fmed3 for gfx8	2022-01-19 10:43:21 -05:00
Matt Arsenault	adab71711e	AMDGPU/GlobalISel: Fix legalize failure on i65 ctpop	2022-01-19 10:26:28 -05:00
Jay Foad	63eea41de6	[AMDGPU] Simplify SILoadStoreOptimizer::getSubRegIdxs. NFC.	2022-01-19 15:20:35 +00:00
Matt Arsenault	7f26a1027f	AMDGPU/GlobalISel: Introduce pseudo to copy sp in call sequences Arbitrary stack pointers are accessed using MUBUF instructions with the voffset field, which is interpreted as the swizzled address. We want to fold fold into the MUBUF form to use the SP in the SGPR offset, and previously we were special casing the interpretation of the pointer value if the access memory operand said it was relative to the stack pointer. `690f5b7a01` removed this check, and moved the DAG path to special casing copies from SGPRs. This is not an entirely sound approach, since it's still changing the interpretation of pointer values based the context. Introduce a new pseudo which corresponds to the wave-to-vector address transform. This way the memory instruction has consistent semantics where the incoming pointer is always interpreted as a vector address, and we're not obligated to optimize into the MUBUF offset-only addressing mode. The DAG should probably have an equivalent pseudo. This should fix some correctness issues, and folding this into addressing modes will be a future optimization patch.	2022-01-19 10:13:31 -05:00
Jim Lin	d6b0734837	[NFC] Use Register instead of unsigned	2022-01-19 20:17:04 +08:00
Piotr Sobczak	8dfb417e67	[AMDGPU] Fix missing waitcnt issue Ignore out of order counters when merging brackets. The fact that there was a pending event in the old state does not guarantee that the waitcnt was generated, so we still need to conservatively re-process the block. The patch fixes a correctness issue where the block was not re-processed and the waitcnt not inserted in consequence. Differential Revision: https://reviews.llvm.org/D117544	2022-01-19 10:54:44 +01:00
Matt Arsenault	82de129ab8	AMDGPU: Remove llvm.amdgcn.alignbit and handle bitcode upgrade to fshr	2022-01-18 14:08:36 -05:00
Matt Arsenault	de1600a1d9	AMDGPU: Avoid enabling kernel workitem IDs with reqd_work_group_size	2022-01-18 13:52:04 -05:00
Christudasan Devadasan	56a5d78893	[AMDGPU] Disable optimizeEndCf at -O0 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D116819	2022-01-18 02:48:52 -05:00
Dmitry Preobrazhensky	c7ca4c6365	[AMDGPU][GFX10][MC] Updated symbolic names of internal HW registers GFX10 no longer support HW_ID. It has been replaced with HW_ID1 and HW_ID2. See bug 52904: https://github.com/llvm/llvm-project/issues/52904 Differential Revision: https://reviews.llvm.org/D117313	2022-01-17 20:29:10 +03:00
Dmitry Preobrazhensky	b5fb7e485e	[AMDGPU][MC] Corrected disassembly of s_waitcnt s_waitcnt with default expcnt, vmcnt and lgkmcnt values was disassembled without arguments. See https://github.com/llvm/llvm-project/issues/52716 Differential Revision: https://reviews.llvm.org/D117305	2022-01-17 20:22:03 +03:00
Matt Arsenault	dc2457c8cf	AMDGPU: Fix crashing on calls to C functions from graphics contexts If we had one of the shader calling conventions calling a default calling convention callee, this would crash when the caller did not have anything to pass to the workitem ID. This is illegal, but we still need to produce something sensible. llvm-reduce likes to replace calls to intrinsics with calls to null or undef, so this does appear and is helpful to avoid hard erroring. Pass undef in this case, as already happened for the other implicit arguments. It might make sense to define the behavior here and pass null for the pointers, and -1 for the workitem ID. We do have extra bits in the workitem ID, so this wouldn't conflict with a valid value.	2022-01-17 10:13:25 -05:00
Matt Arsenault	9392b40d4b	AMDGPU/GlobalISel: Fix selection of constant 32-bit addrspace loads Unfortunately the selection patterns still rely on the address space from the memory operand instead of using the pointer type. Add this address space to the list of cases supported by global-like loads. Alternatively we would have to adjust the address space of the memory operand to deviate from the underlying IR value, which looks ugly and is more work in the legalizer. This doesn't come up in the DAG path because it uses a different selection strategy where the cast is inserted during the addressing mode matching.	2022-01-17 10:06:33 -05:00
Matt Arsenault	0b1140e883	AMDGPU: Correct getMaxNumSGPR treatment of flat_scratch This was approximating the entry point logic for flat_scratch_init, which is not really the point. We need to account for whether we need to reserve the SGPR pair used for flat_scratch, not whether we needed the initialization kernel argument. If this was an arbitrary function, we would end up over-reporting the number of potentially free SGPRs. The logic for architected flat scratch also only applies to the initialization in the kernel, not the reserved registers at the end. Avoids compile failures in a future patch from allocating more SGPRs than the subtarget supports.	2022-01-17 10:04:42 -05:00
Matt Arsenault	c3a74183a5	AMDGPU/GlobalISel: Fix legalization failure for s65 shifts This was trying to clamp s65 down to s32, which wasn't handled so we need to promote all the way to s128 first. Having to order the legalization rules in just the right way is rather dissatisfying, but I'm not sure how smart the legalizer should be in trying to interpret the rules.	2022-01-17 10:04:41 -05:00
Matt Arsenault	e09f98a69a	AMDGPU: Fix LiveVariables error after optimizing VGPR ranges This was not removing the block from the live set depending on the specific depth first visit order. Fixes a verifier error in the OpenCL conformance tests.	2022-01-17 09:38:35 -05:00
Craig Topper	454256ef4f	[AMDGPU] Correct the known bits calculation for MUL_I24. I'm not entirely sure, but based on how ComputeNumSignBits handles ISD::MUL, I believe this code was miscounting the number of sign bits. As an example of an incorrect result let's say that countMinSignBits returned 1 for the left hand side and 24 for the right hand side. LHSValBits would be 23 and RHSValBits would be 0 and the sum would be 23. This would cause the code to set 9 high bits as zero/one. Now suppose the real values for the left side is 0x800000 and the right hand side is 0xffffff. The product is 0x00800000 which has 8 sign bits not 9. The number of valid bits for the left and right operands is now the number of non-sign bits + 1. If the sum of the valid bits of the left and right sides exceeds 32, then the result may overflow and we can't say anything about the sign of the result. If the sum is 32 or less then it won't overflow and we know the result has at least 1 sign bit. For the previous example, the code will now calculate the left side valid bits as 24 and the right side as 1. The sum will be 25 and the sign bits will be 32 - 25 + 1 which is 8, the correct value. Differential Revision: https://reviews.llvm.org/D116469	2022-01-14 08:54:54 -08:00
Stanislav Mekhanoshin	fc6af7e188	[AMDGPU] Fix error handling in asm constraint syntax I believe this is unexploitable because in either case the result will be 'couldn't allocate register for constraint' error message, but error code checking is clearly wrong. Differential Revision: https://reviews.llvm.org/D117189	2022-01-13 09:33:50 -08:00
Matt Arsenault	59994c25f9	AMDGPU: Select workitem ID intrinsics to 0 with req_work_group_size Shockingly we weren't doing this already. We should probably have this be done earlier in the IR too, but it's still helpful to have the lowering guarantee it so that we can modify the ABI implicit inputs based on it.	2022-01-13 12:08:18 -05:00
Matt Arsenault	a6f49423c1	AMDGPU: Optimize outgoing workitem ID based on reqd_work_group_size If we know we we aren't using a component from the kernel, we can save a few bit packing instructions. We're still enabling the VGPR input to the kernel though.	2022-01-13 12:08:18 -05:00
Petar Avramovic	235886e174	AMDGPU/GlobalISel: Fix custom legalizatation for fceil	2022-01-13 14:29:30 +01:00
Matt Arsenault	1adeebc2cf	AMDGPU: Fix assert on function argument as loop condition	2022-01-12 19:44:26 -05:00
Stanislav Mekhanoshin	d043822daa	[AMDGPU] Fixed physreg asm constraint parsing We are always failing parsing of the physreg constraint because we do not drop trailing brace, thus getAsInteger() returns a non-empty string and we delegate reparsing to the TargetLowering. In addition it did not parse register tuples. Fixed which has allowed to remove w/a in two places we call it. Differential Revision: https://reviews.llvm.org/D117055	2022-01-12 16:37:08 -08:00
Matt Arsenault	4515c24bbc	AMDGPU/GlobalISel: Fix assertions on legalize queries with huge align For some reason we pass around the alignment in bits as uint64_t. Two places were truncating it to unsigned, and losing bits in extreme cases.	2022-01-12 18:21:44 -05:00
Matt Arsenault	bd2c01e937	AMDGPU/GlobalISel: Do not use terminator copy before waterfall loops Stop using the _term variants of the mov to save the initial exec value before the waterfall loop. This cannot be glued to the bottom of the block because we may need to spill the result register. Just use a regular mov, like the loops produced on the DAG path. Fixes some verification errors with regalloc fast.	2022-01-12 13:44:05 -05:00
Austin Kerbow	8470bf2b08	[AMDGPU] Do not reserve any VGPR for SGPR spills After the split register allocation changes in `eebe841a47` it is no longer necessary to reserve a VGPR before RA. This can also create bugs when IPRA is enabled since we cannot predict that a called function may not reserve any register if it does not have any SGPR spills. If that happens those functions may override reserved registers that are normally callee saved. Added a test to show this. Fixes: SWDEV-309900 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D115551	2022-01-11 22:14:59 -08:00
David Salinas	c0581f7df6	Revert D109159 : Revert "[amdgpu] Enable selection of `s_cselect_b64`." This reverts commit `640beb38e7`. That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort). Reverting until we have a better solution to s_cselect_b64 codegen cleanup Change-Id: Ifc167b3c2dae7a65920676f22a97ba76485f3456 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D116686 Change-Id: I1abf49b74a7e2ba0e0205f747a4154a468b9d7f2	2022-01-11 21:14:09 +00:00
Matt Arsenault	8e682086a0	AMDGPU/GlobalISel: Explicitly track d16 for image legalization We were trying to guess at the original IR type for image intrinsics after legalization to figure out if they were d16, but this didn't work. Explicitly track if this is a d16 operation or not in the opcode, as is done for the buffer intrinsics. The OpenCL library is using f32 image writes with a dmask of 15 for some reason, and this was incorrectly switching them to use d16. Fixes image failures in the OpenCL conformance test. The equivalent dmask for loads doesn't even select in either selector.	2022-01-10 14:25:14 -05:00
Matt Arsenault	0ba4e4b500	GlobalISel: Pass DebugLoc to getFunctionLiveInPhysReg Fixes crash in assertion about dropping debug info.	2022-01-10 13:50:52 -05:00
Matt Arsenault	68468bbe15	AMDGPU: Avoid null check during addrspacecast lowering If we know the source is a valid object, we do not need to insert a null check. This misses a lot of opportunities from metadata/attributes not tracked in codegen.	2022-01-10 13:27:39 -05:00
Kazu Hirata	720c48b58e	[AMDGPU] Fix an unused variable warning (NFC) This patch fixes: llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp:2245:12: error: unused variable 'Ins' [-Werror,-Wunused-variable]	2022-01-10 08:49:46 -08:00
Petar Avramovic	d9d2516aaf	AMDGPU/GlobalISel: Rework legalization for extract/insert vector elt Use G_MERGE_VALUES and G_UNMERGE_VALUES on vector elements instead of G_EXTRACT and G_INSERT when doing custom legalization for G_EXTRACT_VECTOR_ELT and G_INSERT_VECTOR_ELT. With this approach legalization artifact combiner gets direct access to all vector elements. Differential Revision: https://reviews.llvm.org/D116115	2022-01-10 13:15:20 +01:00
Jay Foad	ff971873b3	[GlobalISel] Fix legality checks for G_UBFX combines 1. Fix CombinerHelper::matchBitfieldExtractFromAnd to check legality with the correct types for the G_UBFX that it builds. 2. Fix AMDGPUTargetLowering::isConstantUnsignedBitfieldExtractLegal to match the legality rules: result and first operand can be s32 or s64 but the "shift amount" operands are always s32. 3. Add AMDGPU tests where the post-legalizer combiner would create illegal MIR without the above fixes. Differential Revision: https://reviews.llvm.org/D116802	2022-01-08 09:20:44 +00:00
alex-t	5d46263a5a	[AMDGPU] Enable divergence-driven 'ctpop' selection This change adds the patterns and divergence predicates for the ctpop (bitcount) nodes to make them selected according to the divergence. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D116284	2022-01-07 16:07:38 +03:00
Jay Foad	3f3fe4a5cf	[GlobalISel] Fix typo Extact to Extract in function name. NFC.	2022-01-07 11:13:35 +00:00
Kazu Hirata	2aed08131d	[llvm] Use true/false instead of 1/0 (NFC) Identified with modernize-use-bool-literals.	2022-01-07 00:39:14 -08:00
Qiu Chaofan	c2cc70e4f5	[NFC] Fix endif comments to match with include guard	2022-01-07 15:52:59 +08:00
Kazu Hirata	f3a344d212	[Target] Remove redundant member initialization (NFC) Identified with readability-redundant-member-init.	2022-01-06 22:01:44 -08:00
Matt Arsenault	4fc18de335	AMDGPU: Clear NoPHIs property in SIOptimizeVGPRLiveRanges Fixes verifier error when writing MIR tests that didn't have phis to begin with.	2022-01-06 11:01:51 -05:00
Christudasan Devadasan	50b5b367c1	[AMDGPU] Iterate LoweredEndCf in the reverse order The function that optimally inserts the exec mask restore operations by combining the blocks currently visits the lowered END_CF pseudos in the forward direction as it iterates the setvector in the order the entries are inserted in it. Due to the absence of BranchFolding at -O0, the irregularly placed BBs cause the forward traversal to incorrectly place two unconditional branches in certain BBs while combining them, especially when an intervening block later gets optimized away in subsequent iterations. It is avoided by reverse iterating the setvector. The blocks at the bottom of a function will get optimized first before processing those at the top. Fixes: SWDEV-315215 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D116273	2022-01-06 00:27:11 -05:00
Nico Weber	085f078307	Revert "Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`."" This reverts commit `859ebca744`. The change contained many unrelated changes and e.g. restored unit test failes for the old lld port.	2022-01-05 13:10:25 -05:00
David Salinas	859ebca744	Revert D109159 "[amdgpu] Enable selection of `s_cselect_b64`." This reverts commit `640beb38e7`. That commit caused performance degradtion in Quicksilver test QS:sGPU and a functional test failure in (rocPRIM rocprim.device_segmented_radix_sort). Reverting until we have a better solution to s_cselect_b64 codegen cleanup Change-Id: Ibf8e397df94001f248fba609f072088a46abae08 Reviewed By: kzhuravl Differential Revision: https://reviews.llvm.org/D115960 Change-Id: Id169459ce4dfffa857d5645a0af50b0063ce1105	2022-01-05 17:57:32 +00:00
Pravin Jagtap	2899e8de67	[AMDGPU] Test commit. NFC. Differential Revision: https://reviews.llvm.org/D116641	2022-01-04 23:18:13 -05:00
serge-sans-paille	9290ccc3c1	Introduce the AttributeMask class This class is solely used as a lightweight and clean way to build a set of attributes to be removed from an AttrBuilder. Previously AttrBuilder was used both for building and removing, which introduced odd situation like creation of Attribute with dummy value because the only relevant part was the attribute kind. Differential Revision: https://reviews.llvm.org/D116110	2022-01-04 15:37:46 +01:00
Craig Topper	cbcbbd6ac8	[ValueTracking][SelectionDAG] Rename ComputeMinSignedBits->ComputeMaxSignificantBits. NFC This function returns an upper bound on the number of bits needed to represent the signed value. Use "Max" to match similar functions in KnownBits like countMaxActiveBits. Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old name around to keep this patch size down. Will do a bulk rename as follow up. Rename KnownBits::countMaxSignedBits->countMaxSignificantBits. Reviewed By: lebedev.ri, RKSimon, spatel Differential Revision: https://reviews.llvm.org/D116522	2022-01-03 11:33:30 -08:00
Kazu Hirata	e5947760c2	Revert "[llvm] Remove redundant member initialization (NFC)" This reverts commit `fd4808887e`. This patch causes gcc to issue a lot of warnings like: warning: base class ‘class llvm::MCParsedAsmOperand’ should be explicitly initialized in the copy constructor [-Wextra]	2022-01-03 11:28:47 -08:00
Craig Topper	361216f3c4	[AMDGPU] Use ComputeMinSignedBits and KnownBits::countMaxActiveBits to simplify some code. NFC Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D116516	2022-01-03 10:09:51 -08:00
Kazu Hirata	41bfac6aed	[Target] Remove unused forward declarations (NFC)	2022-01-02 10:20:15 -08:00
Kazu Hirata	fd4808887e	[llvm] Remove redundant member initialization (NFC) Identified with readability-redundant-member-init.	2022-01-01 16:18:18 -08:00
Kazu Hirata	bc360fd83a	[AMDGPU] Remove unused declarations fold_exp* and fold_log* (NFC)	2021-12-31 16:50:18 -08:00
Kazu Hirata	5c4b9ea4a7	[AMDGPU] Remove replaceWithNative (NFC) The function was introduced without any use on Aug 11, 2017 in commit `7f37794ebd`.	2021-12-31 16:43:06 -08:00
Kazu Hirata	5a667c0e74	[llvm] Use nullptr instead of 0 (NFC) Identified with modernize-use-nullptr.	2021-12-28 08:52:25 -08:00
Kazu Hirata	0542d15211	Remove redundant string initialization (NFC) Identified with readability-redundant-string-init.	2021-12-26 09:39:26 -08:00
Kazu Hirata	2d303e6781	Remove redundant return and continue statements (NFC) Identified with readability-redundant-control-flow.	2021-12-24 23:17:54 -08:00
alex-t	8020458c5d	[AMDGPU] Changing S_AND_B32 to V_AND_B32_e64 in the divergent 'trunc' to i1 pattern In 'trunc' i16/32/64 to i1 pattern the 'and $src, 1' node supply operand to 'setcc'. The latter is selected to S_CMP_EQ/V_CMP_EQ dependent on the divergence. In case the 'and' is scalar and 'setcc' is divergent, we need VGPR to SGPR copy to adjust input operand for V_CMP_EQ. This patch changes the S_AND_B32 to V_AND_B32_e64 in the 'trunc to i1' divergent patterns. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D116241	2021-12-24 18:24:49 +03:00
Brendon Cahoon	d45a247998	[AMDGPU] Don't remove VGPR to AGPR dead spills from frame info Removing dead frame indices for VGPR to AGPR spills is incorrect when the frame index is shared by multiple objects, which may occur due to stack slot coloring. The problem is that subsequent code that processes the other object will assert because the stack frame index is marked dead. Removing dead frame indices is needed prior to stack slot coloring, which is what happens with SGPR to VGPR spills. These spills are lowered prior to stack slot coloring, but the VGPR to AGPR spills are processed afterwards during the Prolog/Epilog Inserter pass. This patch marks the VGPR to AGPR spill slot as dead if the slot is not used by another object. Differential Revision: https://reviews.llvm.org/D115996	2021-12-23 11:09:19 -06:00
Petar Avramovic	29f88b93fd	[GlobalISel] Rework more/fewer elements for vectors Artifact combiner is not able to access individual elements after using LCMTy style merge/unmerge, extract and insert to change vector number of elements (pad with undef or split to sub-vector instructions). Use unmerge to individual elements instead and then merge elements into requested types. Change argument lowering for vectors and moreElementsVector to use buildPadVectorWithUndefElements and buildDeleteTrailingVectorElements. FewerElementsVector had a few helpers that had different behavior, introduce new helper for most of the opcodes. FewerElementsVector helper is more flexible since it can create leftover instruction smaller then requested type (useful in case target wants to avoid pad with undef and use fewer registers). If target does not want leftover of different type it should call more elements first. Some helpers were performing more elements first to have split without leftover. Opcodes that used this helper use clampMaxNumElementsStrict (does more elements first) in LegalizerInfo to avoid test changes. Fixes failures caused by failing to combine artifacts created during more/fewer elements vector. Differential Revision: https://reviews.llvm.org/D114198	2021-12-23 14:30:02 +01:00
Jay Foad	74ce7ff5dc	[AMDGPU] Remove a TODO that was done by D98081	2021-12-23 10:19:37 +00:00
alex-t	e4103c91f8	[AMDGPU] Select build_vector DAG nodes according to the divergence This change enables divergence-driven instruction selection for the build_vector DAG nodes. It also enables packed i16 instructions for GFX9. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D116187	2021-12-23 02:27:12 +03:00
Ron Lieberman	09b53296cf	Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range" This reverts commit `9075009d1f`. Failed amdgpu runtime buildbot # 3514	2021-12-22 11:39:28 -05:00
RamNalamothu	9075009d1f	[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D114652	2021-12-22 20:51:12 +05:30
Kazu Hirata	9db0e21660	[llvm] Use depth_first (NFC)	2021-12-21 22:28:48 -08:00
Matt Arsenault	c222972442	AMDGPU/GlobalISel: Stop using NarrowScalar/FewerElements for unaligned splitting These actions should only be used for adjusting the register types (and the memory type as needed to satisfy the register type). Unaligned accesses should be split as a type of lowering. This has the effect of improving the code in many cases since now we produce zextloads instead of separate loads with ands. The load/store legality rules still seem far more complicated than necessary though.	2021-12-20 18:07:11 -05:00
alex-t	19727e31fb	[AMDGPU] Enable divergence predicates for ctlz/cttz ctlz/cttz get lowered to the set of target opcodes This change enables the ISel to select SALU or VALU form according to the SDNode divergence. CTLZ - S_FLBIT_I32_B32 if uniform and V_FFBH_U32_e64 if divergent CTTZ - S_FF1_I32_B32 if uniform and V_FFBL_B32_e64 if divergent Also @llvm.amdgcn.sffbh.i32 gets lowered to S_FLBIT_I32 if uniform and V_FFBH_I32_e64 if divergent NOTE: 64bit versions S_FF1_I32_B64 and S_FLBIT_I32_B64 are not currently supported by the DAG ISel. ctlz/cttz with i64 input are split into two 32bit instructions. Nevertheless, they already have the patterns and were equipped with the divergence predicates to make sure they will be selected correctly when enabled. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D116044	2021-12-20 20:53:48 +03:00
Jay Foad	8b997adc64	[AMDGPU] Remove dead code after D109052	2021-12-20 14:20:02 +00:00
alex-t	98d09705e1	[AMDGPU] Re-enabling divergence predicates for min/max This patch enables divergence predicates for min/max nodes. It makes ISD::MIN/MAX selected to S_MIN_I(U)32/S_MAX_I(U)32 or V_MIN_I(U)32_e64/V_MAX_I(U)32_e64 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D115954	2021-12-20 16:10:55 +03:00
alex-t	1448aa9dbd	[AMDGPU] Expand not pattern according to the XOR node divergence The "not" is defined as XOR $src -1. We need to transform this pattern to either S_NOT_B32 or V_NOT_B32_e32 dependent on the "xor" node divergence. Reviewed By: rampitec, foad Differential Revision: https://reviews.llvm.org/D115884	2021-12-20 14:41:38 +03:00
Jakub Kuderski	1e93f3895f	[AMDGPU] Use enum_seq to iterator over InstCounterTypes. NFC. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D115900	2021-12-18 16:07:28 -05:00
Jakub Kuderski	d9ae852fcc	[AMDGPU] Fix data race in SIInsertWaitcnts The race condition happened when two pass managers ran on two different modules but modified/read the global variables. To address this, I considered using singletons and freestanding functions to allow getting/setting `HardwareLimits` and `RegisterEncoding`, or making it local to the pass. I chose the latter and made it a member of `WaitcntsBrackets`, to minimizes the amount of global state. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D115896	2021-12-18 16:03:09 -05:00
Kazu Hirata	f78c1b07cb	[Target] Use range-based for loops (NFC)	2021-12-17 10:11:08 -08:00
rkorsa	c680fb69d6	[AMDGPU] Fixes in ISelDAG path and GlobalISel path for 'bias' operand with A16 bit on The LOD bias operand is of type 'half' when the A16-bit is ON' for MIMG instructions. 'bias' is only 16-bit but occupies 32-bits with upper 16-bits containing junk. The patch fixes both the paths(ISelDAG and GlobalISel) for proper encoding of LOD bias operand. Differential Revision: https://reviews.llvm.org/D111754	2021-12-17 16:11:51 +05:30
Ron Lieberman	8a85be807b	Revert "AMDGPU: Remove AMDGPUFixFunctionBitcasts pass" Offload abort in Nekbone This reverts commit `2b48761575`.	2021-12-16 21:21:32 +00:00
Matt Arsenault	4132dc917e	AMDGPU: Return result from indicatePessimisticFixpoint I don't think this fixes anything.	2021-12-16 11:26:30 -05:00
Matt Arsenault	f0cc43cc91	AMDGPU: Use v_accvgpr_mov_b32 when copying AGPR tuples on gfx90a This is an optimization, but also fixes a compile failure when no free VGPRs are available. The problem still exists for gfx908 where a scratch register is still required. This also still exists for the SGPR to AGPR case.	2021-12-15 18:20:49 -05:00
Matt Arsenault	45f16eabd6	AMDGPU: Combine is.shared/is.private of null/undef	2021-12-15 18:20:49 -05:00
Matt Arsenault	2b48761575	AMDGPU: Remove AMDGPUFixFunctionBitcasts pass This was a workaround for not supporting indirect calls when instcombine didn't eliminate constant expression casts of the callee at -O0. Indirect calls are supposed to work now, so drop the hack.	2021-12-15 18:20:48 -05:00
Joe Nash	da9c6ea007	[AMDGPU] Extract helper function in AsmParser. NFC NFC refactor to extract useful helper function isRegOrInline. Reviewed By: rampitec, dp Differential Revision: https://reviews.llvm.org/D115753 Change-Id: Ief52db9a62615c053fb5f429248657b97cb41453	2021-12-15 09:53:23 -05:00
Jay Foad	54fc9eb9b3	[AMDGPU] Use v_fma_f16 on GFX10 Teach convertToThreeAddress to use the V_FMA_F16_gfx9 pseudo (i.e. the standard instruction in GFX9 onwards) instead of V_FMA_F16 (the legacy pseudo for GFX8 compatibility, which is no longer supported in GFX10). This follows the example of macToMad in SIFoldOperands. Differential Revision: https://reviews.llvm.org/D115731	2021-12-15 13:14:48 +00:00
Jay Foad	4db7422771	[AMDGPU] Improve zeroesHigh16BitsOfDest for GFX9 legacy opcodes Pseudos like V_MAD_U16 and V_FMA_F16 map down to what GFX9 calls v_mad_legacy_u16 and v_fma_legacy_f16, which are documented to have the same zeroing behaviour as on GFX8. Differential Revision: https://reviews.llvm.org/D115729	2021-12-15 13:14:48 +00:00
Jay Foad	6a7db0dc8e	[AMDGPU] Skip some work on subtargets without scalar stores. NFC.	2021-12-15 12:46:33 +00:00
Jon Chesterfield	624f12d34f	[amdgpu] Drop lowering of LDS used by global variables Approximately revert D103431. LDS variables are allocated at kernel launch and deallocated at kernel exit. The address is therefore kernel execution dependent. Global variables are initialized by values written to .data, which can't be done for a LDS variable as there is no kernel running, or by a global constructor. Initializing the global to the address of some LDS allocated by a global constructor is possible but indistinguishable from undef. Assigning the address of a LDS variable to a global should be a sema error. It isn't for openmp, haven't checked other languages. Failing that it could be set to undef, perhaps in this pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D115413	2021-12-14 21:59:26 +00:00
Nikita Popov	220815a91a	[AMDGPUPerfHintAnalysis] Avoid getPointerElementType() Extract the load/store type from the instruction rather than fetching it from the pointer element type.	2021-12-13 16:48:21 +01:00
Neubauer, Sebastian	26924b57e8	[AMDGPU] Ignore special ABI registers for graphics Fixed ABI arguments are compute specific and should not be added to graphics shaders or functions, so do not try to add them. Differential Revision: https://reviews.llvm.org/D115344	2021-12-13 16:44:37 +01:00
Jay Foad	16de2c09dd	[AMDGPU] SIShrinkInstructions: sink code to where it's used. NFC.	2021-12-13 14:46:40 +00:00
Jay Foad	63681527ee	[AMDGPU] SIShrinkInstructions: remove redundant check canShrink already calls hasVALU32BitEncoding, so there is no need to call it again here.	2021-12-13 14:46:40 +00:00
Jay Foad	61f8af2657	[AMDGPU] Remove a FIXME implemented in D11061	2021-12-13 14:46:40 +00:00
Daniil Fukalov	e5c64b45be	[CostModel][AMDGPU] Fix intrinsics costs estimations. 1. Fixed costs inconsistency for llvm.fma.vXf16 instinsiscs. 2. Added tests for llvm.sadd.sat, llvm.ssub.sat, llvm.uadd.sat, llvm.usub.sat intrisics since they have special processing in cost model. 3. Minor intrisics' costs tests updat and refinement. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D115385	2021-12-13 17:17:34 +03:00
Jon Chesterfield	24b28db8cc	[amdgpu] Increase alignment of all LDS variables Currently the superalign option only increases the alignment of variables that are moved into the module.lds block. Change that to all LDS variables. Also only increase the alignment once, instead of once per function. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D115488	2021-12-12 19:30:32 +00:00
Kazu Hirata	67aeae0138	[llvm] Use range-based for loops (NFC)	2021-12-11 22:34:07 -08:00
Kazu Hirata	d395befa65	[llvm] Use range-based for loops (NFC)	2021-12-11 11:29:12 -08:00
Matt Arsenault	6bcf1f9181	AMDGPU: Indicate pessimistic fixpoint for entry functions There aren't going to be any callers for these, so avoid running through the machinery to look at the callers.	2021-12-11 11:42:34 -05:00
Matt Arsenault	06b90175e7	AMDGPU: Remove fixed function ABI option	2021-12-10 19:41:19 -05:00
Jon Chesterfield	86caf517bf	Revert "[amdgpu][nfc] Delete dead code in LowerModuleLDS" This reverts commit `7b9ab06d10`. Said code is better removed as part of a larger change.	2021-12-11 00:31:51 +00:00
Christudasan Devadasan	cf58b9ce98	[AMDGPU] Add AV class spill pseudo instructions While enabling vector superclasses with D109301, the AV spills are converted into VGPR spills by introducing appropriate copies. The whole thing ended up adding two instructions per spill (a copy + vgpr spill pseudo) and caused an incorrect liverange update during inline spiller. This patch adds the pseudo instructions for all AV spills from 32b to 1024b and handles them in the way all other spills are lowered. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D115439	2021-12-10 03:10:34 -05:00
Jon Chesterfield	7b9ab06d10	[amdgpu][nfc] Delete dead code in LowerModuleLDS	2021-12-10 00:43:46 +00:00
Jon Chesterfield	04b2f6ea8a	[amdgpu][nfc] Drop dead PtrSet, fix a comment	2021-12-09 17:05:20 +00:00
Arthur Eubanks	1172712f46	[NFC] Replace some deprecated getAlignment() calls with getAlign() Reviewed By: gchatelet Differential Revision: https://reviews.llvm.org/D115370	2021-12-09 08:43:19 -08:00
Matt Arsenault	017ef78549	AMDGPU: Mark scc defs dead in SGPR to VMEM path for no free SGPRs This introduces verifier errors into this broken situation which we do not handle correctly, which is better than being silently miscompiled. For the emergency stack slot, the scavenger likes to move the restore instruction as late as possible, which ends up separating the SCC def from the conditional branch.	2021-12-08 18:40:49 -05:00
Matt Arsenault	a92cf7cea5	AMDGPU: Mark SCC def as dead when expanding frame indexes This improves liveness queries in a future change.	2021-12-08 18:40:39 -05:00
Jon Chesterfield	f0e3b39a5d	[amdgpu][nfc] Move non-shared code out of LDSUtils	2021-12-08 16:23:03 +00:00
Jay Foad	077a14e00b	[AMDGPU] Mark time intrinsics as nomem, hassideeffects Adding IntrHasSideEffects to @llvm.amdgcn.s.memtime and @llvm.amdgcn.s.memrealtime means that we can stop pretending they read and write memory, and similarly for the corresponding pseudo instructions. This should stop these intrinsics from being rescheduled past all other instructions, even ones which don't load or store. See also https://reviews.llvm.org/D58635. Differential Revision: https://reviews.llvm.org/D115227	2021-12-07 16:24:06 +00:00
Jay Foad	47d15170f6	[AMDGPU] Remove redundant mayLoad = 0, mayStore = 0. NFC. Almost everything in this file is mayLoad = 0, mayStore = 0 by default anyway.	2021-12-07 09:55:05 +00:00
Jack Andersen	f108c7f59d	[GlobalISel] Allow DBG_VALUE to use undefined vregs before LiveDebugValues. Expanding on D109750. Since `DBG_VALUE` instructions have final register validity determined in `LDVImpl::handleDebugValue`, there is no apparent reason to immediately prune unused register operands as their defs are erased. Consequently, this renders `MachineInstr::eraseFromParentAndMarkDBGValuesForRemoval` moot; gaining a substantial performance improvement. The only necessary changes involve making relevant passes consider invalid DBG_VALUE vregs uses as valid. Reviewed By: MatzeB Differential Revision: https://reviews.llvm.org/D112852	2021-12-05 15:55:59 -05:00
Michael Liao	53fc971a4b	Fix `-Wunused-variable` warning. NFC.	2021-12-04 23:34:55 -05:00
Matt Arsenault	729bf9b26b	AMDGPU: Enable fixed function ABI by default Code using indirect calls is broken without this, and there isn't really much value in supporting the old attempt to vary the argument placement based on uses. This resulted in more argument shuffling code anyway. Also have the option stop implying all inputs need to be passed. This will no rely on the amdgpu-no-* attributes to avoid passing unnecessary values.	2021-12-04 10:49:18 -05:00
Matt Arsenault	2959e082e1	AMDGPU: Assume all amdhsa kernarg passed implicit arguments by default Previously we would require adding an attribute to kernels to enable the inputs passed in the kernarg segment, accessed by llvm.amdgcn.implicitarg.ptr. This violates the principle of being correct by default. Some OpenMP testcases were broken recently since it wasn't correctly setting this attribute, and no known frontends are setting this to anything other than the maximum. Most of the test changes are from load widening of argument loads since there now more implied dereferenceable bytes.	2021-12-04 10:38:25 -05:00
Matt Arsenault	ae0ba7dedd	AMDGPU: Optimize out implicit kernarg argument allocation if unused We already annotate whether llvm.amdgcn.implicitarg.ptr is known to be unused. Start using it to avoid allocating the implicit arguments if unneeded.	2021-12-04 10:38:25 -05:00
Jay Foad	2774bad112	[AMDGPU] Change llvm.amdgcn.image.bvh.intersect.ray to take vec3 args The ray_origin, ray_dir and ray_inv_dir arguments should all be vec3 to match how the hardware instruction works. Don't change the API of the corresponding OpenCL builtins. Differential Revision: https://reviews.llvm.org/D115032	2021-12-04 10:32:11 +00:00
Stanislav Mekhanoshin	3b17cb1506	[AMDGPU] Kill def when folding immediate in two-addr pass Two-address pass works right before RA and if an immediate was folded into an instruction there is nothing to remove the dead def. We end up with something like: v_mov_b32_e32 v14, 0xc1700000 v_mov_b32_e32 v14, 0x41200000 v_fmaak_f32 v51, s67, v19, 0xc1700000 v_fmaak_f32 v38, v51, v19, 0x4120000 The patch kills the dead move instruction right in the folding. Differential Revision: https://reviews.llvm.org/D114999	2021-12-03 09:37:49 -08:00
Petar Avramovic	0b34ffe4a6	AMDGPU/GlobalISel: Add clamp combine Add clamp combine. Source is fminnum(fmaxnum(Val, 0.0), 1.0) or fmaxnum(fminnum(Val, 1.0), 0.0) or fmed3 intrinsic with 0.0 and 1.0 as two out of three operands. Differential Revision: https://reviews.llvm.org/D90052	2021-12-03 12:49:39 +01:00
Petar Avramovic	ec54867d75	AMDGPU/GlobalISel: Add floating point med3 combine Add floating point version of med3 combine. Source is fminnum(fmaxnum(Val, K0), K1) or fmaxnum(fminnum(Val, K1), K0) where K0 and K1 are constants and K0 <= K1. Differential Revision: https://reviews.llvm.org/D90051	2021-12-03 12:49:39 +01:00
Petar Avramovic	ab01f4d264	AMDGPU/GlobalISel: Do not fcanonicalize const splat padded with undef Recognize constant splat padded with undef in isCanonicalized. Fcanonicalize will be removed by RemoveFcanonicalize in post-legalizer combiner. We will treat undef as value that will result in a splat in clamp combine after regbankselect. Differential Revision: https://reviews.llvm.org/D104408	2021-12-03 12:49:38 +01:00
Jessica Clarke	3ee56eed2f	[AMDGPU][NFC] Alter ComplexPattern types to be consistent with their uses When used as a non-leaf node, TableGen does not currently use the type of a ComplexPattern for type inference, which also means it does not check it doesn't conflict with the use. This differs from when used as a leaf value, where the type is used for inference. Fixing that discrepancy is something I intend to upstream as a subsequent review. AMDGPU currently has several ComplexPatterns that are used in contexts where they're expected to be an iPTR, and where using an iPTR instead of a fixed-width integer type matters. With my locally-patched TableGen, none of these mismatches result in type contradictions, but do change the patterns and cause various failures to select. These changes to the ComplexPatterns' types reflect how they are actually used, result in bit-for-bit identical TableGen output (without my local TableGen patch), and ensure that with improved type inference AMDGPU's backend will continue to work. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D109032	2021-12-03 07:04:59 +00:00
Daniil Fukalov	ab05ab59a7	[CostModel][AMDGPU] Fix instructions costs estimation for vector types. 1. Fixed vector instructions costs estimations incosistency - removed different logic for "not simple types" since it biases costs for these types. 2. Fixed legalization penalty for vectors too big for the target: changed from overwrite default legalization cost value estimation to added penalty. 3. Fixed few typos in tests. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114893	2021-12-03 03:08:08 +03:00
Matt Arsenault	0eebe2e36c	AMDGPU: Sanitized functions require implicit arguments Do not infer no-amdgpu-implicitarg-ptr for sanitized functions. If a function is explicitly marked amdgpu-no-implicitarg-ptr and sanitize_address, infer that it is required.	2021-12-02 17:55:43 -05:00
Kazu Hirata	262dd1e42d	[llvm] Use range-based for loops (NFC)	2021-12-02 09:27:47 -08:00
David Stuttard	0e8590f065	[AMDGPU] Add support for in-order bvh in waitcnt pass bvh should be handled separately from vmem and vmem with sampler instructions for waitcnt handling. Differential Revision: https://reviews.llvm.org/D114794	2021-12-02 14:26:11 +00:00
Austin Kerbow	da067ed569	[AMDGPU] Set most sched model resource's BufferSize to one Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior. Practically, this means that the scheduler will trigger the 'STALL' heuristic more often. This type of change needs to be evaluated experimentally. Preliminary results are positive. Fixes: SWDEV-282962 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114777	2021-12-01 22:31:28 -08:00
Christudasan Devadasan	399b7de0ea	[AMDGPU] Add a regclass flag for scalar registers Along with vector RC flags, this scalar flag will make various regclass queries like `isVGPR` more accurate. Regclasses other than vectors are currently set with the new flag even though certain unallocatable classes aren't truly scalars. It would be ok as long as they remain unallocatable. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D110053	2021-12-01 23:31:07 -05:00
Petar Avramovic	641906da8d	AMDGPU/GlobalISel: Fix constant bus restriction errors for med3 Detected on targets older then gfx10 (e.g. gfx9) for constants that are too large to be inlined (constant are sgpr by default). In med3 combine it is expected that regbankselect maps all operands of min/max we try to match to vgpr. However constants are mapped to sgpr and there will be a sgpr-to-vgpr copy. Matchers look through sgpr-to-vgpr copies and return sgpr and these break constant bus restriction. Build med3 with all vgpr operands. Use existing sgpr-to-vgpr copies for matched sgprs. If there is no such copy (not expected) build one. Differential Revision: https://reviews.llvm.org/D114700	2021-12-01 21:36:37 +01:00
Mateja Marjanovic	ca57b80cd6	Code quality: Combine V_RSQ Combine V_RCP and V_SQRT into V_RSQ on AMDGPU for GlobalISel. Change-Id: I93c5dcb412483156a6e8b68c4085cbce83ac9703	2021-11-30 17:17:15 +01:00
Abinav Puthan Purayil	14c4051122	[AMDGPU][NFC] Remove unused defvar in AMDGPUInstructions.td.	2021-11-30 17:03:37 +05:30
Christudasan Devadasan	5297cbf045	[AMDGPU] Enable copy between VGPR and AGPR classes during regalloc Greedy register allocator prefers to move a constrained live range into a larger allocatable class over spilling them. This patch defines the necessary superclasses for vector registers. For subtargets that support copy between VGPRs and AGPRs, the vector register spills during regalloc now become just copies. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D109301	2021-11-29 22:19:33 -05:00
Mirko Brkusanin	8951136216	[AMDGPU][GlobalISel] Transform (fadd (fpext (fmul x, y)), z) -> (fma (fpext x), (fpext y), z) Patch by: Mateja Marjanovic Differential Revision: https://reviews.llvm.org/D97937	2021-11-29 16:27:21 +01:00
Mirko Brkusanin	881840fc26	[AMDGPU][GlobalISel] Transform (fadd (fmul x, y), z) -> (fma x, y, z) Patch by: Mateja Marjanovic Differential Revision: https://reviews.llvm.org/D93305	2021-11-29 16:27:21 +01:00
Kazu Hirata	fd7d40640d	[llvm] Use range-based for loops (NFC)	2021-11-28 18:14:49 -08:00
Kazu Hirata	387927bbaf	[Target] Use range-based for loops (NFC)	2021-11-26 21:21:17 -08:00
Carl Ritson	8967d044fc	[AMDGPU] Add SIMemoryLegalizer comments to clarify bit usage Attempt to further document the intended cache policies requested by different combinations of GLC, SLC and DLC bits. GFX10 non-temporal stores are updated to set GLC. Reviewed By: t-tye Differential Revision: https://reviews.llvm.org/D114351	2021-11-26 21:05:58 +09:00
Christudasan Devadasan	654c89d85a	[AMDGPU] Make vector superclasses allocatable The combined vector register classes with both VGPRs and AGPRs are currently unallocatable. This patch turns them into allocatable as a prerequisite to enable copy between VGPR and AGPR registers during regalloc. Also, added the missing AV register classes from 192b to 1024b. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D109300	2021-11-26 00:42:12 -05:00
Jay Foad	d7e03df719	[AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32 Select SelectionDAG ops smul_lohi/umul_lohi to v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0. v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions so it is better to use one instruction than two. Further improvements are possible to make better use of the addend operand, but this is already a strict improvement over what we have now. Differential Revision: https://reviews.llvm.org/D113986	2021-11-24 11:25:02 +00:00
Jay Foad	8a52bd82e3	[AMDGPU] Only select VOP3 forms of VOP2 instructions Change VOP_PAT_GEN to default to not generating an instruction selection pattern for the VOP2 (e32) form of an instruction, only for the VOP3 (e64) form. This allows SIFoldOperands maximum freedom to fold copies into the operands of an instruction, before SIShrinkInstructions tries to shrink it back to the smaller encoding. This affects the following VOP2 instructions: v_min_i32 v_max_i32 v_min_u32 v_max_u32 v_and_b32 v_or_b32 v_xor_b32 v_lshr_b32 v_ashr_i32 v_lshl_b32 A further cleanup could simplify or remove VOP_PAT_GEN, since its optional second argument is never used. Differential Revision: https://reviews.llvm.org/D114252	2021-11-24 11:15:30 +00:00
Carl Ritson	976f3b3c9e	[AMDGPU] Only allow implicit WQM in pixel shaders Implicit derivatives are only valid in pixel shaders, hence only implicitly enable WQM for pixel shaders. This avoids unintended WQM in other shader types (e.g. compute) when image sampling instructions are used. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114414	2021-11-24 20:04:42 +09:00
Abinav Puthan Purayil	078da26b1c	[AMDGPU] Check for unneeded shift mask in shift PatFrags. The existing constrained shift PatFrags only dealt with masked shift from OpenCL front-ends. This change copies the X86DAGToDAGISel::isUnneededShiftMask() function to AMDGPU and uses it in the shift PatFrag predicates. Differential Revision: https://reviews.llvm.org/D113448	2021-11-24 10:53:12 +05:30
Stanislav Mekhanoshin	661a232e34	[AMDGPU] Remove a no-op check in the gfx90a hazard recognizer Also rename helper function accordingly. Differential Revision: https://reviews.llvm.org/D114289	2021-11-23 15:35:19 -08:00
Matt Arsenault	273a0c8bc9	PrologEpilogInserter: Use explicit control for scavenge slot placement AMDGPU is unusual in that the both stack is indexed in the same direction as stack growth (up). We therefore always need the emergency stack slots placed as low as possible to ensure they are in range of load/store instruction immediate offsets. The existing logic is mostly OK, but failed if we required stack realignment. I don't understand what the existing control isFPCloseToIncomingSP is supposed to mean, but can only be used to stop placing the scavenge slots earlier. Make this explicit so that targets can opt-in rather than opt-out only.	2021-11-23 18:01:12 -05:00
Kazu Hirata	d45cb1d7ea	[llvm] Use range-based for loops (NFC)	2021-11-23 08:54:48 -08:00
alex-t	9e03e8c99e	[AMDGPU] Enable fneg and fabs divergence-driven instruction selection. Detailed description: We currently have a set of patterns to select ISD::FNEG and ISD::FABS to the bitwise operations. We need to make them predicated to select the VALU or SALU bitwise operation variant according to the SDNode divergence bit. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114257	2021-11-23 19:37:07 +03:00
Jay Foad	44a3916f78	[AMDGPU] Allow VOP3 source modifiers in fpow expansion Differential Revision: https://reviews.llvm.org/D114353	2021-11-22 20:39:46 +00:00
Kazu Hirata	59a26448a6	[Target] Use range-based for loops (NFC)	2021-11-22 08:21:07 -08:00
Kazu Hirata	49e3838145	[llvm] Use make_early_inc_range (NFC)	2021-11-21 19:24:17 -08:00
RamNalamothu	18f9351223	[AMDGPU] Do not generate ELF symbols for the local branch target labels The compiler was generating symbols in the final code object for local branch target labels. This bloats the code object, slows down the loader, and is only used to simplify disassembly. Use '--symbolize-operands' with llvm-objdump to improve readability of the branch target operands in disassembly. Fixes: SWDEV-312223 Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D114273	2021-11-20 10:32:41 +05:30
Jay Foad	ff7f2cfa95	[AMDGPU] Add an implicit use of M0 to all V_MOV_B32_indirect_read/write NFCI. Previously the implicit use was added to V_MOV_B32_indirect_read when building the instruction. V_MOV_B32_indirect_write didn't have an implicit use of M0 at all, but apparently it did not cause any problems. Differential Revision: https://reviews.llvm.org/D114239	2021-11-19 19:00:17 +00:00
Jay Foad	30b27ecfc2	[AMDGPU] Use new opcode for indexed vgpr reads Introduce V_MOV_B32_indirect_read for indexed vgpr reads (and rename the old V_MOV_B32_indirect to V_MOV_B32_indirect_write) so they can be unambiguously distinguished from regular V_MOV_B32_e32. Previously they were distinguished by looking for extra implicit operands but this is fragile because regular moves sometimes have extra implicit operands too: - either by accident, when instructions end up with duplicate implicit operands (see e.g. D100939) - or by design, when SIInstrInfo::copyPhysReg breaks a multi-dword copy into individual subreg mov instructions and adds implicit operands for the super-register. The effect of this is that SIInstrInfo::isFoldableCopy can be simplified and identifies more foldable copies. The test diffs show that more immediate 0 values have been folded as inline operands. SIInstrInfo::isReallyTriviallyReMaterializable could probably be simplified too but that is not part of this patch. Differential Revision: https://reviews.llvm.org/D114230	2021-11-19 13:08:11 +00:00
Stanislav Mekhanoshin	f8e615462b	[AMDGPU] Fix SIPostRABundler crash on null register used by dbg value Recently we started generate DBG_VALUEs with $noreg operands. This crashes SIPostRABundler, and it should not iterate these registers anyway. Fixes: SWDEV-311733 Differential Revision: https://reviews.llvm.org/D114202	2021-11-18 17:01:19 -08:00
Zarko Todorovski	5b8bbbecfa	[NFC][llvm] Inclusive language: reword and remove uses of sanity in llvm/lib/Target Reworded removed code comments that contain `sanity check` and `sanity test`.	2021-11-17 21:59:00 -05:00
Simon Pilgrim	3020608b61	Fix MSVC signed/unsigned mismatch warning. NFC.	2021-11-17 18:59:23 +00:00
Mirko Brkusanin	db6bc2ab51	[AMDGPU][GlobalISel] Fold G_FNEG above when users cannot fold mods If possible fold fneg into instruction above if users cannot fold mods and we know it will decrease instruction count. Follows same logic as SDAG combiner in choosing opportunities to combine. Differential Revision: https://reviews.llvm.org/D112827	2021-11-17 14:25:13 +01:00
Jay Foad	3264e95938	[CodeGen] Update LiveIntervals in TargetInstrInfo::convertToThreeAddress Delegate updating of LiveIntervals to each target's convertToThreeAddress implementation, instead of repairing LiveIntervals after the fact in TwoAddressInstruction::convertInstTo3Addr. Differential Revision: https://reviews.llvm.org/D113493	2021-11-17 10:16:47 +00:00
Kazu Hirata	ee0133dc6d	[llvm] Use range-for loops (NFC)	2021-11-16 09:01:56 -08:00
Jon Chesterfield	30b29db7c7	[amdgpu] Don't crash on empty global ctor/dtor Global ctor/dtor can be an empty array, which is a Constant not a ConstantArray. The cast<ConstantArray> therefore asserts / crashes. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D113800	2021-11-16 14:36:08 +00:00
Matt Arsenault	659887b405	AMDGPU: Mark prolog/epilog SCC defs as dead A future change will add SCC liveness checks. Since we are still relying on forward register scavenging, add dead flags to avoid spuriously detecting SCC as live.	2021-11-15 21:35:06 -05:00
Dmitry Preobrazhensky	91f4650ebb	[AMDGPU][MC][GFX10] Corrected global_atomic_fcmpswap* Corrected src data size of global_atomic_fcmpswap and global_atomic_fcmpswap_x2 opcodes. Differential Revision: https://reviews.llvm.org/D113746	2021-11-15 12:51:12 +03:00
Kazu Hirata	a84a401f7e	[AMDGPU] Remove selectStoreIntrinsic (NFC) The last use was removed on Jan 13, 2020 in commit `533d650e94`.	2021-11-14 19:40:44 -08:00
Kazu Hirata	efa896e5f7	[Target] Use SDNode::uses (NFC)	2021-11-12 21:23:04 -08:00
Jay Foad	a70bbb5f7a	[AMDGPU] Simplify 64-bit division/remainder expansion The old expansion open-coded a 64-bit addition in a strange way, by adding the high parts without carry-in from the low part, and then adding the carry back in later on. Fixing this saves a couple of instructions and makes the code much easier to understand. Differential Revision: https://reviews.llvm.org/D113679	2021-11-12 15:48:41 +00:00
Neubauer, Sebastian	d1f45ed58f	[AMDGPU][NFC] Fix typos Differential Revision: https://reviews.llvm.org/D113672	2021-11-12 11:37:21 +01:00
Kazu Hirata	2ca45adf24	[CodeGen, Target] Use MachineRegisterInfo::use_operands (NFC)	2021-11-11 22:28:55 -08:00
kpyzhov	c9690092c8	[AMDGPU] Small correction in SITargetLowering::performOrCombine(). Differential Revision: https://reviews.llvm.org/D113203	2021-11-10 21:07:27 -05:00
Stanislav Mekhanoshin	476ab0f809	[AMDGPU] Fixed stack pointer init with architected flat scratch Even if wave offset is not present we still need to do the rest of the initialization. The mov into s32 was missing in the kernels. Fixes: SWDEV-310935 Differential Revision: https://reviews.llvm.org/D113628	2021-11-10 17:18:38 -08:00
Matt Arsenault	c7a0c2d0f7	AMDGPU: Report large stack usage for recursive calls We were previously setting an ignored bit in the kernel headers. The current behavior is to add the large amount on top of the statically known size of a single stack frame. I'm not sure if we should just use the large size as the entire reported size instead.	2021-11-10 20:02:01 -05:00
Matt Arsenault	90ff148719	AMDGPU: Account for implicit argument alignment for kernarg segment If a kernel had no formal arguments but did have the implicit arguments, we were reporting a required kernarg alignment of 4. For some reason we require an 8-byte alignment for this, even though there's no real advantage and I don't see where this is documented in the ABI. The code object header code also claims the minimum alignment is 16, which is what I thought you always got at runtime anyway so I don't know why this matters.	2021-11-09 17:48:37 -05:00
Michael Liao	bf225939bc	[InferAddressSpaces] Support assumed addrspaces from addrspace predicates. - CUDA cannot associate memory space with pointer types. Even though Clang could add extra attributes to specify the address space explicitly on a pointer type, it breaks the portability between Clang and NVCC. - This change proposes to assume the address space from a pointer from the assumption built upon target-specific address space predicates, such as `__isGlobal` from CUDA. E.g., ``` foo(float *p) { __builtin_assume(__isGlobal(p)); // From there, we could assume p is a global pointer instead of a // generic one. } ``` This makes the code portable without introducing the implementation-specific features. Note that NVCC starts to support __builtin_assume from version 11. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112041	2021-11-08 16:51:57 -05:00
Joe Nash	79f52af4cd	[AMDGPU] Make getInstSizeInBytes more generic NFC. This check mainly handles size affecting literals. Make it check all explicit operands instead of a few by name. Also make the isLiteral check handle the KIMM operands, see https://reviews.llvm.org/D111067 Change-Id: I1a362d55b2a10f5c74d445272e8b7829a8b77597 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D113318 Change-Id: Ie6c688f30a71e0335d1c6dd1ff65019bd7ce684e	2021-11-08 10:34:49 -05:00
Simon Pilgrim	f059b04f7b	[DAG] Add SelectionDAG::ComputeMinSignedBits helper As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper. I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious. Differential Revision: https://reviews.llvm.org/D113396	2021-11-08 14:12:45 +00:00
skc7	a0633f5ccb	[AMDGPU] Test Commit. NFC Reviewed By: hsmhsm Differential Revision: https://reviews.llvm.org/D113379	2021-11-08 07:09:09 +00:00
Kazu Hirata	aee86f9b6c	[AMDGPU] Remove unused declaration selectSMRD (NFC) The function body proper was removed on Feb 20, 2019 in commit `79b5c3842b`.	2021-11-07 09:53:18 -08:00
Benjamin Kramer	9b8b16457c	Put implementation details into anonymous namespaces. NFCI.	2021-11-07 15:18:30 +01:00
Kazu Hirata	e4bab21848	[AMDGPU] Use MachineBasicBlock::{predecessors,successors} (NFC)	2021-11-06 19:31:20 -07:00
Kazu Hirata	14d656b3d8	[Target] Use llvm::reverse (NFC)	2021-11-06 13:08:21 -07:00
Kazu Hirata	2c4ba3e9d3	[Target] Use make_early_inc_range (NFC)	2021-11-05 09:14:32 -07:00
Jay Foad	c93bf53a3e	[AMDGPU] NFC formatting fixes in SIMemoryLegalizer	2021-11-05 09:10:24 +00:00
Thomas Symalla	76cbe62262	[AMDGPU] Changes the AMDGPU_Gfx calling convention by making the SGPRs 4..29 callee-save. This is to avoid superfluous s_movs when executing amdgpu_gfx function calls as the callee is likely not going to change the argument values. This patch changes the AMDGPU_Gfx calling convention. It defines the SGPR registers s[4:29] as callee-save and leaves some SGPRs usable for callers. The intention is to avoid unneccessary s_mov instructions for arguments the caller would otherwise save and restore in these registers. Reviewed By: sebastian-ne Differential Revision: https://reviews.llvm.org/D111637	2021-11-04 21:50:18 +01:00
RamNalamothu	539f500e78	[AMDGPU] Do not add debug locations to the code inside prologue There is no real source location for code inside prologue as it is generated by compiler but source locations are being added to code inside prologue as a side effect of https://reviews.llvm.org/D99269 because buildSpillLoadStore() is using source location of the real instruction in the basic block if any. Fixes: SWDEV-307590 Reviewed By: scott.linder, sebastian-ne Differential Revision: https://reviews.llvm.org/D113100	2021-11-04 08:02:41 +05:30
alex-t	0a3d755ee9	[AMDGPU] Enable divergence-driven BFE selection Detailed description: This change enables the bit field extract patterns selection to s_bfe_u32 or v_bfe_u32 dependent on the pattern root node divergence. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D110950	2021-11-03 23:26:59 +03:00
Kazu Hirata	4bef0304e1	[AArch64, AMDGPU] Use make_early_inc_range (NFC)	2021-11-03 09:22:51 -07:00
Abinav Puthan Purayil	fbe61fb0aa	[AMDGPU] Fix SGPR checks in S_MOV_B64_IMM_PSEUDO generation. The function to generate S_MOV_B64_IMM_PSEUDO was recently modified to optimize AGPR to AGPR copy but it missed checking for the SGPR clobbering for the S_MOV_B64_IMM_PSEUDO generation. Differential Revision: https://reviews.llvm.org/D113005	2021-11-03 09:09:24 +05:30
Jay Foad	be1a8f8834	[AMDGPU] Really preserve LiveVariables in SILowerControlFlow https://bugs.llvm.org/show_bug.cgi?id=52204 Differential Revision: https://reviews.llvm.org/D112731	2021-11-02 15:03:37 +00:00
Martin Liska	c5029023fb	Fix building with GCC 12: Fixes: https://bugs.llvm.org/show_bug.cgi?id=52380 Differential Revision: https://reviews.llvm.org/D112990	2021-11-02 14:28:00 +01:00
Jay Foad	7afef22926	[AMDGPU] Use MachineInstrBuilder::addReg. NFC.	2021-11-01 15:29:51 +00:00
Jay Foad	2b548b18c1	[AMDGPU] Shrink v_mac_legacy_f32 and v_fmac_legacy_f32 Differential Revision: https://reviews.llvm.org/D112917	2021-11-01 13:55:53 +00:00
Kazu Hirata	72710af233	[CodeGen, Target] Use MachineBasicBlock::terminators (NFC)	2021-10-31 07:57:34 -07:00
Christudasan Devadasan	aa2d3b59ce	GlobalISel/Utils: Use incoming regbank while constraining the superclasses Register operands with superclasses can possibly have multiple regBanks if they have different register types. The regBank ambiguity resolved during regbankselect should be used to constrain the operand regclass instead of obtaining one from the MCInstrDesc. This is a prerequisite patch for D109300 that introduces allocatable AV_* Superclasses for AMDGPU by combining both VGPRs and AGPRs and we want to restrain the regclass to either A or V based on the incoming regbank. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112323	2021-10-30 07:20:45 -04:00
Stanislav Mekhanoshin	e5340ed30c	[AMDGPU] Fix global isel for kernels using agprs on gfx90a With Global ISel getReservedRegs() is called before function is regbank selected for the first time. Defer caching of usesAGPRs() in this case. Differential Revision: https://reviews.llvm.org/D112644	2021-10-29 14:23:14 -07:00
Jay Foad	21a1d4cf71	[AMDGPU] Change numBitsSigned for simplicity and document it. NFC. Change numBitsSigned to return the minimum size of a signed integer that can hold the value. This is different by one from the previous result but is more consistent with numBitsUnsigned. Update all callers. All callers are now more consistent between the signed and unsigned cases, and some callers get simpler, especially the ones that deal with quantities like numBitsSigned(LHS) + numBitsSigned(RHS). Differential Revision: https://reviews.llvm.org/D112813	2021-10-29 14:22:06 +01:00
Vang Thao	52b43d1549	[AMDGPU] Fix cvt_f32_ubyte combine with shl Shift node is still needed to check if the shift is shr or shl to increment/decrement offset. Do not override the node. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112733	2021-10-28 21:43:06 -07:00
Kazu Hirata	01b4789b62	[AMDGPU] Remove hasDefinedInitializer (NFC) The last use was removed on Sep 16, 2021 in commit `7a62a5b56d`.	2021-10-28 20:33:34 -07:00
Kazu Hirata	dd5d46b009	[AMDGPU] Remove unused BBSelectRegister in AMDGPUMachineCFGStructurizer (NFC) This field seems to be unused for at least one year.	2021-10-28 20:33:32 -07:00
Kazu Hirata	309357c01a	[AMDGPU] Remove unused declaration eliminateDeadBranchOperands (NFC)	2021-10-28 20:33:30 -07:00
Abinav Puthan Purayil	2da6ef3664	[AMDGPU] Add 24-bit mulhi intrinsics in INTRINSIC_WO_CHAIN combine. mul24 intrinsic's operands are simplified by AMDGPUTargetLowering::performIntrinsicWOChainCombine(). This change adds the mul24hi intrinsics in the combine since its operands can be simplified like that of the mul24 intrinsics. Differential Revision: https://reviews.llvm.org/D112702	2021-10-28 16:57:48 +05:30
Sebastian Neubauer	fd1cfc9094	[AMDGPU][GlobalISel] Fix waterfall loops - Move the `s_and exec` to its correct position before the content of the waterfall loop - Use the SI_WATERFALL pseudo instruction, like for sdag, to benefit from optimizations - Add support for indirect function calls To support indirect calls, add a G_SI_CALL instruction without register class restrictions and insert a waterfall loop when applying register banks. Differential Revision: https://reviews.llvm.org/D109052	2021-10-28 10:30:55 +02:00
Kazu Hirata	cee3419d65	[AMDGPU] Remove unused declaration findNumUsedRegistersSI (NFC)	2021-10-27 21:24:02 -07:00
Michael Liao	e6a4ba3aa6	[amdgpu] Handle the case where there is no scavenged register. - When an unconditional branch is expanded into an indirect branch, if there is no scavenged register, an SGPR pair needs spilling to enable the destination PC calculation. In addition, before jumping into the destination, that clobbered SGPR pair need restoring. - As SGPR cannot be spilled to or restored from memory directly, the spilling/restoring of that SGPR pair reuses the regular SGPR spilling support but without spilling it into memory. As that spilling and restoring points are fully controlled, we only need to spill that SGPR into the temporary VGPR, which needs spilling into its emergency slot. - The target-specific hook is revised to take additional restore block, where the restoring code is filled. After that, the relaxation will place that restore block directly before the destination block and insert an unconditional branch in any fall-through block into the destination block. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D106449	2021-10-27 18:37:27 -04:00
Austin Kerbow	02e60f2e77	[AMDGPU] Use max waves for scheduler's initial occupancy target The scheduler should set critical/excess register usage thresholds that are guided by the maximum possible occupancy for the function. This change is focused on setting proper lower bounds on register usage which we would typically only see when a specific number of maximum waves is requested with the "waves-per-eu" attribute, or by setting "amdgpu-num-vgpr\|sgpr" directly. This was broken previously. I have a follow-on patch that will address issues with the scheduler not targeting correct upper bounds on register usage which is typical with launch bounds and min "waves-per-eu". Changes by this patch: Set the initial critical register usage thresholds to minimum values that are determined by the maximum possible occupancy for the function, or the number of allocatable registers, whichever is lower. Avoid unisgned overflow if register limits are lower than the register tracking "ErrorMargin", I.e. when using stress-regalloc=2. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D112373	2021-10-26 15:30:26 -07:00
Neubauer, Sebastian	eb16570ab0	[AMDGPU] Remove unused CSR defs CSR_AMDGPU_VGPRs_24_255 and CSR_AMDGPU_VGPRs_32_255 are not used anywhere, so remove them. Differential Revision: https://reviews.llvm.org/D112535	2021-10-26 16:01:49 +02:00
Abinav Puthan Purayil	61e3b9fefe	[AMDGPU] Add constrained shift pattern matches. The motivation for this is due to clang's conformance to https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#operators-shift which makes clang emit (<shift> a, (and b, <width> - 1)) for `a <shift> b` in OpenCL where a is an int of bit width <width>. Differential revision: https://reviews.llvm.org/D110231	2021-10-26 19:07:19 +05:30
Abinav Puthan Purayil	781dd39b7b	[AMDGPU] Enable 48-bit mul in AMDGPUCodeGenPrepare. We were bailing out of creating 24-bit muls for results wider than 32 bits in AMDGPUCodeGenPrepare. With the 24-bit mulhi intrinsic, this change teaches AMDGPUCodeGenPrepare to generate the 48-bit mul correctly. Differential Revision: https://reviews.llvm.org/D112395	2021-10-26 18:53:07 +05:30

... 8 9 10 11 12 ...

7293 Commits