llvm-project

Commit Graph

Author	SHA1	Message	Date
Matt Arsenault	7834194837	TableGen: Introduce generated getSubRegisterClass function Currently there isn't a generic way to get a smaller register class that can be produced from a subregister of a larger class. Replaces a manually implemented version for AMDGPU. This will be used to improve subregister support in the allocator.	2022-09-12 09:03:37 -04:00
Daniil Fukalov	7ed3d81333	[NFCI] Move cost estimation from TargetLowering to TargetTransformInfo. TragetLowering had two last InstructionCost related `getTypeLegalizationCost()` and `getScalingFactorCost()` members, but all other costs are processed in TTI. E.g. it is not comfortable to use other TTI members in these two functions overrided in a target. Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout parameter - it was always passed from TTI. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D117723	2022-08-18 00:38:55 +03:00
Kazu Hirata	109df7f9a4	[llvm] Qualify auto in range-based for loops (NFC) Identified with readability-qualified-auto.	2022-08-13 12:55:42 -07:00
Fangrui Song	de9d80c1c5	[llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC With C++17 there is no Clang pedantic warning or MSVC C5051.	2022-08-08 11:24:15 -07:00
Leon Clark	6a275cd53c	Transform illegal intrinsics to V_ILLEGAL Related tasks: - SWDEV-240194 - SWDEV-309417 - SWDEV-334876 Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D123693	2022-08-06 08:59:00 +01:00
David Stuttard	b14d7bf750	AMDGPU: Turn off force init 16 input SGPRS for pal Pal uses a different mechanism for user sgprs. Differential Revision: https://reviews.llvm.org/D129566	2022-07-25 10:52:46 +01:00
Kazu Hirata	0387da6f4f	Use value instead of getValue (NFC)	2022-07-19 21:18:26 -07:00
Kazu Hirata	41ae78ea3a	Use has_value instead of hasValue (NFC)	2022-07-19 20:15:44 -07:00
Jon Chesterfield	3a20597776	[amdgpu] Implement lds kernel id intrinsic Implement an intrinsic for use lowering LDS variables to different addresses from different kernels. This will allow kernels that cannot reach an LDS variable to avoid wasting space for it. There are a number of implicit arguments accessed by intrinsic already so this implementation closely follows the existing handling. It is slightly novel in that this SGPR is written by the kernel prologue. It is necessary in the general case to put variables at different addresses such that they can be compactly allocated and thus necessary for an indirect function call to have some means of determining where a given variable was allocated. Claiming an arbitrary SGPR into which an integer can be written by the kernel, in this implementation based on metadata associated with that kernel, which is then passed on to indirect call sites is sufficient to determine the variable address. The intent is to emit a __const array of LDS addresses and index into it. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D125060	2022-07-19 17:46:19 +01:00
Piotr Sobczak	2bd8e74b94	[AMDGPU] Fix bitcast v4i64/v16i16 Fix a regression introduced in D128865. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D129375	2022-07-11 22:27:52 +02:00
Piotr Sobczak	bd675af2a2	[AMDGPU] Make v16i16/v16f16 legal There are upcoming intrinsics to use the new types. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D128865	2022-06-30 23:08:40 +02:00
Matt Arsenault	0bdaef38c9	AMDGPU: Add gfx11 feature to force initializing 16 input SGPRs The total user+system SGPR count needs to be padded out to 16 if fewer inputs are enabled.	2022-06-29 14:52:19 -04:00
Jay Foad	3fbc945c3a	[AMDGPU] llvm.amdgcn.exp.compr is not supported on GFX11 Differential Revision: https://reviews.llvm.org/D128259	2022-06-28 14:48:25 +01:00
Joe Nash	ae72fee74e	[AMDGPU] gfx11 Select on Buffer Atomic FAdd Rtn type Reviewed By: #amdgpu, foad, rampitec Differential Revision: https://reviews.llvm.org/D128205	2022-06-23 11:05:32 -04:00
Rodrigo Dominguez	971fa4b196	[AMDGPU] GFX11: remove ShaderType from ds_ordered_count offset field In GFX11 ShaderType is determined by the hardware and should no longer be written into bits[3:2] of the ds_ordered_count offset field. Differential Revision: https://reviews.llvm.org/D128196	2022-06-23 14:20:33 +01:00
Jay Foad	c155a944fb	[AMDGPU] GFX11 CodeGen support for MIMG instructions This includes: - New llvm.amdgcn.image.msaa.load.* intrinsics - NSA changes, because MIMG-NSA is now limited to 3 dwords - Split CD forms of IMAGE_SAMPLE instructions out into separate test files since they are no longer supported in GFX11 Differential Revision: https://reviews.llvm.org/D127837	2022-06-16 18:23:14 +01:00
Jay Foad	4b2d70fa5b	[AMDGPU] Basic implementation of isExtractSubvectorCheap Add a basic implementation of isExtractSubvectorCheap that only considers extracts at offset 0. Differential Revision: https://reviews.llvm.org/D127385	2022-06-10 14:43:07 +01:00
Guillaume Chatelet	0788186182	[Alignment][NFC] Remove usage of MemSDNode::getAlignment I can't remove the function just yet as it is used in the generated .inc files. I would also like to provide a way to compare alignment with TypeSize since it came up a few times. Differential Revision: https://reviews.llvm.org/D126910	2022-06-07 13:52:20 +00:00
Julien Pages	2dfe419446	[AMDGPU] Improve codegen of extractelement/insertelement in some cases This patch improves the codegen of extractelement and insertelement for vector containing 8 elements. Before, a dag combine transformation was generating a sequence of 8 select/cmp. This patch changes the upper limit for this transformation and the movrel instruction will eventually be used instead. Extractlement/insertelement for vectors containing less than 8 elements are unchanged. Differential Revision: https://reviews.llvm.org/D126389	2022-06-02 17:05:55 -04:00
Matt Arsenault	0e1c71e4a4	CodeGen: Move getAddressSpaceForPseudoSourceKind into TargetMachine Avoid the dependency on TargetInstrInfo, which depends on the subtarget and therefore the individual function. Currently AMDGPU is constructing PseudoSourceValue instances in MachineFunctionInfo. In order to facilitate copying MachineFunctionInfo, we need to stop allocating these there. Alternatively we could allow targets to subclass PseudoSourceValueManager, and allocate them similarly to MachineFunctionInfo.	2022-06-01 09:45:40 -04:00
Stanislav Mekhanoshin	5df6669d45	[AMDGPU] Enforce alignment of image vaddr on gfx90a Even though single address image instructions only use a single VGPR HW accesses 4 or 5 which creates alignment requirement. Fixes: SWDEV-316648 Differential Revision: https://reviews.llvm.org/D126009	2022-05-24 10:05:39 -07:00
Shao-Ce SUN	25af3afa67	[NFC][AMDGPU][CodeGen] Use ArrayRef in TargetLowering functions Based on D123467. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D124508	2022-05-18 10:50:23 +08:00
Stanislav Mekhanoshin	dee3190293	[AMDGPU] Add llvm.amdgcn.global.load.lds intrinsic Differential Revision: https://reviews.llvm.org/D125279	2022-05-17 12:35:27 -07:00
Stanislav Mekhanoshin	791ec1c68e	[AMDGPU] Add intrinsics llvm.amdgcn.{raw\|struct}.buffer.load.lds Differential Revision: https://reviews.llvm.org/D124884	2022-05-17 10:32:13 -07:00
Nicolai Hähnle	6c2a01ce3a	AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32 Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us to keep the calculation entirely on the SALU. For subtargets where integer multiplication isn't full-rate, avoid folding if the multiply has too many uses. Finally, we expand 64x32 and 64x64 multiplies here as well, if they feed into an addition. This results in better code generation than the generic expansion for such multiplies because we end up using the accumulator of the MAD instructions. Differential Revision: https://reviews.llvm.org/D123835	2022-05-10 09:15:51 -05:00
Simon Pilgrim	7e3ef7dcd2	[AMDGPU] lowerEXTRACT_VECTOR_ELT - fold from a SCALAR_TO_VECTOR source As suggested by @foad on D124839 If we're extracting a vector element that originally came from a scalar_to_vector, then avoid the bitcasting of a vector type and perform the shift masking on the (any-extended) scalar source directly, making use of the fact that the upper elements of a scalar_to_vector are all undef. Differential Revision: https://reviews.llvm.org/D125173	2022-05-07 20:23:31 +01:00
Jon Chesterfield	bc78c09952	[amdgpu] Elide module lds allocation in kernels with no callees Introduces a string attribute, amdgpu-requires-module-lds, to allow eliding the module.lds block from kernels. Will allocate the block as before if the attribute is missing or has its default value of true. Patch uses the new attribute to detect the simplest possible instance of this, where a kernel makes no calls and thus cannot call any functions that use LDS. Tests updated to match, coverage was already good. Interesting cases is in lower-module-lds-offsets where annotating the kernel allows the backend to pick a different (in this case better) variable ordering than previously. A later patch will avoid moving kernel variables into module.lds when the kernel can have this attribute, allowing optimal ordering and locally unused variable elimination. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122091	2022-05-04 22:42:07 +01:00
hsmahesha	589b9df4e1	[AMDGPU] Fix scalar_to_vector for v8i16/v8f16 so that the stack access is avoided. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D124734	2022-05-03 07:28:15 +05:30
hsmahesha	3175323ce1	[AMDGPU][NFC] Make lowerINSERT_VECTOR_ELT() more readable by moving around the code and by adding more comments, which would later help during any required clean-up. Differential Revision: https://reviews.llvm.org/D124733	2022-05-03 07:28:15 +05:30
Nicolai Hähnle	deaa678137	AMDGPU/SDAG: Factor out the fold (add (mul x, y), y) --> mad_[iu]64_[iu]32 Refactor to simplify a follow-up change. No functional change intended. However, there is a rather subtle logic change: the subsequent combines (e.g. reassociation) are skipped always when one of the operands of the add is a mul, instead of only when additionally mad64_32 etc. are available. This change makes sense because the subsequent combines should never apply when one of the operands is a mul. Differential Revision: https://reviews.llvm.org/D123833	2022-05-02 17:40:03 -05:00
Matt Arsenault	794a0bb547	AMDGPU: Directly implement computeKnownBits for workitem intrinsics Currently metadata is inserted in a late pass which is lowered to an AssertZext. The metadata would be more useful if it was inserted earlier after inlining, but before codegen. Probably shouldn't change anything now. Just replacing the late metadata annotation needs more work, since we lose out on optimizations after these are lowered to CopyFromReg. Seems to be slightly better than relying on the AssertZext from the metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using -start-before=amdgpu-isel instead of running the usual codegen pipeline.	2022-04-22 10:49:50 -04:00
Abinav Puthan Purayil	2147b6c89d	[AMDGPU] Remove no-ret atomic ops selection in the post-isel hook No-ret atomic ops are now selected in tblgen. Differential Revision: https://reviews.llvm.org/D124086	2022-04-22 09:37:41 +05:30
Stanislav Mekhanoshin	ac94073daa	[AMDGPU] Refine 64 bit misaligned LDS ops selection Here is the performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b64 aligned by 8: 3.2 sec ds_write2_b32 aligned by 8: 3.2 sec ds_write_b16 * 4 aligned by 8: 7.0 sec ds_write_b8 * 8 aligned by 8: 13.2 sec ds_write_b64 aligned by 1: 7.3 sec ds_write2_b32 aligned by 1: 7.5 sec ds_write_b16 * 4 aligned by 1: 14.0 sec ds_write_b8 * 8 aligned by 1: 13.2 sec ds_write_b64 aligned by 2: 7.3 sec ds_write2_b32 aligned by 2: 7.5 sec ds_write_b16 * 4 aligned by 2: 7.1 sec ds_write_b8 * 8 aligned by 2: 13.3 sec ds_write_b64 aligned by 4: 4.6 sec ds_write2_b32 aligned by 4: 3.2 sec ds_write_b16 * 4 aligned by 4: 7.1 sec ds_write_b8 * 8 aligned by 4: 13.3 sec ds_read_b64 aligned by 8: 2.3 sec ds_read2_b32 aligned by 8: 2.2 sec ds_read_u16 * 4 aligned by 8: 4.8 sec ds_read_u8 * 8 aligned by 8: 8.6 sec ds_read_b64 aligned by 1: 4.4 sec ds_read2_b32 aligned by 1: 7.3 sec ds_read_u16 * 4 aligned by 1: 14.0 sec ds_read_u8 * 8 aligned by 1: 8.7 sec ds_read_b64 aligned by 2: 4.4 sec ds_read2_b32 aligned by 2: 7.3 sec ds_read_u16 * 4 aligned by 2: 4.8 sec ds_read_u8 * 8 aligned by 2: 8.7 sec ds_read_b64 aligned by 4: 4.4 sec ds_read2_b32 aligned by 4: 2.3 sec ds_read_u16 * 4 aligned by 4: 4.8 sec ds_read_u8 * 8 aligned by 4: 8.7 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b64 aligned by 8: 4.4 sec ds_write2_b32 aligned by 8: 4.3 sec ds_write_b16 * 4 aligned by 8: 7.9 sec ds_write_b8 * 8 aligned by 8: 13.0 sec ds_write_b64 aligned by 1: 23.2 sec ds_write2_b32 aligned by 1: 23.1 sec ds_write_b16 * 4 aligned by 1: 44.0 sec ds_write_b8 * 8 aligned by 1: 13.0 sec ds_write_b64 aligned by 2: 23.2 sec ds_write2_b32 aligned by 2: 23.1 sec ds_write_b16 * 4 aligned by 2: 7.9 sec ds_write_b8 * 8 aligned by 2: 13.1 sec ds_write_b64 aligned by 4: 13.5 sec ds_write2_b32 aligned by 4: 4.3 sec ds_write_b16 * 4 aligned by 4: 7.9 sec ds_write_b8 * 8 aligned by 4: 13.1 sec ds_read_b64 aligned by 8: 3.5 sec ds_read2_b32 aligned by 8: 3.4 sec ds_read_u16 * 4 aligned by 8: 5.3 sec ds_read_u8 * 8 aligned by 8: 8.5 sec ds_read_b64 aligned by 1: 13.1 sec ds_read2_b32 aligned by 1: 22.7 sec ds_read_u16 * 4 aligned by 1: 43.9 sec ds_read_u8 * 8 aligned by 1: 7.9 sec ds_read_b64 aligned by 2: 13.1 sec ds_read2_b32 aligned by 2: 22.7 sec ds_read_u16 * 4 aligned by 2: 5.6 sec ds_read_u8 * 8 aligned by 2: 7.9 sec ds_read_b64 aligned by 4: 13.1 sec ds_read2_b32 aligned by 4: 3.4 sec ds_read_u16 * 4 aligned by 4: 5.6 sec ds_read_u8 * 8 aligned by 4: 7.9 sec ``` GFX10 exposes a different pattern for sub-DWORD load/store performance than GFX9. On GFX9 it is faster to issue a single unaligned load or store than a fully split b8 access, where on GFX10 even a full split is better. However, this is a theoretical only gain because splitting an access to a sub-dword level will require more registers and packing/ unpacking logic, so ignoring this option it is better to use a single 64 bit instruction on a misaligned data with the exception of 4 byte aligned data where ds_read2_b32/ds_write2_b32 is better. Differential Revision: https://reviews.llvm.org/D123956	2022-04-21 09:37:16 -07:00
Stanislav Mekhanoshin	aa14e2ef3e	[AMDGPU] Remove obsolete hack from allowsMisalignedMemoryAccesses. NFCI. Differential Revision: https://reviews.llvm.org/D124035	2022-04-20 11:52:56 -07:00
Stanislav Mekhanoshin	49b39c4f2e	[AMDGPU] Remove redundand RequiredAlignment assignment. NFCI. Differential Revision: https://reviews.llvm.org/D123699	2022-04-14 02:03:51 -07:00
Carl Ritson	35ea326047	[AMDGPU] Try to avoid inserting duplicate s_inst_prefetch Check for existing s_inst_prefetch instructions when configuring prefetches during loop alignment. Reviewed By: rampitec, foad Differential Revision: https://reviews.llvm.org/D123569	2022-04-14 16:06:24 +09:00
Stanislav Mekhanoshin	f6462a26f0	[AMDGPU] Split unaligned 4 DWORD DS operations Similarly to 3 DWORD operations it is better for performance to split unlaligned operations as long a these are at least DWORD alignmened. Performance data: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b128 aligned by 16: 4.9 sec ds_write2_b64 aligned by 16: 5.1 sec ds_write2_b32 * 2 aligned by 16: 5.5 sec ds_write_b128 aligned by 1: 8.1 sec ds_write2_b64 aligned by 1: 8.7 sec ds_write2_b32 * 2 aligned by 1: 14.0 sec ds_write_b128 aligned by 2: 8.1 sec ds_write2_b64 aligned by 2: 8.7 sec ds_write2_b32 * 2 aligned by 2: 14.0 sec ds_write_b128 aligned by 4: 5.6 sec ds_write2_b64 aligned by 4: 8.7 sec ds_write2_b32 * 2 aligned by 4: 5.6 sec ds_write_b128 aligned by 8: 5.6 sec ds_write2_b64 aligned by 8: 5.1 sec ds_write2_b32 * 2 aligned by 8: 5.6 sec ds_read_b128 aligned by 16: 3.8 sec ds_read2_b64 aligned by 16: 3.8 sec ds_read2_b32 * 2 aligned by 16: 4.0 sec ds_read_b128 aligned by 1: 4.6 sec ds_read2_b64 aligned by 1: 8.1 sec ds_read2_b32 * 2 aligned by 1: 14.0 sec ds_read_b128 aligned by 2: 4.6 sec ds_read2_b64 aligned by 2: 8.1 sec ds_read2_b32 * 2 aligned by 2: 14.0 sec ds_read_b128 aligned by 4: 4.6 sec ds_read2_b64 aligned by 4: 8.1 sec ds_read2_b32 * 2 aligned by 4: 4.0 sec ds_read_b128 aligned by 8: 4.6 sec ds_read2_b64 aligned by 8: 3.8 sec ds_read2_b32 * 2 aligned by 8: 4.0 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b128 aligned by 16: 6.2 sec ds_write2_b64 aligned by 16: 7.1 sec ds_write2_b32 * 2 aligned by 16: 7.6 sec ds_write_b128 aligned by 1: 24.1 sec ds_write2_b64 aligned by 1: 25.2 sec ds_write2_b32 * 2 aligned by 1: 43.7 sec ds_write_b128 aligned by 2: 24.1 sec ds_write2_b64 aligned by 2: 25.1 sec ds_write2_b32 * 2 aligned by 2: 43.7 sec ds_write_b128 aligned by 4: 14.4 sec ds_write2_b64 aligned by 4: 25.1 sec ds_write2_b32 * 2 aligned by 4: 7.6 sec ds_write_b128 aligned by 8: 14.4 sec ds_write2_b64 aligned by 8: 7.1 sec ds_write2_b32 * 2 aligned by 8: 7.6 sec ds_read_b128 aligned by 16: 6.2 sec ds_read2_b64 aligned by 16: 6.3 sec ds_read2_b32 * 2 aligned by 16: 7.5 sec ds_read_b128 aligned by 1: 12.5 sec ds_read2_b64 aligned by 1: 24.0 sec ds_read2_b32 * 2 aligned by 1: 43.6 sec ds_read_b128 aligned by 2: 12.5 sec ds_read2_b64 aligned by 2: 24.0 sec ds_read2_b32 * 2 aligned by 2: 43.6 sec ds_read_b128 aligned by 4: 12.5 sec ds_read2_b64 aligned by 4: 24.0 sec ds_read2_b32 * 2 aligned by 4: 7.5 sec ds_read_b128 aligned by 8: 12.5 sec ds_read2_b64 aligned by 8: 6.3 sec ds_read2_b32 * 2 aligned by 8: 7.5 sec ``` Differential Revision: https://reviews.llvm.org/D123634	2022-04-12 16:07:13 -07:00
Matt Arsenault	788f94f731	AMDGPU: Don't use unreachable on stores to unhandled address space For stores to constant address space, this will now consistently hit a selection error instead of hitting unreachable in an asserts build. I'm not sure what we should really do here. We could either just codegen as if it were global, delete the instruction, or declare the IR invalid (we really should have a target IR verifier to enforce it).	2022-04-12 17:31:50 -04:00
Stanislav Mekhanoshin	3870b36025	[AMDGPU] Split unaligned 3 DWORD DS operations I have written a minitest to check the performance. Overall the benefit of aligned b96 operations on data which is not known but happens to be aligned is small, while performance hit of using b96 operations on a really unaligned memory is high. The only exception is when data is not aligned even by 4, it is better to use b96 in this case. Here is the test output on Vega and Navi: ``` Using platform: AMD Accelerated Parallel Processing Using device: gfx900:xnack- ds_write_b96 aligned: 3.4 sec ds_write_b32 + ds_write_b64 aligned: 4.5 sec ds_write_b32 * 3 aligned: 4.8 sec ds_write_b96 misaligned by 1: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 1: 7.2 sec ds_write_b32 * 3 misaligned by 1: 10.0 sec ds_write_b96 misaligned by 2: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 2: 7.2 sec ds_write_b32 * 3 misaligned by 2: 10.1 sec ds_write_b96 misaligned by 4: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.2 sec ds_write_b32 * 3 misaligned by 4: 4.9 sec ds_write_b96 misaligned by 8: 4.8 sec ds_write_b32 + ds_write_b64 misaligned by 8: 4.6 sec ds_write_b32 * 3 misaligned by 8: 4.9 sec ds_read_b96 aligned: 3.3 sec ds_read_b32 + ds_read_b64 aligned: 4.9 sec ds_read_b32 * 3 aligned: 2.6 sec ds_read_b96 misaligned by 1: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 1: 7.2 sec ds_read_b32 * 3 misaligned by 1: 10.1 sec ds_read_b96 misaligned by 2: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 2: 7.2 sec ds_read_b32 * 3 misaligned by 2: 10.1 sec ds_read_b96 misaligned by 4: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 4: 2.6 sec ds_read_b32 * 3 misaligned by 4: 2.6 sec ds_read_b96 misaligned by 8: 4.1 sec ds_read_b32 + ds_read_b64 misaligned by 8: 4.9 sec ds_read_b32 * 3 misaligned by 8: 2.6 sec Using platform: AMD Accelerated Parallel Processing Using device: gfx1030 ds_write_b96 aligned: 4.1 sec ds_write_b32 + ds_write_b64 aligned: 13.0 sec ds_write_b32 * 3 aligned: 4.5 sec ds_write_b96 misaligned by 1: 12.5 sec ds_write_b32 + ds_write_b64 misaligned by 1: 22.0 sec ds_write_b32 * 3 misaligned by 1: 31.5 sec ds_write_b96 misaligned by 2: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 2: 22.0 sec ds_write_b32 * 3 misaligned by 2: 31.5 sec ds_write_b96 misaligned by 4: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 4: 4.0 sec ds_write_b32 * 3 misaligned by 4: 4.5 sec ds_write_b96 misaligned by 8: 12.4 sec ds_write_b32 + ds_write_b64 misaligned by 8: 13.0 sec ds_write_b32 * 3 misaligned by 8: 4.5 sec ds_read_b96 aligned: 3.8 sec ds_read_b32 + ds_read_b64 aligned: 12.8 sec ds_read_b32 * 3 aligned: 4.4 sec ds_read_b96 misaligned by 1: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 1: 21.8 sec ds_read_b32 * 3 misaligned by 1: 31.5 sec ds_read_b96 misaligned by 2: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 2: 21.9 sec ds_read_b32 * 3 misaligned by 2: 31.5 sec ds_read_b96 misaligned by 4: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 4: 3.8 sec ds_read_b32 * 3 misaligned by 4: 4.5 sec ds_read_b96 misaligned by 8: 10.9 sec ds_read_b32 + ds_read_b64 misaligned by 8: 12.8 sec ds_read_b32 * 3 misaligned by 8: 4.5 sec ``` Fixes: SWDEV-330802 Differential Revision: https://reviews.llvm.org/D123524	2022-04-12 07:52:39 -07:00
Stanislav Mekhanoshin	b8e09f1553	[AMDGPU] Refactor LDS alignment checks. Move features/bugs checks into the single place allowsMisalignedMemoryAccessesImpl. This is mostly NFCI except for the order of selection in couple places. A separate change may be needed to stop lying about Fast. Differential Revision: https://reviews.llvm.org/D123343	2022-04-12 07:49:40 -07:00
Stanislav Mekhanoshin	16cf9e6dad	[AMDGPU] Fix handling of gfx10 LDS misaligned access bug It was only handled for FLAT initially because we did not have unaligned DS instructions lowering. Now it is implemented but the bug is not handled. Differential Revision: https://reviews.llvm.org/D123338	2022-04-07 15:08:29 -07:00
Matt Arsenault	e6012c8e0f	AMDGPU: Handle private atomics Use new NotAtomic expansion to turn these into the equivalent non-atomic operations. Independent lanes cannot access the private memory of other lanes, so there's no possibility for synchronization. These don't really appear directly in user code, but InferAddressSpaces can make these appear after optimizations. Fixes issues 54693 and 54274.	2022-04-06 22:47:19 -04:00
Stanislav Mekhanoshin	a41a676e8a	[AMDGPU] Check SI LDS offset bug in the allowsMisalignedMemoryAccesses Differential Revision: https://reviews.llvm.org/D123268	2022-04-06 18:05:02 -07:00
Shao-Ce SUN	662b9fa02c	[NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D122557	2022-03-29 09:53:24 +08:00
Stanislav Mekhanoshin	6e3e14f600	[AMDGPU] Support gfx940 smfmac instructions Differential Revision: https://reviews.llvm.org/D122191	2022-03-24 12:40:42 -07:00
Jon Chesterfield	bcbd4cf1f2	Revert "[amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal" Reconsidered, better to handle per-function state in the constructor as before. This reverts commit `98e474c1b3`.	2022-03-20 00:58:26 +00:00
Jon Chesterfield	98e474c1b3	[amdgpu][nfc] Pass function instead of module to allocateModuleLDSGlobal	2022-03-19 16:42:17 +00:00
Changpeng Fang	dd5895cc39	AMDGPU: Use the implicit kernargs for code object version 5 Summary: Specifically, for trap handling, for targets that do not support getDoorbellID, we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1]. To get aperture bases when targets do not have aperture registers, we load private_base or shared_base directly from the implicit kernarg. In clang, we use implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}. Reviewers: arsenm, sameerds, yaxunl Differential Revision: https://reviews.llvm.org/D120265	2022-03-17 14:12:36 -07:00
Jay Foad	313f306b26	[AMDGPU] Stop using getMinimalPhysRegClass in LowerFormalArguments NFCI. The motivation for this is avoid problems in future if we add new classes containing only a subset of all VGPRs, or a subset of all SGPRs. getMinimalPhysRegClass would favour these smaller classes, which is not what we want here. Differential Revision: https://reviews.llvm.org/D121914	2022-03-17 15:19:17 +00:00
Shengchen Kan	37b378386e	[NFC][CodeGen] Rename some functions in MachineInstr.h and remove duplicated comments	2022-03-16 20:25:42 +08:00

1 2 3 4 5 ...

1065 Commits