llvm-project

Commit Graph

Author	SHA1	Message	Date
Matt Arsenault	1022c0dfde	AMDGPU: Decompose all values to 32-bit pieces for calling conventions This is the more natural lowering, and presents more opportunities to reduce 64-bit ops to 32-bit. This should also help avoid issues graphics shaders have had with 64-bit values, and simplify argument lowering in globalisel. llvm-svn: 366578	2019-07-19 13:57:44 +00:00
Stanislav Mekhanoshin	6e0fa292c2	[AMDGPU] Change register type for v32 vectors When it is AReg_1024 this results in unnecessary copying into AGPRs of a 32 element vectors even though they are not intended for an mfma instruction. Differential Revision: https://reviews.llvm.org/D64815 llvm-svn: 366252	2019-07-16 20:06:00 +00:00
Matt Arsenault	49169a963e	AMDGPU: Add 24-bit mul intrinsics Insert these during codegenprepare. This works around a DAG issue where generic combines eliminate the and asserting the high bits are zero, which then exposes an unknown read source to the mul combine. It doesn't worth the hassle of trying to insert an AssertZext or something to try to deal with it. llvm-svn: 366094	2019-07-15 17:50:31 +00:00
Stanislav Mekhanoshin	1dfae6fe50	[AMDGPU] use v32f32 for 3 mfma intrinsics These should really use v32f32, but were defined as v32i32 due to the lack of the v32f32 type. Differential Revision: https://reviews.llvm.org/D64667 llvm-svn: 365972	2019-07-12 22:42:01 +00:00
Matt Arsenault	51a05d72ae	AMDGPU: Drop remnants of byval support for shaders Before 2018, mesa used to use byval interchangably with inreg, which didn't really make sense. Fix tests still using it to avoid breaking in a future commit. llvm-svn: 365953	2019-07-12 20:12:17 +00:00
Stanislav Mekhanoshin	e67cc380a8	[AMDGPU] gfx908 mfma support Differential Revision: https://reviews.llvm.org/D64584 llvm-svn: 365824	2019-07-11 21:19:33 +00:00
Stanislav Mekhanoshin	e93279fd1b	[AMDGPU] gfx908 atomic fadd and atomic pk_fadd Differential Revision: https://reviews.llvm.org/D64435 llvm-svn: 365717	2019-07-11 00:10:17 +00:00
Stanislav Mekhanoshin	50d7f46460	[AMDGPU] gfx908 mAI instructions, MC part Differential Revision: https://reviews.llvm.org/D64446 llvm-svn: 365563	2019-07-09 21:43:09 +00:00
Christudasan Devadasan	b2d24bd540	[AMDGPU] Created a sub-register class for the return address operand in the return instruction. Function return instruction lowering, currently uses the fixed register pair s[30:31] for holding the return address. It can be any SGPR pair other than the CSRs. Created an SGPR pair sub-register class exclusive of the CSRs, and used this regclass while lowering the return instruction. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D63924 llvm-svn: 365512	2019-07-09 16:48:42 +00:00
Matt Arsenault	71dfb7ec5c	AMDGPU: Make s34 the FP register Make the FP register callee saved. This is tricky because now the FP needs to be spilled in the prolog relative to the incoming SP register, rather than the frame register used throughout the rest of the function. I don't like how this bypassess the standard mechanism for CSR spills just to get the correct insert point. I may look for a better solution, since all CSR VGPRs may also need to have all lanes activated. Another option might be to make getFrameIndexReference change the base register if the frame index is a CSR, and then try to figure out the right insertion point in emitProlog. If there is a free VGPR lane available for SGPR spilling, try to use it for the FP. If that would require intrtoducing a new VGPR spill, try to use a free call clobbered SGPR. Only fallback to introducing a new VGPR spill as a last resort. This also doesn't attempt to handle SGPR spilling with scalar stores. llvm-svn: 365372	2019-07-08 19:03:38 +00:00
Tim Renouf	5816889c74	[AMDGPU] Custom lower INSERT_SUBVECTOR v3, v4, v5, v8 Summary: Since the changes to introduce vec3 and vec5, INSERT_VECTOR for these sizes has been marked "expand", which made LegalizeDAG lower it to loads and stores via a stack slot. The code got optimized a bit later, but the now-unused stack slot was never deleted. This commit avoids that problem by custom lowering INSERT_SUBVECTOR into an EXTRACT_VECTOR_ELT and INSERT_VECTOR_ELT for each element in the subvector to insert. V2: Addressed review comments re test. Differential Revision: https://reviews.llvm.org/D63160 Change-Id: I9e3c13e36f68cfa3431bb9814851cc1f673274e1 llvm-svn: 365148	2019-07-04 17:38:24 +00:00
Matt Arsenault	5fe851b6cd	AMDGPU: Custom lower vector_shuffle for v4i16/v4f16 Ordinarily it is lowered as a build_vector of each extract_vector_elt, which in turn get lowered to bitcasts and bit shifts. Very little understand the lowered extract pattern, resulting in much worse code. We treat concat_vectors of v2i16 as legal, so prefer that. llvm-svn: 364959	2019-07-02 19:15:45 +00:00
Nicolai Haehnle	10c911db63	AMDGPU/GFX10: implement ds_ordered_count changes Summary: ds_ordered_count can now simultaneously operate on up to 4 dwords in a single instruction, which are taken from (and returned to) lanes 0..3 of a single VGPR. Change-Id: I19b6e7b0732b617c10a779a7f9c0303eec7dd276 Reviewers: mareko, arsenm, rampitec Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D63716 llvm-svn: 364815	2019-07-01 17:17:52 +00:00
Nicolai Haehnle	4dc3b2bf95	AMDGPU: Support GDS atomics Summary: Original patch by Marek Olšák Change-Id: Ia97d5d685a63a377d86e82942436d1fe6e429bab Reviewers: mareko, arsenm, rampitec Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, jfb, Petar.Avramovic, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D63452 llvm-svn: 364814	2019-07-01 17:17:45 +00:00
Stanislav Mekhanoshin	07fd88d735	[AMDGPU] Packed thread ids in function call ABI Differential Revision: https://reviews.llvm.org/D63851 llvm-svn: 364619	2019-06-28 01:52:13 +00:00
Nicolai Haehnle	2710171a15	AMDGPU: Write LDS objects out as global symbols in code generation Summary: The symbols use the processor-specific SHN_AMDGPU_LDS section index introduced with a previous change. The linker is then expected to resolve relocations, which are also emitted. Initially disabled for HSA and PAL environments until they have caught up in terms of linker and runtime loader. Some notes: - The llvm.amdgcn.groupstaticsize intrinsics can no longer be lowered to a constant at compile times, which means some tests can no longer be applied. The current "solution" is a terrible hack, but the intrinsic isn't used by Mesa, so we can keep it for now. - We no longer know the full LDS size per kernel at compile time, which means that we can no longer generate a relevant error message at compile time. It would be possible to add a check for the size of individual variables, but ultimately the linker will have to perform the final check. Change-Id: If66dbf33fccfbf3609aefefa2558ac0850d42275 Reviewers: arsenm, rampitec, t-tye, b-sumner, jsjodin Subscribers: qcolombet, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D61494 llvm-svn: 364297	2019-06-25 11:52:30 +00:00
Matt Arsenault	22e3dc60a0	AMDGPU: Fix not using s33 for scratch wave offset in kernels Fixes missing piece from r363990. llvm-svn: 364099	2019-06-21 20:04:02 +00:00
Matt Arsenault	d88db6d7fc	AMDGPU: Always use s33 for global scratch wave offset Every called function could possibly need this to calculate the absolute address of stack objectst, and this avoids inserting a copy around every call site in the kernel. It's also somewhat cleaner to keep this in a callee saved SGPR. llvm-svn: 363990	2019-06-20 21:58:24 +00:00
Matt Arsenault	740322f1eb	AMDGPU: Add intrinsics for DS GWS semaphore instructions llvm-svn: 363983	2019-06-20 21:11:42 +00:00
Matt Arsenault	8ad1decf45	AMDGPU: Insert mem_viol check loop around GWS pre-GFX9 It is necessary to emit this loop around GWS operations in case the wave is preempted pre-GFX9. llvm-svn: 363979	2019-06-20 20:54:32 +00:00
Matt Arsenault	e4c2e9b016	AMDGPU: Consolidate some getGeneration checks This is incomplete, and ideally these would all be removed, but it's better to localize them to the subtarget first with comments about what they're for. llvm-svn: 363902	2019-06-19 23:54:58 +00:00
Matt Arsenault	4d55d024be	Reapply "AMDGPU: Add ds_gws_init / ds_gws_barrier intrinsics" This reapplies r363678, using the correct chain for the CopyToReg for v0. glueCopyToM0 counterintuitively changes the operands of the original node. llvm-svn: 363870	2019-06-19 19:55:27 +00:00
Simon Pilgrim	128ce93c60	Revert rL363678 : AMDGPU: Add ds_gws_init / ds_gws_barrier intrinsics There may or may not be additional work to handle this correctly on SI/CI. ........ Breaks EXPENSIVE_CHECKS buildbots - http://lab.llvm.org:8011/builders/llvm-clang-x86_64-expensive-checks-win/builds/78/ llvm-svn: 363797	2019-06-19 13:00:54 +00:00
Matt Arsenault	8d35dcd703	AMDGPU: Add ds_gws_init / ds_gws_barrier intrinsics There may or may not be additional work to handle this correctly on SI/CI. llvm-svn: 363678	2019-06-18 13:19:57 +00:00
Nicolai Haehnle	6d71be4e67	AMDGPU: Be explicit about whether the high-word in SI_PC_ADD_REL_OFFSET is 0 Summary: Instead of encoding a high-word of 0 using a fake TargetGlobalAddress, just use a literal target constant. This simplifies some subsequent changes. The generated assembly is now more explicit about the kind of relocation that is to be used. Change-Id: I066835202d23b5941fa7a358eb4b89e9b71ab6f8 Reviewers: arsenm, rampitec Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D61491 llvm-svn: 363516	2019-06-16 17:32:01 +00:00
Nicolai Haehnle	490e83cd43	AMDGPU/GFX10: Support DLC bit in llvm.amdgcn.s.buffer.load intrinsic Summary: Change-Id: Ie4c971462a7749740938c687144e77441dac2539 Reviewers: rampitec, arsenm Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D62486 Change-Id: Iae59523edd75c74918d2118df6571a7b671717a0 llvm-svn: 363514	2019-06-16 17:14:12 +00:00
Stanislav Mekhanoshin	5250021672	[AMDGPU] gfx10 conditional registers handling This is cpp source part of wave32 support, excluding overriden getRegClass(). Differential Revision: https://reviews.llvm.org/D63351 llvm-svn: 363513	2019-06-16 17:13:09 +00:00
Stanislav Mekhanoshin	68a2fef9ae	[AMDGPU] gfx1010 wave32 icmp/fcmp intrinsic changes for wave32 Differential Revision: https://reviews.llvm.org/D63301 llvm-svn: 363339	2019-06-13 23:47:36 +00:00
Stanislav Mekhanoshin	5f581c9f08	[AMDGPU] gfx1010 premlane instructions Differential Revision: https://reviews.llvm.org/D63202 llvm-svn: 363185	2019-06-12 17:52:51 +00:00
Simon Pilgrim	4e0648a541	[TargetLowering] Add MachineMemOperand::Flags to allowsMemoryAccess tests (PR42123) As discussed on D62910, we need to check whether particular types of memory access are allowed, not just their alignment/address-space. This NFC patch adds a MachineMemOperand::Flags argument to allowsMemoryAccess and allowsMisalignedMemoryAccesses, and wires up calls to pass the relevant flags to them. If people are happy with this approach I can then update X86TargetLowering::allowsMisalignedMemoryAccesses to handle misaligned NT load/stores. Differential Revision: https://reviews.llvm.org/D63075 llvm-svn: 363179	2019-06-12 17:14:03 +00:00
Simon Pilgrim	266f43964e	[TargetLowering] Add allowsMemoryAccess(MachineMemOperand) helper wrapper. NFCI. As suggested by @arsenm on D63075 - this adds a TargetLowering::allowsMemoryAccess wrapper that takes a Load/Store node's MachineMemOperand to handle the AddressSpace/Alignment arguments and will also implicitly handle the MachineMemOperand::Flags change in D63075. llvm-svn: 363048	2019-06-11 11:00:23 +00:00
Piotr Sobczak	9b11e93d90	[AMDGPU] Optimize image_[load\|store]_mip Summary: Replace image_load_mip/image_store_mip with image_load/image_store if lod is 0. Reviewers: arsenm, nhaehnle Reviewed By: arsenm Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D63073 llvm-svn: 362957	2019-06-10 15:58:51 +00:00
Alexander Timofeev	37bd9bd137	[AMDGPU] Partial revert for the `ba447bae74` "Divergence driven ISel. Assign register class for cross block values according to the divergence." that discovered the design flaw leading to several issues that required to be solved before. This change reverts AMDGPU specific changes and keeps common part unaffected. llvm-svn: 362749	2019-06-06 21:13:02 +00:00
Matt Arsenault	34c8b835b1	AMDGPU: Don't fix emergency stack slot at offset 0 This forced the caller to be aware of this, which is an ugly ABI feature. Partially reverts r295877. The original reasons for doing this are mostly fixed. Alloca is now in a non-0 address space, so it should be OK to have 0 as a valid pointer. Since we treat the absolute address as the pointer value, this part only really needed to apply to kernels. Since r357093, we avoid the need to increment/decrement the offset register in more cases, and since r354816 the scavenger can fail without spilling, so it's less critical that we try to avoid an offset that fits in the MUBUF offset. Restrict to callable functions for now to split this into 2 steps to limit thte number of test updates and in case anything breaks. llvm-svn: 362665	2019-06-05 22:37:50 +00:00
Matt Arsenault	b812b7a45e	AMDGPU: Invert frame index offset interpretation Since the beginning, the offset of a frame index has been consistently interpreted backwards. It was treating it as an offset from the scratch wave offset register as a frame register. The correct interpretation is the offset from the SP on entry to the function, before the prolog. Frame index elimination then should select either SP or another register as an FP. Treat the scratch wave offset on kernel entry as the pre-incremented SP. Rely more heavily on the standard hasFP and frame pointer elimination logic, and clean up the private reservation code. This saves a copy in most callee functions. The kernel prolog emission code is still kind of a mess relying on checking the uses of physical registers, which I would prefer to eliminate. Currently selection directly emits MUBUF instructions, which require using a reference to some register. Use the register chosen for SP, and then ignore this later. This should probably be cleaned up to use pseudos that don't refer to any specific base register until frame index elimination. Add a workaround for shaders using large numbers of SGPRs. I'm not sure these cases were ever working correctly, since as far as I can tell the logic for figuring out which SGPR is the scratch wave offset doesn't match up with the shader input initialization in the shader programming guide. llvm-svn: 362661	2019-06-05 22:20:47 +00:00
Matt Arsenault	8dbeb9256c	TTI: Improve default costs for addrspacecast For some reason multiple places need to do this, and the variant the loop unroller and inliner use was not handling it. Also, introduce a new wrapper to be slightly more precise, since on AMDGPU some addrspacecasts are free, but not no-ops. llvm-svn: 362436	2019-06-03 18:41:34 +00:00
Aakanksha Patil	d5443f8c21	AMDGPU: Return address lowering The patch computes the return address for the current function. Differential revision: https://reviews.llvm.org/D59666 llvm-svn: 362001	2019-05-29 18:20:11 +00:00
Michael Liao	5fc1dfa784	[AMDGPU] Correct the handling of inlineasm output registers. Summary: - There's a regression due to the cross-block RC assignment. Use the proper way to derive the output register RC in inline asm. Reviewers: rampitec, alex-t Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, dstuttard, tpr, t-tye, eraman, hiraditya, llvm-commits, yaxunl Tags: #llvm Differential Revision: https://reviews.llvm.org/D62537 llvm-svn: 361868	2019-05-28 19:37:09 +00:00
Alexander Timofeev	ba447bae74	[AMDGPU] Divergence driven ISel. Assign register class for cross block values according to the divergence. Details: To make instruction selection really divergence driven it is necessary to assign the correct register classes to the cross block values beforehand. For the divergent targets same value type requires different register classes dependent on the value divergence. Reviewers: rampitec, nhaehnle Differential Revision: https://reviews.llvm.org/D59990 This commit was reverted because of the build failure. The reason was mlformed patch. Build failure fixed. llvm-svn: 361741	2019-05-26 20:33:26 +00:00
Peter Collingbourne	3b93737446	Revert r361644, "[AMDGPU] Divergence driven ISel. Assign register class for cross block values according to the divergence." Broke sanitizer bots: http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux/builds/21694/steps/bootstrap%20clang/logs/stdio http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/32478/steps/check-llvm%20asan/logs/stdio llvm-svn: 361688	2019-05-25 01:52:38 +00:00
Alexander Timofeev	dffedea014	[AMDGPU] Divergence driven ISel. Assign register class for cross block values according to the divergence. Details: To make instruction selection really divergence driven it is necessary to assign the correct register classes to the cross block values beforehand. For the divergent targets same value type requires different register classes dependent on the value divergence. Reviewers: rampitec, nhaehnle Differential Revision: https://reviews.llvm.org/D59990 llvm-svn: 361644	2019-05-24 15:32:18 +00:00
Matt Arsenault	5c714cbdd8	AMDGPU: Correct maximum possible private allocation size We were assuming a much larger possible per-wave visible stack allocation than is possible: `faa3ae5138/src/core/runtime/amd_gpu_agent.cpp (L70)` Based on this, we can assume the high 15 bits of a frame index or sret are 0. The frame index value is the per-lane offset, so the maximum frame index value is MAX_WAVE_SCRATCH / wavesize. Remove the corresponding subtarget feature and option that made this configurable. llvm-svn: 361541	2019-05-23 19:38:14 +00:00
Matt Arsenault	99e6f4d11a	AMDGPU: Introduce TokenFactor for ABI register copies in call sequence The call was missing chain dependencies on the pre-call copies. I don't think this was causing any real issues however. llvm-svn: 360906	2019-05-16 15:10:27 +00:00
Stanislav Mekhanoshin	28a1936f6d	[AMDGPU] gfx1010: use fmac instructions Differential Revision: https://reviews.llvm.org/D61527 llvm-svn: 359959	2019-05-04 04:20:37 +00:00
Stanislav Mekhanoshin	93f15c922f	[AMDGPU] gfx1010 loop alignment Differential Revision: https://reviews.llvm.org/D61529 llvm-svn: 359935	2019-05-03 21:17:29 +00:00
Sanjay Patel	284472be6d	[SelectionDAG] remove constant folding limitations based on FP exceptions We don't have FP exception limits in the IR constant folder for the binops (apart from strict ops), so it does not make sense to have them here in the DAG either. Nothing else in the backend tries to preserve exceptions (again outside of strict ops), so I don't see how this could have ever worked for real code that cares about FP exceptions. There are still cases (examples: unary opcodes in SDAG, FMA in IR) where we are trying (at least partially) to preserve exceptions without even asking if the target supports FP exceptions. Those should be corrected in subsequent patches. Real support for FP exceptions requires several changes to handle the constrained/strict FP ops. Differential Revision: https://reviews.llvm.org/D61331 llvm-svn: 359791	2019-05-02 14:47:59 +00:00
Stanislav Mekhanoshin	64399da8b8	[AMDGPU] gfx1010 lost VOP2 forms of some add/sub Add legalization of V_ADD_I32, V_SUB_I32, V_SUBREV_I32. Differential Revision: llvm-svn: 359757	2019-05-02 04:26:35 +00:00
Stanislav Mekhanoshin	692560dc98	[AMDGPU] gfx1010 MIMG implementation Differential Revision: https://reviews.llvm.org/D61339 llvm-svn: 359698	2019-05-01 16:32:58 +00:00
Stanislav Mekhanoshin	a224f68a10	[AMDGPU] gfx1010 DS implementation Differential Revision: https://reviews.llvm.org/D61332 llvm-svn: 359696	2019-05-01 16:11:11 +00:00
Sjoerd Meijer	180f1ae57c	[TargetLowering] Change getOptimalMemOpType to take a function attribute list The MachineFunction wasn't used in getOptimalMemOpType, but more importantly, this allows reuse of findOptimalMemOpLowering that is calling getOptimalMemOpType. This is the groundwork for the changes in D59766 and D59787, that allows implementation of TTI::getMemcpyCost. Differential Revision: https://reviews.llvm.org/D59785 llvm-svn: 359537	2019-04-30 08:38:12 +00:00
Matt Arsenault	055e4dce45	AMDGPU: Remove dx10-clamp from subtarget features Since this can be set with s_setreg*, it should not be a subtarget property. Set a default based on the calling convention, and Introduce a new amdgpu-dx10-clamp attribute to override this if desired. Also introduce a new amdgpu-ieee attribute to match. The values need to match to allow inlining. I think it is OK for the caller's dx10-clamp attribute to override the callee, but there doesn't appear to be the infrastructure to do this currently without definining the attribute in the generic Attributes.td. Eventually the calling convention lowering will need to insert a mode switch somewhere for these. llvm-svn: 357302	2019-03-29 19:14:54 +00:00
Tim Renouf	6f0191a55a	[AMDGPU] Use three- and five-dword result type in image ops Some image ops return three or five dwords. Previously, we modeled that with a 4 or 8 dword register class. The register allocator could cleverly spot that some subregs were dead and allocate something else there, but that caused the de-optimization that waitcnt insertion would think that the result was used immediately. This commit allows such an image op to have a result with a three or five dword result, avoiding the above de-optimization. Differential Revision: https://reviews.llvm.org/D58905 Change-Id: I3651211bbd7ed22721ee7b9fefd7bcc60a809d8b llvm-svn: 356757	2019-03-22 15:21:11 +00:00
Tim Renouf	677387d8dc	[AMDGPU] Implemented dwordx3 variants of buffer/tbuffer load/store intrinsics Now we have vec3 MVTs, this commit implements dwordx3 variants of the buffer intrinsics. On gfx6, a dwordx3 buffer load intrinsic is implemented as a dwordx4 instruction, and a dwordx3 buffer store intrinsic is not supported. We need to support the dwordx3 load intrinsic because it is generated by subtarget-unaware code in InstCombine. Differential Revision: https://reviews.llvm.org/D58904 Change-Id: I016729d8557b98a52f529638ae97c340a5922a4e llvm-svn: 356755	2019-03-22 14:58:02 +00:00
Tim Renouf	033f99a2e5	[AMDGPU] Added v5i32 and v5f32 register classes They are not used by anything yet, but a subsequent commit will start using them for image ops that return 5 dwords. Differential Revision: https://reviews.llvm.org/D58903 Change-Id: I63e1904081e39a6d66e4eb96d51df25ad399d271 llvm-svn: 356735	2019-03-22 10:11:21 +00:00
Tim Renouf	361b5b2193	[AMDGPU] Support for v3i32/v3f32 Added support for dwordx3 for most load/store types, but not DS, and not intrinsics yet. SI (gfx6) does not have dwordx3 instructions, so they are not enabled there. Some of this patch is from Matt Arsenault, also of AMD. Differential Revision: https://reviews.llvm.org/D58902 Change-Id: I913ef54f1433a7149da8d72f4af54dbb13436bd9 llvm-svn: 356659	2019-03-21 12:01:21 +00:00
David Stuttard	fc2a747345	[AMDGPU] Allow MIMG with no uses in adjustWritemask in isel Summary: If an MIMG instruction has managed to get through to adjustWritemask in isel but has no uses (and doesn't enable TFC) then prevent an assertion by not attempting to adjust the writemask. The instruction will be removed anyway. Change-Id: I9a5dba6bafe1f35ac99c1b73df390936e2ac27a7 Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D58964 llvm-svn: 356540	2019-03-20 09:29:55 +00:00
Ryan Taylor	00e063ab92	[AMDGPU] Add buffer/load 8/16 bit overloaded intrinsics Summary: Add buffer store/load 8/16 overloaded intrinsics for buffer, raw_buffer and struct_buffer Change-Id: I166a29f071b2ff4e4683fb0392564b1f223ac61d Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D59265 llvm-svn: 356465	2019-03-19 16:07:00 +00:00
Neil Henning	e85f6bd64f	[AMDGPU] Ban i8 min3 promotion. I found this really weird WWM-related case whereby through the WWM transformations our isel lowering was trying to promote 2 min's into a min3 for the i8 type, which our hardware doesn't support. The new min3_i8.ll test case would previously spew the error: PromoteIntegerResult #0: t69: i8 = SMIN3 t70, Constant:i8<0>, t68 Before the simple fix to our isel lowering to not do it for i8 MVT's. Differential Revision: https://reviews.llvm.org/D59543 llvm-svn: 356464	2019-03-19 15:50:24 +00:00
Tim Renouf	2e94f6e584	[AMDGPU] Asm/disasm v_cndmask_b32_e64 with abs/neg source modifiers This commit allows v_cndmask_b32_e64 with abs, neg source modifiers on src0, src1 to be assembled and disassembled. This does appear to be allowed, even though they are floating point modifiers and the operand type is b32. To do this, I added src0_modifiers and src1_modifiers to the MachineInstr, which involved fixing up several places in codegen and mir tests. Differential Revision: https://reviews.llvm.org/D59191 Change-Id: I69bf4a8c73ebc65744f6110bb8fc4e937d79fbea llvm-svn: 356398	2019-03-18 19:25:39 +00:00
Neil Henning	523dab0788	[AMDGPU] Add an experimental buffer fat pointer address space. Add an experimental buffer fat pointer address space that is currently unhandled in the backend. This commit reserves address space 7 as a non-integral pointer repsenting the 160-bit fat pointer (128-bit buffer descriptor + 32-bit offset) that is heavily used in graphics workloads using the AMDGPU backend. Differential Revision: https://reviews.llvm.org/D58957 llvm-svn: 356373	2019-03-18 14:44:28 +00:00
Matt Arsenault	bc6d07ca46	MIR: Allow targets to serialize MachineFunctionInfo This has been a very painful missing feature that has made producing reduced testcases difficult. In particular the various registers determined for stack access during function lowering were necessary to avoid undefined register errors in a large percentage of cases. Implement a subset of the important fields that need to be preserved for AMDGPU. Most of the changes are to support targets parsing register fields and properly reporting errors. The biggest sort-of bug remaining is for fields that can be initialized from the IR section will be overwritten by a default initialized machineFunctionInfo section. Another remaining bug is the machineFunctionInfo section is still printed even if empty. llvm-svn: 356215	2019-03-14 22:54:43 +00:00
Matt Arsenault	caf1316f71	IR: Add immarg attribute This indicates an intrinsic parameter is required to be a constant, and should not be replaced with a non-constant value. Add the attribute to all AMDGPU and generic intrinsics that comments indicate it should apply to. I scanned other target intrinsics, but I don't see any obvious comments indicating which arguments are intended to be only immediates. This breaks one questionable testcase for the autoupgrade. I'm unclear on whether the autoupgrade is supposed to really handle declarations which were never valid. The verifier fails because the attributes now refer to a parameter past the end of the argument list. llvm-svn: 355981	2019-03-12 21:02:54 +00:00
Matt Arsenault	26e76ef0e2	DAG: Don't try to cluster loads with tied inputs This avoids breaking possible value dependencies when sorting loads by offset. AMDGPU has some load instructions that write into the high or low bits of the destination register, and have a tied input for the other input bits. These can easily have the same base pointer, but be a swizzle so the high address load needs to come first. This was inserting glue forcing the opposite ordering, producing a cycle the InstrEmitter would assert on. It may be potentially expensive to look for the dependency between the other loads, so just skip any where this could happen. Fixes bug 40936 by reverting r351379, which added a hacky attempt to fix this by adding chains in this case, which I think was just working around broken glue before the InstrEmitter. The core of the patch is re-implementing the fix for that problem. llvm-svn: 355728	2019-03-08 20:46:15 +00:00
Dmitry Preobrazhensky	ef92035827	[AMDGPU][MC][GFX8+] Added syntactic sugar for 'vgpr index' operand of instructions s_set_gpr_idx_on and s_set_gpr_idx_mode See bug 39331: https://bugs.llvm.org/show_bug.cgi?id=39331 Reviewers: artem.tamazov, arsenm Differential Revision: https://reviews.llvm.org/D58288 llvm-svn: 354969	2019-02-27 13:12:12 +00:00
Stanislav Mekhanoshin	da1628eb67	[AMDGPU] Fixed hang during DAG combine SITargetLowering::reassociateScalarOps() does not touch constants so that DAGCombiner::ReassociateOps() does not revert the combine. However a global address is not a ConstantSDNode. Switched to the method used by DAGCombiner::ReassociateOps() itself to detect constants. Differential Revision: https://reviews.llvm.org/D58695 llvm-svn: 354926	2019-02-26 20:56:25 +00:00
Matt Arsenault	aa6fb4c45e	AMDGPU: Remove debugger related subtarget features As far as I know these aren't needed anymore. llvm-svn: 354634	2019-02-21 23:27:46 +00:00
Stanislav Mekhanoshin	42e229e130	[AMDGPU] fix commuted case of sub combine Differential Revision: https://reviews.llvm.org/D58481 llvm-svn: 354543	2019-02-21 02:58:00 +00:00
Stanislav Mekhanoshin	871821f786	[AMDGPU] Ressociate 'add (add x, y), z' to use SALU Reassociate adds to collect scalar operands in a single instruction when possible. That will result in a scalar add followed by vector instead of two vector adds, thus better utilizing SALU. Differential Revision: https://reviews.llvm.org/D58220 llvm-svn: 354066	2019-02-14 22:11:25 +00:00
Stanislav Mekhanoshin	0e858b028d	[AMDGPU] Split dot-insts feature Differential Revision: https://reviews.llvm.org/D57971 llvm-svn: 353587	2019-02-09 00:34:21 +00:00
Craig Topper	784929d045	Implementation of asm-goto support in LLVM This patch accompanies the RFC posted here: http://lists.llvm.org/pipermail/llvm-dev/2018-October/127239.html This patch adds a new CallBr IR instruction to support asm-goto inline assembly like gcc as used by the linux kernel. This instruction is both a call instruction and a terminator instruction with multiple successors. Only inline assembly usage is supported today. This also adds a new INLINEASM_BR opcode to SelectionDAG and MachineIR to represent an INLINEASM block that is also considered a terminator instruction. There will likely be more bug fixes and optimizations to follow this, but we felt it had reached a point where we would like to switch to an incremental development model. Patch by Craig Topper, Alexander Ivchenko, Mikhail Dvoretckii Differential Revision: https://reviews.llvm.org/D53765 llvm-svn: 353563	2019-02-08 20:48:56 +00:00
Matt Arsenault	a8b4339c2f	AMDGPU/GlobalISel: Legalize addrspacecast Use a placeholder constant for now on targets that need the load from the queue ptr. llvm-svn: 353497	2019-02-08 02:40:47 +00:00
Scott Linder	e2c5847414	[AMDGPU] Consider XOR in waterfall loop as a terminator Ensure the XOR in the waterfall loop for indirect addressing is considered a terminator. Differential Revision: https://reviews.llvm.org/D57703 llvm-svn: 353207	2019-02-05 19:50:32 +00:00
Scott Linder	d19d197221	[AMDGPU] Support emitting GOT relocations for function calls Differential Revision: https://reviews.llvm.org/D57416 llvm-svn: 353083	2019-02-04 20:00:07 +00:00
Tim Corringham	fa3e4e5b53	[AMDGPU] Fix for vector element insertion Summary: Incorrect code was generated when lowering insertelement operations for vectors with 8 or 16 bit elements. The value being inserted was not adjusted for the position of the element within the 32 bit word and so only the low element within each 32 bit word could receive the intended value. Fixed by simply replicating the value to each element of a congruent vector before the mask and or operation used to update the intended element. A number of affected LIT tests have been updated appropriately. before the mask & or into the intended Reviewers: arsenm, nhaehnle Reviewed By: arsenm Subscribers: llvm-commits, arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye Tags: #llvm Differential Revision: https://reviews.llvm.org/D57588 llvm-svn: 352885	2019-02-01 16:51:09 +00:00
Matt Arsenault	cdd191d9db	AMDGPU: Add DS append/consume intrinsics Since these pass the pointer in m0 unlike other DS instructions, these need to worry about whether the address is uniform or not. This assumes the address is dynamically uniform, and just uses readfirstlane to get a copy into an SGPR. I don't know if these have the same 16-bit add for the addressing mode offset problem on SI or not, but I've just assumed they do. Also includes some misc. changes to avoid test differences between the LDS and GDS versions. llvm-svn: 352422	2019-01-28 20:14:49 +00:00
Tim Corringham	824ca3f3dd	[AMDGPU] Add intrinsics for 16 bit interpolation Summary: Added the intrinsics llvm.amdgcn.interp.p1.f16() and llvm.amdgcn.interp.p2.f16() and related LIT test. The p1 intrinsic generates code appropriate for both 16 and 32 bank LDS. Reviewers: #amdgpu, dstuttard, arsenm, tpr Reviewed By: #amdgpu, arsenm Subscribers: jvesely, mgorny, arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D46754 llvm-svn: 352357	2019-01-28 13:48:59 +00:00
Matt Arsenault	a5840c3c39	Codegen support for atomicrmw fadd/fsub llvm-svn: 351851	2019-01-22 18:36:06 +00:00
Chandler Carruth	2946cd7010	Update the file headers across all of the LLVM projects in the monorepo to reflect the new license. We understand that people may be surprised that we're moving the header entirely to discuss the new license. We checked this carefully with the Foundation's lawyer and we believe this is the correct approach. Essentially, all code in the project is now made available by the LLVM project under our new license, so you will see that the license headers include that license only. Some of our contributors have contributed code under our old license, and accordingly, we have retained a copy of our old license notice in the top-level files in each project and repository. llvm-svn: 351636	2019-01-19 08:50:56 +00:00
Matt Arsenault	85af701e85	AMDGPU: Remove llvm.SI.load.const It's taken 3 years, but now all of the old AMDGPU and SI intrinsics are finally gone llvm-svn: 351586	2019-01-18 20:27:02 +00:00
Changpeng Fang	fe9269f804	AMDGPU: Adjust the chain for loads writing to the HI part of a register. Summary: For these loads that write to the HI part of a register, we should chain them to the op that writes to the LO part of the register to maintain the appropriate order. Reviewers: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D56454 llvm-svn: 351379	2019-01-16 21:32:53 +00:00
Marek Olsak	c5cec5e1fa	AMDGPU: Add llvm.amdgcn.ds.ordered.add & swap Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D52944 llvm-svn: 351351	2019-01-16 15:43:53 +00:00
Marek Olsak	33eb4d947d	AMDGPU: Add a fast path for icmp.i1(src, false, NE) Summary: This allows moving the condition from the intrinsic to the standard ICmp opcode, so that LLVM can do simplifications on it. The icmp.i1 intrinsic is an identity for retrieving the SGPR mask. And we can also get the mask from and i1, or i1, xor i1. Reviewers: arsenm, nhaehnle Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D52060 llvm-svn: 351150	2019-01-15 02:13:18 +00:00
David Stuttard	f77079f892	[AMDGPU] Add support for TFE/LWE in image intrinsics. 2nd try TFE and LWE support requires extra result registers that are written in the event of a failure in order to detect that failure case. The specific use-case that initiated these changes is sparse texture support. This means that if image intrinsics are used with either option turned on, the programmer must ensure that the return type can contain all of the expected results. This can result in redundant registers since the vector size must be a power-of-2. This change takes roughly 6 parts: 1. Modify the instruction defs in tablegen to add new instruction variants that can accomodate the extra return values. 2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE (where the bulk of the work for these instruction types is now done) 3. Extra verification code to catch cases where intrinsics have been used but insufficient return registers are used. 4. Modification to the adjustWritemask optimisation to account for TFE/LWE being enabled (requires extra registers to be maintained for error return value). 5. An extra pass to zero initialize the error value return - this is because if the error does not occur, the register is not written and thus must be zeroed before use. Also added a new (on by default) option to ensure ALL return values are zero-initialized that is required for sparse texture support. 6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO for this to re-enable and handle correctly). There's an additional fix now to avoid a dmask=0 For an image intrinsic with tfe where all result channels except tfe were unused, I was getting an image instruction with dmask=0 and only a single vgpr result for tfe. That is incorrect because the hardware assumes there is at least one vgpr result, plus the one for tfe. Fixed by forcing dmask to 1, which gives the desired two vgpr result with tfe in the second one. The TFE or LWE result is returned from the intrinsics using an aggregate type. Look in the test code provided to see how this works, but in essence IR code to invoke the intrinsic looks as follows: %v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 1, i32 0) %v.vec = extractvalue {<4 x float>, i32} %v, 0 %v.err = extractvalue {<4 x float>, i32} %v, 1 This re-submit of the change also includes a slight modification in SIISelLowering.cpp to work-around a compiler bug for the powerpc_le platform that caused a buildbot failure on a previous submission. Differential revision: https://reviews.llvm.org/D48826 Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda Work around for ppcle compiler bug Change-Id: Ie284cf24b2271215be1b9dc95b485fd15000e32b llvm-svn: 351054	2019-01-14 11:55:24 +00:00
Stanislav Mekhanoshin	d3757d3f3a	[AMDGPU] Separate feature dot-insts Differential Revision: https://reviews.llvm.org/D56524 llvm-svn: 350793	2019-01-10 03:25:20 +00:00
Stanislav Mekhanoshin	ed0d6c60af	Remove check for single use in ShrinkDemandedConstant This removes check for single use from general ShrinkDemandedConstant to the BE because of the AArch64 regression after D56289/rL350475. After several hours of experiments I did not come up with a testcase failing on any other targets if check is not performed. Moreover, direct call to ShrinkDemandedConstant is not really needed and superceed by SimplifyDemandedBits. Differential Revision: https://reviews.llvm.org/D56406 llvm-svn: 350684	2019-01-09 02:24:22 +00:00
Piotr Sobczak	378131bae0	[AMDGPU] Handle OR as operand of raw load/store Summary: Use isBaseWithConstantOffset() which handles OR as an operand to llvm.amdgcn.raw.buffer.load and llvm.amdgcn.raw.buffer.store. Change-Id: Ifefb9dc5ded8710d333df07ab1900b230e33539a Reviewers: nhaehnle, mareko, arsenm Reviewed By: arsenm Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D55999 llvm-svn: 350208	2019-01-02 09:47:41 +00:00
Simon Pilgrim	44dfd81d01	Fix unused variable warning. NFCI. llvm-svn: 348649	2018-12-07 21:44:25 +00:00
Matt Arsenault	ce2e053134	AMDGPU: Allow f32 types for llvm.amdgcn.s.buffer.load llvm-svn: 348625	2018-12-07 18:41:39 +00:00
Matt Arsenault	ca8eb0b672	AMDGPU: Remove llvm.SI.tbuffer.store llvm-svn: 348619	2018-12-07 18:03:47 +00:00
Matt Arsenault	aa9bcd56b1	AMDGPU: Remove llvm.AMDGPU.kill This is the last of the old AMDGPU intrinsics. llvm-svn: 348615	2018-12-07 17:46:16 +00:00
Nicolai Haehnle	a7b00058e0	AMDGPU: Divergence-driven selection of scalar buffer load intrinsics Summary: Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if the load is really uniform. So select the scalar load intrinsics directly to either VMEM or SMRD buffer loads based on divergence analysis. If an offset happens to end up in a VGPR -- either because a floating point calculation was involved, or due to other remaining deficiencies in SIFixSGPRCopies -- we use v_readfirstlane. There is some unrelated churn in tests since we now select MUBUF offsets in a unified way with non-scalar buffer loads. Change-Id: I170e6816323beb1348677b358c9d380865cd1a19 Reviewers: arsenm, alex-t, rampitec, tpr Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D53283 llvm-svn: 348050	2018-11-30 22:55:38 +00:00
Nicolai Haehnle	a9cc92c247	AMDGPU: Fix various issues around the VirtReg2Value mapping Summary: The VirtReg2Value mapping is crucial for getting consistently reliable divergence information into the SelectionDAG. This patch fixes a bunch of issues that lead to incorrect divergence info and introduces tight assertions to ensure we don't regress: 1. VirtReg2Value is generated lazily; there were some cases where a lookup was performed before all relevant virtual registers were created, leading to an out-of-sync mapping. Those cases were: - Complex code to lower formal arguments that generated CopyFromReg nodes from live-in registers (fixed by never querying the mapping for live-in registers). - Code that generates CopyToReg for formal arguments that are used outside the entry basic block (fixed by never querying the mapping for Register nodes, which don't need the divergence info anyway). 2. For complex values that are lowered to a sequence of registers, all registers must be reflected in the VirtReg2Value mapping. I am not adding any new tests, since I'm not actually aware of any bugs that these problems are causing with trunk as-is. However, I recently added a test case (in r346423) which fails when D53283 is applied without this change. Also, the new assertions should provide most of the effective test coverage. There is one test change in sdwa-peephole.ll. The underlying issue is that since the divergence info is now correct, the DAGISel will select V_OR_B32 directly instead of S_OR_B32. This leads to an extra COPY which affects the behavior of MachineLICM in a way that ends up with the S_MOV_B32 with the constant in a different basic block than the V_OR_B32, which is presumably what defeats the peephole. Reviewers: alex-t, arsenm, rampitec Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D54340 llvm-svn: 348049	2018-11-30 22:55:29 +00:00
David Stuttard	c6603861d8	Revert r347871 "Fix: Add support for TFE/LWE in image intrinsic" Also revert fix r347876 One of the buildbots was reporting a failure in some relevant tests that I can't repro or explain at present, so reverting until I can isolate. llvm-svn: 347911	2018-11-29 20:14:17 +00:00
David Stuttard	535c1af0bf	Fix: Add support for TFE/LWE in image intrinsic My change svn-id: 347871 caused a buildbot failure due to an unused variable def (used in an assert). Change-Id: Ia882d18bb6fa79b4d7bbfda422b9ea5d23eab336 llvm-svn: 347876	2018-11-29 15:56:36 +00:00
David Stuttard	de02e4b1cc	Add support for TFE/LWE in image intrinsics TFE and LWE support requires extra result registers that are written in the event of a failure in order to detect that failure case. The specific use-case that initiated these changes is sparse texture support. This means that if image intrinsics are used with either option turned on, the programmer must ensure that the return type can contain all of the expected results. This can result in redundant registers since the vector size must be a power-of-2. This change takes roughly 6 parts: 1. Modify the instruction defs in tablegen to add new instruction variants that can accomodate the extra return values. 2. Updates to lowerImage in SIISelLowering.cpp to accomodate setting TFE or LWE (where the bulk of the work for these instruction types is now done) 3. Extra verification code to catch cases where intrinsics have been used but insufficient return registers are used. 4. Modification to the adjustWritemask optimisation to account for TFE/LWE being enabled (requires extra registers to be maintained for error return value). 5. An extra pass to zero initialize the error value return - this is because if the error does not occur, the register is not written and thus must be zeroed before use. Also added a new (on by default) option to ensure ALL return values are zero-initialized that is required for sparse texture support. 6. Disable the inst_combine optimization in the presence of tfe/lwe (later TODO for this to re-enable and handle correctly). There's an additional fix now to avoid a dmask=0 For an image intrinsic with tfe where all result channels except tfe were unused, I was getting an image instruction with dmask=0 and only a single vgpr result for tfe. That is incorrect because the hardware assumes there is at least one vgpr result, plus the one for tfe. Fixed by forcing dmask to 1, which gives the desired two vgpr result with tfe in the second one. The TFE or LWE result is returned from the intrinsics using an aggregate type. Look in the test code provided to see how this works, but in essence IR code to invoke the intrinsic looks as follows: %v = call {<4 x float>,i32} @llvm.amdgcn.image.load.1d.v4f32i32.i32(i32 15, i32 %s, <8 x i32> %rsrc, i32 1, i32 0) %v.vec = extractvalue {<4 x float>, i32} %v, 0 %v.err = extractvalue {<4 x float>, i32} %v, 1 Differential revision: https://reviews.llvm.org/D48826 Change-Id: If222bc03642e76cf98059a6bef5d5bffeda38dda llvm-svn: 347871	2018-11-29 15:21:13 +00:00
Stanislav Mekhanoshin	443a7f9788	[AMDGPU] Disable DAG combine at -O0 Differential Revision: https://reviews.llvm.org/D54358 llvm-svn: 347659	2018-11-27 15:13:37 +00:00
Fangrui Song	d83a5526d5	[AMDGPU] Fix -Wunused-variable llvm-svn: 347234	2018-11-19 17:54:27 +00:00
Stanislav Mekhanoshin	054f8101f1	[AMDGPU] Convert insert_vector_elt into set of selects This allows to avoid scratch use or indirect VGPR addressing for small vectors. Differential Revision: https://reviews.llvm.org/D54606 llvm-svn: 347231	2018-11-19 17:39:20 +00:00
Stanislav Mekhanoshin	bcb34ac2ea	[AMDGPU] combine extractelement into several selects An extractelement with non-constant index will be lowered either to scratch or movrel loop in most cases. This patch converts such instruction into a set of selects if vector size is not too big. Differential Revision: https://reviews.llvm.org/D54351 llvm-svn: 346800	2018-11-13 21:18:21 +00:00
Nicolai Haehnle	bc233f5523	Revert "AMDGPU: Divergence-driven selection of scalar buffer load intrinsics" This reverts commit r344696 for now (except for some test additions). See https://bugs.freedesktop.org/show_bug.cgi?id=108611. llvm-svn: 346364	2018-11-07 21:53:43 +00:00
Craig Topper	0b5f8169b0	[TargetLowering] Change TargetLoweringBase::getPreferredVectorAction to take an MVT instead of an EVT. NFC The main caller of this already has an MVT and several targets called getSimpleVT inside without checking isSimple. This makes the simpleness explicit. llvm-svn: 346180	2018-11-05 23:26:13 +00:00
Sylvestre Ledru	df92dabaef	Fixed inclusion of M_PI fow MinGW-w64 Patch by KOLANICH llvm-svn: 346000	2018-11-02 17:25:40 +00:00
Neil Henning	7d1b77df57	[AMDGPU] UBSan bug fix for r345710 UBSan detected an error in our ISelLowering that is exposed only when you have a dmask == 0x1. Fix this by adding in an explicit check to ensure we don't do the UBSan detected shl << 32. llvm-svn: 345962	2018-11-02 10:24:57 +00:00
Reid Kleckner	4dc0b1ac60	Fix clang -Wimplicit-fallthrough warnings across llvm, NFC This patch should not introduce any behavior changes. It consists of mostly one of two changes: 1. Replacing fall through comments with the LLVM_FALLTHROUGH macro 2. Inserting 'break' before falling through into a case block consisting of only 'break'. We were already using this warning with GCC, but its warning behaves slightly differently. In this patch, the following differences are relevant: 1. GCC recognizes comments that say "fall through" as annotations, clang doesn't 2. GCC doesn't warn on "case N: foo(); default: break;", clang does 3. GCC doesn't warn when the case contains a switch, but falls through the outer case. I will enable the warning separately in a follow-up patch so that it can be cleanly reverted if necessary. Reviewers: alexfh, rsmith, lattner, rtrieu, EricWF, bollu Differential Revision: https://reviews.llvm.org/D53950 llvm-svn: 345882	2018-11-01 19:54:45 +00:00
Neil Henning	63718b214a	[AMDGPU] support image load/store a16 Our a16 support was only enabled for sample/gather and buffer load/store, but not for image load/store operations (which take an i16 as the pixel index rather than a half). Fix our isel lowering and add test cases to prove it out. Differential Revision: https://reviews.llvm.org/D53750 llvm-svn: 345710	2018-10-31 10:34:48 +00:00
Matt Arsenault	abc4f29f9c	AMDGPU: Remove custom BUILD_VECTOR combine This was looping in a testcase and removing it now slightly improves a test. llvm-svn: 345560	2018-10-30 01:37:59 +00:00
Matt Arsenault	687ec75d10	DAG: Change behavior of fminnum/fmaxnum nodes Introduce new versions that follow the IEEE semantics to help with legalization that may need quieted inputs. There are some regressions from inserting unnecessary canonicalizes when these are matched from fast math fcmp + select which should be fixed in a future commit. llvm-svn: 344914	2018-10-22 16:27:27 +00:00
Nicolai Haehnle	4821937d2e	AMDGPU: Avoid selecting ds_{read,write}2_b32 on SI Summary: To workaround a hardware issue in the (base + offset) calculation when base is negative. The impact on code quality should be limited since SILoadStoreOptimizer still runs afterwards and is able to combine loads/stores based on known sign information. This fixes visible corruption in Hitman on SI (easily reproducible by running benchmark mode). Change-Id: Ia178d207a5e2ac38ae7cd98b532ea2ae74704e5f Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99923 Reviewers: arsenm, mareko Subscribers: jholewinski, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D53160 llvm-svn: 344698	2018-10-17 15:37:48 +00:00
Nicolai Haehnle	c4a2ff0950	AMDGPU: Divergence-driven selection of scalar buffer load intrinsics Summary: Moving SMRD to VMEM in SIFixSGPRCopies is rather bad for performance if the load is really uniform. So select the scalar load intrinsics directly to either VMEM or SMRD buffer loads based on divergence analysis. If an offset happens to end up in a VGPR -- either because a floating point calculation was involved, or due to other remaining deficiencies in SIFixSGPRCopies -- we use v_readfirstlane. There is some unrelated churn in tests since we now select MUBUF offsets in a unified way with non-scalar buffer loads. Change-Id: I170e6816323beb1348677b358c9d380865cd1a19 Reviewers: arsenm, alex-t, rampitec, tpr Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D53283 llvm-svn: 344696	2018-10-17 15:37:30 +00:00
Konstantin Zhuravlyov	aa067cb9fb	AMDGPU: Rename isAmdCodeObjectV2 -> isAmdHsaOrMesa The isAmdCodeObjectV2 is a misleading name which actually checks whether the os is amdhsa or mesa. Also add a test to make sure we do not generate old kernel header for code object v3. Differential Revision: https://reviews.llvm.org/D52897 llvm-svn: 343813	2018-10-04 21:02:16 +00:00
Tim Renouf	a37679d67b	[AMDGPU] Fix for negative offsets in buffer/tbuffer intrinsics Summary: The new buffer/tbuffer intrinsics handle an out-of-range immediate offset by moving/adding offset&-4096 to a vgpr, leaving an in-range immediate offset, with a chance of the move/add being CSEd for similar loads/stores. However it turns out that a negative offset in a vgpr is illegal, even if adding the immediate offset makes it legal again. Therefore, this commit disables the offset&-4096 thing if the offset is negative. Differential Revision: https://reviews.llvm.org/D52683 Change-Id: Ie02f0a74f240a138dc2a29d17cfbd9e350e4ed13 llvm-svn: 343672	2018-10-03 10:29:43 +00:00
Matt Arsenault	f432011d33	AMDGPU: Fix private handling for allowsMisalignedMemoryAccesses If the alignment is at least 4, this should report true. Something still seems off with how < 4-byte types are handled here though. Fixing this seems to change how some combines get to where they get, but somehow isn't changing the net result. llvm-svn: 342879	2018-09-24 13:18:15 +00:00
Matt Arsenault	ebf46143ea	AMDGPU: Don't form fmed3 if it will require materialization If there is a single use constant, it can be folded into the min/max, but not into med3. llvm-svn: 342443	2018-09-18 02:34:54 +00:00
David Stuttard	20de3e99b5	[AMDGPU] Ensure trig range reduction only used for subtargets that require it Summary: GFX9 and above support sin/cos instructions with a greater range and thus don't require a fract instruction prior to invocation. Added a subtarget feature to reflect this and added code to take advantage of expanded range on GFX9+ Also updated the tests to check correct behaviour Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, tpr, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D51933 Change-Id: I1c1f1d3726a5ae32116646ca5cfa1ab4ef69e5b0 llvm-svn: 342222	2018-09-14 10:27:19 +00:00
Matt Arsenault	ff987ac6ea	AMDGPU: Fix not preserving alignent in call setups If an argument was passed on the stack, this was using the default alignment. I'm not sure there's an observable change from this. This was observable due to bugs in expansion of unaligned loads and stores, but since that is fixed I don't think this matters much. llvm-svn: 342133	2018-09-13 12:14:31 +00:00
Matt Arsenault	d1f4571a66	AMDGPU: Remove function pointer type hack Now the pointer size should always be correct and we don't need to improperly inspect the pointee type. llvm-svn: 341806	2018-09-10 12:16:11 +00:00
Matt Arsenault	7f6dc597d3	AMDGPU: Stop reporting is-noop addrspacecast for constant 32-bit This will require something to cast. Before this would eliminate the cast, which would result in copies of $noreg. llvm-svn: 341803	2018-09-10 11:59:27 +00:00
Matt Arsenault	57b5966dad	DAG: Handle odd vector sizes in calling conv splitting This already worked if only one register piece was used, but didn't if a type was split into multiple, unequal sized pieces. Fixes not splitting 3i16/v3f16 into two registers for AMDGPU. This will also allow fixing the ABI for 16-bit vectors in a future commit so that it's the same for all subtargets. llvm-svn: 341801	2018-09-10 11:49:23 +00:00
Matt Arsenault	d77fcc2a92	AMDGPU: Use GOT PSV since it has an address space now llvm-svn: 341768	2018-09-10 02:23:39 +00:00
Matt Arsenault	df84dc6979	AMDGPU: Remove old hack for function addresses llvm-svn: 341567	2018-09-06 17:23:24 +00:00
Matt Arsenault	813613c494	AMDGPU: Fix DAG divergence not reporting flat loads Match behavior in DAG of r340343 llvm-svn: 341393	2018-09-04 18:58:19 +00:00
Stanislav Mekhanoshin	44451b3344	[AMDGPU] Split v32i32 loads Differential Revision: https://reviews.llvm.org/D51555 llvm-svn: 341266	2018-08-31 22:43:36 +00:00
Matt Arsenault	bf07a50a98	AMDGPU: Restrict extract_vector_elt combine to loads The intention is to enable the extract_vector_elt load combine, and doing this for other operations interferes with more useful optimizations on vectors. Handle any type of load since in principle we should do the same combine for the various load intrinsics. llvm-svn: 341219	2018-08-31 15:39:52 +00:00
Matt Arsenault	0da6350dc8	AMDGPU: Remove remnants of old address space mapping llvm-svn: 341165	2018-08-31 05:49:54 +00:00
Nicolai Haehnle	35617ed4cb	[NFC] Rename the DivergenceAnalysis to LegacyDivergenceAnalysis Summary: This is patch 1 of the new DivergenceAnalysis (https://reviews.llvm.org/D50433). The purpose of this patch is to free up the name DivergenceAnalysis for the new generic implementation. The generic implementation class will be shared by specialized divergence analysis classes. Patch by: Simon Moll Reviewed By: nhaehnle Subscribers: jvesely, jholewinski, arsenm, nhaehnle, mgorny, jfb, llvm-commits Differential Revision: https://reviews.llvm.org/D50434 Change-Id: Ie8146b11be2c50d5312f30e11c7a3036a15b48cb llvm-svn: 341071	2018-08-30 14:21:36 +00:00
Ryan Taylor	1f334d0062	[AMDGPU] Add support for a16 modifiear for gfx9 Summary: Adding support for a16 for gfx9. A16 bit replaces r128 bit for gfx9. Change-Id: Ie8b881e4e6d2f023fb5e0150420893513e5f4841 Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, jfb, llvm-commits Differential Revision: https://reviews.llvm.org/D50575 llvm-svn: 340831	2018-08-28 15:07:30 +00:00
Tim Renouf	904343f879	[AMDGPU] Add support for multi-dword s.buffer.load intrinsic Summary: Patch by Marek Olsak and David Stuttard, both of AMD. This adds a new amdgcn intrinsic supporting s.buffer.load, in particular multiple dword variants. These are convenient to use from some front-end implementations. Also modified the existing llvm.SI.load.const intrinsic to common up the underlying implementation. This modification also requires that we can lower to non-uniform loads correctly by splitting larger dword variants into sizes supported by the non-uniform versions of the load. V2: Addressed minor review comments. V3: i1 glc is now i32 cachepolicy for consistency with buffer and tbuffer intrinsics, plus fixed formatting issue. V4: Added glc test. Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D51098 Change-Id: I83a6e00681158bb243591a94a51c7baa445f169b llvm-svn: 340684	2018-08-25 14:53:17 +00:00
Matt Arsenault	bb8e64e7f5	AMDGPU: Fix not respecting byval alignment in call frame setup This was hackily adding in the 4-bytes reserved for the callee's emergency stack slot. Treat it like a normal stack allocation so we get the correct alignment padding behavior. This fixes an inconsistency between the caller and callee. llvm-svn: 340396	2018-08-22 11:09:45 +00:00
Tim Renouf	4f703f5e11	[AMDGPU] New buffer intrinsics Summary: This commit adds new intrinsics llvm.amdgcn.raw.buffer.load llvm.amdgcn.raw.buffer.load.format llvm.amdgcn.raw.buffer.load.format.d16 llvm.amdgcn.struct.buffer.load llvm.amdgcn.struct.buffer.load.format llvm.amdgcn.struct.buffer.load.format.d16 llvm.amdgcn.raw.buffer.store llvm.amdgcn.raw.buffer.store.format llvm.amdgcn.raw.buffer.store.format.d16 llvm.amdgcn.struct.buffer.store llvm.amdgcn.struct.buffer.store.format llvm.amdgcn.struct.buffer.store.format.d16 llvm.amdgcn.raw.buffer.atomic.* llvm.amdgcn.struct.buffer.atomic.* with the following changes from the llvm.amdgcn.buffer.* intrinsics: * there are separate raw and struct versions: raw does not have an index arg and sets idxen=0 in the instruction, and struct always sets idxen=1 in the instruction even if the index is 0, to allow for the fact that gfx9 does bounds checking differently depending on whether idxen is set; * there is a combined cachepolicy arg (glc+slc) * there are now only two offset args: one for the offset that is included in bounds checking and swizzling, to be split between the instruction's voffset and immoffset fields, and one for the offset that is excluded from bounds checking and swizzling, to go into the instruction's soffset field. The AMDISD::BUFFER_* SD nodes always have an index operand, all three offset operands, combined cachepolicy operand, and an extra idxen operand. The obsolescent llvm.amdgcn.buffer.* intrinsics continue to work. Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, jfb, llvm-commits Differential Revision: https://reviews.llvm.org/D50306 Change-Id: If897ea7dc34fcbf4d5496e98cc99a934f62fc205 llvm-svn: 340269	2018-08-21 11:07:10 +00:00
Tim Renouf	35484c9d50	[AMDGPU] New tbuffer intrinsics Summary: This commit adds new intrinsics llvm.amdgcn.raw.tbuffer.load llvm.amdgcn.struct.tbuffer.load llvm.amdgcn.raw.tbuffer.store llvm.amdgcn.struct.tbuffer.store with the following changes from the llvm.amdgcn.tbuffer.* intrinsics: * there are separate raw and struct versions: raw does not have an index arg and sets idxen=0 in the instruction, and struct always sets idxen=1 in the instruction even if the index is 0, to allow for the fact that gfx9 does bounds checking differently depending on whether idxen is set; * there is a combined format arg (dfmt+nfmt) * there is a combined cachepolicy arg (glc+slc) * there are now only two offset args: one for the offset that is included in bounds checking and swizzling, to be split between the instruction's voffset and immoffset fields, and one for the offset that is excluded from bounds checking and swizzling, to go into the instruction's soffset field. The AMDISD::TBUFFER_* SD nodes always have an index operand, all three offset operands, combined format operand, combined cachepolicy operand, and an extra idxen operand. The tbuffer pseudo- and real instructions now also have a combined format operand. The obsolescent llvm.amdgcn.tbuffer.* and llvm.SI.tbuffer.store intrinsics continue to work. V2: Separate raw and struct intrinsics. V3: Moved extract_glc and extract_slc defs to a more sensible place. V4: Rebased on D49995. V5: Only two separate offset args instead of three. V6: Pseudo- and real instructions have joint format operand. V7: Restored optionality of dfmt and nfmt in assembler. V8: Addressed minor review comments. Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, dstuttard, t-tye, llvm-commits Differential Revision: https://reviews.llvm.org/D49026 Change-Id: If22ad77e349fac3a5d2f72dda53c010377d470d4 llvm-svn: 340268	2018-08-21 11:06:05 +00:00
Chandler Carruth	c73c0307fe	[MI] Change the array of `MachineMemOperand` pointers to be a generically extensible collection of extra info attached to a `MachineInstr`. The primary change here is cleaning up the APIs used for setting and manipulating the `MachineMemOperand` pointer arrays so chat we can change how they are allocated. Then we introduce an extra info object that using the trailing object pattern to attach some number of MMOs but also other extra info. The design of this is specifically so that this extra info has a fixed necessary cost (the header tracking what extra info is included) and everything else can be tail allocated. This pattern works especially well with a `BumpPtrAllocator` which we use here. I've also added the basic scaffolding for putting interesting pointers into this, namely pre- and post-instruction symbols. These aren't used anywhere yet, they're just there to ensure I've actually gotten the data structure types correct. I'll flesh out support for these in a subsequent patch (MIR dumping, parsing, the works). Finally, I've included an optimization where we store any single pointer inline in the `MachineInstr` to avoid the allocation overhead. This is expected to be the overwhelmingly most common case and so should avoid any memory usage growth due to slightly less clever / dense allocation when dealing with >1 MMO. This did require several ergonomic improvements to the `PointerSumType` to reasonably support the various usage models. This also has a side effect of freeing up 8 bits within the `MachineInstr` which could be repurposed for something else. The suggested direction here came largely from Hal Finkel. I hope it was worth it. ;] It does hopefully clear a path for subsequent extensions w/o nearly as much leg work. Lots of thanks to Reid and Justin for careful reviews and ideas about how to do all of this. Differential Revision: https://reviews.llvm.org/D50701 llvm-svn: 339940	2018-08-16 21:30:05 +00:00
Matt Arsenault	7121bed210	AMDGPU: Custom lower fexp This will allow the library to just use __builtin_expf directly without expanding this itself. Note f64 still won't work because there is no exp instruction for it. llvm-svn: 339902	2018-08-16 17:07:52 +00:00
Matt Arsenault	a816073764	AMDGPU: Improve extract_vector_elt reduction combine Handle fmul, fsub and preserve flags. Also really test minnum/maxnum reductions. The existing tests were only checking from minnum/maxnum matched from a fast math compare and select which is not the same. llvm-svn: 339820	2018-08-15 21:34:06 +00:00
Matt Arsenault	b3a80e5397	AMDGPU: Implement llvm.amdgcn.icmp/fcmp for i16/f16 Also support these on targets without support for these, since it will allow us to freely create these in instcombine. llvm-svn: 339819	2018-08-15 21:25:20 +00:00
Chandler Carruth	66654b72c9	[SDAG] Remove the reliance on MI's allocation strategy for `MachineMemOperand` pointers attached to `MachineSDNodes` and instead have the `SelectionDAG` fully manage the memory for this array. Prior to this change, the memory management was deeply confusing here -- The way the MI was built relied on the `SelectionDAG` allocating memory for these arrays of pointers using the `MachineFunction`'s allocator so that the raw pointer to the array could be blindly copied into an eventual `MachineInstr`. This creates a hard coupling between how `MachineInstr`s allocate their array of `MachineMemOperand` pointers and how the `MachineSDNode` does. This change is motivated in large part by a change I am making to how `MachineFunction` allocates these pointers, but it seems like a layering improvement as well. This would run the risk of increasing allocations overall, but I've implemented an optimization that should avoid that by storing a single `MachineMemOperand` pointer directly instead of allocating anything. This is expected to be a net win because the vast majority of uses of these only need a single pointer. As a side-effect, this makes the API for updating a `MachineSDNode` and a `MachineInstr` reasonably different which seems nice to avoid unexpected coupling of these two layers. We can map between them, but we shouldn't be surprised at where that occurs. =] Differential Revision: https://reviews.llvm.org/D50680 llvm-svn: 339740	2018-08-14 23:30:32 +00:00
Matt Arsenault	b5acec1f79	AMDGPU: Use splat vectors for undefs when folding canonicalize If one of the elements is undef, use the canonicalized constant from the other element instead of 0. Splat vectors are more useful for other optimizations, such as matching vector clamps. This was breaking on clamps of half3 from the undef 4th component. llvm-svn: 339512	2018-08-12 08:42:54 +00:00
Matt Arsenault	3ead7d7389	AMDGPU: Fix packing undef parts of build_vector llvm-svn: 339511	2018-08-12 08:42:46 +00:00
Matt Arsenault	940e6075e4	AMDGPU: More canonicalized operations llvm-svn: 339464	2018-08-10 19:20:17 +00:00
Matt Arsenault	3dcf4ce435	AMDGPU: Combine and of seto/setuo and fp_class Clear the nan (or non-nan) test bits from the mask. llvm-svn: 339462	2018-08-10 18:58:56 +00:00
Matt Arsenault	8ad00d30fa	AMDGPU: Match isfinite pattern to class instructions llvm-svn: 339460	2018-08-10 18:58:41 +00:00
Matt Arsenault	5bb9d798b4	AMDGPU: Add LLVM_FALLTHROUGH llvm-svn: 339458	2018-08-10 17:57:12 +00:00
Matt Arsenault	935f3b70fe	AMDGPU: Error more gracefully on libcalls I think this is the only situation where the callsite will have a null instruction. llvm-svn: 339271	2018-08-08 16:58:39 +00:00
Matt Arsenault	e719139b10	AMDGPU: Fix shifts for i128 llvm-svn: 339270	2018-08-08 16:58:33 +00:00
Matt Arsenault	08f3fe4fae	AMDGPU: cvt_pk_rtz_f16 canonicalizes llvm-svn: 339078	2018-08-06 23:01:31 +00:00
Matt Arsenault	e94ee833f9	AMDGPU: Handle some vector operations in isCanonicalized llvm-svn: 339077	2018-08-06 22:45:51 +00:00
Matt Arsenault	a29e76244a	AMDGPU: Push fcanonicalize through partially constant build_vector This usually avoids some re-packing code, and may help find canonical sources. llvm-svn: 339072	2018-08-06 22:30:44 +00:00
Matt Arsenault	f2a167fb1d	AMDGPU: Refactor fcanonicalize combine This will make more complex combines easier. llvm-svn: 339070	2018-08-06 22:10:26 +00:00
Matt Arsenault	d49ab0b214	AMDGPU: Treat more custom operations as canonicalizing Everything should quiet, and I think everything should flush. I assume the min3/med3/max3 follow the same rules as regular min/max for flushing, which should at least be conservatively correct. There are still more operations that need to be handled. llvm-svn: 339065	2018-08-06 21:58:11 +00:00
Matt Arsenault	ce6d61fba8	AMDGPU: Conversions always produce canonical results Not sure why this was checking for denormals for f16. My interpretation of the IEEE standard is conversions should produce a canonical result, and the ISA manual says denormals are created when appropriate. llvm-svn: 339064	2018-08-06 21:51:52 +00:00
Matt Arsenault	f8768bfc84	AMDGPU: Fix implementation of isCanonicalized If denormals are enabled, denormals are canonical. Also fix a few other issues. minnum/maxnum are supposed to canonicalize. Temporarily improve workaround for the instruction behavior change in gfx9. Handle selects and fcopysign. The tests were also largely broken, since they were checking for a flush used on some targets after the store of the result. llvm-svn: 339061	2018-08-06 21:38:27 +00:00

1 2 3 4 5 ...

731 Commits