llvm-project

Commit Graph

Author	SHA1	Message	Date
hsmahesha	52ffbfdffc	[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their size. This will make sure that the members of sorted structure are properly aligned, and hence it will further reduce the probability of unaligned LDS access. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D103261	2021-06-07 18:00:41 +05:30
Mirko Brkusanin	35ef4c940b	[AMDGPU][GlobalISel] Legalize G_ABS Legalize and select G_ABS so that we can use llvm.abs intrinsic Differential Revision: https://reviews.llvm.org/D102391	2021-06-04 14:46:43 +02:00
madhur13490	6a3beb1f68	[AMDGPU] [IndirectCalls] Don't propagate attributes to address taken functions and their callees Don't propagate launch bound related attributes to address taken functions and their callees. The idea is to do a traversal over the call graph starting at address taken functions and erase the attributes set by previous logic i.e. process(). This two phase approach makes sure that we don't miss out on deep nested callees from address taken functions as a function might be called directly as well as indirectly. This patch is also reattempt to D94585 as latent issues are fixed in hasAddressTaken function in the recent past. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D103138	2021-06-04 11:36:56 +05:30
hsmahesha	753437fc1d	Revert "[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering." This reverts commit `d71ff907ef`.	2021-06-04 11:16:46 +05:30
hsmahesha	d71ff907ef	[AMDGPU] Increase alignment of LDS globals if necessary before LDS lowering. Before packing LDS globals into a sorted structure, make sure that their alignment is properly updated based on their size. This will make sure that the members of sorted structure are properly aligned, and hence it will further reduce the probability of unaligned LDS access. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D103261	2021-06-04 09:34:37 +05:30
Julien Pagès	37821155c9	[AMDGPU] Fix a crash when selecting a particular case of buffer_load_format_d16 In this particular example, we had a crash when compiling it for several architectures. This patch extends the legalization of extract_subvector to avoid this problem. Differential Revision: https://reviews.llvm.org/D103344	2021-06-03 16:40:18 -04:00
Stanislav Mekhanoshin	9e2e49328f	[AMDGPU] All GWS instructions need aligned VGPR on gfx90a Fixes: SWDEV-288006 Differential Revision: https://reviews.llvm.org/D103197	2021-06-01 17:08:03 -07:00
Arthur Eubanks	8815ce03e8	Remove "Rewrite Symbols" from codegen pipeline It breaks up the function pass manager in the codegen pipeline. With empty parameters, it looks at the -mllvm flag -rewrite-map-file. This is likely not in use. Add a check that we only have one function pass manager in the codegen pipeline. Some tests relied on the fact that we had a module pass somewhere in the codegen pipeline. addr-label.ll crashes on ARM due to this change. This is because a ARMConstantPoolConstant containing a BasicBlock to represent a blockaddress may hold an invalid pointer to a BasicBlock if the blockaddress is invalidated by its BasicBlock getting removed. In that case all referencing blockaddresses are RAUW a constant int. Making ARMConstantPoolConstant::CVal a WeakVH fixes the crash, but I'm not sure that's the right fix. As a workaround, create a barrier right before ISel so that IR optimizations can't happen while a ARMConstantPoolConstant has been created. Reviewed By: rnk, MaskRay, compnerd Differential Revision: https://reviews.llvm.org/D99707	2021-05-31 08:32:36 -07:00
Djordje Todorovic	dee85d47d9	[LiveDebugVariables] Stop trimming locations of non-inlined vars The D35953, D62650 and D73691 introduced trimming of variables locations in LiveDebugVariables pass, since there are some cases where after the virtregrewrite we have exploded number of DBG_VALUEs created for some inlined variables. As it looks, all problematic cases were regarding inlined variables, so it seems reasonable to stop trimming the location ranges for non-inlined variables. It has very good impact on the llvm-locstats report. Differential Revision: https://reviews.llvm.org/D102917	2021-05-31 02:59:19 -07:00
David Green	222aeb4d51	[DSE] Remove stores in the same loop iteration DSE will currently only remove stores in the same block unless they can be guaranteed to be loop invariant. This expands that to any stores that are in the same Loop, at the same loop level. This should still account for where AA/MSSA will not handle aliasing between loops, but allow the dead stores to be removed where they overlap in the same loop iteration. It requires adding loop info to DSE, but that looks fairly harmless. The test case this helps is from code like this, which can come up in certain matrix operations: for(i=..) dst[i] = 0; for(j=..) dst[i] += src[in+j]; After LICM, this becomes: for(i=..) dst[i] = 0; sum = 0; for(j=..) sum += src[in+j]; dst[i] = sum; The first store is dead, and with this patch is now removed. Differntial Revision: https://reviews.llvm.org/D100464	2021-05-31 10:22:37 +01:00
Sebastian Neubauer	690f5b7a01	[AMDGPU] Fix function calls with flat scratch When flat scratch is used, the stack pointer needs to be added when writing arguments to the stack. For buffer instructions, this is done in SelectMUBUFScratchOffen and SelectMUBUFScratchOffset. Move that to call argument lowering, like it is done in GlobalISel. Differential Revision: https://reviews.llvm.org/D103166	2021-05-28 11:22:13 +02:00
Sebastian Neubauer	6133b60a27	[AMDGPU] Precommit test Add scratch run to gfx-callable-argument-types.ll.	2021-05-28 11:22:13 +02:00
Matt Arsenault	e892705d74	GlobalISel: Do not change register types in lowerLoad Adjusting the load register type is a widenScalar type action, not a lowering. lowerLoad should be reserved for operations that change the memory access size, such as unaligned load decomposition. With this trying to adjust the register type, it was hard to avoid infinite loops in the legalizer. Adds a bandaid to avoid regressing a few AArch64 tests, but I'm not sure what the exact condition is and there's probably a cleaner way to do this. For AMDGPU this regresses handling of some cases for unaligned loads, but the way this is currently working is a pretty ugly hack.	2021-05-27 11:49:37 -04:00
Matt Arsenault	34046de042	AMDGPU/GlobalISel: Fix broken test run line	2021-05-27 11:28:51 -04:00
Matt Arsenault	808dc6f866	VirtRegMap: Preserve LiveDebugVariables This avoids recomputing it between regalloc runs when allocation is split, and also avoids a debug info test regression.	2021-05-27 10:40:14 -04:00
Matt Arsenault	ee35900089	AMDGPU/GlobalISel: Lower constant-32-bit zextload/sextload consistently We were accidentally leaning on code in lowerLoad which expands extending loads which should be removed.	2021-05-27 09:49:13 -04:00
Sebastian Neubauer	0bb60dbe34	[AMDGPU][GlobalISel] Allow amdgpu_gfx calling conv Calling functions from shaders already works with the SelectionDAG. Differential Revision: https://reviews.llvm.org/D103183	2021-05-27 10:41:40 +02:00
Stanislav Mekhanoshin	5e2facb922	[AMDGPU] Fix kernel LDS lowering for constants There is a trivial but severe bug in the recent code collecting LDS globals used by kernel. It aborts scan on the first constant without scanning further uses. That leads to LDS overallocation with multiple kernels in certain cases. Differential Revision: https://reviews.llvm.org/D103190	2021-05-26 11:34:50 -07:00
jweightma	fcd32d62c0	[AMDGPU] Fix function pointer argument bug in AMDGPU Propagate Attributes pass. This patch fixes a bug in the AMDGPU Propagate Attributes pass where a call instruction with a function pointer argument is identified as a user of the passed function, and illegally replaces the called function of the instruction with the function argument. For example, given functions f and g with appropriate types, the following illegal transformation could occur without this fix: call void @f(void ()* @g) --> call void @g(void ()* @g.1) The solution introduced in this patch is to prevent the cloning and substitution if the instruction's called function and the function which might be cloned do not match. Reviewed By: arsenm, madhur13490 Differential Revision: https://reviews.llvm.org/D101847	2021-05-26 16:40:15 +02:00
Mirko Brkusanin	9601849984	[AMDGPU][GlobalISel] Stop foldInsertEltToCmpSelect from changing reg banks This function can change regbank for registers which already have a selected bank. Depending on the instruction where these registers were used it can cause instruction selection to fail. Differential Revision: https://reviews.llvm.org/D98515	2021-05-26 11:57:41 +02:00
Mirko Brkusanin	7386ad4e9e	Revert "[AMDGPU][GlobalISel] Stop foldInsertEltToCmpSelect from changing reg banks" This reverts commit `18c5444702`.	2021-05-26 11:57:41 +02:00
Michael Liao	c9dd29925f	[SelectionDAG] Propagate scoped AA metadata when lowering mem intrinsics. - When memory intrinsics, such as memcpy, the attached scoped AA metadata is not passed down to the backend. As a result, the backend cannot schedule relevant memory operations around them following that hint. In this patch, SelectionDAG is enhanced to propagate that metadata (scoped AA only) when they are lowered into loads and stores. Differential Revision: https://reviews.llvm.org/D102215	2021-05-25 14:42:26 -04:00
Michael Liao	4df3b60199	Add pre-commit tests for [D102215](https://reviews.llvm.org/D102215 ).	2021-05-25 14:42:25 -04:00
Stanislav Mekhanoshin	8de4db697f	[AMDGPU] Lower kernel LDS into a sorted structure Differential Revision: https://reviews.llvm.org/D102954	2021-05-25 11:29:29 -07:00
Mirko Brkusanin	18c5444702	[AMDGPU][GlobalISel] Stop foldInsertEltToCmpSelect from changing reg banks This function can change regbank for registers which already have a selected bank. Depending on the instruction where these registers were used it can cause instruction selection to fail.	2021-05-25 19:34:09 +02:00
Christudasan Devadasan	90d784053f	AMDGPU/GlobalISel: Legalize G_[SU]DIVREM instructions Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D100726	2021-05-25 10:51:07 +05:30
serge-sans-paille	4ab3041acb	Revert "[NFC] remove explicit default value for strboolattr attribute in tests" This reverts commit `bda6e5bee0`. See https://lab.llvm.org/buildbot/#/builders/109/builds/15424 for instance	2021-05-24 19:43:40 +02:00
serge-sans-paille	bda6e5bee0	[NFC] remove explicit default value for strboolattr attribute in tests Since `d6de1e1a71`, no attributes is quivalent to setting attribute to false. This is a preliminary commit for https://reviews.llvm.org/D99080	2021-05-24 19:31:04 +02:00
Andy Wingo	81bc732816	[IR][Verifier] Relax restriction on alloca address spaces In the WebAssembly target, we would like to allow alloca in two address spaces. The alloca instruction already has an address space argument, but the verifier asserts that the address space of an alloca is the default alloca address space from the datalayout. This patch removes this restriction. Targets that would like to impose additional restrictions should do so via target-specific verification passes. Differential Revision: https://reviews.llvm.org/D101045	2021-05-21 11:52:45 +02:00
Nicolai Hähnle	a888e492f6	[IR] Memory intrinsics are not unconditionally `nosync` Remove the `nosync` attribute from the memory intrinsic definitions (i.e. memset, memcpy, memmove). Like native memory accesses, memory intrinsics can be volatile. This is indicated by an immarg in the intrinsic call. All else equal, a volatile memory intrinsic is `sync`, so we cannot annotate the intrinsic functions themselves as `nosync`. The attributor and function-attr passes know to take the volatile bit into account. Since `nosync` is a default attribute, this means we have to stop using the DefaultAttrIntrinsic tablegen class for memory intrinsics, and specify all default attributes other than `nosync` explicitly. Most of the test changes are trivial churn, but one test case (in nosync.ll) was in fact incorrect before this change. Differential Revision: https://reviews.llvm.org/D102295	2021-05-21 03:40:59 +02:00
Stanislav Mekhanoshin	748db5bfac	[AMDGPU] Fix module LDS selection Accesses to global module LDS variable start from null, but kernel also thinks its variables start address is null. Fixed by not using a null as an address. Differential Revision: https://reviews.llvm.org/D102882	2021-05-20 15:59:01 -07:00
Jessica Paquette	84ae1cf8ed	Recommit "[GlobalISel] Simplify G_ICMP to true/false when the result is known" Add missing REQUIRES line to prelegalizer-combiner-icmp-to-true-false-known-bits.	2021-05-19 09:29:19 -07:00
Nico Weber	52a7797626	Revert "[GlobalISel] Simplify G_ICMP to true/false when the result is known" This reverts commit `892497c806`. Breaks tests, see comments on https://reviews.llvm.org/D102542	2021-05-19 09:02:27 -04:00
Arthur Eubanks	1c7f32334d	[TargetLowering] Only inspect attributes in the arguments for ArgListEntry Parameter attributes are considered part of the function [1], and like mismatched calling conventions [2], we can't have the verifier check for mismatched parameter attributes. This is a reland after fixing MSan issues in D102667. [1] https://llvm.org/docs/LangRef.html#parameter-attributes [2] https://llvm.org/docs/FAQ.html#why-does-instcombine-simplifycfg-turn-a-call-to-a-function-with-a-mismatched-calling-convention-into-unreachable-why-not-make-the-verifier-reject-it Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D101806	2021-05-18 14:30:22 -07:00
Jessica Paquette	892497c806	[GlobalISel] Simplify G_ICMP to true/false when the result is known Use existing KnownBits helpers from KnownBits.h to simplify G_ICMPs. E.g. x == x -> true x != x -> false load(x) > 1 -> true (when the load is known to be greater than 1) And so on. Differential Revision: https://reviews.llvm.org/D102542	2021-05-18 09:26:41 -07:00
Simon Pilgrim	81606ab856	Revert rGd70cbd1ce9b426f2c7e0e0f900769bbcbb300a95 "[AMDGPU] Regenerate wave32.ll tests" This is failing on buildbots but not locally - not sure why	2021-05-18 12:15:38 +01:00
Simon Pilgrim	d70cbd1ce9	[AMDGPU] Regenerate wave32.ll tests Keep the manual GFX10DEFWAVE checks for VGPRBlocks	2021-05-18 12:03:31 +01:00
Stanislav Mekhanoshin	45764efb69	[AMDGPU] Do not check denorm for LDS FP atomic with unsafe flag This is already how it is handled for global and flat atomics. Differential Revision: https://reviews.llvm.org/D102366	2021-05-17 16:53:09 -07:00
Arthur Eubanks	341902672c	Revert "[TargetLowering] Only inspect attributes in the arguments for ArgListEntry" This reverts commit `16748bd2fb`. Causes https://crbug.com/1209013	2021-05-16 22:02:10 -07:00
Brendon Cahoon	3f7b7e7393	[AMDGPU] Update SCC defs to VCC when uses are changed to VCC The FixSGPRCopies pass converts instructions to VALU when removing illegal VGPR to SGPR copies. Instructions that use SCC are changed to use VCC instead. When that happens, the pass must also change instructions that define SCC to define VCC. The pass was not changing the SCC definition when an ADDC is converted due to a input that is a VGPR to SGPR copy. But, the initial ADD insruction, which define SCC, is not converted. This causes a compilation failure due to a use of an undefined physical register. This patch adds code that inserts the SCC definition in the MoveToVALU worklist when a SCC use is converted to a VCC use. Differential Revision: https://reviews.llvm.org/D102111	2021-05-14 18:05:05 -04:00
Stanislav Mekhanoshin	6fb02596a2	[AMDGPU] Add support for architected flat scratch Add support for the readonly flat Scratch register initialized by the SPI. Differential Revision: https://reviews.llvm.org/D102432	2021-05-14 10:53:48 -07:00
Matt Arsenault	c7cff08f79	AMDGPU: Fix assert when rewriting saddr d16 loads moveOperands does not handle moving tied operands since it would generally have to fixup the tied operand references. Avoid the assert by untying and retying after the modification. These in place modifications really aren't managable.	2021-05-14 13:24:19 -04:00
Jay Foad	7f81c5a5ba	[AMDGPU] getMemOperandsWithOffset: add vaddr operand for stack access BUF instructions A consequence is that checkInstOffsetsDoNotOverlap can now distinguish sp+offset from fp+offset, so it knows that it shouldn't try to work out whether the accesses overlap just by comparing the offsets. For example in these two instructions: MIR: BUFFER_STORE_DWORD_OFFSET %0:vgpr_32(s32), $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 4, 0, 0, 0, 0, 0, implicit $exec :: (dereferenceable store 4 into stack + 4, addrspace 5) %4:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %stack.0.alloca, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 0, 0, 0, 0, 0, 0, implicit $exec :: (load 4 from `i8 addrspace(5)* undef`, addrspace 5) ISA: buffer_store_dword v0, off, s[0:3], s32 offset:4 buffer_load_dword v0, off, s[0:3], s34 Differential Revision: https://reviews.llvm.org/D73957	2021-05-14 10:10:43 +01:00
David Stuttard	31b62aa162	[AMDGPU] Fix codegen of image intrinsics for g16 and a16 For gfx10 gradient (g16) and address (a16) can be independent. Previous implementation assumed that a16 implied g16. There are some other changes that fix the verification (as well as asm/disasm) that are required for the included test to pass - the XFAIL will be removed in those changes. This also includes required fixes for GlobalISel Differential Revision: https://reviews.llvm.org/D102066 Change-Id: I7d171cc90994de05f41669b66a6d0ffa2ed05d09	2021-05-14 09:28:15 +01:00
David Stuttard	72d570ca08	[AMDGPU][AsmParser/Disassembler] Correct A16 and G16 handling A16 support for image instructions assembly/disassembly (gfx10) was missing Also refactor MIMG op addr size calcs to common function We'd got 3 places where the same operation was being done. One test is now marked XFAIL until a related codegen patch is in place Differential Revision: https://reviews.llvm.org/D102231 Change-Id: I7e86e730ef8c71901457855cba570581f4f576bb	2021-05-14 09:25:44 +01:00
Carl Ritson	9cf6ff7aff	[AMDGPU] Do not clause NSA instructions To ensure correct behaviour NSA instructions should not be claused. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D102211	2021-05-14 12:54:56 +09:00
Matt Arsenault	6a70874d27	AMDGPU/GlobalISel: Implement tail calls Or at least the sibling call cases which the DAG already handles.	2021-05-13 18:57:42 -04:00
Aakanksha Patil	464e4dc50f	[AMDGPU] Add gfx1034 target Differential Revision: https://reviews.llvm.org/D102306	2021-05-13 14:25:18 -04:00
Stanislav Mekhanoshin	8f98356bb5	[AMDGPU] Only allow global fp atomics with unsafe option Previously we were allowing to use FP atomics without -amdgpu-unsafe-fp-atomics option if a scope is less then system. This is not safe just as well if we have UC memory. This change only allows global and flat FP atomics with the unsafe option. Consequentially that makes a check for denorm mode redundant since we skip it with the unsafe option and do not have a way to produce these instructions without it anyway. Differential Revision: https://reviews.llvm.org/D102347	2021-05-13 08:52:20 -07:00
Juneyoung Lee	395607af3c	Reapply [ConstantFold] Fold more operations to poison This was reverted to mitigate mitigate miscompiles caused by the logical and/or to bitwise and/or fold. Reapply it now that the underlying issue has been fixed by D101191. ----- This patch folds more operations to poison. Alive2 proof: https://alive2.llvm.org/ce/z/mxcb9G (it does not contain tests about div/rem because they fold to poison when raising UB) Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D92270	2021-05-13 16:04:12 +02:00
Baptiste Saleil	5885f1a4cb	[AMDGPU] Disable the SIFormMemoryClauses pass at -O1 This patch disables the SIFormMemoryClauses pass at -O1. This pass has a significant impact on compilation time, so we only want it to be enabled starting from -O2. Differential Revision: https://reviews.llvm.org/D101939	2021-05-12 11:51:59 -04:00
Julien Pagès	46adccc5cc	[AMDGPU] Improve Codegen for build_vector Improve the code generation of build_vector. Use the v_pack_b32_f16 instruction instead of v_and_b32 + v_lshl_or_b32 Differential Revision: https://reviews.llvm.org/D98081 Patch by Julien Pagès!	2021-05-12 14:17:44 +01:00
Piotr Sobczak	68137ef568	[AMDGPU] Skip invariant loads when avoiding WAR conflicts No need to handle invariant loads when avoiding WAR conflicts, as there cannot be a vector store to the same memory location. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D101177	2021-05-12 10:57:05 +02:00
Matt Arsenault	cc79aaced0	AMDGPU: Fix SILoadStoreOptimizer for gfx90a This was hardcoding the register class to use for the newly created pointer registers, violating the aligned VGPR requirement.	2021-05-11 21:26:43 -04:00
Matt Arsenault	a15ed701ab	AMDGPU: Fix assert on constant load from addrspacecasted pointer This was trying to create a bitcast between different address spaces.	2021-05-11 20:12:20 -04:00
Austin Kerbow	4433f4601e	[AMDGPU] Fix extra waitcnt being added with BUFFER_INVL2 The waitcnt pass would increment the number of vmem events for some buffer invalidates that were not handled by the pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D102252	2021-05-11 13:17:33 -07:00
Piotr Sobczak	09fe84abb4	[AMDGPU] Move code sinking before structurizer Moving code sinking pass before structurizer creates more sinking opportunities. The extra flow edges introduced by the structurizer can have adverse effects on sinking, because the sinking pass prefers moving instructions to blocks with unique predecessors and the structurizer destroys that property in some cases. A notable example is moving high-latency image instructions across kills. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D101115	2021-05-11 14:07:23 +02:00
Jay Foad	3b873831c4	[AMDGPU] Add some GFX10.3 testing. NFC.	2021-05-11 11:21:19 +01:00
Carl Ritson	ad558a4ff7	[AMDGPU] Pre-commit tests for D102211	2021-05-11 12:16:58 +09:00
Arthur Eubanks	16748bd2fb	[TargetLowering] Only inspect attributes in the arguments for ArgListEntry Parameter attributes are considered part of the function [1], and like mismatched calling conventions [2], we can't have the verifier check for mismatched parameter attributes. [1] https://llvm.org/docs/LangRef.html#parameter-attributes [2] https://llvm.org/docs/FAQ.html#why-does-instcombine-simplifycfg-turn-a-call-to-a-function-with-a-mismatched-calling-convention-into-unreachable-why-not-make-the-verifier-reject-it Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D101806	2021-05-10 12:35:11 -07:00
Petar Avramovic	f6985a197e	AMDGPU/GlobalISel: Use destination register bank in applyMappingLoad Large loads on target that does not useFlatForGlobal have to be split in regbankselect. This did not happen in case when destination had vgpr bank and address had sgpr bank. Instead of checking if address bank is sgpr check bank of the destination. Differential Revision: https://reviews.llvm.org/D101992	2021-05-10 10:18:30 +02:00
Petar Avramovic	d13ce17bb4	AMDGPU/GlobalISel: Add regbankselect test for vgpr(dest) sgpr(address) load Pre-commit for D101992.	2021-05-10 10:18:30 +02:00
Qiu Chaofan	2db4979c0f	[VectorCombine] Simplify to scalar store if only one element updated This patch simplifies load-insertelt-store pattern into getelementptr-store. Reviewed By: fhahn Differential Revision: https://reviews.llvm.org/D98240	2021-05-08 18:14:51 +08:00
Sebastian Neubauer	13c0316239	[AMDGPU] Restrict immediate scratch offsets gfx9 does not work with negative offsets, gfx10 works only with aligned negative offsets, but not with unaligned negative offsets. This is slightly more conservative than needed, gfx9 does support negative offsets when a VGPR address is used and gfx10 supports negative, unaligned offsets when an SGPR address is used, but we do not make use of that with this patch. Differential Revision: https://reviews.llvm.org/D101292	2021-05-07 14:51:32 +02:00
David Stuttard	606d4e8061	AMDGPU: Correct const_index_stride for wave 32 for PAL ABI Retrying after revert and fix (removed implicit def flag from operand). Now passes with expensive_checks enabled. Since there is a single scratch resource descriptor for all shaders, if there is a wave32 and a wave64 shader (for instance for VsFs pairs) then the const_index_stride will be incorrect for wave32 shaders. Differential Revision: https://reviews.llvm.org/D101830 Change-Id: Ie3b8b2921237968caca91527dd0c97b1b0cc0360	2021-05-07 13:42:57 +01:00
Simon Pilgrim	280aa3415e	[DAG] Add a generic expansion for SHIFT_PARTS opcodes using funnel shifts Based off a discussion on D89281 - where the AARCH64 implementations were being replaced to use funnel shifts. Any target that has efficient funnel shift lowering can handle the shift parts expansion using the same expansion, avoiding a lot of duplication. I've generalized the X86 implementation and moved it to TargetLowering - so far I've found that AARCH64 and AMDGPU benefit, but many other targets (ARM, PowerPC + RISCV in particular) could easily use this with a few minor improvements to their funnel shift lowering (or the folding of their target ops that funnel shifts lower to). NOTE: I'm trying to avoid adding full SHIFT_PARTS legalizer handling as I think it might actually be possible to remove these opcodes in the medium-term and use funnel shift / libcall expansion directly. Differential Revision: https://reviews.llvm.org/D101987	2021-05-07 13:12:30 +01:00
David Stuttard	793b4b2603	Revert "AMDGPU: Correct const_index_stride for wave 32 for PAL ABI" This reverts commit `442de0c1ad`.	2021-05-07 12:49:17 +01:00
David Stuttard	442de0c1ad	AMDGPU: Correct const_index_stride for wave 32 for PAL ABI Since there is a single scratch resource descriptor for all shaders, if there is a wave32 and a wave64 shader (for instance for VsFs pairs) then the const_index_stride will be incorrect for wave32 shaders. Differential Revision: https://reviews.llvm.org/D101830 Change-Id: Id8de5566b0d1a07a814e2e7db016df9d20bf6d2c	2021-05-07 12:19:49 +01:00
Stanislav Mekhanoshin	c714d03785	[AMDGPU] Expose __builtin_amdgcn_perm for v_perm_b32 Differential Revision: https://reviews.llvm.org/D102022	2021-05-06 16:17:33 -07:00
Carl Ritson	67cfefebbb	[AMDGPU] Fix WQM failure with single block inactive demote Instruction test for inactive kill/demote needs to be based on actual opcode not whether instruction would be lowered to demote. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D101966	2021-05-06 21:02:26 +09:00
Simon Pilgrim	0fdce16efb	[AMDGPU] Regenerate fp2int tests. NFCI.	2021-05-06 11:55:14 +01:00
Simon Pilgrim	20e976e248	[AMDGPU] Regenerate shift tests. NFCI.	2021-05-06 11:55:13 +01:00
Jay Foad	7c706af03b	[AMDGPU] SIFoldOperands: clean up tryConstantFoldOp First clean up the strange API of tryConstantFoldOp where it took an immediate operand value, but no indication of which operand it was the value for. Second clean up the loop that calls tryConstantFoldOp so that it does not have to restart from the beginning every time it folds an instruction. This is NFCI but there are some minor changes caused by the order in which things are folded. Differential Revision: https://reviews.llvm.org/D100031	2021-05-06 09:55:22 +01:00
Stanislav Mekhanoshin	ab90ae6f47	[AMDGPU] Switch AnnotateUniformValues to MemorySSA This shall speedup compilation and also remove threshold limitations used by memory dependency analysis. It also seem to fix the bug in the coalescer_remat.ll where an SMRD load was used in presence of a potentially clobbering store. Fixes: SWDEV-272132 Differential Revision: https://reviews.llvm.org/D101962	2021-05-05 18:34:41 -07:00
Austin Kerbow	6617a5a5ea	[AMDGPU] Move insertion of function entry waitcnt later This allows tracking these as preexisting waitcnt. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D101380	2021-05-05 17:58:38 -07:00
Austin Kerbow	f5199d7ae0	[AMDGPU] Revise handling of preexisting waitcnt Preexisting waitcnt may not update the scoreboard if the instruction being examined needed to wait on fewer counters than what was encoded in the old waitcnt instruction. Fixing this results in the elimination of some redudnat waitcnt. These changes also enable combining consecutive waitcnt into a single S_WAITCNT or S_WAITCNT_VSCNT instruction. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100281	2021-05-05 17:21:33 -07:00
RamNalamothu	41f8b8e807	[MCAsmInfo] Support UsesCFIForDebug for targets with no exception handling This change enables emitting CFI unwind information for debugging purpose for targets with MCAsmInfo::ExceptionsType == ExceptionHandling::None. Currently generating CFI unwind information is entangled with supporting the exceptions, even when AsmPrinter explicitly recognizes that the unwind tables are being generated as debug information. In fact, the unwind information is not generated even if we specify --force-dwarf-frame-section, unless exceptions are enabled. The LIT test llvm/test/CodeGen/AMDGPU/debug_frame.ll demonstrates this behavior. Enable this option for AMDGPU to prepare for future patches which add complete CFI support. Reviewed By: dblaikie, MaskRay Differential Revision: https://reviews.llvm.org/D78778	2021-05-06 04:53:45 +05:30
Matt Arsenault	b6d244e5b8	AMDGPU: Fix lit test	2021-05-05 18:41:18 -04:00
Vang Thao	7a41639c60	[AMDGPU][GlobalISel] Widen 1 and 2 byte scalar loads Widen 1 and 2 byte scalar loads to 4 bytes when sufficiently aligned to avoid using a global load. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D100430	2021-05-05 15:18:19 -07:00
Matt Arsenault	ef5f0adecd	AMDGPU: Add a few more tail call tests Add some cases I noticed were missing when porting to GlobalISel. The cases that required any argument splitting did not work at first.	2021-05-05 17:55:02 -04:00
Stanislav Mekhanoshin	909a5ccf3b	[AMDGPU] Improve global SADDR selection An address can be a uniform sum of two i64 bit values. That regularly happens in a loop where index is an induction variable promoted to 64 bit by the LSR. We can materialize zero in a VGPR and still use SADDR form of the load. Differential Revision: https://reviews.llvm.org/D101591	2021-05-05 14:44:21 -07:00
Matt Arsenault	fa0b93b5a0	GlobalISel: Use DAG call lowering infrastructure in a more compatible way Unfortunately the current call lowering code is built on top of the legacy MVT/DAG based code. However, GlobalISel was not using it the same way. In short, the DAG passes legalized types to the assignment function, and GlobalISel was passing the original raw type if it was simple. I do believe the DAG lowering is conceptually broken since it requires picking a type up front before knowing how/where the value will be passed. This ends up being a problem for AArch64, which wants to pass i1/i8/i16 values as a different size if passed on the stack or in registers. The argument type decision is split across 3 different places which is hard to follow. SelectionDAG builder uses getRegisterTypeForCallingConv to pick a legal type, tablegen gives the illusion of controlling the type, and the target may have additional hacks in the C++ part of the call lowering. AArch64 hacks around this by not using the standard AnalyzeFormalArguments and special casing i1/i8/i16 by looking at the underlying type of the original IR argument. I believe people have generally assumed the calling convention code is processing the original types, and I've discovered a number of dead paths in several targets. x86 actually relies on the opposite behavior from AArch64, and relies on x86_32 and x86_64 sharing calling convention code where the 64-bit cases implicitly do not work on x86_32 due to using the pre-legalized types. AMDGPU targets without legal i16/f16 have always used a broken ABI that promotes to i32/f32. GlobalISel accidentally fixed this to be the ABI we should have, but this fixes it so we're using the worse ABI that is compatible with the DAG. Ideally we would fix the DAG to match the old GlobalISel behavior, but I don't wish to fight that battle. A new native GlobalISel call lowering framework should let the target process the incoming types directly. CCValAssigns select a "ValVT" and "LocVT" but the meanings of these aren't entirely clear. Different targets don't use them consistently, even within their own call lowering code. My current belief is the intent was "ValVT" is supposed to be the legalized value type to use in the end, and and LocVT was supposed to be the ABI passed type (which is also legalized). With the default CCState::Analyze functions always passing the same type for these arguments, these only differ when the TableGen part of the lowering decide to promote the type from one legal type to another. AArch64's i1/i8/i16 hack ends up inverting the meanings of these values, so I had to add an additional hack to let the target interpret how large the argument memory is. Since targets don't consistently interpret ValVT and LocVT, this doesn't produce quite equivalent code to the initial DAG lowerings. I've opted to consistently interpret LocVT as the in-memory size for stack passed values, and ValVT as the register type to assign from that memory. We therefore produce extending loads directly out of the IRTranslator, whereas the DAG would emit regular loads of smaller values. This will also produce loads/stores that are wider than the argument value if the allocated stack slot is larger (and there will be undef padding bytes). If we had the optimizations to reduce load/stores based on truncated values, this wouldn't produce a different end result. Since ValVT/LocVT are more consistently interpreted, we now will emit more G_BITCASTS as requested by the CCAssignFn. For example AArch64 was directly assigning types to some physical vector registers which according to the tablegen spec should have been casted to a vector with a different element type. This also moves the responsibility for inserting G_ASSERT_SEXT/G_ASSERT_ZEXT from the target ValueHandlers into the generic code, which is closer to how SelectionDAGBuilder works. I had to xfail an x86 test since I don't see a quick way to fix it right now (I filed bug 50035 for this). It's broken independently of this change, and only triggers since now we end up with more ands which hit the improperly handled selection pattern. I also observed that FP arguments that need promotion (e.g. f16 passed as f32) are broken, and use regular G_TRUNC and G_ANYEXT. TLDR; the current call lowering infrastructure is bad and nobody has ever understood how it chooses types.	2021-05-05 17:35:02 -04:00
Michael Kitzan	a11489ae3e	[MachineCSE][NFC]: Refactor and comment on preventing CSE for isConvergent instrs - Move the code preventing CSE of `isConvergent` instrs into `ProcessBlockCSE` (from `isProfitableToCSE`) - Add comments explaining why `isConvergent` is used to prevent CSE of non-local instrs in MachineCSE and the new test	2021-05-05 14:22:03 -07:00
Stanislav Mekhanoshin	4c178d809b	[AMDGPU] Pre-commit 2 new saddr load tests. NFC.	2021-05-05 09:16:11 -07:00
Baptiste Saleil	83646f60a8	[AMDGPU] Fix llc pipeline lit test for bots enabling expensive checks	2021-05-05 10:57:58 -04:00
Jay Foad	f106fe5f23	[AMDGPU] Autogenerate checks for a clustering test and add GFX10	2021-05-05 13:18:17 +01:00
Julien Pagès	a1ed39df96	[AMDGPU] Select V_CVT_*16_F16 more often Improve the code generation of fp_to_sint and fp_to_uint for integer on 16-bits. Differential Revision: https://reviews.llvm.org/D101481 Patch by Julien Pagès!	2021-05-05 08:57:51 +01:00
Baptiste Saleil	845c8a60e9	[AMDGPU] Add rm line to lit test to cleanup bots	2021-05-04 18:27:50 -04:00
Baptiste Saleil	a018bd5199	[AMDGPU] Fix lit failure introduced by `6a17609157`	2021-05-04 17:25:58 -04:00
Baptiste Saleil	6a17609157	[AMDGPU] Disable the scalar IR, SDWA and load store vectorizer passes at -O1 This patch disables some of the passes at -O1. These passes have a significant impact on compilation time, so we only want them to be enabled starting from -O2. Differential Revision: https://reviews.llvm.org/D101414	2021-05-04 16:44:39 -04:00
Christudasan Devadasan	80c79035ef	DAG: Cleanup assertion in EmitFuncArgumentDbgValue Removing an assertion introduced with D68945. The patch was later reverted with `6531a78ac4`, but failed to remove this assertion. It causes a problem while trying to split a 64-bit argument into sub registers. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D101594	2021-05-04 21:48:58 +05:30
Stanislav Mekhanoshin	4d6ebe8ac0	[AMDGPU] Change FLAT Scratch SADDR to VADDR form in moveToVALU Extend the legalization of global SADDR loads and stores with changing to VADDR to the FLAT scratch instructions. Differential Revision: https://reviews.llvm.org/D101408	2021-05-03 10:57:14 -07:00
Stanislav Mekhanoshin	89a94be16b	[AMDGPU] Change FLAT SADDR to VADDR form in moveToVALU Instead of legalizing saddr operand with a readfirstlane when address is moved from SGPR to VGPR we can just change the opcode. Differential Revision: https://reviews.llvm.org/D101405	2021-05-03 10:36:26 -07:00
Sebastian Neubauer	c0c8548b70	[AMDGPU] Do not annotate features for graphics SITargetLowering::LowerFormalArguments asserts that none of these features are used for graphics calling conventions, so AnnotateKernelFeatures should not add them. Differential Revision: https://reviews.llvm.org/D101534	2021-05-03 10:33:11 +02:00
Jay Foad	7e43483dd1	[AMDGPU] Remove set_gpr_idx instructions in conditional blocks SIPreEmitPeephole did not try to remove redundant s_set_gpr_idx_* instructions in blocks that end with a conditional branch instruction. This seems like a simple oversight. Differential Revision: https://reviews.llvm.org/D101629	2021-04-30 22:15:45 +01:00
Jay Foad	e2a2df2a1e	[AMDGPU] Add test for set_gpr_idx removal with conditional branches	2021-04-30 15:01:32 +01:00
Jay Foad	181c492ee7	[AMDGPU] Add implicit negative check for the set_gpr_idx tests The only effect of the optimization is to remove s_set_gpr_idx_* instructions, and update_mir_test_checks.py always inserts CHECK: rather than CHECK-NEXT: checks, so without this implicit negative check, the tests would always pass even if the optimization did nothing. Differential Revision: https://reviews.llvm.org/D101622	2021-04-30 14:45:12 +01:00
Jay Foad	66b8a16cc0	[AMDGPU] Fix inconsistent ---/... in MIR tests and regenerate checks In some cases the lack of --- or ... confused update_mir_test_checks.py into not adding any checks for a function.	2021-04-30 14:10:50 +01:00
Christudasan Devadasan	544be70864	[AMDGPU] Skip promote-alloca for insertelement/insertvalue users It is difficult to track the users of vector and aggregate types. Reviewed by: arsenm Differential Revision: https://reviews.llvm.org/D101562	2021-04-30 08:37:26 +05:30
Matt Arsenault	e6701e575c	AMDGPU: Add missing runline to test There are checks for gfx908, but this wasn't actually running with it.	2021-04-29 20:59:22 -04:00
Jay Foad	16d707e656	[AMDGPU] Fix v_swap_b32 formation on physical registers As explained in the comments, matchSwap matches: // mov t, x // mov x, y // mov y, t and turns it into: // mov t, x (t is potentially dead and move eliminated) // v_swap_b32 x, y On physical registers we don't have full use-def chains so the check for T being live-out was not working properly with subregs/superregs. Differential Revision: https://reviews.llvm.org/D101546	2021-04-29 20:53:40 +01:00
Petar Avramovic	c34900e133	AMDGPU/GlobalISel: Fix selection of image intrinsics with unused return When atomic image intrinsic return value is unused, register class for destination of a sub-register copy of return value ends up not being set. This copy then hits 'Register class not set' assert later. If return value has uses, register class is determined by use instruction. Fix is to not create sub-register copy when image intrinsic destination has no uses because it would be deleted by dead-mi-elimination later anyway. Differential Revision: https://reviews.llvm.org/D101448	2021-04-29 20:56:03 +02:00
Jay Foad	1ecddddbec	[AMDGPU] Add a v_swap_b32 test case to be fixed	2021-04-29 16:03:15 +01:00
Joe Nash	168228d76a	[AMDGPU] Make some VOP3 insts commutable Note, only src0 and src1 will be commuted if the isCommutable flag is set. This patch does not change that, it just makes it possible to commute src0 and src1 of some U/I/B vop3 instructions. This patch revises `d35d8da7d6`. It contains the commute opportunities excluding float insts Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D101474 Change-Id: I62938173d750453839f2457a3851661a29135faf	2021-04-28 13:59:08 -04:00
Petar Avramovic	8110fcc8fc	AMDGPU/GlobalISel: Fix negative offset folding for buffer_load Buffer_load does unsigned offset calculations. Don't fold operands of 32-bit add that are likely to cause unsigned add overflow (common case is when one of the operands is negative). Differential Revision: https://reviews.llvm.org/D91336	2021-04-27 14:45:22 +02:00
Petar Avramovic	6a3e1b3531	AMDGPU/GlobalISel: Add test for buffer_load with negative offset Pre-commit test for D91336.	2021-04-27 14:45:21 +02:00
Petar Avramovic	fb7be0d912	AMDGPU/GlobalISel: Remove redundant G_FCANONICALIZE Add basic version of isCanonicalized for global-isel. Copied from sdag. Add post legalizer combine that deletes G_FCANONICALIZE when its input is already Canonicalized. Differential Revision: https://reviews.llvm.org/D96605	2021-04-27 12:26:37 +02:00
Petar Avramovic	4a9bc59867	AMDGPU/GlobalISel: Add integer med3 combines Add signed and unsigned integer version of med3 combine. Source pattern is min(max(Val, K0), K1) or max(min(Val, K1), K0) where K0 and K1 are constants and K0 <= K1. Destination is med3 that corresponds to signedness of min/max in source. Differential Revision: https://reviews.llvm.org/D90050	2021-04-27 11:52:23 +02:00
Baptiste Saleil	caf1294d95	[AMDGPU] Experiments show that the GCNRegBankReassign pass significantly impacts the compilation time and there is no case for which we see any improvement in performance. This patch removes this pass and its associated test cases from the tree. Differential Revision: https://reviews.llvm.org/D101313 Change-Id: I0599169a7609c19a887f8d847a71e664030cc141	2021-04-26 17:21:49 -04:00
Sebastian Neubauer	9579af2bd7	[AMDGPU] Fix autogenerated wwm-reserved-spill.ll Due to a bug in update_llc_test_checks.py, the test is wrongly coalesced between run lines. Remove common check prefix to fix that. NFC.	2021-04-26 19:09:09 +02:00
Sebastian Neubauer	fcc40d9c17	[AMDGPU] Use MapVector for WWMReservedRegs Use MapVector instead of SmallDenseMap because it has a deterministic iteration order. Differential Revision: https://reviews.llvm.org/D101299	2021-04-26 17:43:00 +02:00
Michael Kitzan	59f2dd5f1a	[MachineCSE] Prevent CSE of non-local convergent instrs At the moment, MachineCSE allows CSE-ing convergent instrs which are non-local to each other. This can cause illegal codegen as convergent instrs are control flow dependent. The patch prevents non-local CSE of convergent instrs by adding a check in isProfitableToCSE and rejecting CSE-ing if we're considering CSE-ing non-local convergent instrs. We can still CSE convergent instrs which are in the same control flow scope, so the patch purposely does not make all convergent instrs non-CSE candidates in isCSECandidate. https://reviews.llvm.org/D101187	2021-04-23 16:44:48 -07:00
Sebastian Neubauer	3366d81153	[AMDGPU] Save WWM registers in functions The values of registers in inactive lanes needs to be saved during function calls. Save all registers used for whole wave mode, similar to how it is done for VGPRs that are used for SGPR spilling. Differential Revision: https://reviews.llvm.org/D99429 Reapply with fixed tests on window.	2021-04-23 18:09:24 +02:00
Jay Foad	5802cbefc1	[AMDGPU] Fix typo in implicit operand lists Several tests had a typo where they mentioned sgpr17 twice instead of sgpr17 and sgpr27. This had a significant effect on the "scavenge_sgpr_pei_no_sgprs" tests because there was actually an sgpr available, namely sgpr27. Differential Revision: https://reviews.llvm.org/D100960	2021-04-23 15:44:17 +01:00
Sebastian Neubauer	22d99cb63f	Revert "[AMDGPU] Save WWM registers in functions" This reverts commit `91464c30bf`. Seems to break tests on windows.	2021-04-23 16:38:50 +02:00
Piotr Sobczak	83a3395b30	[AMDGPU][NFC] Update auto-gen test Most likely the "glc" was not added to the test when the volatile loads started generating those bits.	2021-04-23 16:33:16 +02:00
Sebastian Neubauer	91464c30bf	[AMDGPU] Save WWM registers in functions The values of registers in inactive lanes needs to be saved during function calls. Save all registers used for whole wave mode, similar to how it is done for VGPRs that are used for SGPR spilling. Differential Revision: https://reviews.llvm.org/D99429	2021-04-23 16:09:31 +02:00
Matt Arsenault	b58332774f	AMDGPU: Fix assert on inline asm on gfx90a This was assuming all mayLoad instructions have one def.	2021-04-23 09:00:25 -04:00
Matt Arsenault	ed633a1daa	AMDGPU: Restore atomic fp feature on FP atomic instruction definitions `9931b1f7a4` switched this to checking for the two specific subtargets, instead of the dedicated feature. This broke supporting functions which force added the feature when emitting targets that do not actually support them. This stil does not work for the targets that use the gfx6/7 or gfx10 encodings.	2021-04-22 21:32:01 -04:00
Matt Arsenault	987e52851e	AMDGPU: Fix assert when trying to fold reg_sequence of physreg copies	2021-04-21 21:58:18 -04:00
Matt Arsenault	70ab76a81b	AMDGPU: Fix indirect tail calls Fix a selection error on uniform callees, and use a regular call if divergent.	2021-04-21 09:15:24 -04:00
Jay Foad	ec8c61efdf	[AMDGPU] Allow multiple uses of the same literal In GFX10 VOP3 can have a literal, which opens up the possibility of two operands using the same literal value, which is allowed and only counts as one use of the constant bus. AMDGPUAsmParser::validateConstantBusLimitations already knew about this but SIInstrInfo::verifyInstruction did not. Differential Revision: https://reviews.llvm.org/D100770	2021-04-20 16:44:01 +01:00
Matt Arsenault	1cb8a9d595	AMDGPU/GlobalISel: Fix uitofp/sitofp with non-power-of-2 integers	2021-04-20 11:13:29 -04:00
Jay Foad	b22721f01a	[AMDGPU] GCNDPPCombine: don't shrink V_ADD_CO_U32 if carry out is used Don't shrink VOP3 instructions if there are any uses of a carry-out operand, because the shrunken form of the instruction would write the carry-out to vcc instead of to a virtual register. Differential Revision: https://reviews.llvm.org/D100760	2021-04-20 09:17:52 +01:00
madhur13490	6a4d9cb7e0	[AMDGPU] Remove error check for indirect calls and add missing queue-ptr This patch removes -fixed-abi check for indirect calls and also adds queue-ptr which is required for indirect calls to work. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D100633	2021-04-20 00:35:17 +05:30
Jay Foad	ef443390a9	[AMDGPU] Remove MachineDCE after SIFoldOperands Remove the MachineDCE pass after the first SIFoldOperands pass now that SIFoldOperands deletes its own dead instructions. Reapply after fixing dependent change D100188. Differential Revision: https://reviews.llvm.org/D100189	2021-04-19 12:08:02 +01:00
Philip Reames	f549176ad9	[funcattrs] Add the maximal set of implied attributes to definitions Have funcattrs expand all implied attributes into the IR. This expands the infrastructure from D100400, but for definitions not declarations this time. Somewhat subtly, this mostly isn't semantic. Because the accessors did the inference, any client which used the accessor was already getting the stronger result. Clients that directly checked presence of attributes (there are some), will see a stronger result now. The old behavior can end up quite confusing for two reasons: * Without this change, we have situations where function-attrs appears to fail when inferring an attribute (as seen by a human reading IR), but that consuming code will see that it should have been implied. As a human trying to sanity check test results and study IR for optimization possibilities, this is exceeding error prone and confusing. (I'll note that I wasted several hours recently because of this.) * We can have transforms which trigger without the IR appearing (on inspection) to meet the preconditions. This change doesn't prevent this from happening (as the accessors still involve multiple checks), but it should make it less frequent. I'd argue in favor of deleting the extra checks out of the accessors after this lands, but I want that in it's own review as a) it's purely stylistic, and b) I already know there's some disagreement. Once this lands, I'm also going to do a cleanup change which will delete some now redundant duplicate predicates in the inference code, but again, that deserves to be a change of it's own. Differential Revision: https://reviews.llvm.org/D100226	2021-04-16 14:22:19 -07:00
Stelios Ioannou	bf147c4653	[LSR] Fix for pre-indexed generated constant offset This patch changed the isLegalUse check to ensure that LSRInstance::GenerateConstantOffsetsImpl generates an offset that results in a legal addressing mode and formula. The check is changed to look similar to the assert check used for illegal formulas. Differential Revision: https://reviews.llvm.org/D100383 Change-Id: Iffb9e32d59df96b8f072c00f6c339108159a009a	2021-04-15 16:44:42 +01:00
Sebastian Neubauer	7842e1725e	[AMDGPU] Fix large return values with amdgpu_gfx Returning in memory is not supported, so fall back to sret. Also, extend i1 and i16 to i32. Otherwise, they would be passed through memory. Differential Revision: https://reviews.llvm.org/D100543	2021-04-15 14:57:56 +02:00
hsmahesha	4973b0c4e7	[AMDGPU] Disable forceful inline of non-kernel functions which use LDS. Now since LDS uses within non-kernel functions are being handled in the pass - LowerModuleLDS, we NO need to forcefully inline non-kernel functions just because they use LDS. Do forceful inlining only when the pass - LowerModuleLDS is not enabled. It is enabled by default. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D100481	2021-04-15 09:12:56 +05:30
Philip Reames	dd985551c2	Reapply "[InferAttributes] Materialize all infered attributes for declaration"" and follow on patches. This reverts commit `ab98f2c712` and `98eea392cd`. It includes a fix for the clang test which triggered the revert. I failed to notice this one because there was another AMDGPU llvm test with a similiar name and the exact same text in the error message. Odd. Since only one build bot reported the clang test, I didn't notice that one.	2021-04-14 16:38:07 -07:00
Nico Weber	98eea392cd	Revert "Fix buildbots after 61a85da" This reverts commit `c609d53363`. `61a85da` was reverted in `ab98f2c7`	2021-04-14 18:47:46 -04:00
Philip Reames	c609d53363	Fix buildbots after `61a85da`	2021-04-14 15:16:05 -07:00
hsmahesha	e3070db0f7	[AMDGPU] Rename "LDS lowering" pass name. Rename the name of "LDS lowering" pass from `amdgpu-disable-lower-module-lds` to `amdgpu-enable-lower-module-lds` as later is consistent and reads better. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D100441	2021-04-14 20:19:53 +05:30
Sebastian Neubauer	929edd4375	[AMDGPU] Mark scavenged SGPR as used Otherwise it reuses the same register for storing the stack slot offset if the stack slot offset is big. Differential Revision: https://reviews.llvm.org/D100461	2021-04-14 14:55:01 +02:00
madhur13490	5682ae2fc6	[AMDGPU] Set implicit arg attributes for indirect calls This patch adds attributes corresponding to implicits to functions/kernels if 1. it has an indirect call OR 2. it's address is taken. Once such attributes are set, rest of the codegen would work out-of-box for indirect calls. This patch eliminates the potential overhead -fixed-abi imposes even though indirect functions calls are not used. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D99347	2021-04-13 13:15:13 +00:00
Sanjay Patel	661cc71a1c	[PassManager][PhaseOrdering] lower expects before running simplifyCFG Retry of `330619a3a6` that includes a clang test update. Original commit message: If we run passes before lowering llvm.expect intrinsics to metadata, then those passes have no way to act on the hints provided by llvm.expect. SimplifyCFG is the known offender, and we made it smarter about profile metadata in D98898 <https://reviews.llvm.org/D98898>. In the motivating example from https://llvm.org/PR49336 , this means we were ignoring the recommended method for a programmer to tell the compiler that a compare+branch is expensive. This change appears to solve that case - the metadata survives to the backend, the compare order is as expected in IR, and the backend does not do anything to reverse it. We make the same change to the old pass manager to keep things synchronized. Differential Revision: https://reviews.llvm.org/D100213	2021-04-12 15:07:53 -04:00
Sanjay Patel	23ac9d1e6e	Revert "[PassManager][PhaseOrdering] lower expects before running simplifyCFG" This reverts commit `330619a3a6`. There are clang tests that also need to be updated.	2021-04-12 13:58:54 -04:00
Sanjay Patel	330619a3a6	[PassManager][PhaseOrdering] lower expects before running simplifyCFG If we run passes before lowering llvm.expect intrinsics to metadata, then those passes have no way to act on the hints provided by llvm.expect. SimplifyCFG is the known offender, and we made it smarter about profile metadata in D98898. In the motivating example from https://llvm.org/PR49336 , this means we were ignoring the recommended method for a programmer to tell the compiler that a compare+branch is expensive. This change appears to solve that case - the metadata survives to the backend, the compare order is as expected in IR, and the backend does not do anything to reverse it. We make the same change to the old pass manager to keep things synchronized. Differential Revision: https://reviews.llvm.org/D100213	2021-04-12 12:23:31 -04:00
Sebastian Neubauer	6cc91adf1e	[AMDGPU] Kill temporary register after restoring Not a correctness issue, but the temporary register is not used afterwards and should be dead. Differential Revision: https://reviews.llvm.org/D100295	2021-04-12 14:20:03 +02:00
Sebastian Neubauer	b76c2a6c2b	[AMDGPU] Fix saving fp and bp Spilling the fp or bp to scratch could overwrite VGPRs of inactive lanes. Fix that by using only the active lanes of the scavenged VGPR. This builds on the assumptions that 1. a function is never called with exec=0 2. lanes do not die in a function, i.e. exec!=0 in the function epilog 3. no new lanes are active when exiting the function, i.e. exec in the epilog is a subset of exec in the prolog. Differential Revision: https://reviews.llvm.org/D96869	2021-04-12 11:52:55 +02:00
Sebastian Neubauer	ca3bae94c4	[AMDGPU] Autogenerate test. NFC	2021-04-12 11:51:28 +02:00
Sebastian Neubauer	32bc9a9bc3	[AMDGPU] Unify spill code Instead of reimplementing spilling in prolog and epilog, reuse buildSpillLoadStore. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D99269	2021-04-12 11:19:08 +02:00
Sebastian Neubauer	f9a8c6a0e5	[AMDGPU] Save VGPR of whole wave when spilling Spilling SGPRs to scratch uses a temporary VGPR. LLVM currently cannot determine if a VGPR is used in other lanes or not, so we need to save all lanes of the VGPR. We even need to save the VGPR if it is marked as dead. The generated code depends on two things: - Can we scavenge an SGPR to save EXEC? - And can we scavenge a VGPR? If we can scavenge an SGPR, we - save EXEC into the SGPR - set the needed lane mask - save the temporary VGPR - write the spilled SGPR into VGPR lanes - save the VGPR again to the target stack slot - restore the VGPR - restore EXEC If we were not able to scavenge an SGPR, we do the same operations, but everytime the temporary VGPR is written to memory, we - write VGPR to memory - flip exec (s_not exec, exec) - write VGPR again (previously inactive lanes) Surprisingly often, we are able to scavenge an SGPR, even though we are at the brink of running out of SGPRs. Scavenging a VGPR does not have a great effect (saves three instructions if no SGPR was scavenged), but we need to know if the VGPR we use is live before or not, otherwise the machine verifier complains. Differential Revision: https://reviews.llvm.org/D96336	2021-04-12 11:01:38 +02:00
dfukalov	8f4b7e94a2	[AMDGPU][CostModel] Refine cost model for control-flow instructions. Added cost estimation for switch instruction, updated costs of branches, fixed phi cost. Had to increase `-amdgpu-unroll-threshold-if` default value since conditional branch cost (size) was corrected to higher value. Test renamed to "control-flow.ll". Removed redundant code in `X86TTIImpl::getCFInstrCost()` and `PPCTTIImpl::getCFInstrCost()`. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D96805	2021-04-10 09:20:24 +03:00
Mitch Phillips	092f288d36	Revert "[AMDGPU] Remove MachineDCE after SIFoldOperands" This reverts commit `5a0117b2d0`. Reason: Dependent change `d19a42eba9` broke the ASan buildbots.	2021-04-09 15:47:44 -07:00
Jay Foad	5a0117b2d0	[AMDGPU] Remove MachineDCE after SIFoldOperands Remove the MachineDCE pass after the first SIFoldOperands pass now that SIFoldOperands deletes its own dead instructions. Differential Revision: https://reviews.llvm.org/D100189	2021-04-09 20:41:09 +01:00
Stanislav Mekhanoshin	034fe0e03d	[AMDGPU] Added udot2 op_sel test. NFC.	2021-04-09 12:19:42 -07:00
Jay Foad	a4ced03d34	[AMDGPU] SIFoldOperands: eagerly delete dead copies This is cheap to implement, means less work for future passes like MachineDCE, and slightly improves the folding in some cases. Differential Revision: https://reviews.llvm.org/D100117	2021-04-09 13:52:54 +01:00
Philip Reames	35393c865c	[funcattrs] Infer nosync from instruction walk Pretty straightforward use of existing infrastructure and port of the attributor inference rules for nosync. A couple points of interest: * I deliberately switched from "monotonic or better" to "unordered or better". This is simply me being conservative and is better in line with the rest of the optimizer. We treat monotonic conservatively pretty much everywhere. * The operand bundle test change is suspicious. It looks like we might have missed something here, but if so, it's an issue with the existing nofree inference as well. I'm going to take a closer look at that separately. * I needed to keep the previous inference from readnone. This surprised me, but made sense once I realized readonly inference goes to lengths to reason about local vs non-local memory and that writes to local memory are okay. This is fine for the purpose of nosync, but would e.g. prevent us from inferring nofree from readnone - which is slightly surprising. Differential Revision: https://reviews.llvm.org/D99769	2021-04-08 14:05:00 -07:00
Konstantin Zhuravlyov	4fae63c612	AMDGPU: Add gfx90c support to code object v2 for backwards compatibility Differential Revision: https://reviews.llvm.org/D100126	2021-04-08 16:42:43 -04:00
Stanislav Mekhanoshin	627dab3dbf	[AMDGPU] Check for all meta instrs in GCNRegBankReassign It used to work correctly even with a KILL, but there is no reason to consider meta instructions since they do not create real HW uses. Differential Revision: https://reviews.llvm.org/D100135	2021-04-08 13:41:10 -07:00
Nikita Popov	59a2f67011	[LoopRotate] Don't split loop pass manager After D99249 we use three different loop pass managers for LICM, LoopRotate and LICM+LoopUnswitch. This happens because LazyBFI and LazyBPI are not preserved by LoopRotate (note that D74640 is no longer needed). Avoid this by marking them as preserved. My understanding of D86156 is that it is okay to simply preserve them (which LoopUnswitch already does for the same reason) and rely on callbacks to deal with deleted blocks. Differential Revision: https://reviews.llvm.org/D99843	2021-04-08 22:05:18 +02:00
Stanislav Mekhanoshin	189310a140	[AMDGPU] Allow -amdgpu-unsafe-fp-atomics to ignore denorm mode Fixes: SWDEV-274276 Differential Revision: https://reviews.llvm.org/D100072	2021-04-08 12:46:36 -07:00
Jay Foad	e184eeaa3b	[AMDGPU] Add some implicit uses to tests. NFC. This is just to stop a future patch from optimizing away the things that we actually want to check for.	2021-04-08 16:37:48 +01:00
Jay Foad	c28f79a0e3	[AMDGPU] SIFoldOperands: try harder to fold cndmask instructions Look through copies to find more cases where the two values being selected are identical. The motivation for this is just to be able to remove the weird special case where tryFoldCndMask was called from foldInstOperand, part way through folding a move-immediate into its users, without regressing any lit tests.	2021-04-08 14:26:12 +01:00
Sebastian Neubauer	c10cc4ea27	[AMDGPU] Fix computing live registers in prolog ScratchExecCopy needs to be marked as live, we cannot use that register while EXEC is stored in there. Marking SGPRForFPSaveRestoreCopy and SGPRForBPSaveRestoreCopy as available is unnecessary, they should not be live at that point anway. Differential Revision: https://reviews.llvm.org/D100098	2021-04-08 14:52:50 +02:00
Thomas Preud'homme	04419628e0	[AMDGPU, test] Fix use of undef FileCheck var Test CodeGen/AMDGPU/amdgpu.private-memory.ll and CodeGen/AMDGPU/private-memory-r600.ll have a block of CHECK directives whose prefix is inconsistent: R600-CHECK Vs R600. This leads to a R600-NOT directive using an undefined CHAN variable due to R600-CHECK directives never being considered by FileCheck. Fixing the prefix leads to the testcase failing. As per https://reviews.llvm.org/D99865#2675235 this commit removes the directives instead since it is not possible to write a reliable check. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D99865	2021-04-08 09:42:59 +01:00
hsmahesha	ac64995ceb	[AMDGPU] Only use ds_read/write_b128 for alignment >= 16 PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100008	2021-04-08 08:12:05 +05:30
hsmahesha	d5fee599c5	[AMDGPU] Add some exhaustive ds read/write alignment tests PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100007	2021-04-08 08:08:49 +05:30
Tony Tye	4658cd4c18	[AMDGPU] Update gfx90a memory model support Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100070	2021-04-07 22:17:58 +00:00
Jay Foad	e9608a84d8	[AMDGPU][SDag] Add IMG init also for image_gather4 instructions This fixes an oversight in D99747 which moved the IMG init code from SIAddIMGInit to AdjustInstrPostInstrSelection, but did not set the hasPostISelHook flag on gather4 instructions. Differential Revision: https://reviews.llvm.org/D99953	2021-04-06 14:47:20 +01:00
Jay Foad	0bf4836dc4	[AMDGPU] Fix dubious regexes with unescaped brackets. NFC.	2021-04-06 13:17:41 +01:00
Jay Foad	6fec0a34ce	[AMDGPU] Fix typo in regular expression checks. NFC.	2021-04-06 12:29:48 +01:00
Jay Foad	6eb5b06ecf	[AMDGPU] Regenerate checks to fix prefixes broken in D96340. NFC.	2021-04-06 11:43:53 +01:00
Stanislav Mekhanoshin	30b3aab329	Copy syncscope when expanding atomicrmw into cmpxchg loop Fixes: SWDEV-280070 Differential Revision: https://reviews.llvm.org/D99902	2021-04-05 17:29:38 -07:00
Roman Lebedev	a26f1bf67e	[PassManager] Run additional LICM before LoopRotate Loop rotation often has to perform code duplication from header into preheader, which introduces PHI nodes. >>! In D99204, @thopre wrote: > > With loop peeling, it is important that unnecessary PHIs be avoided or > it will leads to spurious peeling. One source of such PHIs is loop > rotation which creates PHIs for invariant loads. Those PHIs are > particularly problematic since loop peeling is now run as part of simple > loop unrolling before GVN is run, and are thus a source of spurious > peeling. > > Note that while some of the load can be hoisted and eventually > eliminated by instruction combine, this is not always possible due to > alignment issue. In particular, the motivating example [1] was a load > inside a class instance which cannot be hoisted because the `this' > pointer has an alignment of 1. > > [1] http://lists.llvm.org/pipermail/llvm-dev/attachments/20210312/4ce73c47/attachment.cpp Now, we could enhance LoopRotate to avoid duplicating code when not needed, but instead hoist loop-invariant code, but isn't that a code duplication? (sic) We have LICM, and in fact we already run it right after LoopRotation. We could try to move it to before LoopRotation, that is basically free from compile-time perspective: https://llvm-compile-time-tracker.com/compare.php?from=6c93eb4477d88af046b915bc955c03693b2cbb58&to=a4bee6d07732b1184c436da489040b912f0dc271&stat=instructions But, looking at stats, i think it isn't great that we would no longer do LICM after LoopRotation, in particular: \| statistic name \| LoopRotate-LICM \| LICM-LoopRotate \| Δ \| % \| abs(%) \| \| asm-printer.EmittedInsts \| 9015930 \| 9015799 \| -131 \| 0.00% \| 0.00% \| \| indvars.NumElimCmp \| 3536 \| 3544 \| 8 \| 0.23% \| 0.23% \| \| indvars.NumElimExt \| 36725 \| 36580 \| -145 \| -0.39% \| 0.39% \| \| indvars.NumElimIV \| 1197 \| 1187 \| -10 \| -0.84% \| 0.84% \| \| indvars.NumElimIdentity \| 143 \| 136 \| -7 \| -4.90% \| 4.90% \| \| indvars.NumElimRem \| 4 \| 5 \| 1 \| 25.00% \| 25.00% \| \| indvars.NumLFTR \| 29842 \| 29890 \| 48 \| 0.16% \| 0.16% \| \| indvars.NumReplaced \| 2293 \| 2227 \| -66 \| -2.88% \| 2.88% \| \| indvars.NumSimplifiedSDiv \| 6 \| 8 \| 2 \| 33.33% \| 33.33% \| \| indvars.NumWidened \| 26438 \| 26329 \| -109 \| -0.41% \| 0.41% \| \| instcount.TotalBlocks \| 1178338 \| 1173840 \| -4498 \| -0.38% \| 0.38% \| \| instcount.TotalFuncs \| 111825 \| 111829 \| 4 \| 0.00% \| 0.00% \| \| instcount.TotalInsts \| 9905442 \| 9896139 \| -9303 \| -0.09% \| 0.09% \| \| lcssa.NumLCSSA \| 425871 \| 423961 \| -1910 \| -0.45% \| 0.45% \| \| licm.NumHoisted \| 378357 \| 378753 \| 396 \| 0.10% \| 0.10% \| \| licm.NumMovedCalls \| 2193 \| 2208 \| 15 \| 0.68% \| 0.68% \| \| licm.NumMovedLoads \| 35899 \| 31821 \| -4078 \| -11.36% \| 11.36% \| \| licm.NumPromoted \| 11178 \| 11154 \| -24 \| -0.21% \| 0.21% \| \| licm.NumSunk \| 13359 \| 13587 \| 228 \| 1.71% \| 1.71% \| \| loop-delete.NumDeleted \| 8547 \| 8402 \| -145 \| -1.70% \| 1.70% \| \| loop-instsimplify.NumSimplified \| 12876 \| 11890 \| -986 \| -7.66% \| 7.66% \| \| loop-peel.NumPeeled \| 1008 \| 925 \| -83 \| -8.23% \| 8.23% \| \| loop-rotate.NumNotRotatedDueToHeaderSize \| 368 \| 365 \| -3 \| -0.82% \| 0.82% \| \| loop-rotate.NumRotated \| 42015 \| 42003 \| -12 \| -0.03% \| 0.03% \| \| loop-simplifycfg.NumLoopBlocksDeleted \| 240 \| 242 \| 2 \| 0.83% \| 0.83% \| \| loop-simplifycfg.NumLoopExitsDeleted \| 497 \| 20 \| -477 \| -95.98% \| 95.98% \| \| loop-simplifycfg.NumTerminatorsFolded \| 618 \| 336 \| -282 \| -45.63% \| 45.63% \| \| loop-unroll.NumCompletelyUnrolled \| 11028 \| 11032 \| 4 \| 0.04% \| 0.04% \| \| loop-unroll.NumUnrolled \| 12608 \| 12529 \| -79 \| -0.63% \| 0.63% \| \| mem2reg.NumDeadAlloca \| 10222 \| 10221 \| -1 \| -0.01% \| 0.01% \| \| mem2reg.NumPHIInsert \| 192110 \| 192106 \| -4 \| 0.00% \| 0.00% \| \| mem2reg.NumSingleStore \| 637650 \| 637643 \| -7 \| 0.00% \| 0.00% \| \| scalar-evolution.NumBruteForceTripCountsComputed \| 814 \| 812 \| -2 \| -0.25% \| 0.25% \| \| scalar-evolution.NumTripCountsComputed \| 283108 \| 282934 \| -174 \| -0.06% \| 0.06% \| \| scalar-evolution.NumTripCountsNotComputed \| 106712 \| 106718 \| 6 \| 0.01% \| 0.01% \| \| simple-loop-unswitch.NumBranches \| 5178 \| 4752 \| -426 \| -8.23% \| 8.23% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 914 \| 503 \| -411 \| -44.97% \| 44.97% \| \| simple-loop-unswitch.NumSwitches \| 20 \| 18 \| -2 \| -10.00% \| 10.00% \| \| simple-loop-unswitch.NumTrivial \| 183 \| 95 \| -88 \| -48.09% \| 48.09% \| ... but that actually regresses LICM (-12% `licm.NumMovedLoads`), loop-simplifycfg (`NumLoopExitsDeleted`, `NumTerminatorsFolded`), simple-loop-unswitch (`NumTrivial`). What if we instead have LICM both before and after LoopRotate? \| statistic name \| LoopRotate-LICM \| LICM-LoopRotate-LICM \| Δ \| % \| abs(%) \| \| asm-printer.EmittedInsts \| 9015930 \| 9014474 \| -1456 \| -0.02% \| 0.02% \| \| indvars.NumElimCmp \| 3536 \| 3546 \| 10 \| 0.28% \| 0.28% \| \| indvars.NumElimExt \| 36725 \| 36681 \| -44 \| -0.12% \| 0.12% \| \| indvars.NumElimIV \| 1197 \| 1185 \| -12 \| -1.00% \| 1.00% \| \| indvars.NumElimIdentity \| 143 \| 146 \| 3 \| 2.10% \| 2.10% \| \| indvars.NumElimRem \| 4 \| 5 \| 1 \| 25.00% \| 25.00% \| \| indvars.NumLFTR \| 29842 \| 29899 \| 57 \| 0.19% \| 0.19% \| \| indvars.NumReplaced \| 2293 \| 2299 \| 6 \| 0.26% \| 0.26% \| \| indvars.NumSimplifiedSDiv \| 6 \| 8 \| 2 \| 33.33% \| 33.33% \| \| indvars.NumWidened \| 26438 \| 26404 \| -34 \| -0.13% \| 0.13% \| \| instcount.TotalBlocks \| 1178338 \| 1173652 \| -4686 \| -0.40% \| 0.40% \| \| instcount.TotalFuncs \| 111825 \| 111829 \| 4 \| 0.00% \| 0.00% \| \| instcount.TotalInsts \| 9905442 \| 9895452 \| -9990 \| -0.10% \| 0.10% \| \| lcssa.NumLCSSA \| 425871 \| 425373 \| -498 \| -0.12% \| 0.12% \| \| licm.NumHoisted \| 378357 \| 383352 \| 4995 \| 1.32% \| 1.32% \| \| licm.NumMovedCalls \| 2193 \| 2204 \| 11 \| 0.50% \| 0.50% \| \| licm.NumMovedLoads \| 35899 \| 35755 \| -144 \| -0.40% \| 0.40% \| \| licm.NumPromoted \| 11178 \| 11163 \| -15 \| -0.13% \| 0.13% \| \| licm.NumSunk \| 13359 \| 14321 \| 962 \| 7.20% \| 7.20% \| \| loop-delete.NumDeleted \| 8547 \| 8538 \| -9 \| -0.11% \| 0.11% \| \| loop-instsimplify.NumSimplified \| 12876 \| 12041 \| -835 \| -6.48% \| 6.48% \| \| loop-peel.NumPeeled \| 1008 \| 924 \| -84 \| -8.33% \| 8.33% \| \| loop-rotate.NumNotRotatedDueToHeaderSize \| 368 \| 365 \| -3 \| -0.82% \| 0.82% \| \| loop-rotate.NumRotated \| 42015 \| 42005 \| -10 \| -0.02% \| 0.02% \| \| loop-simplifycfg.NumLoopBlocksDeleted \| 240 \| 241 \| 1 \| 0.42% \| 0.42% \| \| loop-simplifycfg.NumTerminatorsFolded \| 618 \| 619 \| 1 \| 0.16% \| 0.16% \| \| loop-unroll.NumCompletelyUnrolled \| 11028 \| 11029 \| 1 \| 0.01% \| 0.01% \| \| loop-unroll.NumUnrolled \| 12608 \| 12525 \| -83 \| -0.66% \| 0.66% \| \| mem2reg.NumPHIInsert \| 192110 \| 192073 \| -37 \| -0.02% \| 0.02% \| \| mem2reg.NumSingleStore \| 637650 \| 637652 \| 2 \| 0.00% \| 0.00% \| \| scalar-evolution.NumTripCountsComputed \| 283108 \| 282998 \| -110 \| -0.04% \| 0.04% \| \| scalar-evolution.NumTripCountsNotComputed \| 106712 \| 106691 \| -21 \| -0.02% \| 0.02% \| \| simple-loop-unswitch.NumBranches \| 5178 \| 5185 \| 7 \| 0.14% \| 0.14% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 914 \| 925 \| 11 \| 1.20% \| 1.20% \| \| simple-loop-unswitch.NumTrivial \| 183 \| 179 \| -4 \| -2.19% \| 2.19% \| \| simple-loop-unswitch.NumBranches \| 5178 \| 4752 \| -426 \| -8.23% \| 8.23% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 914 \| 503 \| -411 \| -44.97% \| 44.97% \| \| simple-loop-unswitch.NumSwitches \| 20 \| 18 \| -2 \| -10.00% \| 10.00% \| \| simple-loop-unswitch.NumTrivial \| 183 \| 95 \| -88 \| -48.09% \| 48.09% \| I.e. we end up with less instructions, less peeling, more LICM activity, also note how none of those 4 regressions are here. Namely: \| statistic name \| LICM-LoopRotate \| LICM-LoopRotate-LICM \| Δ \| % \| abs(%) \| \| asm-printer.EmittedInsts \| 9015799 \| 9014474 \| -1325 \| -0.01% \| 0.01% \| \| indvars.NumElimCmp \| 3544 \| 3546 \| 2 \| 0.06% \| 0.06% \| \| indvars.NumElimExt \| 36580 \| 36681 \| 101 \| 0.28% \| 0.28% \| \| indvars.NumElimIV \| 1187 \| 1185 \| -2 \| -0.17% \| 0.17% \| \| indvars.NumElimIdentity \| 136 \| 146 \| 10 \| 7.35% \| 7.35% \| \| indvars.NumLFTR \| 29890 \| 29899 \| 9 \| 0.03% \| 0.03% \| \| indvars.NumReplaced \| 2227 \| 2299 \| 72 \| 3.23% \| 3.23% \| \| indvars.NumWidened \| 26329 \| 26404 \| 75 \| 0.28% \| 0.28% \| \| instcount.TotalBlocks \| 1173840 \| 1173652 \| -188 \| -0.02% \| 0.02% \| \| instcount.TotalInsts \| 9896139 \| 9895452 \| -687 \| -0.01% \| 0.01% \| \| lcssa.NumLCSSA \| 423961 \| 425373 \| 1412 \| 0.33% \| 0.33% \| \| licm.NumHoisted \| 378753 \| 383352 \| 4599 \| 1.21% \| 1.21% \| \| licm.NumMovedCalls \| 2208 \| 2204 \| -4 \| -0.18% \| 0.18% \| \| licm.NumMovedLoads \| 31821 \| 35755 \| 3934 \| 12.36% \| 12.36% \| \| licm.NumPromoted \| 11154 \| 11163 \| 9 \| 0.08% \| 0.08% \| \| licm.NumSunk \| 13587 \| 14321 \| 734 \| 5.40% \| 5.40% \| \| loop-delete.NumDeleted \| 8402 \| 8538 \| 136 \| 1.62% \| 1.62% \| \| loop-instsimplify.NumSimplified \| 11890 \| 12041 \| 151 \| 1.27% \| 1.27% \| \| loop-peel.NumPeeled \| 925 \| 924 \| -1 \| -0.11% \| 0.11% \| \| loop-rotate.NumRotated \| 42003 \| 42005 \| 2 \| 0.00% \| 0.00% \| \| loop-simplifycfg.NumLoopBlocksDeleted \| 242 \| 241 \| -1 \| -0.41% \| 0.41% \| \| loop-simplifycfg.NumLoopExitsDeleted \| 20 \| 497 \| 477 \| 2385.00% \| 2385.00% \| \| loop-simplifycfg.NumTerminatorsFolded \| 336 \| 619 \| 283 \| 84.23% \| 84.23% \| \| loop-unroll.NumCompletelyUnrolled \| 11032 \| 11029 \| -3 \| -0.03% \| 0.03% \| \| loop-unroll.NumUnrolled \| 12529 \| 12525 \| -4 \| -0.03% \| 0.03% \| \| mem2reg.NumDeadAlloca \| 10221 \| 10222 \| 1 \| 0.01% \| 0.01% \| \| mem2reg.NumPHIInsert \| 192106 \| 192073 \| -33 \| -0.02% \| 0.02% \| \| mem2reg.NumSingleStore \| 637643 \| 637652 \| 9 \| 0.00% \| 0.00% \| \| scalar-evolution.NumBruteForceTripCountsComputed \| 812 \| 814 \| 2 \| 0.25% \| 0.25% \| \| scalar-evolution.NumTripCountsComputed \| 282934 \| 282998 \| 64 \| 0.02% \| 0.02% \| \| scalar-evolution.NumTripCountsNotComputed \| 106718 \| 106691 \| -27 \| -0.03% \| 0.03% \| \| simple-loop-unswitch.NumBranches \| 4752 \| 5185 \| 433 \| 9.11% \| 9.11% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 503 \| 925 \| 422 \| 83.90% \| 83.90% \| \| simple-loop-unswitch.NumSwitches \| 18 \| 20 \| 2 \| 11.11% \| 11.11% \| \| simple-loop-unswitch.NumTrivial \| 95 \| 179 \| 84 \| 88.42% \| 88.42% \| {F15983613} {F15983615} {F15983616} (this is vanilla llvm testsuite + rawspeed + darktable) As an example of the code where early LICM only is bad, see: https://godbolt.org/z/GzEbacs4K This does have an observable compile-time regression of +~0.5% geomean https://llvm-compile-time-tracker.com/compare.php?from=7c5222e4d1a3a14f029e5f614c9aefd0fa505f1e&to=5d81826c3411982ca26e46b9d0aff34c80577664&stat=instructions but i think that's basically nothing, and there's potential that it might be avoidable in the future by fixing clang to produce alignment information on function arguments, thus making the second run unneeded. Differential Revision: https://reviews.llvm.org/D99249	2021-04-02 11:11:42 +03:00
Philip Reames	a8ac8816c9	Update a test missed in `6ef4505`	2021-04-01 12:17:01 -07:00
Brendon Cahoon	65c8bfb509	[AMDGPU] Enable output modifiers for double precision instructions Update SIFoldOperands pass to recognize v_add_f64 and v_mul_f64 instructions for folding output modifiers. Differential Revision: https://reviews.llvm.org/D99505	2021-04-01 10:08:17 -04:00
Dmitry Preobrazhensky	cd953434f2	[AMDGPU][MC][GFX10][GFX90A] Corrected _e32/_e64 suffices Fixed bugs https://bugs.llvm.org//show_bug.cgi?id=49643, https://bugs.llvm.org//show_bug.cgi?id=49644, https://bugs.llvm.org//show_bug.cgi?id=49645. Differential Revision: https://reviews.llvm.org/D99413	2021-04-01 14:21:00 +03:00
Simonas Kazlauskas	777a58e05b	Support {S,U}REMEqFold before legalization This allows these optimisations to apply to e.g. `urem i16` directly before `urem` is promoted to i32 on architectures where i16 operations are not intrinsically legal (such as on Aarch64). The legalization then later can happen more directly and generated code gets a chance to avoid wasting time on computing results in types wider than necessary, in the end. Seems like mostly an improvement in terms of results at least as far as x86_64 and aarch64 are concerned, with a few regressions here and there. It also helps in preventing regressions in changes like {D87976}. Reviewed By: lebedev.ri Differential Revision: https://reviews.llvm.org/D88785	2021-04-01 01:35:41 +03:00
Jay Foad	b138cf115e	[AMDGPU] Add some image tests with enable-prt-strict-null disabled. NFC.	2021-03-31 17:27:20 +01:00
Jay Foad	a991ee330b	[AMDGPU] Use a common check prefix for some image tests. NFC.	2021-03-31 17:27:20 +01:00
Jay Foad	5d0e9ddfa5	[AMDGPU][GlobalISel] Add support for global atomicrmw fadd This includes gfx908 which only has a no-return version of the global_atomic_add_f32 instruction, using the same hack that was previously implemented for selecting from the llvm.amdgcn.global.atomic.fadd intrinsic. Differential Revision: https://reviews.llvm.org/D97767	2021-03-31 11:13:00 +01:00
Krasimir Georgiev	c51e91e046	Revert "[Passes] Add relative lookup table converter pass" This reverts commit `5178ffc7cf`. Compiling `llvm-profdata` with a compiler build from this produces a crashing binary.	2021-03-30 14:13:37 +02:00
Gulfem Savrun Yeniceri	5178ffc7cf	[Passes] Add relative lookup table converter pass Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in: https://bugs.llvm.org/show_bug.cgi?id=45244 This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly. Differential Revision: https://reviews.llvm.org/D94355	2021-03-29 21:53:32 +00:00
Joe Nash	45fd7c02af	Revert "[AMDGPU] Mark additional VOP3 as commutable" This reverts commit `d35d8da7d6`.	2021-03-29 14:48:11 -04:00
Joe Nash	d35d8da7d6	[AMDGPU] Mark additional VOP3 as commutable Note, only src0 and src1 will be commuted if the isCommutable flag is set. This patch does not change that, it just makes it possible to commute src0 and src1 of more instructions. Reviewed By: foad, rampitec Differential Revision: https://reviews.llvm.org/D99376 Change-Id: I61e20490962d95ea429beb355c55f55c024dafdc	2021-03-29 14:22:20 -04:00
Roger Ferrer Ibanez	489ca73ac4	[PrologEpilogInserter][AMDGPU] Only adjust offset for emergency spill slots if the stack grows down D89239 adjusts the stack offset of emergency spill slots for overaligned stacks. However the adjustment is not valid for targets whose stack grows up (such as AMDGPU). This change makes the adjustment conditional only to those targets whose stack grows down. Fixes https://bugs.llvm.org/show_bug.cgi?id=49686 Differential Revision: https://reviews.llvm.org/D99504	2021-03-29 17:26:58 +00:00
Petar Avramovic	b082e6f88a	[AMDGPU] Extend gfx10 test coverage. NFC. Differential Revision: https://reviews.llvm.org/D99267	2021-03-29 11:13:55 +02:00
Jay Foad	9d08f276d7	[AMDGPU] Use reductions instead of scans in the atomic optimizer If the result of an atomic operation is not used then it can be more efficient to build a reduction across all lanes instead of a scan. Do this for GFX10, where the permlanex16 instruction makes it viable. For wave64 this saves a couple of dpp operations. For wave32 it saves one readlane (which are generally bad for performance) and one dpp operation. Differential Revision: https://reviews.llvm.org/D98953	2021-03-26 15:38:14 +00:00
Gulfem Savrun Yeniceri	5fbe1fdf17	Revert "[Passes] Add relative lookup table converter pass" This reverts commit `5fd001a5ff` because it broke clang-with-thin-lto-ubuntu bot.	2021-03-24 18:59:33 +00:00
Gulfem Savrun Yeniceri	5fd001a5ff	[Passes] Add relative lookup table converter pass Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in: https://bugs.llvm.org/show_bug.cgi?id=45244 This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly. Differential Revision: https://reviews.llvm.org/D94355	2021-03-24 17:31:18 +00:00
Konstantin Zhuravlyov	f4ace63737	AMDGPU: Add target id and code object v4 support - Add target id support (https://clang.llvm.org/docs/ClangOffloadBundler.html#target-id) - Add code object v4 support (https://llvm.org/docs/AMDGPUUsage.html#elf-code-object) - Add kernarg_size to kernel descriptor - Change trap handler ABI to no longer move queue pointer into s[0:1] - Cleanup ELF definitions - Add V2, V3, V4 suffixes to make a clear distinction for code object version - Consolidate note names Differential Revision: https://reviews.llvm.org/D95638	2021-03-24 11:54:05 -04:00
alex-t	dccf83acf9	[AMDGPU] SIOptimizeExecMaskingPreRA should check constant bus constraint when folds EXEC copy Folding EXEC copy into it's single use may lead to constant bus constraint violation as it adds one more SGPR operand. This change makes it validate the user instruction with the new SGPR operand and only fold it if it is legal. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D98888	2021-03-24 14:14:13 +03:00
Matt Arsenault	b24436ac96	GlobalISel: Lower funnel shifts	2021-03-23 09:11:17 -04:00
Jay Foad	d42f63beeb	[AMDGPU] Use non-compressed exports in a test. NFC. I don't think there's any need for this test to use compressed exports. Using normal exports seems a bit more straightforwards and avoids a tiny bit of bitcasting. Differential Revision: https://reviews.llvm.org/D99167	2021-03-23 11:18:12 +00:00
Pushpinder Singh	d0e5422eb8	[GlobalISel][AMDGPU] Lower G_UMULO/G_SMULO Reviewed By: foad Differential Revision: https://reviews.llvm.org/D93963	2021-03-23 05:45:43 +00:00
Carl Ritson	64db6b8d37	[AMDGPU] Only unbundle memory accesses in SIMemoryLegalizer This restores previous behaviour and is a step toward removing unbundling entirely. Reviewed By: foad, rampitec Differential Revision: https://reviews.llvm.org/D99061	2021-03-23 11:30:36 +09:00
Gulfem Savrun Yeniceri	e3a6d70c68	Revert "[Passes] Add relative lookup table converter pass" This reverts commit `78a65cd945` which caused buildbot failures.	2021-03-23 00:43:16 +00:00
Gulfem Savrun Yeniceri	78a65cd945	[Passes] Add relative lookup table converter pass Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in: https://bugs.llvm.org/show_bug.cgi?id=45244 This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly. Differential Revision: https://reviews.llvm.org/D94355	2021-03-22 22:09:02 +00:00
Matt Arsenault	c34819afe3	GlobalISel: Handle G_BUILD_VECTOR in isKnownToBeAPowerOfTwo	2021-03-22 14:20:35 -04:00
Matt Arsenault	1dd23c6d53	AMDGPU: Allow tail calls for amdgpu_gfx functions	2021-03-22 10:55:19 -04:00
Matt Arsenault	6314a72730	AMDGPU/GlobalISel: Enable CSE in pre-legalizer combiner	2021-03-21 10:07:37 -04:00
Carl Ritson	fe5f4c397f	[AMDGPU] Rename SIInsertSkips Pass Pass no longer handles skips. Pass now removes unnecessary unconditional branches and lowers early termination branches. Hence rename to SILateBranchLowering. Move code to handle returns to epilog from SIPreEmitPeephole into SILateBranchLowering. This means SIPreEmitPeephole only contains optional optimisations, and all required transforms are in SILateBranchLowering. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D98915	2021-03-20 11:48:04 +09:00
Carl Ritson	5df2af8b0e	[AMDGPU] Merge SIRemoveShortExecBranches into SIPreEmitPeephole SIRemoveShortExecBranches is an optimisation so fits well in the context of SIPreEmitPeephole. Test changes relate to early termination from kills which have now been lowered prior to considering branches for removal. As these use s_cbranch the execz skips are now retained instead. Currently either behaviour is valid as kill with EXEC=0 is a nop; however, if early termination is used differently in future then the new behaviour is the correct one. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D98917	2021-03-20 11:26:42 +09:00
Carl Ritson	b76c09023d	[AMDGPU] Allow index optimisation in SIPreEmitPeephole for bundles Add code so duplication index register changes can be removed from inside bundles. Reviewed By: rampitec, foad Differential Revision: https://reviews.llvm.org/D98940	2021-03-20 10:26:23 +09:00
Jay Foad	87248e852b	[AMDGPU] Rationalize some check prefixes and use more common prefixes. NFC.	2021-03-19 16:48:33 +00:00
Jay Foad	5df52f7708	[AMDGPU] Remove weird target triples from tests. NFC.	2021-03-19 16:48:32 +00:00
Simon Pilgrim	9d2df96407	[DAG] computeKnownBits - add ISD::MULHS/MULHU/SMUL_LOHI/UMUL_LOHI handling Reuse the existing KnownBits multiplication code to handle the 'extend + multiply + extract high bits' pattern for multiply-high ops. Noticed while looking at the codegen for D88785 / D98587 - the patch helps division-by-constant expansion code in particular, which suggests that we might have some further KnownBits div/rem cases we could handle - but this was far easier to implement. Differential Revision: https://reviews.llvm.org/D98857	2021-03-19 16:02:31 +00:00
Jay Foad	b8616e40da	[AMDGPU] Add atomic optimizer nouse tests Add some atomic optimizer tests where there is no use of the result of the atomic operation, which is a common case in real code. NFC. Differential Revision: https://reviews.llvm.org/D98952	2021-03-19 15:39:42 +00:00
Jay Foad	685335a014	[AMDGPU] Remove duplicate test functions. NFC.	2021-03-19 11:36:14 +00:00
Stanislav Mekhanoshin	edd6da10d2	[AMDGPU] Remove cpol, tfe, and swz from MUBUF patterns These are always selected as 0 anyway. Differential Revision: https://reviews.llvm.org/D98663	2021-03-18 14:36:04 -07:00
Jon Chesterfield	253f804deb	[amdgpu] Update med3 combine to skip i64 [amdgpu] Update med3 combine to skip i64 Fixes an assumption that a type which is not i32 will be i16. This asserts when trying to sign/zero extend an i64 to i32. Test case was cut down from an openmp application. Variations on it are hit by other combines before reaching the problematic one, e.g. replacing the immediate values with other function arguments changes the codegen path and misses this combine. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D98872	2021-03-18 15:56:41 +00:00
Jay Foad	078b338ba6	[AMDGPU] Add some gfx1010 test coverage. NFC.	2021-03-18 14:00:07 +00:00
Matt Arsenault	b9a0384983	GlobalISel: Preserve source value information for outgoing byval args Pass through the original argument IR value in order to preserve the aliasing information in the memcpy memory operands.	2021-03-18 09:16:54 -04:00
Matt Arsenault	61f834cc09	GlobalISel: Insert memcpy for outgoing byval arguments byval requires an implicit copy between the caller and callee such that the callee may write into the stack area without it modifying the value in the parent. Previously, this was passing through the raw pointer value which would break if the callee wrote into it. Most of the time, this copy can be optimized out (however we don't have the optimization SelectionDAG does yet). This will trigger more fallbacks for AMDGPU now, since we don't have legalization for memcpy yet (although we should stop using byval anyway).	2021-03-18 09:16:54 -04:00
Simon Pilgrim	388fbefb4f	[AMDGPU] Regenerate atomic_optimizations_global_pointer.ll tests	2021-03-18 11:15:44 +00:00
Simon Pilgrim	cfc256ba9f	[DAG] TargetLowering::isBinOp() - add ISD::SSUBSAT/USUBSAT Add to the generic non-commutative binop list.	2021-03-17 14:51:00 +00:00
Simon Pilgrim	4a68740547	Revert rG3b635253ddd0106c88051cff3540d8eb90bee22f "[AMDGPU] Regenerate wave32.ll test checks" Breaks on some buildbots.	2021-03-17 11:47:09 +00:00
Simon Pilgrim	3b635253dd	[AMDGPU] Regenerate wave32.ll test checks This is to help simplify the diff on an upcoming patch	2021-03-17 11:27:11 +00:00
Stanislav Mekhanoshin	bc27a31801	[AMDGPU] Fix copyPhysReg to not produce unalined vgpr access RA can insert something like a sub1_sub2 COPY of a wide VGPR tuple which results in the unaligned acces with v_pk_mov_b32 after the copy is expanded. This is regression after D97316. Differential Revision: https://reviews.llvm.org/D98549	2021-03-15 14:14:30 -07:00
Stanislav Mekhanoshin	3bffb1cd0e	[AMDGPU] Use single cache policy operand Replace individual operands GLC, SLC, and DLC with a single cache_policy bitmask operand. This will reduce the number of operands in MIR and I hope the amount of code. These operands are mostly 0 anyway. Additional advantage that parser will accept these flags in any order unlike now. Differential Revision: https://reviews.llvm.org/D96469	2021-03-15 13:00:59 -07:00
Jon Chesterfield	13e49dcee4	[amdgpu] Implement lower function LDS pass [amdgpu] Implement lower function LDS pass Local variables are allocated at kernel launch. This pass collects global variables that are used from non-kernel functions, moves them into a new struct type, and allocates an instance of that type in every kernel. Uses are then replaced with a constantexpr offset. Prior to this pass, accesses from a function are compiled to trap. With this pass, most such accesses are removed before reaching codegen. The trap logic is left unchanged by this pass. It is still reachable for the cases this pass misses, notably the extern shared construct from hip and variables marked constant which survive the optimizer. This is of interest to the openmp project because the deviceRTL runtime library uses cuda shared variables from functions that cannot be inlined. Trunk llvm therefore cannot compile some openmp kernels for amdgpu. In addition to the unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled and the function pointer hashing scheme deleted passes the openmp suite. This lowering will use more LDS than strictly necessary. It is intended to be a functionally correct fallback for cases that are difficult to target from future optimisation passes. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D94648	2021-03-15 15:24:01 +00:00
Carl Ritson	13877db2fa	[AMDGPU] Fix shortfalls in WQM marking When tracking defined lanes through phi nodes in the live range graph each branch of the phi must be handled independently. Also rewrite the marking algorithm to reduce unnecessary operations. Previously a shared set of defined lanes was used which caused marking to stop prematurely. This was observable in existing lit tests, but test patterns did not cover this detail. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D98614	2021-03-15 21:44:15 +09:00
Roman Lebedev	78b8ce40ef	Reland [SCEV] Improve modelling for (null) pointer constants This reverts commit `329aeb5db4`, and relands commit `61f006ac65`. This is a continuation of D89456. As it was suggested there, now that SCEV models `PtrToInt`, we can try to improve SCEV's pointer handling. In particular, i believe, i will need this in the future to further fix `SCEVAddExpr`operation type handling. This removes special handling of `ConstantPointerNull` from `ScalarEvolution::createSCEV()`, and add constant folding into `ScalarEvolution::getPtrToIntExpr()`. This way, `null` constants stay as such in SCEV's, but gracefully become zero integers when asked. Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D98147	2021-03-13 16:05:34 +03:00
Roman Lebedev	329aeb5db4	Temporairly evert "[SCEV] Improve modelling for (null) pointer constants" This appears to have broken ubsan bot: https://lab.llvm.org/buildbot/#/builders/85/builds/3062 https://reviews.llvm.org/D98147#2623549 It looks like LSR needs some kind of a change around insertion point handling. Reverting until i have a fix. This reverts commit `61f006ac65`.	2021-03-13 09:10:28 +03:00
Roman Lebedev	61f006ac65	[SCEV] Improve modelling for (null) pointer constants This is a continuation of D89456. As it was suggested there, now that SCEV models `PtrToInt`, we can try to improve SCEV's pointer handling. In particular, i believe, i will need this in the future to further fix `SCEVAddExpr`operation type handling. This removes special handling of `ConstantPointerNull` from `ScalarEvolution::createSCEV()`, and add constant folding into `ScalarEvolution::getPtrToIntExpr()`. This way, `null` constants stay as such in SCEV's, but gracefully become zero integers when asked. Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D98147	2021-03-12 22:11:58 +03:00
Simonas Kazlauskas	a2eca31da2	Test cases for rem-seteq fold with illegal types This also briefly tests a larger set of architectures than the more exhaustive functionality tests for AArch64 and x86. As requested in D88785 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D98339	2021-03-12 16:28:04 +02:00
Matt Arsenault	6b76d82853	GlobalISel: Fix marking byval arguments as immutable byval arguments need to be assumed writable. Only implicitly stack passed arguments which aren't addressable in the IR can be assumed immutable. Mips is still broken since for some reason its doing its own thing with the ValueHandlers (and x86 doesn't actually handle byval arguments now, although some of the code is there).	2021-03-12 09:01:53 -05:00
Matt Arsenault	34471c3060	GlobalISel: Partially fix handling of byval arguments This was essentially ignoring byval and treating them as a pointer argument which needed to be loaded from. This should copy the frame index value to the virtual register, not insert a load from the frame index into the pointer value. For AMDGPU, this was producing a load from the byval pointer argument, to a pointer used for the byval arguments. I do not understand how AArch64 managed to work before since it appears to be similarly broken. We could also change the ValueHandler API to avoid the extra copy from the frame index, since currently it returns a new register. I believe there is still an issue with outgoing byval arguments. These should have a copy inserted in case the callee decided to overwrite the memory.	2021-03-12 09:01:53 -05:00
Carl Ritson	f08dadd242	[AMDGPU] Do not annotate an else branch if there is a kill As llvm.amdgcn.kill is lowered to a terminator it can cause else branch annotations to end up in the wrong block. Do not annotate conditionals as else branches where there is a kill to avoid this. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D97427	2021-03-12 11:52:08 +09:00
Carl Ritson	c07f2025e4	[AMDGPU] Restrict image_msaa_load to MSAA dimension types This instruction is only valid on 2D MSAA and 2D MSAA Array surfaces. Remove intrinsic support for other dimension types, and block assembly for unsupported dimensions. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D98397	2021-03-12 09:47:24 +09:00
Ruiling Song	e8e6817d00	[AMDGPU] Don't check hasStackObjects() when reserving VGPR We have amdgpu_gfx functions that have high register pressure. If we do not reserve VGPR for SGPR spill, we will fall into the path to spill the SGPR to memory, which does not only have correctness issue, but also have really bad performance. I don't know why there is the check for hasStackObjects(), in our case, we don't have stack objects at the time of finalizeLowering(). So just remove the check that we always reserve a VGPR for possible SGPR spill in non-entry functions. Reviewed by: arsenm Differential Revision: https://reviews.llvm.org/D98345	2021-03-12 08:11:14 +08:00
Matt Arsenault	70cb57d7da	AMDGPU/GlobalISel: Improve private addressing mode matching This enables the look-through-copy to hack around not correctly regbankselecting constants to match the use bank.	2021-03-11 10:23:35 -05:00
Matt Arsenault	cf5ecd5644	GlobalISel: Fix off by one in finding explicit byval alignment For attribute sets, the return index is at 0, and arguments start at 1. getParamAlignment adds the offset of 1, so we need to convert from attribute index back to IR index.	2021-03-11 10:23:08 -05:00
Matt Arsenault	0e0c7ef8e4	AMDGPU/GlobalISel: Add more tests for byval arguments	2021-03-11 10:23:08 -05:00
Ruiling Song	66340846b3	[AMDGPU] Always create Stack Object for reserved VGPR As we may overwrite inactive lanes of a caller-save-vgpr, we should always save/restore the reserved vgpr for sgpr spill. Reviewed by: arsenm Differential Revision: https://reviews.llvm.org/D98319	2021-03-11 10:06:07 +08:00
Daniel Sanders	134a179dee	[mir] Change 'undef' for MMO base addresses to 'unknown-address' Differential Revision: https://reviews.llvm.org/D98100	2021-03-10 16:46:44 -08:00
Stanislav Mekhanoshin	574a9dabc6	[AMDGPU] Always expand system scope fp atomics on gfx90a FP atomics in system scope cannot be used and shall always be expanded in a CAS loop. Differential Revision: https://reviews.llvm.org/D98085	2021-03-10 12:35:23 -08:00
Jay Foad	70f013fd3b	[AMDGPU] Fix isReallyTriviallyReMaterializable for V_MOV_* D57708 changed SIInstrInfo::isReallyTriviallyReMaterializable to reject V_MOVs with extra implicit operands, but it accidentally rejected all V_MOVs because of their implicit use of exec. Fix it but avoid adding a moderately expensive call to MI.getDesc().getNumImplicitUses(). In real graphics shaders this changes quite a few vgpr copies into move- immediates, which is good for avoiding stalls on GFX10. Differential Revision: https://reviews.llvm.org/D98347	2021-03-10 16:18:12 +00:00
Christudasan Devadasan	4c6ab48fb1	GlobalISel: Try to combine G_[SU]DIV and G_[SU]REM It is good to have a combined `divrem` instruction when the `div` and `rem` are computed from identical input operands. Some targets can lower them through a single expansion that computes both division and remainder. It effectively reduces the number of instructions than individually expanding them. Reviewed By: arsenm, paquette Differential Revision: https://reviews.llvm.org/D96013	2021-03-10 18:46:07 +05:30
Christudasan Devadasan	24c0ad7143	[AMDGPU] Fix the dead frame indices during custom spill lowering. AMDGPU target tries to handle the SGPR and VGPR spills in a custom pass before the actual frame lowering pass. Once they are handled and the respective frames are eliminated in the custom pass, certain uses of them still remain. For instance, the DBG_VALUE instructions inserted by the allocator alongside the spill instruction will use the corresponding frame index. They become dead later during PEI and causes a crash while trying to replace the frame indices. We should possibly avoid this custom pass. For now, replacing such dead references with null register value. Reviewed By: arsenm, scott.linder Differential Revision: https://reviews.llvm.org/D98038	2021-03-09 23:22:49 +05:30
Ruiling Song	f0ccdde3c9	[AMDGPU] Remove SI_MASK_BRANCH This is already deprecated, so remove code working on this. Also update the tests by using S_CBRANCH_EXECZ instead of SI_MASK_BRANCH. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D97545	2021-03-09 09:13:23 +08:00
Craig Topper	0eb405c3b8	[SelectionDAG] Add computeKnownBits support for ISD::USUBSAT. The result of ISD::USUBSAT will never be larger than the LHS. We can use this to put a bound on the number of leading zeros. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D98133	2021-03-07 09:48:42 -08:00
Jay Foad	99682bc039	Revert "Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030"" This reverts commit `e58d68fcd0`. This reinstates commit `fc28f600e5` with a fix to initialize HasShaderCyclesRegister. See https://reviews.llvm.org/D97928.	2021-03-06 09:00:01 +00:00
Mitch Phillips	e58d68fcd0	Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030" Broke the ASan/MSan buildbots. See more comments in the original patch, https://reviews.llvm.org/D97928. Build failure at http://lab.llvm.org:8011/#/builders/5/builds/5327 This reverts commit `fc28f600e5`.	2021-03-05 18:24:59 -08:00
Jay Foad	fc28f600e5	[AMDGPU] Restore the s_memtime instruction in gfx1030 gfx1030 added a new way to implement readcyclecounter using the SHADER_CYCLES hardware register, but the s_memtime instruction still exists, so the MC layer should still accept it and the llvm.amdgcn.s.memtime intrinsic should still work. Differential Revision: https://reviews.llvm.org/D97928	2021-03-05 20:19:11 +00:00
RamNalamothu	3998a8e797	[AMDGPU] Do not attempt sgpr spills to vgpr, when it is disabled This covers a path missed in https://reviews.llvm.org/D95768. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D98013	2021-03-05 22:47:21 +05:30
Sebastian Neubauer	e0e73714fb	[AMDGPU] Keep skip branch for ds instructions Same as other memory instructions, ds instructions add latency even if exec is zero. Jumping over them if exec=0 is cheaper than executing them. With this change, the branch instruction that skips over a basic block if exec=0 is not removed when the block contains a ds instruction. Differential Revision: https://reviews.llvm.org/D97922	2021-03-05 12:34:09 +01:00
Petar Avramovic	36beaa3ba3	Reland AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect Recommit `bf5a582650`. Depends on `4c8fb7ddd6` which was reverted. RegBankSelect creates zext and trunc when it selects banks for uniform i1. Add zext_trunc_fold from generic combiner to post RegBankSelect combiner. Differential Revision: https://reviews.llvm.org/D95432	2021-03-05 11:05:37 +01:00
Petar Avramovic	d44f61f81c	Reland [GlobalISel] Combine zext(trunc x) to x Recommit `4112299ee7`. Depends on `4c8fb7ddd6` which was reverted. Combine zext(trunc x) to x when truncated bits are known to be zero. Differential Revision: https://reviews.llvm.org/D96031	2021-03-05 11:05:37 +01:00
Jay Foad	ed7458398a	[AMDGPU] Don't check for VMEM hazards on GFX10 The hazard where a VMEM reads an SGPR written by a VALU counts as a data dependency hazard, so no nops are required on GFX10. Tested with Vulkan CTS on GFX10.1 and GFX10.3. Differential Revision: https://reviews.llvm.org/D97926	2021-03-04 21:44:56 +00:00
Petar Avramovic	d7834556b7	Reland [GlobalISel] Start using vectors in GISelKnownBits This is recommit of `4c8fb7ddd6`. MIR in one unit test had mismatched types. For vectors we consider a bit as known if it is the same for all demanded vector elements (all elements by default). KnownBits BitWidth for vector type is size of vector element. Add support for G_BUILD_VECTOR. This allows combines of urem_pow2_to_mask in pre-legalizer combiner. Differential Revision: https://reviews.llvm.org/D96122	2021-03-04 21:47:13 +01:00
Daniel Sanders	9fc2be6f28	[mir] Fix confusing MIR when MMO's value is nullptr but offset is non-zero :: (store 1 + 4, addrspace 1) -> :: (store 1 into undef + 4, addrspace 1) An offset without a base isn't terribly useful but it's convenient to update the offset without checking the value. For example, when breaking apart stores into smaller units Differential Revision: https://reviews.llvm.org/D97812	2021-03-04 10:34:30 -08:00
Nico Weber	e68de60bc4	Revert "AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect" This reverts commit `bf5a582650`. Also depends on now-reverted `4c8fb7ddd6`	2021-03-04 10:16:11 -05:00
Nico Weber	59beb1ef6d	Revert "[GlobalISel] Combine zext(trunc x) to x" This reverts commit `4112299ee7`. Seems to depend on `4c8fb7ddd6` which is being reverted.	2021-03-04 10:13:40 -05:00
Nico Weber	4b1015361c	Revert "[GlobalISel] Start using vectors in GISelKnownBits" This reverts commit `4c8fb7ddd6`. Breaks check-llvm everywhere, see https://reviews.llvm.org/D96122	2021-03-04 10:13:40 -05:00
Petar Avramovic	bf5a582650	AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect RegBankSelect creates zext and trunc when it selects banks for uniform i1. Add zext_trunc_fold from generic combiner to post RegBankSelect combiner. Differential Revision: https://reviews.llvm.org/D95432	2021-03-04 15:05:24 +01:00
Petar Avramovic	4112299ee7	[GlobalISel] Combine zext(trunc x) to x Combine zext(trunc x) to x when truncated bits are known to be zero. Differential Revision: https://reviews.llvm.org/D96031	2021-03-04 15:05:23 +01:00

... 3 4 5 6 7 ...

4784 Commits