llvm-project

Commit Graph

Author	SHA1	Message	Date
Jay Foad	16d707e656	[AMDGPU] Fix v_swap_b32 formation on physical registers As explained in the comments, matchSwap matches: // mov t, x // mov x, y // mov y, t and turns it into: // mov t, x (t is potentially dead and move eliminated) // v_swap_b32 x, y On physical registers we don't have full use-def chains so the check for T being live-out was not working properly with subregs/superregs. Differential Revision: https://reviews.llvm.org/D101546	2021-04-29 20:53:40 +01:00
Petar Avramovic	c34900e133	AMDGPU/GlobalISel: Fix selection of image intrinsics with unused return When atomic image intrinsic return value is unused, register class for destination of a sub-register copy of return value ends up not being set. This copy then hits 'Register class not set' assert later. If return value has uses, register class is determined by use instruction. Fix is to not create sub-register copy when image intrinsic destination has no uses because it would be deleted by dead-mi-elimination later anyway. Differential Revision: https://reviews.llvm.org/D101448	2021-04-29 20:56:03 +02:00
Jay Foad	1ecddddbec	[AMDGPU] Add a v_swap_b32 test case to be fixed	2021-04-29 16:03:15 +01:00
Joe Nash	168228d76a	[AMDGPU] Make some VOP3 insts commutable Note, only src0 and src1 will be commuted if the isCommutable flag is set. This patch does not change that, it just makes it possible to commute src0 and src1 of some U/I/B vop3 instructions. This patch revises `d35d8da7d6`. It contains the commute opportunities excluding float insts Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D101474 Change-Id: I62938173d750453839f2457a3851661a29135faf	2021-04-28 13:59:08 -04:00
Petar Avramovic	8110fcc8fc	AMDGPU/GlobalISel: Fix negative offset folding for buffer_load Buffer_load does unsigned offset calculations. Don't fold operands of 32-bit add that are likely to cause unsigned add overflow (common case is when one of the operands is negative). Differential Revision: https://reviews.llvm.org/D91336	2021-04-27 14:45:22 +02:00
Petar Avramovic	6a3e1b3531	AMDGPU/GlobalISel: Add test for buffer_load with negative offset Pre-commit test for D91336.	2021-04-27 14:45:21 +02:00
Petar Avramovic	fb7be0d912	AMDGPU/GlobalISel: Remove redundant G_FCANONICALIZE Add basic version of isCanonicalized for global-isel. Copied from sdag. Add post legalizer combine that deletes G_FCANONICALIZE when its input is already Canonicalized. Differential Revision: https://reviews.llvm.org/D96605	2021-04-27 12:26:37 +02:00
Petar Avramovic	4a9bc59867	AMDGPU/GlobalISel: Add integer med3 combines Add signed and unsigned integer version of med3 combine. Source pattern is min(max(Val, K0), K1) or max(min(Val, K1), K0) where K0 and K1 are constants and K0 <= K1. Destination is med3 that corresponds to signedness of min/max in source. Differential Revision: https://reviews.llvm.org/D90050	2021-04-27 11:52:23 +02:00
Baptiste Saleil	caf1294d95	[AMDGPU] Experiments show that the GCNRegBankReassign pass significantly impacts the compilation time and there is no case for which we see any improvement in performance. This patch removes this pass and its associated test cases from the tree. Differential Revision: https://reviews.llvm.org/D101313 Change-Id: I0599169a7609c19a887f8d847a71e664030cc141	2021-04-26 17:21:49 -04:00
Sebastian Neubauer	9579af2bd7	[AMDGPU] Fix autogenerated wwm-reserved-spill.ll Due to a bug in update_llc_test_checks.py, the test is wrongly coalesced between run lines. Remove common check prefix to fix that. NFC.	2021-04-26 19:09:09 +02:00
Sebastian Neubauer	fcc40d9c17	[AMDGPU] Use MapVector for WWMReservedRegs Use MapVector instead of SmallDenseMap because it has a deterministic iteration order. Differential Revision: https://reviews.llvm.org/D101299	2021-04-26 17:43:00 +02:00
Michael Kitzan	59f2dd5f1a	[MachineCSE] Prevent CSE of non-local convergent instrs At the moment, MachineCSE allows CSE-ing convergent instrs which are non-local to each other. This can cause illegal codegen as convergent instrs are control flow dependent. The patch prevents non-local CSE of convergent instrs by adding a check in isProfitableToCSE and rejecting CSE-ing if we're considering CSE-ing non-local convergent instrs. We can still CSE convergent instrs which are in the same control flow scope, so the patch purposely does not make all convergent instrs non-CSE candidates in isCSECandidate. https://reviews.llvm.org/D101187	2021-04-23 16:44:48 -07:00
Sebastian Neubauer	3366d81153	[AMDGPU] Save WWM registers in functions The values of registers in inactive lanes needs to be saved during function calls. Save all registers used for whole wave mode, similar to how it is done for VGPRs that are used for SGPR spilling. Differential Revision: https://reviews.llvm.org/D99429 Reapply with fixed tests on window.	2021-04-23 18:09:24 +02:00
Jay Foad	5802cbefc1	[AMDGPU] Fix typo in implicit operand lists Several tests had a typo where they mentioned sgpr17 twice instead of sgpr17 and sgpr27. This had a significant effect on the "scavenge_sgpr_pei_no_sgprs" tests because there was actually an sgpr available, namely sgpr27. Differential Revision: https://reviews.llvm.org/D100960	2021-04-23 15:44:17 +01:00
Sebastian Neubauer	22d99cb63f	Revert "[AMDGPU] Save WWM registers in functions" This reverts commit `91464c30bf`. Seems to break tests on windows.	2021-04-23 16:38:50 +02:00
Piotr Sobczak	83a3395b30	[AMDGPU][NFC] Update auto-gen test Most likely the "glc" was not added to the test when the volatile loads started generating those bits.	2021-04-23 16:33:16 +02:00
Sebastian Neubauer	91464c30bf	[AMDGPU] Save WWM registers in functions The values of registers in inactive lanes needs to be saved during function calls. Save all registers used for whole wave mode, similar to how it is done for VGPRs that are used for SGPR spilling. Differential Revision: https://reviews.llvm.org/D99429	2021-04-23 16:09:31 +02:00
Matt Arsenault	b58332774f	AMDGPU: Fix assert on inline asm on gfx90a This was assuming all mayLoad instructions have one def.	2021-04-23 09:00:25 -04:00
Matt Arsenault	ed633a1daa	AMDGPU: Restore atomic fp feature on FP atomic instruction definitions `9931b1f7a4` switched this to checking for the two specific subtargets, instead of the dedicated feature. This broke supporting functions which force added the feature when emitting targets that do not actually support them. This stil does not work for the targets that use the gfx6/7 or gfx10 encodings.	2021-04-22 21:32:01 -04:00
Matt Arsenault	987e52851e	AMDGPU: Fix assert when trying to fold reg_sequence of physreg copies	2021-04-21 21:58:18 -04:00
Matt Arsenault	70ab76a81b	AMDGPU: Fix indirect tail calls Fix a selection error on uniform callees, and use a regular call if divergent.	2021-04-21 09:15:24 -04:00
Jay Foad	ec8c61efdf	[AMDGPU] Allow multiple uses of the same literal In GFX10 VOP3 can have a literal, which opens up the possibility of two operands using the same literal value, which is allowed and only counts as one use of the constant bus. AMDGPUAsmParser::validateConstantBusLimitations already knew about this but SIInstrInfo::verifyInstruction did not. Differential Revision: https://reviews.llvm.org/D100770	2021-04-20 16:44:01 +01:00
Matt Arsenault	1cb8a9d595	AMDGPU/GlobalISel: Fix uitofp/sitofp with non-power-of-2 integers	2021-04-20 11:13:29 -04:00
Jay Foad	b22721f01a	[AMDGPU] GCNDPPCombine: don't shrink V_ADD_CO_U32 if carry out is used Don't shrink VOP3 instructions if there are any uses of a carry-out operand, because the shrunken form of the instruction would write the carry-out to vcc instead of to a virtual register. Differential Revision: https://reviews.llvm.org/D100760	2021-04-20 09:17:52 +01:00
madhur13490	6a4d9cb7e0	[AMDGPU] Remove error check for indirect calls and add missing queue-ptr This patch removes -fixed-abi check for indirect calls and also adds queue-ptr which is required for indirect calls to work. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D100633	2021-04-20 00:35:17 +05:30
Jay Foad	ef443390a9	[AMDGPU] Remove MachineDCE after SIFoldOperands Remove the MachineDCE pass after the first SIFoldOperands pass now that SIFoldOperands deletes its own dead instructions. Reapply after fixing dependent change D100188. Differential Revision: https://reviews.llvm.org/D100189	2021-04-19 12:08:02 +01:00
Philip Reames	f549176ad9	[funcattrs] Add the maximal set of implied attributes to definitions Have funcattrs expand all implied attributes into the IR. This expands the infrastructure from D100400, but for definitions not declarations this time. Somewhat subtly, this mostly isn't semantic. Because the accessors did the inference, any client which used the accessor was already getting the stronger result. Clients that directly checked presence of attributes (there are some), will see a stronger result now. The old behavior can end up quite confusing for two reasons: * Without this change, we have situations where function-attrs appears to fail when inferring an attribute (as seen by a human reading IR), but that consuming code will see that it should have been implied. As a human trying to sanity check test results and study IR for optimization possibilities, this is exceeding error prone and confusing. (I'll note that I wasted several hours recently because of this.) * We can have transforms which trigger without the IR appearing (on inspection) to meet the preconditions. This change doesn't prevent this from happening (as the accessors still involve multiple checks), but it should make it less frequent. I'd argue in favor of deleting the extra checks out of the accessors after this lands, but I want that in it's own review as a) it's purely stylistic, and b) I already know there's some disagreement. Once this lands, I'm also going to do a cleanup change which will delete some now redundant duplicate predicates in the inference code, but again, that deserves to be a change of it's own. Differential Revision: https://reviews.llvm.org/D100226	2021-04-16 14:22:19 -07:00
Stelios Ioannou	bf147c4653	[LSR] Fix for pre-indexed generated constant offset This patch changed the isLegalUse check to ensure that LSRInstance::GenerateConstantOffsetsImpl generates an offset that results in a legal addressing mode and formula. The check is changed to look similar to the assert check used for illegal formulas. Differential Revision: https://reviews.llvm.org/D100383 Change-Id: Iffb9e32d59df96b8f072c00f6c339108159a009a	2021-04-15 16:44:42 +01:00
Sebastian Neubauer	7842e1725e	[AMDGPU] Fix large return values with amdgpu_gfx Returning in memory is not supported, so fall back to sret. Also, extend i1 and i16 to i32. Otherwise, they would be passed through memory. Differential Revision: https://reviews.llvm.org/D100543	2021-04-15 14:57:56 +02:00
hsmahesha	4973b0c4e7	[AMDGPU] Disable forceful inline of non-kernel functions which use LDS. Now since LDS uses within non-kernel functions are being handled in the pass - LowerModuleLDS, we NO need to forcefully inline non-kernel functions just because they use LDS. Do forceful inlining only when the pass - LowerModuleLDS is not enabled. It is enabled by default. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D100481	2021-04-15 09:12:56 +05:30
Philip Reames	dd985551c2	Reapply "[InferAttributes] Materialize all infered attributes for declaration"" and follow on patches. This reverts commit `ab98f2c712` and `98eea392cd`. It includes a fix for the clang test which triggered the revert. I failed to notice this one because there was another AMDGPU llvm test with a similiar name and the exact same text in the error message. Odd. Since only one build bot reported the clang test, I didn't notice that one.	2021-04-14 16:38:07 -07:00
Nico Weber	98eea392cd	Revert "Fix buildbots after 61a85da" This reverts commit `c609d53363`. `61a85da` was reverted in `ab98f2c7`	2021-04-14 18:47:46 -04:00
Philip Reames	c609d53363	Fix buildbots after `61a85da`	2021-04-14 15:16:05 -07:00
hsmahesha	e3070db0f7	[AMDGPU] Rename "LDS lowering" pass name. Rename the name of "LDS lowering" pass from `amdgpu-disable-lower-module-lds` to `amdgpu-enable-lower-module-lds` as later is consistent and reads better. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D100441	2021-04-14 20:19:53 +05:30
Sebastian Neubauer	929edd4375	[AMDGPU] Mark scavenged SGPR as used Otherwise it reuses the same register for storing the stack slot offset if the stack slot offset is big. Differential Revision: https://reviews.llvm.org/D100461	2021-04-14 14:55:01 +02:00
madhur13490	5682ae2fc6	[AMDGPU] Set implicit arg attributes for indirect calls This patch adds attributes corresponding to implicits to functions/kernels if 1. it has an indirect call OR 2. it's address is taken. Once such attributes are set, rest of the codegen would work out-of-box for indirect calls. This patch eliminates the potential overhead -fixed-abi imposes even though indirect functions calls are not used. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D99347	2021-04-13 13:15:13 +00:00
Sanjay Patel	661cc71a1c	[PassManager][PhaseOrdering] lower expects before running simplifyCFG Retry of `330619a3a6` that includes a clang test update. Original commit message: If we run passes before lowering llvm.expect intrinsics to metadata, then those passes have no way to act on the hints provided by llvm.expect. SimplifyCFG is the known offender, and we made it smarter about profile metadata in D98898 <https://reviews.llvm.org/D98898>. In the motivating example from https://llvm.org/PR49336 , this means we were ignoring the recommended method for a programmer to tell the compiler that a compare+branch is expensive. This change appears to solve that case - the metadata survives to the backend, the compare order is as expected in IR, and the backend does not do anything to reverse it. We make the same change to the old pass manager to keep things synchronized. Differential Revision: https://reviews.llvm.org/D100213	2021-04-12 15:07:53 -04:00
Sanjay Patel	23ac9d1e6e	Revert "[PassManager][PhaseOrdering] lower expects before running simplifyCFG" This reverts commit `330619a3a6`. There are clang tests that also need to be updated.	2021-04-12 13:58:54 -04:00
Sanjay Patel	330619a3a6	[PassManager][PhaseOrdering] lower expects before running simplifyCFG If we run passes before lowering llvm.expect intrinsics to metadata, then those passes have no way to act on the hints provided by llvm.expect. SimplifyCFG is the known offender, and we made it smarter about profile metadata in D98898. In the motivating example from https://llvm.org/PR49336 , this means we were ignoring the recommended method for a programmer to tell the compiler that a compare+branch is expensive. This change appears to solve that case - the metadata survives to the backend, the compare order is as expected in IR, and the backend does not do anything to reverse it. We make the same change to the old pass manager to keep things synchronized. Differential Revision: https://reviews.llvm.org/D100213	2021-04-12 12:23:31 -04:00
Sebastian Neubauer	6cc91adf1e	[AMDGPU] Kill temporary register after restoring Not a correctness issue, but the temporary register is not used afterwards and should be dead. Differential Revision: https://reviews.llvm.org/D100295	2021-04-12 14:20:03 +02:00
Sebastian Neubauer	b76c2a6c2b	[AMDGPU] Fix saving fp and bp Spilling the fp or bp to scratch could overwrite VGPRs of inactive lanes. Fix that by using only the active lanes of the scavenged VGPR. This builds on the assumptions that 1. a function is never called with exec=0 2. lanes do not die in a function, i.e. exec!=0 in the function epilog 3. no new lanes are active when exiting the function, i.e. exec in the epilog is a subset of exec in the prolog. Differential Revision: https://reviews.llvm.org/D96869	2021-04-12 11:52:55 +02:00
Sebastian Neubauer	ca3bae94c4	[AMDGPU] Autogenerate test. NFC	2021-04-12 11:51:28 +02:00
Sebastian Neubauer	32bc9a9bc3	[AMDGPU] Unify spill code Instead of reimplementing spilling in prolog and epilog, reuse buildSpillLoadStore. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D99269	2021-04-12 11:19:08 +02:00
Sebastian Neubauer	f9a8c6a0e5	[AMDGPU] Save VGPR of whole wave when spilling Spilling SGPRs to scratch uses a temporary VGPR. LLVM currently cannot determine if a VGPR is used in other lanes or not, so we need to save all lanes of the VGPR. We even need to save the VGPR if it is marked as dead. The generated code depends on two things: - Can we scavenge an SGPR to save EXEC? - And can we scavenge a VGPR? If we can scavenge an SGPR, we - save EXEC into the SGPR - set the needed lane mask - save the temporary VGPR - write the spilled SGPR into VGPR lanes - save the VGPR again to the target stack slot - restore the VGPR - restore EXEC If we were not able to scavenge an SGPR, we do the same operations, but everytime the temporary VGPR is written to memory, we - write VGPR to memory - flip exec (s_not exec, exec) - write VGPR again (previously inactive lanes) Surprisingly often, we are able to scavenge an SGPR, even though we are at the brink of running out of SGPRs. Scavenging a VGPR does not have a great effect (saves three instructions if no SGPR was scavenged), but we need to know if the VGPR we use is live before or not, otherwise the machine verifier complains. Differential Revision: https://reviews.llvm.org/D96336	2021-04-12 11:01:38 +02:00
dfukalov	8f4b7e94a2	[AMDGPU][CostModel] Refine cost model for control-flow instructions. Added cost estimation for switch instruction, updated costs of branches, fixed phi cost. Had to increase `-amdgpu-unroll-threshold-if` default value since conditional branch cost (size) was corrected to higher value. Test renamed to "control-flow.ll". Removed redundant code in `X86TTIImpl::getCFInstrCost()` and `PPCTTIImpl::getCFInstrCost()`. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D96805	2021-04-10 09:20:24 +03:00
Mitch Phillips	092f288d36	Revert "[AMDGPU] Remove MachineDCE after SIFoldOperands" This reverts commit `5a0117b2d0`. Reason: Dependent change `d19a42eba9` broke the ASan buildbots.	2021-04-09 15:47:44 -07:00
Jay Foad	5a0117b2d0	[AMDGPU] Remove MachineDCE after SIFoldOperands Remove the MachineDCE pass after the first SIFoldOperands pass now that SIFoldOperands deletes its own dead instructions. Differential Revision: https://reviews.llvm.org/D100189	2021-04-09 20:41:09 +01:00
Stanislav Mekhanoshin	034fe0e03d	[AMDGPU] Added udot2 op_sel test. NFC.	2021-04-09 12:19:42 -07:00
Jay Foad	a4ced03d34	[AMDGPU] SIFoldOperands: eagerly delete dead copies This is cheap to implement, means less work for future passes like MachineDCE, and slightly improves the folding in some cases. Differential Revision: https://reviews.llvm.org/D100117	2021-04-09 13:52:54 +01:00
Philip Reames	35393c865c	[funcattrs] Infer nosync from instruction walk Pretty straightforward use of existing infrastructure and port of the attributor inference rules for nosync. A couple points of interest: * I deliberately switched from "monotonic or better" to "unordered or better". This is simply me being conservative and is better in line with the rest of the optimizer. We treat monotonic conservatively pretty much everywhere. * The operand bundle test change is suspicious. It looks like we might have missed something here, but if so, it's an issue with the existing nofree inference as well. I'm going to take a closer look at that separately. * I needed to keep the previous inference from readnone. This surprised me, but made sense once I realized readonly inference goes to lengths to reason about local vs non-local memory and that writes to local memory are okay. This is fine for the purpose of nosync, but would e.g. prevent us from inferring nofree from readnone - which is slightly surprising. Differential Revision: https://reviews.llvm.org/D99769	2021-04-08 14:05:00 -07:00
Konstantin Zhuravlyov	4fae63c612	AMDGPU: Add gfx90c support to code object v2 for backwards compatibility Differential Revision: https://reviews.llvm.org/D100126	2021-04-08 16:42:43 -04:00
Stanislav Mekhanoshin	627dab3dbf	[AMDGPU] Check for all meta instrs in GCNRegBankReassign It used to work correctly even with a KILL, but there is no reason to consider meta instructions since they do not create real HW uses. Differential Revision: https://reviews.llvm.org/D100135	2021-04-08 13:41:10 -07:00
Nikita Popov	59a2f67011	[LoopRotate] Don't split loop pass manager After D99249 we use three different loop pass managers for LICM, LoopRotate and LICM+LoopUnswitch. This happens because LazyBFI and LazyBPI are not preserved by LoopRotate (note that D74640 is no longer needed). Avoid this by marking them as preserved. My understanding of D86156 is that it is okay to simply preserve them (which LoopUnswitch already does for the same reason) and rely on callbacks to deal with deleted blocks. Differential Revision: https://reviews.llvm.org/D99843	2021-04-08 22:05:18 +02:00
Stanislav Mekhanoshin	189310a140	[AMDGPU] Allow -amdgpu-unsafe-fp-atomics to ignore denorm mode Fixes: SWDEV-274276 Differential Revision: https://reviews.llvm.org/D100072	2021-04-08 12:46:36 -07:00
Jay Foad	e184eeaa3b	[AMDGPU] Add some implicit uses to tests. NFC. This is just to stop a future patch from optimizing away the things that we actually want to check for.	2021-04-08 16:37:48 +01:00
Jay Foad	c28f79a0e3	[AMDGPU] SIFoldOperands: try harder to fold cndmask instructions Look through copies to find more cases where the two values being selected are identical. The motivation for this is just to be able to remove the weird special case where tryFoldCndMask was called from foldInstOperand, part way through folding a move-immediate into its users, without regressing any lit tests.	2021-04-08 14:26:12 +01:00
Sebastian Neubauer	c10cc4ea27	[AMDGPU] Fix computing live registers in prolog ScratchExecCopy needs to be marked as live, we cannot use that register while EXEC is stored in there. Marking SGPRForFPSaveRestoreCopy and SGPRForBPSaveRestoreCopy as available is unnecessary, they should not be live at that point anway. Differential Revision: https://reviews.llvm.org/D100098	2021-04-08 14:52:50 +02:00
Thomas Preud'homme	04419628e0	[AMDGPU, test] Fix use of undef FileCheck var Test CodeGen/AMDGPU/amdgpu.private-memory.ll and CodeGen/AMDGPU/private-memory-r600.ll have a block of CHECK directives whose prefix is inconsistent: R600-CHECK Vs R600. This leads to a R600-NOT directive using an undefined CHAN variable due to R600-CHECK directives never being considered by FileCheck. Fixing the prefix leads to the testcase failing. As per https://reviews.llvm.org/D99865#2675235 this commit removes the directives instead since it is not possible to write a reliable check. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D99865	2021-04-08 09:42:59 +01:00
hsmahesha	ac64995ceb	[AMDGPU] Only use ds_read/write_b128 for alignment >= 16 PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100008	2021-04-08 08:12:05 +05:30
hsmahesha	d5fee599c5	[AMDGPU] Add some exhaustive ds read/write alignment tests PS: Submitting on behalf of Jay. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100007	2021-04-08 08:08:49 +05:30
Tony Tye	4658cd4c18	[AMDGPU] Update gfx90a memory model support Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D100070	2021-04-07 22:17:58 +00:00
Jay Foad	e9608a84d8	[AMDGPU][SDag] Add IMG init also for image_gather4 instructions This fixes an oversight in D99747 which moved the IMG init code from SIAddIMGInit to AdjustInstrPostInstrSelection, but did not set the hasPostISelHook flag on gather4 instructions. Differential Revision: https://reviews.llvm.org/D99953	2021-04-06 14:47:20 +01:00
Jay Foad	0bf4836dc4	[AMDGPU] Fix dubious regexes with unescaped brackets. NFC.	2021-04-06 13:17:41 +01:00
Jay Foad	6fec0a34ce	[AMDGPU] Fix typo in regular expression checks. NFC.	2021-04-06 12:29:48 +01:00
Jay Foad	6eb5b06ecf	[AMDGPU] Regenerate checks to fix prefixes broken in D96340. NFC.	2021-04-06 11:43:53 +01:00
Stanislav Mekhanoshin	30b3aab329	Copy syncscope when expanding atomicrmw into cmpxchg loop Fixes: SWDEV-280070 Differential Revision: https://reviews.llvm.org/D99902	2021-04-05 17:29:38 -07:00
Roman Lebedev	a26f1bf67e	[PassManager] Run additional LICM before LoopRotate Loop rotation often has to perform code duplication from header into preheader, which introduces PHI nodes. >>! In D99204, @thopre wrote: > > With loop peeling, it is important that unnecessary PHIs be avoided or > it will leads to spurious peeling. One source of such PHIs is loop > rotation which creates PHIs for invariant loads. Those PHIs are > particularly problematic since loop peeling is now run as part of simple > loop unrolling before GVN is run, and are thus a source of spurious > peeling. > > Note that while some of the load can be hoisted and eventually > eliminated by instruction combine, this is not always possible due to > alignment issue. In particular, the motivating example [1] was a load > inside a class instance which cannot be hoisted because the `this' > pointer has an alignment of 1. > > [1] http://lists.llvm.org/pipermail/llvm-dev/attachments/20210312/4ce73c47/attachment.cpp Now, we could enhance LoopRotate to avoid duplicating code when not needed, but instead hoist loop-invariant code, but isn't that a code duplication? (sic) We have LICM, and in fact we already run it right after LoopRotation. We could try to move it to before LoopRotation, that is basically free from compile-time perspective: https://llvm-compile-time-tracker.com/compare.php?from=6c93eb4477d88af046b915bc955c03693b2cbb58&to=a4bee6d07732b1184c436da489040b912f0dc271&stat=instructions But, looking at stats, i think it isn't great that we would no longer do LICM after LoopRotation, in particular: \| statistic name \| LoopRotate-LICM \| LICM-LoopRotate \| Δ \| % \| abs(%) \| \| asm-printer.EmittedInsts \| 9015930 \| 9015799 \| -131 \| 0.00% \| 0.00% \| \| indvars.NumElimCmp \| 3536 \| 3544 \| 8 \| 0.23% \| 0.23% \| \| indvars.NumElimExt \| 36725 \| 36580 \| -145 \| -0.39% \| 0.39% \| \| indvars.NumElimIV \| 1197 \| 1187 \| -10 \| -0.84% \| 0.84% \| \| indvars.NumElimIdentity \| 143 \| 136 \| -7 \| -4.90% \| 4.90% \| \| indvars.NumElimRem \| 4 \| 5 \| 1 \| 25.00% \| 25.00% \| \| indvars.NumLFTR \| 29842 \| 29890 \| 48 \| 0.16% \| 0.16% \| \| indvars.NumReplaced \| 2293 \| 2227 \| -66 \| -2.88% \| 2.88% \| \| indvars.NumSimplifiedSDiv \| 6 \| 8 \| 2 \| 33.33% \| 33.33% \| \| indvars.NumWidened \| 26438 \| 26329 \| -109 \| -0.41% \| 0.41% \| \| instcount.TotalBlocks \| 1178338 \| 1173840 \| -4498 \| -0.38% \| 0.38% \| \| instcount.TotalFuncs \| 111825 \| 111829 \| 4 \| 0.00% \| 0.00% \| \| instcount.TotalInsts \| 9905442 \| 9896139 \| -9303 \| -0.09% \| 0.09% \| \| lcssa.NumLCSSA \| 425871 \| 423961 \| -1910 \| -0.45% \| 0.45% \| \| licm.NumHoisted \| 378357 \| 378753 \| 396 \| 0.10% \| 0.10% \| \| licm.NumMovedCalls \| 2193 \| 2208 \| 15 \| 0.68% \| 0.68% \| \| licm.NumMovedLoads \| 35899 \| 31821 \| -4078 \| -11.36% \| 11.36% \| \| licm.NumPromoted \| 11178 \| 11154 \| -24 \| -0.21% \| 0.21% \| \| licm.NumSunk \| 13359 \| 13587 \| 228 \| 1.71% \| 1.71% \| \| loop-delete.NumDeleted \| 8547 \| 8402 \| -145 \| -1.70% \| 1.70% \| \| loop-instsimplify.NumSimplified \| 12876 \| 11890 \| -986 \| -7.66% \| 7.66% \| \| loop-peel.NumPeeled \| 1008 \| 925 \| -83 \| -8.23% \| 8.23% \| \| loop-rotate.NumNotRotatedDueToHeaderSize \| 368 \| 365 \| -3 \| -0.82% \| 0.82% \| \| loop-rotate.NumRotated \| 42015 \| 42003 \| -12 \| -0.03% \| 0.03% \| \| loop-simplifycfg.NumLoopBlocksDeleted \| 240 \| 242 \| 2 \| 0.83% \| 0.83% \| \| loop-simplifycfg.NumLoopExitsDeleted \| 497 \| 20 \| -477 \| -95.98% \| 95.98% \| \| loop-simplifycfg.NumTerminatorsFolded \| 618 \| 336 \| -282 \| -45.63% \| 45.63% \| \| loop-unroll.NumCompletelyUnrolled \| 11028 \| 11032 \| 4 \| 0.04% \| 0.04% \| \| loop-unroll.NumUnrolled \| 12608 \| 12529 \| -79 \| -0.63% \| 0.63% \| \| mem2reg.NumDeadAlloca \| 10222 \| 10221 \| -1 \| -0.01% \| 0.01% \| \| mem2reg.NumPHIInsert \| 192110 \| 192106 \| -4 \| 0.00% \| 0.00% \| \| mem2reg.NumSingleStore \| 637650 \| 637643 \| -7 \| 0.00% \| 0.00% \| \| scalar-evolution.NumBruteForceTripCountsComputed \| 814 \| 812 \| -2 \| -0.25% \| 0.25% \| \| scalar-evolution.NumTripCountsComputed \| 283108 \| 282934 \| -174 \| -0.06% \| 0.06% \| \| scalar-evolution.NumTripCountsNotComputed \| 106712 \| 106718 \| 6 \| 0.01% \| 0.01% \| \| simple-loop-unswitch.NumBranches \| 5178 \| 4752 \| -426 \| -8.23% \| 8.23% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 914 \| 503 \| -411 \| -44.97% \| 44.97% \| \| simple-loop-unswitch.NumSwitches \| 20 \| 18 \| -2 \| -10.00% \| 10.00% \| \| simple-loop-unswitch.NumTrivial \| 183 \| 95 \| -88 \| -48.09% \| 48.09% \| ... but that actually regresses LICM (-12% `licm.NumMovedLoads`), loop-simplifycfg (`NumLoopExitsDeleted`, `NumTerminatorsFolded`), simple-loop-unswitch (`NumTrivial`). What if we instead have LICM both before and after LoopRotate? \| statistic name \| LoopRotate-LICM \| LICM-LoopRotate-LICM \| Δ \| % \| abs(%) \| \| asm-printer.EmittedInsts \| 9015930 \| 9014474 \| -1456 \| -0.02% \| 0.02% \| \| indvars.NumElimCmp \| 3536 \| 3546 \| 10 \| 0.28% \| 0.28% \| \| indvars.NumElimExt \| 36725 \| 36681 \| -44 \| -0.12% \| 0.12% \| \| indvars.NumElimIV \| 1197 \| 1185 \| -12 \| -1.00% \| 1.00% \| \| indvars.NumElimIdentity \| 143 \| 146 \| 3 \| 2.10% \| 2.10% \| \| indvars.NumElimRem \| 4 \| 5 \| 1 \| 25.00% \| 25.00% \| \| indvars.NumLFTR \| 29842 \| 29899 \| 57 \| 0.19% \| 0.19% \| \| indvars.NumReplaced \| 2293 \| 2299 \| 6 \| 0.26% \| 0.26% \| \| indvars.NumSimplifiedSDiv \| 6 \| 8 \| 2 \| 33.33% \| 33.33% \| \| indvars.NumWidened \| 26438 \| 26404 \| -34 \| -0.13% \| 0.13% \| \| instcount.TotalBlocks \| 1178338 \| 1173652 \| -4686 \| -0.40% \| 0.40% \| \| instcount.TotalFuncs \| 111825 \| 111829 \| 4 \| 0.00% \| 0.00% \| \| instcount.TotalInsts \| 9905442 \| 9895452 \| -9990 \| -0.10% \| 0.10% \| \| lcssa.NumLCSSA \| 425871 \| 425373 \| -498 \| -0.12% \| 0.12% \| \| licm.NumHoisted \| 378357 \| 383352 \| 4995 \| 1.32% \| 1.32% \| \| licm.NumMovedCalls \| 2193 \| 2204 \| 11 \| 0.50% \| 0.50% \| \| licm.NumMovedLoads \| 35899 \| 35755 \| -144 \| -0.40% \| 0.40% \| \| licm.NumPromoted \| 11178 \| 11163 \| -15 \| -0.13% \| 0.13% \| \| licm.NumSunk \| 13359 \| 14321 \| 962 \| 7.20% \| 7.20% \| \| loop-delete.NumDeleted \| 8547 \| 8538 \| -9 \| -0.11% \| 0.11% \| \| loop-instsimplify.NumSimplified \| 12876 \| 12041 \| -835 \| -6.48% \| 6.48% \| \| loop-peel.NumPeeled \| 1008 \| 924 \| -84 \| -8.33% \| 8.33% \| \| loop-rotate.NumNotRotatedDueToHeaderSize \| 368 \| 365 \| -3 \| -0.82% \| 0.82% \| \| loop-rotate.NumRotated \| 42015 \| 42005 \| -10 \| -0.02% \| 0.02% \| \| loop-simplifycfg.NumLoopBlocksDeleted \| 240 \| 241 \| 1 \| 0.42% \| 0.42% \| \| loop-simplifycfg.NumTerminatorsFolded \| 618 \| 619 \| 1 \| 0.16% \| 0.16% \| \| loop-unroll.NumCompletelyUnrolled \| 11028 \| 11029 \| 1 \| 0.01% \| 0.01% \| \| loop-unroll.NumUnrolled \| 12608 \| 12525 \| -83 \| -0.66% \| 0.66% \| \| mem2reg.NumPHIInsert \| 192110 \| 192073 \| -37 \| -0.02% \| 0.02% \| \| mem2reg.NumSingleStore \| 637650 \| 637652 \| 2 \| 0.00% \| 0.00% \| \| scalar-evolution.NumTripCountsComputed \| 283108 \| 282998 \| -110 \| -0.04% \| 0.04% \| \| scalar-evolution.NumTripCountsNotComputed \| 106712 \| 106691 \| -21 \| -0.02% \| 0.02% \| \| simple-loop-unswitch.NumBranches \| 5178 \| 5185 \| 7 \| 0.14% \| 0.14% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 914 \| 925 \| 11 \| 1.20% \| 1.20% \| \| simple-loop-unswitch.NumTrivial \| 183 \| 179 \| -4 \| -2.19% \| 2.19% \| \| simple-loop-unswitch.NumBranches \| 5178 \| 4752 \| -426 \| -8.23% \| 8.23% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 914 \| 503 \| -411 \| -44.97% \| 44.97% \| \| simple-loop-unswitch.NumSwitches \| 20 \| 18 \| -2 \| -10.00% \| 10.00% \| \| simple-loop-unswitch.NumTrivial \| 183 \| 95 \| -88 \| -48.09% \| 48.09% \| I.e. we end up with less instructions, less peeling, more LICM activity, also note how none of those 4 regressions are here. Namely: \| statistic name \| LICM-LoopRotate \| LICM-LoopRotate-LICM \| Δ \| % \| abs(%) \| \| asm-printer.EmittedInsts \| 9015799 \| 9014474 \| -1325 \| -0.01% \| 0.01% \| \| indvars.NumElimCmp \| 3544 \| 3546 \| 2 \| 0.06% \| 0.06% \| \| indvars.NumElimExt \| 36580 \| 36681 \| 101 \| 0.28% \| 0.28% \| \| indvars.NumElimIV \| 1187 \| 1185 \| -2 \| -0.17% \| 0.17% \| \| indvars.NumElimIdentity \| 136 \| 146 \| 10 \| 7.35% \| 7.35% \| \| indvars.NumLFTR \| 29890 \| 29899 \| 9 \| 0.03% \| 0.03% \| \| indvars.NumReplaced \| 2227 \| 2299 \| 72 \| 3.23% \| 3.23% \| \| indvars.NumWidened \| 26329 \| 26404 \| 75 \| 0.28% \| 0.28% \| \| instcount.TotalBlocks \| 1173840 \| 1173652 \| -188 \| -0.02% \| 0.02% \| \| instcount.TotalInsts \| 9896139 \| 9895452 \| -687 \| -0.01% \| 0.01% \| \| lcssa.NumLCSSA \| 423961 \| 425373 \| 1412 \| 0.33% \| 0.33% \| \| licm.NumHoisted \| 378753 \| 383352 \| 4599 \| 1.21% \| 1.21% \| \| licm.NumMovedCalls \| 2208 \| 2204 \| -4 \| -0.18% \| 0.18% \| \| licm.NumMovedLoads \| 31821 \| 35755 \| 3934 \| 12.36% \| 12.36% \| \| licm.NumPromoted \| 11154 \| 11163 \| 9 \| 0.08% \| 0.08% \| \| licm.NumSunk \| 13587 \| 14321 \| 734 \| 5.40% \| 5.40% \| \| loop-delete.NumDeleted \| 8402 \| 8538 \| 136 \| 1.62% \| 1.62% \| \| loop-instsimplify.NumSimplified \| 11890 \| 12041 \| 151 \| 1.27% \| 1.27% \| \| loop-peel.NumPeeled \| 925 \| 924 \| -1 \| -0.11% \| 0.11% \| \| loop-rotate.NumRotated \| 42003 \| 42005 \| 2 \| 0.00% \| 0.00% \| \| loop-simplifycfg.NumLoopBlocksDeleted \| 242 \| 241 \| -1 \| -0.41% \| 0.41% \| \| loop-simplifycfg.NumLoopExitsDeleted \| 20 \| 497 \| 477 \| 2385.00% \| 2385.00% \| \| loop-simplifycfg.NumTerminatorsFolded \| 336 \| 619 \| 283 \| 84.23% \| 84.23% \| \| loop-unroll.NumCompletelyUnrolled \| 11032 \| 11029 \| -3 \| -0.03% \| 0.03% \| \| loop-unroll.NumUnrolled \| 12529 \| 12525 \| -4 \| -0.03% \| 0.03% \| \| mem2reg.NumDeadAlloca \| 10221 \| 10222 \| 1 \| 0.01% \| 0.01% \| \| mem2reg.NumPHIInsert \| 192106 \| 192073 \| -33 \| -0.02% \| 0.02% \| \| mem2reg.NumSingleStore \| 637643 \| 637652 \| 9 \| 0.00% \| 0.00% \| \| scalar-evolution.NumBruteForceTripCountsComputed \| 812 \| 814 \| 2 \| 0.25% \| 0.25% \| \| scalar-evolution.NumTripCountsComputed \| 282934 \| 282998 \| 64 \| 0.02% \| 0.02% \| \| scalar-evolution.NumTripCountsNotComputed \| 106718 \| 106691 \| -27 \| -0.03% \| 0.03% \| \| simple-loop-unswitch.NumBranches \| 4752 \| 5185 \| 433 \| 9.11% \| 9.11% \| \| simple-loop-unswitch.NumCostMultiplierSkipped \| 503 \| 925 \| 422 \| 83.90% \| 83.90% \| \| simple-loop-unswitch.NumSwitches \| 18 \| 20 \| 2 \| 11.11% \| 11.11% \| \| simple-loop-unswitch.NumTrivial \| 95 \| 179 \| 84 \| 88.42% \| 88.42% \| {F15983613} {F15983615} {F15983616} (this is vanilla llvm testsuite + rawspeed + darktable) As an example of the code where early LICM only is bad, see: https://godbolt.org/z/GzEbacs4K This does have an observable compile-time regression of +~0.5% geomean https://llvm-compile-time-tracker.com/compare.php?from=7c5222e4d1a3a14f029e5f614c9aefd0fa505f1e&to=5d81826c3411982ca26e46b9d0aff34c80577664&stat=instructions but i think that's basically nothing, and there's potential that it might be avoidable in the future by fixing clang to produce alignment information on function arguments, thus making the second run unneeded. Differential Revision: https://reviews.llvm.org/D99249	2021-04-02 11:11:42 +03:00
Philip Reames	a8ac8816c9	Update a test missed in `6ef4505`	2021-04-01 12:17:01 -07:00
Brendon Cahoon	65c8bfb509	[AMDGPU] Enable output modifiers for double precision instructions Update SIFoldOperands pass to recognize v_add_f64 and v_mul_f64 instructions for folding output modifiers. Differential Revision: https://reviews.llvm.org/D99505	2021-04-01 10:08:17 -04:00
Dmitry Preobrazhensky	cd953434f2	[AMDGPU][MC][GFX10][GFX90A] Corrected _e32/_e64 suffices Fixed bugs https://bugs.llvm.org//show_bug.cgi?id=49643, https://bugs.llvm.org//show_bug.cgi?id=49644, https://bugs.llvm.org//show_bug.cgi?id=49645. Differential Revision: https://reviews.llvm.org/D99413	2021-04-01 14:21:00 +03:00
Simonas Kazlauskas	777a58e05b	Support {S,U}REMEqFold before legalization This allows these optimisations to apply to e.g. `urem i16` directly before `urem` is promoted to i32 on architectures where i16 operations are not intrinsically legal (such as on Aarch64). The legalization then later can happen more directly and generated code gets a chance to avoid wasting time on computing results in types wider than necessary, in the end. Seems like mostly an improvement in terms of results at least as far as x86_64 and aarch64 are concerned, with a few regressions here and there. It also helps in preventing regressions in changes like {D87976}. Reviewed By: lebedev.ri Differential Revision: https://reviews.llvm.org/D88785	2021-04-01 01:35:41 +03:00
Jay Foad	b138cf115e	[AMDGPU] Add some image tests with enable-prt-strict-null disabled. NFC.	2021-03-31 17:27:20 +01:00
Jay Foad	a991ee330b	[AMDGPU] Use a common check prefix for some image tests. NFC.	2021-03-31 17:27:20 +01:00
Jay Foad	5d0e9ddfa5	[AMDGPU][GlobalISel] Add support for global atomicrmw fadd This includes gfx908 which only has a no-return version of the global_atomic_add_f32 instruction, using the same hack that was previously implemented for selecting from the llvm.amdgcn.global.atomic.fadd intrinsic. Differential Revision: https://reviews.llvm.org/D97767	2021-03-31 11:13:00 +01:00
Krasimir Georgiev	c51e91e046	Revert "[Passes] Add relative lookup table converter pass" This reverts commit `5178ffc7cf`. Compiling `llvm-profdata` with a compiler build from this produces a crashing binary.	2021-03-30 14:13:37 +02:00
Gulfem Savrun Yeniceri	5178ffc7cf	[Passes] Add relative lookup table converter pass Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in: https://bugs.llvm.org/show_bug.cgi?id=45244 This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly. Differential Revision: https://reviews.llvm.org/D94355	2021-03-29 21:53:32 +00:00
Joe Nash	45fd7c02af	Revert "[AMDGPU] Mark additional VOP3 as commutable" This reverts commit `d35d8da7d6`.	2021-03-29 14:48:11 -04:00
Joe Nash	d35d8da7d6	[AMDGPU] Mark additional VOP3 as commutable Note, only src0 and src1 will be commuted if the isCommutable flag is set. This patch does not change that, it just makes it possible to commute src0 and src1 of more instructions. Reviewed By: foad, rampitec Differential Revision: https://reviews.llvm.org/D99376 Change-Id: I61e20490962d95ea429beb355c55f55c024dafdc	2021-03-29 14:22:20 -04:00
Roger Ferrer Ibanez	489ca73ac4	[PrologEpilogInserter][AMDGPU] Only adjust offset for emergency spill slots if the stack grows down D89239 adjusts the stack offset of emergency spill slots for overaligned stacks. However the adjustment is not valid for targets whose stack grows up (such as AMDGPU). This change makes the adjustment conditional only to those targets whose stack grows down. Fixes https://bugs.llvm.org/show_bug.cgi?id=49686 Differential Revision: https://reviews.llvm.org/D99504	2021-03-29 17:26:58 +00:00
Petar Avramovic	b082e6f88a	[AMDGPU] Extend gfx10 test coverage. NFC. Differential Revision: https://reviews.llvm.org/D99267	2021-03-29 11:13:55 +02:00
Jay Foad	9d08f276d7	[AMDGPU] Use reductions instead of scans in the atomic optimizer If the result of an atomic operation is not used then it can be more efficient to build a reduction across all lanes instead of a scan. Do this for GFX10, where the permlanex16 instruction makes it viable. For wave64 this saves a couple of dpp operations. For wave32 it saves one readlane (which are generally bad for performance) and one dpp operation. Differential Revision: https://reviews.llvm.org/D98953	2021-03-26 15:38:14 +00:00
Gulfem Savrun Yeniceri	5fbe1fdf17	Revert "[Passes] Add relative lookup table converter pass" This reverts commit `5fd001a5ff` because it broke clang-with-thin-lto-ubuntu bot.	2021-03-24 18:59:33 +00:00
Gulfem Savrun Yeniceri	5fd001a5ff	[Passes] Add relative lookup table converter pass Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in: https://bugs.llvm.org/show_bug.cgi?id=45244 This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly. Differential Revision: https://reviews.llvm.org/D94355	2021-03-24 17:31:18 +00:00
Konstantin Zhuravlyov	f4ace63737	AMDGPU: Add target id and code object v4 support - Add target id support (https://clang.llvm.org/docs/ClangOffloadBundler.html#target-id) - Add code object v4 support (https://llvm.org/docs/AMDGPUUsage.html#elf-code-object) - Add kernarg_size to kernel descriptor - Change trap handler ABI to no longer move queue pointer into s[0:1] - Cleanup ELF definitions - Add V2, V3, V4 suffixes to make a clear distinction for code object version - Consolidate note names Differential Revision: https://reviews.llvm.org/D95638	2021-03-24 11:54:05 -04:00
alex-t	dccf83acf9	[AMDGPU] SIOptimizeExecMaskingPreRA should check constant bus constraint when folds EXEC copy Folding EXEC copy into it's single use may lead to constant bus constraint violation as it adds one more SGPR operand. This change makes it validate the user instruction with the new SGPR operand and only fold it if it is legal. Reviewed By: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D98888	2021-03-24 14:14:13 +03:00
Matt Arsenault	b24436ac96	GlobalISel: Lower funnel shifts	2021-03-23 09:11:17 -04:00
Jay Foad	d42f63beeb	[AMDGPU] Use non-compressed exports in a test. NFC. I don't think there's any need for this test to use compressed exports. Using normal exports seems a bit more straightforwards and avoids a tiny bit of bitcasting. Differential Revision: https://reviews.llvm.org/D99167	2021-03-23 11:18:12 +00:00
Pushpinder Singh	d0e5422eb8	[GlobalISel][AMDGPU] Lower G_UMULO/G_SMULO Reviewed By: foad Differential Revision: https://reviews.llvm.org/D93963	2021-03-23 05:45:43 +00:00
Carl Ritson	64db6b8d37	[AMDGPU] Only unbundle memory accesses in SIMemoryLegalizer This restores previous behaviour and is a step toward removing unbundling entirely. Reviewed By: foad, rampitec Differential Revision: https://reviews.llvm.org/D99061	2021-03-23 11:30:36 +09:00
Gulfem Savrun Yeniceri	e3a6d70c68	Revert "[Passes] Add relative lookup table converter pass" This reverts commit `78a65cd945` which caused buildbot failures.	2021-03-23 00:43:16 +00:00
Gulfem Savrun Yeniceri	78a65cd945	[Passes] Add relative lookup table converter pass Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in: https://bugs.llvm.org/show_bug.cgi?id=45244 This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly. Differential Revision: https://reviews.llvm.org/D94355	2021-03-22 22:09:02 +00:00
Matt Arsenault	c34819afe3	GlobalISel: Handle G_BUILD_VECTOR in isKnownToBeAPowerOfTwo	2021-03-22 14:20:35 -04:00
Matt Arsenault	1dd23c6d53	AMDGPU: Allow tail calls for amdgpu_gfx functions	2021-03-22 10:55:19 -04:00
Matt Arsenault	6314a72730	AMDGPU/GlobalISel: Enable CSE in pre-legalizer combiner	2021-03-21 10:07:37 -04:00
Carl Ritson	fe5f4c397f	[AMDGPU] Rename SIInsertSkips Pass Pass no longer handles skips. Pass now removes unnecessary unconditional branches and lowers early termination branches. Hence rename to SILateBranchLowering. Move code to handle returns to epilog from SIPreEmitPeephole into SILateBranchLowering. This means SIPreEmitPeephole only contains optional optimisations, and all required transforms are in SILateBranchLowering. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D98915	2021-03-20 11:48:04 +09:00
Carl Ritson	5df2af8b0e	[AMDGPU] Merge SIRemoveShortExecBranches into SIPreEmitPeephole SIRemoveShortExecBranches is an optimisation so fits well in the context of SIPreEmitPeephole. Test changes relate to early termination from kills which have now been lowered prior to considering branches for removal. As these use s_cbranch the execz skips are now retained instead. Currently either behaviour is valid as kill with EXEC=0 is a nop; however, if early termination is used differently in future then the new behaviour is the correct one. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D98917	2021-03-20 11:26:42 +09:00
Carl Ritson	b76c09023d	[AMDGPU] Allow index optimisation in SIPreEmitPeephole for bundles Add code so duplication index register changes can be removed from inside bundles. Reviewed By: rampitec, foad Differential Revision: https://reviews.llvm.org/D98940	2021-03-20 10:26:23 +09:00
Jay Foad	87248e852b	[AMDGPU] Rationalize some check prefixes and use more common prefixes. NFC.	2021-03-19 16:48:33 +00:00
Jay Foad	5df52f7708	[AMDGPU] Remove weird target triples from tests. NFC.	2021-03-19 16:48:32 +00:00
Simon Pilgrim	9d2df96407	[DAG] computeKnownBits - add ISD::MULHS/MULHU/SMUL_LOHI/UMUL_LOHI handling Reuse the existing KnownBits multiplication code to handle the 'extend + multiply + extract high bits' pattern for multiply-high ops. Noticed while looking at the codegen for D88785 / D98587 - the patch helps division-by-constant expansion code in particular, which suggests that we might have some further KnownBits div/rem cases we could handle - but this was far easier to implement. Differential Revision: https://reviews.llvm.org/D98857	2021-03-19 16:02:31 +00:00
Jay Foad	b8616e40da	[AMDGPU] Add atomic optimizer nouse tests Add some atomic optimizer tests where there is no use of the result of the atomic operation, which is a common case in real code. NFC. Differential Revision: https://reviews.llvm.org/D98952	2021-03-19 15:39:42 +00:00
Jay Foad	685335a014	[AMDGPU] Remove duplicate test functions. NFC.	2021-03-19 11:36:14 +00:00
Stanislav Mekhanoshin	edd6da10d2	[AMDGPU] Remove cpol, tfe, and swz from MUBUF patterns These are always selected as 0 anyway. Differential Revision: https://reviews.llvm.org/D98663	2021-03-18 14:36:04 -07:00
Jon Chesterfield	253f804deb	[amdgpu] Update med3 combine to skip i64 [amdgpu] Update med3 combine to skip i64 Fixes an assumption that a type which is not i32 will be i16. This asserts when trying to sign/zero extend an i64 to i32. Test case was cut down from an openmp application. Variations on it are hit by other combines before reaching the problematic one, e.g. replacing the immediate values with other function arguments changes the codegen path and misses this combine. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D98872	2021-03-18 15:56:41 +00:00
Jay Foad	078b338ba6	[AMDGPU] Add some gfx1010 test coverage. NFC.	2021-03-18 14:00:07 +00:00
Matt Arsenault	b9a0384983	GlobalISel: Preserve source value information for outgoing byval args Pass through the original argument IR value in order to preserve the aliasing information in the memcpy memory operands.	2021-03-18 09:16:54 -04:00
Matt Arsenault	61f834cc09	GlobalISel: Insert memcpy for outgoing byval arguments byval requires an implicit copy between the caller and callee such that the callee may write into the stack area without it modifying the value in the parent. Previously, this was passing through the raw pointer value which would break if the callee wrote into it. Most of the time, this copy can be optimized out (however we don't have the optimization SelectionDAG does yet). This will trigger more fallbacks for AMDGPU now, since we don't have legalization for memcpy yet (although we should stop using byval anyway).	2021-03-18 09:16:54 -04:00
Simon Pilgrim	388fbefb4f	[AMDGPU] Regenerate atomic_optimizations_global_pointer.ll tests	2021-03-18 11:15:44 +00:00
Simon Pilgrim	cfc256ba9f	[DAG] TargetLowering::isBinOp() - add ISD::SSUBSAT/USUBSAT Add to the generic non-commutative binop list.	2021-03-17 14:51:00 +00:00
Simon Pilgrim	4a68740547	Revert rG3b635253ddd0106c88051cff3540d8eb90bee22f "[AMDGPU] Regenerate wave32.ll test checks" Breaks on some buildbots.	2021-03-17 11:47:09 +00:00
Simon Pilgrim	3b635253dd	[AMDGPU] Regenerate wave32.ll test checks This is to help simplify the diff on an upcoming patch	2021-03-17 11:27:11 +00:00
Stanislav Mekhanoshin	bc27a31801	[AMDGPU] Fix copyPhysReg to not produce unalined vgpr access RA can insert something like a sub1_sub2 COPY of a wide VGPR tuple which results in the unaligned acces with v_pk_mov_b32 after the copy is expanded. This is regression after D97316. Differential Revision: https://reviews.llvm.org/D98549	2021-03-15 14:14:30 -07:00
Stanislav Mekhanoshin	3bffb1cd0e	[AMDGPU] Use single cache policy operand Replace individual operands GLC, SLC, and DLC with a single cache_policy bitmask operand. This will reduce the number of operands in MIR and I hope the amount of code. These operands are mostly 0 anyway. Additional advantage that parser will accept these flags in any order unlike now. Differential Revision: https://reviews.llvm.org/D96469	2021-03-15 13:00:59 -07:00
Jon Chesterfield	13e49dcee4	[amdgpu] Implement lower function LDS pass [amdgpu] Implement lower function LDS pass Local variables are allocated at kernel launch. This pass collects global variables that are used from non-kernel functions, moves them into a new struct type, and allocates an instance of that type in every kernel. Uses are then replaced with a constantexpr offset. Prior to this pass, accesses from a function are compiled to trap. With this pass, most such accesses are removed before reaching codegen. The trap logic is left unchanged by this pass. It is still reachable for the cases this pass misses, notably the extern shared construct from hip and variables marked constant which survive the optimizer. This is of interest to the openmp project because the deviceRTL runtime library uses cuda shared variables from functions that cannot be inlined. Trunk llvm therefore cannot compile some openmp kernels for amdgpu. In addition to the unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled and the function pointer hashing scheme deleted passes the openmp suite. This lowering will use more LDS than strictly necessary. It is intended to be a functionally correct fallback for cases that are difficult to target from future optimisation passes. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D94648	2021-03-15 15:24:01 +00:00
Carl Ritson	13877db2fa	[AMDGPU] Fix shortfalls in WQM marking When tracking defined lanes through phi nodes in the live range graph each branch of the phi must be handled independently. Also rewrite the marking algorithm to reduce unnecessary operations. Previously a shared set of defined lanes was used which caused marking to stop prematurely. This was observable in existing lit tests, but test patterns did not cover this detail. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D98614	2021-03-15 21:44:15 +09:00
Roman Lebedev	78b8ce40ef	Reland [SCEV] Improve modelling for (null) pointer constants This reverts commit `329aeb5db4`, and relands commit `61f006ac65`. This is a continuation of D89456. As it was suggested there, now that SCEV models `PtrToInt`, we can try to improve SCEV's pointer handling. In particular, i believe, i will need this in the future to further fix `SCEVAddExpr`operation type handling. This removes special handling of `ConstantPointerNull` from `ScalarEvolution::createSCEV()`, and add constant folding into `ScalarEvolution::getPtrToIntExpr()`. This way, `null` constants stay as such in SCEV's, but gracefully become zero integers when asked. Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D98147	2021-03-13 16:05:34 +03:00
Roman Lebedev	329aeb5db4	Temporairly evert "[SCEV] Improve modelling for (null) pointer constants" This appears to have broken ubsan bot: https://lab.llvm.org/buildbot/#/builders/85/builds/3062 https://reviews.llvm.org/D98147#2623549 It looks like LSR needs some kind of a change around insertion point handling. Reverting until i have a fix. This reverts commit `61f006ac65`.	2021-03-13 09:10:28 +03:00
Roman Lebedev	61f006ac65	[SCEV] Improve modelling for (null) pointer constants This is a continuation of D89456. As it was suggested there, now that SCEV models `PtrToInt`, we can try to improve SCEV's pointer handling. In particular, i believe, i will need this in the future to further fix `SCEVAddExpr`operation type handling. This removes special handling of `ConstantPointerNull` from `ScalarEvolution::createSCEV()`, and add constant folding into `ScalarEvolution::getPtrToIntExpr()`. This way, `null` constants stay as such in SCEV's, but gracefully become zero integers when asked. Reviewed By: Meinersbur Differential Revision: https://reviews.llvm.org/D98147	2021-03-12 22:11:58 +03:00
Simonas Kazlauskas	a2eca31da2	Test cases for rem-seteq fold with illegal types This also briefly tests a larger set of architectures than the more exhaustive functionality tests for AArch64 and x86. As requested in D88785 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D98339	2021-03-12 16:28:04 +02:00
Matt Arsenault	6b76d82853	GlobalISel: Fix marking byval arguments as immutable byval arguments need to be assumed writable. Only implicitly stack passed arguments which aren't addressable in the IR can be assumed immutable. Mips is still broken since for some reason its doing its own thing with the ValueHandlers (and x86 doesn't actually handle byval arguments now, although some of the code is there).	2021-03-12 09:01:53 -05:00
Matt Arsenault	34471c3060	GlobalISel: Partially fix handling of byval arguments This was essentially ignoring byval and treating them as a pointer argument which needed to be loaded from. This should copy the frame index value to the virtual register, not insert a load from the frame index into the pointer value. For AMDGPU, this was producing a load from the byval pointer argument, to a pointer used for the byval arguments. I do not understand how AArch64 managed to work before since it appears to be similarly broken. We could also change the ValueHandler API to avoid the extra copy from the frame index, since currently it returns a new register. I believe there is still an issue with outgoing byval arguments. These should have a copy inserted in case the callee decided to overwrite the memory.	2021-03-12 09:01:53 -05:00
Carl Ritson	f08dadd242	[AMDGPU] Do not annotate an else branch if there is a kill As llvm.amdgcn.kill is lowered to a terminator it can cause else branch annotations to end up in the wrong block. Do not annotate conditionals as else branches where there is a kill to avoid this. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D97427	2021-03-12 11:52:08 +09:00
Carl Ritson	c07f2025e4	[AMDGPU] Restrict image_msaa_load to MSAA dimension types This instruction is only valid on 2D MSAA and 2D MSAA Array surfaces. Remove intrinsic support for other dimension types, and block assembly for unsupported dimensions. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D98397	2021-03-12 09:47:24 +09:00
Ruiling Song	e8e6817d00	[AMDGPU] Don't check hasStackObjects() when reserving VGPR We have amdgpu_gfx functions that have high register pressure. If we do not reserve VGPR for SGPR spill, we will fall into the path to spill the SGPR to memory, which does not only have correctness issue, but also have really bad performance. I don't know why there is the check for hasStackObjects(), in our case, we don't have stack objects at the time of finalizeLowering(). So just remove the check that we always reserve a VGPR for possible SGPR spill in non-entry functions. Reviewed by: arsenm Differential Revision: https://reviews.llvm.org/D98345	2021-03-12 08:11:14 +08:00
Matt Arsenault	70cb57d7da	AMDGPU/GlobalISel: Improve private addressing mode matching This enables the look-through-copy to hack around not correctly regbankselecting constants to match the use bank.	2021-03-11 10:23:35 -05:00
Matt Arsenault	cf5ecd5644	GlobalISel: Fix off by one in finding explicit byval alignment For attribute sets, the return index is at 0, and arguments start at 1. getParamAlignment adds the offset of 1, so we need to convert from attribute index back to IR index.	2021-03-11 10:23:08 -05:00
Matt Arsenault	0e0c7ef8e4	AMDGPU/GlobalISel: Add more tests for byval arguments	2021-03-11 10:23:08 -05:00
Ruiling Song	66340846b3	[AMDGPU] Always create Stack Object for reserved VGPR As we may overwrite inactive lanes of a caller-save-vgpr, we should always save/restore the reserved vgpr for sgpr spill. Reviewed by: arsenm Differential Revision: https://reviews.llvm.org/D98319	2021-03-11 10:06:07 +08:00
Daniel Sanders	134a179dee	[mir] Change 'undef' for MMO base addresses to 'unknown-address' Differential Revision: https://reviews.llvm.org/D98100	2021-03-10 16:46:44 -08:00
Stanislav Mekhanoshin	574a9dabc6	[AMDGPU] Always expand system scope fp atomics on gfx90a FP atomics in system scope cannot be used and shall always be expanded in a CAS loop. Differential Revision: https://reviews.llvm.org/D98085	2021-03-10 12:35:23 -08:00
Jay Foad	70f013fd3b	[AMDGPU] Fix isReallyTriviallyReMaterializable for V_MOV_* D57708 changed SIInstrInfo::isReallyTriviallyReMaterializable to reject V_MOVs with extra implicit operands, but it accidentally rejected all V_MOVs because of their implicit use of exec. Fix it but avoid adding a moderately expensive call to MI.getDesc().getNumImplicitUses(). In real graphics shaders this changes quite a few vgpr copies into move- immediates, which is good for avoiding stalls on GFX10. Differential Revision: https://reviews.llvm.org/D98347	2021-03-10 16:18:12 +00:00
Christudasan Devadasan	4c6ab48fb1	GlobalISel: Try to combine G_[SU]DIV and G_[SU]REM It is good to have a combined `divrem` instruction when the `div` and `rem` are computed from identical input operands. Some targets can lower them through a single expansion that computes both division and remainder. It effectively reduces the number of instructions than individually expanding them. Reviewed By: arsenm, paquette Differential Revision: https://reviews.llvm.org/D96013	2021-03-10 18:46:07 +05:30
Christudasan Devadasan	24c0ad7143	[AMDGPU] Fix the dead frame indices during custom spill lowering. AMDGPU target tries to handle the SGPR and VGPR spills in a custom pass before the actual frame lowering pass. Once they are handled and the respective frames are eliminated in the custom pass, certain uses of them still remain. For instance, the DBG_VALUE instructions inserted by the allocator alongside the spill instruction will use the corresponding frame index. They become dead later during PEI and causes a crash while trying to replace the frame indices. We should possibly avoid this custom pass. For now, replacing such dead references with null register value. Reviewed By: arsenm, scott.linder Differential Revision: https://reviews.llvm.org/D98038	2021-03-09 23:22:49 +05:30
Ruiling Song	f0ccdde3c9	[AMDGPU] Remove SI_MASK_BRANCH This is already deprecated, so remove code working on this. Also update the tests by using S_CBRANCH_EXECZ instead of SI_MASK_BRANCH. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D97545	2021-03-09 09:13:23 +08:00
Craig Topper	0eb405c3b8	[SelectionDAG] Add computeKnownBits support for ISD::USUBSAT. The result of ISD::USUBSAT will never be larger than the LHS. We can use this to put a bound on the number of leading zeros. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D98133	2021-03-07 09:48:42 -08:00
Jay Foad	99682bc039	Revert "Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030"" This reverts commit `e58d68fcd0`. This reinstates commit `fc28f600e5` with a fix to initialize HasShaderCyclesRegister. See https://reviews.llvm.org/D97928.	2021-03-06 09:00:01 +00:00
Mitch Phillips	e58d68fcd0	Revert "[AMDGPU] Restore the s_memtime instruction in gfx1030" Broke the ASan/MSan buildbots. See more comments in the original patch, https://reviews.llvm.org/D97928. Build failure at http://lab.llvm.org:8011/#/builders/5/builds/5327 This reverts commit `fc28f600e5`.	2021-03-05 18:24:59 -08:00
Jay Foad	fc28f600e5	[AMDGPU] Restore the s_memtime instruction in gfx1030 gfx1030 added a new way to implement readcyclecounter using the SHADER_CYCLES hardware register, but the s_memtime instruction still exists, so the MC layer should still accept it and the llvm.amdgcn.s.memtime intrinsic should still work. Differential Revision: https://reviews.llvm.org/D97928	2021-03-05 20:19:11 +00:00
RamNalamothu	3998a8e797	[AMDGPU] Do not attempt sgpr spills to vgpr, when it is disabled This covers a path missed in https://reviews.llvm.org/D95768. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D98013	2021-03-05 22:47:21 +05:30
Sebastian Neubauer	e0e73714fb	[AMDGPU] Keep skip branch for ds instructions Same as other memory instructions, ds instructions add latency even if exec is zero. Jumping over them if exec=0 is cheaper than executing them. With this change, the branch instruction that skips over a basic block if exec=0 is not removed when the block contains a ds instruction. Differential Revision: https://reviews.llvm.org/D97922	2021-03-05 12:34:09 +01:00
Petar Avramovic	36beaa3ba3	Reland AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect Recommit `bf5a582650`. Depends on `4c8fb7ddd6` which was reverted. RegBankSelect creates zext and trunc when it selects banks for uniform i1. Add zext_trunc_fold from generic combiner to post RegBankSelect combiner. Differential Revision: https://reviews.llvm.org/D95432	2021-03-05 11:05:37 +01:00
Petar Avramovic	d44f61f81c	Reland [GlobalISel] Combine zext(trunc x) to x Recommit `4112299ee7`. Depends on `4c8fb7ddd6` which was reverted. Combine zext(trunc x) to x when truncated bits are known to be zero. Differential Revision: https://reviews.llvm.org/D96031	2021-03-05 11:05:37 +01:00
Jay Foad	ed7458398a	[AMDGPU] Don't check for VMEM hazards on GFX10 The hazard where a VMEM reads an SGPR written by a VALU counts as a data dependency hazard, so no nops are required on GFX10. Tested with Vulkan CTS on GFX10.1 and GFX10.3. Differential Revision: https://reviews.llvm.org/D97926	2021-03-04 21:44:56 +00:00
Petar Avramovic	d7834556b7	Reland [GlobalISel] Start using vectors in GISelKnownBits This is recommit of `4c8fb7ddd6`. MIR in one unit test had mismatched types. For vectors we consider a bit as known if it is the same for all demanded vector elements (all elements by default). KnownBits BitWidth for vector type is size of vector element. Add support for G_BUILD_VECTOR. This allows combines of urem_pow2_to_mask in pre-legalizer combiner. Differential Revision: https://reviews.llvm.org/D96122	2021-03-04 21:47:13 +01:00
Daniel Sanders	9fc2be6f28	[mir] Fix confusing MIR when MMO's value is nullptr but offset is non-zero :: (store 1 + 4, addrspace 1) -> :: (store 1 into undef + 4, addrspace 1) An offset without a base isn't terribly useful but it's convenient to update the offset without checking the value. For example, when breaking apart stores into smaller units Differential Revision: https://reviews.llvm.org/D97812	2021-03-04 10:34:30 -08:00
Nico Weber	e68de60bc4	Revert "AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect" This reverts commit `bf5a582650`. Also depends on now-reverted `4c8fb7ddd6`	2021-03-04 10:16:11 -05:00
Nico Weber	59beb1ef6d	Revert "[GlobalISel] Combine zext(trunc x) to x" This reverts commit `4112299ee7`. Seems to depend on `4c8fb7ddd6` which is being reverted.	2021-03-04 10:13:40 -05:00
Nico Weber	4b1015361c	Revert "[GlobalISel] Start using vectors in GISelKnownBits" This reverts commit `4c8fb7ddd6`. Breaks check-llvm everywhere, see https://reviews.llvm.org/D96122	2021-03-04 10:13:40 -05:00
Petar Avramovic	bf5a582650	AMDGPU/GlobalISel: Combine zext(trunc x) to x after RegBankSelect RegBankSelect creates zext and trunc when it selects banks for uniform i1. Add zext_trunc_fold from generic combiner to post RegBankSelect combiner. Differential Revision: https://reviews.llvm.org/D95432	2021-03-04 15:05:24 +01:00
Petar Avramovic	4112299ee7	[GlobalISel] Combine zext(trunc x) to x Combine zext(trunc x) to x when truncated bits are known to be zero. Differential Revision: https://reviews.llvm.org/D96031	2021-03-04 15:05:23 +01:00
Petar Avramovic	4c8fb7ddd6	[GlobalISel] Start using vectors in GISelKnownBits For vectors we consider a bit as known if it is the same for all demanded vector elements (all elements by default). KnownBits BitWidth for vector type is size of vector element. Add support for G_BUILD_VECTOR. This allows combines of urem_pow2_to_mask in pre-legalizer combiner. Differential Revision: https://reviews.llvm.org/D96122	2021-03-04 15:05:23 +01:00
Baptiste Saleil	54c0f520c7	[VirtRegRewriter] Insert missing killed flags when tracking subregister liveness VirtRegRewriter may sometimes fail to correctly apply the kill flag where necessary, which causes unecessary code gen on PowerPC. This patch fixes the way masks for defined lanes are computed and the way mask for used lanes is computed. Contact albion.fung@ibm.com instead of author for problems related to this commit. Differential Revision: https://reviews.llvm.org/D92405	2021-03-03 12:02:04 -05:00
Matt Arsenault	78dcff4841	GlobalISel: Add default implementation of assignValueToReg Refactor insertion of the asserting ops. This enables using them for AMDGPU. This code should essentially be the same for every target. Mips, X86 and ARM all have different code there now, but this seems to be an accident. The assignment functions are called with different types than they would be in the DAG, so this is all likely an assortment of hacks to get around that.	2021-03-03 09:29:53 -05:00
Piotr Sobczak	4672bac177	[AMDGPU] Introduce Strict WQM mode * Add amdgcn_strict_wqm intrinsic. * Add a corresponding STRICT_WQM machine instruction. * The semantic is similar to amdgcn_strict_wwm with a notable difference that not all threads will be forcibly enabled during the computations of the intrinsic's argument, but only all threads in quads that have at least one thread active. * The difference between amdgc_wqm and amdgcn_strict_wqm, is that in the strict mode an inactive lane will always be enabled irrespective of control flow decisions. Reviewed By: critson Differential Revision: https://reviews.llvm.org/D96258	2021-03-03 14:19:16 +01:00
Piotr Sobczak	c3ce7bae80	[AMDGPU] Rename amdgcn_wwm to amdgcn_strict_wwm * Introduce the new intrinsic amdgcn_strict_wwm * Deprecate the old intrinsic amdgcn_wwm The change is done for consistency as the "strict" prefix will become an important, distinguishing factor between amdgcn_wqm and amdgcn_strictwqm in the future. The "strict" prefix indicates that inactive lanes do not take part in control flow, specifically an inactive lane enabled by a strict mode will always be enabled irrespective of control flow decisions. The amdgcn_wwm will be removed, but doing so in two steps gives users time to switch to the new name at their own pace. Reviewed By: critson Differential Revision: https://reviews.llvm.org/D96257	2021-03-03 09:33:57 +01:00
Carl Ritson	2ddac69f98	[AMDGPU] Rename llvm.amdgcn.msaa.load to llvm.amdgcn.msaa.load.x While the underlying instruction is called image_msaa_load, the resource must be x component only. Rename the intrinsic for clarity. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D97829	2021-03-03 17:30:39 +09:00
Matt Arsenault	fd82cbcf7d	GlobalISel: Merge and cleanup more AMDGPU call lowering code This merges more AMDGPU ABI lowering code into the generic call lowering. Start cleaning up by factoring away more of the pack/unpack logic into the buildCopy{To\|From}Parts functions. These could use more improvement, and the SelectionDAG versions are significantly more complex, and we'll eventually have to emulate all of those cases too. This is mostly NFC, but does result in some minor instruction reordering. It also removes some of the limitations with mismatched sizes the old code had. However, similarly to the merge on the input, this is forcing gfx6/gfx7 to use the gfx8+ ABI (which is what we actually want, but SelectionDAG is stuck using the weird emergent ABI). This also changes the load/store size for stack passed EVTs for AArch64, which makes it consistent with the DAG behavior.	2021-03-02 17:31:13 -05:00
Dmitry Preobrazhensky	28f164bca7	[AMDGPU][MC][GFX9+] Corrected encoding of op_sel_hi for unused operands in VOP3P Corrected encoding of VOP3P op_sel_hi for unused operands. See bug 49363. Differential Revision: https://reviews.llvm.org/D97689	2021-03-02 13:02:25 +03:00
Stanislav Mekhanoshin	7c724a896f	[AMDGPU] Do not check max-bb for a single block callee -amdgpu-inline-max-bb option could lead to a suboptimal codegen preventing inlining of really simple functions including pure wrapper calls. Relax the cutoff by allowing to call a function with a single block on the grounds that it will not increase total number of blocks after inlining. Differential Revision: https://reviews.llvm.org/D97744	2021-03-01 19:48:50 -08:00
Yuanfang Chen	5de2d189e6	[Diagnose] Unify MCContext and LLVMContext diagnosing The situation with inline asm/MC error reporting is kind of messy at the moment. The errors from MC layout are not reliably propagated and users have to specify an inlineasm handler separately to get inlineasm diagnose. The latter issue is not a correctness issue but could be improved. * Kill LLVMContext inlineasm diagnose handler and migrate it to use DiagnoseInfo/DiagnoseHandler. * Introduce `DiagnoseInfoSrcMgr` to diagnose SourceMgr backed errors. This covers use cases like inlineasm, MC, and any clients using SourceMgr. * Move AsmPrinter::SrcMgrDiagInfo and its instance to MCContext. The next step is to combine MCContext::SrcMgr and MCContext::InlineSrcMgr because in all use cases, only one of them is used. * If LLVMContext is available, let MCContext uses LLVMContext's diagnose handler; if LLVMContext is not available, MCContext uses its own default diagnose handler which just prints SMDiagnostic. * Change a few clients(Clang, llc, lldb) to use the new way of reporting. Reviewed By: MaskRay Differential Revision: https://reviews.llvm.org/D97449	2021-03-01 15:58:37 -08:00
Arthur Eubanks	040c1b49d7	Move EntryExitInstrumentation pass location This seems to be more of a Clang thing rather than a generic LLVM thing, so this moves it out of LLVM pipelines and as Clang extension hooks into LLVM pipelines. Move the post-inline EEInstrumentation out of the backend pipeline and into a late pass, similar to other sanitizer passes. It doesn't fit into the codegen pipeline. Also fix up EntryExitInstrumentation not running at -O0 under the new PM. PR49143 Reviewed By: hans Differential Revision: https://reviews.llvm.org/D97608	2021-03-01 10:08:10 -08:00
Jay Foad	796a60d2ea	[AMDGPU] New intrinsic void llvm.amdgcn.s.sethalt(i32) The expected use case is for frontends to insert this into shaders that are to be run under a debugger. The shader can then be resumed or single stepped from the point of the call under debugger control. Differential Revision: https://reviews.llvm.org/D97670	2021-03-01 14:30:23 +00:00
Matt Arsenault	25e60f645a	AMDGPU/GlobalISel: Add subtarget to a test SelectionDAG forces us to have a weird ABI for 16-bit values without legal 16-bit operations, but currently GlobalISel bypasses this and sometimes ends up using the gfx8+ ABI in some contexts. Make sure we're testing the normal ABI to avoid a test change in a future patch.	2021-02-28 10:29:25 -05:00
Matt Arsenault	81b2c23b77	AMDGPU: Use kill instruction to hint soft clause live ranges Previously we would use a bundle to hint the register allocator to not overwrite the pointers in a sequence of loads to avoid breaking soft clauses. This bundling was based on a fuzzy register pressure heuristic, so we could not guarantee using more registers than are really available. This would result in register allocator failing on unsatisfiable bundles. Use a kill to artificially extend the live ranges, so we can always succeed at register allocation even if it means extra spills in the worst case. This seems to capture most of the benefit of the bundle while avoiding most of the risk presented by the bundle. However the lit tests do show a handful of regressions. In some cases with sequences of volatile loads, unused load components end up getting reallocated to the next load which forces a wait between. There are also a few small scheduling regressions where a hazard used to be avoided, and one spill torture test which for some reason nearly doubles the stack usage. There is also a bit of noise from leftover kills (it may make sense for post-RA pseudos to strip all of these out).	2021-02-26 18:26:40 -05:00
Jay Foad	dc2259537a	[AMDGPU] Add selection pattern for v_xnor_b32 This allows GlobalISel to use this instruction where available. I assume SelectionDAG always selects s_xnor_b32 so it isn't affected by this change. Differential Revision: https://reviews.llvm.org/D97560	2021-02-26 16:41:47 +00:00
Jay Foad	3ad5216ed8	[AMDGPU] Better codegen for i64 bitreverse Differential Revision: https://reviews.llvm.org/D97547	2021-02-26 15:51:36 +00:00
Michael Liao	0d4e12e3c1	[amdgpu] Atomic should be source of divergence. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D97392	2021-02-24 15:27:47 -05:00
Matt Arsenault	589223e044	AMDGPU: Remove special case in shouldCoalesce Unaligned registers are now constrained with classes, rather than specially reserving a subset of the whole class.	2021-02-24 14:49:44 -05:00
Matt Arsenault	78b6d73a93	AMDGPU: Add even aligned VGPR/AGPR register classes gfx90a operations require even aligned registers, but this was previously achieved by reserving registers inside the full class. Ideally this would be captured in the static instruction definitions for the operands, and we would have different instructions per subtarget. The hackiest part of this is we need to manually reassign AGPR register classes after instruction selection (we get away without this for VGPRs since those types are actually registered for legal types).	2021-02-24 14:49:37 -05:00
Jay Foad	449e36ce72	[AMDGPU] Add a bit more gfx90a test coverage Update the GlobalISel version of llvm.amdgcn.workitem.id.ll to mostly match the SelctionDAG version. Differential Revision: https://reviews.llvm.org/D97377	2021-02-24 17:08:32 +00:00
Matt Arsenault	e844f24a27	AMDGPU: Use aligned vgprs/agprs in gfx90a mir tests These would fail a verifier check in a future change.	2021-02-23 16:46:22 -05:00
Nicolai Hähnle	52bc2e7577	[AMDGPU][SelectionDAG] Don't combine uniform multiplies to MUL_[UI]24 Prefer to keep uniform (non-divergent) multiplies on the scalar ALU when possible. This significantly improves some game cases by eliminating v_readfirstlane instructions when the result feeds into a scalar operation, like the address calculation for a scalar load or store. Since isDivergent is only an approximation of whether a value is in SGPRs, it can potentially regress some situations where a uniform value ends up in a VGPR. These should be rare in real code, although the test changes do contain a number of examples. Most of the test changes are just using s_mul instead of v_mul/mad which is generally better for both register pressure and latency (at least on GFX10 where sgpr pressure doesn't affect occupancy and vector ALU instructions have significantly longer latency than scalar ALU). Some R600 tests now use MULLO_INT instead of MUL_UINT24. GlobalISel appears to handle more scenarios in the desirable way, although it can also be thrown off and fails to select the 24-bit multiplies in some cases. Alternative solution considered and rejected was to allow selecting MUL_[UI]24 to S_MUL_I32. I've rejected this because the definition of those SD operations works is don't-care on the most significant 8 bits, and this fact is used in some combines via SimplifyDemandedBits. Based on a patch by Nicolai Hähnle. Differential Revision: https://reviews.llvm.org/D97063	2021-02-23 15:39:19 +00:00
Jay Foad	fdaa2d0259	[AMDGPU] Use divergent addresses for vector loads Change some test cases to use divergent addresses for vector loads, which should be the common case in real world code. Using uniform addresses causes poor instruction selection for the surrounding code which has to be fixed up post-register-allocation, and this causes a lot of testsuite churn for a forthcoming patch to stop selecting 24-bit vector multiply instructions for uniform multiplies. This shows up some problems in the idot tests where we fail to select v_dot instructions because the patterns only match MUL_[UI]24 ISD nodes, but the DAG contains i16 mul nodes instead. Differential Revision: https://reviews.llvm.org/D97062	2021-02-23 13:33:15 +00:00
Dmitry Preobrazhensky	4813518092	[AMDGPU][MC] Corrected bound_ctrl for compatibility with sp3 Enabled "bound_ctrl:1" and disabled "bound_ctrl:-1" syntax. Corrected printer to output "bound_ctrl:1" instead of "bound_ctrl:0". See bug 35397 for detailed issue description. Differential Revision: https://reviews.llvm.org/D97048	2021-02-22 14:59:40 +03:00
Nikita Popov	71a8e4e7d6	[MemCopyOpt] Enable MemorySSA by default This enables use of MemorySSA instead of MemDep in MemCpyOpt. To allow this without significant compile-time impact, the MemCpyOpt pass is moved directly before DSE (in the cases where this was not already the case), which allows us to reuse the existing MemorySSA analysis. Unlike the MemDep-based implementation, the MemorySSA-based MemCpyOpt can also perform simple optimizations across basic blocks. Differential Revision: https://reviews.llvm.org/D94376	2021-02-19 18:06:25 +01:00
madhur13490	3c297a2564	Make fixed-abi default for AMD HSA OS fixed-abi uses pre-defined and predictable SGPR/VGPRs for passing arguments. This patch makes this scheme default when HSA OS is specified in triple. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D96340	2021-02-19 15:05:25 +00:00
Jay Foad	b2c7f06db1	[AMDGPU] Add some GFX9 test coverage. NFC.	2021-02-19 14:38:52 +00:00
Carl Ritson	8181dcd30f	[AMDGPU] WQM/WWM: Fix marking of partial definitions Track lanes when processing definitions for marking WQM/WWM. If all lanes have been defined then marking can stop. This prevents marking unnecessary instructions as WQM/WWM. In particular this fixes a bug where values passing through V_SET_INACTIVE would me marked as requiring WWM. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D95503	2021-02-19 20:45:24 +09:00
Matt Arsenault	27093f1a94	AMDGPU: Add regression testcase for bundle pressure issue This is a somewhat reduced testcase that regressed, causing the revert in `477e3fe4f8`. This was producing a bundle that could not be allocated. This is a tricky one to reduce/reproduce, but I do like having some sanity check for this.	2021-02-18 17:39:33 -05:00
Matt Arsenault	62d946e133	GlobalISel: Merge some AMDGPU ABI lowering code to generic code AMDGPU currently has a lot of pre-processing code to pre-split argument types into 32-bit pieces before passing it to the generic code in handleAssignments. This is a bit sloppy and also requires some overly fancy iterator work when building the calls. It's better if all argument marshalling code is handled directly in handleAssignments. This handles more situations like decomposing large element vectors into sub-element sized pieces. This should mostly be NFC, but does change the generated code by shifting where the initial argument packing instructions are placed. I think this is nicer looking, since it now emits the packing code directly after the relevant copies, rather than after the copies for the remaining arguments. This doubles down on gfx6/gfx7 using the gfx8+ ABI for 16-bit types. This is ultimately the better option, but incompatible with the DAG. Fixing this requires more work, especially for f16.	2021-02-18 17:26:55 -05:00
Konstantin Zhuravlyov	622652bf73	AMDGPU: Fix checks in llvm.amdgcn.workitem.id.ll Differential Revision: https://reviews.llvm.org/D96967	2021-02-18 11:56:15 -05:00
Jay Foad	e1b1119f21	[AMDGPU] Tidy up a FIXME fixed by D34973	2021-02-18 14:28:27 +00:00
Stanislav Mekhanoshin	a8d9d50762	[AMDGPU] gfx90a support Differential Revision: https://reviews.llvm.org/D96906	2021-02-17 16:01:32 -08:00
Jessica Paquette	26fb036559	[GlobalISel] Implement computeNumSignBits for G_ASSERT_SEXT Same implementation as G_SEXT_INREG. Add a testcase to combine-sext-inreg for a concrete example, and a testcase to KnownBitsTest. Differential Revision: https://reviews.llvm.org/D96897	2021-02-17 13:53:17 -08:00
Piotr Sobczak	c72a63b4b0	[AMDGPU] Add implicit vcc_lo on S_CBRANCH_VCCNZ in wave32 * Update skip-if-dead.ll with tests for wave32. * Fix the crash in verifier in one newly enabled test by adding missing fixImplicitOperands in branch insertion code. ``` * Bad machine code: Using an undefined physical register * - function: test_kill_divergent_loop - basic block: %bb.2 bb (0xad96308) - instruction: S_CBRANCH_VCCNZ %bb.1, implicit $vcc_lo - operand 1: implicit $vcc_lo LLVM ERROR: Found 1 machine code errors. ``` * Simplify "cbranch_kill" to not use interp instructions. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D96793	2021-02-17 15:14:57 +01:00
Piotr Sobczak	08131c7439	[AMDGPU] Fix a miscompile with S_ADD/S_SUB The helper function isBoolSGPR is too aggressive when determining when a v_cndmask can be skipped on a boolean value because the function does not check the operands of and/or/xor. This can be problematic for the Add/Sub combines that can leave bits set even for inactive lanes leading to wrong results. Fix this by inspecting the operands of and/or/xor recursively. Differential Revision: https://reviews.llvm.org/D86878	2021-02-17 12:24:58 +01:00
Tony Tye	c62b737ad6	[AMDGPU] Correct rmw atomics s_waitcnt generation The AMD GPU SIMemoryLegalizer was using the ordering address space rather than the instruction address space when determining the s_waitcnt to generate to ensure that a read-modify-write atomic has completed. This resulted in additional unnecessary counters being waited on. Differential Revision: https://reviews.llvm.org/D96743	2021-02-17 01:32:29 +00:00
Simon Pilgrim	df45c18135	[DAG] PromoteIntRes_ADDSUBSHLSAT - promote ISD::UADDSAT as clamped add Similar to D96622, we're better off just promoting uaddsat(x,y) -> umin(add(x,y),c) instead of trying to perform a shifted uaddsat. I initially tried to just use shifted promotion in cases where we didn't have a legal/custom umin - but we don't appear to have any targets that have uaddsat but not umin, so imo we're better off always using the umin and avoid an untested shifted uaddsat code path. Differential Revision: https://reviews.llvm.org/D96767	2021-02-16 17:37:44 +00:00
Matt Arsenault	a7455d7b7c	AMDGPU: Remove kills following clusters of memory instruction In a future commit, soft clauses will be hinted with kill instructions rather than forced together with bundles. Look for kills that look like this, and erase them. I'm not sure if the check for specific uses is worthwhile, or if it would be better to just unconditionally erase kills. This reduces test churn in a future patch.	2021-02-16 10:49:28 -05:00
Matt Arsenault	c320e8196a	AMDGPU: Fix debug info handling in post-RA bundler This was allowing debug instructions to break the bundling, which would change scheduling behavior. Bundle debug info / kills inside the bundle. This seems to work OK, although the asm printer doesn't understand these in a bundle. This implicitly expects the memory legalizer to unbundle. It would probably be slightly nicer to move these after. Rewrite the loop to be clearer and make sure we don't end a bundle on a meta instruction, only allow them in between other valid bundle instructions.	2021-02-16 10:42:06 -05:00
Stanislav Mekhanoshin	5cf9292ce3	[AMDGPU] Add two TSFlags: IsAtomicNoRtn and IsAtomicRtn We are using AtomicNoRet map in multiple places to determine if an instruction atomic, rtn or nortn atomic. This method does not work always since we have some instructions which only has rtn or nortn version. One such instruction is ds_wrxchg_rtn_b32 which does not have nortn version. This has caused changes in memory legalizer tests. Differential Revision: https://reviews.llvm.org/D96639	2021-02-15 11:27:59 -08:00
Carl Ritson	aef781b47a	[AMDGPU] Add llvm.amdgcn.wqm.demote intrinsic Add intrinsic which demotes all active lanes to helper lanes. This is used to implement demote to helper Vulkan extension. In practice demoting a lane to helper simply means removing it from the mask of live lanes used for WQM/WWM/Exact mode. Where the shader does not use WQM, demotes just become kills. Additionally add llvm.amdgcn.live.mask intrinsic to complement demote operations. In theory llvm.amdgcn.ps.live can be used to detect helper lanes; however, ps.live can be moved by LICM. The movement of ps.live cannot be remedied without changing its type signature and such a change would require ps.live users to update as well. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D94747	2021-02-15 08:45:46 +09:00
Tony Tye	8a91b68b95	[AMDGPU] Limit memory scope for scratch, LDS and GDS Changes for AMD GPU SIMemoryLegalizer: - Limit the memory scope to maximum supported by the scratch, LDS and GDS address spaces. - Improve assertion checking. - Correct toSIAtomicScope argument name. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D96643	2021-02-14 17:34:12 +00:00
Fangrui Song	962b29d716	ELFObjectWriter: Don't sort non-local symbols As we don't sort local symbols, don't sort non-local symbols. This makes non-local symbols appear in their register order, which matches GNU as. The register order is nice in that you can write tests with interleaved CHECK prefixes, e.g. ``` // CHECK: something about foo .globl foo foo: // CHECK: something about bar .globl bar bar: ``` With the lexicographical order, the user needs to place lexicographical smallest symbol first or keep CHECK prefixes in one place.	2021-02-13 10:32:27 -08:00
Simon Pilgrim	60ba5397df	[DAG] PromoteIntRes_ADDSUBSHLSAT - use promoted ISD::USUBSAT directly As discussed on D96413, as long as the promoted bits of the args are zero we can use the basic ISD::USUBSAT pattern directly, without the shifting like we do for other ops. I think something similar should be possible for ISD::UADDSAT as well, which I'll look at later. Also, create a ISD::USUBSAT node directly - this will be expanded back by the legalizer later on if necessary. Differential Revision: https://reviews.llvm.org/D96622	2021-02-13 12:35:10 +00:00
Fangrui Song	39db16e75b	[test] Make ELF tests less reliant on the lexicographical order of non-local symbols	2021-02-13 01:01:06 -08:00
Jessica Paquette	145549ff89	[GlobalISel] Combine (x + 0) -> x, G_PTR_ADD edition Add it to right_identity_zero. Differential Revision: https://reviews.llvm.org/D96621	2021-02-12 12:09:48 -08:00
Simon Pilgrim	4841a225b7	[DAG] Move basic USUBSAT pattern matches from X86 to DAGCombine Begin transitioning the X86 vector code to recognise sub(umax(a,b) ,b) or sub(a,umin(a,b)) USUBSAT patterns to make it more generic and available to all targets. This initial patch just moves the basic umin/umax patterns to DAG, removing some vector-only checks on the way - these are some of the patterns that the legalizer will try to expand back to so we can be reasonably relaxed about matching these pre-legalization. We can handle the trunc(sub(..))) variants as well, which helps with patterns where we were promoting to a wider type to detect overflow/saturation. The remaining x86 code requires some cleanup first - some of it isn't actually tested etc. I also need to resurrect D25987. Differential Revision: https://reviews.llvm.org/D96413	2021-02-12 18:22:57 +00:00
Petar Avramovic	f0d65f4096	AMDGPU/GlobalISel: Calculate isKnownNeverNaN for fminnum and fmaxnum Implements same logis as in SelectionDAG. G_FMINNUM_IEEE and G_FMAXNUM_IEEE are never SNaN by definition and never NaN when one operand is known non-NaN and other known non-SNaN. G_FMINNUM and G_FMAXNUM are never NaN/SNaN when one of the operands is known non-NaN/SNaN. Differential Revision: https://reviews.llvm.org/D91716	2021-02-12 17:14:34 +01:00
Petar Avramovic	122c649c98	AMDGPU/GlobalISel: Check values of constants in isKnownNeverNaN Differential Revision: https://reviews.llvm.org/D91714	2021-02-12 17:14:34 +01:00
Petar Avramovic	841ee7423d	AMDGPU/GlobalISel: Precommit globalisel tests for isKnownNeverNaN	2021-02-12 17:14:34 +01:00
Stanislav Mekhanoshin	cb41ee92da	[AMDGPU] Fix promote alloca with double use in a same insn If we have an instruction where more than one pointer operands are derived from the same promoted alloca, we are fixing it for one argument and do not fix a second use considering this user done. Fix this by deferring processing of memory intrinsics until all potential operands are replaced. Fixes: SWDEV-271358 Differential Revision: https://reviews.llvm.org/D96386	2021-02-11 11:42:25 -08:00
Matt Arsenault	e3c6fa3611	AMDGPU: Restrict soft clause bundling at half of the available regs Fixes a testcase that was overcommitting large register tuples to a bundle, which the register allocator could not possibly satisfy. This was producing a bundle which used nearly all of the available SGPRs with a series of 16-dword loads (not all of which are freely available to use). This is a quick hack for some deeper issues with how the clause bundler tracks register pressure. Overall the pressure tracking used here doesn't make sense and is too imprecise for what it needs to avoid the allocator failing. The pressure estimate does not account for the alignment requirements of large SGPR tuples, so this was really underestimating the pressure impact. This also ignores the impact of the extended live range of the use registers after the bundle is introduced. Additionally, it didn't account for some wide tuples not being available due to reserved registers. This regresses a few cases. These end up introducing more spilling. This is also a function of the global pressure being used in the decision to bundle, not the local pressure impact of the bundle itself.	2021-02-11 14:08:59 -05:00
Jay Foad	23db2d363f	[AMDGPU] Better selection of base offset when merging DS reads/writes When merging a pair of DS reads or writes needs to materialize the base offset in a vgpr, choose a value that is aligned to as high a power of two as possible. This maximises the chance that different pairs can use the same base offset, in which case the base offset registers can be commoned up by MachineCSE. Differential Revision: https://reviews.llvm.org/D96421	2021-02-11 17:46:09 +00:00
Carl Ritson	c16f776028	[AMDGPU] Move kill lowering to WQM pass and add live mask tracking Move implementation of kill intrinsics to WQM pass. Add live lane tracking by updating a stored exec mask when lanes are killed. Use live lane tracking to enable early termination of shader at any point in control flow. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D94746	2021-02-11 20:31:29 +09:00
Jay Foad	b5f3383152	[AMDGPU] Add another test case for combining DS reads	2021-02-10 14:59:49 +00:00
Matt Arsenault	f4ca6d8289	AMDGPU: Fix verifier error with argument passed in CSR SGPR We need to avoid setting the kill flag on the CSR spill if there's an additional use of the register after the spill. This does rely on consistency between the entry block liveins and the MRI's function live ins, which is not something the verifier checks now.	2021-02-09 13:49:44 -05:00
Matt Arsenault	e855cc6d04	AMDGPU/GlobalISel: Remove dead check prefixes	2021-02-08 17:09:28 -05:00
Jay Foad	d8323b1a86	[AMDGPU] Generate test checks and add GFX10 test coverage Differential Revision: https://reviews.llvm.org/D96143	2021-02-08 12:57:51 +00:00
Thomas Symalla	f89f6d1e5d	[AMDGPU]: Fixes an invalid clamp selection pattern. When running the tests on PowerPC and x86, the lit test GlobalISel/trunc.ll fails at the memory sanitize step. This seems to be due to wrong invalid logic (which matches even if it shouldn't) and likely missing variable initialisation." Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D95878	2021-02-08 13:06:30 +01:00
Wen-Heng (Jack) Chung	04766c401b	[AMDGPU] Add Fiji target in fptosi/fptoui instruction-select MIR tests. In response to review comments in D95964, add a target with f16 instructions. Differential Revision: https://reviews.llvm.org/D96061	2021-02-05 11:33:54 -06:00
Wen-Heng (Jack) Chung	50578cf339	[AMDGPU] Add f16 to i1 CodeGen patterns. Follow patterns used for f32 and f64 types. Differential Revision: https://reviews.llvm.org/D95964	2021-02-04 11:44:18 -06:00
Jay Foad	d84e5fdac1	[AMDGPU][GlobalISel] Fix v2s16 right shifts When widening, each half of the v2s16 operands needs to be sign extended for G_ASHR or zero extended for G_LSHR. Differential Revision: https://reviews.llvm.org/D96048	2021-02-04 17:04:32 +00:00
Jay Foad	b3bb5c3efc	[AMDGPU][GlobalISel] Use scalar min/max instructions SALU min/max s32 instructions exist so use them. This means that regbankselect can handle min/max much like add/sub/mul/shifts. Differential Revision: https://reviews.llvm.org/D96047	2021-02-04 17:04:32 +00:00
Konstantin Zhuravlyov	6054a456da	AMDGPU: Add support for amdgpu-unsafe-fp-atomics attribute If amdgpu-unsafe-fp-atomics is specified, allow {flat\|global}_atomic_add_f32 even if atomic modes don't match. Differential Revision: https://reviews.llvm.org/D95391	2021-02-04 08:09:34 -05:00
Sebastian Neubauer	6c59dc474d	[AMDGPU] Save all lanes for reserved VGPRs When SGPRs are spilled to VGPRs, they can overwrite any lane. We need to preserve the value of inactive lanes in function calls, so we save the register even if it is marked as caller saved. Also, teach buildPrologSpill to work when no registers are free like in CodeGen/AMDGPU/pei-scavenge-vgpr-spill.mir and update the comment on findScratchNonCalleeSaveRegister as it is not used anymore to realign the stack pointer since D95865. Differential Revision: https://reviews.llvm.org/D95946	2021-02-04 09:56:36 +01:00
Amara Emerson	1a13ee1efb	[GlobalISel] Add sext(constant) -> constant artifact combine. This is the G_SEXT counterpart to the existing G_ZEXT/G_ANYEXT combines. Differential Revision: https://reviews.llvm.org/D95729	2021-02-03 14:10:08 -08:00
Matt Arsenault	39fbb5c3e3	RegisterCoalescer: Fix not setting undef on coalesced subregister uses This was only adding undef to the use if the copy itself had a subregister index. It did not consider the subrange liveness if the use had a subreg index to begin with.	2021-02-03 13:54:43 -05:00
Matt Arsenault	d886da042c	RegisterCoalescer: Prune undef subranges from copy pairs in loops If we had a pair of copies inside a loop which introduced new liveness to a subregister which was undef before the loop, we would have a dummy phi-only segment remaining across the loop body. Later, this false segment would confuse RenameIndependentSubregs causing it to introduce IMPLICIT_DEFs with broken value numbering. It seems always adding the lanes to ShrinkMask is OK, so any conditions should be purely a compile time filter.	2021-02-03 13:42:53 -05:00
Matt Arsenault	477e3fe4f8	Revert "AMDGPU: Don't consider global pressure when bundling soft clauses" This reverts commit `1e377a273f`. A regression was reported.	2021-02-03 13:25:05 -05:00
Stanislav Mekhanoshin	6038d68baf	[AMDGPU] Added -mcpu to couple more tests. NFC.	2021-02-03 10:20:18 -08:00
Juneyoung Lee	06829034ca	Revert "[ConstantFold] Fold more operations to poison" This reverts commit `53040a968d` due to its bad interaction with select i1 -> and/or i1 transformation. This fixes: https://bugs.llvm.org/show_bug.cgi?id=49005 https://bugs.llvm.org/show_bug.cgi?id=48435	2021-02-04 00:24:02 +09:00
Matt Arsenault	9719f17011	AMDGPU: Move handling of allocation of fixed ABI inputs For the fixed ABI, set this in the initial argument constructor, rather than relying on the allocation logic to set the values. Also stop passing them for amdgpu_gfx, since the DAG path seems to skip these. I'm unclear on what amdgpu_gfx's expectations are. This will allow moving the special input registers out of the normal argument range.	2021-02-03 09:27:59 -05:00
Sebastian Neubauer	d49efdc969	Revert "[AMDGPU] Add a new Clamp Pattern to the GlobalISel Path." This reverts commits 62af0305b7cc..677a3529d3e6 from D93708. They cause failures in the sanitizer builds because of uninitialized values. A fix is in D95878, but it might take some time until this is pushed, so reverting the changes for now.	2021-02-03 11:03:34 +01:00
Matt Arsenault	af2cbe8eff	AMDGPU: Fix adding extra operands for i128 asm constraints We don't register i128 as a legal type with addRegisterClass, but it appears in the list of legal register types. This inconsistency resulted in the asm constraint lowering trying to use 2 128-bit registers for these operands. This would leave behind a dead def that would waste registers. Regresses GlobalISel tests for i128 load/store, but these aren't very important right now. Ideally these would not depend on the list of register types.	2021-02-02 19:01:04 -05:00
Matt Arsenault	1e377a273f	AMDGPU: Don't consider global pressure when bundling soft clauses This should only consider whether the pressure impact of the bundle at the given point in the program will decrease the occupancy. High VGPR pressure was incorrectly blocking the formation of scalar bundles, and vice versa. This was also blocking bundling from high pressure situations at other points in the program.	2021-02-02 19:00:14 -05:00
Sebastian Neubauer	8b898b19a8	[AMDGPU] Remove unused tmp register The temporary register is only used to compute the frame pointer. The frame pointer is overwritten and not used in between, so we can reuse the frame pointer for the computation, saving one register. Differential Revision: https://reviews.llvm.org/D95865	2021-02-02 17:17:54 +01:00
Sebastian Neubauer	6b6ae583cf	[AMDGPU] Save fp/bp after csr saves Saving callee-save registers happens in whole wave mode. Exec is saved to a free register, which can be reused to save the frame pointer. Therefore, saving the fp needs to happen after saving csrs. Differential Revision: https://reviews.llvm.org/D95861	2021-02-02 17:17:54 +01:00
Sebastian Neubauer	b91afa474e	[AMDGPU] Mark epilog restores as frame-destroy I guess instructions were marked as frame-setup by accident, they are restores as part of the epilog. Differential Revision: https://reviews.llvm.org/D95783	2021-02-02 10:24:37 +01:00
Thomas Symalla	fa3e840d3d	Removed the generic virtual register creations. Reworked the tests.	2021-02-02 09:14:54 +01:00
Thomas Symalla	6604d81e1b	Added and used new target pseudo for v_cvt_pk_i16_i32, changes due to code review.	2021-02-02 09:14:53 +01:00
Thomas Symalla	79e729bdf1	Fixed tests.	2021-02-02 09:14:53 +01:00
Thomas Symalla	3a46502264	Move step to PreLegalizer	2021-02-02 09:14:53 +01:00
Thomas Symalla	cdfd9b3bf5	Move Combiner to PreLegalize step	2021-02-02 09:14:53 +01:00
Thomas Symalla	f2ef2fbc69	Renamed identifiers in lit	2021-02-02 09:14:53 +01:00
Thomas Symalla	dae85e4671	Fixed the lit tests and a bug in the implementation.	2021-02-02 09:14:52 +01:00
Thomas Symalla	d41b7fa9bf	Renames	2021-02-02 09:14:52 +01:00
Thomas Symalla	62af0305b7	Added clamp i64 to i16 global isel pattern.	2021-02-02 09:14:52 +01:00
Matt Arsenault	41877b82f0	AMDGPU: Fix dbg_value handling when forming soft clause bundles DBG_VALUES placed between memory instructions would change codegen. Skip over these and re-insert them after the bundle instead of giving up on bundling.	2021-02-01 22:16:35 -05:00
Austin Kerbow	0397dca021	[AMDGPU] Fix crash with sgpr spills to vgpr disabled This would assert with amdgpu-spill-sgpr-to-vgpr disabled when trying to spill the FP. Fixes: SWDEV-262704 Reviewed By: RamNalamothu Differential Revision: https://reviews.llvm.org/D95768	2021-02-01 08:35:25 -08:00
Matt Arsenault	1801e2aa24	RegAlloc: Fix assert if all registers in class reserved With a context instruction, this would produce a context error. However, it would continue on and do an out of bounds access of the empty allocation order array.	2021-01-31 11:10:04 -05:00
Roman Lebedev	a78d8feb48	[LowerConstantIntrinsics] Preserve Dominator Tree, if avaliable	2021-01-30 01:14:50 +03:00
Carl Ritson	0824694d68	[AMDGPU] Fix WMM Entry SCC preservation SCC was not correctly preserved when entering WWM. Current lit test was unable to detect this as entry block is handled differently. Additionally fix an issue where SCC was unnecessarily preserved when exiting from WWM to Exact mode. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D95500	2021-01-29 10:05:36 +09:00
Carl Ritson	0e8f50595e	[AMDGPU] Mark V_SET_INACTIVE as defining SCC V_SET_INACTIVE is implemented with S_NOT which clobbers SCC. Mark sure it is marked appropriately. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D95509	2021-01-29 09:46:41 +09:00
Cassie Jones	f22f4557a7	[GlobalISel] Implement widenScalar for carry-in add/sub These are widened to a wider UADDE/USUBE, with the overflow value unused, and with the same synthesis of a new overflow value as for the O operations. Reviewed By: paquette Differential Revision: https://reviews.llvm.org/D95326	2021-01-28 17:06:24 -05:00
Jay Foad	39ef0965df	[AMDGPU] Simplify some RUN lines. NFC.	2021-01-28 17:57:55 +00:00
Mirko Brkusanin	3c979ae9ec	[AMDGPU][GlobalISel] Remove redundant cmp when copying constant to vcc Differential Revision: https://reviews.llvm.org/D95540	2021-01-28 11:20:09 +01:00
Mirko Brkusanin	4b422708ba	[AMDGPU][GlobalISel] Handle G_PTR_ADD when looking for constant offset Look throught G_PTRTOINT and G_PTR_ADD nodes when looking for constant offset for buffer stores. This also helps with merging of these instructions later on. Differential Revision: https://reviews.llvm.org/D95242	2021-01-28 11:20:09 +01:00
Piotr Sobczak	fc8e741121	[AMDGPU] Avoid an illegal operand in si-shrink-instructions Before the patch it was possible to trigger a constant bus violation when folding immediates into a shrunk instruction. The patch adds a check to enforce the legality of the new operand. Differential Revision: https://reviews.llvm.org/D95527	2021-01-28 08:49:21 +01:00
Carl Ritson	2b9ed4fca6	[AMDGPU][NFC] Pre-commit test for D95509	2021-01-28 12:37:58 +09:00
Carl Ritson	8d8be87979	[AMDGPU][NFC] Generate llvm.amdgcn.set.inactive tests This is a pre-commit for D95509.	2021-01-28 11:43:36 +09:00
Stanislav Mekhanoshin	d91ee2f782	[AMDGPU] Do not reassign spilled registers We cannot call LRM::unassign() if LRM::assign() was never called before, these are symmetrical calls. There are two ways of assigning a physical register to virtual, via LRM::assign() and via VRM::assignVirt2Phys(). LRM::assign() will call the VRM to assign the register and then update LiveIntervalUnion. Inline spiller calls VRM directly and thus LiveIntervalUnion never gets updated. A call to LRM::unassign() then asserts about inconsistent liveness. We have to note that not all callers of the InlineSpiller even have LRM to pass, RegAllocPBQP does not have it, so we cannot always pass LRM into the spiller. The only way to get into that spiller LRE_DidCloneVirtReg() call is from LiveRangeEdit::eliminateDeadDefs if we split an LI. This patch refuses to reassign a LiveInterval created by a split to workaround the problem. In fact we cannot reassign a spill anyway as all registers of the needed class are occupied and we are spilling. Fixes: SWDEV-267996 Differential Revision: https://reviews.llvm.org/D95489	2021-01-27 16:29:05 -08:00
Fangrui Song	4d28f0a6a4	[llc] Add reportError helper and canonicalize error messages	2021-01-26 15:33:37 -08:00
Jessica Paquette	f36007e811	[GlobalISel] Implement computeKnownBits for G_SEXT_INREG Just use the existing `Known.sextInReg` implementation. - Update KnownBitsTest.cpp. - Update combine-redundant-and.mir for a more concrete example. Differential Revision: https://reviews.llvm.org/D95484	2021-01-26 15:01:38 -08:00
Austin Kerbow	2291bd137d	[AMDGPU] Update subtarget features for new target ID support Support for XNACK and SRAMECC is not static on some GPUs. We must be able to differentiate between different scenarios for these dynamic subtarget features. The possible settings are: - Unsupported: The GPU has no support for XNACK/SRAMECC. - Any: Preference is unspecified. Use conservative settings that can run anywhere. - Off: Request support for XNACK/SRAMECC Off - On: Request support for XNACK/SRAMECC On GCNSubtarget will track the four options based on the following criteria. If the subtarget does not support XNACK/SRAMECC we say the setting is "Unsupported". If no subtarget features for XNACK/SRAMECC are requested we must support "Any" mode. If the subtarget features XNACK/SRAMECC exist in the feature string when initializing the subtarget, the settings are "On/Off". The defaults are updated to be conservatively correct, meaning if no setting for XNACK or SRAMECC is explicitly requested, defaults will be used which generate code that can be run anywhere. This corresponds to the "Any" setting. Differential Revision: https://reviews.llvm.org/D85882	2021-01-26 11:25:51 -08:00
Matt Arsenault	5f9707b796	AMDGPU: Fix redundant FP spilling/assert in some functions If a function has stack objects, and a call, we require an FP. If we did not initially have any stack objects, and only introduced them during PrologEpilogInserter for CSR VGPR spills, SILowerSGPRSpills would end up spilling the FP register as if it were a normal register. This would result in an assert in a debug build, or redundant handling of the FP register in a release build. Try to predict that we will have an FP later, although this is ugly.	2021-01-26 13:01:45 -05:00
Mitch Phillips	c9466ede7e	Revert "Revert "[GlobalISel] LegalizerHelper - Extract widenScalarAddoSubo method"" This reverts commit `554b3211fe`. Differential Revision: https://reviews.llvm.org/D95035	2021-01-25 16:22:22 -08:00
Stanislav Mekhanoshin	eace81c48f	[AMDGPU] Added -mcpu=tahiti to 3 tests. NFC.	2021-01-25 15:50:59 -08:00
Konstantin Zhuravlyov	2cdb34efda	Revert "[IndirectFunctions] Skip propagating attributes to address taken functions" This reverts commit `dd8ae42674`. This commit causes infinite loop when compiling rocThrust and hipCUB. Differential Revision: https://reviews.llvm.org/D95389	2021-01-25 15:58:06 -05:00
Carl Ritson	a80ebd0179	[AMDGPU] Fix llvm.amdgcn.init.exec and frame materialization Frame-base materialization may insert vector instructions before EXEC is initialised. Fix this by moving lowering of llvm.amdgcn.init.exec later in backend. Also remove SI_INIT_EXEC_LO pseudo as this is not necessary. Reviewed By: ruiling Differential Revision: https://reviews.llvm.org/D94645	2021-01-25 08:31:17 +09:00
Roger Ferrer Ibanez	d4ce062340	[RISCV][PrologEpilogInserter] "Float" emergency spill slots to avoid making them immediately unreachable from the stack pointer In RISC-V there is a single addressing mode of the form imm(reg) where imm is a signed integer of 12-bit with a range of [-2048..2047] bytes from reg. The test MultiSource/UnitTests/C++11/frame_layout of the LLVM test-suite exercises several scenarios with the stack, including function calls where the stack will need to be realigned to to a local variable having a large alignment of 4096 bytes. In situations of large stacks, the RISC-V backend (in RISCVFrameLowering) reserves an extra emergency spill slot which can be used (if no free register is found) by the register scavenger after the frame indexes have been eliminated. PrologEpilogInserter already takes care of keeping the emergency spill slots as close as possible to the stack pointer or frame pointer (depending on what the function will use). However there is a final alignment step to honour the maximum alignment of the stack that, when using the stack pointer to access the emergency spill slots, has the side effect of setting them farther from the stack pointer. In the case of the frame_layout testcase, the net result is that we do have an emergency spill slot but it is so far from the stack pointer (more than 2048 bytes due to the extra alignment of a variable to 4096 bytes) that it becomes unreachable via any immediate offset. During elimination of the frame index, many (regular) offsets of the stack may be immediately unreachable already. Their address needs to be computed using a register. A virtual register is created and later RegisterScavenger should be able to find an unused (physical) register. However if no register is available, RegisterScavenger will pick a physical register and spill it onto an emergency stack slot, while we compute the offset (restoring the chosen register after all this). This assumes that the emergency stack slot is easily reachable (this is, without requiring another register!). This is the assumption we seem to break when we perform the extra alignment in PrologEpilogInserter. We can "float" the emergency spill slots by increasing (in absolute value) their offsets from the incoming stack pointer. This way the emergency spill slots will remain close to the stack pointer (once the function has allocated storage for the stack, including the needed realignment). The new size computed in PrologEpilogInserter is padding so it should be OK to move the emergency spill slots there. Also because we're increasing the alignment, the new location should stay aligned for the purpose of the emergency spill slots. Note that this change also impacts other backends as shown by the tests. Changes are minor adjustments to the emergency stack slot offset. Differential Revision: https://reviews.llvm.org/D89239	2021-01-23 09:10:03 +00:00
Stanislav Mekhanoshin	ca904b81e6	[AMDGPU] Fix FP materialization/resolve with flat scratch Differential Revision: https://reviews.llvm.org/D95266	2021-01-22 16:06:47 -08:00
Mitch Phillips	554b3211fe	Revert "[GlobalISel] LegalizerHelper - Extract widenScalarAddoSubo method" This reverts commit `2bb92bf451`. Dependent patch broke UBSan on Android: `3dedad475d`	2021-01-22 14:32:11 -08:00
Cassie Jones	2bb92bf451	[GlobalISel] LegalizerHelper - Extract widenScalarAddoSubo method The widenScalar implementation for signed and unsigned overflowing operations were very similar: both are checked by truncating the result and then re-sign/zero-extending it and checking that it matches the computed operation. Using a truncate + zero-extend for the unsigned case instead of manually producing the AND instruction like before leads to an extra copy instruction during legalization, but this should be harmless. Differential Revision: https://reviews.llvm.org/D95035	2021-01-22 14:08:46 -08:00
Sebastian Neubauer	8214982b50	[AMDGPU] Implement mir parseCustomPseudoSourceValue Allow parsing generated mir with custom pseudo source value tokens. Also rename pseudo source values to have more meaningful names. Relands `ba7dcd8542`, which had memory leaks. Differential Revision: https://reviews.llvm.org/D95215	2021-01-22 11:24:08 +01:00
Christudasan Devadasan	ff8a1cae18	[AMDGPU] Fix the inconsistency in soffset for MUBUF stack accesses. During instruction selection, there is an inconsistency in choosing the initial soffset value. With certain early passes, this value is getting modified and that brought additional fixup during eliminateFrameIndex to work for all cases. This whole transformation looks trivial and can be handled better. This patch clearly defines the initial value for soffset and keeps it unchanged before eliminateFrameIndex. The initial value must be zero for MUBUF with a frame index. The non-frame index MUBUF forms that use a raw offset from SP will have the stack register for soffset. During frame elimination, the soffset remains zero for entry functions with zero dynamic allocas and no callsites, or else is updated to the appropriate frame/stack register. Also, did some code clean up and made all asserts around soffset stricter to match. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D95071	2021-01-22 14:20:59 +05:30
Christudasan Devadasan	c971bcd210	[AMDGPU] Test clean up (NFC)	2021-01-22 13:38:52 +05:30
Arthur Eubanks	a11bf9a7fb	[AMDGPU][Inliner] Remove amdgpu-inline and add a new TTI inline hook Having a custom inliner doesn't really fit in with the new PM's pipeline. It's also extra technical debt. amdgpu-inline only does a couple of custom things compared to the normal inliner: 1) It disables inlining if the number of BBs in a function would exceed some limit 2) It increases the threshold if there are pointers to private arrays(?) These can all be handled as TTI inliner hooks. There already exists a hook for backends to multiply the inlining threshold. This way we can remove the custom amdgpu-inline pass. This caused inline-hint.ll to fail, and after some investigation, it looks like getInliningThresholdMultiplier() was previously getting applied twice in amdgpu-inline (https://reviews.llvm.org/D62707 fixed it not applying at all, so some later inliner change must have fixed something), so I had to change the threshold in the test. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D94153	2021-01-21 20:29:17 -08:00
RamNalamothu	b6c3a59c3f	[AMDGPU] Test case demonstrating issues with generation of .debug_frame This test case demonstrates that the Call Frame Information generation is totally biased towards whether exceptions are enabled or not. Currently LLVM does not generate CFI i.e. a .debug_frame for debug purpose even if --force-dwarf-frame-section is enabled unless exceptions are enabled. Reviewed By: scott.linder Differential Revision: https://reviews.llvm.org/D94801	2021-01-22 07:39:06 +05:30
Nikita Popov	65fd034b95	[FunctionAttrs] Infer willreturn for functions without loops If a function doesn't contain loops and does not call non-willreturn functions, then it is willreturn. Loops are detected by checking for backedges in the function. We don't attempt to handle finite loops at this point. Differential Revision: https://reviews.llvm.org/D94633	2021-01-21 20:29:33 +01:00
Sebastian Neubauer	4dbdff66fe	Revert "[AMDGPU] Implement mir parseCustomPseudoSourceValue" This reverts commit `ba7dcd8542`. (caused memory leaks)	2021-01-21 18:11:48 +01:00
Jay Foad	c0b3c5a064	[AMDGPU][GlobalISel] Run SIAddImgInit This pass is required to get correct codegen for image instructions with the tfe or lwe bits set. Differential Revision: https://reviews.llvm.org/D95132	2021-01-21 15:54:54 +00:00
Matt Arsenault	94375d1083	AMDGPU: Remove v_rsq_f64 patterns This isn't accurate enough without correction	2021-01-21 10:51:36 -05:00
Matt Arsenault	2a0db8d70e	AMDGPU: Use more accurate fast f64 fdiv A raw v_rcp_f64 isn't accurate enough, so start applying correction.	2021-01-21 10:51:36 -05:00
Sebastian Neubauer	ba7dcd8542	[AMDGPU] Implement mir parseCustomPseudoSourceValue Allow parsing generated mir with custom pseudo source value tokens. Also rename pseudo source values to have more meaningful names. Differential Revision: https://reviews.llvm.org/D94768	2021-01-21 16:32:17 +01:00
Simon Pilgrim	69bc0990a9	[DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE (REAPPLIED). Add DemandedElts support inside the TRUNCATE analysis. REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d Differential Revision: https://reviews.llvm.org/D56387	2021-01-21 13:01:34 +00:00
madhur13490	dd8ae42674	[IndirectFunctions] Skip propagating attributes to address taken functions In case of indirect calls or address taken functions, skip propagating any attributes to them. We just propagate features to such functions. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D94585	2021-01-21 07:04:28 +00:00
Hans Wennborg	a51226057f	Revert "[DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE" It caused "Vector shift amounts must be in the same as their first arg" asserts in Chromium builds. See the code review for repro instructions. > Add DemandedElts support inside the TRUNCATE analysis. > > Differential Revision: https://reviews.llvm.org/D56387 This reverts commit `cad4275d69`.	2021-01-20 20:06:55 +01:00
Simon Pilgrim	cad4275d69	[DAGCombiner] Enable SimplifyDemandedBits vector support for TRUNCATE Add DemandedElts support inside the TRUNCATE analysis. Differential Revision: https://reviews.llvm.org/D56387	2021-01-20 15:39:58 +00:00
Mirko Brkusanin	a6a72dfdf2	[AMDGPU][GlobalISel] Avoid selecting S_PACK with constants If constants are hidden behind G_ANYEXT we can treat them same way as G_SEXT. For that purpose we extend getConstantVRegValWithLookThrough with option to handle G_ANYEXT same way as G_SEXT. Differential Revision: https://reviews.llvm.org/D92219	2021-01-20 11:54:53 +01:00
Jay Foad	0808c7009a	[AMDGPU] Fix test case for D94010	2021-01-19 16:46:47 +00:00
Jay Foad	de2f942399	[AMDGPU] Simplify test case for D94010	2021-01-19 16:36:43 +00:00
Simon Pilgrim	207f32948b	[DAG] SimplifyDemandedBits - use KnownBits comparisons to remove ISD::UMIN/UMAX ops Use the KnownBits icmp comparisons to determine when a ISD::UMIN/UMAX op is unnecessary should either op be known to be ULT/ULE or UGT/UGE than the other. Differential Revision: https://reviews.llvm.org/D94532	2021-01-18 10:29:23 +00:00
Carl Ritson	790c75c163	[AMDGPU] Add SI_EARLY_TERMINATE_SCC0 for early terminating shader Add pseudo instruction to allow early termination of pixel shader anywhere based on the value of SCC. The intention is to use this when a mask of live lanes is updated, e.g. live lanes in WQM pass. This facilitates early termination of shaders even when EXEC is incomplete, e.g. in non-uniform control flow. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D88777	2021-01-13 13:29:05 +09:00
Joe Nash	314e29ed2b	[AMDGPU] Add _e64 suffix to VOP3 Insts Previously, instructions which could be expressed as VOP3 in addition to another encoding had a _e64 suffix on the tablegen record name, while those only available as VOP3 did not. With this patch, all VOP3s will have the _e64 suffix. The assembly does not change, only the mir. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D94341 Change-Id: Ia8ec8890d47f8f94bbbdac43745b4e9dd2b03423	2021-01-12 18:33:18 -05:00
Matt Arsenault	3d39709159	AMDGPU: Remove wrapper only call limitation This seems to only have overridden cold handling, which we probably shouldn't do. As far as I can tell the wrapper library functions are still inlined as appropriate.	2021-01-12 17:12:49 -05:00
Craig Topper	03c8d6a0c4	[LegalizeDAG][RISCV][PowerPC][AMDGPU][WebAssembly] Improve expansion of SETONE/SETUEQ on targets without SETO/SETUO. If SETO/SETUO aren't legal, they'll be expanded and we'll end up with 3 comparisons. SETONE is equivalent to (SETOGT \|\| SETOLT) so if one of those operations is supported use that expansion. We don't need both since we can commute the operands to make the other. SETUEQ can be implemented with !(SETOGT \|\| SETOLT) or (SETULE && SETUGE). I've only implemented the first because it didn't look like most of the affected targets had legal SETULE/SETUGE. Reviewed By: frasercrmck, tlively, nemanjai Differential Revision: https://reviews.llvm.org/D94450	2021-01-12 10:45:03 -08:00
Simon Pilgrim	a4931d4fe3	[AMDGPU] Regenerate umax crash test	2021-01-12 18:02:15 +00:00
Jay Foad	794e3d94d5	[AMDGPU][GlobalISel] Remove some duplicate RUN lines Differential Revision: https://reviews.llvm.org/D86618	2021-01-12 11:02:16 +00:00
Sebastian Neubauer	6a195491b6	[AMDGPU] Fix failing assert with scratch ST mode In ST mode, flat scratch instructions have neither an sgpr nor a vgpr for the address. This lead to an assertion when inserting hard clauses. Differential Revision: https://reviews.llvm.org/D94406	2021-01-12 09:54:02 +01:00
Craig Topper	b1c304c494	[CodeGen] Try to make the print of memory operand alignment a little more user friendly. Memory operands store a base alignment that does not factor in the effect of the offset on the alignment. Previously the printing code only printed the base alignment if it was different than the size. If there is an offset, the reader would need to figure out the effective alignment themselves. This has confused me before and someone else was recently confused on IRC. This patch prints the possibly offset adjusted alignment if it is different than the size. And prints the base alignment if it is different than the alignment. The MIR parser has been updated to read basealign in addition to align. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D94344	2021-01-11 19:58:47 -08:00
Mircea Trofin	05e90cefeb	[NFC] Disallow unused prefixes under llvm/test/CodeGen This patch finishes addressing unused prefixes under CodeGen: 2 remaining tests fixed, and then undo-ing the lit.local.cfg changes under various subdirs and moving the policy under CodeGen. Differential Revision: https://reviews.llvm.org/D94430	2021-01-11 12:32:18 -08:00
Jay Foad	6dcf9207df	[AMDGPU] Fix a urem combine test to test what it was supposed to	2021-01-11 13:32:34 +00:00
QingShan Zhang	7539c75bb4	[DAGCombine] Remove the check for unsafe-fp-math when we are checking the AFN We are checking the unsafe-fp-math for sqrt but not for fpow, which behaves inconsistent. As the direction is to remove this global option, we need to remove the unsafe-fp-math check for sqrt and update the test with afn fast-math flags. Reviewed By: Spatel Differential Revision: https://reviews.llvm.org/D93891	2021-01-11 02:25:53 +00:00
Tony	2f499b9aff	[AMDGPU] Add volatile support to SIMemoryLegalizer Treat a non-atomic volatile load and store as a relaxed atomic at system scope for the address spaces accessed. This will ensure all relevant caches will be bypassed. A volatile atomic is not changed and still only bypasses caches upto the level specified by the SyncScope operand. Differential Revision: https://reviews.llvm.org/D94214	2021-01-09 00:52:33 +00:00
Mircea Trofin	a8bda3df42	[NFC] Disallow unused prefixes in CodeGen/AMDGPU This adds the lit config, and cleans up remaining tests. Differential Revision: https://reviews.llvm.org/D94245	2021-01-08 11:49:23 -08:00
Christudasan Devadasan	ae25a397e9	AMDGPU/GlobalISel: Enable sret demotion	2021-01-08 10:56:35 +05:30
Matt Arsenault	2cbbc6e87c	GlobalISel: Fail legalization on narrowing extload below memory size	2021-01-07 17:40:34 -05:00
Matt Arsenault	1f9b6ef91f	GlobalISel: Add combine for G_UREM by power of 2 Really I want this in the legalizer, but this is a start.	2021-01-07 16:36:35 -05:00
Mircea Trofin	ee57d30f44	[NFC] Removed unused prefixes from CodeGen/AMDGPU Last bulk batch. Differential Revision: https://reviews.llvm.org/D94236	2021-01-07 09:48:14 -08:00
Mircea Trofin	e881a25f1e	[NFC] Removed unused prefixes in CodeGen/AMDGPU This covers tests starting with s. Differential Revision: https://reviews.llvm.org/D94184	2021-01-07 08:00:11 -08:00
Matt Arsenault	6b7d5a928f	AMDGPU/GlobalISel: Start cleaning up calling convention lowering There are various hacks working around limitations in handleAssignments, and the logical split between different parts isn't correct. Start separating the type legalization to satisfy going through the DAG infrastructure from the code required to split into register types. The type splitting should be moved to generic code.	2021-01-07 10:36:45 -05:00
Arthur Eubanks	a515342de9	[test] Pin AMDGPU/opt-pipeline.ll to legacy PM The pipeline being tested is specifically the legacy PM pipeline.	2021-01-06 11:44:16 -08:00
Mircea Trofin	90347ab96f	[NFC] Removed unused prefixes in CodeGen/AMDGPU This covers tests starting with m-r. Differential Revision: https://reviews.llvm.org/D94181	2021-01-06 10:32:44 -08:00
Mircea Trofin	b470630913	[NFC] Removed unused prefixes from CodeGen/AMDGPU All the 'l'-starting tests. Differential Revision: https://reviews.llvm.org/D94151	2021-01-06 09:34:11 -08:00
Matt Arsenault	ab3a3f543b	AMDGPU/GlobalISel: Update fdiv lowering for denormal/ulp interaction Change the GlobalISel fast fdiv handling to match the changes in `2531535984` and `884acbb9e1`	2021-01-06 12:32:01 -05:00
Matt Arsenault	0a3cf7f476	AMDGPU/GlobalISel: Add baseline IR tests for fdiv The fdiv lowering is currently split between an IR pass and codegen, so make sure this works end to end. We also currently differ from the DAG on some edge cases, which this will show in a future change.	2021-01-06 11:37:00 -05:00
Matt Arsenault	136f498919	AMDGPU: Explicitly use SelectionDAG in legacy intrinsic tests GlobalISel will probably not support the legacy buffer intrinsics, so don't fail when the default is switched.	2021-01-06 11:37:00 -05:00
Mircea Trofin	c1cd42d698	[NFC] Removed unused prefixes in CodeGen/AMDGPU This covers the tests starting with h-k. Differential Revision: https://reviews.llvm.org/D94147	2021-01-05 20:22:40 -08:00
Mircea Trofin	cdfd4c5c1a	[NFC] Removed unused prefixes in test/CodeGen/AMDGPU More patches to follow. This covers the pertinent tests starting with e, f, and g. Differential Revision: https://reviews.llvm.org/D94124	2021-01-05 19:18:30 -08:00
Changpeng Fang	cb5b52a06e	AMDGPU: Annotate amdgpu.noclobber for global loads only Summary: This is to avoid unnecessary analysis since amdgpu.noclobber is only used for globals. Reviewers: arsenm Fixes: SWDEV-239161 Differential Revision: https://reviews.llvm.org/D94107	2021-01-05 14:47:19 -08:00
Mircea Trofin	1ebe86adf5	[NFC] Removed unused prefixes in test/CodeGen/AMDGPU More patches to follow. Differential Revision: https://reviews.llvm.org/D94121	2021-01-05 14:16:52 -08:00
Mircea Trofin	bec987ea67	[NFC] Removed unused prefixes in CodeGen/AMDGPU This is part of the pertinent tests, more to follow in subsequent patches. Differential Revision: https://reviews.llvm.org/D94114	2021-01-05 14:10:03 -08:00
Mircea Trofin	a9543469d5	[NFC] Removed unused prefixes in CodeGen/AMDGPU/GlobalISel Differential Revision: https://reviews.llvm.org/D94099	2021-01-05 12:57:17 -08:00
Jay Foad	3914bebe91	[AMDGPU] Handle v_fmac_legacy_f32 in SIFoldOperands Convert it to v_fma_legacy_f32 if it is profitable to do so, just like other mac instructions that are converted to their mad equivalents. Differential Revision: https://reviews.llvm.org/D94010	2021-01-05 11:55:33 +00:00
Jay Foad	639a50e2f1	[AMDGPU] Precommit test case for D94010	2021-01-05 11:55:14 +00:00
Arthur Eubanks	8e293fe6ad	[NewPM][AMDGPU] Pass TargetMachine to AMDGPUSimplifyLibCallsPass Missed in https://reviews.llvm.org/D93863.	2021-01-04 13:48:09 -08:00
Cameron McInally	92be640bd7	[FPEnv][AMDGPU] Disable FSUB(-0,X)->FNEG(X) DAGCombine when subnormals are flushed This patch disables the FSUB(-0,X)->FNEG(X) DAG combine when we're flushing subnormals. It requires updating the existing AMDGPU tests to use the fneg IR instruction, in place of the old fsub(-0,X) canonical form, since AMDGPU is the only backend currently checking the DenormalMode flags. Note that this will require follow-up optimizations to make sure the FSUB(-0,X) form is handled appropriately Differential Revision: https://reviews.llvm.org/D93243	2021-01-04 14:44:10 -06:00
Arthur Eubanks	191552344b	[NewPM][AMDGPU] Make amdgpu-aa work with NewPM An AMDGPUAA class already existed that was supposed to work with the new PM, but it wasn't tested and was a bit broken. Fix up the existing classes to have the right keys/parameters. Wire up AMDGPUAA inside AMDGPUTargetMachine. Add it to the list of alias analyses for the "default" AAManager since in adjustPassManager() amdgpu-aa is added into the pipeline at the beginning. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D93914	2021-01-04 12:36:27 -08:00
Arthur Eubanks	4e838ba9ea	[NewPM][AMDGPU] Port amdgpu-always-inline And add to AMDGPU opt pipeline. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D94025	2021-01-04 12:27:01 -08:00
Arthur Eubanks	fd323a897c	[NewPM][AMDGPU] Port amdgpu-printf-runtime-binding And add to AMDGPU opt pipeline. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D94026	2021-01-04 12:25:50 -08:00
Arthur Eubanks	e1833e7493	[NewPM][AMDGPU] Port amdgpu-unify-metadata And add to AMDGPU opt pipeline. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D94023	2021-01-04 11:57:46 -08:00
Arthur Eubanks	a5f863e076	[NewPM][AMDGPU] Port amdgpu-propagate-attributes-early/late And add to AMDGPU opt pipeline. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D94022	2021-01-04 11:53:37 -08:00
Arthur Eubanks	b8f22f9d30	[NewPM][AMDGPU] Run InternalizePass when -amdgpu-internalize-symbols The legacy PM doesn't run EP_ModuleOptimizerEarly on -O0, so skip running it here when given O0. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D93886	2021-01-04 11:34:40 -08:00
Roman Lebedev	4b80647367	[AMDGPU][SimplifyCFG] Teach AMDGPUUnifyDivergentExitNodes to preserve {,Post}DomTree This is a (last big?) part of the patch series to make SimplifyCFG preserve DomTree. Currently, it still does not actually preserve it, even thought it is pretty much fully updated to preserve it. Once the default is flipped, a valid DomTree must be passed into simplifyCFG, which means that whatever pass calls simplifyCFG, should also be smart about DomTree's. As far as i can see from `check-llvm` with default flipped, this is the last LLVM test batch (other than bugpoint tests) that needed fixes to not break with default flipped. The changes here are boringly identical to the ones i did over 42+ times/commits recently already, so while AMDGPU is outside of my normal ecosystem, i'm going to go for post-commit review here, like in all the other 42+ changes. Note that while the pass is taught to preserve {,Post}DomTree, it still doesn't do that by default, because simplifycfg still doesn't do that by default, and flipping default in this pass will implicitly flip the default for simplifycfg. That will happen, but not right now.	2021-01-02 01:01:20 +03:00
Roman Lebedev	b23b1bcc26	[NFC][CodeGen][Tests] Mark all tests that fail to preserve DomTree for SimplifyCFG as such These tests start to fail when the SimplifyCFG's default regarding DomTree updating is switched on, so mark them as needing changes.	2021-01-02 01:01:19 +03:00
Juneyoung Lee	9b29610228	Use unary CreateShuffleVector if possible As mentioned in D93793, there are quite a few places where unary `IRBuilder::CreateShuffleVector(X, Mask)` can be used instead of `IRBuilder::CreateShuffleVector(X, Undef, Mask)`. Let's update them. Actually, it would have been more natural if the patches were made in this order: (1) let them use unary CreateShuffleVector first (2) update IRBuilder::CreateShuffleVector to use poison as a placeholder value (D93793) The order is swapped, but in terms of correctness it is still fine. Reviewed By: spatel Differential Revision: https://reviews.llvm.org/D93923	2020-12-30 22:36:08 +09:00
Arthur Eubanks	7ecbe0c7a0	[NewPM][AMDGPU] Port amdgpu-lower-kernel-attributes And add it to the AMDGPU opt pipeline. This is a function pass instead of a module pass (like the legacy pass) because it's getting added to a CGSCCPassManager, and you can't put a module pass in a CGSCCPassManager. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D93885	2020-12-29 10:26:06 -08:00
Arthur Eubanks	c2ef06d3dd	[NewPM] Port infer-address-spaces And add it to the AMDGPU opt pipeline. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D93880	2020-12-28 19:58:12 -08:00
Arthur Eubanks	0e9abcfc19	[AMDGPU][NewPM] Port amdgpu-promote-alloca(-to-vector) And add to AMDGPU opt pipeline. Don't pin an opt run to the legacy PM when -enable-new-pm=1 if these passes (or passes introduced in https://reviews.llvm.org/D93863) are in the list of passes. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D93875	2020-12-28 17:52:31 -08:00
Arthur Eubanks	9abc457724	[NewPM][AMDGPU] Port amdgpu-simplifylib/amdgpu-usenative And add them to the pipeline via AMDGPUTargetMachine::registerPassBuilderCallbacks(), which mirrors AMDGPUTargetMachine::adjustPassManager(). These passes can't be unconditionally added to PassRegistry.def since they are only present when the AMDGPU backend is enabled. And there are no target-specific headers in llvm/include, so parsing these pass names must occur somewhere in the AMDGPU directory. I decided the best place was inside the TargetMachine, since the PassBuilder invokes TargetMachine::registerPassBuilderCallbacks() anyway. If we come up with a cleaner solution for target-specific passes in the future that's fine, but there aren't too many target-specific IR passes living in target-specific directories so it shouldn't be too bad to change in the future. Reviewed By: ychen, arsenm Differential Revision: https://reviews.llvm.org/D93863	2020-12-28 10:38:51 -08:00
alex-t	644da789e3	[AMDGPU] Split edge to make si_if dominate end_cf Basic block containing "if" not necessarily dominates block that is the "false" target for the if. That "false" target block may have another predecessor besides the "if" block. IR value corresponding to the Exec mask is generated by the si_if intrinsic and then used by the end_cf intrinsic. In this case IR verifier complains that 'Def does not dominate all uses'. This change split the edge between the "if" block and "false" target block to make it dominated by the "if" block. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D91435	2020-12-28 17:14:02 +03:00
Juneyoung Lee	9d70dbdc2b	[InstCombine] use poison as placeholder for undemanded elems Currently undef is used as a don’t-care vector when constructing a vector using a series of insertelement. However, this is problematic because undef isn’t undefined enough. Especially, a sequence of insertelement can be optimized to shufflevector, but using undef as its placeholder makes shufflevector a poison-blocking instruction because undef cannot be optimized to poison. This makes a few straightforward optimizations incorrect, such as: ``` ; https://bugs.llvm.org/show_bug.cgi?id=44185 define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) { %xv = insertelement <4 x float> %q, float %x, i32 2 %r = shufflevector <4 x float> %y, <4 x float> %xv, <4 x i32> { 0, 6, 2, undef } ret <4 x float> %r ; %r[3] is undef } => define <4 x float> @insert_not_undef_shuffle_translate_commute(float %x, <4 x float> %y, <4 x float> %q) { %r = insertelement <4 x float> %y, float %x, i32 1 ret <4 x float> %r ; %r[3] = %y[3], incorrect if %y[3] = poison } Transformation doesn't verify! ERROR: Target is more poisonous than source ``` I’d like to suggest 1. Using poison as insertelement’s placeholder value (IRBuilder::CreateVectorSplat should be patched too) 2. Updating shufflevector’s semantics to return poison element if mask is undef Note that poison is currently lowered into UNDEF in SelDag, so codegen part is okay. m_Undef() matches PoisonValue as well, so existing optimizations will still fire. The only concern is hidden miscompilations that will go incorrect when poison constant is given. A conservative way is copying all tests having `insertelement undef` & replacing it with `insertelement poison` & run Alive2 on it, but it will create many tests and people won’t like it. :( Instead, I’ll simply locally maintain the tests and run Alive2. If there is any bug found, I’ll report it. Relevant links: https://bugs.llvm.org/show_bug.cgi?id=43958 , http://lists.llvm.org/pipermail/llvm-dev/2019-November/137242.html Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D93586	2020-12-28 08:58:15 +09:00
Praveen Velliengiri	61177943c9	[AMDGPU] Use MUBUF instructions for global address space access Currently, the compiler crashes in instruction selection of global load/stores in gfx600 due to the lack of FLAT instructions. This patch fix the crash by selecting MUBUF instructions for global load/stores in gfx600. Authored-by: Praveen Velliengiri <Praveen.Velliengiri@amd.com> Reviewed by: t-tye Differential revision: https://reviews.llvm.org/D92483	2020-12-24 10:13:04 +00:00
Evgeniy Brevnov	9fb074e7bb	[BPI] Improve static heuristics for "cold" paths. Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness of other paths. New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights. One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together. In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers. Reviewed By: yrouban Differential Revision: https://reviews.llvm.org/D79485	2020-12-23 22:47:36 +07:00
Sebastian Neubauer	221fdedc69	[AMDGPU][GlobalISel] Fold flat vgpr + constant addresses Use getPtrBaseWithConstantOffset in selectFlatOffsetImpl to fold more vgpr+constant addresses. Differential Revision: https://reviews.llvm.org/D93692	2020-12-23 10:40:30 +01:00
Matt Arsenault	bac54639c7	AMDGPU: Add spilled CSR SGPRs to entry block live ins	2020-12-22 21:55:59 -05:00
Matt Arsenault	29ed846d67	AMDGPU: Fix assert when checking for implicit operand legality	2020-12-22 20:56:24 -05:00
Stanislav Mekhanoshin	d15119a02d	[AMDGPU][GlobalISel] GlobalISel for flat scratch It does not seem to fold offsets but this is not specific to the flat scratch as getPtrBaseWithConstantOffset() does not return the split for these tests unlike its SDag counterpart. Differential Revision: https://reviews.llvm.org/D93670	2020-12-22 16:33:06 -08:00
Stanislav Mekhanoshin	ca4bf58e4e	[AMDGPU] Support unaligned flat scratch in TLI Adjust SITargetLowering::allowsMisalignedMemoryAccessesImpl for unaligned flat scratch support. Mostly needed for global isel. Differential Revision: https://reviews.llvm.org/D93669	2020-12-22 16:12:31 -08:00
Stanislav Mekhanoshin	ae8f4b2178	[AMDGPU] Folding of FI operand with flat scratch Differential Revision: https://reviews.llvm.org/D93501	2020-12-22 10:48:04 -08:00
Fangrui Song	8ffda237a6	MCContext::reportError: don't call report_fatal_error Errors from MCAssembler, MCObjectStreamer and *ObjectWriter typically cause a crash: ``` % cat c.c int bar; extern int foo __attribute__((alias("bar"))); % clang -c -fcommon c.c fatal error: error in backend: Common symbol 'bar' cannot be used in assignment expr PLEASE submit a bug report to ... Stack dump: ... ``` `LLVMTargetMachine::addPassesToEmitFile` constructs `MachineModuleInfoWrapperPass` which creates a MCContext without SourceMgr. `MCContext::reportError` calls `report_fatal_error` which gets captured by Clang `LLVMErrorHandler` and gets translated to the output above. Since `MCContext::reportError` errors indicate user errors, such a crashing style error is inappropriate. So this patch changes `report_fatal_error` to `SourceMgr().PrintMessage`. ``` % clang -c -fcommon c.c <unknown>:0: error: Common symbol 'bar' cannot be used in assignment expr ``` Ideally we should at least recover the original filename (the line information is generally lost). That requires general improvement to MC diagnostics, because currently in many cases SMLoc information is lost.	2020-12-20 23:23:12 -08:00
Pushpinder Singh	e2303a448e	[FastRA] Fix handling of bundled MIs Fast register allocator skips bundled MIs, as the main assignment loop uses MachineBasicBlock::iterator (= MachineInstrBundleIterator) This was causing SIInsertWaitcnts to crash which expects all instructions to have registers assigned. This patch makes sure to set everything inside bundle to the same assignments done on BUNDLE header. Reviewed By: qcolombet Differential Revision: https://reviews.llvm.org/D90369	2020-12-21 02:10:55 -05:00
Whitney Tsang	2a814cd9e1	Ensure SplitEdge to return the new block between the two given blocks This PR implements the function splitBasicBlockBefore to address an issue that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore. The issue occurs in SplitEdge when the Succ has a single predecessor and the edge between the BB and Succ is not critical. This produces the result ‘BB->Succ->New’. The new function splitBasicBlockBefore was added to splitBlockBefore to handle the issue and now produces the correct result ‘BB->New->Succ’. Below is an example of splitting the block bb1 at its first instruction. /// Original IR bb0: br bb1 bb1: %0 = mul i32 1, 2 br bb2 bb2: /// IR after splitEdge(bb0, bb1) using splitBasicBlock bb0: br bb1 bb1: br bb1.split bb1.split: %0 = mul i32 1, 2 br bb2 bb2: /// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore bb0: br bb1.split bb1.split br bb1 bb1: %0 = mul i32 1, 2 br bb2 bb2: Differential Revision: https://reviews.llvm.org/D92200	2020-12-18 17:37:17 +00:00
Bangtian Liu	511cfe9441	Revert "Ensure SplitEdge to return the new block between the two given blocks" This reverts commit `d20e0c3444`.	2020-12-17 21:00:37 +00:00
Bangtian Liu	d20e0c3444	Ensure SplitEdge to return the new block between the two given blocks This PR implements the function splitBasicBlockBefore to address an issue that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore. The issue occurs in SplitEdge when the Succ has a single predecessor and the edge between the BB and Succ is not critical. This produces the result ‘BB->Succ->New’. The new function splitBasicBlockBefore was added to splitBlockBefore to handle the issue and now produces the correct result ‘BB->New->Succ’. Below is an example of splitting the block bb1 at its first instruction. /// Original IR bb0: br bb1 bb1: %0 = mul i32 1, 2 br bb2 bb2: /// IR after splitEdge(bb0, bb1) using splitBasicBlock bb0: br bb1 bb1: br bb1.split bb1.split: %0 = mul i32 1, 2 br bb2 bb2: /// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore bb0: br bb1.split bb1.split br bb1 bb1: %0 = mul i32 1, 2 br bb2 bb2: Differential Revision: https://reviews.llvm.org/D92200	2020-12-17 16:00:15 +00:00
Matt Arsenault	f333736757	AMDGPU: Remove SGPRSpillVGPRDefinedSet hack These VGPRs should be reserved and therefore do not need "correct" liveness. They should not have undef uses, which can still cause issues.	2020-12-16 21:33:35 -05:00
Bangtian Liu	c10757200d	Revert "Ensure SplitEdge to return the new block between the two given blocks" This reverts commit `cf638d793c`.	2020-12-16 11:52:30 +00:00
Stanislav Mekhanoshin	eb66bf0802	[AMDGPU] Print SCRATCH_EN field after the kernel Differential Revision: https://reviews.llvm.org/D93353	2020-12-15 22:44:30 -08:00
Bangtian Liu	cf638d793c	Ensure SplitEdge to return the new block between the two given blocks This PR implements the function splitBasicBlockBefore to address an issue that occurred during SplitEdge(BB, Succ, ...), inside splitBlockBefore. The issue occurs in SplitEdge when the Succ has a single predecessor and the edge between the BB and Succ is not critical. This produces the result ‘BB->Succ->New’. The new function splitBasicBlockBefore was added to splitBlockBefore to handle the issue and now produces the correct result ‘BB->New->Succ’. Below is an example of splitting the block bb1 at its first instruction. /// Original IR bb0: br bb1 bb1: %0 = mul i32 1, 2 br bb2 bb2: /// IR after splitEdge(bb0, bb1) using splitBasicBlock bb0: br bb1 bb1: br bb1.split bb1.split: %0 = mul i32 1, 2 br bb2 bb2: /// IR after splitEdge(bb0, bb1) using splitBasicBlockBefore bb0: br bb1.split bb1.split br bb1 bb1: %0 = mul i32 1, 2 br bb2 bb2: Differential Revision: https://reviews.llvm.org/D92200	2020-12-15 23:32:29 +00:00

... 5 6 7 8 9 ...

4784 Commits