llvm-project

Commit Graph

Author	SHA1	Message	Date
Matt Arsenault	d2e52eec51	AMDGPU: Select global saddr mode from SGPR pointer Use the 64-bit SGPR base with a 0 offset, since it's 1 fewer instruction to materialize the 0 vs. the 64-bit copy.	2020-11-16 11:51:06 -05:00
Sebastian Neubauer	a343b9b032	Revert "[AMDGPU] Insert waitcnt after returning from call" This reverts commit `ca907bfb57`. According to michel.daenzer, > This completely broke the Mesa radeonsi driver on Navi 14. Xorg + > xterm come up with major corruption & psychedelic colours.	2020-09-23 17:16:39 +02:00
Sebastian Neubauer	ca907bfb57	[AMDGPU] Insert waitcnt after returning from call When memory operations are outstanding on function calls, either the caller or the callee can insert a waitcnt to ensure that all reads are finished. Calls need some time to be executed, so if the callee inserts the waitcnt, filling the instruction buffer and waiting for memory will be interleaved, hiding some latency. This comes at the cost of having a waitcnt inside functions that may not be needed as no memory operations are outstanding. For function calls, this is already implemented. The same principal applies to returns: If the caller inserts a waitcnt after the call, the callee does not have to wait and the return and memory operation can be run in parallel. This commit implements waiting in the caller after returning from a function call. Differential Revision: https://reviews.llvm.org/D87674	2020-09-23 12:17:59 +02:00
Jay Foad	c799f873cb	[AMDGPU] Don't cluster stores Clustering loads has caching benefits, but as far as I know there is no advantage to clustering stores on any AMDGPU subtargets. The disadvantage is that it tends to increase register pressure and restricts scheduling freedom. Differential Revision: https://reviews.llvm.org/D85530	2020-09-14 13:40:17 +01:00
QingShan Zhang	3359ea62ed	[Scheduling] Create the missing dependency edges for store cluster If it is load cluster, we don't need to create the dependency edges(SUb->reg) from SUb to SUa as they both depend on the base register "reg" +-------+ +----> reg \| \| +---+---+ \| ^ \| \| \| \| \| \| \| +---+---+ \| \| SUa \| Load 0(reg) \| +---+---+ \| ^ \| \| \| \| \| +---+---+ +----+ SUb \| Load 4(reg) +-------+ But if it is store cluster, we need to create it as follow shows to avoid the instruction store depend on scheduled in-between SUb and SUa. +-------+ +----> reg \| \| +---+---+ \| ^ \| \| Missing +-------+ \| \| +-------------------->+ y \| \| \| \| +---+---+ \| +---+-+-+ ^ \| \| SUa \| Store x 0(reg) \| \| +---+---+ \| \| ^ \| \| \| +------------------------+ \| \| \| \| +---+--++ +----+ SUb \| Store y 4(reg) +-------+ Reviewed By: evandro, arsenm, rampitec, foad, fhahn Differential Revision: https://reviews.llvm.org/D72031	2020-08-07 04:58:03 +00:00
hsmahesha	33fd4a18e7	[AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as follows. The existing heuristic roughly summarizes as below: * Assume, all the mem ops instructions participating in the clustering process, loads/stores same num bytes * If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes * If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes * If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it properly handles both the sub-word loads and the wide loads. Reviewed By: arsenm, rampitec Differential Revision: https://reviews.llvm.org/D84354	2020-07-30 21:41:13 +05:30
hsmahesha	4905536086	Revert "[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size" This reverts commit `cc9d693856`.	2020-07-17 12:20:37 +05:30
Your Name	cc9d693856	[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size Summary: Make use of both the - (1) clustered bytes and (2) cluster length, to decide on the max number of mem ops that can be clustered. On an average, when loads are dword or smaller, consider `5` as max threshold, otherwise `4`. This heuristic is purely based on different experimentation conducted, and there is no analytical logic here. Reviewers: foad, rampitec, arsenm, vpykhtin Reviewed By: rampitec Subscribers: llvm-commits, kerbowa, hiraditya, t-tye, Anastasia, tpr, dstuttard, yaxunl, nhaehnle, wdng, jvesely, kzhuravl, thakis Tags: #llvm Differential Revision: https://reviews.llvm.org/D82393	2020-06-24 00:39:41 +05:30
Michael Liao	b1360caa82	[SDAG] Add new AssertAlign ISD node. Summary: - AssertAlign node records the guaranteed alignment on its source node, where these alignments are retrieved from alignment attributes in LLVM IR. These tracked alignments could help DAG combining and lowering generating efficient code. - In this patch, the basic support of AssertAlign node is added. So far, we only generate AssertAlign nodes on return values from intrinsic calls. - Addressing selection in AMDGPU is revised accordingly to capture the new (base + offset) patterns. Reviewers: arsenm, bogner Subscribers: jvesely, wdng, nhaehnle, tpr, hiraditya, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D81711	2020-06-23 00:51:11 -04:00
hsmahesha	7410571ce9	Revert "[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size" This reverts commit `40a632a335`.	2020-06-09 19:27:17 +05:30
hsmahesha	40a632a335	[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size Summary: Make use of both the - (1) clustered bytes and (2) cluster length, to decide on the max number of mem ops that can be clustered. On an average, when loads are dword or smaller, consider `5` as max threshold, otherwise `4`. This heuristic is purely based on different experimentation conducted, and there is no analytical logic here. Reviewers: foad, rampitec, arsenm, vpykhtin Reviewed By: foad, rampitec Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, Anastasia, t-tye, hiraditya, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D81085	2020-06-09 14:09:14 +05:30
Jay Foad	43830790d7	[AMDGPU] Remove dubious logic in bidirectional list scheduler Summary: pickNodeBidirectional tried to compare the best top candidate and the best bottom candidate by examining TopCand.Reason and BotCand.Reason. This is unsound because, after calling pickNodeFromQueue, Cand.Reason does not reflect the most important reason why Cand was chosen. Rather it reflects the most recent reason why it beat some other potential candidate, which could have been for some low priority tie breaker reason. I have seen this cause problems where TopCand is a good candidate, but because TopCand.Reason is ORDER (which is very low priority) it is repeatedly ignored in favour of a mediocre BotCand. This is not how bidirectional scheduling is supposed to work. To fix this I changed the code to always compare TopCand and BotCand directly, like the generic implementation of pickNodeBidirectional does. This removes some uncommented AMDGPU-specific logic; if this logic turns out to be important then perhaps it could be moved into an override of tryCandidate instead. Graphics shader benchmarking on gfx10 shows a lot more positive than negative effects from this change. Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D68338	2020-02-28 21:35:34 +00:00
Simon Pilgrim	fc2b4a02b1	[DAGCombine] visitEXTRACT_VECTOR_ELT - add SimplifyDemandedBits multi use support Similar to what we already do with SimplifyDemandedVectorElts, call SimplifyDemandedBits across all the extracted elements of the source vector, treating it as single use. There's a minor regression in store-weird-sizes.ll which will be addressed in an upcoming SimplifyDemandedBits patch.	2020-02-20 15:49:38 +00:00
Stanislav Mekhanoshin	555d8f4ef5	[AMDGPU] Bundle loads before post-RA scheduler We are relying on atrificial DAG edges inserted by the MemOpClusterMutation to keep loads and stores together in the post-RA scheduler. This does not work all the time since it allows to schedule a completely independent instruction in the middle of the cluster. Removed the DAG mutation and added pass to bundle already clustered instructions. These bundles are unpacked before the memory legalizer because it does not work with bundles but also because it allows to insert waitcounts in the middle of a store cluster. Removing artificial edges also allows a more relaxed scheduling. Differential Revision: https://reviews.llvm.org/D72737	2020-01-24 11:33:38 -08:00
Simon Pilgrim	e3eec06dde	[AMDGPU] Reapplied BFE canonicalization from D60462 This was committed in rL358887 but reverted in rL360066 due to a x86 regression, really it should be have been pre-committed instead of being part of the SimplifyDemandedBits bitcast patch. llvm-svn: 360263	2019-05-08 15:49:10 +00:00
Craig Topper	55a71b575c	Revert r359392 and r358887 Reverts "[X86] Remove (V)MOV64toSDrr/m and (V)MOVDI2SSrr/m. Use 128-bit result MOVD/MOVQ and COPY_TO_REGCLASS instead" Reverts "[TargetLowering][AMDGPU][X86] Improve SimplifyDemandedBits bitcast handling" Eric Christopher and Jorge Gorbe Moya reported some issues with these patches to me off list. Removing the CodeGenOnly instructions has changed how fneg is handled during fast-isel with sse/sse2. We're now emitting fsub -0.0, x instead moving to the integer domain(in a GPR), xoring the sign bit, and then moving back to xmm. This is because the fast isel table no longer contains an entry for (f32/f64 bitcast (i32/i64)) so the target independent fneg code fails. The use of fsub changes the behavior of nan with respect to -O2 codegen which will always use a pxor. NOTE: We still have a difference with double with -m32 since the move to GPR doesn't work there. I'll file a separate PR for that and add test cases. Since removing the CodeGenOnly instructions was fixing PR41619, I'm reverting r358887 which exposed that PR. Though I wouldn't be surprised if that bug can still be hit independent of that. This should hopefully get Google back to green. I'll work with Simon and other X86 folks to figure out how to move forward again. llvm-svn: 360066	2019-05-06 19:29:24 +00:00
Simon Pilgrim	6276ce0142	[TargetLowering][AMDGPU][X86] Improve SimplifyDemandedBits bitcast handling This patch adds support for BigBitWidth -> SmallBitWidth bitcasts, splitting the DemandedBits/Elts accordingly. The AMDGPU backend needed an extra (srl (and x, c1 << c2), c2) -> (and (srl(x, c2), c1) combine to encourage BFE creation, I investigated putting this in DAGCombine but it caused a lot of noise on other targets - some improvements, some regressions. The X86 changes are all definite wins. Differential Revision: https://reviews.llvm.org/D60462 llvm-svn: 358887	2019-04-22 14:04:35 +00:00
Simon Pilgrim	e017ed3245	[SelectionDAG] Improve SimplifyDemandedBits to SimplifyDemandedVectorElts simplification D52935 introduced the ability for SimplifyDemandedBits to call SimplifyDemandedVectorElts through BITCASTs if the demanded bit mask entirely covered the sub element. This patch relaxes this to demanding an element if we need any bit from it. Differential Revision: https://reviews.llvm.org/D54761 llvm-svn: 348073	2018-12-01 12:08:55 +00:00
Simon Pilgrim	bac49ac455	[AMDGPU] Regenerate weird stores tests. Makes an upcoming SimplifyDemandedBits optimization much easier to understand. llvm-svn: 347326	2018-11-20 17:04:02 +00:00
Matt Arsenault	8c4a35237a	AMDGPU: Add pass to lower kernel arguments to loads This replaces most argument uses with loads, but for now not all. The code in SelectionDAG for calling convention lowering is actively harmful for amdgpu_kernel. It attempts to split the argument types into register legal types, which results in low quality code for arbitary types. Since all kernel arguments are passed in memory, we just want the raw types. I've tried a couple of methods of mitigating this in SelectionDAG, but it's easier to just bypass this problem alltogether. It's possible to hack around the problem in the initial lowering, but the real problem is the DAG then expects to be able to use CopyToReg/CopyFromReg for uses of the arguments outside the block. Exposing the argument loads in the IR also has the advantage that the LoadStoreVectorizer can merge them. I'm not sure the best approach to dealing with the IR argument list is. The patch as-is just leaves the IR arguments in place, so all the existing code will still compute the same kernarg size and pointlessly lowers the arguments. Arguably the frontend should emit kernels with an empty argument list in the first place. Alternatively a dummy array could be inserted as a single argument just to reserve space. This does have some disadvantages. Local pointer kernel arguments can no longer have AssertZext placed on them as the equivalent !range metadata is not valid on pointer typed loads. This is mostly bad for SI which needs to know about the known bits in order to use the DS instruction offset, so in this case this is not done. More importantly, this skips noalias arguments since this pass does not yet convert this to the equivalent !alias.scope and !noalias metadata. Producing this metadata correctly seems to be tricky, although this logically is the same as inlining into a function which doesn't exist. Additionally, exposing these loads to the vectorizer may result in degraded aliasing information if a pointer load is merged with another argument load. I'm also not entirely sure this is preserving the current clover ABI, although I would greatly prefer if it would stop widening arguments and match the HSA ABI. As-is I think it is extending < 4-byte arguments to 4-bytes but doesn't align them to 4-bytes. llvm-svn: 335650	2018-06-26 19:10:00 +00:00
Matt Arsenault	762d498808	AMDGPU: Add combine for trunc of bitcast from build_vector If the truncate is only accessing the first element of the vector, we can use the original source value. This helps with some combine ordering issues after operations are lowered to integer operations between bitcasts of build_vector. In particular it stops unnecessarily materializing the unused top half of a vector in some cases. llvm-svn: 331909	2018-05-09 18:37:39 +00:00
Matt Arsenault	3f71c0e3ee	AMDGPU: Select DS insts without m0 initialization GFX9 stopped using m0 for most DS instructions. Select a different instruction without the use. I think this will be less error prone than trying to manually maintain m0 uses as needed. llvm-svn: 319270	2017-11-29 00:55:57 +00:00
Matt Arsenault	e123aba94e	DAG: Legalize truncstores to illegal int types Truncate to a legal int type, and produce a new truncstore from a narrower type. llvm-svn: 319185	2017-11-28 17:11:30 +00:00

23 Commits