Commit Graph

7214 Commits

Author SHA1 Message Date
Dmitry Preobrazhensky a80116efec [AMDGPU][MC][GFX11] Add a helper function for identification of VOPD instructions
Differential Revision: https://reviews.llvm.org/D133608
2022-09-13 12:41:39 +03:00
Dmitry Preobrazhensky 815ba49068 [AMDGPU][MC] Add detection of mandatory literals in parser
Differential Revision: https://reviews.llvm.org/D133606
2022-09-13 12:37:30 +03:00
Kazu Hirata 9606608474 [llvm] Use x.empty() instead of llvm::empty(x) (NFC)
I'm planning to deprecate and eventually remove llvm::empty.

I thought about replacing llvm::empty(x) with std::empty(x), but it
turns out that all uses can be converted to x.empty().  That is, no
use requires the ability of std::empty to accept C arrays and
std::initializer_list.

Differential Revision: https://reviews.llvm.org/D133677
2022-09-12 13:34:35 -07:00
Matt Arsenault 7834194837 TableGen: Introduce generated getSubRegisterClass function
Currently there isn't a generic way to get a smaller register class
that can be produced from a subregister of a larger class. Replaces a
manually implemented version for AMDGPU. This will be used to improve
subregister support in the allocator.
2022-09-12 09:03:37 -04:00
Johannes Doerfert c922cac868 Revert "[Attributor] AAPointerInfo should allow "harmless" uses"
Revert "[Attributor] Teach AAPointerInfo to look into aggregates"

This reverts commit 844f6c5d03 and
4ed0a88cd8 as they broke the buildbots
that run openmp/libomptarget/test/offloading/bug49021.cpp.
2022-09-11 21:37:54 -07:00
Johannes Doerfert 4ed0a88cd8 [Attributor] Teach AAPointerInfo to look into aggregates
If we have a constant aggregate, e.g., as an initializer, we usually
failed to extract the proper value/type from it. This patch provides the
size and offset information necessary to extract the right part of the
constant.
2022-09-11 20:16:11 -07:00
Jay Foad 8901f7cebc [AMDGPU] Fix crash legalizing G_EXTRACT_VECTOR_ELT with negative index
Fixes https://github.com/llvm/llvm-project/issues/57408

Differential Revision: https://reviews.llvm.org/D132938
2022-09-09 15:53:34 +01:00
Dmitry Preobrazhensky 6b79610fd5 [AMDGPU][MC][GFX11][NFC] Correct VOPD parsing
Differential Revision: https://reviews.llvm.org/D133492
2022-09-09 13:03:29 +03:00
Jay Foad afa0ed33df [AMDGPU] Fix shrinking of F16 FMA on newer subtargets
D125803 introduced shrinking of F16 FMA to FMAAK/FMAMK in
SIShrinkInstructions (useful on GFX10+ where VOP3 instructions may have
a literal operand) but failed to handle the V_FMA_F16_gfx9_e64 form of
the opcode which is used on GFX9+.

Differential Revision: https://reviews.llvm.org/D133489
2022-09-08 16:41:04 +01:00
Joe Loser 5e96cea1db [llvm] Use std::size instead of llvm::array_lengthof
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.

Change call sites to use `std::size` instead.

Differential Revision: https://reviews.llvm.org/D133429
2022-09-08 09:01:53 -06:00
Ivan Kosarev 57c943d581 [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions.
Reviewed By: nhaehnle

Differential Revision: https://reviews.llvm.org/D124671
2022-09-08 15:21:30 +01:00
Justin Bogner a81c7dbf0d [AMDGPU] Drop _oneuse checks from med3 patterns
We use _oneuse checks to make sure combines won't accidentally
increase code size, but this prevents the optimization in cases where
we happen to want to clamp multiple values to the same range

It's safe to drop these checks for two reasons:

1. The pattern of max/min operations for med3 is complicated enough
   it's unlikely to come up by accident, so this will still only fire
   when appropriate to do so
2. Even if every intermediate is used and we don't save a single
   operation, we still won't end up with more operations since the
   med3 replaces the final max/min.

In pathological cases we could potentially end up with a larger
encoding size or possibly slightly increased vgpr pressure, but the
risk of that is low, especially considering the upside.

Differential Revision: https://reviews.llvm.org/D132621
2022-09-07 16:31:49 -07:00
Stanislav Mekhanoshin fb28bf3fb4 [AMDGPU] Fix liveness verifier error in hazard recognizer
After D133067 we are inserting swaps to use a new physical
register. I have noticed verifier errors about undefined
physical register uses if we are tracking liveness post RA.

We have no access to LIS at this point, so mark new register
uses as undef to calm down the verifier. Liveness should not
matter at this point anyway.

Note the description of the RegState::Undef: "Value of the
register doesn't matter." I.e. it does not say it is strictly
undefined. In fact that is what we really need: this value
does not matter.

I also had to modify the test a bit since with tracking enabled
it does not pass verification even before the recognizer.

Differential Revision: https://reviews.llvm.org/D133459
2022-09-07 16:30:36 -07:00
Stanislav Mekhanoshin 95d497ff2a [AMDGPU] W/a hazard if 64 bit shift amount is a highest allocated VGPR
In this case gfx90a uses v0 instead of the correct register. Swap
the value temporarily with a lower register and then swap it back.

Unfortunately hazard recognizer works after wait count insertion,
so we cannot simply reuse an arbitrary register, hence w/a also
includes a full waitcount. This can be avoided if we run it from
expandPostRAPseudo, but that is a complete misplacement.

Differential Revision: https://reviews.llvm.org/D133067
2022-09-07 14:23:49 -07:00
Jon Chesterfield 23f6c8d635 [amdgpu] Always, instead of mostly, remove unused LDS symbols
Currently LDS variables are removed by the lower module pass
if they have a use which is caught by the replace with struct control flow.
This makes tests brittle to changes to that control flow which induces
noise when trying to improve lowering. Some tests already check that
variables are removed, while others checked that they are not removed.

LDS variables are not (currently) externally accessible, and if that
changes the machinery which makes them externally accessible will look
like a use. This change therefore breaks no applications.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D133028
2022-09-07 18:28:16 +01:00
Jay Foad 96dfa523c2 [AMDGPU] Refactor SIFoldOperands. NFC.
Refactor static functions into class methods so they have access to TII, MRI
etc.
2022-09-07 11:05:01 +01:00
Jay Foad 5291c3dd36 [AMDGPU] Simplify mad/mac patterns. NFC.
Simplify instruction selection patterns for mad/mac:
- Use any_fmad consistently to make it clear that all patterns treat
  fmad and AMDGPUfmad_ftz identically.
- For mad, put the patterns on the instruction definitions. For mac the
  patterns are still out-of-line because we want to set AddedComplexity
  and to have special handling of the source modifiers.

Differential Revision: https://reviews.llvm.org/D133305
2022-09-07 09:58:28 +01:00
raghavmedicherla 57f01fee1e [AMDGPU/Metadata] Rename HSAMD::MetadataStreamer classes
Renamed all HSAMD::MetadataStreamer classes to improve readability of the code.

Differential Revision: https://reviews.llvm.org/D133156
2022-09-06 16:46:37 -04:00
Ivan Kosarev 5db8d6fd2b [AMDGPU][CodeGen] Support (base | offset) SMEM loads.
Prevents generation of unnecessary s_or_b32 instructions.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D132552
2022-09-05 14:22:06 +01:00
Ivan Kosarev f33645301e [AMDGPU][CodeGen] Support (soffset + offset) s_buffer_load's.
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D130263
2022-09-05 12:53:05 +01:00
Kazu Hirata 7d8c2d17eb [llvm] Use range-based for loops (NFC)
Identified with modernize-loop-convert.
2022-09-03 23:27:25 -07:00
Kazu Hirata 32aa35b504 Drop empty string literals from static_assert (NFC)
Identified with modernize-unary-static-assert.
2022-09-03 11:17:47 -07:00
Kazu Hirata fedc59734a [llvm] Use range-based for loops (NFC) 2022-09-03 11:17:40 -07:00
Juan Manuel MARTINEZ CAAMAÑO ee761374f7 [AMDGPU][NFC] Fix typo in commment: replace SiMemOpInfo by SIMemOpInfo 2022-09-02 16:45:10 +02:00
Jon Chesterfield a28bbd00c6 [amdgpu][nfc] Factor predicate out of findLDSVariablesToLower 2022-08-31 15:44:51 +01:00
Stanislav Mekhanoshin fd1f8c85f2 [AMDGPU] Limit TID / wavefrontsize uniformness to 1D kernels
If a kernel has uneven dimensions we can have a value of workitem-id-x
divided by the wavefrontsize non-uniform. For example dimensions (65, 2)
will have workitems with address (64, 0) and (0, 1) packed into a same
wave which gives 1 and 0 after the division by 64 respectively.

Unfortunately, this limits the optimization to OpenCL only and only if
reqd_work_group_size attribute is set. This patch limits it to 1D kernels,
although that shall be possible to perform this optimization is the size
of the X dimension is a power of 2, we just do not currently have
infrastructure to query it.

Note that presence of amdgpu-no-workitem-id-y attribute does not help
as it only hints the lack of the workitem-id-y query, but not the absence
of the actual 2nd dimension, therefore affecting just the SGPR allocation.

Differential Revision: https://reviews.llvm.org/D132879
2022-08-30 12:22:08 -07:00
Joe Nash 3e39ab25e6 [AMDGPU][GFX11] Fix dst register class for V_CVT_U32_U16
This instruction was referring to the wrong VOPProfile, likely due to a
typo, leading to an incorrect destination register type.

The MC layer will care about this change, but is NFC while 16-bit values
actually use 32 bit registers.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D132878
2022-08-30 14:01:25 -04:00
Joe Nash 70e7a1257c [AMDGPU][NFC] Allow separate RC for VOP3 DPP Dst
Create a field in VOPProfile called DstRCVOP3DPP to allow the VOP3
versions of DPP instructions to have a different destination register
class than the non-VOP3 encoding. NFC for current instructions, but
planned to be functional in upcoming ones.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D132673
2022-08-29 11:22:07 -04:00
Kazu Hirata 6ed2cb4ad5 Revert "[llvm] Use llvm::is_contained (NFC)"
This reverts commit ebf574f59a.

This patch seems to cause build failures on Windows.
2022-08-28 18:52:49 -07:00
Kazu Hirata ebf574f59a [llvm] Use llvm::is_contained (NFC) 2022-08-28 17:35:03 -07:00
Kazu Hirata 9861a68a7c [Target] Qualify auto in range-based for loops (NFC) 2022-08-28 10:41:50 -07:00
Kazu Hirata ce9f007c7c [llvm] Use llvm::find_if (NFC) 2022-08-28 10:41:48 -07:00
Kazu Hirata 21de2888a4 Use llvm::is_contained (NFC) 2022-08-27 09:53:11 -07:00
Stanislav Mekhanoshin 813ae2871d [AMDGPU] Detect uniformness of TID / wavefrontsize
A value of 'workitemid / wavefrontize' or 'workitemid & (wavefrontize - 1)'
is wave uniform.

Differential Revision: https://reviews.llvm.org/D132511
2022-08-26 23:26:08 -07:00
Simon Pilgrim f9de13232f [X86] Promote i8/i16 CTTZ (BSF) instructions and remove speculation branch
This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.

For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.

Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.

Differential Revision: https://reviews.llvm.org/D132520
2022-08-24 17:28:18 +01:00
Simon Pilgrim 3cf48963ff [AMDGPU] Remove old isCheapToSpeculateCttz FIXME
As confirmed on D132520 - this should always return true
2022-08-24 15:53:38 +01:00
Pierre van Houtryve 59cf9dd923 [AMDGPU][GISel] Enable Selection of ADD3 for G_PTR_ADD
Allows things like `(G_PTR_ADD (G_PTR_ADD a, b), c)` to be
simplified into a single ADD3 instruction instead of two adds.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D131254
2022-08-24 14:44:19 +00:00
Alex Richardson 38107171ed [RegisterInfoEmitter] Generate isConstantPhysReg(). NFCI
This commit moves the information on whether a register is constant into
the Tablegen files to allow generating the implementaiton of
isConstantPhysReg(). I've marked isConstantPhysReg() as final in this
generated file to ensure that changes are made to tablegen instead of
overriding this function, but if that turns out to be too restrictive,
we can remove the qualifier.

This should be pretty much NFC, but I did notice that e.g. the AMDGPU
generated file also includes the LO16/HI16 registers now.

The new isConstant flag will also be used by D131958 to ensure that
constant registers are marked as call-preserved.

Differential Revision: https://reviews.llvm.org/D131962
2022-08-24 14:16:20 +00:00
Jay Foad 1bca81c12e [AMDGPU] Remove unused S_ADD_U64_CO_PSEUDO and S_SUB_U64_CO_PSEUDO 2022-08-24 10:28:35 +01:00
Raghav 79d2529c10 AMDGPU/MetaData: Restrict address space key to only be emitted for "global_buffer" and "dynamic_shared_pointer"
This matches .address_space docs at https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v3

Differential Revision: https://reviews.llvm.org/D132145
2022-08-23 14:01:01 -04:00
Thomas Symalla 5ee0fb7ed2 [NFC][AMDGPU] Some cleanups in the SIOptimizeExecMasking pass.
Fix typos and remove an unused argument.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D132292
2022-08-23 18:16:47 +02:00
Philip Reames 104fa367ee [TTI] Use OperandValueInfo in getArithmeticInstrCost implementation [NFC]
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.

This is the change which motivated the whole sequence which preceeded it.  In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact.  This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.

I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance.  For instance, every parameter which changes type in this change also changes name.  This was intentional to make sure that every call site possible effected must show up in the diff.  This let me audit each one closely.
2022-08-22 15:16:39 -07:00
Simon Pilgrim 5263155d5b [CostModel] Add CostKind argument to getShuffleCost
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.

Differential Revision: https://reviews.llvm.org/D132287
2022-08-21 10:54:51 +01:00
Kazu Hirata 8b1b0d1d81 Revert "Use std::is_same_v instead of std::is_same (NFC)"
This reverts commit c5da37e42d.

This patch seems to break builds with some versions of MSVC.
2022-08-20 23:00:39 -07:00
Kazu Hirata c5da37e42d Use std::is_same_v instead of std::is_same (NFC) 2022-08-20 22:36:26 -07:00
Thomas e565e2fa5c [NFC][AMDGPU] Fix typo. 2022-08-20 08:30:42 +02:00
Austin Kerbow b0f4678b90 [AMDGPU] Add iglp_opt builtin and MFMA GEMM Opt strategy
Adds a builtin that serves as an optimization hint to apply specific optimized
DAG mutations during scheduling. This also disables any other mutations or
clustering that may interfere with the desired pipeline. The first optimization
strategy that is added here is designed to improve the performance of small gemm
kernels on gfx90a.

Reviewed By: jrbyrnes

Differential Revision: https://reviews.llvm.org/D132079
2022-08-19 15:38:36 -07:00
jeff 20cf170e68 [InferAddressSpaces] [AMDGPU] Add inference for flat_atomic intrinsics
Certain address space dependent optimizations, like SeperateConstOffsetFromGEP, assume agreement between the address space of the recursive uses and the address space of the def. If this assumption is invalid, then optimizations may or may not be correct depending on properties of an address space for a given target, the address spaces of recursive uses, and the optimization being done.

This patch infers the previous address space for flat_atomic ptr arguments. As a result, the address spaces of the uses in flat_atomic cases will agree with the address space in recursive defs. If this results in non-flat address space, then isel may infer a different intrinsic. For example, if the result is a flat_atomic using global address space, then it will be lowered to the corresponding global_atomic intrinsic.

Change-Id: Ifcd981709dc2ea94d4acbcb84efe7176593ec8c7
2022-08-19 11:37:20 -07:00
Joe Nash 063ee26ea3 [AMDGPU] Update comment on shrinking dpp. NFC 2022-08-18 11:29:32 -04:00
Jeffrey Byrnes 1c8d7ea973 [AMDGPU] Implement pipeline solver for non-trivial pipelines
Requested SchedGroup pipelines may be non-trivial to satisify. A minimimal example is if the requested pipeline is {2 VMEM, 2 VALU, 2 VMEM} and the original order of SUnits is {VMEM, VALU, VMEM, VALU, VMEM}. Because of existing dependencies, the choice of which SchedGroup the middle VMEM goes into impacts how closely we are able to match the requested pipeline. It seems minimizing the degree of misfit (as measured by the number of edges we can't add) w.r.t the choice we make when mapping an instruction -> SchedGroup is an NP problem. This patch implements the PipelineSolver class which produces a solution for the defined problem for the sched_group_barrier mutation. The solver has both an exponential time exact algorithm and a greedy algorithm. The patch includes some controls which allows the user to select the greedy/exact algorithm.

Differential Revision: https://reviews.llvm.org/D130797
2022-08-17 16:21:59 -07:00