I'm planning to deprecate and eventually remove llvm::empty.
I thought about replacing llvm::empty(x) with std::empty(x), but it
turns out that all uses can be converted to x.empty(). That is, no
use requires the ability of std::empty to accept C arrays and
std::initializer_list.
Differential Revision: https://reviews.llvm.org/D133677
Currently there isn't a generic way to get a smaller register class
that can be produced from a subregister of a larger class. Replaces a
manually implemented version for AMDGPU. This will be used to improve
subregister support in the allocator.
Revert "[Attributor] Teach AAPointerInfo to look into aggregates"
This reverts commit 844f6c5d03 and
4ed0a88cd8 as they broke the buildbots
that run openmp/libomptarget/test/offloading/bug49021.cpp.
If we have a constant aggregate, e.g., as an initializer, we usually
failed to extract the proper value/type from it. This patch provides the
size and offset information necessary to extract the right part of the
constant.
D125803 introduced shrinking of F16 FMA to FMAAK/FMAMK in
SIShrinkInstructions (useful on GFX10+ where VOP3 instructions may have
a literal operand) but failed to handle the V_FMA_F16_gfx9_e64 form of
the opcode which is used on GFX9+.
Differential Revision: https://reviews.llvm.org/D133489
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.
Change call sites to use `std::size` instead.
Differential Revision: https://reviews.llvm.org/D133429
We use _oneuse checks to make sure combines won't accidentally
increase code size, but this prevents the optimization in cases where
we happen to want to clamp multiple values to the same range
It's safe to drop these checks for two reasons:
1. The pattern of max/min operations for med3 is complicated enough
it's unlikely to come up by accident, so this will still only fire
when appropriate to do so
2. Even if every intermediate is used and we don't save a single
operation, we still won't end up with more operations since the
med3 replaces the final max/min.
In pathological cases we could potentially end up with a larger
encoding size or possibly slightly increased vgpr pressure, but the
risk of that is low, especially considering the upside.
Differential Revision: https://reviews.llvm.org/D132621
After D133067 we are inserting swaps to use a new physical
register. I have noticed verifier errors about undefined
physical register uses if we are tracking liveness post RA.
We have no access to LIS at this point, so mark new register
uses as undef to calm down the verifier. Liveness should not
matter at this point anyway.
Note the description of the RegState::Undef: "Value of the
register doesn't matter." I.e. it does not say it is strictly
undefined. In fact that is what we really need: this value
does not matter.
I also had to modify the test a bit since with tracking enabled
it does not pass verification even before the recognizer.
Differential Revision: https://reviews.llvm.org/D133459
In this case gfx90a uses v0 instead of the correct register. Swap
the value temporarily with a lower register and then swap it back.
Unfortunately hazard recognizer works after wait count insertion,
so we cannot simply reuse an arbitrary register, hence w/a also
includes a full waitcount. This can be avoided if we run it from
expandPostRAPseudo, but that is a complete misplacement.
Differential Revision: https://reviews.llvm.org/D133067
Currently LDS variables are removed by the lower module pass
if they have a use which is caught by the replace with struct control flow.
This makes tests brittle to changes to that control flow which induces
noise when trying to improve lowering. Some tests already check that
variables are removed, while others checked that they are not removed.
LDS variables are not (currently) externally accessible, and if that
changes the machinery which makes them externally accessible will look
like a use. This change therefore breaks no applications.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D133028
Simplify instruction selection patterns for mad/mac:
- Use any_fmad consistently to make it clear that all patterns treat
fmad and AMDGPUfmad_ftz identically.
- For mad, put the patterns on the instruction definitions. For mac the
patterns are still out-of-line because we want to set AddedComplexity
and to have special handling of the source modifiers.
Differential Revision: https://reviews.llvm.org/D133305
If a kernel has uneven dimensions we can have a value of workitem-id-x
divided by the wavefrontsize non-uniform. For example dimensions (65, 2)
will have workitems with address (64, 0) and (0, 1) packed into a same
wave which gives 1 and 0 after the division by 64 respectively.
Unfortunately, this limits the optimization to OpenCL only and only if
reqd_work_group_size attribute is set. This patch limits it to 1D kernels,
although that shall be possible to perform this optimization is the size
of the X dimension is a power of 2, we just do not currently have
infrastructure to query it.
Note that presence of amdgpu-no-workitem-id-y attribute does not help
as it only hints the lack of the workitem-id-y query, but not the absence
of the actual 2nd dimension, therefore affecting just the SGPR allocation.
Differential Revision: https://reviews.llvm.org/D132879
This instruction was referring to the wrong VOPProfile, likely due to a
typo, leading to an incorrect destination register type.
The MC layer will care about this change, but is NFC while 16-bit values
actually use 32 bit registers.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D132878
Create a field in VOPProfile called DstRCVOP3DPP to allow the VOP3
versions of DPP instructions to have a different destination register
class than the non-VOP3 encoding. NFC for current instructions, but
planned to be functional in upcoming ones.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D132673
This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.
For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.
Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.
Differential Revision: https://reviews.llvm.org/D132520
Allows things like `(G_PTR_ADD (G_PTR_ADD a, b), c)` to be
simplified into a single ADD3 instruction instead of two adds.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D131254
This commit moves the information on whether a register is constant into
the Tablegen files to allow generating the implementaiton of
isConstantPhysReg(). I've marked isConstantPhysReg() as final in this
generated file to ensure that changes are made to tablegen instead of
overriding this function, but if that turns out to be too restrictive,
we can remove the qualifier.
This should be pretty much NFC, but I did notice that e.g. the AMDGPU
generated file also includes the LO16/HI16 registers now.
The new isConstant flag will also be used by D131958 to ensure that
constant registers are marked as call-preserved.
Differential Revision: https://reviews.llvm.org/D131962
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.
This is the change which motivated the whole sequence which preceeded it. In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact. This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.
I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance. For instance, every parameter which changes type in this change also changes name. This was intentional to make sure that every call site possible effected must show up in the diff. This let me audit each one closely.
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
Adds a builtin that serves as an optimization hint to apply specific optimized
DAG mutations during scheduling. This also disables any other mutations or
clustering that may interfere with the desired pipeline. The first optimization
strategy that is added here is designed to improve the performance of small gemm
kernels on gfx90a.
Reviewed By: jrbyrnes
Differential Revision: https://reviews.llvm.org/D132079
Certain address space dependent optimizations, like SeperateConstOffsetFromGEP, assume agreement between the address space of the recursive uses and the address space of the def. If this assumption is invalid, then optimizations may or may not be correct depending on properties of an address space for a given target, the address spaces of recursive uses, and the optimization being done.
This patch infers the previous address space for flat_atomic ptr arguments. As a result, the address spaces of the uses in flat_atomic cases will agree with the address space in recursive defs. If this results in non-flat address space, then isel may infer a different intrinsic. For example, if the result is a flat_atomic using global address space, then it will be lowered to the corresponding global_atomic intrinsic.
Change-Id: Ifcd981709dc2ea94d4acbcb84efe7176593ec8c7
Requested SchedGroup pipelines may be non-trivial to satisify. A minimimal example is if the requested pipeline is {2 VMEM, 2 VALU, 2 VMEM} and the original order of SUnits is {VMEM, VALU, VMEM, VALU, VMEM}. Because of existing dependencies, the choice of which SchedGroup the middle VMEM goes into impacts how closely we are able to match the requested pipeline. It seems minimizing the degree of misfit (as measured by the number of edges we can't add) w.r.t the choice we make when mapping an instruction -> SchedGroup is an NP problem. This patch implements the PipelineSolver class which produces a solution for the defined problem for the sched_group_barrier mutation. The solver has both an exponential time exact algorithm and a greedy algorithm. The patch includes some controls which allows the user to select the greedy/exact algorithm.
Differential Revision: https://reviews.llvm.org/D130797