Commit Graph

7293 Commits

Author SHA1 Message Date
Joe Nash dc850fbf3b [AMDGPU] NFC. Assert that mask is full with VOPC DPP
VOPC DPP should not be formed when the row_mask and bank_mask are not
0xf (full) because the resulting VOP DPP would have different semantics
than the MOV DPP followed by VOP. Existing checks in GCNDPPCombine cover
this case but for different reasons, so assert the property for
future-proofing.

Reviewed By: nhaehnle

Differential Revision: https://reviews.llvm.org/D130101
2022-07-20 13:23:03 -04:00
Kazu Hirata 0387da6f4f Use value instead of getValue (NFC) 2022-07-19 21:18:26 -07:00
Kazu Hirata 41ae78ea3a Use has_value instead of hasValue (NFC) 2022-07-19 20:15:44 -07:00
Johannes Doerfert bf789b1957 [Attributor] Replace AAValueSimplify with AAPotentialValues
For the longest time we used `AAValueSimplify` and
`genericValueTraversal` to determine "potential values". This was
problematic for many reasons:
- We recomputed the result a lot as there was no caching for the 9
  locations calling `genericValueTraversal`.
- We added the idea of "intra" vs. "inter" procedural simplification
  only as an afterthought. `genericValueTraversal` did offer an option
  but `AAValueSimplify` did not. Thus, we might end up with "too much"
  simplification in certain situations and then gave up on it.
- Because `genericValueTraversal` was not a real `AA` we ended up with
  problems like the infinite recursion bug (#54981) as well as code
  duplication.

This patch introduces `AAPotentialValues` and replaces the
`AAValueSimplify` uses with it. `genericValueTraversal` is folded into
`AAPotentialValues` as are the instruction simplifications performed in
`AAValueSimplify` before. We further distinguish "intra" and "inter"
procedural simplification now.

`AAValueSimplify` was not deleted as we haven't ported the
re-materialization of instructions yet. There are other differences over
the former handling, e.g., we may not fold trivially foldable
instructions right now, e.g., `add i32 1, 1` is not folded to `i32 2`
but if an operand would be simplified to `i32 1` we would fold it still.

We are also even more aware of function/SCC boundaries in CGSCC passes,
which is good even if some tests look like they regress.

Fixes: https://github.com/llvm/llvm-project/issues/54981

Note: A previous version was flawed and consequently reverted in
      6555558a80.
2022-07-19 16:24:42 -05:00
Jon Chesterfield 3a20597776 [amdgpu] Implement lds kernel id intrinsic
Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.

There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.

It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.

The intent is to emit a __const array of LDS addresses and index into it.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D125060
2022-07-19 17:46:19 +01:00
Jon Chesterfield 2224bbcd74 [nfc][amdgpu] LDS. Move selection logic up the stack. 2022-07-19 17:20:19 +01:00
Joe Nash b28bb8cc9c [AMDGPU] Remove old operand from VOPC DPP
For most DPP instructions, the old operand stores the value that was in
the current lane before the DPP operation, and is tied to the
destination. For VOPC DPP, this is unnecessary and incorrect.

There appears to have been a latent bug related to D122737 with
SIInstrInfo::isOperandLegal. If you checked if a register operand was legal
when the InstructionDesc expected an immediate, it reported that is valid.
Its fix is necessary for and tested in this patch.

Reviewed By: foad, rampitec

Differential Revision: https://reviews.llvm.org/D130040
2022-07-19 09:35:05 -04:00
Abinav Puthan Purayil 9fa425c1ab [AMDGPU] Set amdgpu-memory-bound if a basic block has dense global memory access
AMDGPUPerfHintAnalysis doesn't set the memory bound attribute if
FuncInfo::InstCost outweighs MemInstCost even if we have a basic block
with relatively high global memory access. GCNSchedStrategy could revert
optimal scheduling in favour of occupancy which seems to degrade
performance for some kernels. This change introduces the
HasDenseGlobalMemAcc metric in the heuristic that makes the analysis
more conservative in these cases.

This fixes SWDEV-334259/SWDEV-343932

Differential Revision: https://reviews.llvm.org/D129759
2022-07-19 15:16:28 +05:30
Matt Arsenault 8d0383eb69 CodeGen: Remove AliasAnalysis from regalloc
This was stored in LiveIntervals, but not actually used for anything
related to LiveIntervals. It was only used in one check for if a load
instruction is rematerializable. I also don't think this was entirely
correct, since it was implicitly assuming constant loads are also
dereferenceable.

Remove this and rely only on the invariant+dereferenceable flags in
the memory operand. Set the flag based on the AA query upfront. This
should have the same net benefit, but has the possible disadvantage of
making this AA query nonlazy.

Preserve the behavior of assuming pointsToConstantMemory implying
dereferenceable for now, but maybe this should be changed.
2022-07-18 17:23:41 -04:00
Stanislav Mekhanoshin 523a99c0eb [AMDGPU] Support for gfx940 fp8 smfmac
Differential Revision: https://reviews.llvm.org/D129908
2022-07-18 12:12:41 -07:00
Stanislav Mekhanoshin 2695f0a688 [AMDGPU] Support for gfx940 fp8 mfma
Differential Revision: https://reviews.llvm.org/D129906
2022-07-18 11:49:56 -07:00
Stanislav Mekhanoshin 9fa5a6b7e8 [AMDGPU] Support for gfx940 fp8 conversions
Differential Revision: https://reviews.llvm.org/D129902
2022-07-18 11:48:43 -07:00
Petar Avramovic c287bc4841 [AMDGPU][MC][GFX11] AsmParser for op_sel for VOP3 dpp opcodes
Parse op_sel for *_e64_dpp VOP3 opcodes.
Depends on D129637 and setting of VOP3_OPSEL in dpp pseudos.

Differential Revision: https://reviews.llvm.org/D129767
2022-07-18 15:08:52 +02:00
Ivan Kosarev 432cbd7827 [AMDGPU][CodeGen] Support (register + immediate) SMRD offsets.
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D129381
2022-07-18 11:29:31 +01:00
Ivan Kosarev 9c66c02e2e [AMDGPU][CodeGen] Match SMRDs with constant bases and register offsets.
Saves some add instructions on a couple Rage 2 shaders and is also a
prerequisite for a coming-soon change matching (register + immediate)
offsets.

Reviewed By: foad, arsenm

Differential Revision: https://reviews.llvm.org/D129095
2022-07-18 11:18:23 +01:00
Abinav Puthan Purayil d96361d714 [AMDGPU] Add the uses_dynamic_stack field to the kernel descriptor and the kernel metadata map
This change introduces the dynamic stack boolean field to code-object-v3
and above under the code properties of the kernel descriptor and under
the kernel metadata map of NT_AMDGPU_METADATA. This field corresponds to
the is_dynamic_callstack field of amd_kernel_code_t.

Differential Revision: https://reviews.llvm.org/D128344
2022-07-18 10:07:13 +05:30
Kazu Hirata 7094ab4ee7 [llvm] Modernize bool literals (NFC)
Identified with modernize-use-bool-literals.
2022-07-17 18:08:51 -07:00
Carl Ritson 547e3cba7d [AMDGPU] Improve liveness copying in si-optimize-exec-masking-pre-ra
Further improve liveness copying for CC register post optimization
by mirroring live internal splits.
The fixes a bug in register allocation when CC register liveness
is extended across a branches instead of split.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D129557
2022-07-17 17:34:05 +09:00
Kazu Hirata deac0ac523 [AMDGPU] Use default member initialization (NFC)
Identified with modernize-use-default-member-init.
2022-07-16 12:44:35 -07:00
Kazu Hirata 6cbfffb3a3 [AMDGPU] Declare TableRef in terms of ArrayRef (NFC) 2022-07-16 10:56:20 -07:00
Jon Chesterfield eda2bcad02 [nfc][amdgpu] Remove dead variable and function 2022-07-15 23:56:43 +01:00
Vang Thao 67357739c6 [AMDGPU] Add remarks to output some resource usage
Add analyis remarks to output kernel name, register usage, occupancy,
scratch usage, spills, and LDS information.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D123878
2022-07-15 11:01:53 -07:00
Dmitry Preobrazhensky 185c36de73 [AMDGPU][MC][NFC] Remove unnecessary code
Differential Revision: https://reviews.llvm.org/D129766
2022-07-15 13:17:36 +03:00
Dmitry Preobrazhensky 2a6532d542 [AMDGPU][MC][GFX11] Correct disassembly of *_e64_dpp opcodes which support op_sel
These opcodes cannot be disassembled because op_sel operand is missing - it must be added manually.
See https://github.com/llvm/llvm-project/issues/56512 for detailed issue analysis.

Differential Revision: https://reviews.llvm.org/D129637
2022-07-15 13:11:59 +03:00
jeff 8a12f20ef7 [AMDGPU] Update the mechanism used to check for cycles and add eges in power-sched mutation 2022-07-14 16:24:13 -07:00
Alexander Timofeev 2e29b0138c [AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.
Since the divergence-driven instruction selection has been enabled for AMDGPU,
 all the uniform instructions are expected to be selected to SALU form, except those not having one.
 VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm
 that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form
 and overhead introduced by the data transfer between the VGPRs and SGPRs.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D128252
2022-07-14 23:59:02 +02:00
Jay Foad e45aa230ad [AMDGPU] Update LiveVariables after killing an immediate def
D114999 added code to kill an immediate def if it was folded into its
only use by convertToThreeAddress. This patch updates LiveVariables when
that happens in order to fix verification failures exposed by D129213.

Differential Revision: https://reviews.llvm.org/D129661
2022-07-14 10:49:41 +01:00
David Green 3e0bf1c7a9 [CodeGen] Move instruction predicate verification to emitInstruction
D25618 added a method to verify the instruction predicates for an
emitted instruction, through verifyInstructionPredicates added into
<Target>MCCodeEmitter::encodeInstruction. This is a very useful idea,
but the implementation inside MCCodeEmitter made it only fire for object
files, not assembly which most of the llvm test suite uses.

This patch moves the code into the <Target>_MC::verifyInstructionPredicates
method, inside the InstrInfo.  The allows it to be called from other
places, such as in this patch where it is called from the
<Target>AsmPrinter::emitInstruction methods which should trigger for
both assembly and object files. It can also be called from other places
such as verifyInstruction, but that is not done here (it tends to catch
errors earlier, but in reality just shows all the mir tests that have
incorrect feature predicates). The interface was also simplified
slightly, moving computeAvailableFeatures into the function so that it
does not need to be called externally.

The ARM, AMDGPU (but not R600), AVR, Mips and X86 backends all currently
show errors in the test-suite, so have been disabled with FIXME
comments.

Recommitted with some fixes for the leftover MCII variables in release
builds.

Differential Revision: https://reviews.llvm.org/D129506
2022-07-14 09:33:28 +01:00
Jannik Silvanus e5c4cde451 [AMDGPU] SIMachineScheduler: Add support for several MachineScheduler features
The SI machine scheduler inherits from ScheduleDAGMI.
This patch adds support for a few features that are implemented
in ScheduleDAGMI (or its base classes) that were missing so far
because their support is implemented in overridden functions.

* Support cl::opt -view-misched-dags
  This option allows to open a graphical window of the scheduling DAG.

* Support cl::opt -misched-print-dags
  This option allows to print the scheduling DAG in text form.

* After constructing the scheduling DAG, call postprocessDAG()
  to apply any registered DAG mutations.
  Note that currently there are no mutations defined in AMDGPUTargetMachine.cpp
  in case SIScheduler is used.
  Still add this to avoid surprises in the future in case mutations are added.

Differential Revision: https://reviews.llvm.org/D128808
2022-07-14 09:45:31 +02:00
Kazu Hirata 611ffcf4e4 [llvm] Use value instead of getValue (NFC) 2022-07-13 23:11:56 -07:00
David Green 95252133e1 Revert "Move instruction predicate verification to emitInstruction"
This reverts commit e2fb8c0f4b as it does
not build for Release builds, and some buildbots are giving more warning
than I saw locally. Reverting to fix those issues.
2022-07-13 13:28:11 +01:00
David Green e2fb8c0f4b Move instruction predicate verification to emitInstruction
D25618 added a method to verify the instruction predicates for an
emitted instruction, through verifyInstructionPredicates added into
<Target>MCCodeEmitter::encodeInstruction. This is a very useful idea,
but the implementation inside MCCodeEmitter made it only fire for object
files, not assembly which most of the llvm test suite uses.

This patch moves the code into the <Target>_MC::verifyInstructionPredicates
method, inside the InstrInfo.  The allows it to be called from other
places, such as in this patch where it is called from the
<Target>AsmPrinter::emitInstruction methods which should trigger for
both assembly and object files. It can also be called from other places
such as verifyInstruction, but that is not done here (it tends to catch
errors earlier, but in reality just shows all the mir tests that have
incorrect feature predicates). The interface was also simplified
slightly, moving computeAvailableFeatures into the function so that it
does not need to be called externally.

The ARM, AMDGPU (but not R600), AVR, Mips and X86 backends all currently
show errors in the test-suite, so have been disabled with FIXME
comments.

Differential Revision: https://reviews.llvm.org/D129506
2022-07-13 12:53:32 +01:00
Jay Foad 5d41fe0768 [AMDGPU] SILowerControlFlow uses LiveIntervals
The availability of LiveIntervals affects kill flags in the output, so
declare the use to avoid strange effects where the output of this pass
is different depending on what other passes are scheduled after it.

Differential Revision: https://reviews.llvm.org/D129555
2022-07-12 16:53:53 +01:00
Piotr Sobczak 2bd8e74b94 [AMDGPU] Fix bitcast v4i64/v16i16
Fix a regression introduced in D128865.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D129375
2022-07-11 22:27:52 +02:00
NAKAMURA Takumi 393e12bddd R600ISelLowering.h: Silence a warning. [-Warray-parameter]
FIXME: Could it be rewritten with llvm::ArrayRef ?
2022-07-10 18:29:55 +09:00
David Blaikie 9008d0a38e Fix -Warray-parameter warning
Remove the bound in the definition, since it's not guaranteed/could
provide a false sense of security (I'd be inclined to go further and
change this to a pointer parameter, since that's what it really is - but
figured I'd preserve some of the author's intent here)
2022-07-09 17:04:01 +00:00
serge-sans-paille e1272ab6ec [AMDGPU][NFC] Harmonize decl&def of R600TargetLowering::OptimizeSwizzle
The freshly baked -Warray-parameter warning discovered an inconsistency in
argument declaration, use the stricter one.

This fixes build issues like https://lab.llvm.org/buildbot#builders/18/builds/5305
2022-07-09 09:07:31 +02:00
Abinav Puthan Purayil 17a81ecf85 [AMDGPU] Use the HasNoUse predicate for no-ret atomic op selection
This change replaces the C++ predicates with the HasNoUse builtin
predicate that would enable the no-ret atomic op selection in
GlobalISel.

Differential Revision: https://reviews.llvm.org/D125213
2022-07-08 09:47:33 +05:30
Abinav Puthan Purayil 7504c7a877 [AMDGPU] Use AddedComplexity for ret and noret atomic ops selection
This patch removes the predicate for return atomic ops and uses
AddedComplexity to distinguish its selection from its no return variant.
This will produce better matchers that doesn't unnecessarily check for
the negated predicate if the initial predicate failed. Also, it
simplifies the enabling of no return atomic ops selection in GlobalISel.

Differential Revision: https://reviews.llvm.org/D128241
2022-07-08 09:47:33 +05:30
Austin Kerbow 6817031d0b [AMDGPU] Disable FillMFMAShadowMutation by default
Disable amdgpu mfma power sched.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D129172
2022-07-07 09:34:45 -07:00
Shilei Tian 1023ddaf77 [LLVM] Add the support for fmax and fmin in atomicrmw instruction
This patch adds the support for `fmax` and `fmin` operations in `atomicrmw`
instruction. For now (at least in this patch), the instruction will be expanded
to CAS loop. There are already a couple of targets supporting the feature. I'll
create another patch(es) to enable them accordingly.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D127041
2022-07-06 10:57:53 -04:00
Thomas Symalla 86bd7e2065 [NFC][AMDGPU] Cleanup the SIOptimizeExecMasking pass.
This patch removes a bit of code duplication and
moves the v_cmpx optimization out of the
runOnMachineFunction pass.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D129086
2022-07-06 11:03:03 +02:00
Carl Ritson 8bc5e7ac51 [AMDGPU] Additional liveness tests for si-optimize-exec-masking-pre-ra
Merge tests and fixes from D128110 and D128315 on top of already
committed D128800.

Original author: arsenm

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D128882
2022-07-06 15:05:32 +09:00
Jay Foad 4dbc2876cf [AMDGPU] GFX11 trivial NFC tweaks
A few miscellaneous comment, whitespace and indentation tweaks.
2022-07-05 17:20:17 +01:00
Jay Foad 12fd00ee17 [AMDGPU] Add patterns for GFX11 v_minmax and v_maxmin instructions
Differential Revision: https://reviews.llvm.org/D128445
2022-07-05 16:07:47 +01:00
Joe Nash 0483c91eee [AMDGPU] gfx11 CodeGen for new DPP instructions
Modifies the GCNDPPCombine pass to enable DPP formation for the new DPP
instruction in gfx11, namely VOP3 encoded instructions with DPP and VOPC
with DPP.

Depends on D128656

Reviewed By: #amdgpu, rampitec

Differential Revision: https://reviews.llvm.org/D128682
2022-07-05 10:17:59 -04:00
Joe Nash d1af09ad96 [AMDGPU] gfx11 Generate VOPD Instructions
We form VOPD  instructions in the GCNCreateVOPD pass by combining
back-to-back component instructions. There are strict register
constraints for creating a legal VOPD, namely that the matching operands
(e.g. src0x and src0y, src1x and src1y) must be in different register
banks. We add a PostRA scheduler
mutation to put possible VOPD components back-to-back.

Depends on D128442, D128270

Reviewed By: #amdgpu, rampitec

Differential Revision: https://reviews.llvm.org/D128656
2022-07-05 09:18:19 -04:00
Ivan Kosarev 4696a33dfa [AMDGPU][NFC] Refine matching SMRD offsets.
Tell the matcher what we are looking for instead of matching everything
and then discarding the result if doesn't fit.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D128171
2022-07-05 14:07:22 +01:00
Ivan Kosarev 8cd79bc12c [AMDGPU][GlobalISel] Support register offsets for SMRDs.
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D128836
2022-07-05 13:41:06 +01:00
Thomas Symalla 04c5fed5e0 [NFC] Fix wrong comment. 2022-07-05 13:37:44 +02:00
Nikita Popov 8e70258b18 [AMDGPUCodeGenPrepare] Check result of ConstantFoldBinaryOpOperands()
This function will become fallible once we don't support constant
expressions for all binops, so make sure to check the result.
2022-07-04 14:20:23 +02:00
Mirko Brkusanin 2208342c9b [AMDGPU][GlobalISel] Always use VGPR bank for G_FCMP
Differential Revision: https://reviews.llvm.org/D128980
2022-07-01 15:03:37 +02:00
Piotr Sobczak b6ef36a1c4 [AMDGPU] Update WMMA intrinsics with explicit f16 types
Update intrinsics to use n x f16 and n x i16 instead
of 32-bit types. This may avoid the need for a bitcast
and is probably less confusing.

Depends on making v16f16 and v16i16 types legal.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D128951
2022-07-01 08:55:25 +02:00
Piotr Sobczak bd675af2a2 [AMDGPU] Make v16i16/v16f16 legal
There are upcoming intrinsics to use the new types.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D128865
2022-06-30 23:08:40 +02:00
Jay Foad 0f94d2b385 [AMDGPU] GFX11: automatically release VGPRs at the end of the shader
GFX11 has a new message type MSG_DEALLOC_VGPRS which can be used to
release a shader's VGPRs. Sending this at the end of a shader (just
before the s_endpgm) can help overall system performance in cases where
the s_endpgm would have to wait for outstanding VMEM stores to complete
before releasing the VGPRs.

Differential Revision: https://reviews.llvm.org/D128442
2022-06-30 20:55:14 +01:00
Piotr Sobczak 4874838a63 [AMDGPU] gfx11 WMMA instruction support
gfx11 introduces new WMMA (Wave Matrix Multiply-accumulate)
instructions.

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D128756
2022-06-30 11:13:45 -04:00
Carl Ritson d0f6641615 [AMDGPU] Fix liveness for loops in si-optimize-exec-masking-pre-ra
Follow up to D127894, new liveness update code needs to handle
the case where S_ANDN2 input must be extended through loops when
V_CNDMASK_B32 has been hoisted.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D128800
2022-06-30 15:26:50 +09:00
Jay Foad cfb7ffdec0 [AMDGPU] New AMDGPUInsertDelayAlu pass
Differential Revision: https://reviews.llvm.org/D128270
2022-06-29 21:30:20 +01:00
Matt Arsenault 0bdaef38c9 AMDGPU: Add gfx11 feature to force initializing 16 input SGPRs
The total user+system SGPR count needs to be padded out to 16 if fewer
inputs are enabled.
2022-06-29 14:52:19 -04:00
Matt Arsenault ffd6aaf5b6 AMDGPU: Make packed 32-bit instructions rematerializable 2022-06-29 11:57:54 -04:00
Matt Arsenault 4c400dc103 AMDGPU: Make 16-bit pk instructions rematerializable 2022-06-29 11:57:53 -04:00
Matt Arsenault da6d7728d4 AMDGPU: Mark more instructions as rematerializable
D106023 excluded 16-bit instructions from rematerialization, with the
justification that we can't rematerialize instructions that preserve
the high bits (plus the instructions which do are a confusing mess
between different subtargets). This doesn't make sense to me as a
problem since cases where we would rely on the high bit behavior would
still need to be represented as a register value constraint with a
tied operand. It's not a hidden side effect and should still be
rematerializable.
2022-06-29 11:19:15 -04:00
Matt Arsenault d342d130da AMDGPU: Use isMeta flags on pseudoinstructions 2022-06-29 10:31:29 -04:00
Stanislav Mekhanoshin 21895c6b50 [AMDGPU] Relax verification of soffset in scalar stores
It must use m0 only on GFX8. Later chips can use ang SGPR.

Differential Revision: https://reviews.llvm.org/D128765
2022-06-28 16:10:08 -07:00
Jay Foad 3fbc945c3a [AMDGPU] llvm.amdgcn.exp.compr is not supported on GFX11
Differential Revision: https://reviews.llvm.org/D128259
2022-06-28 14:48:25 +01:00
Joe Nash f1cfaa956d [AMDGPU] Use GFX11 S_PACK_HL instruction in more cases
Differential Revision: https://reviews.llvm.org/D128527
2022-06-28 14:35:19 +01:00
Jay Foad b5818e4eb4 [AMDGPU] Cluster stores as well as loads for GFX11
Differential Revision: https://reviews.llvm.org/D128517
2022-06-27 16:41:41 +01:00
Jay Foad 77e63b25f9 [AMDGPU] Fix assertion failure on mad with negative immediate addend
Without this, the new test case would fail with:

AMDGPUInstPrinter.cpp:545: void llvm::AMDGPUInstPrinter::printImmediate64(uint64_t, const llvm::MCSubtargetInfo &, llvm::raw_ostream &): Assertion `isUInt<32>(Imm) || Imm == 0x3fc45f306dc9c882' failed.

Differential Revision: https://reviews.llvm.org/D128435
2022-06-27 09:49:20 +01:00
Kazu Hirata a7938c74f1 [llvm] Don't use Optional::hasValue (NFC)
This patch replaces Optional::hasValue with the implicit cast to bool
in conditionals only.
2022-06-25 21:42:52 -07:00
Kazu Hirata 3b7c3a654c Revert "Don't use Optional::hasValue (NFC)"
This reverts commit aa8feeefd3.
2022-06-25 11:56:50 -07:00
Kazu Hirata aa8feeefd3 Don't use Optional::hasValue (NFC) 2022-06-25 11:55:57 -07:00
Min-Yih Hsu 97579dcc6d [MCA] Introducing incremental SourceMgr and resumable pipeline
The new resumable mca::Pipeline capability introduced in this patch
allows users to save the current state of pipeline and resume from the
very checkpoint.
It is better (but not require) to use with the new IncrementalSourceMgr,
where users can add mca::Instruction incrementally rather than having a
fixed number of instructions ahead-of-time.

Note that we're using unit tests to test these new features. Because
integrating them into the `llvm-mca` tool will make too many churns.

Differential Revision: https://reviews.llvm.org/D127083
2022-06-24 15:39:51 -07:00
Joe Nash 07b7fada73 [AMDGPU] gfx11 VOPD instructions MC support
VOPD is a new encoding for dual-issue instructions for use in wave32.
This patch includes MC layer support only.

A VOPD instruction is constituted of an X component (for which there are
13 possible opcodes) and a Y component (for which there are the 13 X
opcodes plus 3 more). Most of the complexity in defining and parsing
a VOPD operation arises from the possible different total numbers of
operands and deferred parsing of certain operands depending on the
constituent X and Y opcodes.

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D128218
2022-06-24 11:08:39 -04:00
Konstantin Zhuravlyov 7736ce1c56 AMDGPU: Clear kill flags when optimizing vcmp save exec sequence
It was causing bad machine code for several blender scenes:
  *** Bad machine code: Using an undefined physical register ***
  - function:    kernel_holdout_emission_blurring_pathtermination_ao
  - basic block: %bb.28 if.end40.i (0x7f84861a2320)
  - instruction: V_CMPX_EQ_U32_nosdst_e64 0, $vgpr3, implicit-def $exec, implicit $exec
  - operand 1:   $vgpr3

Differential Revision: https://reviews.llvm.org/D127768
2022-06-24 11:30:22 -04:00
Joe Nash ae72fee74e [AMDGPU] gfx11 Select on Buffer Atomic FAdd Rtn type
Reviewed By: #amdgpu, foad, rampitec

Differential Revision: https://reviews.llvm.org/D128205
2022-06-23 11:05:32 -04:00
Baptiste Saleil 79e77a9f39 [AMDGPU] Flush the vmcnt counter in loop preheaders when necessary
waitcnt vmcnt instructions are currently generated in loop bodies before using
values loaded outside of the loop. In some cases, it is better to flush the
vmcnt counter in a loop preheader before entering the loop body. This patch
detects these cases and generates waitcnt instructions to flush the counter.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D115747
2022-06-23 10:53:21 -04:00
Rodrigo Dominguez 971fa4b196 [AMDGPU] GFX11: remove ShaderType from ds_ordered_count offset field
In GFX11 ShaderType is determined by the hardware and should no longer
be written into bits[3:2] of the ds_ordered_count offset field.

Differential Revision: https://reviews.llvm.org/D128196
2022-06-23 14:20:33 +01:00
Ruiling Song 49b8ca3f7c AMDGPU: Don't crash on global_ctor/dtor declaration
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D128320
2022-06-23 21:04:54 +08:00
Dmitry Preobrazhensky dcb24f93af [AMDGPU][MC][GFX11] Correct disassembly of VOP3.DPP8 opcodes
Fix bug #56163.
Add W32/W64 tests for all VOP3.DPP opcodes.

Differential Revision: https://reviews.llvm.org/D128369
2022-06-23 13:07:45 +03:00
Matt Arsenault b03d902b61 AMDGPU: Fix invalid liveness after si-optimize-exec-masking-pre-ra
This was leaving behind a use at the deleted instruction which the
verifier would fail during allocation.
2022-06-22 20:49:03 -04:00
serge-sans-paille 27fd01d3f8 [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since fb67d683db detected a few
regressions, fixing them.

The impact on preprocessed output is negligible: -4k lines.
2022-06-22 18:50:39 +02:00
Guillaume Chatelet cef65864af [Alignment] Use Align for MaxKernArgAlign
Differential Revision: https://reviews.llvm.org/D128118
2022-06-22 13:40:37 +00:00
Ruiling Song 4dcb42fae5 AMDGPU: Skip unexpected CFG in SIOptimizeVGPRLiveRange
There are some cases that we use si_if/si_else in unatural way.
Just skip them.

Fixes: https://github.com/llvm/llvm-project/issues/55922

Reviewed by: critson

Differential Revision: https://reviews.llvm.org/D128193
2022-06-22 12:49:41 +08:00
Joe Nash 90254d524f [AMDGPU] gfx11 Remove SDWA from shuffle_vector ISel
gfx11 does not have SDWA

Reviewed By: #amdgpu, rampitec

Differential Revision: https://reviews.llvm.org/D128208
2022-06-21 14:55:00 -04:00
Jay Foad 929a8ad2b6 [AMDGPU] Update SPI_SHADER_PGM_RSRC2_PS.EXTRA_LDS_SIZE for GFX11
The granularity of SPI_SHADER_PGM_RSRC2_PS.EXTRA_LDS_SIZE changed
in GFX11. It is now in units of 256 dwords instead of 128 dwords.

COMPUTE_PGM_RSRC2.LDS_SIZE is unaffected. It is still in units of
128 dwords.

Differential Revision: https://reviews.llvm.org/D128179
2022-06-21 14:48:12 +01:00
Carl Ritson 62abc8c200 [AMDGPU] Set GFX11 null export target based on export attributes
If shader only has depth exports use MRTZ otherwise use MRT0.

Differential Revision: https://reviews.llvm.org/D128185
2022-06-21 09:40:31 +01:00
Kazu Hirata 7a47ee51a1 [llvm] Don't use Optional::getValue (NFC) 2022-06-20 22:45:45 -07:00
Kazu Hirata 064a08cd95 Don't use Optional::hasValue (NFC) 2022-06-20 20:05:16 -07:00
Ruiling Song 732eed40fd [AMDGPU] Mark GFX11 dual source blend export as strict-wqm
The instructions that generate the source of dual source blend export
should run in strict-wqm. That is if any lane in a quad is active,
we need to enable all four lanes of that quad to make the shuffling
operation before exporting to dual source blend target work correctly.

Differential Revision: https://reviews.llvm.org/D127981
2022-06-20 21:58:12 +01:00
Piotr Sobczak 29621c13ef [AMDGPU] Tag GFX11 LDS loads as using strict_wqm
LDS_PARAM_LOAD and LDS_DIRECT_LOAD use EXEC per quad
(if any pixel is enabled in the quad, data is written
to all 4 pixels/threads in the quad).

Tag LDS_PARAM_LOAD and LDS_DIRECT_LOAD as using strict_wqm
to enforce this and avoid lane clobbering issues.
Note that only the instruction itself is tagged.
The implicit uses of these do not need to be set WQM.
The reduces unnecessary WQM calculation of M0.

Differential Revision: https://reviews.llvm.org/D127977
2022-06-20 21:58:12 +01:00
Jay Foad 13107c2770 [AMDGPU] Add support for GFX11 LDSDIR hazards
Detect LDS direct WAR/WAW hazards and compute values for
wait_vdst (va_vdst) parameter.  Where appropriate this
raises wait_vdst from the default 0 to allow concurrent
issue of LDS direct with VALU execution.

Also detect LDS direct versus VMEM source VGPR hazards
and insert vm_vsrc=0 waits using s_waitcnt_depctr.

Differential Revision: https://reviews.llvm.org/D127963
2022-06-20 21:58:12 +01:00
Guillaume Chatelet d154d0ac06 [NFC] Simplify code 2022-06-20 15:15:52 +00:00
Jay Foad ba306216d2 [AMDGPU] Reorder cases. NFC. 2022-06-20 14:30:17 +01:00
Jay Foad d7762a3b36 [AMDGPU] Increase instruction cache line size to 128 bytes for GFX11
Differential Revision: https://reviews.llvm.org/D128189
2022-06-20 14:25:10 +01:00
Jay Foad b8e32e808d [AMDGPU] Remove a duplicate atomic fadd pattern
This was left over after D124538.
2022-06-20 14:08:57 +01:00
Dmitry Preobrazhensky 485e8b4f63 [AMDGPU][MC][GFX11] Correct disassembly of DPP variants of VOPC64 opcodes
Fix bugs https://github.com/llvm/llvm-project/issues/56091, https://github.com/llvm/llvm-project/issues/56065.

Differential Revision: https://reviews.llvm.org/D128075
2022-06-20 14:23:07 +03:00
Mirko Brkusanin 6cae753bf4 [AMDGPU][GlobalISel] Legalize G_FSUB for s16
Differential Revision: https://reviews.llvm.org/D128066
2022-06-20 12:25:49 +02:00
Guillaume Chatelet f1255186c7 [NFC][Alignment] Remove max functions between Align and MaybeAlign
`llvm::max(Align, MaybeAlign)` and `llvm::max(MaybeAlign, Align)` are
not used often enough to be required. They also make the code more opaque.

Differential Revision: https://reviews.llvm.org/D128121
2022-06-20 08:37:48 +00:00
Jay Foad 7050d5b98c [AMDGPU] Limit GFX11 to using 128 VGPRs
This is a temporary measure to avoid generating incorrect code until the
compiler understands the new way that GFX11 encodes 16-bit operands in
VOP instructions.

Differential Revision: https://reviews.llvm.org/D128054
2022-06-20 07:58:27 +01:00
Kazu Hirata 129b531c9c [llvm] Use value_or instead of getValueOr (NFC) 2022-06-18 23:07:11 -07:00
Kazu Hirata 437f960062 [llvm] Call *set::insert without checking membership first (NFC) 2022-06-18 10:22:05 -07:00
Kazu Hirata 4271a1ff33 [llvm] Call *set::insert without checking membership first (NFC) 2022-06-18 10:17:22 -07:00
Joe Nash 2a68364745 [AMDGPU] gfx11 waitcnt support for VINTERP and LDSDIR instructions
Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D127781
2022-06-17 09:30:37 -04:00
Joe Nash 20d20156f4 [AMDGPU] gfx11 VINTERP intrinsics and ISel support
Depends on D127664

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D127756
2022-06-17 09:16:59 -04:00
Joe Nash 6d5d8b1313 [AMDGPU] gfx11 ldsdir intrinsics and ISel
Reviewed By: #amdgpu, rampitec

Differential Revision: https://reviews.llvm.org/D127664
2022-06-17 09:03:16 -04:00
LiaoChunyu 6181c19283 [AMDGPU][NFC] Remove isConstantAddr
fix isConstantAddr defined but not used

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D127959
2022-06-17 08:49:29 +08:00
Joe Nash 2d43de13df [AMDGPU] gfx11 new dot instruction codegen support
Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D127904
2022-06-16 14:19:34 -04:00
Jay Foad 7e681ef35e [AMDGPU] Add GFX11 codegen for llvm.amdgcn.mov.dpp8
Differential Revision: https://reviews.llvm.org/D127980
2022-06-16 19:44:28 +01:00
Jay Foad 36ec1fcaac [AMDGPU] Add GFX11 llvm.amdgcn.ds.add.gs.reg.rtn / llvm.amdgcn.ds.sub.gs.reg.rtn intrinsics
Differential Revision: https://reviews.llvm.org/D127955
2022-06-16 18:23:14 +01:00
Jay Foad c155a944fb [AMDGPU] GFX11 CodeGen support for MIMG instructions
This includes:
- New llvm.amdgcn.image.msaa.load.* intrinsics
- NSA changes, because MIMG-NSA is now limited to 3 dwords
- Split CD forms of IMAGE_SAMPLE instructions out into separate
  test files since they are no longer supported in GFX11

Differential Revision: https://reviews.llvm.org/D127837
2022-06-16 18:23:14 +01:00
Jay Foad 445a483b41 [AMDGPU] Add new GFX11 intrinsic llvm.amdgcn.exp.row
Differential Revision: https://reviews.llvm.org/D127671
2022-06-16 18:23:14 +01:00
Dmitry Preobrazhensky b26afab9d1 [AMDGPU][MC][GFX11] Correct src0 for dpp variants of v_cvt_*_e64
Differential Revision: https://reviews.llvm.org/D127847
2022-06-16 13:48:43 +03:00
David Stuttard 77851cc1cf [AMDGPU] Change use null for dead sdst to be gfx1030+
Pre gfx1030 null for sdst is different.
c97436f8b6 [AMDGPU] Use null for dead sdst operand - requires a change to make
it not apply to pre gfx1030

Differential Revision: https://reviews.llvm.org/D127869
2022-06-16 10:39:06 +01:00
Jay Foad 9dff14be9e [AMDGPU] Add support for GFX11 hazards
Add support for partial stall over EXEC hazard and trans use hazard.

Differential Revision: https://reviews.llvm.org/D127872
2022-06-16 08:15:21 +01:00
Austin Kerbow 4bba82116a [AMDGPU] Fix buildbot failures after 48ebc1af29
Some buildbots (lto, windows) were failing due to some function reference
variables being improperly initialized.
2022-06-15 00:23:30 -07:00
Austin Kerbow 48ebc1af29 [AMDGPU] Add more expressive sched_barrier controls
The sched_barrier builtin allow the scheduler's behavior to be shaped by users
when very specific codegen is needed in order to create highly optimized code.
This patch adds more granular control over the types of instructions that are
allowed to be reordered with respect to one or multiple sched_barriers. A mask
is used to specify groups of instructions that should be allowed to be scheduled
around a sched_barrier. The details about this mask may be used can be found in
llvm/include/llvm/IR/IntrinsicsAMDGPU.td.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D127123
2022-06-14 22:03:05 -07:00
Austin Kerbow bd9eed3aec [AMDGPU] Add isMFMA helper function. NFC
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D127124
2022-06-14 22:01:49 -07:00
Joe Nash 989bd57f98 [AMDGPU] gfx11 support add_f16
The instruction was skipped in the earlier large patch adding
VOP2, https://reviews.llvm.org/D126917.

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D127697
2022-06-14 08:59:45 -04:00
Dmitry Preobrazhensky 365d827f65 [AMDGPU][MC][GFX11] Correct ds_swizzle_b32
Enable offset parsing.

Differential Revision: https://reviews.llvm.org/D127404
2022-06-14 12:58:03 +03:00
Stanislav Mekhanoshin c97436f8b6 [AMDGPU] Use null for dead sdst operand
Differential Revision: https://reviews.llvm.org/D127542
2022-06-13 14:41:40 -07:00
Stanislav Mekhanoshin cb9ae93712 [AMDGPU] Define SGPR_NULL64 register. NFCI.
On gfx10+ null register can be used as both 32 and 64 bit operand.
Define a 64 bit version of the register to use during codegen.

Differential Revision: https://reviews.llvm.org/D127527
2022-06-13 13:23:33 -07:00
Jay Foad bfcfd53b92 [AMDGPU] Add GFX11 llvm.amdgcn.permlane64 intrinsic
Compared to permlane16, permlane64 has no BC input because it has no
boundary conditions, no fi input because the instruction acts as if FI
were always enabled, and no OLD input because it always writes to every
active lane.

Also use the new intrinsic in the atomic optimizer pass.

Differential Revision: https://reviews.llvm.org/D127662
2022-06-13 21:12:11 +01:00
Jay Foad 7b9f620e78 [AMDGPU] Work around GFX11 flat scratch SVS swizzling bug
Differential Revision: https://reviews.llvm.org/D127635
2022-06-13 21:00:42 +01:00
Jay Foad d943c51465 [AMDGPU] Fix GFX11 codegen for V_MAD_U64_U32 and V_MAD_I64_I32
GFX11 uses different pseudos for these because of a new constraint
on which operands' registers can overlap.

Differential Revision: https://reviews.llvm.org/D127659
2022-06-13 20:59:18 +01:00
Stanislav Mekhanoshin 0f81830632 [AMDGPU] Make temp vgpr selection stable in indirectCopyToAGPR
This uses rotating reminder of division by 3 to select another
temp vgpr each next time in a sequence of several agpr copies.
Therefore, temp vgpr selection depends on the generated agpr
number. This number could change with any unrelated change to
the register definitions.

Stabilize the selection by using a real agpr number.

Differential Revision: https://reviews.llvm.org/D127524
2022-06-13 09:39:46 -07:00
Fangrui Song adf4142f76 [MC] De-capitalize SwitchSection. NFC
Add SwitchSection to return switchSection. The API will be removed soon.
2022-06-10 22:50:55 -07:00
Jay Foad ff85d61a6e Update *_TMPRING_SIZE.WAVESIZE for GFX11
The encoding of COMPUTE_TMPRING_SIZE.WAVESIZE and
SPI_TMPRING_SIZE.WAVESIZE has changed in GFX11: it is now in units
of 64 dwords instead of 256 dwords, and the field has been widened
from 13 bits to 15 bits.

Depends on D126989

Reviewed By: rampitec, arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D127248
2022-06-10 13:24:00 -04:00
Joe Nash ea3c9a87d3 [AMDGPU] gfx11 add bits to COMPUTE_PGM_RSRC3
Contributors:
Konstantin Zhuravlyov <kzhuravl_dev@outlook.com>

Patch 21/N for upstreaming of AMDGPU gfx11 architecture

Depends on D127143

Reviewed By: rampitec, #amdgpu, kzhuravl

Differential Revision: https://reviews.llvm.org/D127241
2022-06-10 13:07:14 -04:00
Joe Nash 78d8fdb88b [AMDGPU] NFC. Comment change to GFX10+ in AsmParser 2022-06-10 12:34:07 -04:00
Joe Nash 9175ab7746 [AMDGPU] gfx11 SRC_POPS_EXISTING_WAVE_ID is removed 2022-06-10 12:32:22 -04:00
Joe Nash fd3304ef85 [AMDGPU] gfx11 EXECZ and VCCZ are no longer allowed to be used as
sources to SALU and VALU instructions.

Contributors:
Baptiste Saleil <baptiste.saleil@amd.com>

Patch 20/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126989

Reviewed By: rampitec, foad, #amdgpu

Differential Revision: https://reviews.llvm.org/D127143
2022-06-10 10:03:43 -04:00
Jay Foad 4b2d70fa5b [AMDGPU] Basic implementation of isExtractSubvectorCheap
Add a basic implementation of isExtractSubvectorCheap that only
considers extracts at offset 0.

Differential Revision: https://reviews.llvm.org/D127385
2022-06-10 14:43:07 +01:00
Ivan Kosarev 60d6fbb621 [AMDGPU][GFX9][GFX10] Support base+soffset+offset SMEM atomics.
Resolves a part of
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D127314
2022-06-10 13:22:41 +01:00
Dmitry Preobrazhensky f8aba9995a [AMDGPU][MC][GFX1013] Enable image_msaa_load
Differential Revision: https://reviews.llvm.org/D127198
2022-06-10 13:42:05 +03:00
Jay Foad 6c372daa84 [AMDGPU] New GFX11 intrinsic llvm.amdgcn.s.sendmsg.rtn
Add new intrinsic and codegen support for the s_sendmsg_rtn_b32 and
s_sendmsg_rtn_b64 instructions.

Differential Revision: https://reviews.llvm.org/D127315
2022-06-10 08:15:23 +01:00
Jay Foad b0a3849439 [AMDGPU] Update dlc usage for GFX11
In GFX10 dlc controlled L1 cache bypass. In GFX11 it has been repurposed
to control MALL NOALLOC, and glc controls L1 as well as L0 cache bypass.

Update the documentation and SIMemoryLegalizer accordingly. Set dlc for
nontemporal and volatile accesses.

Differential Revision: https://reviews.llvm.org/D127405
2022-06-10 08:10:34 +01:00
Jay Foad ffe86e3bdd [AMDGPU] Update SIInsertHardClauses for GFX11
Changes for GFX11:
- Clauses may not mix instructions of different types, and there are
  more types. For example image instructions with and without a sampler
  are now different types.
- The max size of a clause is explicitly documented as 63 instructions.
  Previously it was implicitly assumed to be 64. This is such a tiny
  difference that it does not seem worth making it conditional on the
  subtarget.
- It can be beneficial to clause stores as well as loads.

Differential Revision: https://reviews.llvm.org/D127391
2022-06-09 21:29:56 +01:00
Joe Nash be1082c6d5 [AMDGPU] gfx11 VOPC instructions
Supports encoding existing instrutions on gfx11 and MC support for the new VOPC
dpp instructions.

Patch 19/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126978

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126989
2022-06-09 15:22:42 -04:00
Stanislav Mekhanoshin 23db8e4b43 [AMDGPU] Use v_mad_u64_u32 for IMAD32
Nic Curtis done the experiments to prove it is faster than a
separate mul and add.

Fixes: SWDEV-332806

Differential Revision: https://reviews.llvm.org/D127253
2022-06-09 11:39:49 -07:00
Stanislav Mekhanoshin 5c974d086c [AMDGPU] Fix hazard handling of v_cmpx to permlane
- VOP3 and SDWA forms of V_CMPX were not handled
- Hazard only exists if the compare defines EXEC (i.e. V_CMPX)
  forwarded to the permlane.

Differential Revision: https://reviews.llvm.org/D127344
2022-06-09 10:33:54 -07:00
Simon Moll b8c2781ff6 [NFC] format InstructionSimplify & lowerCaseFunctionNames
Clang-format InstructionSimplify and convert all "FunctionName"s to
"functionName".  This patch does touch a lot of files but gets done with
the cleanup of InstructionSimplify in one commit.

This is the alternative to the less invasive clang-format only patch: D126783

Reviewed By: spatel, rengolin

Differential Revision: https://reviews.llvm.org/D126889
2022-06-09 16:10:08 +02:00
Benjamin Kramer 0abb472fff AMDGPU/GISel: Remove unused variable. NFC. 2022-06-09 13:43:47 +02:00
Nicolai Hähnle 264d1136f9 AMDGPU/GISel: Introduce custom legalization of G_MUL
The generic legalizer framework is still used to reduce the problem
to scalar multiplication with the bit size a multiple of 32.

Generating optimal code sequences for big integer multiplication is
somewhat tricky and has a number of target-specific intricacies:

- The target has V_MAD_U64_U32 instructions that multiply two 32-bit
  factors and add a 64-bit accumulator. Most partial products should
  use this instruction.
- The accumulator is mapped to consecutive 32-bit GPRs, and partial-
  product multiply-adds can feed the accumulator into each other
  directly. (The register allocator's support for that is somewhat
  limited, but that only matters for 128-bit integers and larger.)
- OTOH, on some hardware, V_MAD_U64_U32 requires the accumulator
  to be stored in an even-aligned pair of GPRs. To avoid excessive
  register copies, it makes sense to compute odd partial products
  separately from even partial products (where a partial product
  src0[j0] * src1[j1] is "odd" if j0 + j1 is odd) and add both
  halves together as a final step.
- We can combine G_MUL+G_ADD into a single cascade of multiply-adds.
- The target can keep many carry-bits in flight simultaneously, so
  combining carries using G_UADDE is preferable over G_ZEXT + G_ADD.
- Not addressed by this patch: When the factors are sign-extended,
  the V_MAD_I64_I32 instruction (signed version!) can be used.

It is difficult to address these points generically:

1) Finding matching pairs of G_MUL and G_UMULH to find a wide
multiply is expensive. We could add a G_UMUL_LOHI generic instruction
and conditionally use that in the generic legalizer, but by itself
this wouldn't allow us to use the accumulation capability of
V_MAD_U64_U32. One could attempt to find matching G_ADD + G_UADDE
post-legalization, but this is also expensive.

2) Similarly, making sense of the legalization outcome of a wide
pre-legalization G_MUL+G_ADD pair is extremely expensive.

3) How could the generic legalizer possibly deal with the
particular idiosyncracy of "odd" vs. "even" partial products.

All this points in the direction of directly emitting an ideal code
sequence during legalization, but the generic legalizer should not
be burdened with such overly target-specific concerns. Hence, a
custom legalization.

Note that the implemented approach is different from that used by
SelectionDAG because narrowing of scalars works differently in
general. SelectionDAG iteratively cuts wide scalars into low and
high halves until a legal size is reached. By contrast, GlobalISel
does the narrowing in a single shot, which should be better for
compile-time and for the quality of the generated code.

This patch leaves three gaps open:

1. When the factors are uniform, we should execute the multiplication on
   the SALU. Register bank mapping already ensures this.

   However, the resulting code sequence is not optimal because it doesn't
   fully use the carry-in capabilities of S_ADDC_U32. (V_MAD_U64_U32
   doesn't have a carry-in.) It is very difficult to fix this after the
   fact, so we should really use a different legalization sequence in
   this case. Unfortunately, we don't have a divergence analysis and so
   cannot make that choice.

   (This only matters for 128-bit integers and larger.)

2. Avoid unnecessary multiplies when sources are known to be zero- or
   sign-extended. The challenge is that the legalizer does not currently
   have access to GISelKnownBits.

3. When the G_MUL is followed by a G_ADD, we should consider combining
   the two instructions into a single multiply-add sequence, to utilize
   the accumulator of V_MAD_U64_U32 fully. (Unless the multiply has
   multiple uses and the implied duplication of the multiply is an
   overall negative). However, this is also not true when the factors
   are uniform: in that case, it is generally better to *not* combine
   the two operations, so that the multiply can be done on the SALU.

   Again, we don't have a divergence analysis available and so cannot
   make an informed choice.

Differential Revision: https://reviews.llvm.org/D124844
2022-06-09 13:38:56 +02:00
Joe Nash 40f35cef89 [AMDGPU] gfx11 VOP3P instruction MC support
Includes dpp versions of VOP3P instructions.

Patch 18/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126917

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126978
2022-06-08 13:32:01 -04:00
Joe Nash 086a9c1062 Reland [AMDGPU] gfx11 VOP1+VOP2 Instruction MC support
The reverted dependent commit is now relanded, so reland this.
Includes dpp instructions and vop1/vop2 promoted to vop3

Patch 17/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126483

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126917
2022-06-08 11:10:57 -04:00
Joe Nash e243ead6fc Reland [AMDGPU] gfx11 vop3dpp instructions
There was an issue with encoding wide (>64 bit) instructions on
BigEndian hosts, which is fixed in D127195. Therefore reland this.

gfx11 adds the ability to use dpp modifiers on vop3 instructions.
This patch adds machine code layer support for that. The MCCodeEmitter
is changed to use APInt instead of uint64_t to support these wider
instructions.

Patch 16/N for upstreaming of AMDGPU gfx11 architecture

Differential Revision: https://reviews.llvm.org/D126483
2022-06-07 14:49:13 -04:00
Jay Foad 81edc831fb [AMDGPU] Add support for the .reloc directive
Differential Revision: https://reviews.llvm.org/D127117
2022-06-07 15:18:54 +01:00
Matt Arsenault cc5a1b3dd9 llvm-reduce: Add cloning of target MachineFunctionInfo
MIR support is totally unusable for AMDGPU without this, since the set
of reserved registers is set from fields here.

Add a clone method to MachineFunctionInfo. This is a subtle variant of
the copy constructor that is required if there are any MIR constructs
that use pointers. Specifically, at minimum fields that reference
MachineBasicBlocks or the MachineFunction need to be adjusted to the
values in the new function.
2022-06-07 10:14:48 -04:00
Matt Arsenault cfe5168499 AMDGPU: Make PSV instances static members 2022-06-07 10:14:48 -04:00
Guillaume Chatelet 0788186182 [Alignment][NFC] Remove usage of MemSDNode::getAlignment
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.

Differential Revision: https://reviews.llvm.org/D126910
2022-06-07 13:52:20 +00:00
Fangrui Song 15d82c62dc [MC] De-capitalize MCStreamer functions
Follow-up to c031378ce0 .
The class is mostly consistent now.
2022-06-07 00:31:02 -07:00
Joe Nash eaed07eb7e Revert "[AMDGPU] gfx11 vop3dpp instructions"
This reverts commit 99a83b1286.
2022-06-06 17:12:09 -04:00
Joe Nash f617f89e5b Revert "[AMDGPU] gfx11 VOP1+VOP2 Instruction MC support"
This reverts commit 6079804498.
2022-06-06 17:11:35 -04:00
Ivan Kosarev facbfb121a [AMDGPU][GFX9+] Support base+soffset+offset s_atc_probe's.
Resolves part of
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126791
2022-06-06 16:46:22 +01:00
Ivan Kosarev 79ec1e8fd6 [AMDGPU][GFX9][GFX10] Support base+soffset+offset s_dcache_discard's.
Resolves part of
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126766
2022-06-06 16:32:16 +01:00
Joe Nash 6079804498 [AMDGPU] gfx11 VOP1+VOP2 Instruction MC support
Includes dpp instructions and vop1/vop2 promoted to vop3

Patch 17/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126483

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126917
2022-06-06 09:57:59 -04:00
Joe Nash 99a83b1286 [AMDGPU] gfx11 vop3dpp instructions
gfx11 adds the ability to use dpp modifiers on vop3 instructions.
This patch adds machine code layer support for that. The MCCodeEmitter
is changed to use APInt instead of uint64_t to support these wider
instructions.

Patch 16/N for upstreaming of AMDGPU gfx11 architecture

Depends on D126475

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126483
2022-06-06 09:34:59 -04:00
Fangrui Song 77e300ffdf [MC] Change EndOfStatement "unexpected tokens in .xxx directive " to "expected newline" 2022-06-05 15:11:01 -07:00
Fangrui Song 95a134254a Remove unneeded cl::ZeroOrMore for cl::opt/cl::list options 2022-06-05 01:07:51 -07:00
Kazu Hirata e0039b8d6a Use llvm::less_second (NFC) 2022-06-04 22:48:32 -07:00
Jacob Weightman 814a0abcce AMDGPU: allow reordering of functions in AMDGPUResourceUsageAnalysis
The AMDGPUResourceUsageAnalysis was previously a CGSCC pass, and assumed
that a function's callees were always analyzed prior to their callees.
When it was refactored into a module pass, this assumption no longer
always holds. This results in calls being erroneously identified as
indirect, and reserving private segment space for them. This results in
significantly slower kernel launch latency.

This patch changes the order in which the module's functions are analyzed
from the order in which they occur in the module to a post-order traversal
of the call graph. Perhaps Clang always generates the module's functions
in such an order, but this is not the case for the Cray Fortran compiler.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D126025
2022-06-03 15:55:54 -05:00
Matt Arsenault dd7e407d81 AMDGPU: Move SpilledReg from MFI to SIRegisterInfo
This isn't the most natural place for it, but it avoids a circular
include dependency in an out of tree patch.
2022-06-02 17:11:24 -04:00
Julien Pages 2dfe419446 [AMDGPU] Improve codegen of extractelement/insertelement in some cases
This patch improves the codegen of extractelement and insertelement for vector
containing 8 elements. Before, a dag combine transformation was generating a
sequence of 8 select/cmp.
This patch changes the upper limit for this transformation and the movrel
instruction will eventually be used instead. Extractlement/insertelement for
vectors containing less than 8 elements are unchanged.

Differential Revision: https://reviews.llvm.org/D126389
2022-06-02 17:05:55 -04:00
Joe Nash 3732cd59be [AMDGPU] gfx11 vop3 and inherited vop instructions
This patch includes MC layer support for VOP3 encoded instructions and generic VOP support
classes.
Some VOP1 and VOP2 instructions which share an encoding with gfx10 and are using
the AssemblerPredicate = isGFX10Plus are also enabled. That predicate
will be changed to isGFX10Only in a later patch.

Patch 15/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D126468

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126475
2022-06-02 14:03:02 -04:00
Joe Nash e4870c8357 [AMDGPU] gfx11 ds instructions
MC layer support for ds instructions

Contributors:
Piotr Sobczak <Piotr.Sobczak@amd.com>

Patch 14/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D126463

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D126468
2022-06-02 13:36:56 -04:00
Matt Arsenault 89b1808a2f AMDGPU: Fix missing c++ mode comment 2022-06-01 21:14:48 -04:00
Stanislav Mekhanoshin c9e242f6dd [AMDGPU] Change GISel error handling for TFE on GFX90A
Differential Revision: https://reviews.llvm.org/D126797
2022-06-01 11:07:25 -07:00
Scott Linder 2d43955cec [AMDGPU][NFC] Refactor AMDGPUCallingConv.td
Rename CalleeSavedRegs defs to avoid being overly specific:

* CSR_AMDGPU_AGPRs_32_255 => CSR_AMDGPU_AGPRs
* CSR_AMDGPU_SGPRs_30_31 + CSR_AMDGPU_SGPRs_32_105 => CSR_AMDGPU_SGPRs
* CSR_AMDGPU_SI_Gfx_SGPRs_4_29 + CSR_AMDGPU_SI_Gfx_SGPRs_64_105 =>
  CSR_AMDGPU_SI_Gfx_SGPRs
* CSR_AMDGPU_HighRegs => CSR_AMDGPU
* CSR_AMDGPU_HighRegs_With_AGPRs => CSR_AMDGPU_GFX90AInsts
* CSR_AMDGPU_SI_Gfx_With_AGPRs => CSR_AMDGPU_SI_Gfx_GFX90AInsts

Introduce a class RegMask to mark the cases where we use the
CalleeSavedRegs class purely as an expedient way to produce a mask.
Update the names of these masks to not mention "CSR". Other targets also
seem to do this, so a reasonable alternative is to actually update
table-gen to include a new class to do this explicitly, but the current
approach seems harmless so I opted to just make it more explicit.

Reviewed By: arsenm, sebastian-ne

Differential Revision: https://reviews.llvm.org/D109008
2022-06-01 16:24:09 +00:00
Matt Arsenault 0e1c71e4a4 CodeGen: Move getAddressSpaceForPseudoSourceKind into TargetMachine
Avoid the dependency on TargetInstrInfo, which depends on the subtarget
and therefore the individual function.

Currently AMDGPU is constructing PseudoSourceValue instances in MachineFunctionInfo.
In order to facilitate copying MachineFunctionInfo, we need to stop allocating these
there. Alternatively we could allow targets to subclass PseudoSourceValueManager,
and allocate them similarly to MachineFunctionInfo.
2022-06-01 09:45:40 -04:00
Stanislav Mekhanoshin dec1283279 [AMDGPU] Fix image opcodes GlobalISel on gfx90a.
- Correct flavor of an instruction was not selected.
- GFX90A does not support TFE.

Differential Revision: https://reviews.llvm.org/D126312
2022-05-31 14:07:46 -07:00
jeff 2e61dfb124 [AMDGPU] Instruction Type Pipeline
This patch implements a DAG mutation which adds edges between different groups of instructions. The purpose is to try to generate code that conforms to a pipeline (groupA instructions occur before groupB, groupB -> groupC, and so on). Currently the pipeline order is hardcoded as VMEM->DSRead->MFMA->DSWrite, but the patch was designed to be easily extensible. Alias analysis is problematic for pipelining as memory instructions will usually not be able to be reordered w.r.t one another.

Differential Revision: https://reviews.llvm.org/D125997
2022-05-31 17:48:52 +00:00
Joe Nash e8860bee28 [AMDGPU] gfx11 Image instructions
MC layer support for instructions in the MIMG encoding(Image
instructions).

Contributors:
Carl Ritson <carl.ritson@amd.com>

Patch 13/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125992

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D126463
2022-05-31 10:53:35 -04:00
Ivan Kosarev f199b2b00f [AMDGPU][NFC] Refine defining the offset field for GFX10+ SMEM instructions.
Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126662
2022-05-31 09:54:51 +01:00
Ivan Kosarev b4dbcba3b7 [AMDGPU][GFX9][NFC] Rename the base class for SMEM stores. 2022-05-30 10:31:59 +01:00
Ivan Kosarev 082822b381 [AMDGPU][GFX9] Support base+soffset+offset SMEM stores.
Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126388
2022-05-30 10:27:57 +01:00
Nicolai Hähnle 5df2893a9a AMDGPU: Add G_AMDGPU_MAD_64_32 instructions
These generic instructions are trivially selected to
V_MAD_[IU]64_[IU]32 instructions when run on the VALU.

When at least both factors are scalar, it is usually better to execute
some or all of the instruction on the SALU. To this end, we lower the
instruction to simpler instructions that are supported on the SALU
when applying the register bank mapping.

Differential Revision: https://reviews.llvm.org/D124843
2022-05-27 12:36:17 -05:00
Ivan Kosarev b0ccf38b01 [AMDGPU][GFX9] Support base+soffset+offset SMEM loads.
Resolves part of
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D125700
2022-05-26 12:42:33 +01:00
serge-sans-paille fb67d683db [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since 7030654296 detected a few
regressions, fixing them.

Differential Revision: https://reviews.llvm.org/D126417
2022-05-26 08:12:34 +02:00
Maksim Panchenko bed9efed71 [MCDisassembler] Disambiguate Size parameter in tryAddingSymbolicOperand()
MCSymbolizer::tryAddingSymbolicOperand() overloaded the Size parameter
to specify either the instruction size or the operand size depending on
the architecture. However, for proper symbolic disassembly on X86, we
need to know both sizes, as an instruction can have two operands, and
the instruction size cannot be reliably calculated based on the operand
offset and its size. Hence, split Size into OpSize and InstSize.

For X86, the new interface allows to fix a couple of issues:
  * Correctly adjust the value of PC-relative operands.
  * Set operand size to zero when the operand is specified implicitly.

Differential Revision: https://reviews.llvm.org/D126101
2022-05-25 13:44:32 -07:00
Joe Nash 835e09c4c3 [AMDGPU] gfx11 FLAT Instructions
MachineCode Support for FLAT type instructions

Contributors:
Sebastian Neubauer <sebastian.neubauer@amd.com>

Patch 12/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125989

Reviewed By: rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D125992
2022-05-25 15:29:39 -04:00
Joe Nash ef1ea5ac01 [AMDGPU] gfx11 vinterp instructions MC support
A new instruction encoding. Some of these instructions were previously VOP3
encoded.

Contributors:
Carl Ritson <carl.ritson@amd.com>

Patch 11/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125824

Reviewed By: critson

Differential Revision: https://reviews.llvm.org/D125989
2022-05-25 14:59:16 -04:00
Joe Nash 1a51ab766f [AMDGPU] gfx11 export instructions
Contributors:
Jay Foad <jay.foad@amd.com>
Dmitry Preobrazhensky <d-pre@mail.ru>

Patch 10/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125822

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D125824
2022-05-25 14:44:09 -04:00
Nicolai Hähnle affa1b1cc5 AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane
A later change will add a 3rd user, so factoring out the common code
seems useful.

Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).

(Resubmit with missing test added.)

Differential Revision: https://reviews.llvm.org/D125324
2022-05-25 12:14:01 -05:00
Nicolai Hähnle afc90101a5 Revert "AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane"
This reverts commit 2a28467e53.
2022-05-25 12:03:23 -05:00
Nicolai Hähnle 2a28467e53 AMDGPU/GISel: Factor out AMDGPURegisterBankInfo::buildReadFirstLane
A later change will add a 3rd user, so factoring out the common code
seems useful.

Reorganizing the executeInWaterfallLoop causes some more COPYs to be
generated, but those all fold away during instruction selection.
Generating the comparisons uses generic instructions over machine
instructions now which admittedly shouldn't make a difference
(though it should make it easier to move the waterfall loop generation
to another place).

Differential Revision: https://reviews.llvm.org/D125324
2022-05-25 11:35:02 -05:00
Stanislav Mekhanoshin 5df6669d45 [AMDGPU] Enforce alignment of image vaddr on gfx90a
Even though single address image instructions only use a single VGPR
HW accesses 4 or 5 which creates alignment requirement.

Fixes: SWDEV-316648

Differential Revision: https://reviews.llvm.org/D126009
2022-05-24 10:05:39 -07:00
Ivan Kosarev 1586e1dc95 [AMDGPU][MC][GFX11] Support base+soffset+offset SMEM loads.
Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D126207
2022-05-24 15:13:14 +01:00
Dmitry Preobrazhensky 818cc9b285 [AMDGPU][MC][GFX940] Disable v_mac_f32_dpp
Differential Revision: https://reviews.llvm.org/D126070
2022-05-23 15:49:44 +03:00
Jay Foad 9af56c676e [AMDGPU] Mark SMEM cache invalidations as not reading memory
This brings the MachineInstrs in line with the corresponding intrinsics
which have side effects but do not access memory. It also matches how
BUF cache invalidation instructions are defined.

The lit test changes are just because the machine scheduler previously
treated them like loads, and added an artificial scheduling edge from
them to the exit SU, which caused them to be scheduled earlier.

Differential Revision: https://reviews.llvm.org/D126074
2022-05-20 17:18:03 +01:00
Jay Foad 78ec59e6ae [AMDGPU] Handle mandatory literals in isOperandLegal
Extend SIInstrInfo::isOperandLegal to enforce a limit on the number of
literal operands for all VALU instructions, not just VOP3. In particular
it now handles VOP2 instructions with a mandatory literal operand like
V_FMAAK_F32.

Differential Revision: https://reviews.llvm.org/D126064
2022-05-20 16:14:00 +01:00
Jay Foad 5b18ef7256 [AMDGPU] Add verification for mandatory literals
Extend the literal operand checking in SIInstrInfo::verifyInstruction to
check VOP2 instructions like V_FMAAK_F32 which have a mandatory literal
operand. The rule is that src0 can also be a literal, but only if it is
the same literal value.

AMDGPUAsmParser::validateConstantBusLimitations already handles this
correctly.

Differential Revision: https://reviews.llvm.org/D126063
2022-05-20 16:14:00 +01:00
Dmitry Preobrazhensky f598dfb3bf [AMDGPU][MC][GFX8+] Correct SMEM offset parsing
Differential Revision: https://reviews.llvm.org/D125907
2022-05-20 14:00:34 +03:00
Jay Foad 9ece051847 [AMDGPU] Mark s_get_waveid_in_workgroup as not reading memory
It is already marked as having side effects, at least in MIR. It does
not interact with anything else that is modelled as a memory access
either in IR or MachineIR.

Differential Revision: https://reviews.llvm.org/D125985
2022-05-19 21:25:46 +01:00
Jay Foad 86b55edab6 [AMDGPU] Mark s_getreg as having side effects instead of reading memory
s_getreg does not interact with anything else that is modelled as a
memory access either in IR or MachineIR.

Differential Revision: https://reviews.llvm.org/D125968
2022-05-19 21:25:46 +01:00
Jay Foad d14f2a6359 [AMDGPU] Allow multiple uses of the same literal in SOP2/SOPC
AMDGPUAsmParser::validateSOPLiteral already knew about this but
SIInstrInfo::verifyInstruction did not.

Differential Revision: https://reviews.llvm.org/D125976
2022-05-19 16:42:20 +01:00
Joe Nash ac2ff258d6 [AMDGPU] gfx11 scalar memory instructions
Contributors:
Mirko Brkusanin <Mirko.Brkusanin@amd.com>

Patch 9/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125820

Reviewed By: kosarev, #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D125822
2022-05-19 10:27:47 -04:00
Joe Nash 729467acef [AMDGPU] gfx11 LDSDIR instructions MC support
Contributors:
Carl Ritson <carl.ritson@amd.com>

Patch 8/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125498

Reviewed By: critson, rampitec, #amdgpu

Differential Revision: https://reviews.llvm.org/D125820
2022-05-19 10:08:47 -04:00
Dmitry Preobrazhensky 44673278e0 [AMDGPU][MC][GFX940] Add SMFMAC aliases
Differential Revision: https://reviews.llvm.org/D125888
2022-05-19 13:40:48 +03:00
Jay Foad 6bec3e9303 [APInt] Remove all uses of zextOrSelf, sextOrSelf and truncOrSelf
Most clients only used these methods because they wanted to be able to
extend or truncate to the same bit width (which is a no-op). Now that
the standard zext, sext and trunc allow this, there is no reason to use
the OrSelf versions.

The OrSelf versions additionally have the strange behaviour of allowing
extending to a *smaller* width, or truncating to a *larger* width, which
are also treated as no-ops. A small amount of client code relied on this
(ConstantRange::castOp and MicrosoftCXXNameMangler::mangleNumber) and
needed rewriting.

Differential Revision: https://reviews.llvm.org/D125557
2022-05-19 11:23:13 +01:00
Dmitry Preobrazhensky 32ca9bd7b5 [AMDGPU][MC][GFX940] Correct tied operand decoding for smfmac opcodes
Differential Revision: https://reviews.llvm.org/D125790
2022-05-18 15:39:30 +03:00
Dmitry Preobrazhensky 169416c64a [AMDGPU][MC][GFX7] Disable cache policy modifiers with SMRD
Differential Revision: https://reviews.llvm.org/D125799
2022-05-18 15:17:49 +03:00
Dmitry Preobrazhensky 95a8af2750 [AMDGPU][MC][NFC] MUBUF code cleanup
Removed code that is no longer used after https://reviews.llvm.org/D124485.

Differential Revision: https://reviews.llvm.org/D125811
2022-05-18 15:00:38 +03:00
Jay Foad e2926501d8 [AMDGPU] Aggressively fold immediates in SIShrinkInstructions
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.

Differential Revision: https://reviews.llvm.org/D114644
2022-05-18 11:04:33 +01:00
Jay Foad 3eb2281bc0 [AMDGPU] Aggressively fold immediates in SIFoldOperands
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.

This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
  which might increase occupancy. (On the other hand, many of these
  registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
  into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.

Differential Revision: https://reviews.llvm.org/D114643
2022-05-18 10:19:35 +01:00
Jay Foad dd12c3433e [AMDGPU] Shrink F16 MAD/FMA to MADAK/MADMK/FMAAK/FMAMK on GFX10
Differential Revision: https://reviews.llvm.org/D125803
2022-05-18 10:00:06 +01:00
Shao-Ce SUN 25af3afa67 [NFC][AMDGPU][CodeGen] Use ArrayRef in TargetLowering functions
Based on D123467.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124508
2022-05-18 10:50:23 +08:00
Stanislav Mekhanoshin dee3190293 [AMDGPU] Add llvm.amdgcn.global.load.lds intrinsic
Differential Revision: https://reviews.llvm.org/D125279
2022-05-17 12:35:27 -07:00
Stanislav Mekhanoshin a09af86693 [AMDGPU] Enable FLAT LDS DMA on gfx9/10 before gfx940
We always had global and scratch loads to LDS in the gfx9,
but did not handle it. These were available via the 'lds'
encoding bit. In gfx940 this bit was reused as 'svs' which
resulted in new '_lds' opcodes effectively pushing this
bit into the opcode, but functionally it is the same. These
instructions are also available on gfx10.

Differential Revision: https://reviews.llvm.org/D125126
2022-05-17 12:16:37 -07:00
Joe Nash d21b9b4946 [AMDGPU] gfx11 scalar alu instructions
MC layer support for SOP(scalar alu operations) including encoding
support for s_delay_alu and s_sendmsg_rtn.

Contributors:
Jay Foad <jay.foad@amd.com>

Patch 7/N for upstreaming of AMDGPU gfx11 architecture.

Depends on D125319

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D125498
2022-05-17 13:35:41 -04:00
Stanislav Mekhanoshin 791ec1c68e [AMDGPU] Add intrinsics llvm.amdgcn.{raw|struct}.buffer.load.lds
Differential Revision: https://reviews.llvm.org/D124884
2022-05-17 10:32:13 -07:00
Stanislav Mekhanoshin 332b73fe12 [AMDGPU] Revert wide LDS DMA support.
This reverts ffbee7acdc, see also bug 37653 which it was fixing.
The bug claims this is an undocumented feature which actually works.
In the reality it is documented as not working for a good reason.
It likely does something, but it is useless anyway. These instructions
write into the LDS. The LDS address is:

M0 + inst_offset + (TIDinWave * 4).

For a store wider than a DWORD neighboring lanes will overwrite each
other.

Differential Revision: https://reviews.llvm.org/D125409
2022-05-16 11:23:35 -07:00
Joe Nash 6ef17f20d9 [AMDGPU] Mark sendmsg hasSideEffects. NFC
Address the FIXME by marking the sendmsg instructions with
hasSideEffects.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D125569
2022-05-16 09:59:27 -04:00
Jay Foad 27fa41583f [AMDGPU] Shrink MAD/FMA to MADAK/MADMK/FMAAK/FMAMK on GFX10
On GFX10 VOP3 instructions can have a literal operand, so the conversion
from VOP3 MAD/FMA to VOP2 MADAK/MADMK/FMAAK/FMAMK will not happen in
SIFoldOperands. The only benefit of the VOP2 form is code size, so do it
in SIShrinkInstructions instead.

Differential Revision: https://reviews.llvm.org/D125567
2022-05-16 15:15:23 +01:00
Joe Nash c70259405c [AMDGPU] gfx11 BUF Instructions
Includes MachineCode layer support and tests, and MIR tests not requiring
CodeGen pass changes.
Includes a small change in SMInstructions.td to correct encoded bits.

Contributors:
Petar Avramovic <Petar.Avramovic@amd.com>
Dmitry Preobrazhensky <dmitry.preobrazhensky@amd.com>

Depends on D125316

Patch 6/N for upstreaming of AMDGPU gfx11 architecture.

Reviewed By: dp, Petar.Avramovic

Differential Revision: https://reviews.llvm.org/D125319
2022-05-16 09:41:40 -04:00
Jay Foad c1af2d329f [AMDGPU] SIShrinkInstructions: change static functions to methods
This is a mechanical change to avoid passing MRI and TII around
explicitly. NFC.

Differential Revision: https://reviews.llvm.org/D125566
2022-05-16 09:43:41 +01:00
Jay Foad dfb006c0c9 [AMDGPU] Extract SIInstrInfo::removeModOperands. NFC.
Make this an externally callable function for use in a future patch.

Differential Revision: https://reviews.llvm.org/D125565
2022-05-16 09:43:41 +01:00
Sheng c644488a8b Rename `MCFixedLenDisassembler.h` as `MCDecoderOps.h`
The name `MCFixedLenDisassembler.h` is out of date after D120958.

Rename it as `MCDecoderOps.h` to reflect the change.

Reviewed By: myhsu

Differential Revision: https://reviews.llvm.org/D124987
2022-05-15 08:44:58 +08:00
Ivan Kosarev bf5fc0d603 [AMDGPU][NFC] Remove unused function.
Introduced in
https://reviews.llvm.org/rG229d5e669bbbe7ca38ad832627a9809405939f1b

and then became unused in
https://reviews.llvm.org/D19584

Reviewed By: foad, dp

Differential Revision: https://reviews.llvm.org/D125385
2022-05-12 08:52:06 +01:00
Ivan Kosarev cb67b2ccc4 [AMDGPU][GFX10] Support base+soffset+offset SMEM stores.
Also makes another step towards resolving
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: foad, dp

Differential Revision: https://reviews.llvm.org/D125380
2022-05-12 08:48:05 +01:00
Austin Kerbow 2db700215a [AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic
Adds an intrinsic/builtin that can be used to fine tune scheduler behavior. If
there is a need to have highly optimized codegen and kernel developers have
knowledge of inter-wave runtime behavior which is unknown to the compiler this
builtin can be used to tune scheduling.

This intrinsic creates a barrier between scheduling regions. The immediate
parameter is a mask to determine the types of instructions that should be
prevented from crossing the sched_barrier. In this initial patch, there are only
two variations. A mask of 0 means that no instructions may be scheduled across
the sched_barrier. A mask of 1 means that non-memory, non-side-effect inducing
instructions may cross the sched_barrier.

Note that this intrinsic is only meant to work with the scheduling passes. Any
other transformations that may move code will not be impacted in the ways
described above.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124700
2022-05-11 13:22:51 -07:00
Joe Nash a0a406b257 [AMDGPU] gfx11 Decode wider instructions. NFC
Refactor to pass a templatized size parameter to the decoder to allow wider than
64bit decodes in a later patch.

Contributors:
Jay Foad <jay.foad@amd.com>

Depends on D125261

Patch 5/N for upstreaming of AMDGPU gfx11 architecture.

Reviewed By: dp

Differential Revision: https://reviews.llvm.org/D125316
2022-05-11 11:05:58 -04:00
Joe Nash 18ed279a3a [AMDGPU] gfx11 subtarget features & early tests
Tablegen definitions for subtarget features and cpp predicate functions to
access the features.
New Sub-TargetProcessors and common latencies.
Simple changes to MIR codegen tests which pass on gfx11 because they have the
same output as previous subtargets or operate on pseudo instructions which
are reused from previous subtargets.

Contributors:
Jay Foad <jay.foad@amd.com>
Petar Avramovic <Petar.Avramovic@amd.com>

Patch 4/N for upstreaming of AMDGPU gfx11 architecture

Depends on D124538

Reviewed By: Petar.Avramovic, foad

Differential Revision: https://reviews.llvm.org/D125261
2022-05-11 10:31:49 -04:00
Mehdi Amini 3ffb08844c Remove unused variable (fix -Werror build on MSVC) 2022-05-10 21:04:52 +00:00
jeff f822db7670 [AMDGPU] Allow for MFMA Inst Clustering
This patch adds cluster edges between independent MFMA instructions. Additionally, it propogates all predecessors of cluster insts to the root of the cluster(s), and all successors to the leaf(ves) of the cluster(s) -- this is done to remove the possibility that those insts will be interspersed within the cluster.

Reviewed By: kerbowa

Differential Revision: https://reviews.llvm.org/D124678
2022-05-10 12:57:40 -07:00
jeff 3ff8ee2447 [NFC] Fix typo
Reviewed By: kerbowa

Differential Revision: https://reviews.llvm.org/D124647
2022-05-10 12:11:21 -07:00
Ivan Kosarev 88f04bdbd8 [AMDGPU][GFX10] Support base+soffset+offset SMEM loads.
Also makes a step towards resolving
https://github.com/llvm/llvm-project/issues/38652

Reviewed By: foad, dp

Differential Revision: https://reviews.llvm.org/D125117
2022-05-10 16:17:14 +01:00
Nicolai Hähnle 6c2a01ce3a AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32
Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us
to keep the calculation entirely on the SALU.

For subtargets where integer multiplication isn't full-rate, avoid
folding if the multiply has too many uses.

Finally, we expand 64x32 and 64x64 multiplies here as well, if they
feed into an addition. This results in better code generation than
the generic expansion for such multiplies because we end up using
the accumulator of the MAD instructions.

Differential Revision: https://reviews.llvm.org/D123835
2022-05-10 09:15:51 -05:00
Simon Pilgrim 7e3ef7dcd2 [AMDGPU] lowerEXTRACT_VECTOR_ELT - fold from a SCALAR_TO_VECTOR source
As suggested by @foad on D124839

If we're extracting a vector element that originally came from a scalar_to_vector, then avoid the bitcasting of a vector type and perform the shift masking on the (any-extended) scalar source directly, making use of the fact that the upper elements of a scalar_to_vector are all undef.

Differential Revision: https://reviews.llvm.org/D125173
2022-05-07 20:23:31 +01:00
Joe Nash 7e71a03966 [AMDGPU] Split FeatureAtomicFaddInsts
FeatureAtomicFaddInsts is replaced with three more granular features.
Contributors:
Petar Avramovic <Petar.Avramovic@amd.com>

Patch 3/N for upstreaming of AMDGPU gfx11 architecture

Depends on D124537

Reviewed By: foad, #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D124538
2022-05-05 13:27:45 -04:00
Jay Foad ba6c8d42d4 [AMDGPU] Combine DPP mov even if old reg def is in different BB
Given a DPP mov like this:

  %2:vgpr_32 = V_MOV_B32_e32 0, implicit $exec
  ...
  %3:vgpr_32 = V_MOV_B32_dpp %2, %1, 1, 1, 1, 0, implicit $exec

this patch just removes a check that %2 (the "old reg") was defined in
the same BB as the DPP mov instruction. GCNDPPCombine requires that the
MIR is in SSA form so I don't understand why the BB matters.

This lets the optimization work in more real world cases when the
definition of %2 gets hoisted out of a loop.

Differential Revision: https://reviews.llvm.org/D124182
2022-05-05 11:30:31 +01:00
Mariusz Sikora 2417de2758 [AMDGPU] Use d16 flag for image.sample instructions
Image.sample instruction can be forced to return half type instead of
float when d16 flag is enabled.

This patch adds new pattern in InstCombine to detect if output of
image.sample is used later only by fptrunc which converts the type
from float to half. If pattern is detected then fptrunc and image.sample
are combined to single image.sample which is returning half type.
Later in Lowering part d16 flag is added to image sample intrinsic.

Differential Revision: https://reviews.llvm.org/D124232
2022-05-05 06:29:19 +02:00
Stanislav Mekhanoshin 63f21f4cc7 [AMDGPU] Handle LDS DMA and LDS_DIRECT hazards
There shall be 1 wait state between M0 write and LDS DMA/LDS_DIRECT use.

Differential Revision: https://reviews.llvm.org/D124550
2022-05-04 14:45:16 -07:00
Jon Chesterfield bc78c09952 [amdgpu] Elide module lds allocation in kernels with no callees
Introduces a string attribute, amdgpu-requires-module-lds, to allow
eliding the module.lds block from kernels. Will allocate the block as before
if the attribute is missing or has its default value of true.

Patch uses the new attribute to detect the simplest possible instance of this,
where a kernel makes no calls and thus cannot call any functions that use LDS.

Tests updated to match, coverage was already good. Interesting cases is in
lower-module-lds-offsets where annotating the kernel allows the backend to pick
a different (in this case better) variable ordering than previously. A later
patch will avoid moving kernel variables into module.lds when the kernel can
have this attribute, allowing optimal ordering and locally unused variable
elimination.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122091
2022-05-04 22:42:07 +01:00
serge-sans-paille 7030654296 [iwyu] Handle regressions in libLLVM header include
Running iwyu-diff on LLVM codebase since fa5a4e1b95 detected a few
regressions, fixing them.

Differential Revision: https://reviews.llvm.org/D124847
2022-05-04 08:32:38 +02:00
Nicolai Hähnle 8b42e6d057 AMDGPU: Remove redundant call to MachineInstrBuilder::setMBB
setInstrAndDebugLoc also sets the basic block automatically.

Differential Revision: https://reviews.llvm.org/D124809
2022-05-03 07:49:20 -05:00
hsmahesha 589b9df4e1 [AMDGPU] Fix scalar_to_vector for v8i16/v8f16
so that the stack access is avoided.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124734
2022-05-03 07:28:15 +05:30
hsmahesha 3175323ce1 [AMDGPU][NFC] Make lowerINSERT_VECTOR_ELT() more readable
by moving around the code and by adding more comments, which would
later help during any required clean-up.

Differential Revision: https://reviews.llvm.org/D124733
2022-05-03 07:28:15 +05:30
Nicolai Hähnle deaa678137 AMDGPU/SDAG: Factor out the fold (add (mul x, y), y) --> mad_[iu]64_[iu]32
Refactor to simplify a follow-up change.

No functional change intended. However, there is a rather subtle logic
change: the subsequent combines (e.g. reassociation) are skipped *always*
when one of the operands of the add is a mul, instead of only when
additionally mad64_32 etc. are available. This change makes sense because
the subsequent combines should never apply when one of the operands is a
mul.

Differential Revision: https://reviews.llvm.org/D123833
2022-05-02 17:40:03 -05:00
Stanislav Mekhanoshin 51e02409f0 [AMDGPU] Produce waitcounts for LDS DMA
MUBUF and FLAT LDS DMA operations need a wait on vmcnt before LDS written
can be accessed. A load from LDS to VMEM does not need a wait.

Differential Revision: https://reviews.llvm.org/D124626
2022-04-29 11:14:11 -07:00
Joe Nash 813e521e55 [AMDGPU] Add gfx11 subtarget ELF definition
This is the first patch of a series to upstream support for the new
subtarget.

Contributors:
Jay Foad <jay.foad@amd.com>
Konstantin Zhuravlyov <kzhuravl_dev@outlook.com>

Patch 1/N for upstreaming AMDGPU gfx11 architectures.

Reviewed By: foad, kzhuravl, #amdgpu

Differential Revision: https://reviews.llvm.org/D124536
2022-04-29 12:27:17 -04:00
Ivan Kosarev 6ddf2a824d [AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling.
As older waves execute long sequences of VALU instructions, this may
prevent younger waves from address calculation and then issuing their
VMEM loads, which in turn leads the VALU unit to idle. This patch tries
to prevent this by temporarily raising the wave's priority.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D124246
2022-04-27 14:37:18 +01:00
Stanislav Mekhanoshin 6a24e37219 [AMDGPU] Remove now unused variable HasLdsModifier. NFC. 2022-04-26 17:49:30 -07:00
Stanislav Mekhanoshin 0274811b5a [AMDGPU] Add both mayLoad and mayStore to MUBUF LDS opcodes
Differential Revision: https://reviews.llvm.org/D124483
2022-04-26 17:30:24 -07:00
Stanislav Mekhanoshin 00d84a9f92 [AMDGPU] Remove vdata from buffer to lds load
Differential Revision: https://reviews.llvm.org/D124485
2022-04-26 17:16:26 -07:00
Stanislav Mekhanoshin a9ccc7bc54 [AMDGPU] Properly mark MUBUF and FLAT LDS DMA instructions. NFC.
Add these bits to the MUBUF and FLAT LDS DMA instructions:

- LGKM_CNT - these operate on LDS;
- VALU - SPG 3.9.8: This instruction acts as both a MUBUF and
VALU instruction;

Codegen currently does not produce any of this, so the change is NFC.

Differential Revision: https://reviews.llvm.org/D124472
2022-04-26 14:20:26 -07:00
Vasileios Porpodas fa8a9fea47 Recommit "[SLP][TTI] Refactoring of `getShuffleCost` `Args` to work like `getArithmeticInstrCost`"
This reverts commit 6a9bbd9f20.

Code review: https://reviews.llvm.org/D124202
2022-04-26 14:02:40 -07:00
Piotr Sobczak c6afbdb5d2 Revert "[AMDGPU] Use d16 flag for image.sample instructions"
This reverts commit d1762fc454.

Reverting D124232 as the buildbot reported some errors in sanitizers.
2022-04-25 17:18:49 +02:00
Mariusz Sikora d1762fc454 [AMDGPU] Use d16 flag for image.sample instructions
Image.sample instruction can be forced to return half type instead of
float when d16 flag is enabled.

This patch adds new pattern in InstCombine to detect if output of
image.sample is used later only by fptrunc which converts the type
from float to half. If pattern is detected then fptrunc and image.sample
are combined to single image.sample which is returning half type.
Later in Lowering part d16 flag is added to image sample intrinsic.

Differential Revision: https://reviews.llvm.org/D124232
2022-04-25 13:05:52 +01:00
Matt Arsenault 0ecbb683a2 TableGen/GlobalISel: Make address space/align predicates consistent
The builtin predicate handling has a strange behavior where the code
assumes that a PatFrag is a stack of PatFrags, and each level adds at
most one predicate. I don't think this particularly makes sense,
especially without a diagnostic to ensure you aren't trying to set
multiple at once.

This wasn't followed for address spaces and alignment, which could
potentially fall through to report no builtin predicate was
added. Just switch these to follow the existing convention for now.
2022-04-22 15:48:07 -04:00
Matt Arsenault 794a0bb547 AMDGPU: Directly implement computeKnownBits for workitem intrinsics
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.

Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.

Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
2022-04-22 10:49:50 -04:00