Commit Graph

491 Commits

Author SHA1 Message Date
Kazu Hirata 20cde15415 [Target] Use std::nullopt instead of None (NFC)
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated.  The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.

This is part of an effort to migrate from llvm::Optional to
std::optional:

https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-02 20:36:06 -08:00
Mateja Marjanovic 595a08847a [AMDGPU] Add support for new LLVM vector types
Add VReg, AReg and SReg on AMDGPU for bit widths: 288, 320, 352 and 384.

Differential Revision: https://reviews.llvm.org/D138205
2022-11-29 17:02:04 +01:00
Janek van Oirschot 322966f8f8 [AMDGPU] Add llvm.is.fpclass intrinsic to existing SelectionDAG fp
class support and introduce GlobalISel implementation for AMDGPU

Uses existing SelectionDAG lowering of the llvm.amdgcn.class intrinsic
for llvm.is.fpclass
2022-11-28 16:00:36 -05:00
Ivan Kosarev ec8ede8177 [AMDGPU][CodeGen] Support raw format TFE buffer loads other than byte, short and d16 ones.
Differential Revision: https://reviews.llvm.org/D138215
2022-11-24 10:50:26 +00:00
Stanislav Mekhanoshin bcaf31ec3f [AMDGPU] Allow finer grain control of an unaligned access speed
A target can return if a misaligned access is 'fast' as defined
by the target or not. In reality there can be different levels
of 'fast' and 'slow'. This patch changes the boolean 'Fast'
argument of the allowsMisalignedMemoryAccesses family of functions
to an unsigned representing its speed.

A target can still define it as it wants and the direct translation
of the current code uses 0 and 1 for current false and true. This
makes the change an NFC.

Subsequent patch will start using an actual value of speed in
the load/store vectorizer to compare if a vectorized access going
to be not just fast, but not slower than before.

Differential Revision: https://reviews.llvm.org/D124217
2022-11-17 09:23:53 -08:00
Simon Pilgrim 78739fdb4d [DAG] Enable combineShiftOfShiftedLogic folds after type legalization
This was disabled to prevent regressions, which appear to be just occurring on AMDGPU (at least in our current lit tests), which I've addressed by adding AMDGPUTargetLowering::isDesirableToCommuteWithShift overrides.

Fixes #57872

Differential Revision: https://reviews.llvm.org/D136042
2022-10-29 12:30:04 +01:00
Pierre van Houtryve 824dd811be [AMDGPU][DAG] Fix trunc/shift combine condition
The condition needs to be different for right-shifts, else we may lose information in some cases.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D136059
2022-10-21 06:36:07 +00:00
Leon Clark 6370bc2435 Add f16 nearbyint support.
Enable lowering of FNEARBYINT for f16 and extend existing tests.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D135124
2022-10-14 08:05:24 +01:00
Stanislav Mekhanoshin 5a3fe9a039 [AMDGPU] Move SIModeRegisterDefaults to SI MFI
It does not belong to a general AMDGPU MFI.

Differential Revision: https://reviews.llvm.org/D134666
2022-09-28 13:13:40 -07:00
Vitaly Buka 20a80d60a8 Revert "[AMDGPU] Move SIModeRegisterDefaults to SI MFI"
Break msan bots. Details in D134666.

This reverts commit 0ce96e06ee.
2022-09-26 22:22:09 -07:00
Stanislav Mekhanoshin 0ce96e06ee [AMDGPU] Move SIModeRegisterDefaults to SI MFI
It does not belong to a general AMDGPU MFI.

Differential Revision: https://reviews.llvm.org/D134666
2022-09-26 13:20:24 -07:00
Matt Arsenault 28e03692ae AMDGPU: Fix expansion of 16-bit atomicrmw
Fixes issue 57830
2022-09-20 14:47:40 -04:00
Simon Pilgrim f9de13232f [X86] Promote i8/i16 CTTZ (BSF) instructions and remove speculation branch
This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.

For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.

Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.

Differential Revision: https://reviews.llvm.org/D132520
2022-08-24 17:28:18 +01:00
Simon Pilgrim 3cf48963ff [AMDGPU] Remove old isCheapToSpeculateCttz FIXME
As confirmed on D132520 - this should always return true
2022-08-24 15:53:38 +01:00
David Stuttard 1d1cc05539 AMDGPU: mbcnt allow for non-zero src1 for known-bits
Src1 for mbcnt can be a non-zero literal or register. Take this into account
when calculating known bits.

Differential Revision: https://reviews.llvm.org/D131478
2022-08-11 13:23:43 +01:00
David Truby 9a976f3661 [llvm] Always use TargetConstant for FP_ROUND ISD Nodes
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.

This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D130370
2022-08-03 14:02:11 +01:00
Shilei Tian 1023ddaf77 [LLVM] Add the support for fmax and fmin in atomicrmw instruction
This patch adds the support for `fmax` and `fmin` operations in `atomicrmw`
instruction. For now (at least in this patch), the instruction will be expanded
to CAS loop. There are already a couple of targets supporting the feature. I'll
create another patch(es) to enable them accordingly.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D127041
2022-07-06 10:57:53 -04:00
Piotr Sobczak bd675af2a2 [AMDGPU] Make v16i16/v16f16 legal
There are upcoming intrinsics to use the new types.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D128865
2022-06-30 23:08:40 +02:00
Guillaume Chatelet f1255186c7 [NFC][Alignment] Remove max functions between Align and MaybeAlign
`llvm::max(Align, MaybeAlign)` and `llvm::max(MaybeAlign, Align)` are
not used often enough to be required. They also make the code more opaque.

Differential Revision: https://reviews.llvm.org/D128121
2022-06-20 08:37:48 +00:00
Guillaume Chatelet 0788186182 [Alignment][NFC] Remove usage of MemSDNode::getAlignment
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.

Differential Revision: https://reviews.llvm.org/D126910
2022-06-07 13:52:20 +00:00
Shao-Ce SUN 25af3afa67 [NFC][AMDGPU][CodeGen] Use ArrayRef in TargetLowering functions
Based on D123467.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124508
2022-05-18 10:50:23 +08:00
Matt Arsenault 794a0bb547 AMDGPU: Directly implement computeKnownBits for workitem intrinsics
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.

Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.

Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
2022-04-22 10:49:50 -04:00
Nicolai Hähnle 4df4922da6 AMDGPU/SDAG: Custom SETCC (i.e. ballot) is always uniform
The AMDGPUISD::SETCC node is like ISD::SETCC, but returns a lane mask
instead of a per-lane boolean. The lane mask is uniform.

This improves instruction selection for code patterns like
ctpop(ballot(x)), which can now use an S_BCNT1_* instruction instead
of V_BCNT_*.

GlobalISel already selects scalar instructions (an earlier commit
added a test case)..

Differential Revision: https://reviews.llvm.org/D123432
2022-04-11 14:04:21 -05:00
Stanislav Mekhanoshin e66f0edb40 [AMDGPU] Split unaligned LDS access instead of scalarizing
There is no need to fully scalarize an unaligned operation in
some case, just split it to alignment.

Differential Revision: https://reviews.llvm.org/D123330
2022-04-07 14:27:37 -07:00
Shao-Ce SUN 662b9fa02c [NFC][CodeGen] Add a setTargetDAGCombine use ArrayRef
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D122557
2022-03-29 09:53:24 +08:00
Changpeng Fang dd5895cc39 AMDGPU: Use the implicit kernargs for code object version 5
Summary:
  Specifically, for trap handling, for targets that do not support getDoorbellID,
we load the queue_ptr from the implicit kernarg, and move queue_ptr to s[0:1].
To get aperture bases when targets do not have aperture registers, we load
private_base or shared_base directly from the implicit kernarg. In clang, we use
implicitarg_ptr + offsets to implement __builtin_amdgcn_workgroup_size_{xyz}.

Reviewers: arsenm, sameerds, yaxunl

Differential Revision: https://reviews.llvm.org/D120265
2022-03-17 14:12:36 -07:00
serge-sans-paille 989f1c72e0 Cleanup codegen includes
This is a (fixed) recommit of https://reviews.llvm.org/D121169

after:  1061034926
before: 1063332844

Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D121681
2022-03-16 08:43:00 +01:00
Nico Weber a278250b0f Revert "Cleanup codegen includes"
This reverts commit 7f230feeea.
Breaks CodeGenCUDA/link-device-bitcode.cu in check-clang,
and many LLVM tests, see comments on https://reviews.llvm.org/D121169
2022-03-10 07:59:22 -05:00
serge-sans-paille 7f230feeea Cleanup codegen includes
after:  1061034926
before: 1063332844

Differential Revision: https://reviews.llvm.org/D121169
2022-03-10 10:00:30 +01:00
Venkata Ramanaiah Nalamothu 04fff547e2 [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm, ronlieb

Differential Revision: https://reviews.llvm.org/D114652
2022-03-09 12:18:02 +05:30
Julien Pages dcb2da13f1 [AMDGPU] Add a new intrinsic to control fp_trunc rounding mode
Add a new llvm.fptrunc.round intrinsic to precisely control
the rounding mode when converting from f32 to f16.

Differential Revision: https://reviews.llvm.org/D110579
2022-02-11 12:08:23 -05:00
Simon Pilgrim fd2bb51f1e [ADT] Add APInt/MathExtras isShiftedMask variant returning mask offset/length
In many cases, calls to isShiftedMask are immediately followed with checks to determine the size and position of the bitmask.

This patch adds variants of APInt::isShiftedMask, isShiftedMask_32 and isShiftedMask_64 that return these values as additional arguments.

I've updated a number of cases that were either performing seperate size/position calculations or had created their own local wrapper versions of these.

Differential Revision: https://reviews.llvm.org/D119019
2022-02-08 12:04:13 +00:00
Stanislav Mekhanoshin bb1fe36977 [AMDGPU] Make v8i16/v8f16 legal
Differential Revision: https://reviews.llvm.org/D117721
2022-01-24 11:51:08 -08:00
Craig Topper 454256ef4f [AMDGPU] Correct the known bits calculation for MUL_I24.
I'm not entirely sure, but based on how ComputeNumSignBits handles
ISD::MUL, I believe this code was miscounting the number of sign
bits.

As an example of an incorrect result let's say that countMinSignBits
returned 1 for the left hand side and 24 for the right hand side.
LHSValBits would be 23 and RHSValBits would be 0 and the sum would
be 23. This would cause the code to set 9 high bits as zero/one. Now
suppose the real values for the left side is 0x800000 and the right
hand side is 0xffffff. The product is 0x00800000 which has 8 sign bits
not 9.

The number of valid bits for the left and right operands is now
the number of non-sign bits + 1. If the sum of the valid bits of
the left and right sides exceeds 32, then the result may overflow and we
can't say anything about the sign of the result. If the sum is 32
or less then it won't overflow and we know the result has at least
1 sign bit.

For the previous example, the code will now calculate the left
side valid bits as 24 and the right side as 1. The sum will be 25
and the sign bits will be 32 - 25 + 1 which is 8, the correct value.

Differential Revision: https://reviews.llvm.org/D116469
2022-01-14 08:54:54 -08:00
Jay Foad ff971873b3 [GlobalISel] Fix legality checks for G_UBFX combines
1. Fix CombinerHelper::matchBitfieldExtractFromAnd to check legality
   with the correct types for the G_UBFX that it builds.
2. Fix AMDGPUTargetLowering::isConstantUnsignedBitfieldExtractLegal to
   match the legality rules: result and first operand can be s32 or s64
   but the "shift amount" operands are always s32.
3. Add AMDGPU tests where the post-legalizer combiner would create
   illegal MIR without the above fixes.

Differential Revision: https://reviews.llvm.org/D116802
2022-01-08 09:20:44 +00:00
Jay Foad 3f3fe4a5cf [GlobalISel] Fix typo Extact to Extract in function name. NFC. 2022-01-07 11:13:35 +00:00
Craig Topper cbcbbd6ac8 [ValueTracking][SelectionDAG] Rename ComputeMinSignedBits->ComputeMaxSignificantBits. NFC
This function returns an upper bound on the number of bits needed
to represent the signed value. Use "Max" to match similar functions
in KnownBits like countMaxActiveBits.

Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old
name around to keep this patch size down. Will do a bulk rename as
follow up.

Rename KnownBits::countMaxSignedBits->countMaxSignificantBits.

Reviewed By: lebedev.ri, RKSimon, spatel

Differential Revision: https://reviews.llvm.org/D116522
2022-01-03 11:33:30 -08:00
Ron Lieberman 09b53296cf Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range"
This reverts commit 9075009d1f.

 Failed amdgpu runtime buildbot # 3514
2021-12-22 11:39:28 -05:00
RamNalamothu 9075009d1f [AMDGPU] Move call clobbered return address registers s[30:31] to callee saved range
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.

But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.

This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.

And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.

As an added benefit, this patch simplifies overall return instruction handling.

Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D114652
2021-12-22 20:51:12 +05:30
Jay Foad d7e03df719 [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32
Select SelectionDAG ops smul_lohi/umul_lohi to
v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0.
v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions
so it is better to use one instruction than two.

Further improvements are possible to make better use of the addend
operand, but this is already a strict improvement over what we have
now.

Differential Revision: https://reviews.llvm.org/D113986
2021-11-24 11:25:02 +00:00
Kazu Hirata efa896e5f7 [Target] Use SDNode::uses (NFC) 2021-11-12 21:23:04 -08:00
Jay Foad a70bbb5f7a [AMDGPU] Simplify 64-bit division/remainder expansion
The old expansion open-coded a 64-bit addition in a strange way, by
adding the high parts *without* carry-in from the low part, and then
adding the carry back in later on. Fixing this saves a couple of
instructions and makes the code much easier to understand.

Differential Revision: https://reviews.llvm.org/D113679
2021-11-12 15:48:41 +00:00
Simon Pilgrim f059b04f7b [DAG] Add SelectionDAG::ComputeMinSignedBits helper
As suggested on D113371, this adds a wrapper to SelectionDAG::ComputeNumSignBits, similar to the llvm::ComputeMinSignedBits wrapper.

I've included some usage, its not exhaustive, just the more obvious cases where the intention is obvious.

Differential Revision: https://reviews.llvm.org/D113396
2021-11-08 14:12:45 +00:00
Thomas Symalla 76cbe62262 [AMDGPU] Changes the AMDGPU_Gfx calling convention by making the SGPRs 4..29 callee-save. This is to avoid superfluous s_movs when executing amdgpu_gfx function calls as the callee is likely not going to change the argument values.
This patch changes the AMDGPU_Gfx calling convention. It defines the SGPR registers s[4:29] as callee-save and leaves some SGPRs usable for callers. The intention is to avoid unneccessary s_mov instructions for arguments the caller would otherwise save and restore in these registers.

Reviewed By: sebastian-ne

Differential Revision: https://reviews.llvm.org/D111637
2021-11-04 21:50:18 +01:00
Jay Foad 21a1d4cf71 [AMDGPU] Change numBitsSigned for simplicity and document it. NFC.
Change numBitsSigned to return the minimum size of a signed integer that
can hold the value. This is different by one from the previous result
but is more consistent with numBitsUnsigned. Update all callers. All
callers are now more consistent between the signed and unsigned cases,
and some callers get simpler, especially the ones that deal with
quantities like numBitsSigned(LHS) + numBitsSigned(RHS).

Differential Revision: https://reviews.llvm.org/D112813
2021-10-29 14:22:06 +01:00
Kazu Hirata 01b4789b62 [AMDGPU] Remove hasDefinedInitializer (NFC)
The last use was removed on Sep 16, 2021 in commit
7a62a5b56d.
2021-10-28 20:33:34 -07:00
Abinav Puthan Purayil 2da6ef3664 [AMDGPU] Add 24-bit mulhi intrinsics in INTRINSIC_WO_CHAIN combine.
mul24 intrinsic's operands are simplified by
AMDGPUTargetLowering::performIntrinsicWOChainCombine(). This change adds
the mul24hi intrinsics in the combine since its operands can be
simplified like that of the mul24 intrinsics.

Differential Revision: https://reviews.llvm.org/D112702
2021-10-28 16:57:48 +05:30
Roman Lebedev 684cbae89a
[KnownBits] Introduce `countMaxActiveBits()` and use it in a few places 2021-10-11 23:36:06 +03:00
Christudasan Devadasan 7a62a5b56d [AMDGPU] Legalize initialized LDS variables
We don't allow an initializer for LDS variables
and there is an early abort during instruction
selection. This patch legalizes them by ignoring
the init values. During assembly emission, proper
error reporting already exists for such instances.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109901
2021-09-23 22:53:20 -04:00
Jacob Lambert dc6e8dfdfe [AMDGPU][NFC] Correct typos in lib/Target/AMDGPU/AMDGPU*.cpp files. Test commit for new contributor. 2021-09-20 14:48:50 -07:00