Noticed while I was touching other nearby code. I don't have a
test where this matters because the targets I work on
use zero or one boolean contents. And the tests cases I've seen
this fire on happen before type legalization where the result type
is MVT::i1 so the distinction doesn't matter.
There was code to handle the first operand being different than
the result type. And code to handle first operand having the
same type as the type to extend from. This should never happen
for a correctly formed SIGN_EXTEND_INREG. I've replace the
code with asserts.
I also noticed we created the same APInt twice so I've reused it.
Add DemandedElts support inside the TRUNCATE analysis.
REAPPLIED - this was reverted by @hans at rGa51226057fc3 due to an issue with vector shift amount types, which was fixed in rG935bacd3a724 and an additional test case added at rG0ca81b90d19d
Differential Revision: https://reviews.llvm.org/D56387
As noticed on D56387, for vectors we must always correctly adjust the shift amount type during truncation (not just after legalization). We were getting away with it as we currently only accepted scalars via the dyn_cast<ConstantSDNode>.
It caused "Vector shift amounts must be in the same as their first arg"
asserts in Chromium builds. See the code review for repro instructions.
> Add DemandedElts support inside the TRUNCATE analysis.
>
> Differential Revision: https://reviews.llvm.org/D56387
This reverts commit cad4275d69.
Use the KnownBits icmp comparisons to determine when a ISD::UMIN/UMAX op is unnecessary should either op be known to be ULT/ULE or UGT/UGE than the other.
Differential Revision: https://reviews.llvm.org/D94532
These methods are recursive so a little costly.
We only look at the result in one place in this function and it's
conditional. We also only need the second call if the first had
enough returned enough sign bits.
This improves llvm::isConstOrConstSplat by allowing it to analyze
ISD::SPLAT_VECTOR nodes, in order to allow more constant-folding of
operations using scalable vector types.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D94168
Attempt to simplify all/any-of style patterns that concatenate 2 smaller integers together into an and(x,y)/or(x,y) + icmp 0/-1 instead.
This is mainly to help some bool predicate reduction patterns where we end up concatenating bool vectors that have been bitcasted to integers.
Differential Revision: https://reviews.llvm.org/D93599
This patch adds support for the fptoui.sat and fptosi.sat intrinsics,
which provide basically the same functionality as the existing fptoui
and fptosi instructions, but will saturate (or return 0 for NaN) on
values unrepresentable in the target type, instead of returning
poison. Related mailing list discussion can be found at:
https://groups.google.com/d/msg/llvm-dev/cgDFaBmCnDQ/CZAIMj4IBAAJ
The intrinsics have overloaded source and result type and support
vector operands:
i32 @llvm.fptoui.sat.i32.f32(float %f)
i100 @llvm.fptoui.sat.i100.f64(double %f)
<4 x i32> @llvm.fptoui.sat.v4i32.v4f16(half %f)
// etc
On the SelectionDAG layer two new ISD opcodes are added,
FP_TO_UINT_SAT and FP_TO_SINT_SAT. These opcodes have two operands
and one result. The second operand is an integer constant specifying
the scalar saturation width. The idea here is that initially the
second operand and the scalar width of the result type are the same,
but they may change during type legalization. For example:
i19 @llvm.fptsi.sat.i19.f32(float %f)
// builds
i19 fp_to_sint_sat f, 19
// type legalizes (through integer result promotion)
i32 fp_to_sint_sat f, 19
I went for this approach, because saturated conversion does not
compose well. There is no good way of "adjusting" a saturating
conversion to i32 into one to i19 short of saturating twice.
Specifying the saturation width separately allows directly saturating
to the correct width.
There are two baseline expansions for the fp_to_xint_sat opcodes. If
the integer bounds can be exactly represented in the float type and
fminnum/fmaxnum are legal, we can expand to something like:
f = fmaxnum f, FP(MIN)
f = fminnum f, FP(MAX)
i = fptoxi f
i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN
If the bounds cannot be exactly represented, we expand to something
like this instead:
i = fptoxi f
i = select f ult FP(MIN), MIN, i
i = select f ogt FP(MAX), MAX, i
i = select f uo f, 0, i # unnecessary if unsigned as 0 = MIN
It should be noted that this expansion assumes a non-trapping fptoxi.
Initial tests are for AArch64, x86_64 and ARM. This exercises all of
the scalar and vector legalization. ARM is included to test float
softening.
Original patch by @nikic and @ebevhan (based on D54696).
Differential Revision: https://reviews.llvm.org/D54749
X86 and AArch64 expand it as libcall inside the target. And PowerPC also
want to expand them as libcall for P8. So, propose an implement in the
legalizer to common the logic and remove the code for X86/AArch64 to
avoid the duplicate code.
Reviewed By: Craig Topper
Differential Revision: https://reviews.llvm.org/D91331
Changes in this patch:
- Minor changes to the LowerVECREDUCE_SEQ_FADD function added by @cameron.mcinally
to also work for scalable types
- Added TableGen patterns for FP reductions with unpacked types (nxv2f16, nxv4f16 & nxv2f32)
- Asserts added to expandFMINNUM_FMAXNUM & expandVecReduceSeq for scalable types
Reviewed By: cameron.mcinally
Differential Revision: https://reviews.llvm.org/D93050
Move the X86 VSELECT->UADDSAT fold to DAGCombiner - there's nothing target specific about these folds.
The SSE42 test diffs are relatively benign - its avoiding an extra constant load in exchange for an extra xor operation - there are extra register moves, which is annoying as all those operations should commute them away.
Differential Revision: https://reviews.llvm.org/D91876
If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion.
Allows us to move the x86-specific lowering to the generic expansion code.
Differential Revision: https://reviews.llvm.org/D92183
If usubsat() is legal, this is likely to result in smaller codegen expansion than the default cmp+select codegen expansion.
Allows us to move the x86-specific lowering to the generic expansion code.
If smax() is legal, this is likely to result in smaller codegen expansion for abs(x) than the xor(add,ashr) method.
This is also what PowerPC has been doing for its abs implementation, so it lets us get rid of a load of custom lowering code there (and which was never updated when they added smax lowering).
Alive2: https://alive2.llvm.org/ce/z/xRk3cD
Differential Revision: https://reviews.llvm.org/D92095
`SimplifySetCC` invokes `getNodeIfExists` without passing `Flags` argument and `getNodeIfExists` uses a default `SDNodeFlags` to intersect the original flags, as a consequence, flags like `nsw` is dropped. Added a new helper function `doesNodeExist` to check if a node exists without modifying its flags.
Reviewed By: #powerpc, nemanjai
Differential Revision: https://reviews.llvm.org/D89938
This is part of the discussion on D91876 about trying to reduce custom lowering of MIN/MAX ops on older SSE targets - if we can improve generic vector expansion we should be able to relax the limitations in SelectionDAGBuilder when it will let MIN/MAX ops be generated, and avoid having to flag so many ops as 'custom'.
Lowers the llvm.masked.scatter intrinsics (scalar plus vector addressing mode only)
Changes included in this patch:
- Custom lowering for MSCATTER, which chooses the appropriate scatter store opcode to use.
Floating-point scatters are cast to integer, with patterns added to match FP reinterpret_casts.
- Added the getCanonicalIndexType function to convert redundant addressing
modes (e.g. scaling is redundant when accessing bytes)
- Tests with 32 & 64-bit scaled & unscaled offsets
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D90941
Add a TLI hook to allow SelectionDAG to fine tune the conversion of CTPOP to a chain of "x & (x - 1)" when CTPOP isn't legal.
A subsequent patch will attempt to fine tune the X86 code gen.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D89952
This patch uses the existing LowerFixedLengthReductionToSVE function to also lower
scalable vector reductions. A separate function has been added to lower VECREDUCE_AND
& VECREDUCE_OR operations with predicate types using ptest.
Lowering scalable floating-point reductions will be addressed in a follow up patch,
for now these will hit the assertion added to expandVecReduce() in TargetLowering.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D89382
Add Legalization support for VECREDUCE_SEQ_FADD, so that we don't need to depend on ExpandReductionsPass.
Differential Revision: https://reviews.llvm.org/D90247
For i1 types, boolean false is represented identically regardless of
the boolean content, so we can allow optimizations that otherwise
would not be correct for booleans with false represented as a negative
one.
Patch by Erik Hogeman.
Differential Revision: https://reviews.llvm.org/D90145
Some of our conversion algorithms produce -0.0 when converting unsigned i64 to double when the rounding mode is round toward negative. This switches them to other algorithms that don't have this problem. Since it is undefined behavior to change rounding mode with the non-strict nodes, this patch only changes the behavior for strict nodes.
There are still problems with unsigned i32 conversions too which I'll try to fix in another patch.
Fixes part of PR47393
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D87115
Updates an optimization that relies on boolean contents being either 0
or 1 to properly check for this before triggering.
The following:
(X & 8) != 0 --> (X & 8) >> 3
Produces unexpected results when a boolean 'true' value is represented
by negative one.
Patch by Erik Hogeman.
Differential Revision: https://reviews.llvm.org/D89390
This enables these transforms for vectors:
(ctpop x) u< 2 -> (x & x-1) == 0
(ctpop x) u> 1 -> (x & x-1) != 0
(ctpop x) == 1 --> (x != 0) && ((x & x-1) == 0)
(ctpop x) != 1 --> (x == 0) || ((x & x-1) != 0)
All enabled if CTPOP isn't Legal. This differs from the scalar
behavior where the first two are done unconditionally and the
last two are done if CTPOP isn't Legal or Custom. The Legal
check produced better results for vectors based on X86's
custom handling. Might be worth re-visiting scalars here.
I disabled the looking through truncate for vectors. The
code that creates new setcc can use the same result VT as the
original setcc even if we truncated the input. That may work
work for most scalars, but definitely wouldn't work for vectors
unless it was a vector of i1.
Fixes or at least improves PR47825
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D89346
In certain places in llvm/lib/CodeGen we were relying upon the TypeSize
comparison operators when in fact the code was only ever expecting
either scalar values or fixed width vectors. I've changed some of these
places to use the equivalent scalar operator.
Differential Revision: https://reviews.llvm.org/D88482
In certain places in the code we can never end up in a situation where
we're mixing fixed width and scalable vector types. For example,
we can't have truncations and extends that change the lane count. Also,
in other places such as GenWidenVectorStores and GenWidenVectorLoads we
know from the behaviour of FindMemType that we can never choose a vector
type with a different scalable property.
In various places I have used EVT::bitsXY functions instead of
TypeSize::isKnownXY, where it probably makes sense to keep an assert
that scalable properties match.
Differential Revision: https://reviews.llvm.org/D88654
As requested in D89346. This allows us to add some early outs.
I reordered some checks a little bit to make the more common bail outs happen earlier. Like checking opcode before checking hasOneUse. And I moved the bit width check to make sure it was safe to look through a truncate to the spot where we look through truncates instead of after.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D89494
This combine can look through (trunc (ctpop X)). When doing this
it tries to make sure the trunc doesn't lose any information
from the ctpop. It does this by checking that the truncated type
has more bits that Log2_32_Ceil of the ctpop type. The Ceil is
unnecessary and pessimizes non-power of 2 types.
For example, ctpop of i256 requires 9 bits to represent the max
value of 256. But ctpop of i255 only requires 8 bits to represent
the max result of 255. Log2_32_Ceil of 256 and 255 both return 8
while Log2_32 returns 8 for 256 and 7 for 255
The code with popcnt enabled is a regression for this test case,
but it does match what already happens with i256 truncated to i9.
Since power of 2 is more likely, I don't think it should block
this change.
Differential Revision: https://reviews.llvm.org/D89412
We were already doing this for integer constants. This patch implements
the same thing for floating point constants.
Differential Revision: https://reviews.llvm.org/D88570
getNode handling for ISD:SETCC calls FoldSETCC which can canonicalize
FP constants to the RHS. When this happens we should create the node
with the FMF that was requested. By using FlagInserter when can ensure
any calls to getNode/getSetcc during canonicalization will also get the flags.
Differential Revision: https://reviews.llvm.org/D88063
This reverts partial of a2fb5446 (actually, 2508ef01) about removing
negated FP constant immediately if it has no uses. However, as discussed
in bug 47517, there're cases when NegX is folded into constant from
other places while NegY is removed by that line of code and NegX is
equal to NegY. In these cases, NegX is deleted before used and crash
happens. So revert the code and add necessary test case.
When we know that a particular type is always going to be fixed
width we have so far been writing code like this:
getSizeInBits().getFixedSize()
Since we are doing this in quite a few places now it seems to make
sense to add a new helper function that allows us to replace
these calls with a single getFixedSizeInBits() call.
Differential Revision: https://reviews.llvm.org/D88649
2508ef01 fixed a bug about constant removal in negation. But after
sanitizing check I found there's still some issue about it so it's
reverted.
Temporary nodes will be removed if useless in negation. Before the
removal, they'd be checked if any other nodes used it. So the removal
was moved after getNode. However in rare cases the node to be removed is
the same as result of getNode. We missed that and will be fixed by this
patch.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D87614
2508ef01 doesn't totally fix the issue since we did not handle the case
when unused temporary negated result is the same with the result, which
is found by address sanitizer.
960cbc53 immediately removes nodes that won't be used to avoid
compilation time explosion. This patch adds the removal to constants to
fix PR47517.
Reviewed By: RKSimon, steven.zhang
Differential Revision: https://reviews.llvm.org/D87614
The versions that take 'unsigned' will be removed in the future.
I tried to use getOriginalAlign instead of getAlign in some
places. getAlign factors in the minimum alignment implied by
the offset in the pointer info. Since we're also passing the
pointer info we can use the original alignment.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D87592
Previously, we formed ISD::PARITY by looking for (and (ctpop X), 1)
but the AND might be separated from the ctpop. For example if the
parity result is multiplied by 2, we'll pull the AND through the
shift.
So to handle more cases, move to SimplifyDemandedBits where we
can handle more cases that result in only the LSB of the CTPOP
being used.
As discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/140729.html
This is hopefully the final remaining showstopper before we can remove
the 'experimental' from the reduction intrinsics.
No behavior was specified for the FP min/max reductions, so we have a
mess of different interpretations.
There are a few potential options for the semantics of these max/min ops.
I think this is the simplest based on current behavior/implementation:
make the reductions inherit from the existing llvm.maxnum/minnum intrinsics.
These correspond to libm fmax/fmin, and those are similar to the (now
deprecated?) IEEE-754 maxNum/minNum functions (NaNs are treated as missing
data). So the default expansion creates calls to libm functions.
Another option would be to inherit from llvm.maximum/minimum (NaNs propagate),
but most targets just crash in codegen when given those nodes because no
default expansion was ever implemented AFAICT.
We could also just assume 'nnan' semantics by default (we are already
assuming 'nsz' semantics in the maxnum/minnum intrinsics), but some targets
(AArch64, PowerPC) support the more defined behavior, so it doesn't make much
sense to not allow a tighter spec. Fast-math-flags (nnan) can be used to
loosen the semantics.
(Note that D67507 was proposed to update the LangRef to acknowledge the more
recent IEEE-754 2019 standard, but that patch seems to have stalled. If we do
update based on the new standard, the reduction instructions can seamlessly
inherit from whatever updates are made to the max/min intrinsics.)
x86 sees a regression here on 'nnan' tests because we have underlying,
longstanding bugs in FMF creation/propagation. Those need to be fixed apart
from this change (for example: https://llvm.org/PR35538). The expansion
sequence before this patch may not have been correct.
Differential Revision: https://reviews.llvm.org/D87391
This removes the after the fact FMF handling from D46854 in favor of passing fast math flags to getNode. This should be a superset of D87130.
This required adding a SDNodeFlags to SelectionDAG::getSetCC.
Now we manage to contant fold some stuff undefs during the
initial getNode that we don't do in later DAG combines.
Differential Revision: https://reviews.llvm.org/D87200
I have fixed up some more ElementCount/TypeSize related warnings in
the following tests:
CodeGen/AArch64/sve-split-extract-elt.ll
CodeGen/AArch64/sve-split-insert-elt.ll
In SelectionDAG::CreateStackTemporary we were relying upon the implicit
cast from TypeSize -> uint64_t when calling MachineFrameInfo::CreateStackObject.
I've fixed this by passing in the known minimum size instead, which I
believe is fine because the associated stack id indicates whether this
is a scalable object or not.
I've also fixed up a case in TargetLowering::SimplifyDemandedBits when
extracting a vector element from a scalable vector. The result is a scalar,
hence it wasn't caught at the start of the function. If the vector is
scalable we just bail out for now.
Differential Revision: https://reviews.llvm.org/D86431
I have fixed up a number of warnings resulting from TypeSize -> uint64_t
casts and calling getVectorNumElements() on scalable vector types. I
think most of the changes are fairly trivial except for those in
DAGTypeLegalizer::SplitVecRes_MLOAD I've tried to ensure we create
the MachineMemoryOperands in a sensible way for scalable vectors.
I have added a CHECK line to the following test:
CodeGen/AArch64/sve-split-load.ll
that ensures no new warnings are added.
Differential Revision: https://reviews.llvm.org/D86697
This is the follow up patch for https://reviews.llvm.org/D86183 as we miss to delete the node if NegX == NegY, which has use after we create the node.
```
if (NegX && (CostX <= CostY)) {
Cost = std::min(CostX, CostZ);
RemoveDeadNode(NegY);
return DAG.getNode(Opcode, DL, VT, NegX, Y, NegZ, Flags); #<-- NegY is used here if NegY == NegX.
}
```
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D86689
D77152 tried to do this but got it wrong in the shift-by-zero case.
D86430 reverted the wrong code. Reimplement the optimization with
different code depending on whether the shift amount is known to be
non-zero (modulo bitwidth).
This improves code quality for fshl tests on AMDGPU, which only has an
fshr instruction.
Differential Revision: https://reviews.llvm.org/D86438
This is a fixup of commit 0819a6416f (D77152) which could
result in miscompiles. The miscompile could only happen for targets
where isOperationLegalOrCustom could return different values for
FSHL and FSHR.
The commit mentioned above added logic in expandFunnelShift to
convert between FSHL and FSHR by swapping direction of the
funnel shift. However, that transform is only legal if we know
that the shift count (modulo bitwidth) isn't zero.
Basically, since fshr(-1,0,0)==0 and fshl(-1,0,0)==-1 then doing a
rewrite such as fshr(X,Y,Z) => fshl(X,Y,0-Z) would be incorrect if
Z modulo bitwidth, could be zero.
```
$ ./alive-tv /tmp/test.ll
----------------------------------------
define i32 @src(i32 %x, i32 %y, i32 %z) {
%0:
%t0 = fshl i32 %x, i32 %y, i32 %z
ret i32 %t0
}
=>
define i32 @tgt(i32 %x, i32 %y, i32 %z) {
%0:
%t0 = sub i32 32, %z
%t1 = fshr i32 %x, i32 %y, i32 %t0
ret i32 %t1
}
Transformation doesn't verify!
ERROR: Value mismatch
Example:
i32 %x = #x00000000 (0)
i32 %y = #x00000400 (1024)
i32 %z = #x00000000 (0)
Source:
i32 %t0 = #x00000000 (0)
Target:
i32 %t0 = #x00000020 (32)
i32 %t1 = #x00000400 (1024)
Source value: #x00000000 (0)
Target value: #x00000400 (1024)
```
It could be possible to add back the transform, given that logic
is added to check that (Z % BW) can't be zero. Since there were
no test cases proving that such a transform actually would be useful
I decided to simply remove the faulty code in this patch.
Reviewed By: foad, lebedev.ri
Differential Revision: https://reviews.llvm.org/D86430
In SelectionDAGBuilder always translate the fshl and fshr intrinsics to
FSHL and FSHR (or ROTL and ROTR) instead of lowering them to shifts and
ORs. Improve the legalization of FSHL and FSHR to avoid code quality
regressions.
Differential Revision: https://reviews.llvm.org/D77152
In DAGTypeLegalizer::GenWidenVectorLoads the algorithm assumes it only
ever deals with fixed width types, hence the offsets for each individual
store never take 'vscale' into account. I've changed the code in that
function to use TypeSize instead of unsigned for tracking the remaining
load amount. In addition, I've changed the load loop to use the new
IncrementPointer helper function for updating the addresses in each
iteration, since this handles scalable vector types.
Also, I've added report_fatal_errors in GenWidenVectorExtLoads,
TargetLowering::scalarizeVectorLoad and TargetLowering::scalarizeVectorStores,
since these functions currently use a sequence of element-by-element
scalar loads/stores. In a similar vein, I've also added a fatal error
report in FindMemType for the case when we decide to return the element
type for a scalable vector type.
I've added new tests in
CodeGen/AArch64/sve-split-load.ll
CodeGen/AArch64/sve-ld-addressing-mode-reg-imm.ll
for the changes in GenWidenVectorLoads.
Differential Revision: https://reviews.llvm.org/D85909
When the result type of insertelement needs to be split,
SplitVecRes_INSERT_VECTOR_ELT will try to store the vector to a
stack temporary, store the element at the location of the stack
temporary plus the index, and reload the Lo/Hi parts.
This patch does the following to ensure this works for scalable vectors:
- Sets the StackID with getStackIDForScalableVectors() in CreateStackTemporary
- Adds an IsScalable flag to getMemBasePlusOffset() and scales the
offset by VScale when this is true
- Ensures the immediate is clamped correctly by clampDynamicVectorIndex
so that we don't try to use an out of range index
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D84874
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D85220
These aren't the canonical forms we'd get from InstCombine, but
we do have X86 tests for them. Recognizing them is pretty cheap.
While there make use of APInt:isSignedMinValue/isSignedMaxValue
instead of creating a new APInt to compare with. Also use
SelectionDAG::getAllOnesConstant helper to hide the all ones
APInt creation.
Try to be more consistent with the SDLoc param in the TargetLowering methods.
This also exposes an issue where we were passing a SDNode as a SDLoc, relying on the implicit SDLoc(SDNode) constructor.
In an upcoming AMDGPU patch, the scalar cases will be legal and vector
ops should be scalarized, rather than producing a long sequence of
vector ops which will also need to be scalarized.
Use a lazy heuristic that seems to work and improves the thumb2 MVE
test.
Summary:
This patch modifies IncrementMemoryAddress to use a vscale
when calculating the new address if the data type is scalable.
Also adds tablegen patterns which match an extract_subvector
of a legal predicate type with zip1/zip2 instructions
Reviewers: sdesmalen, efriedma, david-arm
Reviewed By: efriedma, david-arm
Subscribers: tschuett, hiraditya, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D83137
Use a simpler code sequence when the shift amount is known not to be
zero modulo the bit width.
Nothing much uses this until D77152 changes the translation of fshl and
fshr intrinsics.
Differential Revision: https://reviews.llvm.org/D82540
Using a negation instead of a subtraction from a constant can save an
instruction on some targets.
Nothing much uses this until D77152 changes the translation of fshl and
fshr intrinsics.
Differential Revision: https://reviews.llvm.org/D82539
If a constant is only allsignbits in the demanded/active bits, then sign extend it to an allsignbits bool pattern for OR/XOR ops.
This also requires SimplifyDemandedBits XOR handling to be modified to call ShrinkDemandedConstant on any (non-NOT) XOR pattern to account for non-splat cases.
Next step towards fixing PR45808 - with this patch we now get a <-1,-1,0,0> v4i64 constant instead of <1,1,0,0>.
Differential Revision: https://reviews.llvm.org/D82257
Pre-commit for D82257, this adds a DemandedElts arg to ShrinkDemandedConstant/targetShrinkDemandedConstant which will allow future patches to (optionally) add vector support.
Implement them on top of sdiv/udiv, similar to what we do for integer
types.
Potential future work: implementing i8/i16 srem/urem, optimizations for
constant divisors, optimizing the mul+sub to mls.
Differential Revision: https://reviews.llvm.org/D81511
Summary:
This patch adds base support for code generating fixed length
vector operations targeting a known SVE vector length. To achieve
this we lower fixed length vector operations to equivalent scalable
vector operations, whereby SVE predication is used to limit the
elements processed to those present within the fixed length vector.
Specifically this patch implements load and store operations, which
get lowered to their masked counterparts thusly:
V = load(Addr) =>
V = extract_fixed_vector(masked_load(make_pred(V.NumElts), Addr))
store(V, (Addr)) =>
masked_store(insert_fixed_vector(V), make_pred(V.NumElts), Addr))
Reviewers: rengolin, efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D80385
We have many cases where we call SimplifyMultipleUseDemandedBits and demand specific vector elements, but all the bits from them - this adds a helper wrapper to handle this.
For little endian targets, if we only need the lowest element and none of the extended bits then we can just use the (bitcasted) source vector directly.
We already do this in SimplifyDemandedBits, this adds the SimplifyMultipleUseDemandedBits equivalent.
For now I have changed SimplifyDemandedBits and it's various callers
to assume we know nothing for scalable vectors and to ignore the
demanded bits completely. I have also done something similar for
SimplifyDemandedVectorElts. These changes fix up lots of warnings
due to calls to EVT::getVectorNumElements() for types with scalable
vectors. These functions are all used for optimisations, rather than
functional requirements. In future we can revisit this code if
there is a need to improve code quality for SVE.
Differential Revision: https://reviews.llvm.org/D80537
Just computing the alignment makes sense without caring about the
general known bits, such as for non-integral pointers. Separate the
two and start calling into the TargetLowering hooks for frame indexes.
Start calling the TargetLowering implementation for FrameIndexes,
which improves the AMDGPU matching for stack addressing modes. Also
introduce a new hook for returning known alignment of target
instructions. For AMDGPU, it would be useful to report the known
alignment implied by certain intrinsic calls.
Also stop using MaybeAlign.
This wasn't getting much value from the DAG or depth arguments, since
it's only called on the frame index root nodes. FrameIndexes can also
only return a scalar value, so it also didn't need DemandedElts.
If we're only demanding the (shifted) sign bits of the shift source value, then we can use the value directly.
This handles SimplifyDemandedBits/SimplifyMultipleUseDemandedBits for both ISD::SHL and X86ISD::VSHLI.
Differential Revision: https://reviews.llvm.org/D80869
For the supported binops (basic arithmetic, logicals + shifts), if we fail to simplify the demanded vector elts, then call SimplifyMultipleUseDemandedBits and try to peek through ops to remove unnecessary dependencies.
This helps with PR40502.
Differential Revision: https://reviews.llvm.org/D79003
For the 'inverse shift', we currently always perform a subtraction of the original (masked) shift amount.
But for the case where we are handling power-of-2 type widths, we can replace:
(sub bw-1, (and amt, bw-1) ) -> (and (xor amt, bw-1), bw-1) -> (and ~amt, bw-1)
This allows x86 shifts to fold away the and-mask.
Followup to D77301 + D80466.
http://volta.cs.utah.edu:8080/z/Nod0Gr
Differential Revision: https://reviews.llvm.org/D80489
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Reverted due to unexpectedly passing tests, added REQUIRES: asserts for reland.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
See https://reviews.llvm.org/D74651 for the preallocated IR constructs
and LangRef changes.
In X86TargetLowering::LowerCall(), if a call is preallocated, record
each argument's offset from the stack pointer and the total stack
adjustment. Associate the call Value with an integer index. Store the
info in X86MachineFunctionInfo with the integer index as the key.
This adds two new target independent ISDOpcodes and two new target
dependent Opcodes corresponding to @llvm.call.preallocated.{setup,arg}.
The setup ISelDAG node takes in a chain and outputs a chain and a
SrcValue of the preallocated call Value. It is lowered to a target
dependent node with the SrcValue replaced with the integer index key by
looking in X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to an
%esp adjustment, the exact amount determined by looking in
X86MachineFunctionInfo with the integer index key.
The arg ISelDAG node takes in a chain, a SrcValue of the preallocated
call Value, and the arg index int constant. It produces a chain and the
pointer fo the arg. It is lowered to a target dependent node with the
SrcValue replaced with the integer index key by looking in
X86MachineFunctionInfo. In
X86TargetLowering::EmitInstrWithCustomInserter() this is lowered to a
lea of the stack pointer plus an offset determined by looking in
X86MachineFunctionInfo with the integer index key.
Force any function containing a preallocated call to use the frame
pointer.
Does not yet handle a setup without a call, or a conditional call.
Does not yet handle musttail. That requires a LangRef change first.
Tried to look at all references to inalloca and see if they apply to
preallocated. I've made preallocated versions of tests testing inalloca
whenever possible and when they make sense (e.g. not alloca related,
inalloca edge cases).
Aside from the tests added here, I checked that this codegen produces
correct code for something like
```
struct A {
A();
A(A&&);
~A();
};
void bar() {
foo(foo(foo(foo(foo(A(), 4), 5), 6), 7), 8);
}
```
by replacing the inalloca version of the .ll file with the appropriate
preallocated code. Running the executable produces the same results as
using the current inalloca implementation.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77689
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.
This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D77319
Use an extra shift-by-1 instead of a compare and select to handle the
shift-by-zero case. This sometimes saves one instruction (if the compare
couldn't be combined with a previous instruction). It also works better
on targets that don't have good select instructions.
Note that currently this change doesn't affect most targets because
expandFunnelShift is not used because funnel shift intrinsics are
lowered early in SelectionDAGBuilder. But there is work afoot to change
that; see D77152.
Differential Revision: https://reviews.llvm.org/D77301
We have the getNegatibleCost/getNegatedExpression to evaluate the cost and negate the expression.
However, during negating the expression, the cost might change as we are changing the DAG,
and then, hit the assertion if we negated the wrong expression as the cost is not trustful anymore.
This patch is target to remove the getNegatibleCost to avoid the out of sync with getNegatedExpression,
and check the cost during negating the expression. It also reduce the duplicated code between
getNegatibleCost and getNegatedExpression. And fix the crash for the test in D76638
Reviewed By: RKSimon, spatel
Differential Revision: https://reviews.llvm.org/D77319
If the SimplifyMultipleUseDemandedBits calls BITCASTs that peek through back to the original type then we can remove the BITCASTs entirely.
Differential Revision: https://reviews.llvm.org/D79572
Calling getShiftAmountTy with LegalTypes set may return a type that's too narrow to hold the shift amount for integer type it's applied to.
Fixes the regression introduced by D79096
Differential Revision: https://reviews.llvm.org/D79405
The two code paths have the same goal, legalizing a load of a non-byte-sized vector by loading the "flattened" representation in memory, slicing off each single element and then building a vector out of those pieces.
The technique employed by `ExpandLoad` is slightly more convoluted and produces slightly better codegen on ARM, AMDGPU and x86 but suffers from some bugs (D78480) and is wrong for BE machines.
Differential Revision: https://reviews.llvm.org/D79096
This method has been commented as deprecated for a while. Remove
it and replace all uses with the equivalent getCalledOperand().
I also made a few cleanups in here. For example, to removes use
of getElementType on a pointer when we could just use getFunctionType
from the call.
Differential Revision: https://reviews.llvm.org/D78882
This is a NFC patch for D77319. The idea is to hide the getNegatibleCost inside the getNegatedExpression()
to have it return null if the cost is expensive, and add some helper function for easy to use. And
rename the old getNegatedExpression to negateExpression to avoid the semantic conflict.
Reviewed By: RKSimon
Differential revision: https://reviews.llvm.org/D78291
This is a minor NFC change to make the code more clear. We have the NegatibleCost that
has cheaper, neutral, and expensive. Typically, the smaller one means the less cost.
It is inverse for current implementation, which makes following code not easy to read.
If (CostX > CostY) negate(X)
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D77993
Summary:
There are at least three clients for KnownBits calculations:
ValueTracking, SelectionDAG and GlobalISel. To reduce duplication the
common logic should be moved out of these clients and into KnownBits
itself.
This patch does this for AND, OR and XOR calculations by implementing
and using appropriate operator overloads KnownBits::operator& etc.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74060
This removes a call to getScalarType from a bunch of call sites.
It also makes the behavior consistent with SIGN_EXTEND_INREG.
Differential Revision: https://reviews.llvm.org/D77631
I think we can save the MRI argument from these since it's in
GISelKnownBits already, but currently not accessible.
Implementation deferred to avoid dependency on other patches.
ISD::ROTL/ROTR rotation values are guaranteed to act as a modulo amount, so for power-of-2 bitwidths we only need the lowest bits.
Differential Revision: https://reviews.llvm.org/D76201
PowerPC hits an assertion due to somewhat the same reason as https://reviews.llvm.org/D70975.
Though there are already some hack, it still failed with some case, when the operand 0 is NOT
a const fp, it is another fma that with const fp. And that const fp is negated which result in multi-uses.
A better fix is to check the uses of the negated const fp. If there are already use of its negated
value, we will have benefit as no extra Node is added.
Differential revision: https://reviews.llvm.org/D75501
I expect that the isCondCodeLegal checks should match that CC of
the node that we're going to create.
Rewriting to a switch to minimize repeated mentions of the same
constants.
We can only report the knownbits for a SCALAR_TO_VECTOR node if we only demand the 0'th element - the upper elements are undefined and shouldn't be trusted.
This is causing a number of regressions that need addressing but we need to get the bugfix in first.
Summary:
This patch adds intrinsics and ISelDAG nodes for signed
and unsigned fixed-point division:
```
llvm.sdiv.fix.sat.*
llvm.udiv.fix.sat.*
```
These intrinsics perform scaled, saturating division
on two integers or vectors of integers. They are
required for the implementation of the Embedded-C
fixed-point arithmetic in Clang.
Reviewers: bjope, leonardchan, craig.topper
Subscribers: hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71550
Use the SelectionDAG::getValidShiftAmountConstant helper to get const/constsplat shift amounts, which allows us to drop the out of range shift amount early-out.
First step towards better non-uniform shift amount support in SimplifyDemandedBits.
Summary:
This was a very odd API, where you had to pass a flag into a zext
function to say whether the extended bits really were zero or not. All
callers passed in a literal true or false.
I think it's much clearer to make the function name reflect the
operation being performed on the value we're tracking (rather than on
the KnownBits Zero and One fields), so zext means the value is being
zero extended and new function anyext means the value is being extended
with unknown bits.
NFC.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74482
The isNegatibleForFree/getNegatedExpression methods currently rely on a raw char value to indicate whether a negation is beneficial or not.
This patch replaces the char return value with an NegatibleCost enum to more clearly demonstrate what is implied.
It also renames isNegatibleForFree to getNegatibleCost to more accurately reflect whats going on.
Differential Revision: https://reviews.llvm.org/D74221
Summary: This patch introduces an API for MemOp in order to simplify and tighten the client code.
Reviewers: courbet
Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73964
We have to be careful in SimplifyDemandedBits with loads in case we attempt to combine back to a constant (which then gets turned into a constant pool load again), but we can at least set the upper KnownBits for a ZEXTLoad to zero.
This allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits to create a simpler ISD::INSERT_SUBVECTOR, which is particularly useful for cases where we're splitting into subvectors anyhow.
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.
Reviewers: courbet
Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73785
This allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits to create a simpler ISD::EXTRACT_SUBVECTOR, which is particularly useful for cases where we're splitting into subvectors anyhow.
Differential Revision: This allows SimplifyDemandedBits to call SimplifyMultipleUseDemandedBits to create a simpler ISD::EXTRACT_SUBVECTOR, which is particularly useful for cases where we're splitting into subvectors anyhow.
This was moved in October 2018, but we don't appear to be using
this for vectors on any in tree target.
Moving it back simplifies D72794 so we can share the code for i32->f32.
Summary:
This always just used the same libcall as unordered, but the comparison predicate was different. This change appears to have been made when targets were given the ability to override the predicates. Before that they were hardcoded into the type legalizer. At that time we never inverted predicates and we handled ugt/ult/uge/ule compares by emitting an unordered check ORed with a ogt/olt/oge/ole checks. So only ordered needed an inverted predicate. Later ugt/ult/uge/ule were optimized to only call a single libcall and invert the compare.
This patch removes the ordered entries and just uses the inverting logic that is now present. This removes some odd things in both the Mips and WebAssembly code.
Reviewers: efriedma, ABataev, uweigand, cameron.mcinally, kpn
Reviewed By: efriedma
Subscribers: dschuff, sdardis, sbc100, arichardson, jgravelle-google, kristof.beyls, hiraditya, aheejin, sunfish, atanasyan, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72536
ONE is currently softened to OGT | OLT. But the libcalls for OGT and OLT libcalls will trigger an exception for QNAN. At least for X86 with libgcc. UEQ on the other hand uses UO | OEQ. The UO and OEQ libcalls will not trigger an exception for QNAN.
This patch changes ONE to use the inverse of the UEQ lowering. So we now produce O & UNE. Technically the existing behavior was correct for a signalling ONE, but since I don't know how to generate one of those from clang that seemed like something we can deal with later as we would need to fix other predicates as well. Also removing spurious exceptions seemed better than missing an exception.
There are also problems with quiet OGT/OLT/OLE/OGE, but those are harder to fix.
Differential Revision: https://reviews.llvm.org/D72477
If we're doing a compare that only tests the sign bit and only the sign bit is demanded, we can just bypass the node. This removes one of the blend dependencies in our v2i64->v2f32 uint_to_fp codegen on pre-sse4.2 targets.
Differential Revision: https://reviews.llvm.org/D72356
Summary:
This patch adds intrinsics and ISelDAG nodes for
signed and unsigned fixed-point division:
llvm.sdiv.fix.*
llvm.udiv.fix.*
These intrinsics perform scaled division on two
integers or vectors of integers. They are required
for the implementation of the Embedded-C fixed-point
arithmetic in Clang.
Patch by: ebevhan
Reviewers: bjope, leonardchan, efriedma, craig.topper
Reviewed By: craig.topper
Subscribers: Ka-Ka, ilya, hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70007
This isn't a functonal change since we also check the bit width is the
same and the input type is integer. This guarantees the input and
output type are the same. But passing the input type makes the code
more readable.
Previously, for vectors we created a vselect with a condition that
didn't match what the target wanted according to getSetCCResultType.
To make up for this, X86 had a special DAG combine to detect if
the condition was all sign bits and then insert its own truncate
or extend. By adding the extend/truncate here explicitly we can
avoid that.
This patch attempts to peek through vectors based on the demanded bits/elt of a particular ISD::EXTRACT_VECTOR_ELT node, allowing us to avoid dependencies on ops that have no impact on the extract.
In particular this helps remove some unnecessary scalar->vector->scalar patterns.
The wasm shift patterns are annoying - @tlively has indicated that the wasm vector shift codegen are to be refactored in the near-term and isn't considered a major issue.
Reapplied after reversion at rL368660 due to PR42982 which was fixed at rGca7fdd41bda0.
Differential Revision: https://reviews.llvm.org/D65887
When the "disable-tail-calls" attribute was added, checks were added for
it in various backends. Now this code has proliferated, and it is
something the target is responsible for checking. Move that
responsibility back to the ISels (fast, global, and SD).
There's no major functionality change, except for targets that never
implemented this check.
This LLVM attribute was originally added in
d9699bc7bd (2015).
Reviewers: echristo, MaskRay
Differential Revision: https://reviews.llvm.org/D72118
The NoFPExcept bit in SDNodeFlags currently defaults to true, unlike all
other such flags. This is a problem, because it implies that all code that
transforms SDNodes without copying flags can introduce a correctness bug,
not just a missed optimization.
This patch changes the default to false. This makes it necessary to move
setting the (No)FPExcept flag for constrained intrinsics from the
visitConstrainedIntrinsic routine to the generic visit routine at the
place where the other flags are set, or else the intersectFlagsWith
call would erase the NoFPExcept flag again.
In order to avoid making non-strict FP code worse, whenever
SelectionDAGISel::SelectCodeCommon matches on a set of orignal nodes
none of which can raise FP exceptions, it will preserve this property
on all results nodes generated, by setting the NoFPExcept flag on
those result nodes that would otherwise be considered as raising
an FP exception.
To check whether or not an SD node should be considered as raising
an FP exception, the following logic applies:
- For machine nodes, check the mayRaiseFPException property of
the underlying MI instruction
- For regular nodes, check isStrictFPOpcode
- For target nodes, check a newly introduced isTargetStrictFPOpcode
The latter is implemented by reserving a range of target opcodes,
similarly to how memory opcodes are identified. (Note that there a
bit of a quirk in identifying target nodes that are both memory nodes
and strict FP nodes. To simplify the logic, right now all target memory
nodes are automatically also considered strict FP nodes -- this could
be fixed by adding one more range.)
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D71841
This allows us to clean up some places that were peeking through
the MERGE_VALUES node after the call. By returning the SDValues
directly, we can clean that up.
Unfortunately, there are several call sites in AMDGPU that wanted
the MERGE_VALUES and now need to create their own.
This allows us to delete InlineAsm::Constraint_i workarounds in
SelectionDAGISel::SelectInlineAsmMemoryOperand overrides and
TargetLowering::getInlineAsmMemConstraint overrides.
They were introduced to X86 in r237517 to prevent crashes for
constraints like "=*imr". They were later copied to other targets.
Fix several several additional problems with the int <-> FP conversion
logic both in common code and in the X86 target. In particular:
- The STRICT_FP_TO_UINT expansion emits a floating-point compare. This
compare can raise exceptions and therefore needs to be a strict compare.
I've made it signaling (even though quiet would also be correct) as
signaling is the more usual default for an LT. This code exists both
in common code and in the X86 target.
- The STRICT_UINT_TO_FP expansion algorithm was incorrect for strict mode:
it emitted two STRICT_SINT_TO_FP nodes and then used a select to choose one
of the results. This can cause spurious exceptions by the STRICT_SINT_TO_FP
that ends up not chosen. I've fixed the algorithm to use only a single
STRICT_SINT_TO_FP instead.
- The !isStrictFPEnabled logic in DoInstructionSelection would sometimes do
the wrong thing because it calls getOperationAction using the result VT.
But for some opcodes, incuding [SU]INT_TO_FP, getOperationAction needs to
be called using the operand VT.
- Remove some (obsolete) code in X86DAGToDAGISel::Select that would mutate
STRICT_FP_TO_[SU]INT to non-strict versions unnecessarily.
Reviewed by: craig.topper
Differential Revision: https://reviews.llvm.org/D71840
This is an alternate fix for the bug discussed in D70595.
This also includes minimal tests for other in-tree targets to show the problem more
generally.
We check the number of uses as a predicate for whether some value is free to negate,
but that use count can change as we rewrite the expression in getNegatedExpression().
So something that was marked free to negate during the cost evaluation phase becomes
not free to negate during the rewrite phase (or the inverse - something that was not
free becomes free). This can lead to a crash/assert because we expect that everything
in an expression that is negatible to be handled in the corresponding code within
getNegatedExpression().
This patch adds a hack to work-around the case where we probably no longer detect
that either multiply operand of an FMA isNegatibleForFree which is assumed to be
true when we started rewriting the expression.
Differential Revision: https://reviews.llvm.org/D70975
This is an alternate fix for the bug discussed in D70595.
This also includes minimal tests for other in-tree targets to show the problem more
generally.
We check the number of uses as a predicate for whether some value is free to negate,
but that use count can change as we rewrite the expression in getNegatedExpression().
So something that was marked free to negate during the cost evaluation phase becomes
not free to negate during the rewrite phase (or the inverse - something that was not
free becomes free). This can lead to a crash/assert because we expect that everything
in an expression that is negatible to be handled in the corresponding code within
getNegatedExpression().
This patch adds a hack to work-around the case where we probably no longer detect
that either multiply operand of an FMA isNegatibleForFree which is assumed to be
true when we started rewriting the expression.
Differential Revision: https://reviews.llvm.org/D70975
of integers to floating point.
This includes some of Craig Topper's changes for promotion support from
D71130.
Differential Revision: https://reviews.llvm.org/D69275
Summary:
To find potential opportunities to use getMemBasePlusOffset() I looked at
all ISD::ADD uses found with the regex getNode\(ISD::ADD,.+,.+Ptr
in lib/CodeGen/SelectionDAG. If this patch is accepted I will convert
the files in the individual backends too.
The motivation for this change is our out-of-tree CHERI backend
(https://github.com/CTSRD-CHERI/llvm-project). We use a separate register
type to store pointers (128-bit capabilities, which are effectively
unforgeable and monotonic fat pointers). These capabilities permit a
reduced set of operations and therefore use a separate ValueType (iFATPTR).
to represent pointers implemented as capabilities.
Therefore, we need to avoid using ISD::ADD for our patterns that operate
on pointers and need to use a function that chooses ISD::ADD or a new
ISD::PTRADD opcode depending on the value type.
We originally added a new DAG.getPointerAdd() function, but after this
patch series we can modify the implementation of getMemBasePlusOffset()
instead. Avoiding direct uses of ISD::ADD for pointer types will
significantly reduce the amount of assertion/instruction selection
failures for us in future upstream merges.
Reviewers: spatel
Reviewed By: spatel
Subscribers: merge_guards_bot, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71207
Summary:
The use of a boolean isInteger flag (generally initialized using
VT.isInteger()) caused errors in our out-of-tree CHERI backend
(https://github.com/CTSRD-CHERI/llvm-project).
In our backend, pointers use a separate ValueType (iFATPTR) and therefore
.isInteger() returns false. This meant that getSetCCInverse() was using the
floating-point variant and generated incorrect code for us:
`(void *)0x12033091e < (void *)0xffffffffffffffff` would return false.
Committing this change will significantly reduce our merge conflicts
for each upstream merge.
Reviewers: spatel, bogner
Reviewed By: bogner
Subscribers: wuzish, arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70917
This is an alternate fix for the bug discussed in D70595.
This also includes minimal tests for other in-tree targets
to show the problem more generally.
We check the number of uses as a predicate for whether some
value is free to negate, but that use count can change as we
rewrite the expression in getNegatedExpression(). So something
that was marked free to negate during the cost evaluation
phase becomes not free to negate during the rewrite phase (or
the inverse - something that was not free becomes free).
This can lead to a crash/assert because we expect that
everything in an expression that is negatible to be handled
in the corresponding code within getNegatedExpression().
This patch skips the use check during the rewrite phase.
So we determine that some expression isNegatibleForFree
(identically to without this patch), but during the rewrite,
don't rely on use counts to decide how to create the optimal
expression.
Differential Revision: https://reviews.llvm.org/D70975
D53794 introduced code to perform the FP_TO_UINT expansion via FP_TO_SINT in a way that would never expose floating-point exceptions in the intermediate steps. Unfortunately, I just noticed there is still a way this can happen. As discussed in D53794, the compiler now generates this sequence:
// Sel = Src < 0x8000000000000000
// Val = select Sel, Src, Src - 0x8000000000000000
// Ofs = select Sel, 0, 0x8000000000000000
// Result = fp_to_sint(Val) ^ Ofs
The problem is with the Src - 0x8000000000000000 expression. As I mentioned in the original review, that expression can never overflow or underflow if the original value is in range for FP_TO_UINT. But I missed that we can get an Inexact exception in the case where Src is a very small positive value. (In this case the result of the sub is ignored, but that doesn't help.)
Instead, I'd suggest to use the following sequence:
// Sel = Src < 0x8000000000000000
// FltOfs = select Sel, 0, 0x8000000000000000
// IntOfs = select Sel, 0, 0x8000000000000000
// Result = fp_to_sint(Val - FltOfs) ^ IntOfs
In the case where the value is already in range of FP_TO_SINT, we now simply compute Val - 0, which now definitely cannot trap (unless Val is a NaN in which case we'd want to trap anyway).
In the case where the value is not in range of FP_TO_SINT, but still in range of FP_TO_UINT, the sub can never be inexact, as Val is between 2^(n-1) and (2^n)-1, i.e. always has the 2^(n-1) bit set, and the sub is always simply clearing that bit.
There is a slight complication in the case where Val is a constant, so we know at compile time whether Sel is true or false. In that scenario, the old code would automatically optimize the sub away, while this no longer happens with the new code. Instead, I've added extra code to check for this case and then just fall back to FP_TO_SINT directly. (This seems to catch even slightly more cases.)
Original version of the patch by Ulrich Weigand. X86 changes added by Craig Topper
Differential Revision: https://reviews.llvm.org/D67105
InstCombine may synthesize FMINNUM/FMAXNUM nodes from fcmp+select
sequences (where the fcmp is marked nnan). Currently, if the
target does not otherwise handle these nodes, they'll get expanded
to libcalls to fmin/fmax. However, these functions may reside in
libm, which may introduce a library dependency that was not originally
present in the source code, potentially resulting in link failures.
To fix this problem, add code to TargetLowering::expandFMINNUM_FMAXNUM
to expand FMINNUM/FMAXNUM to a compare+select sequence instead of the
libcall. This is done only if the node is marked as "nnan"; in this case,
the expansion to compare+select is always correct. This also suffices to
catch all cases where FMINNUM/FMAXNUM was synthesized as above.
Differential Revision: https://reviews.llvm.org/D70965
I need to be able to drop an operand for STRICT_FP_ROUND handling on X86. Merging these functions gives me the ArrayRef interface that passes the return type, operands, and debugloc instead of the Node.
Differential Revision: https://reviews.llvm.org/D70503
Summary:
This is a preparatory cleanup before i add more
of this fold to deal with comparisons with non-zero.
In essence, the current lowering is:
```
Name: (X % C1) == 0 -> X * C3 <= C4
Pre: (C1 u>> countTrailingZeros(C1)) * C3 == 1
%zz = and i8 C3, 0 ; trick alive into making C3 avaliable in precondition
%o0 = urem i8 %x, C1
%r = icmp eq i8 %o0, 0
=>
%zz = and i8 C3, 0 ; and silence it from complaining about said reg
%C4 = -1 /u C1
%n0 = mul i8 %x, C3
%n1 = lshr i8 %n0, countTrailingZeros(C1) ; rotate right
%n2 = shl i8 %n0, ((8-countTrailingZeros(C1)) %u 8) ; rotate right
%n3 = or i8 %n1, %n2 ; rotate right
%r = icmp ule i8 %n3, %C4
```
https://rise4fun.com/Alive/oqd
It kinda just works, really no weird edge-cases.
But it isn't all that great for when comparing with non-zero.
In particular, given `(X % C1) == C2`, there will be problems
in the always-false tautological case where `C2 u>= C1`:
https://rise4fun.com/Alive/pH3
That case is tautological, always-false:
```
Name: (X % Y) u>= Y
%o0 = urem i8 %x, %y
%r = icmp uge i8 %o0, %y
=>
%r = false
```
https://rise4fun.com/Alive/ofu
While we can't/shouldn't get such tautological case normally,
we do deal with non-splat vectors, so unless we want to give up
in this case, we need to fixup/short-circuit such lanes.
There are two lowering variants:
1. We can blend between whatever computed result and the correct tautological result
```
Name: (X % C1) == C2 -> X * C3 <= C4 || false
Pre: (C2 == 0 || C1 u<= C2) && (C1 u>> countTrailingZeros(C1)) * C3 == 1
%zz = and i8 C3, 0 ; trick alive into making C3 avaliable in precondition
%o0 = urem i8 %x, C1
%r = icmp eq i8 %o0, C2
=>
%zz = and i8 C3, 0 ; and silence it from complaining about said reg
%C4 = -1 /u C1
%n0 = mul i8 %x, C3
%n1 = lshr i8 %n0, countTrailingZeros(C1) ; rotate right
%n2 = shl i8 %n0, ((8-countTrailingZeros(C1)) %u 8) ; rotate right
%n3 = or i8 %n1, %n2 ; rotate right
%is_tautologically_false = icmp ule i8 C1, C2
%res = icmp ule i8 %n3, %C4
%r = select i1 %is_tautologically_false, i1 0, i1 %res
```
https://rise4fun.com/Alive/PjT5https://rise4fun.com/Alive/1KV
2. We can invert the comparison result
```
Name: (X % C1) == C2 -> X * C3 <= C4 || false
Pre: (C2 == 0 || C1 u<= C2) && (C1 u>> countTrailingZeros(C1)) * C3 == 1
%zz = and i8 C3, 0 ; trick alive into making C3 avaliable in precondition
%o0 = urem i8 %x, C1
%r = icmp eq i8 %o0, C2
=>
%zz = and i8 C3, 0 ; and silence it from complaining about said reg
%C4 = -1 /u C1
%n0 = mul i8 %x, C3
%n1 = lshr i8 %n0, countTrailingZeros(C1) ; rotate right
%n2 = shl i8 %n0, ((8-countTrailingZeros(C1)) %u 8) ; rotate right
%n3 = or i8 %n1, %n2 ; rotate right
%is_tautologically_false = icmp ule i8 C1, C2
%C4_fixed = select i1 %is_tautologically_false, i8 -1, i8 %C4
%res = icmp ule i8 %n3, %C4_fixed
%r = xor i1 %res, %is_tautologically_false
```
https://rise4fun.com/Alive/2xChttps://rise4fun.com/Alive/jpb5
3. We can expand into `and`/`or`:
https://rise4fun.com/Alive/WGnhttps://rise4fun.com/Alive/lcb5
Blend-one is likely better since we avoid having to load the
replacement from constant pool. `xor` is second best since
it's still pretty general. I'm not adding `and`/`or` variants.
Reviewers: RKSimon, craig.topper, spatel
Reviewed By: RKSimon
Subscribers: nick, hiraditya, xbolva00, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70051
Summary:
Replaces
```
unsigned getShiftAmountThreshold(EVT VT)
```
by
```
bool shouldAvoidTransformToShift(EVT VT, unsigned amount)
```
thus giving more flexibility for targets to decide whether particular shift amounts must be considered expensive or not.
Updates the MSP430 target with a custom implementation.
This continues D69116, D69120, D69326 and updates them, so all of them must be committed before this.
Existing tests apply, a few more have been added.
Reviewers: asl, spatel
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D70042
Summary:
Additional filtering of undesired shifts for targets that do not support them efficiently.
Related with D69116 and D69120
Applies the TLI.getShiftAmountThreshold hook to prevent undesired generation of shifts for the following IR code:
```
define i16 @testShiftBits(i16 %a) {
entry:
%and = and i16 %a, -64
%cmp = icmp eq i16 %and, 64
%conv = zext i1 %cmp to i16
ret i16 %conv
}
define i16 @testShiftBits_11(i16 %a) {
entry:
%cmp = icmp ugt i16 %a, 63
%conv = zext i1 %cmp to i16
ret i16 %conv
}
define i16 @testShiftBits_12(i16 %a) {
entry:
%cmp = icmp ult i16 %a, 64
%conv = zext i1 %cmp to i16
ret i16 %conv
}
```
The attached diff file shows the piece code in TargetLowering that is responsible for the generation of shifts in relation to the IR above.
Before applying this patch, shifts will be generated to replace non-legal icmp immediates. However, shifts may be undesired if they are even more expensive for the target.
For all my previous patches in this series (cited above) I added test cases for the MSP430 target. However, in this case, the target is not suitable for showing improvements related with this patch, because the MSP430 does not implement "isLegalICmpImmediate". The default implementation returns always true, therefore the patched code in TargetLowering is never reached for that target. Targets implementing both "isLegalICmpImmediate" and "getShiftAmountThreshold" will benefit from this.
The differential effect of this patch can only be shown for the MSP430 by temporarily implementing "isLegalICmpImmediate" to return false for large immediates. This is simulated with the implementation of a command line flag that was incorporated in D69975
This patch belongs to a initiative to "relax" the generation of shifts by LLVM for targets requiring it
Reviewers: spatel, lebedev.ri, asl
Reviewed By: spatel
Subscribers: lenary, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D69326
We need to be checking the value types for the inner setccs not
the outer setcc. We need to ensure those setccs produce a 0/1
value or that the xor is on the i1 type. I think at the time
this code was originally written, getBooleanContents didn't
take any arguments so this was probably correct. But now we can
have a different boolean contents for integer and floating point.
Not sure why the other combines below the xor were also checking
the boolean contents. None of them involve any setccs other than
the outer one and they only produce a new setcc.
Differential Revision: https://reviews.llvm.org/D69480
This combine is only valid if the inner setcc produces a 0/1 result
or the inner type is MVT::i1.
I haven't seen this cause any issues, just happened to notice it
while reviewing combines in this function.
While there also fix another call to use the value type from the
SDValue for the operand instead of calling SDNode::getValueType(0).
Though its likely the use is result 0, its not guaranteed.
This broke various Windows builds, see comments on the Phabricator
review.
This also reverts the follow-up 20bf0cf.
> Summary:
> This fold, helps recover from the rest of the D62266 ARM regressions.
> https://rise4fun.com/Alive/TvpC
>
> Note that while the fold is quite flexible, i've restricted it
> to the single interesting pattern at the moment.
>
> Reviewers: efriedma, craig.topper, spatel, RKSimon, deadalnix
>
> Reviewed By: deadalnix
>
> Subscribers: javed.absar, kristof.beyls, llvm-commits
>
> Tags: #llvm
>
> Differential Revision: https://reviews.llvm.org/D62450
We should do the fold only if both constants are plain,
non-opaque constants, at least that is the DAG.FoldConstantArithmetic()
requirement.
And if the constant we are comparing with is zero - we shouldn't be
trying to do this fold in the first place.
Fixes https://bugs.llvm.org/show_bug.cgi?id=43769
Summary:
This fold, helps recover from the rest of the D62266 ARM regressions.
https://rise4fun.com/Alive/TvpC
Note that while the fold is quite flexible, i've restricted it
to the single interesting pattern at the moment.
Reviewers: efriedma, craig.topper, spatel, RKSimon, deadalnix
Reviewed By: deadalnix
Subscribers: javed.absar, kristof.beyls, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62450
Provides a TLI hook to allow targets to relax the emission of shifts, thus enabling
codegen improvements on targets with no multiple shift instructions and cheap selects
or branches.
Contributes to a Fix for PR43559:
https://bugs.llvm.org/show_bug.cgi?id=43559
Patch by: @joanlluch (Joan LLuch)
Differential Revision: https://reviews.llvm.org/D69116
llvm-svn: 375347
This patch converts the DAGCombine isNegatibleForFree/GetNegatedExpression into overridable TLI hooks.
The intention is to let us extend existing FNEG combines to work more generally with negatible float ops, allowing it work with target specific combines and opcodes (e.g. X86's FMA variants).
Unlike the SimplifyDemandedBits, we can't just handle target nodes through a Target callback, we need to do this as an override to allow targets to handle generic opcodes as well. This does mean that the target implementations has to duplicate some checks (recursion depth etc.).
Partial reversion of rL372756 - I've identified the infinite loop issue inside the X86 override but haven't fixed it yet so I've only (re)committed the common TargetLowering refactoring part of the patch.
Differential Revision: https://reviews.llvm.org/D67557
llvm-svn: 373343
Summary:
It seems we missed that the target hook can't query the known-bits for the
inputs to a target instruction. Fix that oversight
Reviewers: aditya_nandakumar
Subscribers: rovka, hiraditya, volkan, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67380
llvm-svn: 373264
ISD::SADDO uses the suggested sequence described in the section §2.4 of
the RISCV Spec v2.2. ISD::SSUBO uses the dual approach but checking for
(non-zero) positive.
Differential Revision: https://reviews.llvm.org/D47927
llvm-svn: 373187
Summary:
After the switch in SimplifyDemandedBits, it tries to create a
constant when possible. If the original node is a TargetConstant
the default in the switch will call computeKnownBits on the
TargetConstant which will succeed. This results in the
TargetConstant becoming a Constant. But TargetConstant exists to
avoid being changed.
I've fixed the two cases that relied on this in tree by explicitly
making the nodes constant instead of target constant. The Sparc
case is an old bug. The Mips case was recently introduced now that
ImmArg on intrinsics gets turned into a TargetConstant when the
SelectionDAG is created. I've removed the ImmArg since it lowers
to generic code.
Reviewers: arsenm, RKSimon, spatel
Subscribers: jyknight, sdardis, wdng, arichardson, hiraditya, fedor.sergeev, jrtc27, atanasyan, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67802
llvm-svn: 372409
This patch converts the DAGCombine isNegatibleForFree/GetNegatedExpression into overridable TLI hooks and includes a demonstration X86 implementation.
The intention is to let us extend existing FNEG combines to work more generally with negatible float ops, allowing it work with target specific combines and opcodes (e.g. X86's FMA variants).
Unlike the SimplifyDemandedBits, we can't just handle target nodes through a Target callback, we need to do this as an override to allow targets to handle generic opcodes as well. This does mean that the target implementations has to duplicate some checks (recursion depth etc.).
I've only begun to replace X86's FNEG handling here, handling FMADDSUB/FMSUBADD negation and some low impact codegen changes (some FMA negatation propagation). We can build on this in future patches.
Differential Revision: https://reviews.llvm.org/D67557
llvm-svn: 372333
As commented on D67557 we have a lot of uses of depth checks all using magic numbers.
This patch adds the SelectionDAG::MaxRecursionDepth constant and moves over some general cases to use this explicitly.
Differential Revision: https://reviews.llvm.org/D67711
llvm-svn: 372315
This is the first sweep of generic code to add isAtomic bailouts where appropriate. The intention here is to have the switch from AtomicSDNode to LoadSDNode/StoreSDNode be close to NFC; that is, I'm not looking to allow additional optimizations at this time. That will come later. See D66309 for context.
Differential Revision: https://reviews.llvm.org/D66318
llvm-svn: 371786
Loosely based on DAGCombiner version, but this part is slightly simpler in
GlobalIsel because all address calculation is performed by G_GEP. That makes
the inc/dec distinction moot so there's just pre/post to think about.
No targets can handle it yet so testing is via a special flag that overrides
target hooks.
llvm-svn: 371384