This patch converts llvm.memset intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).
The llvm.memset is converted to a TP loop for both
constant and non-constant input sizes (of llvm.memset).
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D100435
This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).
From an implementation point of view, the patch
- adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel)
- adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter,
on matching the above node.
- Adds a custom inserter function that expands the pseudo instruction into MIR suitable
to be (by later passes) into a WLSTP loop.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D99723
This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).
From an implementation point of view, the patch
- adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel)
- adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter,
on matching the above node.
- Adds a custom inserter function that expands the pseudo instruction into MIR suitable
to be (by later passes) into a WLSTP loop.
Note: A cli option is used to control the conversion of memcpy to TP
loop and this option is currently disabled by default. It may be enabled
in the future after further downstream testing.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D99723
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
Originally submitted as 3338290c18.
Reverted in c7df6b1223.
Similarly to D101096, this makes sure that MMO operands get propagated
through from MVE gathers/scatters to the Machine Instructions. This
allows extra scheduling freedom, not forcing the instructions to act as
scheduling barriers. We create MMO's with an unknown size, specifying
that they can load from anywhere in memory, similar to the masked_gather
or X86 intrinsics.
Differential Revision: https://reviews.llvm.org/D101219
atomicrmw instructions are expanded by AtomicExpandPass before register allocation
into cmpxchg loops. Register allocation can insert spills between the exclusive loads
and stores, which invalidates the exclusive monitor and can lead to infinite loops.
To avoid this, reimplement atomicrmw operations as pseudo-instructions and expand them
after register allocation.
Floating point legalisation:
f16 ATOMIC_LOAD_FADD(*f16, f16) is legalised to
f32 ATOMIC_LOAD_FADD(*i16, f32) and then eventually
f32 ATOMIC_LOAD_FADD_16(*i16, f32)
Differential Revision: https://reviews.llvm.org/D101164
Otherwise the CMP glue may be used in multiple nodes, needing to be
emitted multiple times. Currently this either increases instruction
count or fails as it attempt to insert the same node multiple times.
This adds a pattern to recognize VIDUP from BUILD_VECTOR of incrementing
adds. This can come up from either geps or adds, and came up recently in
D100550. We are just looking for a BUILD_VECTOR where each lane is an
add of the first lane with N*i, where i is the lane and N is one of 1,
2, 4, or 8, supported by the VIDUP instruction.
Differential Revision: https://reviews.llvm.org/D101263
This expands the VMOVRRD(extract(..(build_vector(a, b, c, d)))) pattern,
to also handle insert_vectors. Providing we can find the correct insert,
this helps further simplify patterns by removing the redundant VMOVRRD.
Differential Revision: https://reviews.llvm.org/D100245
This clang-formats the list of ARMISD nodes. Usually this is something I
would avoid, but these cause problems with formatting every time new
nodes are added.
The list in getTargetNodeName also makes use of MAKE_CASE macros, as
other backends do.
The generic SoftFloatVectorExtract.ll test was failing when run on arm
machines, as it tries to create a f64 under soft float. Limit the
transform to when f64 is legal.
Also add a missing override, as reported in D100244.
This adds a combine for extract(x, n); extract(x, n+1) ->
VMOVRRD(extract x, n/2). This allows two vector lanes to be moved at the
same time in a single instruction, and thanks to the other VMOVRRD folds
we have added recently can help reduce the amount of executed
instructions. Floating point types are very similar, but will include a
bitcast to an integer type.
This also adds a shouldRewriteCopySrc, to prevent copy propagation from
DPR to SPR, which can break as not all DPR regs can be extracted from
directly. Otherwise the machine verifier is unhappy.
Differential Revision: https://reviews.llvm.org/D100244
Combine sub 0, csinc X, Y, CC to csinv -X, Y, CC providing that the
negation of X is cheap, currently just handling constants. This comes up
during the splat of an i1 to a predicate, where we now generate csetm,
as opposed to cset; rsb.
Differential Revision: https://reviews.llvm.org/D99940
This uses the shuffle mask cost from D98206 to give a better cost of MVE
VREV instructions. This helps especially in VectorCombine where the cost
of shuffles is used to reorder bitcasts, which this helps keep the phase
ordering test for fp16 reductions producing optimal code. The isVREVMask
has been moved to a header file to allow it to be used across target
transform and isel lowering.
Differential Revision: https://reviews.llvm.org/D98210
This adjusts the place that the t2DoLoopStart reg allocation hint is
inserted, adding it in the ARMTPAndVPTOptimizaionPass in a similar place
as other tail predicated loop optimizations. This removes the need for
doing so in a custom inserter, and should make the hint more accurate,
only adding it where we expect to create a DLS (not DLSTP or WLS).
Recently we improved the lowering of low overhead loops and tail
predicated loops, but concentrated first on the DLS do style loops. This
extends those improvements over to the WLS while loops, improving the
chance of lowering them successfully. To do this the lowering has to
change a little as the instructions are terminators that produce a value
- something that needs to be treated carefully.
Lowering starts at the Hardware Loop pass, inserting a new
llvm.test.start.loop.iterations that produces both an i1 to control the
loop entry and an i32 similar to the llvm.start.loop.iterations
intrinsic added for do loops. This feeds into the loop phi, properly
gluing the values together:
%wls = call { i32, i1 } @llvm.test.start.loop.iterations.i32(i32 %div)
%wls0 = extractvalue { i32, i1 } %wls, 0
%wls1 = extractvalue { i32, i1 } %wls, 1
br i1 %wls1, label %loop.ph, label %loop.exit
...
loop:
%lsr.iv = phi i32 [ %wls0, %loop.ph ], [ %iv.next, %loop ]
..
%iv.next = call i32 @llvm.loop.decrement.reg.i32(i32 %lsr.iv, i32 1)
%cmp = icmp ne i32 %iv.next, 0
br i1 %cmp, label %loop, label %loop.exit
The llvm.test.start.loop.iterations need to be lowered through ISel
lowering as a pair of WLS and WLSSETUP nodes, which each get converted
to t2WhileLoopSetup and t2WhileLoopStart Pseudos. This helps prevent
t2WhileLoopStart from being a terminator that produces a value,
something difficult to control at that stage in the pipeline. Instead
the t2WhileLoopSetup produces the value of LR (essentially acting as a
lr = subs rn, 0), t2WhileLoopStart consumes that lr value (the Bcc).
These are then converted into a single t2WhileLoopStartLR at the same
point as t2DoLoopStartTP and t2LoopEndDec. Otherwise we revert the loop
to prevent them from progressing further in the pipeline. The
t2WhileLoopStartLR is a single instruction that takes a GPR and produces
LR, similar to the WLS instruction.
%1:gprlr = t2WhileLoopStartLR %0:rgpr, %bb.3
t2B %bb.1
...
bb.2.loop:
%2:gprlr = PHI %1:gprlr, %bb.1, %3:gprlr, %bb.2
...
%3:gprlr = t2LoopEndDec %2:gprlr, %bb.2
t2B %bb.3
The t2WhileLoopStartLR can then be treated similar to the other low
overhead loop pseudos, eventually being lowered to a WLS providing the
branches are within range.
Differential Revision: https://reviews.llvm.org/D97729
This adds some simple known bits handling for the three CSINC/NEG/INV
instructions. From the operands known bits we can compute the common
bits of the first operand and incremented/negated/inverted second
operand. The first, especially CSINC ZR, ZR, comes up fair amount in the
tests. The others are more rare so a unit test for them is added.
Differential Revision: https://reviews.llvm.org/D97788
Instead of converting the 0 into a ZR reg during lowering, do that with
tablegen by matching the zero immediate. This when combined with other
optimizations is more likely to use ZR and helps keep the DAG more
easily optimizable. It should not otherwise effect code generation.
If we insert undef using a VMOVN, we can just use the original value in
three out of the four possible combinations. Using VMOVT into a undef
vector will still require the lanes to be moved, but otherwise the
non-undef value can be used.
This patch provides two major changes:
1. Add getRelocationInfo to check if a constant will have static, dynamic, or
no relocations. (Also rename the original needsRelocation to needsDynamicRelocation.)
2. Only allow a constant with no relocations (static or dynamic) to be placed
in a mergeable section.
This will allow unused symbols that contain static relocations and happen to
fit in mergeable constant sections (.rodata.cstN) to instead be placed in
unique-named sections if -fdata-sections is used and subsequently garbage collected
by --gc-sections.
See https://lists.llvm.org/pipermail/llvm-dev/2021-February/148281.html.
Differential Revision: https://reviews.llvm.org/D95960
Given a floating point store from an extracted vector, with an integer
VGETLANE that already exists, storing the existing VGETLANEu directly
can be better for performance. As the value is known to already be in an
integer registers, this can help reduce fp register pressure, removed
the need for the fp extract and allows use of more integer post-inc
stores not available with vstr.
This can be a bit narrow in scope, but helps with certain biquad kernels
that store shuffled vector elements.
Differential Revision: https://reviews.llvm.org/D96159
Our current lowering of VMOVNT goes via a shuffle vector of the form
<0, N, 2, N+2, 4, N+4, ..>. That can of course also be a single input
shuffle of the form <0, 0, 2, 2, 4, 4, ..>, where we use a VMOVNT to
insert a vector into the top lanes of itself. This adds lowering of that
case, re-using the existing isVMOVNMask.
Differential Revision: https://reviews.llvm.org/D96065
We were storing predicate registers, such as a <8 x i1>, in the opposite
order to how the rest of llvm expects. This actually turns out to be
correct for the one place that usually uses it - the
ScalarizeMaskedMemIntrin pass, but only because the pass was incorrect
itself. This fixes the order so that bits are stored in the opposite
order and bitcasts work as expected. This allows the Scalarization pass
to be fixed, as in https://reviews.llvm.org/D94765.
Differential Revision: https://reviews.llvm.org/D94867
A One-Off Identity mask is a shuffle that is mostly an identity mask
from as single source but contains a single element out-of-place, either
from a different vector or from another position in the same vector. As
opposed to lowering this via a ARMISD::BUILD_VECTOR we can generate an
extract/insert pair directly. Under ARM with individually accessible
lane elements this often becomes a simple lane move.
This also alters the LowerVECTOR_SHUFFLEUsingMovs code to use v4f32 (not
v4i32), a more natural type for lane moves.
Differential Revision: https://reviews.llvm.org/D95551
Because we mark all operations as expand for v2f64, scalar_to_vector
would end up lowering through a stack store/reload. But it is pretty
simple to implement, only inserting a D reg into an undef vector. This
helps clear up some inefficient codegen from soft calling conventions.
Differential Revision: https://reviews.llvm.org/D96153
This new f16 shuffle under Neon would hit an assert in
GeneratePerfectShuffle as it would try to treat a f16 vector as an i8.
Add f16 handling, treating them like an i16.
Differential Revision: https://reviews.llvm.org/D95446
Given a shuffle(vqdmulh(shuffle, shuffle), we can flatter the shuffles
out if they become an identity mask. This can come up during lane
interleaving, when we do that better.
Differential Revision: https://reviews.llvm.org/D94034
Under the softfp calling convention, we are often left with
VMOVRRD(extract(bitcast(build_vector(a, b, c, d)))) for the return value
of the function. These can be simplified to a,b or c,d directly,
depending on the value of the extract.
Big endian is a little different because the bitcast switches the lanes
around, meaning we end up with b,a or d,c.
Differential Revision: https://reviews.llvm.org/D94989
This adds a DAG combine for converting sext_inreg of VGetLaneu into
VGetLanes, providing the types match correctly.
Differential Revision: https://reviews.llvm.org/D95073
Under SoftFP calling conventions, we can be left with
extract(bitcast(BUILD_VECTOR(VMOVDRR(a, b), ..))) patterns that can
simplify to a or b, depending on the extract lane.
Differential Revision: https://reviews.llvm.org/D94990
The MVE VLD2/4 and VST2/4 instructions require the pointer to be aligned
to at least the size of the element type. This adds a check for that
into the ARM lowerInterleavedStore and lowerInterleavedLoad functions,
not creating the intrinsics if they are invalid for the alignment of
the load/store.
Unfortunately this is one of those bug fixes that does effect some
useful codegen, as we were able to sometimes do some nice lowering of
q15 types. But they can cause problem with low aligned pointers.
Differential Revision: https://reviews.llvm.org/D95319
I may have given bad advice, and skipping sext_inreg when matching SSAT
patterns is not valid on it's own. It at least needs to sext_inreg the
input again, but as far as I can tell is still only valid based on
demanded bits. For the moment disable that part of the combine,
hopefully reimplementing it in the future more correctly.
This replaces the isSaturatingConditional function with
LowerSaturatingConditional that directly returns a new SSAT or
USAT SDValue, instead of returning true and the components of it.
We have no lowering for VSELECT vXi1, vXi1, vXi1, so mark them as
expanded to turn them into a series of logical operations.
Differential Revision: https://reviews.llvm.org/D94946
The isVMOVNOriginalMask was previously only checking for two input
shuffles that could be better expanded as vmovn nodes. This expands that
to single input shuffles that will later be legalized to multiple
vectors.
Differential Revision: https://reviews.llvm.org/D94189
The ISel patterns we have for truncating to i1's under MVE do not seem
to be correct. Instead custom lower to icmp(ne, and(x, 1), 0).
Differential Revision: https://reviews.llvm.org/D94226
Similar to 78d8a821e2 but for ARM, this handles any_extend whilst
creating MULL nodes, treating them as zextends.
Differential Revision: https://reviews.llvm.org/D93834
The lowering of a <4 x i16> or <4 x i8> vecreduce.add into an i64 would
previously be expanded, due to the i64 not being legal. This patch
adjusts our reduction matchers, making it produce a VADDLV(sext A to
v4i32) instead.
Differential Revision: https://reviews.llvm.org/D93622
As a linker is allowed to clobber r12 on function calls, the code
transformation that hardens indirect calls is not correct in case a
linker does so. Similarly, the transformation is not correct when
register lr is used.
This patch makes sure that r12 or lr are not used for indirect calls
when harden-sls-blr is enabled.
Differential Revision: https://reviews.llvm.org/D92469
This folds a not (an xor -1) though a predicate_cast, so that it can be
turned into a VPNOT and potentially be folded away as an else predicate
inside a VPT block.
Differential Revision: https://reviews.llvm.org/D92235
We already expand select and select_cc in codegenprepare, but they can
still be generated under some situations. Explicitly mark them as expand
to ensure they are not produced, leading to a failure to select the
nodes.
Differential Revision: https://reviews.llvm.org/D92373
The PREDICATE_CAST node is used to model moves between MVE predicate
registers and gpr's, and eventually become a VMSR p0, rn. When moving to
a predicate only the bottom 16 bits of the sources register are
demanded. This adds a simple fold for that, allowing it to potentially
remove instructions like uxth.
Differential Revision: https://reviews.llvm.org/D92213
We have a frequent pattern where we're merging two KnownBits to get the common/shared bits, and I just fell for the gotcha where I tried to use the & operator to merge them........
This hints the operand of a t2DoLoopStart towards using LR, which can
help make it more likely to become t2DLS lr, lr. This makes it easier to
move if needed (as the input is the same as the output), or potentially
remove entirely.
The hint is added after others (from COPY's etc) which still take
precedence. It needed to find a place to add the hint, which currently
uses the post isel custom inserter.
Differential Revision: https://reviews.llvm.org/D89883
This adds ISel matching for a form of VQDMULH. There are several ir
patterns that we could match to that instruction, this one is for:
min(ashr(mul(sext(a), sext(b)), 7), 127)
Which is what llvm will optimize to once it has removed the max that
usually makes up the min/max saturate pattern, as in this case the
compare will always be false. The additional complication to match i32
patterns (which extend into an i64) is that the min will be a
vselect/setcc, as vmin is not supported for i64 vectors. Tablegen
patterns have also been updated to attempt to reuse the MVE_TwoOpPattern
patterns.
Differential Revision: https://reviews.llvm.org/D90096
In most of lib/Target we know that we are not dealing with scalable
types so it's perfectly fine to replace TypeSize comparison operators
with their fixed width equivalents, making use of getFixedSize()
and so on.
Differential Revision: https://reviews.llvm.org/D89101
This folds a select_cc or select(set_cc) of a max or min vector reduction with a scalar value into a VMAXV or VMINV.
Differential Revision: https://reviews.llvm.org/D87836
This folds a select_cc or select(set_cc) of a max or min vector reduction with a scalar value into a VMAXV or VMINV.
Differential Revision: https://reviews.llvm.org/D87836
We were not accounting for the pointer offset when splitting a store from
a VMOVDRR node, which could lead to incorrect aliasing info. In this
case it is the fneg via integer arithmetic that gives us a store->load
pair that we started getting wrong.
Differential Revision: https://reviews.llvm.org/D88653
An existing function Type::getScalarSizeInBits returns a uint64_t
instead of a TypeSize class because the caller is requesting a
scalar size, which cannot be scalable. This patch makes other
similar functions requesting a scalar size consistent with that,
thereby eliminating more than 1000 implicit TypeSize -> uint64_t
casts.
Differential revision: https://reviews.llvm.org/D87889
security boundary
It was never supported and that part was accidentally omitted when
upstreaming D76518.
Differential Revision: https://reviews.llvm.org/D86478
Change-Id: If6ba9506eb0431c87a1d42a38aa60e47ce263039
This adds lowering for f32 values using the vmov.f16, which zeroes the
top bits whilst setting the lower bits to a pattern. This range of
values does not often come up, except where a f16 constant value has
been converted to a f32.
Differential Revision: https://reviews.llvm.org/D87790
This adds simple constant folding for VMOVrh, to constant fold fp16
constants to integer values. It can help especially with soft calling
conventions, but some of the results are not optimal as we end up
loading using a vldr. This will be improved in a follow up patch.
Differential Revision: https://reviews.llvm.org/D87789
The predicated MVE intrinsics are generated as, for example,
llvm.arm.mve.add.predicated(x, splat(y). p). We need to sink the splat
value back into the loop, like we do for other instructions, so we can
re-select qr variants.
Differential Revision: https://reviews.llvm.org/D87693
The versions that take 'unsigned' will be removed in the future.
I tried to use getOriginalAlign instead of getAlign in some
places. getAlign factors in the minimum alignment implied by
the offset in the pointer info. Since we're also passing the
pointer info we can use the original alignment.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D87592
LLVM will canonicalize conditional selectors to a different pattern than the old code that was used.
This is updating the function to match the new expected patterns and select SSAT or USAT when successful.
Tests have also been updated to use the new patterns.
Differential Review: https://reviews.llvm.org/D87379
This fixes a complication on top of D87276. If we are sign extending
around a mul with the two operands that are the same, instcombine will
helpfully convert one of the sext to a zext. Reverse that so that we
again generate a reduction.
Differnetial Revision: https://reviews.llvm.org/D87287
We can sometimes get code that does:
xe = zext i16 x to i32
ye = zext i16 y to i32
m = mul i32 xe, ye
me = zext i32 m to i64
r = vecreduce.add(me)
This "double extend" can trip up the reduction identification, but
should give identical results.
This extends the pattern matching to handle them.
Differential Revision: https://reviews.llvm.org/D87276
Skip this for now, to avoid a backend crash in:
UNREACHABLE executed at llvm/lib/Target/ARM/ARMISelLowering.cpp:13412
This should fix PR45824.
Differential Revision: https://reviews.llvm.org/D86784
If gather/scatters are enabled, ARMTargetTransformInfo now allows
tail predication for loops with a much wider range of strides, up
to anything that is loop invariant.
Differential Revision: https://reviews.llvm.org/D85410
Changes the Offset arguments to both functions from int64_t to TypeSize
& updates all uses of the functions to create the offset using TypeSize::Fixed()
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D85220
This adds patterns for v16i16's vecreduce, using all the existing code
to go via an i32 VADDV/VMLAV and truncating the result.
Differential Revision: https://reviews.llvm.org/D85452
Similar to 8fa824d7a3 but this time for MLA patterns, this selects
predicated vmlav/vmlava/vmlalv/vmlava instructions from
vecreduce.add(select(p, mul(x, y), 0)) nodes.
Differential Revision: https://reviews.llvm.org/D84102
Given a vecreduce.add(select(p, x, 0)), we can convert that to a
predicated vaddv, as the else value for the select is the identity
value, a zero. That is what this patch does for the vaddv, vaddva,
vaddlv and vaddlva instructions, copying the existing patterns to also
handle predication through a select.
Differential Revision: https://reviews.llvm.org/D84101
Vector bitwise selects are matched by pseudo VBSP instruction
and expanded to VBSL/VBIT/VBIF after register allocation
depend on operands registers to minimize extra copies.
MVE has native reductions for integer add and min/max. The others need
to be expanded to a series of extract's and scalar operators to reduce
the vector into a single scalar. The default codegen for that expands
the reduction into a series of in-order operations.
This modifies that to something more suitable for MVE. The basic idea is
to use vector operations until there are 4 remaining items then switch
to pairwise operations. For example a v8f16 fadd reduction would become:
Y = VREV X
Z = ADD(X, Y)
z0 = Z[0] + Z[1]
z1 = Z[2] + Z[3]
return z0 + z1
The awkwardness (there is always some) comes in from something like a
v4f16, which is first legalized by adding identity values to the extra
lanes of the reduction, and which can then not be optimized away through
the vrev; fadd combo, the inserts remain. I've made sure they custom
lower so that we can produce the pairwise additions before the extra
values are added.
Differential Revision: https://reviews.llvm.org/D81397
Pre-commit for D82257, this adds a DemandedElts arg to ShrinkDemandedConstant/targetShrinkDemandedConstant which will allow future patches to (optionally) add vector support.
This extends PerformSplittingToWideningLoad to also handle FP_Ext, as
well as sign and zero extends. It uses an integer extending load
followed by a VCVTL on the bottom lanes to efficiently perform an fpext
on a smaller than legal type.
The existing code had to be rewritten a little to not just split the
node in two and let legalization handle it from there, but to actually
split into legal chunks.
Differential Revision: https://reviews.llvm.org/D81340
This adds code to lower f16 to f32 fp_exts's using an MVE VCVT
instructions, similar to a recent similar patch for fp_trunc. Again it
goes through the lowering of a BUILD_VECTOR, but is slightly simpler
only having to deal with interleaved indices. It adds a VCVTL node to
lower to, similar to VCVTN.
Differential Revision: https://reviews.llvm.org/D81339
This splits MVE vector stores of a fp_trunc in the same way that we do
for standard trunc's. It extends PerformSplittingToNarrowingStores to
handle fp_round, splitting the store into pieces and adding a VCVTNb to
perform the actual fp_round. The actual store is then converted to an
integer store so that it can truncate bottom lanes of the result.
Differential Revision: https://reviews.llvm.org/D81141
This adds code to lower f32 to f16 fp_trunc's using a pair of MVE VCVT
instructions. Due to v4f16 not being legal, fp_round are often split up
fairly early. So this reconstructs the vcvt's from a buildvector of
fp_rounds from two vector inputs. Something like:
BUILDVECTOR(FP_ROUND(EXTRACT_ELT(X, 0),
FP_ROUND(EXTRACT_ELT(Y, 0),
FP_ROUND(EXTRACT_ELT(X, 1),
FP_ROUND(EXTRACT_ELT(Y, 1), ...)
It adds a VCVTN node to handle this, which like VMOVN or VQMOVN lowers
into the top/bottom lanes of an MVE instruction.
Differential Revision: https://reviews.llvm.org/D81139
Summary:
This change permits scalar bfloats to be loaded, stored, moved and
used as function call arguments and return values, whenever the bf16
feature is supported by the subtarget.
Previously that was only supported in the presence of the fullfp16
feature, because the code generation strategy depended on instructions
from that extension. This change adds alternative code generation
strategies so that those operations can be done even without fullfp16.
The strategy for loads and stores is to replace VLDRH/VSTRH with
integer LDRH/STRH plus a move between register classes. I've written
isel patterns for those, conditional on //not// having the fullfp16
feature (so that in the fullfp16 case, the existing patterns will
still be used).
For function arguments and returns, instead of writing isel patterns
to match `VMOVhr` and `VMOVrh`, I've avoided generating those SDNodes
in the first place, by factoring out the code that constructs them
into helper functions `MoveToHPR` and `MoveFromHPR` which have a
fallback for non-fullfp16 subtargets.
The current output code is not especially pretty: in the new test file
you can see unnecessary store/load pairs implementing no-op bitcasts,
and lots of pointless moves back and forth between FP registers and
GPRs. But it at least works, which is an improvement on the previous
situation.
Reviewers: dmgreen, SjoerdMeijer, stuij, chill, miyuki, labrinea
Reviewed By: dmgreen, labrinea
Subscribers: labrinea, kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D82372
Implement them on top of sdiv/udiv, similar to what we do for integer
types.
Potential future work: implementing i8/i16 srem/urem, optimizations for
constant divisors, optimizing the mul+sub to mls.
Differential Revision: https://reviews.llvm.org/D81511
This patch adds basic support for BFloat in the Arm backend.
For now the code generation relies on fullfp16 being present.
Briefly:
* adds the bfloat scalar and vector types in the necessary register classes,
* adjusts the calling convention to cope with bfloat argument passing and return,
* adds codegen patterns for moves, loads and stores.
It's tested mostly by the intrinsic patches that depend on it (load/store, convert/copy).
The following people contributed to this patch:
* Alexandros Lamprineas
* Ties Stuij
Differential Revision: https://reviews.llvm.org/D81373
Summary:
As half-precision floating point arguments and returns were previously
coerced to either float or int32 by clang's codegen, the CMSE handling
of those was also performed in clang's side by zeroing the unused MSBs
of the coercer values.
This patch moves this handling to the backend's calling convention
lowering, making sure the high bits of the registers used by
half-precision arguments and returns are zeroed.
Reviewers: chill, rjmccall, ostannard
Reviewed By: ostannard
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D81428
Summary:
Half-precision floating point arguments and returns are currently
promoted to either float or int32 in clang's CodeGen and there's
no existing support for the lowering of `half` arguments and returns
from IR in AArch32's backend.
Such frontend coercions, implemented as coercion through memory
in clang, can cause a series of issues in argument lowering, as causing
arguments to be stored on the wrong bits on big-endian architectures
and incurring in missing overflow detections in the return of certain
functions.
This patch introduces the handling of half-precision arguments and returns in
the backend using the actual "half" type on the IR. Using the "half"
type the backend is able to properly enforce the AAPCS' directions for
those arguments, making sure they are stored on the proper bits of the
registers and performing the necessary floating point convertions.
Reviewers: rjmccall, olista01, asl, efriedma, ostannard, SjoerdMeijer
Reviewed By: ostannard
Subscribers: stuij, hiraditya, dmgreen, llvm-commits, chill, dnsampaio, danielkiss, kristof.beyls, cfe-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D75169
The rearranges PerformANDCombine and PerformORCombine to try and make
sure we don't call isConstantSplat on any i1 vectors. As pointed out in
D81860 it may not be very well defined in those cases.