An existing function Type::getScalarSizeInBits returns a uint64_t
instead of a TypeSize class because the caller is requesting a
scalar size, which cannot be scalable. This patch makes other
similar functions requesting a scalar size consistent with that,
thereby eliminating more than 1000 implicit TypeSize -> uint64_t
casts.
Differential revision: https://reviews.llvm.org/D87889
Changes TTI function getIntImmCostInst to take an additional Instruction parameter,
which enables us to be able to check it is part of a min(max())/max(min()) pattern that will match SSAT.
We can then mark the constant used as free to prevent it being hoisted so SSAT can still be generated.
Required minor changes in some non-ARM backends to allow for the optional parameter to be included.
Differential Revision: https://reviews.llvm.org/D87457
The VPTBlock has been modified to track the 'global' state of the
VPR, as well as the state for each block. Each object now just holds
a list of instructions that makeup the block, while static structures
hold the predicate information. This enables global access for
querying how both a VPT block and individual instructions are
predicated. These changes now allow us, again, to handle more
complicated cases where multiple instructions build a predicate
and/or where the same predicate in used in multiple blocks.
It doesn't, however, get us back to before the tracking was 'fixed'
as some extra logic will be required to properly handle VPT
instructions. Currently a VPT could be effectively predicated because
of it's inputs, but the existing logic will not detect that and so
will refuse to perform the transformation. This can be seen in
remat-vctp.ll test where we still don't perform the transform.
Differential Revision: https://reviews.llvm.org/D87681
Remove the domain from the instructions and create a shouldInspect
helper for LowOverheadLoops which queries it or a vpr operand.
Differential Revision: https://reviews.llvm.org/D87900
security boundary
It was never supported and that part was accidentally omitted when
upstreaming D76518.
Differential Revision: https://reviews.llvm.org/D86478
Change-Id: If6ba9506eb0431c87a1d42a38aa60e47ce263039
This adds lowering for f32 values using the vmov.f16, which zeroes the
top bits whilst setting the lower bits to a pattern. This range of
values does not often come up, except where a f16 constant value has
been converted to a f32.
Differential Revision: https://reviews.llvm.org/D87790
This adds simple constant folding for VMOVrh, to constant fold fp16
constants to integer values. It can help especially with soft calling
conventions, but some of the results are not optimal as we end up
loading using a vldr. This will be improved in a follow up patch.
Differential Revision: https://reviews.llvm.org/D87789
When generating matching tables for GlobalISel, TableGen would output
"::zero_reg" whenever encountering the zero_reg, which in turn would
result in compilation error. This patch fixes that by instead outputting
NoRegister (== 0), which is the same result that TableGen produces when
generating matching tables for ISelDAG.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D86215
This extends the distributing postinc code in load/store optimizer to
also handle the case where there is an existing pre/post inc instruction,
where subsequent instructions can be modified to use the adjusted
offset from the increment. This can save us having to keep the old
register live past the increment instruction.
Differential Revision: https://reviews.llvm.org/D83377
The predicated MVE intrinsics are generated as, for example,
llvm.arm.mve.add.predicated(x, splat(y). p). We need to sink the splat
value back into the loop, like we do for other instructions, so we can
re-select qr variants.
Differential Revision: https://reviews.llvm.org/D87693
Additional sanity checks were added to get.active.lane.mask's second argument,
the loop tripcount/elementcount, in rG635b87511ec3. Like the other (overflow)
checks, skip this if tail-predication is forced.
Differential Revision: https://reviews.llvm.org/D87769
Clear the CurrentPredicate when we find an instruction which would
completely overwrite the VPR. This fix essentially means we're back
to not really being able to handle VPT instructions when tail
predicating.
Differential Revision: https://reviews.llvm.org/D87610
Modify the unit test to inspect all MVE instructions and mark the
load/store/move of vpr/p0 as valid, as well as the remaining scalar
shifts.
Differential Revision: https://reviews.llvm.org/D87753
The versions that take 'unsigned' will be removed in the future.
I tried to use getOriginalAlign instead of getAlign in some
places. getAlign factors in the minimum alignment implied by
the offset in the pointer info. Since we're also passing the
pointer info we can use the original alignment.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D87592
This adds SoftenFloatRes, PromoteFloatRes and SoftPromoteHalfRes
legalizations for VECREDUCE, to fill the remaining hole in the SDAG
legalization. These legalizations simply expand the reduction and
let it be recursively legalized. For the PromoteFloatRes case at
least it is possible to do better than that, but it's pretty tricky
(because we need to consider the interaction of three different
vector legalizations and the type promotion) and probably not
really worthwhile.
I haven't added ExpandFloatRes support, as I am not familiar with
ppc_fp128.
Differential Revision: https://reviews.llvm.org/D87569
LLVM will canonicalize conditional selectors to a different pattern than the old code that was used.
This is updating the function to match the new expected patterns and select SSAT or USAT when successful.
Tests have also been updated to use the new patterns.
Differential Review: https://reviews.llvm.org/D87379
This adds additional checks for the original scalar loop tripcount value, i.e.
get.active.lane.mask second argument, and perform several sanity checks to see
if it is of the form that we expect similarly like we already do for the IV
which is the first argument of get.active.lane.
Differential Revision: https://reviews.llvm.org/D86074
Treating an SoImm offset as a multiple of 4 between -1020 and 1020
mis-handles the second of a pair of 16-bit constants where the offset is a multiple of 2 but not a multiple of 4,
leading to an LLVM ERROR: out of range pc-relative fixup value
For 32-bit and larger (64-bit) constants, continue to treat an SoImm offset as a multiple of 4 between -1020 and 1020.
For smaller (16-bit) constants, treat an SoImm offset as a multiple of 1 between -255 and 255.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D86949
This allows the backend to tell the vectorizer to produce inloop
reductions through a TTI hook.
For the moment on ARM under MVE this means allowing integer add
reductions of the correct size. In the future this can include integer
min/max too, under -Os.
Differential Revision: https://reviews.llvm.org/D75512
This fixes a complication on top of D87276. If we are sign extending
around a mul with the two operands that are the same, instcombine will
helpfully convert one of the sext to a zext. Reverse that so that we
again generate a reduction.
Differnetial Revision: https://reviews.llvm.org/D87287
As discussed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-April/140729.html
This is hopefully the final remaining showstopper before we can remove
the 'experimental' from the reduction intrinsics.
No behavior was specified for the FP min/max reductions, so we have a
mess of different interpretations.
There are a few potential options for the semantics of these max/min ops.
I think this is the simplest based on current behavior/implementation:
make the reductions inherit from the existing llvm.maxnum/minnum intrinsics.
These correspond to libm fmax/fmin, and those are similar to the (now
deprecated?) IEEE-754 maxNum/minNum functions (NaNs are treated as missing
data). So the default expansion creates calls to libm functions.
Another option would be to inherit from llvm.maximum/minimum (NaNs propagate),
but most targets just crash in codegen when given those nodes because no
default expansion was ever implemented AFAICT.
We could also just assume 'nnan' semantics by default (we are already
assuming 'nsz' semantics in the maxnum/minnum intrinsics), but some targets
(AArch64, PowerPC) support the more defined behavior, so it doesn't make much
sense to not allow a tighter spec. Fast-math-flags (nnan) can be used to
loosen the semantics.
(Note that D67507 was proposed to update the LangRef to acknowledge the more
recent IEEE-754 2019 standard, but that patch seems to have stalled. If we do
update based on the new standard, the reduction instructions can seamlessly
inherit from whatever updates are made to the max/min intrinsics.)
x86 sees a regression here on 'nnan' tests because we have underlying,
longstanding bugs in FMF creation/propagation. Those need to be fixed apart
from this change (for example: https://llvm.org/PR35538). The expansion
sequence before this patch may not have been correct.
Differential Revision: https://reviews.llvm.org/D87391
We can sometimes get code that does:
xe = zext i16 x to i32
ye = zext i16 y to i32
m = mul i32 xe, ye
me = zext i32 m to i64
r = vecreduce.add(me)
This "double extend" can trip up the reduction identification, but
should give identical results.
This extends the pattern matching to handle them.
Differential Revision: https://reviews.llvm.org/D87276
values
The effects of unpredicated vector instruction with unknown
lanes cannot be predicted and therefore cannot be tail predicated. This
does not apply to predicated vector instructions and so this patch
allows tail predication on them.
Differential Revision: https://reviews.llvm.org/D87376
We weren't using this before, so none of the MachineFunction CFG edges had the
branch probability information added. As a result, block placement later in the
pipeline was flying blind.
This is enabled only with optimizations enabled like SelectionDAG.
Differential Revision: https://reviews.llvm.org/D86824
We really want to try and avoid spilling P0, which can be difficult
since there's only one register, so try to rematerialize any VCTP
instructions.
Differential Revision: https://reviews.llvm.org/D87280
count register
After my patch at D86087, code that now uses the mov operand rather than
the vctp operand will no longer remove modifications to the vctp operand
as they should. This patch fixes that by explicitly removing
modifications to the vctp operand rather than the register used as the
element count.
This was reverted in 503deec218
because it caused gigantic increase (3x) in branch mispredictions
in certain benchmarks on certain CPU's,
see https://reviews.llvm.org/D84108#2227365.
It has since been investigated and here are the results:
https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200907/827578.html
> It's an amazingly severe regression, but it's also all due to branch
> mispredicts (about 3x without this). The code layout looks ok so there's
> probably something else to deal with. I'm not sure there's anything we can
> reasonably do so we'll just have to take the hit for now and wait for
> another code reorganization to make the branch predictor a bit more happy :)
>
> Thanks for giving us some time to investigate and feel free to recommit
> whenever you'd like.
>
> -eric
So let's just reland this.
Original commit message:
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.
After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.
Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.
I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.
Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.
Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.
As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S
llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement
The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.
If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
* -14 (-73.68%) loops not rotated due to the header size (yay)
* -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
* -3937 (-64.19%) common instructions hoisted
* +561 (+0.06%) x86 asm instructions
* -2 basic blocks
* +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable {F12360201}
* -36396 (-65.29%) common instructions hoisted
* +1676 (+0.02%) x86 asm instructions
* +662 (+0.06%) basic blocks
* +4395 (+0.04%) IR instructions
It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.
Reviewed By: mkazantsev
Differential Revision: https://reviews.llvm.org/D84108
This reverts commit 503deec218.
When optimising for size, make the cost of i1 logical operations
relatively expensive so that optimisations don't try to combine
predicates.
Differential Revision: https://reviews.llvm.org/D86525
This adds a simple tablegen pattern for folding predicate_cast(load)
into vldr p0, providing the alignment and offset are correct.
Differential Revision: https://reviews.llvm.org/D86702
This patch implements the foldMemoryOperand hook in Thumb1InstrInfo,
allowing tBLXr and a spilled function address to be combined back into a
tBL. This can help with codesize at Oz, especailly in the tinycrypt
library.
Differential Revision: https://reviews.llvm.org/D79785
There's a special case in hasAttribute for None when pImpl is null. If pImpl is not null we dispatch to pImpl->hasAttribute which will always return false for Attribute::None.
So if we just want to check for None its sufficient to just check that pImpl is null. Which can even be done inline.
This patch adds a helper for that case which I hope will speed up our getSubtargetImpl implementations.
Differential Revision: https://reviews.llvm.org/D86744
Skip this for now, to avoid a backend crash in:
UNREACHABLE executed at llvm/lib/Target/ARM/ARMISelLowering.cpp:13412
This should fix PR45824.
Differential Revision: https://reviews.llvm.org/D86784
These arm_mve_vldr_gather_offset_predicated and
arm_mve_vstr_scatter_offset_predicated have some extra parameters
meaning the predicate is at a later operand. If a loop contains _only_
those masked instructions, we would miss transforming the active lane
mask.
Differential Revision: https://reviews.llvm.org/D86791
Remove the code that tried to look for reduction patterns, since the
vectorizer and isel can now produce predicated arithmetic instructios
within the loop body. This has required some reorganisation and fixes
around live-out and predication checks, as well as looking for cases
where an input/output is initialised to zero.
Differential Revision: https://reviews.llvm.org/D86613
This patch adjusts the following ARM/AArch64 LLVM IR intrinsics:
- neon_bfmmla
- neon_bfmlalb
- neon_bfmlalt
so that they take and return bf16 and float types. Previously these
intrinsics used <8 x i8> and <4 x i8> vectors (a rudiment from
implementation lacking bf16 IR type).
The neon_vbfdot[q] intrinsics are adjusted similarly. This change
required some additional selection patterns for vbfdot itself and
also for vector shuffles (in a previous patch) because of SelectionDAG
transformations kicking in and mangling the original code.
This patch makes the generated IR cleaner (less useless bitcasts are
produced), but it does not affect the final assembly.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D86146