Commit Graph

25 Commits

Author SHA1 Message Date
Roman Lebedev cae2d871c0
[NFCI][Thumb2] Regenerate MVE tests i missed in 59560e8589 2020-12-14 21:01:00 +03:00
David Green 0447f3508f [ARM][RegAlloc] Add t2LoopEndDec
We currently have problems with the way that low overhead loops are
specified, with LR being spilled between the t2LoopDec and the t2LoopEnd
forcing the entire loop to be reverted late in the backend. As they will
eventually become a single instruction, this patch introduces a
t2LoopEndDec which is the combination of the two, combined before
registry allocation to make sure this does not fail.

Unfortunately this instruction is a terminator that produces a value
(and also branches - it only produces the value around the branching
edge). So this needs some adjustment to phi elimination and the register
allocator to make sure that we do not spill this LR def around the loop
(needing to put a spill after the terminator). We treat the loop very
carefully, making sure that there is nothing else like calls that would
break it's ability to use LR. For that, this adds a
isUnspillableTerminator to opt in the new behaviour.

There is a chance that this could cause problems, and so I have added an
escape option incase. But I have not seen any problems in the testing
that I've tried, and not reverting Low overhead loops is important for
our performance. If this does work then we can hopefully do the same for
t2WhileLoopStart and t2DoLoopStart instructions.

This patch also contains the code needed to convert or revert the
t2LoopEndDec in the backend (which just needs a subs; bne) and the code
pre-ra to create them.

Differential Revision: https://reviews.llvm.org/D91358
2020-12-10 12:14:23 +00:00
Nikita Popov e987fbdd85 [BasicAA] Generalize recursive phi alias analysis
For recursive phis, we skip the recursive operands and check that
the remaining operands are NoAlias with an unknown size. Currently,
this is limited to inbounds GEPs with positive offsets, to
guarantee that the recursion only ever increases the pointer.

Make this more general by only requiring that the underlying object
of the phi operand is the phi itself, i.e. it it based on itself in
some way. To compensate, we need to use a beforeOrAfterPointer()
location size, as we no longer have the guarantee that the pointer
is strictly increasing.

This allows us to handle some additional cases like negative geps,
geps with dynamic offsets or geps that aren't inbounds.

Differential Revision: https://reviews.llvm.org/D91914
2020-11-29 10:25:23 +01:00
Roman Lebedev b33fbbaa34
Reland [SimplifyCFG] FoldBranchToCommonDest: lift use-restriction on bonus instructions
This was orginally committed in 2245fb8aaa.
but was immediately reverted in f3abd54958
because of a PHI handling issue.

Original commit message:

1. It doesn't make sense to enforce that the bonus instruction
   is only used once in it's basic block. What matters is
   whether those user instructions fit within our budget, sure,
   but that is another question.
2. It doesn't make sense to enforce that said bonus instructions
   are only used within their basic block. Perhaps the branch
   condition isn't using the value computed by said bonus instruction,
   and said bonus instruction is simply being calculated
   to be used in successors?

So iff we can clone bonus instructions, to lift these restrictions,
we just need to carefully update their external uses
to use the new cloned instructions.

Notably, this transform (even without this change) appears to be
poison-unsafe as per alive2, but is otherwise (including the patch) legal.

We don't introduce any new PHI nodes, but only "move" the instructions
around, i'm not really seeing much potential for extra cost modelling
for the transform, especially since now we allow at most one such
bonus instruction by default.

This causes the fold to fire +11.4% more (13216 -> 14725)
as of vanilla llvm test-suite + RawSpeed.

The motivational pattern is IEEE-754-2008 Binary16->Binary32
extension code:
ca57d77fb2/src/librawspeed/common/FloatingPoint.h (L115-L120)
^ that should be a switch, but it is not now: https://godbolt.org/z/bvja5v
That being said, even thought this seemed like this would fix it: https://godbolt.org/z/xGq3TM
apparently that fold is happening somewhere else afterall,
so something else also has a similar 'artificial' restriction.
2020-11-27 12:47:15 +03:00
Craig Topper 4252f7773a [SelectionDAG][ARM][AArch64][Hexagon][RISCV][X86] Add SDNPCommutative to fma and fmad nodes in tablegen. Remove explicit commuted patterns from targets.
X86 was already specially marking fma as commutable which allowed
tablegen to autogenerate commuted patterns. This moves it to the target
independent definition and fix up the targets to remove now
unneeded patterns.

Unfortunately, the tests change because the commuted version of
the patterns are generating operands in a different than the
explicit patterns.

Differential Revision: https://reviews.llvm.org/D91842
2020-11-23 10:09:20 -08:00
David Green 73a6cd4b6b [ARM] Add a RegAllocHint for hinting t2DoLoopStart towards LR
This hints the operand of a t2DoLoopStart towards using LR, which can
help make it more likely to become t2DLS lr, lr. This makes it easier to
move if needed (as the input is the same as the output), or potentially
remove entirely.

The hint is added after others (from COPY's etc) which still take
precedence. It needed to find a place to add the hint, which currently
uses the post isel custom inserter.

Differential Revision: https://reviews.llvm.org/D89883
2020-11-10 16:28:57 +00:00
David Green b2ac9681a7 [ARM] Alter t2DoLoopStart to define lr
This changes the definition of t2DoLoopStart from
t2DoLoopStart rGPR
to
GPRlr = t2DoLoopStart rGPR

This will hopefully mean that low overhead loops are more tied together,
and we can more reliably generate loops without reverting or being at
the whims of the register allocator.

This is a fairly simple change in itself, but leads to a number of other
required alterations.

 - The hardware loop pass, if UsePhi is set, now generates loops of the
   form:
       %start = llvm.start.loop.iterations(%N)
     loop:
       %p = phi [%start], [%dec]
       %dec = llvm.loop.decrement.reg(%p, 1)
       %c = icmp ne %dec, 0
       br %c, loop, exit
 - For this a new llvm.start.loop.iterations intrinsic was added, identical
   to llvm.set.loop.iterations but produces a value as seen above, gluing
   the loop together more through def-use chains.
 - This new instrinsic conceptually produces the same output as input,
   which is taught to SCEV so that the checks in MVETailPredication are not
   affected.
 - Some minor changes are needed to the ARMLowOverheadLoop pass, but it has
   been left mostly as before. We should now more reliably be able to tell
   that the t2DoLoopStart is correct without having to prove it, but
   t2WhileLoopStart and tail-predicated loops will remain the same.
 - And all the tests have been updated. There are a lot of them!

This patch on it's own might cause more trouble that it helps, with more
tail-predicated loops being reverted, but some additional patches can
hopefully improve upon that to get to something that is better overall.

Differential Revision: https://reviews.llvm.org/D89881
2020-11-10 15:57:58 +00:00
Michael Liao 4b11201592 [MachineInstr] Add support for instructions with multiple memory operands.
- Basically iterate each pair of memory operands from both instructions
  and return true if any of them may alias.
- The exception are memory instructions without any memory operand. They
  may touch everything and could alias to any memory instruction.

Differential Revision: https://reviews.llvm.org/D89447
2020-11-03 20:44:40 -05:00
David Green 6dcbc323fd Revert "[ARM][LowOverheadLoops] Adjust Start insertion."
This reverts commit 38f625d0d1.

This commit contains some holes in its logic and has been causing
issues since it was commited. The idea sounds OK but some cases were not
handled correctly. Instead of trying to fix that up later it is probably
simpler to revert it and work to reimplement it in a more reliable way.
2020-10-20 08:55:21 +01:00
Sam Parker 38f625d0d1 [ARM][LowOverheadLoops] Adjust Start insertion.
Try to move the insertion point to become the terminator of the
block, usually the preheader.

Differential Revision: https://reviews.llvm.org/D88638
2020-10-01 10:49:19 +01:00
Sam Parker 7b90516d47 [ARM][LowOverheadLoops] Start insertion point
If possible, try not to move the start position earlier than it
already is.

Differential Revision: https://reviews.llvm.org/D88542
2020-10-01 10:05:25 +01:00
Sam Parker a3e41d4581 [ARM] Make MachineVerifier more strict about terminators
Fix the ARM backend's analyzeBranch so it doesn't ignore predicated
return instructions, and make the MachineVerifier rule more strict.

Differential Revision: https://reviews.llvm.org/D40061
2020-08-27 07:10:20 +01:00
David Green 9e03547cab [ARM][HWLoops] Create hardware loops for sibling loops
Given a loop with two subloops, it should be possible for both to be
converted to hardware loops. That's what this patch does, simply enough.
It slightly alters the loop iterating order to try and convert all
subloops. If one (or more) succeeds, it stops as before.

Differential Revision: https://reviews.llvm.org/D78502
2020-07-03 17:20:02 +01:00
Tres Popp 09d72ad399 Revert "[CGP] Enable CodeGenPrepares phi type convertion."
This reverts commit 67121d7b82.

This is causing compile times to be 2x slower on some large binaries.
2020-06-22 13:06:18 +02:00
David Green 67121d7b82 [CGP] Enable CodeGenPrepares phi type convertion. 2020-06-21 16:46:16 +01:00
David Green 076e08aa45 [LSR] Filter for postinc formulae
In more complicated loops we can easily hit the complexity limits of
loop strength reduction. If we do and filtering occurs, it's all too
easy to remove the wrong formulae for post-inc preferring accesses due
to it attempting to maximise register re-use. The patch adds an
alternative filtering step when the target is preferring postinc to pick
postinc formulae instead, hopefully lowering the complexity to below the
limit so that aggressive filtering is not needed.

There is also a change in here to stop considering existing addrecs as
free under postinc. We should already be modelling them as a reg so
don't want it to cause us to get the cost wrong. (I'm not sure that code
makes sense in general, but there are X86 tests specifically for it
where it seems to be helping so have left it around for the standard
non-post-inc case).

Differential Revision: https://reviews.llvm.org/D80273
2020-06-17 12:32:04 +01:00
David Green e8e7b2cb46 [ARM] More tests for MVE LSR and float issues. NFC 2020-05-28 22:04:12 +01:00
Jean-Michel Gorius 65cd2c7a80 Revert "[CodeGen] Add support for multiple memory operands in MachineInstr::mayAlias"
This temporarily reverts commit 7019cea26d.

It seems that, for some targets, there are instructions with a lot of memory operands (probably more than would be expected). This causes a lot of buildbots to timeout and notify failed builds. While investigations are ongoing to find out why this happens, revert the changes.
2020-05-22 21:26:46 +02:00
Jean-Michel Gorius 7019cea26d [CodeGen] Add support for multiple memory operands in MachineInstr::mayAlias
Summary:
To support all targets, the mayAlias member function needs to support instructions with multiple operands.

This revision also changes the order of the emitted instructions in some test cases.

Reviewers: efriedma, hfinkel, craig.topper, dmgreen

Reviewed By: efriedma

Subscribers: MatzeB, dmgreen, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D80161
2020-05-21 23:02:54 +02:00
David Green fa15255d8a [ARM] Convert floating point splats to integer
Under MVE a vdup will always take a gpr register, not a floating point
value. During DAG combine we convert the types to a bitcast to an
integer in an attempt to fold the bitcast into other instructions. This
is OK, but only works inside the same basic block. To do the same trick
across a basic block boundary we need to convert the type in
codegenprepare, before the splat is sunk into the loop.

This adds a convertSplatType function to codegenprepare to do that,
putting bitcasts around the splat to force the type to an integer. There
is then some adjustment to the code in shouldSinkOperands to handle the
extra bitcasts.

Differential Revision: https://reviews.llvm.org/D78728
2020-05-13 15:24:16 +01:00
David Green eecba95067 [ARM] Replace arm vendor with none. NFC 2020-04-22 18:19:35 +01:00
David Green 892af45c86 [ARM] Distribute MVE post-increments
This adds some extra processing into the Pre-RA ARM load/store optimizer
to detect and merge MVE loads/stores and adds of the same base. This we
don't always turn into a post-inc during ISel, and due to the nature of
it being a graph we don't always know an order to use for the nodes, not
knowing which nodes to make post-inc and which to use the new post-inc
of. After ISel, we have an order that we can use to post-inc the
following instructions.

So this looks for a loads/store with a starting offset of 0, and an
add/sub from the same base, plus a number of other loads/stores. We then
do some checks and convert the zero offset load/store into a postinc
variant. Any loads/stores after it have the offset subtracted from their
immediates.  For example:
  LDR #4           LDR #4
  LDR #0           LDR_POSTINC #16
  LDR #8           LDR #-8
  LDR #12          LDR #-4
  ADD #16
It only handles MVE loads/stores at the moment. Normal loads/store will
be added in a followup patch, they just have some extra details to
ensure that we keep generating LDRD/LDM successfully.

Differential Revision: https://reviews.llvm.org/D77813
2020-04-22 14:16:51 +01:00
David Green 37b9cc8f29 [ARM] Sink splats to vector float instructions
Some MVE floating point instructions have gpr register variants that take
the scalar gpr value and splat them to all lanes. In order to accept
them in loops, the shuffle_vector and insert need to be sunk down into
the loop, next to the instruction so that ISel can see the whole
pattern.

This does that sinking for FAdd, FSub, FMul and FCmp. The patterns for
mul are slightly more constrained as there are no fms variants taking
register arguments.

Differential Revision: https://reviews.llvm.org/D76023
2020-03-26 09:02:18 +00:00
David Green b3499f572d [ARM] Change VDUP type to i32 for MVE
The MVE VDUP instruction take a GPR and splats into every lane of a
vector register. Unlike NEON we do not have a VDUPLANE equivalent
instruction, doing the same splat from a fp register. Previously a VDUP
to a v4f32/v8f16 would be represented as a (v4f32 VDUP f32), which
would mean the instruction pattern needs to add a COPY_TO_REGCLASS to
the GPR.

Instead this now converts that earlier during an ISel DAG combine,
converting (VDUP x) to (VDUP (bitcast x)). This can allow instruction
selection to tell that the input needs to be an i32, which in one of the
testcases allows it to use ldr (or specifically ldm) over (vldr;vmov).

Whilst being simple enough for floats, as the types sizes are the same,
these is no BITCAST equivalent for getting a half into a i32. This uses
a VMOVrh ARMISD node, which doesn't know the same tricks yet.

Differential Revision: https://reviews.llvm.org/D76292
2020-03-20 09:48:45 +00:00
David Green 9cf920e64d [ARM] Extra MVE float loop tests. NFC 2020-03-20 09:21:45 +00:00