The alias analysis in DAG Combine looks at the BaseAlign, the Offset and
the Size of two accesses, and determines if they are known to access
different parts of memory by the fact that they are different offsets
from inside that "alignment window". It does not seem to account for
accesses that are not a multiple of the size, and may overflow from one
alignment window into another.
For example in the test case we have a 19byte memset that is splits into
a 16 byte neon store and an unaligned 4 byte store with a 15 byte
offset. This 15byte offset (with a base align of 8) wraps around to the
next alignment windows. When compared to an access that is a 16byte
offset (of the same 4byte size and 8byte basealign), the two accesses
are said not to alias.
I've fixed this here by just ensuring that the offsets are a multiple of
the size, ensuring that they don't overlap by wrapping. Fixes PR45035,
which was exposed by the UseAA changes in the arm backend.
Differential Revision: https://reviews.llvm.org/D75238
We can only report the knownbits for a SCALAR_TO_VECTOR node if we only demand the 0'th element - the upper elements are undefined and shouldn't be trusted.
This is causing a number of regressions that need addressing but we need to get the bugfix in first.
Way back in D24994, the combination of LexicalScopes::dominates and
LiveDebugValues was identified as having worst-case quadratic complexity,
but it wasn't triggered by any code path at the time. I've since run into a
scenario where this occurs, in a very large basic block where large numbers
of inlined DBG_VALUEs are present.
The quadratic-ness comes from LiveDebugValues::join calling "dominates" on
every variable location, and LexicalScopes::dominates potentially touching
every instruction in a block to test for the presence of a scope. We have,
however, already computed the presence of scopes in blocks, in the
"InstrRanges" of each scope. This patch switches the dominates method to
examine whether a block is present in a scope's InsnRanges, avoiding
walking through the whole block.
At the same time, fix getMachineBasicBlocks to account for the fact that
InsnRanges can cover multiple blocks, and add some unit tests, as Lexical
Scopes didn't have any.
Differential revision: https://reviews.llvm.org/D73725
Ensure that we're recording implicit defs, as well as visiting implicit
uses and implicit defs when we're walking through operands.
Differential Revision: https://reviews.llvm.org/D75185
According to Joseph Myers, a libm maintainer
> They were only ever an ABI (selected by use of -ffinite-math-only or
> options implying it, which resulted in the headers using "asm" to redirect
> calls to some libm functions), not an API. The change means that ABI has
> turned into compat symbols (only available for existing binaries, not for
> anything newly linked, not included in static libm at all, not included in
> shared libm for future glibc ports such as RV32), so, yes, in any case
> where tools generate direct calls to those functions (rather than just
> following the "asm" annotations on function declarations in the headers),
> they need to stop doing so.
As a consequence, we should no longer assume these symbols are available on the
target system.
Still keep the TargetLibraryInfo for constant folding.
Differential Revision: https://reviews.llvm.org/D74712
This is part 3 of a 3-part series to address a compile-time explosion
issue in LiveDebugValues.
---
Start encoding register locations within VarLoc IDs, and take advantage
of this encoding to speed up transferRegisterDef.
There is no fundamental algorithmic change: this patch simply swaps out
SparseBitVector in favor of CoalescingBitVector. That changes iteration
order (hence the test updates), but otherwise this patch is NFCI.
The only interesting change is in transferRegisterDef. Instead of doing:
```
KillSet = {}
for (ID : OpenRanges.getVarLocs())
if (DeadRegs.count(ID))
KillSet.add(ID)
```
We now do:
```
KillSet = {}
for (Reg : DeadRegs)
for (ID : intervalsReservedForReg(Reg, OpenRanges.getVarLocs()))
KillSet.add(ID)
```
By not visiting each open location every time we visit an instruction,
this eliminates some potentially quadratic behavior. The new
implementation basically does a constant amount of work per instruction
because the interval map lookups are very fast.
For a file in WebKit, this brings the time spent in LiveDebugValues down
from ~2.5 minutes to 4 seconds, reducing compile time spent in that pass
from 28% of the total to just over 1%.
Before:
```
2.49 min 27.8% 0 s LiveDebugValues::process
2.41 min 27.0% 5.40 s LiveDebugValues::transferRegisterDef
1.51 min 16.9% 1.51 min LiveDebugValues::VarLoc::isDescribedByReg() const
32.73 s 6.1% 8.70 s llvm::SparseBitVector<128u>::SparseBitVectorIterator::operator++()
```
After:
```
4.53 s 1.1% 0 s LiveDebugValues::process
3.00 s 0.7% 107.00 ms LiveDebugValues::transferRegisterCopy
892.00 ms 0.2% 406.00 ms LiveDebugValues::transferSpillOrRestoreInst
404.00 ms 0.1% 32.00 ms LiveDebugValues::transferRegisterDef
110.00 ms 0.0% 2.00 ms LiveDebugValues::getUsedRegs
57.00 ms 0.0% 1.00 ms std::__1::vector<>::push_back
40.00 ms 0.0% 1.00 ms llvm::CoalescingBitVector<>::find(unsigned long long)
```
FWIW, I tried the same approach using SparseBitVector, but got bad
results. To do that, I had to extend SparseBitVector to support 64-bit
indices and expose its lower bound operation. The problem with this is
that the performance is very hard to predict: SparseBitVector's lower
bound operation falls back to O(n) linear scans in a std::list if you're
not /very/ careful about managing iteration order. When I profiled this
the performance looked worse than the baseline.
You can see the full CoalescingBitVector-based implementation here:
https://github.com/vedantk/llvm-project/commits/try-coalescing
You can see the full SparseBitVector-based implementation here:
https://github.com/vedantk/llvm-project/commits/try-sparsebitvec-find
Depends on D74984 and D74985.
Differential Revision: https://reviews.llvm.org/D74986
This is part 2 of a 3-part series to address a compile-time explosion
issue in LiveDebugValues.
---
Each VarLoc has a unique ID: this ID is used to look up a VarLoc in the
VarLocMap, and to virtually insert a VarLoc into a VarLocSet. Instead of
inserting the VarLoc /itself/ into the VarLocSet, we insert just the ID,
because this can be represented efficiently with a SparseBitVector.
This change introduces LocIndex, a layer of abstraction on top of VarLoc
IDs. Prior to this change, an ID was just an index into a vector. With
this change, an ID encodes both an index /and/ a register location. The
type-checker ensures that conversions to and from LocIndex are correct.
For the moment the register location is always 0 (undef). We have plenty
of bits left over to encode physregs, stack slots, and other locations
in the future.
Differential Revision: https://reviews.llvm.org/D74985
The code changes here are hopefully straightforward:
1. Use MachineInstruction flags to decide if FP ops can be reassociated
(use both "reassoc" and "nsz" to be consistent with IR transforms;
we probably don't need "nsz", but that's a safer interpretation of
the FMF).
2. Check that both nodes allow reassociation to change instructions.
This is a stronger requirement than we've usually implemented in
IR/DAG, but this is needed to solve the motivating bug (see below),
and it seems unlikely to impede optimization at this late stage.
3. Intersect/propagate MachineIR flags to enable further reassociation
in MachineCombiner.
We managed to make MachineCombiner flexible enough that no changes are
needed to that pass itself. So this patch should only affect x86
(assuming no other targets have implemented the hooks using MachineIR
flags yet).
The motivating example in PR43609 is another case of fast-math transforms
interacting badly with special FP ops created during lowering:
https://bugs.llvm.org/show_bug.cgi?id=43609
The special fadd ops used for converting int to FP assume that they will
not be altered, so those are created without FMF.
However, the MachineCombiner pass was being enabled for FP ops using the
global/function-level TargetOption for "UnsafeFPMath". We managed to run
instruction/node-level FMF all the way down to MachineIR sometime in the
last 1-2 years though, so we can do better now.
The test diffs require some explanation:
1. llvm/test/CodeGen/X86/fmf-flags.ll - no target option for unsafe math was
specified here, so MachineCombiner kicks in where it did not previously;
to make it behave consistently, we need to specify a CPU schedule model,
so use the default model, and there are no code diffs.
2. llvm/test/CodeGen/X86/machine-combiner.ll - replace the target option for
unsafe math with the equivalent IR-level flags, and there are no code diffs;
we can't remove the NaN/nsz options because those are still used to drive
x86 fmin/fmax codegen (special SDAG opcodes).
3. llvm/test/CodeGen/X86/pow.ll - similar to #1
4. llvm/test/CodeGen/X86/sqrt-fastmath.ll - similar to #1, but MachineCombiner
does some reassociation of the estimate sequence ops; presumably these are
perf wins based on latency/throughput (and we get some reduction of move
instructions too); I'm not sure how it affects numerical accuracy, but the
test reflects reality better now because we would expect MachineCombiner to
be enabled if the IR was generated via something like "-ffast-math" with clang.
5. llvm/test/CodeGen/X86/vec_int_to_fp.ll - this is the test added to model PR43609;
the fadds are not reassociated now, so we should get the expected results.
6. llvm/test/CodeGen/X86/vector-reduce-fadd-fast.ll - similar to #1
7. llvm/test/CodeGen/X86/vector-reduce-fmul-fast.ll - similar to #1
Differential Revision: https://reviews.llvm.org/D74851
This will address the issue: P8198 and P8199 (from D73534).
The methods was not handle bundles properly.
Differential Revision: https://reviews.llvm.org/D74904
Summary:
If the describeLoadedValue() hook produced a DIExpression when
describing a instruction, and it was not possible to emit a call site
entry directly (the value operand was not an immediate nor a preserved
register), then that described value could not be inserted into the
worklist, and would instead be dropped, meaning that the parameter's
call site value couldn't be described.
This patch extends the worklist so that each entry has an DIExpression
that is built up when iterating through the instructions.
This allows us to describe instruction chains like this:
$reg0 = mv $fp
$reg0 = add $reg0, offset
call @call_with_offseted_fp
Since DW_OP_LLVM_entry_value operations can't be combined with any other
expression, such call site entries will not be emitted. I have added a
test, dbgcall-site-expr-entry-value.mir, which verifies that we don't
assert or emit broken DWARF in such cases.
Reviewers: djtodoro, aprantl, vsk
Reviewed By: djtodoro, vsk
Subscribers: hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D75036
Summary:
This is a preparatory patch for D75036, in which a debug expression is
associated with each parameter register in the worklist. In that patch
the two lambda functions addToWorklist() and finishCallSiteParams() grow
a bit, so move those out to separate functions. This patch also prepares
for each parameter register having their own expression moving the
creation of the DbgValueLoc into finishCallSiteParams().
Reviewers: djtodoro, vsk
Reviewed By: djtodoro, vsk
Subscribers: hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D75050
This may inhibit vector narrowing in general, but there's
already an inconsistency in the way that we deal with this
pattern as shown by the test diff.
We may want to add a dedicated function for narrowing fneg.
It's often folded into some other op, so moving it away from
other math ops may cause regressions that we would not see
for normal binops.
See D73978 for more details.
Add getUniqueReachingMIDef to RDA which performs a global search for
a machine instruction that produces a unique definition of a given
register at a given point. Also add two helper functions
(getMIOperand) that wrap around this functionality to get the
incoming definition uses of a given instruction. These now replace
the uses of getReachingMIDef in ARMLowOverheadLoops. getReachingMIDef
has been renamed to getReachingLocalMIDef and has been made private
along with getInstFromId.
Differential Revision: https://reviews.llvm.org/D74605
Only MCAsmStreamer (assembly output) needs to keep names of temporary labels created by
MCContext::createTempSymbol().
This change made the rL236642 optimization available for cc2as and
probably some other users.
This eliminates a behavior difference between llvm-mc -filetype=obj and cc1as, which caused
https://reviews.llvm.org/D74006#1890487
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D75097
This node reads the rounding control which means it needs to be ordered properly with operations that change the rounding control. So it needs to be chained to maintain order.
This patch adds a chain input and output to the node and connects it to the chain in SelectionDAGBuilder. I've update all in-tree targets to connect their chain through their lowering code.
Differential Revision: https://reviews.llvm.org/D75132
Unlike what I claimed in my previous commit. The caching is
actually not NFC on PHIs.
When we put a big enough max depth, we end up simulating loops.
The cache is effectively cutting the simulation short and we
get less information as a result.
E.g.,
```
v0 = G_CONSTANT i8 0xC0
jump
v1 = G_PHI i8 v0, v2
v2 = G_LSHR i8 v1, 1
```
Let say we want the known bits of v1.
- With cache:
Set v1 cache to we know nothing
v1 is v0 & v2
v0 gives us 0xC0
v2 gives us known bits of v1 >> 1
v1 is in the cache
=> v1 is 0, thus v2 is 0x80
Finally v1 is v0 & v2 => 0x80
- Without cache and enough depth to do two iteration of the loop:
v1 is v0 & v2
v0 gives us 0xC0
v2 gives us known bits of v1 >> 1
v1 is v0 & v2
v0 is 0xC0
v2 is v1 >> 1
Reach the max depth for v1...
unwinding
v1 is know nothing
v2 is 0x80
v0 is 0xC0
v1 is 0x80
v2 is 0xC0
v0 is 0xC0
v1 is 0xC0
Thus now v1 is 0xC0 instead of 0x80.
I've added a unittest demonstrating that.
NFC
Add a dump method that recursively prints an instruction and all
the instructions defining its operands and so on.
This is helpful when looking at combiner issue.
NFC
Differential Revision: https://reviews.llvm.org/D75094
MachineVerifier still takes 45-50% of total compile time with
-verify-machineinstrs, with calcRegsPassed dataflow taking ~50-60% of
MachineVerifier.
The majority of that time is spent in BBInfo::addPassed, mostly within
DenseSet implementing the sets the dataflow is operating over.
In particular, 1/4 of that DenseSet time is spent just iterating over it
(operator++), 40-50% on insertions, and most of the rest in ::count.
Given that, we're implementing custom sets just for this analysis here,
focusing on cheap insertions and O(n) iteration time (as opposed to
O(U), where U is the universe).
As it's based _mostly_ on BitVector for sparse and SmallVector for
dense, it may remotely resemble SparseSet. The difference is, our
solution is a lot less clever, doesn't have constant time `clear` that
we won't use anyway as reusing these sets across analyses is cumbersome,
and thus more space efficient and safer (got a resizable Universe and a
fallback to DenseSet for sparse if it gets too big).
With this patch MachineVerifier gets ~15-20% faster, its contribution to
total compile time drops from 45-50% to ~35%, while contribution of
calcRegsPassed to MachineVerifier drops from 50-60% to ~35% as well.
calcRegsPassed itself gets another 2x faster here.
All measured on a large suite of shaders targeting a number of GPUs.
Reviewers: bogner, stoklund, rudkx, qcolombet
Reviewed By: rudkx
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75033
Summary:
Terminators in LLVM aren't prohibited from returning values. This means that
the "callbr" instruction, which is used for "asm goto", can support "asm goto
with outputs."
This patch removes all restrictions against "callbr" returning values. The
heavy lifting is done by the code generator. The "INLINEASM_BR" instruction's
a terminator, and the code generator doesn't allow non-terminator instructions
after a terminator. In order to correctly model the feature, we need to copy
outputs from "INLINEASM_BR" into virtual registers. Of course, those copies
aren't terminators.
To get around this issue, we split the block containing the "INLINEASM_BR"
right before the "COPY" instructions. This results in two cheats:
- Any physical registers defined by "INLINEASM_BR" need to be marked as
live-in into the block with the "COPY" instructions. This violates an
assumption that physical registers aren't marked as "live-in" until after
register allocation. But it seems as if the live-in information only
needs to be correct after register allocation. So we're able to get away
with this.
- The indirect branches from the "INLINEASM_BR" are moved to the "COPY"
block. This is to satisfy PHI nodes.
I've been told that MLIR can support this handily, but until we're able to
use it, we'll have to stick with the above.
Reviewers: jyknight, nickdesaulniers, hfinkel, MaskRay, lattner
Reviewed By: nickdesaulniers, MaskRay, lattner
Subscribers: rriddle, qcolombet, jdoerfert, MatzeB, echristo, MaskRay, xbolva00, aaron.ballman, cfe-commits, JonChesterfield, hiraditya, llvm-commits, rnk, craig.topper
Tags: #llvm, #clang
Differential Revision: https://reviews.llvm.org/D69868
Changes the handling of odd breakdowns, and avoids using
G_EXTRACT/G_INSERT. Pad with undef to a wider size, and unmerge. Also
avoid introducing instructions for the fully undef components.
Depending on the target, test suite, pipeline config and perhaps other
factors machine verifier when forced on with -verify-machineinstrs can
increase compile time 2-2.5 times over (Release, Asserts On), taking up
~60% of the time. An invaluable tool, it significantly slows down
machine verifier-enabled testing.
Nearly 75% of its time MachineVerifier spends in the calcRegsPassed
method. It's a classic forward dataflow analysis executed over sets, but
visiting MBBs in arbitrary order. We switch that to RPO here.
This speeds up MachineVerifier by about 35%, decreasing the overall
compile time with -verify-machineinstrs by 20-25% or so.
calcRegsPassed itself gets 2x faster here.
All measured on a large suite of shaders targeting a number of GPUs.
Reviewers: bogner, stoklund, rudkx, qcolombet
Reviewed By: bogner
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75032
This is the second patch as part of https://bugs.llvm.org/show_bug.cgi?id=36544
Merging in the ConstantSDNode variant of FoldConstantArithmetic. After this, I will begin merging in FoldConstantVectorArithmetic
I've ensured this patch can build & pass all lit tests in Windows and Linux environments.
Patch by @justice_adams (Justice Adams)
Differential Revision: https://reviews.llvm.org/D74881
This adds infrastructure to print and parse MIR MachineOperand comments.
The motivation for the ARM backend is to print condition code names instead of
magic constants that are difficult to read (for human beings). For example,
instead of this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14, $noreg
t2Bcc %bb.4, 0, killed $cpsr
we now print this:
dead renamable $r2, $cpsr = tEOR killed renamable $r2, renamable $r1, 14 /* CC::always */, $noreg
t2Bcc %bb.4, 0 /* CC:eq */, killed $cpsr
This shows that MachineOperand comments are enclosed between /* and */. In this
example, the EOR instruction is not conditionally executed (i.e. it is "always
executed"), which is encoded by the 14 immediate machine operand. Thus, now
this machine operand has /* CC::always */ as a comment. The 0 on the next
conditional branch instruction represents the equal condition code, thus now
this operand has /* CC:eq */ as a comment.
As it is a comment, the MI lexer/parser completely ignores it. The benefit is
that this keeps the change in the lexer extremely minimal and no target
specific parsing needs to be done. The changes on the MIPrinter side are also
minimal, as there is only one target hooks that is used to create the machine
operand comments.
Differential Revision: https://reviews.llvm.org/D74306
Change the way that we remove the redundant iteration count code in
the presence of IT blocks. collectLocalKilledOperands has been
introduced to scan an instructions operands, collecting the killed
instructions and then visiting them too. This is used to delete the
code in the preheader which calculates the iteration count. We also
track any IT blocks within the preheader and, if we remove all the
instructions from the IT block, we also remove the IT instruction.
isSafeToRemove is used to remove any redundant uses of the iteration
count within the loop body.
Differential Revision: https://reviews.llvm.org/D74975
Summary:
This patch adds intrinsics and ISelDAG nodes for signed
and unsigned fixed-point division:
```
llvm.sdiv.fix.sat.*
llvm.udiv.fix.sat.*
```
These intrinsics perform scaled, saturating division
on two integers or vectors of integers. They are
required for the implementation of the Embedded-C
fixed-point arithmetic in Clang.
Reviewers: bjope, leonardchan, craig.topper
Subscribers: hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71550
Summary:
The type used to represent functional units in MC is
'unsigned', which is 32 bits wide. This is currently
not a problem in any upstream target as no one seems
to have hit the limit on this yet, but in our
downstream one, we need to define more than 32
functional units.
Increasing the size does not seem to cause a huge
size increase in the binary (an llc debug build went
from 1366497672 to 1366523984, a difference of 26k),
so perhaps it would be acceptable to have this patch
applied upstream as well.
Subscribers: hiraditya, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71210
This version fixes a buildbot failure cause by picking the wrong insert
point for XORs. We cannot pick the XOR binary operator as insert point,
as it is not guaranteed that both input operands for the overflow
intrinsic are defined before it.
This reverts the revert commit
c7fc0e5da6.
A question about this behavior came up on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2020-February/139003.html
...and as part of backend improvements in D73978.
We decided not to implement a more general change that would have
folded any FP binop with nearly arbitrary constant + undef operand
to undef because that is not theoretically correct (even if it is
practically correct).
This is the SDAG-equivalent to the IR change in D74713.
This patch adds a cache that is valid only for the duration of a call
to getKnownBits. With such short lived cache we avoid all the problems
of cache invalidation while still getting the benefits of reusing
the information we already computed.
This cache is useful whenever an instruction occurs more than once
in a chain of computation.
E.g.,
v0 = G_ADD v1, v2
v3 = G_ADD v0, v1
Previously we would compute the known bits for:
v1, v2, v0, then v1 again and finally v3.
With the patch, now we won't have to recompute v1 again.
NFC
Summary:
This prevents BFI queries on new blocks (from
MachineSinking::GetAllSortedSuccessors) and fixes a bunch of assert failures
under -check-bfi-unknown-block-queries=true.
Reviewers: davidxl
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74511
This changes the SimplifyLibCalls utility to accept an IRBuilderBase,
which allows us to pass through the IRBuilder used by InstCombine.
This will ensure that new instructions get added to the worklist.
The annotated test-case drops from 4 to 2 InstCombine iterations thanks
to this.
To achieve this, I'm adding an IRBuilderBase::OperandBundlesGuard,
which is basically the same as the existing InsertPointGuard and
FastMathFlagsGuard, but for operand bundles. Also add a
setDefaultOperandBundles() method so these can be set outside the
constructor.
Differential Revision: https://reviews.llvm.org/D74792
Use the SelectionDAG::getValidShiftAmountConstant helper to get const/constsplat shift amounts, which allows us to drop the out of range shift amount early-out.
First step towards better non-uniform shift amount support in SimplifyDemandedBits.
I believe this was carried over from getELFKindForNamedSection since
the wasm backend originally used ELF object writing as a template.
Differential Revision: https://reviews.llvm.org/D74565
This includes both GEPs where the indexed type is a scalable vector, and
GEPs where the result type is a scalable vector.
Differential Revision: https://reviews.llvm.org/D73602
When analyzing PHIs, we gather the known bits for every operand and
merge them together to get the known bits of the result of the PHI.
It is not unusual that merging the information leads to know nothing
on the result (e.g., phi a: i8 3, b: i8 unknown, ..., after looking at the
second argument we know we will know nothing on the result), thus, as
soon as we reach that state, stop analyzing the following operand (i.e.,
on the previous example, we won't process anything after looking at `b`).
This improves compile time in particular with PHIs with a large number
of operands.
NFC.
Similar to what we already do with SimplifyDemandedVectorElts, call SimplifyDemandedBits across all the extracted elements of the source vector, treating it as single use.
There's a minor regression in store-weird-sizes.ll which will be addressed in an upcoming SimplifyDemandedBits patch.