This call is expensive, so don't perform it for zero indices.
Also rename the variable to use Alloc rather than Alloca, this
doesn't have anything to do with allocas in particular.
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
The previous implementation of time tracing in NewPassManager is direct but messive.
The key codes are like the demo below:
```
/// Runs the function pass across every function in the module.
PreservedAnalyses run(LazyCallGraph::SCC &C, CGSCCAnalysisManager &AM,
LazyCallGraph &CG, CGSCCUpdateResult &UR) {
/// ...
PreservedAnalyses PassPA;
{
TimeTraceScope TimeScope(Pass.name());
PassPA = Pass.run(F, FAM);
}
/// ...
}
```
It can be bothered to judge where should we add the tracing codes by hands.
With the PassInstrumentation framework, we can easily add `Before/After` callback
functions to add time tracing codes.
Differential Revision: https://reviews.llvm.org/D131960
In the Tensorflow C lib utilities, an error gets thrown if some features
haven't gotten passed into the model (due to differences in ordering
which now don't exist with the transition to TFLite). However, this is
not currently the case when using TFLiteUtils. This patch makes some
minor changes to throw an error when not all inputs of the model have
been passed, which when not handled will result in a seg fault within
TFLite.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D133451
This adds the ExpandLargeDivRem to the default pass pipeline.
The limit at which it expands div/rem instructions is configured
via a new TargetTransformInfo hook (default: no expansion)
X86, Arm and AArch64 backends implement this hook to expand div/rem
instructions with more than 128 bits.
Differential Revision: https://reviews.llvm.org/D130076
The if-statement should check whehter TFLITE is on or not rather than if the variable is specified.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D132902
CGP uses a raw `getInstructionCost(I, TargetTransformInfo::TCK_SizeAndLatency) >= TCC_Expensive` check to see if its better to move an expensive instruction used in a select behind a branch instead.
This is causing issues with upcoming improvements to TCK_SizeAndLatency costs on X86 as we need to use TCK_SizeAndLatency as an uop count (so its compatible with various target-specific buffer sizes - see D132288), but we can have instructions that have a low TCK_SizeAndLatency value but should still be treated as 'expensive' (FDIV for example) - by adding a isExpensiveToSpeculativelyExecute wrapper we can keep the current behaviour but still add an x86 override in a future patch when the cost tables are updated to compensate.
The current code is basically just emulating what the analysis manager does.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D132581
We currently instrument CallBrInst but do not annotate it with
the branch weight. This patch enables PGO annotation of CallBrInst.
Differential Revision: https://reviews.llvm.org/D133040
This is a fix for PR57025 and an alternative to D131776. The problem
in the phi-translation-to-wrong-context.ll test case is that phi
translation of %gep.j into if2 pick %gep.i as the result. While this
instruction has the correct pointer address, it occurs in a context
where %i != 0. As such, we get a NoAlias result for the store in
if2, even though they do alias for %i == 0 (which is legal in the
original context of the pointer).
PHITranslateValue already has a MustDominate option, which can be
used to restrict PHI translation results to values that dominate the
translated-into block. However, this is more aggressive than what we
need and would significantly regress GVN results. In particular, if
we have a pointer value that does not require any translation, then
it is fine to continue using that value in the predecessor, because
the context is still correct for the original query. We only run into
problems if PHITranslateSubExpr() picks a completely random
instruction in a context that may have preconditions that do not hold.
Fix this by always performing the dominance checks in
PHITranslateSubExpr(), without enabling the more general MustDominate
requirement.
Fixes https://github.com/llvm/llvm-project/issues/57025. This also
fixes the test case for https://github.com/llvm/llvm-project/issues/30999,
but I'm not sure whether that's just the particular test case,
or a general solution to the problem.
Differential Revision: https://reviews.llvm.org/D132935
Also, some local variables were renamed in accordance with the code
style as well as `std::tie` occurrences and `.first`/`.second` member
uses were replaced with structure bindings.
Differential Revision: https://reviews.llvm.org/D132806
This patch replaces calls to GreatestCommonDivisor64 with std::gcd
where both arguments are known to be of unsigned types no larger than
64 bits in size.
Otherwise when we visit all libcalls in
updateCGAndAnalysisManagerForPass(), the old libcall is dead and doesn't
have a node.
We treat libcalls conservatively in LazyCallGraph because any function
may introduce calls to them out of thin air.
It is weird to change the signature of a libcall since introducing calls
to the libcall with a different signature may break, but other passes
like deadargelim already do it, so let's preserve this behavior for now.
Fixes an issue found in D128830.
Reviewed By: psamolysov
Differential Revision: https://reviews.llvm.org/D132764
The simpler diff-checks require pointers with add-recs from the same
innermost loop, but this property wasn't check completely. Add the
missing check to ensure both addrecs are in the innermost loop.
Fixes#57315.
When we have a dependency with a dependence distance which can only be hit on an iteration beyond the actual trip count of the loop, we can ignore that dependency when analyzing said loop. We already had this code, but had restricted it solely to unknown dependence distances. This change applies it to all dependence distances.
Without this code, we relied on the vectorizer reducing VF such that our infeasible dependence was respected. This usually worked out to about the same result, but not always. For fixed length vectorization, this could mean a smaller VF than optimal being chosen or additional runtime checks. For scalable vectorization - where the bounds on access implied by VF are broader - we could often not find a feasible VF at all.
Differential Revision: https://reviews.llvm.org/D131924
This is a long-standing FIXME with a non-FMF test that exposes
the bug as shown in issue #57357.
It's possible that there's still a way to miscompile by
mis-identifying/mis-folding FP min/max patterns, but
this patch only exposes a couple of seemingly minor
regressions while preventing the broken transform.
TLite is a lightweight, statically linkable[1], model evaluator, supporting a
subset of what the full tensorflow library does, sufficient for the
types of scenarios we envision having. It is also faster.
We still use saved models as "source of truth" - 'release' mode's AOT
starts from a saved model; and the ML training side operates in terms of
saved models.
Using TFLite solves the following problems compared to using the full TF
C API:
- a compiler-friendly implementation for runtime-loadable (as opposed
to AOT-embedded) models: it's statically linked; it can be built via
cmake;
- solves an issue we had when building the compiler with both AOT and
full TF C API support, whereby, due to a packaging issue on the TF
side, we needed to have the pip package and the TF C API library at
the same version. We have no such constraints now.
The main liability is it supporting a subset of what the full TF
framework does. We do not expect that to cause an issue, but should that
be the case, we can always revert back to using the full framework
(after also figuring out a way to address the problems that motivated
the move to TFLite).
Details:
This change switches the development mode to TFLite. Models are still
expected to be placed in a directory - i.e. the parameters to clang
don't change; what changes is the directory content: we still need
an `output_spec.json` file; but instead of the saved_model protobuf and
the `variables` directory, we now just have one file, `model.tflite`.
The change includes a utility showing how to take a saved model and
convert it to TFLite, which it uses for testing.
The full TF implementation can still be built (not side-by-side). We
intend to remove it shortly, after patching downstream dependencies. The
build behavior, however, prioritizes TFLite - i.e. trying to enable both
full TF C API and TFLite will just pick TFLite.
[1] thanks to @petrhosek's changes to TFLite's cmake support and its deps!
This has the effect of exposing the power-of-two property for use in memory op costing, but no target actually uses it yet. The main point of this change is simple consistency with the recently changes getArithmeticInstrCost, and to remove the last (interface) use of OperandValueKind.
This change completes the process of replacing OperandValueKind and OperandValueProperties which were previously passed independently in this API with a single container class which contains both.
This is the change which motivated the whole sequence which preceeded it. In an original spike version of this change, I'd noticed a nasty bug: I'd changed the signature without changing names, and as result, we silently passed additional information through a callsite which previously dropped the power-of-two fact. This might be harmless in most cases, but at least a couple clearly dependend for correctness on not passing that property through.
I did my best to split off prior changes which reduced the scope of this one, and which made it possible to use compiler assistance. For instance, every parameter which changes type in this change also changes name. This was intentional to make sure that every call site possible effected must show up in the diff. This let me audit each one closely.
This removes the last use of OperandValueKind from the client side API, and (once this is fully plumbed through TTI implementation) allow use of the same properties in store costing as arithmetic costing.
This completes the client side transition to the OperandValueInfo version of this routine. Backend TTI implementations still use the prior versions for now.
OperandValueKind and OperandValueProperties both provide facts about the operands of an instruction for purposes of cost modeling. We've discussed merging them several times; before I plumb through more flags, let's go ahead and do so.
This change only adds the client side interface for getArithmeticInstrCost and makes a couple of minor changes in client code to prove that it works. Target TTI implementations still use the split flags. I'm deliberately splitting what could be one big change into a series of smaller ones so that I can lean on the compiler to catch errors along the way.
Initial implementation had too weak requirements to positive/negative
range crossings. Not crossing zero with nuw is not enough for two reasons:
- If ArLHS has negative step, it may turn from positive to negative
without crossing 0 boundary from left to right (and crossing right to
left doesn't count for unsigned);
- If ArLHS crosses SINT_MAX boundary, it still turns from positive to
negative;
In fact we require that ArLHS always stays non-negative or negative,
which an be enforced by the following set of preconditions:
- both nuw and nsw;
- positive step (looks liftable);
Because of positive step, boundary crossing is only possible from left
part to the right part. And because of no-wrap flags, it is guaranteed
to never happen.
Defaults to TCK_RecipThroughput - as most explicit calls were assuming TCK_RecipThroughput (vectorizers) or was just doing a before-vs-after comparison (vectorcombiner). Calls via getInstructionCost were just dropping the CostKind, so again there should be no change at this time (as getShuffleCost and its expansions don't use CostKind yet) - but it will make it easier for us to better account for size/latency shuffle costs in inline/unroll passes in the future.
Differential Revision: https://reviews.llvm.org/D132287
Follow-up to 7f1262a322. That patch avoided removing the
call, but it still allowed the constant-folded result. This
makes the behavior consistent with 1-arg libm folding: if the
call potentially raises an exception, then we just bail out.
It seems likely that there are other corner-cases like this,
but the tests are incomplete, so we have lived with these
discrepancies for a long time. This was untested before the
the constant folding was expanded in D127964.
These may raise an error (set errno) as discussed in the post-commit
comments for D127964, so we can't fold away the call and potentially
alter that behavior.
A debug message in `LoopAccessAnalysis` did not have a newline in it, causing printed debug messages to be formatted incorrectly.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D132172
If the incoming previous value of a fixed-order recurrence is a phi in
the header, go through incoming values from the latch until we find a
non-phi value. Use this as the new Previous, all uses in the header
will be dominated by the original phi, but need to be moved after
the non-phi previous value.
At the moment, fixed-order recurrences are modeled as a chain of
first-order recurrences.
Reviewed By: Ayal
Differential Revision: https://reviews.llvm.org/D119661
* Replace getUserCost with getInstructionCost, covering all cost kinds.
* Remove getInstructionLatency, it's not implemented by any backends, and we should fold the functionality into getUserCost (now getInstructionCost) to make it easier for targets to handle the cost kinds with their existing cost callbacks.
Original Patch by @samparker (Sam Parker)
Differential Revision: https://reviews.llvm.org/D79483
Currently, clang ignores the 0 initialisation in finite math
For example:
```
double f_prod = 0;
double arr[1000];
for (size_t i = 0; i < 1000; i++) {
f_prod *= arr[i];
}
```
Clang will ignore that `f_prod` is set to zero and it will generate assembly to iterate over the loop.
Reviewed By: fhahn, spatel
Differential Revision: https://reviews.llvm.org/D131672
Handle cases where a forked pointer has an add or sub instruction
before reaching a select.
Reviewed By: fhahn
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D130278
If computeKnownBits encounters a phi node, and we fail to determine any known bits through direct analysis, see if the incoming value is part of a branch condition feeding the phi.
Handle cases where icmp(IncomingValue PRED Constant) is driving a branch instruction feeding that phi node - at the moment this only handles EQ/ULT/ULE predicate cases as they are the most straightforward to handle and most likely for branch-loop 'max upper bound' cases - we can extend this if/when necessary.
I investigated a more general icmp(LHS PRED RHS) KnownBits system, but the hard limits we put on value tracking depth through phi nodes meant that we were mainly catching constants anyhow.
Fixes the pointless vectorization in PR38280 / Issue #37628 (excessive unrolling still needs handling though)
Differential Revision: https://reviews.llvm.org/D131838
This reverts commit 354fa0b480.
Returning as is. The patch was reverted due to a miscompile, but
this patch is not causing it. This patch made it possible to infer
some nuw flags in code guarded by `false` condition, and then someone
else to managed to propagate the flag from dead code outside.
Returning the patch to be able to reproduce the issue.
We already support SGE, so the same logic should hold for SLE with
the LHS and RHS swapped.
I didn't see this in the wild. Just happened to walk past this code
and thought it was odd that it was asymmetric in what condition
codes it handled.
Reviewed By: spatel, reames
Differential Revision: https://reviews.llvm.org/D131805
This reverts commit 34ae308c73.
Our internal testing found a miscompile. Not sure if it's caused by
this patch or it revealed something else. Reverting while investigating.
The value of the attribute is a size in bytes. It has the effect of
suppressing inlining of functions whose stacksizes exceed the given value.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D129904
Contextual knowledge may be used to prove invariance of some conditions.
For example, in this case:
```
; %len >= 0
guard(%iv = {start,+,1}<nuw> <s %len)
guard(%iv = {start,+,1}<nuw> <u %len)
```
the 2nd check always fails if `start` is negative and always passes otherwise.
It looks like there are more opportunities of this kind that are still to be
implemented in the future.
Differential Revision: https://reviews.llvm.org/D129753
Reviewed By: apilipenko
My most recent change for D131607 had a formatting error that I didn't
notice until after I committed it. Let me fix it now so changes to this
file will be back-to-back from me.
Another ticket split out of D107285, this extends the optimization
of 0.0 - -X to just X when using constrained intrinsics and the
optimization is allowed.
If the negation of X is done with fsub then the match fails because of
the lack of IR Matcher support for constrained intrinsics.
While I'm here, remove some TODO notices since the work is no longer
planned.
Differential Revision: https://reviews.llvm.org/D131607
From the opengroup specifications, atan2 may fail if the result
underflows and atan may fail if the argument is subnormal, but
we assume that does not happen and eliminate the calls if we
can constant fold the result at compile-time.
Differential Revision: https://reviews.llvm.org/D127964
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.
Differential Revision: https://reviews.llvm.org/D130755
To move from TF C API to TFLite, we found that the argmax op in TFLite does not work for int64 inputs, so cast the int64 inputs to int32 inputs to make TFLite argmax op work
Differential Revision: https://reviews.llvm.org/D131462
We get a couple of improvements from recognizing swapped
operand patterns that were not handled by the replicated
code.
This should also enable simplifying larger patterns as
seen in issue #56653 and issue #56654, but that requires
enhancements to isImpliedCondition() itself.
Given a poison constant as input, the dyn_cast to a ConstantInt would
fail so we would fall through to the generic code that attempts to fold
each element of the input vectors. The inputs to these intrinsics are
not vectors though, leading to a compile time crash. Instead bail out
properly for poison values by returning nullptr. This doesn't try to
define what poison means for these intrinsics.
Fixes#56945
A const reference is preferred over a non-null const pointer.
`Type *` is kept as is to match the other overload.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D131197
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.
Differential Revision: https://reviews.llvm.org/D131114
BoundsChecking uses ObjectSizeOffsetEvaluator to keep track of the
underlying size/offset of pointers in allocations. However,
ObjectSizeOffsetVisitor (something ObjectSizeOffsetEvaluator
uses to check for constant sizes/offsets)
doesn't quite treat sizes and offsets the same way as
BoundsChecking. BoundsChecking wants to know the size of the
underlying allocation and the current pointer's offset within
it, but ObjectSizeOffsetVisitor only cares about the size
from the pointer to the end of the underlying allocation.
This only comes up when merging two size/offset pairs. Add a new mode to
ObjectSizeOffsetVisitor which cares about the underlying size/offset
rather than the size from the current pointer to the end of the
allocation.
Fixes a false positive with -fsanitize=bounds.
Reviewed By: vitalybuka, asbirlea
Differential Revision: https://reviews.llvm.org/D131001
This patch is the first of the two-patch series (D130188, D130179) that
resolve PR56275 (https://github.com/llvm/llvm-project/issues/56275)
which is a missed opportunity, where a perfrectly valid case for loop
interchange failed interchange legality.
If the distance/direction vector produced by dependence analysis (DA) is
negative, it needs to be normalized (reversed). This patch provides helper
functions `isDirectionNegative()` and `normalize()` in DA that does the
normalization, and clients can query DA to do normalization if needed.
A pass option `<normalized-results>` is added to DependenceAnalysisPrinterPass,
and we leverage it to update DA test cases to make sure of test coverage. The
test cases added in `Banerjee.ll` shows that negative vectors are normalized
with `print<da><normalized-results>`.
Reviewed By: bmahjour, Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D130188
This just shuffles implementations and declarations around. Now the
logger and the TF C API-based model evaluator are separate.
Differential Revision: https://reviews.llvm.org/D131116
Unfortunately, this overflow is extremely hard to reproduce reliably (in fact, I was unable to do so). The issue is that:
- getOperandsToCreate sometimes skips creating an SCEV for the LHS
- then, createSCEV is called for the BinaryOp
- ... which calls getNoWrapFlagsFromUB
- ... which under certain circumstances calls isSCEVExprNeverPoison
- ... which under certain circumstances requires the SCEVs of all operands
For certain deep dependency trees, this causes a stack overflow.
Reviewed By: bkramer, fhahn
Differential Revision: https://reviews.llvm.org/D129745
Mark ModRefInfo as a bitmask enum, which allows using normal
& and | operators on it. This supersedes various functions like
unionModRef() and intersectModRef(). I think this makes the code
cleaner than going through helper functions...
Differential Revision: https://reviews.llvm.org/D130870
Sometimes SCEV cannot infer nuw/nsw from something as simple as
```
len in [0, MAX_INT]
...
iv = phi(0, iv.next)
guard(iv <s len)
guard(iv <u len)
iv.next = iv + 1
```
just because flag strenthening only relies on definition and does not use local facts.
This patch adds support for the simplest case: inference of flags of `add(x, constant)`
if we can contextually prove that `x <= max_int - constant`.
In case if it has negative CT impact, we can add an option to switch it off. I woudln't
expect that though.
Differential Revision: https://reviews.llvm.org/D129643
Reviewed By: apilipenko
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
If we have interleave groups in the loop we want to vectorise then
we should fall back on normal vectorisation with a scalar epilogue. In
such cases when tail-folding is enabled we'll almost certainly go on to
create vplans with very high costs for all vector VFs and fall back on
VF=1 anyway. This is likely to be worse than if we'd just used an
unpredicated vector loop in the first place.
Once the vectoriser has proper support for analysing all the costs
for each combination of VF and vectorisation style, then we should
be able to remove this.
Added an extra test here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D128342
Now the API getExtendedAddReductionCost is used to determine the cost of extended Add reduction with optional Mul. For Arm, it could cover the cases. But for other target, for example: RISCV, they support other kinds of extended recution, such as FAdd.
This patch does the following changes:
1, Split getExtendedAddReductionCost into 2 new API: getExtendedReductionCost which handles the extended reduction with addtional input of Opcode; getMulAccReductionCost which handle the MLA cases the getExtendedAddReductionCost.
2, Refactor getReductionPatternCost, add some contraint condition to make sure the getMulAccReductionCost should only handle the reuction of Add + Mul.
Differential Revision: https://reviews.llvm.org/D130868
getModRefInfo() queries currently track whether the result is a
MustAlias on a best-effort basis. The only user of this functionality
is the optimized memory access type in MemorySSA -- which in turn
has no users. Given that this functionality has not found a user
since it was introduced five years ago (in D38862), I think we
should drop it again.
The context is that I'm working to separate FunctionModRefBehavior
to track mod/ref for different location kinds (like argmem or
inaccessiblemem) separately, and the fact that ModRefInfo also has
an unrelated Must flag makes this quite awkward, especially as this
means that NoModRef is not a zero value. If we want to retain the
functionality, I would probably split getModRefInfo() results into
a part that just contains the ModRef information, and a separate
part containing a (best-effort) AliasResult.
Differential Revision: https://reviews.llvm.org/D130713
issue #56775
I rearranged the Thumb2 codegen test to avoid simplifying the chain
of rounding instructions. I'm assuming the intent of the test is
to verify lowering of each of those intrinsics.
At the moment, proveNoSignedWrapViaInduction may be called for the
same AddRec a large number of times via getSignExtendExpr. This can have
a severe compile-time impact for very loop-heavy code.
If proveNoSignedWrapViaInduction failed to prove NSW the first time,
it is unlikely to succeed on subsequent tries and the cost doesn't seem
to be justified.
This is the signed version of 8daa338297 / D130648.
This can drastically improve compile-time in some excessive cases and
also has a slightly positive compile-time impact on CTMark:
NewPM-O3: -0.06%
NewPM-ReleaseThinLTO: -0.04%
NewPM-ReleaseLTO-g: -0.04%
https://llvm-compile-time-tracker.com/compare.php?from=8daa338297d533db4d1ae8d3770613eb25c29688&to=aed126a196e7a5a9803543d9b4d6bdb233d0009c&stat=instructions
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D130694
This patch introduces the inline cost priority into the
module inliner, which uses the same computation as
InlineCost.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D130012
This patch introduces the inline cost priority into the
module inliner, which uses the same computation as
InlineCost.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D130012
At the moment, proveNoUnsignedWrapViaInduction may be called for the
same AddRec a large number of times via getZeroExtendExpr. This can have
a severe compile-time impact for very loop-heavy code. One one
particular workload, LSR takes ~51s without this patch, almost
exlusively in proveNoUnsignedWrapViaInduction. With this patch, the time
in LSR drops to ~0.4s.
If proveNoUnsignedWrapViaInduction failed to prove NUW the first time,
it is unlikely to succeed on subsequent tries and the cost doesn't seem
to be justified.
Besides drastically improving compile-time in some excessive cases, this
also has a slightly positive compile-time impact on CTMark:
NewPM-O3: -0.07%
NewPM-ReleaseThinLTO: -0.08%
NewPM-ReleaseLTO-g: -0.06
https://llvm-compile-time-tracker.com/compare.php?from=b435da027d7774c24cdb8c88d09f6b771e07fb14&to=f2729e33e8284b502f6c35a43345272252f35d12&stat=instructions
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D130648
As test in PR56672 shows, LAA produces different results which lead to either
positive or negative vectorization decisions depending on the order of blocks
in loop. The exact reason of this is not clear to me, however this makes investigation
of related bugs extremely complex.
Current order of blocks in the loop is arbitrary. It may change, for example, if loop
info analysis is dropped and recomputed. Seems that it interferes with LAA's logic.
This patch chooses fixed traversal order of blocks in loops, making it RPOT.
Note: this is *not* a fix for bug with incorrect analysis result. It just makes
the answer more robust to make the investigation easier.
Differential Revision: https://reviews.llvm.org/D130482
Reviewed By: aeubanks, fhahn
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.
Reviewed By: bogner, davidxl
Differential Revision: https://reviews.llvm.org/D128860
I am playing with the LoopDataPrefetch pass and found out that it
bails to work with a pointer in a non-zero address space. This
patch adds the target callback to check if an address space is to
be considered for prefetching. Default implementation still only
allows address space 0, so this is NFCI.
This does not currently affect any known targets, but seems to be
generally useful for the future.
Differential Revision: https://reviews.llvm.org/D129795
We already call the more general isImpliedCondition() (which calls
isImpliedTrueByMatchingCmp() internally) from simplifyAndInst()
and simplifyOrInst().
There was a difference visible with this change on a vector test
before a925bef70c, but I can't find any gaps now.
Turning on opaque pointers has uncovered an issue with WPD where we currently pattern match away `assume(type.test)` in WPD so that a later LTT doesn't resolve the type test to undef and introduce an `assume(false)`. The pattern matching can fail in cases where we transform two `assume(type.test)`s into `assume(phi(type.test.1, type.test.2))`.
Currently we create `assume(type.test)` for all virtual calls that might be devirtualized. This is to support `-Wl,--lto-whole-program-visibility`.
To prevent this, all virtual calls that may not be in the same LTO module instead use a new `llvm.public.type.test` intrinsic in place of the `llvm.type.test`. Then when we know if `-Wl,--lto-whole-program-visibility` is passed or not, we can either replace all `llvm.public.type.test` with `llvm.type.test`, or replace all `llvm.public.type.test` with `true`. This prevents WPD from trying to pattern match away `assume(type.test)` for public virtual calls when failing the pattern matching will result in miscompiles.
Reviewed By: tejohnson
Differential Revision: https://reviews.llvm.org/D128955
It would be better for CodeMetrics to use hasOneLiveUse while analyzing
static and called once callsites, since inline cost now uses
hasOneLiveUse instead of hasOneUse to avoid overpessimization on dead
constant cases (since this patch https://reviews.llvm.org/D109294).
This change has no noticeable influence now, but it helps improve the
accuracy of cost models of passes that use CodeMetrics.
Reviewed By: fhahn, nikic
Differential Revision: https://reviews.llvm.org/D130461
We no longer need specialized knowledge of these allocator functions in
this file since we have the correct attributes available now.
As far as I can tell the changes in the attributor tests are due to
things getting more consistent on alloc-family once we remove the static
list entries.
The two test changes in NewGVN merit extra scrutiny: NewGVN appears to
be _extremely_ sensitive to the inaccessiblememonly for reasons that
are beyond me. As a result, I had-enumerated all the attributes on
allocation functions in those two tests instead of using -inferattrs.
I assumed that the two -disable-simplify-libcalls tests there no
longer are sensible since the function declaration now includes all the
relevant attributes.
Differential Revision: https://reviews.llvm.org/D130107
The code in this `#if 0` block appears to be a net benefit. Put it
behind a switch defaulting to off to support experimentation and as a
request for comment.
The codegen impact of enabling this that I'm currently persuing is that
it allows PRE to take place more frequently, particularly in loops with
second order recurrences.
Preliminary experimental data:
Across LNT on AArch64, 54 benchmarks are sped up by >1%, and 42 are
regressed by >1%, the geomean (exec_time_enabled / exec_time_disabled)
of these 96 "1% or greater significance" benchmarks is 0.991. For the
full set of 770 benchmarks it's 0.998.
There are two benchmarks which experience a >30% speedup, and the worst
slowdown is ~12%, and for every benchmark with a slowdown there is a
benckmark which is sped up by a greater factor.
Differential Revision: https://reviews.llvm.org/D130241
Handle guards uniformly with assumes, rather than iterating through all
block instructions in attempt to find them.
Differential Revision: https://reviews.llvm.org/D129874
Reviewed By: nikic
This patch adds a command line flag to be able to test
the type based cost-model analysis for Intrinsics.
Differential Revision: https://reviews.llvm.org/D129109
Adds a number of utilities that are used to help create and update
memprof related metadata. These will be used during profile matching
and annotation, as well as by the inliner when updating the metadata.
Also adds unit tests for the utilities.
See also related RFCs:
RFC: Sanitizer-based Heap Profiler [1]
RFC: A binary serialization format for MemProf [2]
RFC: IR metadata format for MemProf [3]
(Note that the IR metadata format has changed from the RFC during
implementation, as described in the preceeding patch adding the basic
metadata and verification support.)
Depends on D128141.
Differential Revision: https://reviews.llvm.org/D128854
This allows us to accept annotations from out-of-tree languages (the
example test is derived from Rust) so they can enjoy the benefits of
LLVM's optimizations without requiring LLVM to have language-specific
knowledge.
Differential Revision: https://reviews.llvm.org/D123091
Prior to this change, we relied on the hard-coded list for all of the
information performed by MemoryBuiltins. With this change, we're able to
start relying on properites of functions described in attributes, which
opens the door to out-of-tree compilers being able to describe their
allocator functions to LLVM's optimizer logic without having to register
their implementation details with LLVM.
Differential Revision: https://reviews.llvm.org/D123090
No need to add checks for every type per pointer that we couldn't create
a check for the first time around, just the types that weren't
successful.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D119376
This patch adds the AArch64 hook for preferPredicateOverEpilogue,
which currently returns true if SVE is enabled and one of the
following conditions (non-exhaustive) is met:
1. The "sve-tail-folding" option is set to "all", or
2. The "sve-tail-folding" option is set to "all+noreductions"
and the loop does not contain reductions,
3. The "sve-tail-folding" option is set to "all+norecurrences"
and the loop has no first-order recurrences.
Currently the default option is "disabled", but this will be
changed in a later patch.
I've added new tests to show the options behave as expected here:
Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll
Differential Revision: https://reviews.llvm.org/D129560
Replace the value-accepting isReallocLikeFn() overload with a
getReallocatedOperand() function, which returns which operand is
the one being reallocated. Currently, this is always the first one,
but once allockind(realloc) is respected, the reallocated operand
will be determined by the allocptr parameter attribute.
Remove isFreeCall() in favor of getFreedOperand(). Replace the
two remaining uses with a getFreedOperand() != nullptr check, as
they only care that something is getting freed. (The usage in DSE
is correct as such. The allocator-related checks in CFLGraph look
rather questionable in general.)
We currently assume in a number of places that free-like functions
free their first argument. This is true for all hardcoded free-like
functions, but with the new attribute-based design, the freed
argument is supposed to be indicated by the allocptr attribute.
To make sure we handle this correctly once allockind(free) is
respected, add a getFreedOperand() helper which returns the freed
argument, rather than just indicating whether the call frees *some*
argument.
This migrates most but not all users of isFreeCall() to the new
API. The remaining users are a bit more tricky.
Default getAllocSize() to use the trivial mapper. Also switch
from using std::function to function_ref.
Furthermore, update the doc comment to point out a subtle difference
between getAllocSize() and getObjectSize(): The latter may also
return something for calls that return their argument (via "returned"
attribute or special intrinsics like invariant groups).
Fetching allocation data for calls is a rather hot operation, and
TLI lookups are slow. We can greatly reduce the number of calls
for which TLI is queried by checking that they return a pointer
value first, as this is a requirement for allocation functions
anyway.
There is a problem in loop cache analysis that the types of SCEV variables
`Coeff` and `ElemSize` in function `isConsecutive()` may not match. The
mismatch would cause SCEV failures when `Coeff` is multiplied with `ElemSize`.
The fix in this patch is to extend the type of both `Coeff` and `ElemSize` to
whichever is wider in those two variables. As a clean-up, duplicate calculations
of `Stride` in `computeRefCost()` is then removed.
Reviewed By: Meinersbur, #loopoptwg
Differential Revision: https://reviews.llvm.org/D128877
Noticed via inspection; to my knowledge, impossible to hit today. In theory, we could have a fixed stride check be analyzed, then a scalable one. With the old code, the scalable one would be silently dropped, and the runtime guard would go ahead with only the fixed one. This would be a miscompile.
In fact, in unreached code we can say that every fact is true. So do not waste time trying to
do something smarter.
Formally it's not an NFC because it may change query results in unreached code, but they
won't have any impact on execution.
Hypothetical CT boost expected but not measured in practice.
Differential Revision: https://reviews.llvm.org/D129878
To solve the readnone problems in coroutines. See
https://discourse.llvm.org/t/address-thread-identification-problems-with-coroutine/62015
for details.
According to the discussion, we decide to fix the problem by inserting
isPresplitCoroutine() checks in different passes instead of
wrapping/unwrapping readnone attributes in CoroEarly/CoroCleanup passes.
In this direction, we might not be able to cover every case at first.
Let's take a "find and fix" strategy.
Reviewed By: nikic, nhaehnle, jyknight
Differential Revision: https://reviews.llvm.org/D127383
Following discussion in PR56243, we need to somehow detect the situation
when token values penetrate LCSSA form for transforms that require that
it is maintained by all values (for example, to sustain use-def dominance
invarians). This patch introduces a parameter to LCSSA checkers to control
their ignorance about tokens.
Differential Revision: https://reviews.llvm.org/D129983
Reviewed By: efriedma
llvm/lib/Analysis/LoopAccessAnalysis.cpp:916:12: error: no viable conversion from returned value of type 'SmallVector<[...], 2>' to function return type 'SmallVector<[...], (default)
CalculateSmallVectorDefaultInlinedElements<T>::value aka 3>'
return Scevs;
^~~~~