Update llvm::sys::fs::mapped_file_region to have a move constructor and
a move assignment operator, allowing it to be used as an Optional. Also,
update FileOutputBuffer's OnDiskBuffer to take advantage of this,
avoiding an extra allocation from the unique_ptr.
A nice follow-up would be to make the mapped_file_region constructor
private and replace its use with a factory function, such as
mapped_file_region::create(), that returns an Expected (or ErrorOr). I
don't plan on doing that immediately, but I might swing back later.
No functionality change, besides the saved allocation in OnDiskBuffer.
Differential Revision: https://reviews.llvm.org/D100159
This adds support for swapping comparison operands when it may introduce new
folding opportunities.
This is roughly the same as the code added to AArch64ISelLowering in
162435e7b5.
For an example of a testcase which exercises this, see
llvm/test/CodeGen/AArch64/swap-compare-operands.ll
(Godbolt for that testcase: https://godbolt.org/z/43WEMb)
The idea behind this is that sometimes, we may be able to fold away, say, a
shift or extend in a compare by swapping its operands.
e.g. in the case of this compare:
```
lsl x8, x0, #1
cmp x8, x1
cset w0, lt
```
The following is equivalent:
```
cmp x1, x0, lsl #1
cset w0, gt
```
Most of the code here is just a reimplementation of what already exists in
AArch64ISelLowering.
(See `getCmpOperandFoldingProfit` and `getAArch64Cmp` for the equivalent code.)
Note that most of the AND code in the testcase doesn't actually fold. It seems
like we're missing selection support for that sort of fold right now, since SDAG
happily folds these away (e.g testSwapCmpWithShiftedZeroExtend8_32 in the
original .ll testcase)
Differential Revision: https://reviews.llvm.org/D89422
Remove the MachineDCE pass after the first SIFoldOperands pass now
that SIFoldOperands deletes its own dead instructions.
Differential Revision: https://reviews.llvm.org/D100189
When inserting a new def and renaming of uses is asked, always compute
IDF and do the renaming for the blocks with Phis in that IDF.
Resolves PR49859.
Differential Revision: https://reviews.llvm.org/D100163
Regiser types for xxsplti32dx for two td file patterns was incorrect.
Fixed the two types and added a test case that was reduced from a larger
failing test.
Reviewed By: nemanjai, #powerpc
Differential Revision: https://reviews.llvm.org/D100223
When lowering a BUILD_VECTOR SDNode, we choose among various possible vector
creation instructions in an attempt to minimize the total number of instructions
used. We previously considered using swizzles, consts, and splats, and this
patch adds shuffles as well. A common pattern that now lowers to shuffles is
when two 64-bit vectors are concatenated. Previously, concatenations generally
lowered to sequences of extract_lane and replace_lane instructions when they
could have been a single shuffle.
Differential Revision: https://reviews.llvm.org/D100018
I recently forgot a comma in a defm argument list and tablegen just
failed with exit code 1 without printing an error message. I believe
this issue was introduced in a9fc44c557.
This change prints the following instead:
.../clang/include/clang/Driver/Options.td:569:3: error: Expected comma before next argument
Reviewed By: Paul-C-Anagnostopoulos
Differential Revision: https://reviews.llvm.org/D100178
There are four new PowerPC instructions that are introduced in
Power 10. They are hashst, hashchk, hashstp, hashchkp.
These instructions will be used for ROP Protection.
This patch adds the four instructions.
Reviewed By: nemanjai, amyk, #powerpc
Differential Revision: https://reviews.llvm.org/D99375
This patch updates the linkage name in the DISubprogram of coro-split
functions, which is particularly important for Swift, where the
funclets have a special name mangling. This patch does not affect C++
coroutines, since the DW_AT_specification is expected to hold the
(original) linkage name. I believe this is mostly due to limitations
in AsmPrinter, so we might be able to relax this restriction in the
future.
Differential Revision: https://reviews.llvm.org/D99693
I've initially just enabled this for BMI which has the ANDN instruction for i32/i64 - the i16/i8 cases give an idea of what'd we get when we enable it in all cases (I'll do this as a later commit).
Additionally, the i16/i8 cases could be freely promoted to i32 (as the args are already zeroext) and we could then make use of ANDN + the free cmp0 there as well - this has come up in PR48768 and PR49028 so I'm going to look at this soon.
https://alive2.llvm.org/ce/z/QVWHP_https://alive2.llvm.org/ce/z/pLngT-
Vector cases do not appear to benefit from this as we end up with having to generate the zero vector as well - this is one of the reasons I didn't try to tie this into hasAndNot/hasAndNotCompare.
Differential Revision: https://reviews.llvm.org/D100177
As suggested in the review thread for 5094e12 and seen in the
motivating example from https://llvm.org/PR49885, it's not
clear if we have a way to create the optimal code without
this heuristic.
In LazyValueInfoImpl::isNonNullAtEndOfBlock we populate a set of
pointers, known to be non-null at the end of a block (e.g. because we
did a load through them). We then infer that any pointer, based on an
element of this set is non-null as well ("based" here meaning a
non-null pointer is the underlying object). This is incorrect, even if
the base pointer was non-null, the value of a GEP, that lacks the
inbounds` attribute, may be null.
This issue appeared as miscompilation of the following test case:
int puts(const char *);
typedef struct iter {
int *val;
} iter_t;
static long distance(iter_t first, iter_t last) {
long r = 0;
for (; first.val != last.val; first.val++)
++r;
return r;
}
int main() {
int arr[2] = {0};
iter_t i, j;
i.val = arr;
j.val = arr + 1;
if (distance(i, j) >= 2)
puts("failed");
else
puts("passed");
}
This fixes PR49662.
Differential Revision: https://reviews.llvm.org/D99642
This is cheap to implement, means less work for future passes like
MachineDCE, and slightly improves the folding in some cases.
Differential Revision: https://reviews.llvm.org/D100117
Use SIInstrFlags to differentiate between the different
variants of flat instructions (flat, global and scratch).
This should make it easier to bundle the immediate offset logic in a
single place and implement restrictions and bug workarounds.
Fixed version of D99587, which does not rely on the address space.
Differential Revision: https://reviews.llvm.org/D99743
Add an ability to store `Offset` between partially aliased location. Use this
storage within returned `ResultAlias` instead of caching it in `AAQueryInfo`.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D98718
Main reason is preparation to transform AliasResult to class that contains
offset for PartialAlias case.
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D98027
These cases were failing before, but with cryptic asserts.
Add asserts in the RegScavenger that fail earlier with better
messages. NFC
Differential Revision: https://reviews.llvm.org/D100109
New SDTypeProfile can be reused for other word operation patterns without explicit i64 type in the future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100097
Previously loading the vtable used in calling a virtual method in a loop
was not hoisted out of the loop. This fixes that.
canSinkOrHoistInst() itself doesn't check that the load operands are
loop invariant, callers also check that separately.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D99784
meetBDVState looks pretty difficult to read and follow.
This is purely NFC but doing several things:
1) Combine meet and meetBDVState
2) Move the function to be a member of BDVState
3) Make BDVState be a mutable object
4) Convert switch to sequence of ifs
5) Adds comments.
Reviewers: reames, dantrushin
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D99064
Add explicit type i64 to RV64 only patterns to stop emitting unneeded i32 patterns.
It can reduce the isel table size.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100089
This reverts commit e35afbe535, reapplying
022ccedde8 and
e7ed5c920d.
- The first attempt missed defining `SignpostEmitterImpl`.
- The second attempt missed defining `llvm::SignpostEmitterImpl`.
Not sure how I failed to test both versions locally before; I thought
I'd turned the feature off via rerunning `cmake` but it must have been
stuck in place. This time I confirmed via `clang -E` that I was testing
both build configurations.
Original commit message:
Replace some manual memory management with std::unique_ptr.
Differential Revision: https://reviews.llvm.org/D100151
This reverts commit 078072285d, reapplying
022ccedde8.
I figured out why this was failing in other environments: it's not a
problem with std::unique_ptr, but that SignpostEmitterImpl only has a
forward declaration. Adding an empty definition should do the trick.
Original commit message:
Replace some manual memory management with std::unique_ptr.
Differential Revision: https://reviews.llvm.org/D100151
The destructor for SignPostEmitterImpl::SignpostLog is known statically. Avoid
the unnecessary vtable indirection through std::function in the std::unique_ptr
by turning LogDeleter into a struct. No real functionality change here.
Differential Revision: https://reviews.llvm.org/D100154
This is a DenseMap, which has its own initializer; we don't need to explicitly
call the default constructor here.
Differential Revision: https://reviews.llvm.org/D100152
Add a variant of `fs::resize_file` for use immediately before opening a
file with `mapped_file_region::readwrite`. On Windows, `_chsize`
(`ftruncate`) is slow, but `CreateFileMapping` (`mmap`) automatically
extends the file so the call to `fs::resize_file` can be skipped.
This optimization was added to `FileOutputBuffer` in
da9bc2e56d5a5c6332a9def1a0065eb399182b93; this commit just extracts the
logic out and adds a unit test.
Differential Revision: https://reviews.llvm.org/D95490
Instead of instantiating multiclasses inside multiclasses, just
inherit from them.
We can do the same for the VPseudo* multiclasses, but that may
interfere with the scheduler class work.
Pretty straightforward use of existing infrastructure and port of the attributor inference rules for nosync.
A couple points of interest:
* I deliberately switched from "monotonic or better" to "unordered or better". This is simply me being conservative and is better in line with the rest of the optimizer. We treat monotonic conservatively pretty much everywhere.
* The operand bundle test change is suspicious. It looks like we might have missed something here, but if so, it's an issue with the existing nofree inference as well. I'm going to take a closer look at that separately.
* I needed to keep the previous inference from readnone. This surprised me, but made sense once I realized readonly inference goes to lengths to reason about local vs non-local memory and that writes to local memory are okay. This is fine for the purpose of nosync, but would e.g. prevent us from inferring nofree from readnone - which is slightly surprising.
Differential Revision: https://reviews.llvm.org/D99769
This fixes a "Cached first special instruction is wrong!" assert.
The assert fires because replacing a value with another can cause an
instruction to no longer be "special" to ICF. In this case,
devirtualization happened, turning an indirect call to a
call to a willreturn function which is no longer special.
Reviewed By: nikic, rnk
Differential Revision: https://reviews.llvm.org/D99977
It used to work correctly even with a KILL, but there is
no reason to consider meta instructions since they do not
create real HW uses.
Differential Revision: https://reviews.llvm.org/D100135
After D99249 we use three different loop pass managers for LICM,
LoopRotate and LICM+LoopUnswitch. This happens because LazyBFI
and LazyBPI are not preserved by LoopRotate (note that D74640
is no longer needed). Avoid this by marking them as preserved.
My understanding of D86156 is that it is okay to simply preserve
them (which LoopUnswitch already does for the same reason) and
rely on callbacks to deal with deleted blocks.
Differential Revision: https://reviews.llvm.org/D99843
wasm64 was missing DAG ISEL patterns for external symbol based global.get, but simply adding these analogous to the existing 32-bit versions doesn't work.
This is because we are conflating the 32-bit global index with the pointer represented by the external symbol, which for wasm32 happened to work.
The simplest fix is to pretend we have a 64-bit global index. This sounds incorrect, but is immaterial since once this index is stored as a MachineOperand it becomes 64-bit anyway (and has been all along). As such, the EmitInstrWithCustomInserter based implementation I experimented with become a no-op and no further changes in the C++ code are required.
Differential Revision: https://reviews.llvm.org/D99904
After loop interchange, the (old) outer loop header should not jump to
the `LoopExit`. Note that the old outer loop becomes the new inner loop
after interchange. If we branched to `LoopExit` then after interchange
we would jump directly from the (new) inner loop header to `LoopExit`
without executing the rest of outer loop.
This patch modifies adjustLoopBranches() such that the old outer
loop header (which becomes the new inner loop header) jumps to the
old inner loop latch which becomes the new outer loop latch after
interchange.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D98475
Add InstAlias that allows the last operand to be an imm for following instructions:
1. Zbb or Zbp:
- ror
- rorw (RV64 Only)
2. Zbs
- best
- bclr
- binv
- bext
Reviewed By: craig.topper, jrtc27
Differential Revision: https://reviews.llvm.org/D100083
clang++ uses llvm.compiler.used in certain cases to preserve
symbol which is fully inlined. D96087 has resulted in undefined
symbols in such cases. Set it to false by default to preserve
old behavior but keep the option for specific uses where we
want to ignore these (e.g. to detect a potential indirect call
to a function).
Differential Revision: https://reviews.llvm.org/D99897
During SelectionDAG, we must track the SDNodes that each SDDbgValue depends on
to compute its value. These are ultimately derived from the location operands to
the SDDbgValue, but were stored in a separate vector prior to this patch. This
resulted in cases where one of the lists was updated incorrectly, resulting in
crashes during compilation. This patch fixes the issue by directly recomputing
the dependency list from the SDDbgOperands in getDependencies().
Differential Revision: https://reviews.llvm.org/D99423
Look through copies to find more cases where the two values being
selected are identical. The motivation for this is just to be able to
remove the weird special case where tryFoldCndMask was called from
foldInstOperand, part way through folding a move-immediate into its
users, without regressing any lit tests.
Instead of passing the start value and the defined value to
widenPHIInstruction, pass the VPWidenPHIRecipe directly, which can be
used to get both (and more in future patches).
This allows mapping larger files, delaying OOM failures until too many
pages of them are accessed. This is makes the behavior of the
mapped_file_region in this regard consistent between its "Unix" and
"Windows" implementations.
Guard the code witih #if defined(MAP_NORESERVE), consistent with other
uses of MAP_NORESERVE in llvm-project, because some FreeBSD versions do
not provide this flag.
Reviewed By: clayborg
Differential Revision: https://reviews.llvm.org/D96626
ScratchExecCopy needs to be marked as live, we cannot use that register
while EXEC is stored in there.
Marking SGPRForFPSaveRestoreCopy and SGPRForBPSaveRestoreCopy as
available is unnecessary, they should not be live at that point anway.
Differential Revision: https://reviews.llvm.org/D100098
When attempting to truncate a FP vector and store the result out
to memory we crashed because we had no pattern for truncating FP
stores. In fact, we don't support these types of stores and the
correct fix is to stop marking these truncating stores as legal.
Tests have been added here:
CodeGen/AArch64/sve-fptrunc-store.ll
Differential Revision: https://reviews.llvm.org/D100025
During LoopStrengthReduce, some of the SSA values that are used by debug values
may be lost and/or salvaged. After LSR we attempt to recover any undef debug
values, including any that were salvaged but then lost their values afterwards,
by replacing the lost values with any live equal values (plus a possible
constant offset) that have been gathered prior to running LSR. When we do this
we restore the debug value's original DIExpression, to undo any salvaging (as we
have gone back to using the original debug value).
This process can currently produce invalid debug info if the number of operands
has changed by salvaging during LSR. Replacing old values during the
applyEqualValues step does not change the number of location operands, which
means that when we restore the old DIExpression we may have a mismatch between
the number of operands used by the debug value and the number of operands
referenced by the DIExpression. This patch fixes this by restoring the full
original location metadata at the start of the applyEqualValues step, so that
there is no mismatch in operand count between the debug value and its
DIExpression.
Differential Revision: https://reviews.llvm.org/D98644
Without the fix we get
../lib/Target/NVPTX/NVPTXLowerArgs.cpp:236:24: error: lambda capture 'Arg' is not used [-Werror,-Wunused-lambda-capture]
auto IsALoadChain = [Arg](Value *Start) {
^~~
1 error generated.
D99674 stopped the folding of certain select operations into and/or, due
to incorrect folding in the presence of poison. D97360 added some costs
to attempt to account for the change, but only worked at the getUserCost
level, not the getCmpSelInstrCost that the vectorizer will use directly.
This adds similar logic into the vectorizer to handle these logical
and/or selects, treating them like and/or directly.
This fixes 60% performance regressions from code like the attached test
case.
Differential Revision: https://reviews.llvm.org/D99884
This patch adds RVV codegen support for OR/XOR/AND reductions for both
scalable- and fixed-length vector types. There are a few possible
codegen strategies for each -- vmfirst.m, vmsbf.m, and vmsif.m could be
used to some extent -- but the vpopc.m instruction was chosen since it
produces the scalar result in one instruction, after which scalar
instructions can finish off the computation.
The reductions are lowered identically for both scalable- and
fixed-length vectors, although some alternate strategies may be more
optimal on fixed-length vectors since it's cheaper to get the length of
those types.
Other reduction types were not deemed to be relevant for mask vectors.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D100030
The GNU AS manual states the following about single-character constants enclosed within single quotes:
> Some backslash escapes apply to characters, \b, \f, \n, \r, \t, and \" with the same meaning as for strings, plus \' for a single quote.
Add two more characters to the switch handling this case to match GAS behaviour, plus a test to make sure nothing regresses.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D99609
Combine all collected stats into separate struct RAGreedyStats
with add and report methods.
The motivation is to extend the number of statistics to capture and instead of
adding new parameters, just combine all of them into one structure.
Additionally I plan to use report from different places in future to report data
for function as well.
Reviewers: reames, MatzeB, anemet, thegameg
Reviewed By: thegameg
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D100012
To save compile time, avoid computation of stats if ORE will not emit it.
The motivation is to add more stats and compute it only if it will dumped.
Reviewers: reames, MatzeB, anemet, thegameg
Reviewed By: reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D100010
Summary: Set the default DwarfInlinedStrings as inlined strings for DBX, due to DBX does not support .dwstr section for now.
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D99933
If the stack size is larger than 12 bits, we have to use a scratch
register to store the stack size. Before we introduce the scalable stack
offset, we could simplify
%0 = ADDI %stack.0, 0
=>
%scratch = ... # sequence of instructions to move the offset into
%%scratch
%0 = ADD %fp, %scratch
However, if the offset contains scalable part, we need to consider it.
%0 = ADDI %stack.0, 0
=>
%scratch = ... # sequence of instructions to move the offset into
%%scratch
%scratch = ADD %fp, %scratch
%scalable_offset = ... # sequence of instructions for vscaled-offset.
%0 = ADD/SUB %scratch, %scalable_offset
Differential Revision: https://reviews.llvm.org/D100035
This is a (late) follow-up patch of 8871a4b4ca and
c95f39891a to make ConstantStruct::get/ConstantArray::getImpl
correctly return PoisonValue if all elements are poison.
This was found while discussing about the elements of a vector-typed UndefValue (D99853)
Pseudo probes, when scattered in a block, can be chained dependencies of other regular DAG nodes and block DAG combine optimizations. To fix this, scattered probes in a block are grouped and placed at the beginning of the block. This shouldn't affect the profile quality.
Test Plan:
Reviewed By: wenlei, wmi
Differential Revision: https://reviews.llvm.org/D100002
New custom DAG nodes were added to represent operations on CSR. These
nodes are lowered to corresponding pseudo instruction. Using the pseudo
instructions allows to specify different scheduling information for
operations on different system registers. It also make possible to
specify dependencies of instructions on specific system registers.
Differential Revision: https://reviews.llvm.org/D98936
After loop interchange, the (old) outer loop header should not jump to
`LoopExit`. Note that the old outer loop becomes the new inner loop
after interchange. If we branched to `LoopExit` then after interchange
we would jump directly from the (new) inner loop header to `LoopExit`
without executing the rest of (new) outer loop.
This patch modifies adjustLoopBranches() such that the old outer
loop header (which becomes the new inner loop header) jumps to the
old inner loop latch which becomes the new outer loop latch after
interchange.
Reviewed By: bmahjour
Differential Revision: https://reviews.llvm.org/D98475
Allow pass to work separately with SGPR, VGPR registers or both.
This is NFC now but will be needed to split RA for separate
SGPR and VGPR passes.
Differential Revision: https://reviews.llvm.org/D100063
If the constants have a difference of 1 we can convert one to
the other by adding or subtracting the condition.
We have a DAG combine for this, but it only runs before type
legalization. If the select is introduced later during type
legalization or op legalization we will miss it.
We don't need a specific condition, but some conditions are
harder to materialize than others on RISCV. I know that SETLT
will be a single instruction and it is what is used by the
motivating pattern from signed saturating add/sub.
Differential Revision: https://reviews.llvm.org/D99021
When using the large code model with FastISel (for example via
clang -O0 which adds the optnone attribute), FP constants could
still be materialized using adrp + ldr. Unconditionally enable
the existing path for MachO to materialize the constant in code.
For testing, restore literal_pools_float.ll to exercise the constant
pool and add two optnone-functions that return a float and a double,
respectively. Consolidate fpimm.ll and add a new fast-isel-fpimm.ll
to check the code paths taken with FastISel.
Differential Revision: https://reviews.llvm.org/D99607
Since we have created a new OF_TextWithCRLF flag, we no longer need to worry about OF_Text flag turning on CRLF translation. I can remove this workaround I added to globally open all ToolOutputFiles as binary on Windows.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D100034
This can't use our normal strategy of splatting the scalar and using
a .vv operation instead of .vx.
Instead this patch bitcasts the vector to the equivalent SEW=32
vector and inserts the scalar parts using two vslide1up/down. We
do that unmasked and apply the mask separately at the end with
a vmerge.
For vslide1up there maybe some other options here like getting
i64 into element 0 and using vslideup.vi with this vector as
vd and the original source as vs1. Masking would still need to
be done afterwards.
That idea doesn't work for vslide1down. We need to slidedown and
then insert a single scalar at vl-1 which we could do with a
vslideup, but that assumes vl > 0 which I don't think we can assume.
The i32 double slide1down implemented here is the best I could come
up with and I just made vslide1up consistent.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D99910
This allows FoldConstantArithmetic to handle SPLAT_VECTOR in
addition to BUILD_VECTOR. This allows it to support scalable
vectors. I'm also allowing fixed length SPLAT_VECTOR which is
used by some targets, but I'm not familiar enough to write tests
for those targets.
I had to block this function from running on CONCAT_VECTORS to
avoid calling getNode for a CONCAT_VECTORS of 2 scalars.
This can happen because the 2 operand getNode calls this
function for any opcode. Previously we were protected because
CONCAT_VECTORs of BUILD_VECTOR is folded to a larger BUILD_VECTOR
before that call. But it's not always possible to fold a CONCAT_VECTORS
of SPLAT_VECTORs, and we don't even try.
This fixes PR49781 where DAG combine thought constant folding
should be possible, but FoldConstantArithmetic couldn't do it.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D99682
-Make sure of the CreateShl/LShr/AShr methods that take a uint64_t
instead of creating a ConstantInt for 1 ourselves.
-Use Builder.getInt1 or ConstantInt::getBool instead of a conditional.
-Pull out repeated calls to getType.
All of the code that handles general constant here (other than the more
restrictive APInt-dealing code) expects that it is an immediate,
because otherwise we won't actually fold the constants, and increase
instruction count. And it isn't obvious why we'd be okay with
increasing the number of constant expressions,
those still will have to be run..
But after 2829094a8e
this could also cause endless combine loops.
So actually properly restrict this code to immediates.
This fixes the examples from
D99674 and
https://llvm.org/PR49878
The matchers succeed on partial undef/poison vector constants,
but the transform creates a full 'not' (-1) constant, so it
would undo a demanded vector elements change triggered by the
extractelement.
Differential Revision: https://reviews.llvm.org/D100044
We see a regression related to low probe factor(0.01) which prevents some callsites being promoted in ICPPass and later cause the missing inline in CGSCC inliner. The root cause is due to redundant(the second) multiplication of the probe factor and this change try to fix it.
`Sum` does multiply a factor right after findCallSamples but later when using as the parameter in setProbeDistributionFactor, it multiplies one again.
This change could get ~2% perf back on mcf benchmark. In mcf, previously the corresponding factor is 1 and it's the recent feature introducing the <1 factor then trigger this bug.
Reviewed By: hoy, wenlei
Differential Revision: https://reviews.llvm.org/D99787
Use report_fatal_error here since this is an internal error, and not
something the user can/should be trying to fix.
Also distinguish between the symbol being missing and the symbol having
the wrong type.
We have a failure internally where the symbol is missing. Currently
trying to reduce the test case to something we can attach to an llvm
bug.
Differential Revision: https://reviews.llvm.org/D99960
The struct is used for both, callee and caller-save registers now.
The frame index is not set for entrypoints, as we do not need to save
the registers then.
Update the struct name to reflect that.
Differential Revision: https://reviews.llvm.org/D99722
Extend D94856 to handle 'and', 'or' and 'xor' instructions as well
We still fail on many i8/i16 cases as the test and the logic-op are performed on different widths
No need to lookup through and/or try to vectorize operands of the
CmpInst instructions during attempts to find/vectorize min/max
reductions. Compiler implements postanalysis of the CmpInsts so we can
skip extra attempts in tryToVectorizeHorReductionOrInstOperands and save
compile time.
Differential Revision: https://reviews.llvm.org/D99950
The swap of the operands can affect later transforms that
are expecting a constant as operand 1. I don't think we
can trigger a bug with the current code, but I hit that
problem while drafting a new transform for min/max intrinsics.
I do not see any bit-width restriction from the point of the
LLVM Lang Ref - Operand Bundles on the types of the deopt bundle
operands. Statepoint Lowering seems to be able to work with any
types.
This patch relaxes the two related assertions and adds a new test
for this change.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D100006
This reverts commit a547b4e26b,
relanding commit 31d219d299,
which was reverted because there was a conflicting inverse transform,
which was causing an endless combine loop, which has now been adjusted.
Original commit message:
https://alive2.llvm.org/ce/z/67w-wQ
We prefer `add`s over `sub`, and this particular xform
allows further folds to happen:
Fixes https://bugs.llvm.org/show_bug.cgi?id=49858
I.e., if any/all of the consants is an expression, don't do it.
Since those constants won't reduce into an immediate,
but would be left as an constant expression, they could cause
endless combine loops after 31d219d299
added an inverse transformation.
A value from reachable block may come to a Phi node as its input from
unreachable block. This may confuse matchSimpleRecurrence which
has no access to DomTree and can falsely recognize something as a recurrency
because of this effect, as the attached test shows.
Patch `ae7b1e` deals with half of this problem, but it only accounts from
the case when an unreachable instruction comes to Phi as an input.
This patch provides a generalization by checking that no Phi block's
predecessor is unreachable (no matter what the input is).
Differential Revision: https://reviews.llvm.org/D99929
Reviewed By: reames
Consider the .debug_pubnames and .debug_pubtypes their own kind of
accelerator and stop emitting them together with the Apple-style
accelerator tables. The only reason we were still emitting both was for
(byte-for-byte) compatibility with dsymutil-classic.
- This patch adds a new accelerator table kind "Pub" which can be
specified with --accelerator=Pub.
- This patch removes the ability to emit both pubnames/types and apple
style accelerator tables. I don't think anyone is relying on that but
it's worth pointing out.
- This patch removes the --minimize option and makes this behavior the
default. Specifying the flag will result in a warning but won't abort
the program.
Differential revision: https://reviews.llvm.org/D99907
Looking at the Doxygen-generated documentation for the llvm namespace
currently shows all sorts of random comments from different parts of the
codebase. These are mostly caused by:
- File doc comments that aren't marked with \file, so they're attached to
the next declaration, which is usually "namespace llvm {".
- Class doc comments placed before the namespace rather than before the
class.
- Code comments before the namespace that (in my opinion) shouldn't be
extracted by doxygen at all.
This commit fixes these comments. The generated doxygen documentation now
has proper docs for several classes and files, and the docs for the llvm
and llvm::detail namespaces are now empty.
Reviewed By: thakis, mizvekov
Differential Revision: https://reviews.llvm.org/D96736
We encountered a hang in our internal code base. I'm having trouble
creating a test case because the test that hit it was testing some
code that is not upstream.
Summary:
The function SplitCriticalEdge (called by SplitEdge) can return a nullptr in
cases where the edge is a critical. SplitEdge uses SplitCriticalEdge assuming it
can always split all critical edges, which is an incorrect assumption.
The three cases where the function SplitCriticalEdge will return a nullptr is:
1. DestBB is an exception block
2. Options.IgnoreUnreachableDests is set to true and
isa(DestBB->getFirstNonPHIOrDbgOrLifetime()) is not equal to a nullptr
3. LoopSimplify form must be preserved (Options.PreserveLoopSimplify is true)
and it cannot be maintained for a loop due to indirect branches
For each of these situations they are handled in the following way:
1. Modified the function ehAwareSplitEdge originally from
llvm/lib/Transforms/Coroutines/CoroFrame.cpp to handle the cases when the DestBB
is an exception block. This function is called directly in SplitEdge.
SplitEdge does not call SplitCriticalEdge in this case
2. Options.IgnoreUnreachableDests is set to false by default, so this situation
does not apply.
3. Return a nullptr in this situation since the SplitCriticalEdge also returned
nullptr. Nothing we can do in this case.
Reviewed By: asbirlea
Differential Revision:https://reviews.llvm.org/D94619
Follow up to a6d2a8d6f5. These were found by simply grepping for "::assume", and are the subset of that result which looked cleaner to me using the isa/dyn_cast patterns.
Follow up to a6d2a8d6f5. This covers all the public interfaces of the bundle related code. I tried to cleanup the internals where the changes were obvious, but there's definitely more room for improvement.
Fixes the ASan RISC-V memory mapping (originally introduced by D87580 and
D87581). This should be an improvement both in terms of first principles
soundness and observed test failures --- test failures would occur
non-deterministically depending on the ASLR random offset.
On RISC-V Linux (64-bit), `TASK_UNMAPPED_BASE` is currently defined as
`PAGE_ALIGN(TASK_SIZE / 3)`. The non-power-of-two divisor makes the result
be the not very round number 0x1555556000. That address had to be further
rounded to ensure page alignment after the shadow scale shifting is applied.
Still, that value explains why the mapping table may look less regular than
expected.
Further cleanups:
- Moved the mapping table comment, to ensure that the two Linux/AArch64
tables stayed together;
- Removed mention of Sv48. Neither the original mapping nor this one are
compatible with an actual Linux Sv48 address space (mainline Linux still
operates Sv48 in Sv39 mode). A future patch can improve this;
- Removed the additional comments, for consistency.
Differential Revision: https://reviews.llvm.org/D97646
Previously, 34-bit constants were materialized in selectI64Imm(), and we relied
on td pattern matching to instead produce a pli. This becomes problematic as
there is no guarantee that the 34-bit constant will reach the td pattern
selection for pli. It is also possible for other transformations (such as complex
bit permutations) to also produce and utilize the 34-bit constant materialized
through selectI64Imm().
This patch instead produces pli on Power10 directly whenever the constant fits
within 34-bits.
Differential Revision: https://reviews.llvm.org/D99906
Add the subclass, update a few places which check for the intrinsic to use idiomatic dyn_cast, and update the public interface of AssumptionCache to use the new class. A follow up change will do the same for the newer assumption query/bundle mechanisms.
performScalarPREInsertion() inserts instructions into blocks that we
need to tell ImplicitControlFlowTracking about, otherwise the ICF cache
may be invalid.
Fixes PR49193.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D99909
The current code does not properly handle vector indices unless they are
the first index.
At the moment LangRef gives the impression that the vector index must be
the one and only index (https://llvm.org/docs/LangRef.html#getelementptr-instruction).
But vector indices can appear at any position and according to the
verifier there may be multiple vector indices. If that's the case, the
number of elements must match.
This patch updates SimplifyGEPInst to properly handle those additional
cases.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D99961
Many of the operands are handled the same or in the same order
for all these intrinsics. Factor out the code for selecting and
pushing them into the Operands vector.
Differential Revision: https://reviews.llvm.org/D99923
This allows frontend and backend diagnostic files to all go into the
same place. Have it control the Windows (mini-)dump location.
Differential Revision: https://reviews.llvm.org/D99199
This patch adds support for TLS variables to the XCOFF object writer:
- Add TData and TBSS sections
- Add CsectGroups for the mapping classes XCOFF::XMC_TL and XCOFF::XMC_UL
- Add XMC_UL in the enum entry of CsectStorageMapping class to print the string
while reading the symbol properties for TLS variables
- Fix the starting address of TData and TBSS sections
Reviewed by: hubert.reinterpretcast, DiggerLin
Differential Revision: https://reviews.llvm.org/D98946
The key change (4f5e92c) to switch gc.result and gc.relocate to being readnone landed nearly two weeks ago, and we haven't seen any fallout. Time to remove the code added to make reverting easy.
This fixes an oversight in D99747 which moved the IMG init code from
SIAddIMGInit to AdjustInstrPostInstrSelection, but did not set the
hasPostISelHook flag on gather4 instructions.
Differential Revision: https://reviews.llvm.org/D99953
Previously we could only vectorize FP reductions if fast math was enabled, as this allows us to
reorder FP operations. However, it may still be beneficial to vectorize the loop by moving
the reduction inside the vectorized loop and making sure that the scalar reduction value
be an input to the horizontal reduction, e.g:
%phi = phi float [ 0.0, %entry ], [ %reduction, %vector_body ]
%load = load <8 x float>
%reduction = call float @llvm.vector.reduce.fadd.v8f32(float %phi, <8 x float> %load)
This patch adds a new flag (IsOrdered) to RecurrenceDescriptor and makes use of the changes added
by D75069 as much as possible, which already teaches the vectorizer about in-loop reductions.
For now in-order reduction support is off by default and controlled with the `-enable-strict-reductions` flag.
Reviewed By: david-arm
Differential Revision: https://reviews.llvm.org/D98435
Problem:
On SystemZ we need to open text files in text mode. On Windows, files opened in text mode adds a CRLF '\r\n' which may not be desirable.
Solution:
This patch adds two new flags
- OF_CRLF which indicates that CRLF translation is used.
- OF_TextWithCRLF = OF_Text | OF_CRLF indicates that the file is text and uses CRLF translation.
Developers should now use either the OF_Text or OF_TextWithCRLF for text files and OF_None for binary files. If the developer doesn't want carriage returns on Windows, they should use OF_Text, if they do want carriage returns on Windows, they should use OF_TextWithCRLF.
So this is the behaviour per platform with my patch:
z/OS:
OF_None: open in binary mode
OF_Text : open in text mode
OF_TextWithCRLF: open in text mode
Windows:
OF_None: open file with no carriage return
OF_Text: open file with no carriage return
OF_TextWithCRLF: open file with carriage return
The Major change is in llvm/lib/Support/Windows/Path.inc to only set text mode if the OF_CRLF is set.
```
if (Flags & OF_CRLF)
CrtOpenFlags |= _O_TEXT;
```
These following files are the ones that still use OF_Text which I left unchanged. I modified all these except raw_ostream.cpp in recent patches so I know these were previously in Binary mode on Windows.
./llvm/lib/Support/raw_ostream.cpp
./llvm/lib/TableGen/Main.cpp
./llvm/tools/dsymutil/DwarfLinkerForBinary.cpp
./llvm/unittests/Support/Path.cpp
./clang/lib/StaticAnalyzer/Core/HTMLDiagnostics.cpp
./clang/lib/Frontend/CompilerInstance.cpp
./clang/lib/Driver/Driver.cpp
./clang/lib/Driver/ToolChains/Clang.cpp
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D99426
Changes getRecurrenceIdentity to always return a neutral value of -0.0 for FAdd.
Reviewed By: dmgreen, spatel
Differential Revision: https://reviews.llvm.org/D98963
For VPWidenPHIRecipes that model all incoming values as VPValue
operands, print those operands instead of printing the original PHI.
D99294 updates recipes of reduction PHIs to use the VPValue for the
incoming value from the loop backedge, making use of this new printing.
After rG47321c311bdbe0145b9bf45d822185c37b19fa50 we promote vXi8 reductions to vXi16 to create a much faster PMULLW mul reduction, followed by a (free) truncation. This avoids the high cost of repeated vXi8 multiplications (which extend+multiply+truncate to/from vXi16 types....).
Fixes the missing vXi8 mul reduction vectorization in PR42674 (Comment #20) 'mul16' test case.
This patch enhances hasAddressTaken() to ignore bitcasts as a
callee in callbase instruction. Such bitcast usage doesn't really take
the address in a useful meaningful way.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D98884
It is generally beneficial to prefer "movi d0, #0" over "fmov s0, wzr" as this
is most efficient across all cores; it is recognised as a zeroing idiom. For
newer cores, fmov instructions can also be eliminated early and there is no
difference with movi, but some implementations lack this so is not true for
other/older cores. Thus this standardises on using movi as this should always
gives the same or better performance than the fmov with wzr.
Differential Revision: https://reviews.llvm.org/D99586
This was using the .2d variant which zeros 128 bits, but using the .2s variant
that zeros 64 bits is faster on some cores.
This is a prep step for D99586 to always using movi for zeroing floats.
Differential Revision: https://reviews.llvm.org/D99710
The reason for the NewPM redesign is described in the commit
cba3e783389a: [NewPM] Disable PreservedCFGChecker ...
The checker introduces an internal custom CFG analysis that tracks
current up-to date CFG snapshot. The analysis is invalidated along
any other CFG related analysis (the key is CFGAnalyses). If the CFG
analysis is not invalidated at a functional pass exit then the checker
asserts that the CFG snapshot taken from this analysis is equals to
a snapshot of the current CFG.
Along the way:
- the function CFG::printDiff() is simplified by removing function
name calculation. The name is printed by the caller;
- fixed CFG invalidated condition (see CFG::invalidate());
- StandardInstrumentations::registerCallbacks() gets additional
optional parameter of type FunctionAnalysisManager*, which is
needed by the checker to get the custom CFG analysis;
- several PM related tests updated to explicitly set
-verify-cfg-preserved=1 as they need.
This patch is safe to land as the CFGChecker is left switched off
(the options -verify-cfg-preserved is false by default). It will be
switched on by a separate patch to minimize possible reverts.
Reviewed By: skatkov, kuhar
Differential Revision: https://reviews.llvm.org/D91327
I missed a few intrinsics in 3dd4aa7d09
when I did this for masked loads and masked segment loads/stores.
Found while trying to share more code between these custom isel
functions.
When we are able to SROA an alloca, we know all uses of it, meaning we
don't have to preserve the invariant group intrinsics and metadata.
It's possible that we could lose information regarding redundant
loads/stores, but that's unlikely to have any real impact since right
now the only user is Clang and vtables.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D99760
This is the sibling fix to c590a9880d -
as there, we can't subsitute a vector value the equality
compare replacement that we are trying requires that the
comparison is true for the entire value. Vector select
can be partly true/false.
It's a bit silly, but it allows us to write stricter type
constraints for isel. There's still some extra type checks in
the generated table due to some type interference limitations
around HWMode.
For use in an uncoming patch. Left out the phi case (which could otherwise fit in this framework) as it would cause infinite recursion in said patch. We can probably also leverage this in instcombine to ensure we keep the two sets of related analysis and transforms in sync.
These look like $00A0cf for hex and %001010101 for binary. They are used in Motorola assembly syntax.
Differential Revision: https://reviews.llvm.org/D98519
TextAPI/ELF has moved out into InterfaceStubs, so theres no longer a
need to seperate out TextAPI between formats.
Reviewed By: ributzka, int3, #lld-macho
Differential Revision: https://reviews.llvm.org/D99811
This patch supports bitcasts from scalar types to fixed-length vectors
and vice versa. It custom-lowers and custom-legalizes them to
EXTRACT_VECTOR_ELT/INSERT_VECTOR_ELT operations, using a single-element
vectors to hold the scalar where appropriate.
Previously, some of these would fail to select, others would be expanded
through stack loads and stores. Effort was made to ensure the codegen
avoids the stack for both legal and illegal scalar types.
Some of the codegen could be improved, but on first glance it looks like
a general optimization of EXTRACT_VECTOR_ELT when extracting an i64
element on RV32.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99667
As shown in the example based on:
https://llvm.org/PR49832
...and the existing test, we can't substitute
a vector value because the equality compare
replacement that we are attempting requires
that the comparison is true for the entire
value. Vector select can be partly true/false.
In 0dbcb36394, most most target symbols were made hidden by default
with the public ones marked with LLVM_EXTERNAL_VISIBILITY. When the
M68k target was added, this particular change was forgotten so that
external tools cannot make use of the public M68k target functions
in libLLVM.so. Thus, add the missing LLVM_EXTERNAL_VISIBILITY macro
to all public target functions in the M68k backend.
Differential Revision: https://reviews.llvm.org/D99869
Caught in internal testing, these operations are assumed legal by
default, even for scalable vector types. Expand them back into separate
truncations and stores, or loads and extensions.
Also add explicit fixed-length vector tests for these operations, even
though they should have been correct already.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99654
During vectorization better to postpone the vectorization of the CmpInst
instructions till the end of the basic block. Otherwise we may vectorize
it too early and may miss some vectorization patterns, like reductions.
Reworked part of D57059
Differential Revision: https://reviews.llvm.org/D99796
This patch introduces a DIPrinter interface to implement by different output style printer implementations. DIPrinterGNU and DIPrinterLLVM implement the GNU and LLVM output style printing respectively. No functional changes.
This refactoring clarifies and simplifies the code, and makes a new output style addition easier.
Reviewed By: jhenderson, dblaikie
Differential Revision: https://reviews.llvm.org/D98994
The W version of orc.b does not exist in Zbp so we need to use
gorci encoding. If we have Zbp, we can use gorciw which can avoid a
sext.w in some cases.
This is identical to 781d077afb,
but for the other function.
For certain shift amount bit widths, we must first ensure that adding
shift amounts is safe, that the sum won't have an unsigned overflow.
Fixes https://bugs.llvm.org/show_bug.cgi?id=49778
This is discussed in https://llvm.org/PR48999 ,
but it does not solve that request.
The difference in the vector test shows that some
other logic transform is limited to scalar types.
When converting a switch with two cases and a default into a
select, also handle the denegerate case where two cases have the
same value.
Generate this case directly as
%or = or i1 %cmp1, %cmp2
%res = select i1 %or, i32 %val, i32 %default
rather than
%sel1 = select i1 %cmp1, i32 %val, i32 %default
%res = select i1 %cmp2, i32 %val, i32 %sel1
as InstCombine is going to canonicalize to the former anyway.
Even if one of the operands is overdefined, we may still produce
a non-overdefined result, e.g. due to a min/max operation. This
matches our handling elsewhere, e.g. for binary operators.
The slot poisoning comment refers to a much older LVI cache
implementation.
As long as it's a constant we can directly pattern match it
without any problems. It's only when it isn't a constant that
we need to add an AND.
In theory this should allow more target independent optimizations
to remain active.
This patch fixes llvm.org/pr49688 by conditionally folding select i1 into and/or:
```
select cond, cond2, false
->
and cond, cond2
```
This is not safe if cond2 is poison whereas cond isn’t.
Unconditionally disabling this transformation affects later pipelines that depend on and/or i1s.
To minimize its impact, this patch conservatively checks whether cond2 is an instruction that
creates a poison or its operand creates a poison.
This approach is similar to what InstSimplify's SimplifyWithOpReplaced is doing.
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D99674
This was prompted by D95727, which had the side-effect to break the
'release' mode build bot for ML-driven policies. The problem is that now
the pre-compiled object files don't get transitively carried through as
'source' anymore; that being said, the previous way of consuming them
was problematic, because it was only working for static builds; in
dynamic builds, the whole tf_xla_runtime was linked, which is
undesirable.
The alternative is to treat tf_xla_runtime as an archive, which then
leads to the desired effect.
Differential Revision: https://reviews.llvm.org/D99829
This is a followup to D98145: As far as I know, tracking of kill
flags in FastISel is just a compile-time optimization. However,
I'm not actually seeing any compile-time regression when removing
the tracking. This probably used to be more important in the past,
before FastRA was switched to allocate instructions in reverse
order, which means that it discovers kills as a matter of course.
As such, the kill tracking doesn't really seem to serve a purpose
anymore, and just adds additional complexity and potential for
errors. This patch removes it entirely. The primary changes are
dropping the hasTrivialKill() method and removing the kill
arguments from the emitFast methods. The rest is mechanical fixup.
Differential Revision: https://reviews.llvm.org/D98294
Fixes PR47603
This should probably be transferable to DAGCombine - the main limitation with the existing trunc(logicop) DAG fold is we don't know if legalization has tried to promote truncated logicops already. We might be able to peek through extensions as well.
Use the getTargetShuffleInputs helper for all shuffle decoding
Reapplied (after reversion in rGfa0aff6d6960) with fix+test for subvector splitting - we weren't accounting for peeking through bitcasts changing the vector element count of the shuffle sources.
Started to see build errors like this
../lib/Support/Z3Solver.cpp:19:10: fatal error: 'z3.h' file not found
#include <z3.h>
^~~~~~
1 error generated.
after commit 43ceb74eb1.
The -isystem path to the Z3_INCLUDE_DIR wen't missing in the compile
commands. No idea why target_include_directories stopped working with
that commit, but using include_directories seem to work better.
InstCombine performs simple forwarding from stores to loads, but
currently only handles the case where the load and store have the
same size. This extends it to also handle a store of a constant
with a larger size followed by a load with a smaller size.
This is implemented through ConstantFoldLoadThroughBitcast() which
is fairly primitive (e.g. does not allow storing a large integer
and then loading a small one), but at least can forward the first
element of a vector store. Unfortunately it seems that we currently
don't have a generic helper for "read a constant value as a different
type", it's all tangled up with other logic in either
ConstantFolding or VNCoercion.
Differential Revision: https://reviews.llvm.org/D98114
The AAMDNodes part of the MemoryLocation is not used by the BasicAA
cache, so don't store it. This reduces the size of each cache entry
from 112 bytes to 48 bytes.
BasicAA itself doesn't make use of AA metadata, but passes it
through to recursive queries and makes it part of the cache key.
Aliasing decisions that are based on AA metadata (i.e. TBAA and
ScopedAA) are based *only* on AA metadata, so checking them with
different pointer values or sizes is not useful, the result will
always be the same.
While this change is a mild compile-time improvement by itself,
the actual goal here is to reduce the size of AA cache keys in
a followup change.
Differential Revision: https://reviews.llvm.org/D90098
Define -fatal-warnings to make warnings fatal, and accept /WX as an ML.EXE compatible alias for it.
Also make sure that if Warning() returns true, we always treat it as an error.
Reviewed By: thakis
Differential Revision: https://reviews.llvm.org/D92504
Head files are included in a separate patch in case the name needs to be changed.
RV32 / 64:
clmul
clmulh
clmulr
Differential Revision: https://reviews.llvm.org/D99711
Forgot to amend the Author.
Original commit message:
Header files are included in a separate patch in case the name needs to be changed.
RV32 / 64:
orc.b
Differential Revision: https://reviews.llvm.org/D99320
Make variables and text-macro references case-insensitive, to match ml.exe.
Also improve error handling for text-macro expansion.
Reviewed By: thakis
Differential Revision: https://reviews.llvm.org/D92503
Implementation for RISC-V Zbr extension intrinsic.
Header files are included in separate patch in case the name needs to be changed
RV32 / 64:
crc32b
crc32h
crc32w
crc32cb
crc32ch
crc32cw
RV64 Only:
crc32d
crc32cd
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99009
For positive constants we try shifting left to remove leading zeros
and fill the bottom bits with 1s. We then materialize that constant
shift it right.
This patch adds a new strategy to try filling the bottom bits with
zeros instead. This catches some additional cases.
When run under valgrind, or with a malloc that poisons freed memory,
this can lead to segfaults or other problems.
To avoid modifying the AdditionalUsers DenseMap while still iterating,
save the instructions to be notified in a separate SmallPtrSet, and use
this to later call OperandChangedState on each instruction.
Fixes PR49582.
Reviewed By: fhahn
Differential Revision: https://reviews.llvm.org/D98602
This patch moves mapping of IR operands to VPValues out of
tryToCreateWidenRecipe. This allows using existing VPValue operands when
widening recipes directly, which will be introduced in future patches.
The safepoints being inserted exists to free memory, or coordinate with another thread to do so. Thus, we must strip any inferred attributes and reinfer them after the lowering.
I'm not aware of any active miscompiles caused by this, but since I'm working on strengthening inference of both and leveraging them in the optimization decisions, I figured a bit of future proofing was warranted.
Change the definition of G_SBFX and G_UBFX so that the lsb and width
can have different types than the src and dst operands.
Differential Revision: https://reviews.llvm.org/D99739
The ultimate reduction node may have multiple uses, but if the ultimate
reduction is min/max reduction and based on SelectInstruction, the
condition of this select instruction must have only single use.
Differential Revision: https://reviews.llvm.org/D99753
In order to bring up scalable vector support in LLVM incrementally,
we introduced behaviour to emit a warning, instead of an error, when
asking the wrong question of a scalable vector, like asking for the
fixed number of elements.
This patch puts that behaviour under a flag. The default behaviour is
that the compiler will always error, which means that all LLVM unit
tests and regression tests will now fail when a code-path is taken that
still uses the wrong interface.
The behaviour to demote an error to a warning can be individually enabled
for tools that want to support experimental use of scalable vectors.
This patch enables that behaviour when driving compilation from Clang.
This means that for users who want to try out scalable-vector support,
fixed-width codegen support, or build user-code with scalable vector
intrinsics, Clang will not crash and burn when the compiler encounters
such a case.
This allows us to do away with the following pattern in many of the SVE tests:
RUN: .... 2>%t
RUN: cat %t | FileCheck --check-prefix=WARN
WARN-NOT: warning: ...
The behaviour to emit warnings is only temporary and we expect this flag
to be removed in the future when scalable vector support is more stable.
This patch also has fixes the following tests:
unittests:
ScalableVectorMVTsTest.SizeQueries
SelectionDAGAddressAnalysisTest.unknownSizeFrameObjects
AArch64SelectionDAGTest.computeKnownBitsSVE_ZERO_EXTEND_VECTOR_INREG
regression tests:
Transforms/InstCombine/vscale_gep.ll
Reviewed By: paulwalker-arm, ctetreau
Differential Revision: https://reviews.llvm.org/D98856
The motivation for this patch is to better estimate the cost of
extracelement instructions in cases were they are going to be free,
because the source vector can be used directly.
A simple example is
%v1.lane.0 = extractelement <2 x double> %v.1, i32 0
%v1.lane.1 = extractelement <2 x double> %v.1, i32 1
%a.lane.0 = fmul double %v1.lane.0, %x
%a.lane.1 = fmul double %v1.lane.1, %y
Currently we only consider the extracts free, if there are no other
users.
In this particular case, on AArch64 which can fit <2 x double> in a
vector register, the extracts should be free, independently of other
users, because the source vector of the extracts will be in a vector
register directly, so it should be free to use the vector directly.
The SLP vectorized version of noop_extracts_9_lanes is 30%-50% faster on
certain AArch64 CPUs.
It looks like this does not impact any code in
SPEC2000/SPEC2006/MultiSource both on X86 and AArch64 with -O3 -flto.
This originally regressed after D80773, so if there's a better
alternative to explore, I'd be more than happy to do that.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D99719
D99717 introduced some test cases which showed that the output of one
vsetvli into another would not be picked up by the RISCVCleanupVSETVLI
pass. This patch teaches the optimization about such a pattern. The
pattern is quite common when using the RVV vsetvli intrinsic to pass the
VL onto other intrinsics.
The second test case introduced by D99717 is left unoptimized by this
patch. It is a rarer case and will require us to rewire any uses of the
redundant vset[i]vli's output to the previous one's.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99730
Support reassociation for min/max. With that we should be able to transform min(min(a, b), c) -> min(min(a, c), b) if min(a, c) is already available.
Reviewed By: mkazantsev, lebedev.ri
Differential Revision: https://reviews.llvm.org/D88287
Recently we switched to use InvalidProbeCount = UINT64_MAX (instead of 0) to represent dangling probe, but UINT64_MAX is not excluded when computing profile summary. This caused profile summary to produce incorrect hot/cold threshold. The change fixed it by excluding UINT64_MAX from summary builder.
Differential Revision: https://reviews.llvm.org/D99788
This is a patch to fix the bug in alignment calculation (see https://reviews.llvm.org/D90529#2619492).
Consider this code:
```
call void @llvm.assume(i1 true) ["align"(i32* %a, i32 32, i32 28)]
%arrayidx = getelementptr inbounds i32, i32* %a, i64 -1
; aligment of %arrayidx?
```
The llvm.assume guarantees that `%a - 28` is 32-bytes aligned, meaning that `%a` is 32k + 28 for some k.
Therefore `a - 4` cannot be 32-bytes aligned but the existing code was calculating the pointer as 32-bytes aligned.
The reason why this happened is as follows.
`DiffSCEV` stores `%arrayidx - %a` which is -4.
`OffSCEV` stores the offset value of “align”, which is 28.
`DiffSCEV` + `OffSCEV` = 24 should be used for `a - 4`'s offset from 32k, but `DiffSCEV` - `OffSCEV` = 32 was being used instead.
Reviewed By: Tyker
Differential Revision: https://reviews.llvm.org/D98759
The code is assuming that having an exact exit count for the loop implies that exit counts for every exit are known. This used to be true, but when we added handling for dead exits we broke this invariant. The new invariant is that an exact loop count implies that any exits non trivially dead have exit counts.
We could have fixed this by either a) explicitly checking for a dead exit, or b) just testing for SCEVCouldNotCompute. I chose the second as it was simpler.
(Debugging this took longer than it should have since I'd mistyped the original assert and it wasn't checking what it was meant to...)
p.s. Sorry for the lack of test case. Getting things into a state to actually hit this is difficult and fragile. The original repro involves loop-deletion leaving SCEV in a slightly inprecise state which lets us bypass other transforms in IndVarSimplify on the way to this one. All of my attempts to separate it into a standalone test failed.
This occurs when we type legalize an i64 scalar input on RV32. We
need to manually splat, which requires a vector input. Rather
than special case this in lowering just pattern match it.
This removes the restriction that only Thumb2 targets enable runtime
loop unrolling, allowing it for Thumb1 only cores as well. The existing
T2 heuristics are used (for the time being) to control when and how
unrolling is performed.
Differential Revision: https://reviews.llvm.org/D99588
The default legalization strategy is PromoteFloat which keeps
half in single precision format through multiple floating point
operations. Conversion to/from float is done at loads, stores,
bitcasts, and other places that care about the exact size being 16
bits.
This patches switches to the alternative method softPromoteHalf.
This aims to keep the type in 16-bit format between every operation.
So we promote to float and immediately round for any arithmetic
operation. This should be closer to the IR semantics since we
are rounding after each operation and not accumulating extra
precision across multiple operations. X86 is the only other
target that enables this today. See https://reviews.llvm.org/D73749
I had to update getRegisterTypeForCallingConv to force f16 to
use f32 when the F extension is enabled. This way we can still
pass it in the lower bits of an FPR for ilp32f and lp64f ABIs.
The softPromoteHalf would otherwise always give i16 as the
argument type.
Reviewed By: asb, frasercrmck
Differential Revision: https://reviews.llvm.org/D99148
This implements the most basic possible nosync inference. The choice of inference rule is taken from the comments in attributor and the discussion on the review of the change which introduced the nosync attribute (0626367202).
This is deliberately minimal. As noted in code comments, I do plan to add a more robust inference which actually scans the function IR directly, but a) I need to do some refactoring of the attributor code to use common interfaces, and b) I wanted to get something in. I also wanted to minimize the "interesting" analysis discussion since that's time intensive.
Context: This combines with existing nofree attribute inference to help prove dereferenceability in the ongoing deref-at-point semantics work.
Differential Revision: https://reviews.llvm.org/D99749
Hookup TLI when inferring object size from allocation calls. This allows the analysis to prove dereferenceability for known allocation functions (such as malloc/new/etc) in addition to those marked explicitly with the allocsize attribute.
This is a follow up to 0129cd5 now that the bug fixed by e2c6621e6 is resolved.
As noted in the test, this relies on being able to prove that there is no free between allocation and context (e.g. hoist location). At the moment, this is handled conservatively. I'm working strengthening out ability to reason about no-free regions separately.
Differential Revision: https://reviews.llvm.org/D99737
We have this logic duplicated in several cases, none of which were exhaustive. Consolidate it in one place.
I don't believe this actually impacts behavior of the callers. I think they all filter their inputs such that their partial implementations were correct. If not, this might be fixing a cornercase bug.
We need to splat the scalar separately and use .vv, but there is
no vmsgt(u).vv. So add isel patterns to select vmslt(u).vv with
swapped operands.
We also need to get VT to use for the splat from an operand rather
than the result since the result VT is nxvXi1.
Reviewed By: HsiangKai
Differential Revision: https://reviews.llvm.org/D99704
There's no target independent ISD opcode for MULHSU, so custom
legalize 2*XLen multiplies ourselves. We have to be a little
careful to prefer MULHU or MULHSU.
I thought about doing this in isel by pattern matching the
(add (mul X, (srai Y, XLen-1)), (mulhu X, Y)) pattern. I decided
against this because the add might become part of a chain of adds.
I don't trust DAG combine not to reassociate with other adds making
it difficult to find both pieces again.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D99479
Doing this during instruction selection avoids the cost of running
SIAddIMGInit which is yet another pass over the MIR.
Differential Revision: https://reviews.llvm.org/D99670
Doing this in a post-isel hook avoids the cost of running SIAddIMGInit
which is yet another pass over the MIR.
Differential Revision: https://reviews.llvm.org/D99747
This adds a new integer materialization strategy mainly targeted
at 64-bit constants like 0xffffffff where there are 32 or more trailing
ones with leading zeros. We can materialize these by using an addi -1
and srli to restore the leading zeros. This matches what gcc does.
I haven't limited to just these cases though. The implementation
here takes the constant, shifts out all the leading zeros and
shifts ones into the LSBs, creates the new sequence, adds an srli,
and checks if this is shorter than our original strategy.
I've separated the recursive portion into a standalone function
so I could append the new strategy outside of the recursion. Since
external users are no longer using the recursive function, I've
cleaned up the external interface to return the sequence instead of
taking a vector by reference.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D98821
Support deriving dereferenceability facts from allocation sites with known object sizes while correctly accounting for any possibly frees between allocation and use site. (At the moment, we're conservative and only allowing it in functions where we know we can't free.)
This is part of the work on deref-at-point semantics. I'm making the change unconditional as the miscompile in this case is way too easy to trip by accident, and the optimization was only recently added (by me).
There will be a follow up patch wiring through TLI since that should now be doable without introducing widespread miscompiles.
Differential Revision: https://reviews.llvm.org/D95815
The main part of the patch is the change in RegAllocGreedy.cpp: Q.collectInterferringVregs()
needs to be called before iterating the interfering live ranges.
The rest of the patch offers support that is the case: instead of clearing the query's
InterferingVRegs field, we invalidate it. The clearing happens when the live reg matrix
is invalidated (existing triggering mechanism).
Without the change in RegAllocGreedy.cpp, the compiler ices.
This patch should make it more easily discoverable by developers that
collectInterferringVregs needs to be called before iterating.
I will follow up with a subsequent patch to improve the usability and maintainability of Query.
Differential Revision: https://reviews.llvm.org/D98232
- This patch adds in support to accept the "#" character as part of an Identifier.
- This support is needed especially for the HLASM dialect since "#" is treated as part of the valid "Alphabet" range
- The way this is done is by making use of the previous precedent set by the `AllowAtInIdentifier` field in `MCAsmLexer.h`. A new field called `AllowHashInIdentifier` is introduced.
- The static function `IsIdentifierChar` is also updated to accept the `#` character if the `AllowHashInIdentifier` field is set to true.
Note: The field introduced in `MCAsmLexer.h` could very well be moved to `MCAsmInfo.h`. I'm not opposed to it. I decided to put it in `MCAsmLexer` since there seems to be some sort of precedent already with `AllowAtInIdentifier`.
Reviewed By: abhina.sreeskantharajan, nickdesaulniers, MaskRay
Differential Revision: https://reviews.llvm.org/D99277
When an SVE function calls another SVE function using the C calling
convention we use the more efficient SVE VectorCall PCS. However,
for the Fast calling convention we're incorrectly falling back to
the generic AArch64 PCS.
This patch adds the same "can use SVE vector calling convention"
detection used by CallingConv::C to CallingConv::Fast.
Co-authored-by: Paul Walker <paul.walker@arm.com>
Differential Revision: https://reviews.llvm.org/D99657
1. Need to cleanup InstrElementSize map for each new tree, otherwise might
use sizes from the previous run of the vectorization attempt.
2. No need to include into analysis the instructions from the different basic
blocks to save compile time.
Differential Revision: https://reviews.llvm.org/D99677
If the inner shuffle already contains undef elements, then accept them in the merged shuffle as well.
This helps some X86 HADD/SUB patterns where slow targets were ending up with HADD/SUB because the (un)merged shuffles were stuck either side of the ADD/SUB - meaning we ended up with a total cost much higher than the "2*shuffle+add" that a slow target usually expands a HADD/SUB to.
By convention, VOP1/2/C instructions which can be promoted to VOP3 have _e32 suffix while promoted instructions have _e64 suffix. Instructions which have a single variant should have no _e32/_e64 suffix. Unfortunately there was no simple way to identify single variant instructions - it was implemented by a hack. See bug https://bugs.llvm.org/show_bug.cgi?id=39086.
This fix simplifies handling of single VOP instructions by adding a dedicated flag.
Differential Revision: https://reviews.llvm.org/D99408
Removes CFGAnalyses from the preserved analyses set
returned by LoopFlattenPass::run().
Reviewed By: Dave Green, Ta-Wei Tu
Differential Revision: https://reviews.llvm.org/D99700
A frequent pattern for floating point conditional branches use an xor
to invert the input for the branch. Instead we can fold away the xor
by swapping the branch target instead.
Differential Revision: https://reviews.llvm.org/D99171
Name GVN uses name 'LI' for two different unrelated things:
LoadInst and LoopInfo. This patch relates the variables with
former meaning into 'Load' to disambiguate the code.
Before this change, the `llvm.access.group` metadata was dropped
when moving a load instruction in GVN. This prevents vectorizing
a C/C++ loop with `#pragma clang loop vectorize(assume_safety)`.
This change propagates the metadata as well as other metadata if
it is safe (the move-destination basic block and source basic
block belong to the same loop).
Differential Revision: https://reviews.llvm.org/D93503
This commit adjusts the order of two swappable if statements to
make code cleaner.
Reviewed By: lattner, nikic
Differential Revision: https://reviews.llvm.org/D99648
This allows these optimisations to apply to e.g. `urem i16` directly
before `urem` is promoted to i32 on architectures where i16 operations
are not intrinsically legal (such as on Aarch64). The legalization then
later can happen more directly and generated code gets a chance to avoid
wasting time on computing results in types wider than necessary, in the end.
Seems like mostly an improvement in terms of results at least as far as x86_64 and aarch64 are concerned, with a few regressions here and there. It also helps in preventing regressions in changes like {D87976}.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D88785
While probing stack, the stack register is moved without dwarf
information, which could cause panic if unwind the backtrace.
This commit only add annotation for the inline stack probe case.
Dwarf information for the loop case should be done in another
patch and need further discussion.
Reviewed By: nagisa
Differential Revision: https://reviews.llvm.org/D99579
Use SetVector instead of SmallPtrSet to track values with uniform use. Doing this
can help avoid non-determinism caused by iterating over unordered containers.
This bug was found with reverse iteration turning on,
--extra-llvm-cmake-variables="-DLLVM_REVERSE_ITERATION=ON".
Failing LLVM test consecutive-ptr-uniforms.ll .
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D99549
Removes the prototype builtin and intrinsic for i64x2.eq and implements that
instruction as well as the other i64x2 comparison instructions in the final SIMD
spec. Unsigned comparisons were not included in the final spec, so they still
need to be scalarized via a custom lowering.
Differential Revision: https://reviews.llvm.org/D99623
This is a patch teaching ValueTracking that `s/u*.with.overflow` intrinsics do not
create undef/poison and they propagate poison.
I couldn't write a nice example like the one with ctpop; ValueTrackingTest.cpp were simply updated
to check these instead.
This patch helps reducing regression while fixing https://llvm.org/pr49688 .
Reviewed By: nikic
Differential Revision: https://reviews.llvm.org/D99671
This fixes an issue introduced with my change d4648e, and reported in pr49768.
The root problem is that dominance collapses in unreachable code, and that LoopInfo explicitly only models reachable code. Since the recurrence matcher doesn't filter by reachability (and can't easily because not all consumers have domtree), we need to bailout before assuming that finding a recurrence implies we found a loop.
The default expansion creates a MUL and either a MULHS/MULHU. Each
of those separately expand to sequences that use one or more
PMULLW instructions as well as additional instructions to
extend the types to vXi16. The MULHS/MULHU expansion computes the
whole 16-bit product, but only keeps the high part.
We can improve the lowering of SMULO/UMULO for some cases by using the MULHS/MULHU
expansion, but keep both the high and low parts. And we can use
those parts to calculate the overflow.
For AVX512 we might have vXi1 overflow outputs. We can improve those by using
vpcmpeqw to produce a k register if AVX512BW is enabled. This is a little better
than truncating the high result to use vpcmpeqb. If we don't have avx512bw we
can extend up to v16i32 to use vpcmpeqd to produce a k register.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97624
On ppc64 linux , MachineLICM will hoist caller preserved registers, including TOC loads of the global variable address, out of loops. This is to enable this on AIX for both ppc64 and ppc32.
Differential Revision: https://reviews.llvm.org/D99076
We previously couldn't optimize out a TEST if the branch/setcc/cmov
used the overflow flag. This patches allows the TEST to be removed
if the flag producing instruction is known to clear the OF flag.
Thats what the TEST instruction would have done so that should be
equivalent.
Need to add test cases. I'll try to get back to this if I have bandwidth.
Fixes PR48768.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D94856
in this patch we add a new libLTO API to specify debug options independent of an lto_code_gen_t.
This allows clients to pass codegen flags (through libLTO) which otherwise today are ignored.
Reviewed By: steven_wu
Differential Revision: https://reviews.llvm.org/D92611
Our CLZW isel pattern is quite easily broken by surrounding code
preventing it from matching sometimes. This usually results in
failing to remove the and X, 0xffffffff inserted by type
legalization. The add with -32 that type legalization also inserts
will often gets combined into other add/sub nodes. That doesn't
usually result in extra code when we don't use clzw.
CTTZ seems to be less fragile, but I wanted to keep it consistent
with CTLZ.
Reviewed By: asb, HsiangKai
Differential Revision: https://reviews.llvm.org/D99317
Also modify the simm5_plus1 check because Imm-1 is UB if Imm happens
to be INT64_MIN. I don't think the compiler would optimize based on that in this
usage, but it could fail UBSan or -ftrapv.
Reviewed By: HsiangKai, frasercrmck
Differential Revision: https://reviews.llvm.org/D99637
This marks FSIN and other operations to EXPAND for scalable
vectors, so that they are not assumed to be legal by the cost-model.
Depends on D97470
Reviewed By: dmgreen, paulwalker-arm
Differential Revision: https://reviews.llvm.org/D97471
Attempts to compute savings more accurately cannot impact the set of critically important call sites.
Reviewed By: kazu
Differential Revision: https://reviews.llvm.org/D98577
This is an improvement over Zen 2, where only branch fusion is supported,
as per Agner, 21.4 Instruction fusion.
AMD SOG 17h has no mention of fusion.
AMD SOG 19h, 2.9.3 Branch Fusion
The following flag writing instructions support branch fusion
with their reg/reg, reg/imm and reg/mem forms
* CMP
* TEST
* SUB
* ADD
* INC (no fusion with branches dependent on CF)
* DEC (no fusion with branches dependent on CF)
* OR
* AND
* XOR
Agner, 22.4 Instruction fusion
<...> This applies to CMP, TEST, ADD, SUB, AND, OR, XOR, INC, DEC and
all conditional jumps, except if the arithmetic or logic instruction has a rip-relative address or
both an address displacement and an immediate operand.
This adds almost everything required for supporting the new stepvector
intrinsic on RVV. It is lowered to the existing VID_VL SDNode.
The only exception is a limitation that RV32 cannot yet lower the
intrinsic on i64 vectors. This is because the step operand is
(currently) required to be at least as large as the vector element type.
I will look into patching that out and loosening the requirement to only
an integer pointer type.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D99594
This includes gfx908 which only has a no-return version of the
global_atomic_add_f32 instruction, using the same hack that was
previously implemented for selecting from the
llvm.amdgcn.global.atomic.fadd intrinsic.
Differential Revision: https://reviews.llvm.org/D97767
Currently the code only checks for integer constants (ConstantSDNode)
and triggers an infinite cycle for single-element floating point
vector constants. We need to check for both FP and integer constants.
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D99384
Adds utilities for creating anonymous pointers and jump stubs to x86_64.h. These
are used by the GOT and Stubs builder, but may also be used by pass writers who
want to create pointer stubs for indirection.
This patch also switches the underlying type for LinkGraph content from
StringRef to ArrayRef<char>. This avoids any confusion when working with buffers
that contain null bytes in the middle like, for example, a newly added null
pointer content array. ;)
Summary: Try to insert dbg.declare to entry.resume basic block in resume
function. In this way, we could print alloca such as __promise in
gdb/lldb under O2, which would be beneficial to debug coroutine program.
Test Plan: check-llvm
Reviewed by: aprantl
Differential Revision: https://reviews.llvm.org/D96938
We only use this in Pat patterns, so it just needs to be an
ImmLeaf. If we did need it as an instruction operand, the
ParserMatchClass, EncoderMethod, and DecoderMethod were probably wrong.
GCC warning:
```
/llvm-project/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp: In member function ‘bool llvm::CombinerHelper::matchFunnelShiftToRotate(llvm::MachineInstr&)’:
/llvm-project/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp:3882:35: warning: ?: using integer constants in boolean context, the expression will always evaluate to ‘true’ [-Wint-in-bool-context]
3882 | Opc == TargetOpcode::G_FSHL ? TargetOpcode::G_ROTL : TargetOpcode::G_ROTR;
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
MemberOffsets are stored at the end of StructLayout. The class
contains a single entry array to mark the start of the member
offsets. getStructLayout calculates the additional space needed
for additional elements before allocating memory.
This patch converts this to use TrailingObjects. This simplifies
the size computation in getStructLayout and gets rid of the
single entry array.
This is prep work, but to use TypeSize instead of uint64_t for
D98169. The single entry array doesn't work with TypeSize because
TypeSize doesn't have a default constructor. We thought this
change was an improvement by itself so we've separated it out.
Reviewed By: mehdi_amini
Differential Revision: https://reviews.llvm.org/D99608
The number of events and the type index should be encoded in ULEB128,
but they were incorrctly encoded in LEB128. The smallest number with
which its LEB128 and ULEB128 encodings are different is 64.
There's no way we can generate 64 events in the C++ toolchain
implementation so we can't test that, but the attached test tests when
the type index is 64.
Reviewed By: dschuff
Differential Revision: https://reviews.llvm.org/D99627
another one for distributed mode.
Currently during module importing, ThinLTO opens all the source modules,
collect functions to be imported and append them to the destination module,
then leave all the modules open through out the lto backend pipeline. This
patch refactors it in the way that one source module will be closed before
another source module is opened. All the source modules will be closed after
importing phase is done. It will save some amount of memory when there are
many source modules to be imported.
Note that this patch only changes the distributed thinlto mode. For in
process thinlto mode, one source module is shared acorss different thinlto
backend threads so it is not changed in this patch.
Differential Revision: https://reviews.llvm.org/D99554
Mark v6m/v8m-baseline cores as having no branch predictors. This should
not alter very much on its own, but is more correct as the cores do not
have branch predictors and can help in the future.
Use SetVector instead of SmallPtrSet for external definitions created for VPlan.
Doing this can help avoid non-determinism caused by iterating over unordered containers.
This bug was found with reverse iteration turning on,
--extra-llvm-cmake-variables="-DLLVM_REVERSE_ITERATION=ON".
Failing LLVM-Unit test VPRecipeTest.dump.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D99544
This patch adds 3 methods, one for power-of-2 vectors which use tree
reductions using vector ops, before a final reduction op. For non-pow-2
types it generates multiple narrow reductions and combines the values with
scalar ops.
Differential Revision: https://reviews.llvm.org/D97163
For imported pattern purposes, we have a custom rule that promotes the rotate
amount to 64b as well.
Differential Revision: https://reviews.llvm.org/D99463
Negative numbers are represented using DW_OP_consts along with signed representation
of the number as the argument.
Test case IR is generated using Fortran front-end.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D99273
Currently prof metadata with branch counts is added only for BranchInst and SwitchInst, but not for IndirectBrInst. As a result, BPI/BFI make incorrect inferences for indirect branches, which can be very hot.
This diff adds metadata for IndirectBrInst, in addition to BranchInst and SwitchInst.
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D99550
Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example:
1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them.
2. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining.
3. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph.
4. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions.
Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4.
I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%.
The change is an enhancement to https://reviews.llvm.org/D95988.
Reviewed By: wmi, wenlei
Differential Revision: https://reviews.llvm.org/D99351
Basically a port of isBitfieldExtractOpFromSExtInReg in AArch64ISelDAGToDAG.
This is only done post-legalization for now. Once the legalizer knows how to
decompose these back into shifts, this requirement can probably be removed.
Differential Revision: https://reviews.llvm.org/D99230
Without Zfh the half type isn't legal, but it could still be
used as an argument/return in IR. Clang will not generate this today.
Previously we promoted the half value to float for arguments and
returns if the F extension is enabled but Zfh isn't. Then depending on
which ABI is enabled we would pass it in either an FPR or a GPR in
float format.
If the F extension isn't enabled, it would get passed in the lower
16 bits of a GPR in half format.
With this patch the value will always in half format and will be
in the lower bits of a GPR or FPR. This should be consistent
with where the bits are located when Zfh is enabled.
I've based this implementation off of how this is done on ARM.
I've manually nan-boxed the value to 32 bits using integer ops.
It looks like flw, fsw, fmv.s, fmv.w.x, fmf.x.w won't
canonicalize nans so should leave the value alone. I think those
are the instructions that could get used on this value.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D98670
Currently needsStackRealignment returns false if canRealignStack returns false.
This means that the behavior of needsStackRealignment does not correspond to
it's name and description; a function might need stack realignment, but if it
is not possible then this function returns false. Furthermore,
needsStackRealignment is not virtual and therefore some backends have made use
of canRealignStack to indicate whether a function needs stack realignment.
This patch attempts to clarify the situation by separating them and introducing
new names:
- shouldRealignStack - true if there is any reason the stack should be
realigned
- canRealignStack - true if we are still able to realign the stack (e.g. we
can still reserve/have reserved a frame pointer)
- hasStackRealignment = shouldRealignStack && canRealignStack (not target
customisable)
Targets can now override shouldRealignStack to indicate that stack realignment
is required.
This change will make it easier in a future change to handle the case where we
need to realign the stack but can't do so (for example when the register
allocator creates an aligned spill after the frame pointer has been
eliminated).
Differential Revision: https://reviews.llvm.org/D98716
Change-Id: Ib9a4d21728bf9d08a545b4365418d3ffe1af4d87
This was crashing with the example from:
https://llvm.org/PR49716
...and that was avoided with a283d72583 ,
but as we can see from the SSE vs. AVX test code diff,
we can try harder to match the pattern.
This matcher code was adapted from another pmadd pattern
match in D49636, but it needs different ops to deal with
size mismatches.
Differential Revision: https://reviews.llvm.org/D99531
This fixes the miscompilation reported in https://reviews.llvm.org/rG5bb38e84d3d0#986154 .
`select _, true, false` matches both m_LogicalAnd and m_LogicalOr, making later
transformations confused.
Simplify the branch condition to not have the form.
As another addition to MVE lane interleaving, this handles Splat shuffle
vectors, as the shuffle of a splat is a splat.
Differential Revision: https://reviews.llvm.org/D97291
This patch adds support for the vectorization of induction variables when
using scalable vectors, which required the following changes:
1. Removed assert from InnerLoopVectorizer::getStepVector.
2. Modified InnerLoopVectorizer::createVectorIntOrFpInductionPHI to use
a runtime determined value for VF and removed an assert.
3. Modified InnerLoopVectorizer::buildScalarSteps to work for scalable
vectors. I did this by calculating the full vector value for each Part
of the unroll factor (UF) and caching this in the VP state. This means
that we are always able to extract an arbitrary element from the vector
if necessary. In addition to this, I also permitted the caching of the
individual lane values themselves for the known minimum number of elements
in the same way we do for fixed width vectors. This is a further
optimisation that improves the code quality since it avoids unnecessary
extractelement operations when extracting the first lane.
4. Added an assert to InnerLoopVectorizer::widenPHIInstruction, since while
testing some code paths I noticed this is currently broken for scalable
vectors.
Various tests to support different cases have been added here:
Transforms/LoopVectorize/AArch64/sve-inductions.ll
Differential Revision: https://reviews.llvm.org/D98715
We previously made a change to getUserCost to return a Invalid cost
when one of the TTI costs returned '-1' (meaning 'unknown' or
'infinitely expensive'). It makes no sense to say that:
shufflevector <2 x i8> %x, <2 x i8> %y, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
has an invalid cost. Perhaps the cost is not known, but the IR is valid
and can be code-generated. Invalid should only be used for IR that
cannot possibly be code-generated and where a cost is nonsensical.
With more passes now asserting that the cost must be valid, it is possible
that those assertions will fail for perfectly valid IR. An incomplete
cost-model probably shouldn't be a reason for the compiler to break.
It's better to consider these costs as 'very expensive' and ignore them
for other reasons. At some point, we should consider replacing -1 with
some other mechanism.
Reviewed By: paulwalker-arm, dmgreen
Differential Revision: https://reviews.llvm.org/D99502
This option tells LLJIT to disable platform support explicitly: JITDylibs aren't scanned for special init/deinit symbols and no runtime API interposes are injected.
It's useful in two cases: for platforms that don't have such requirements and platforms for which we have no explicit support yet and that don't work well with the generic IR platform.
Reviewed By: lhames
Differential Revision: https://reviews.llvm.org/D99416
This is needed for Fortran assumed shape arrays whose dimensions are
defined as,
- 'count' is taken from array descriptor passed as parameter by
caller, access from descriptor is defined by type DIExpression.
- 'lowerBound' is defined by callee.
The current alternate way represents using upperBound in place of
count, where upperBound is calculated in callee in a temp variable
using lowerBound and count
Representation with count (DIExpression) is not only clearer as
compared to upperBound (DIVariable) but it has another advantage that
variable count is accessed by being parameter has better chance of
survival at higher optimization level than upperBound being local
variable.
Reviewed By: aprantl
Differential Revision: https://reviews.llvm.org/D99335
Empty functions (functions with no real code) are irrelevant for propeller optimizations and their addresses sometimes conflict with other functions which obfuscates the analysis.
This simple change skips the BB address map emission for such functions.
Reviewed By: tmsriram
Differential Revision: https://reviews.llvm.org/D99395
This commit adds debugging support for set types defined in languages
such as Pascal and Modula-2.
Patch by Peter McKinna!
Differential Revision: https://reviews.llvm.org/D76115
When I updated the SIMD opcodes in f5764a8654, I accidentally missed updating
i8x16.popcnt. This patch fixes the omission.
Differential Revision: https://reviews.llvm.org/D99536
Use SmallVector instead of SmallSet to track the context profiles mapped. Doing this
can help avoid non-determinism caused by iterating over unordered containers.
This bug was found with reverse iteration turning on,
--extra-llvm-cmake-variables="-DLLVM_REVERSE_ITERATION=ON".
Failing LLVM test profile-context-tracker-debug.ll .
Reviewed By: MaskRay, wenlei
Differential Revision: https://reviews.llvm.org/D99547
dsymutil is not relocating the DW_AT_low_pc for a DW_TAG_label. This
patch fixes that and adds a test.
Differential revision: https://reviews.llvm.org/D99534
Lookup tables generate non PIC-friendly code, which requires dynamic relocation as described in:
https://bugs.llvm.org/show_bug.cgi?id=45244
This patch adds a new pass that converts lookup tables to relative lookup tables to make them PIC-friendly.
Differential Revision: https://reviews.llvm.org/D94355
Currently performExtendCombine assumes that the src-element bitwidth * 2
is a valid MVT. But this is not the case for i1 and it causes a crash on
the v64i1 test cases added in this patch.
It turns out that this code appears to not be needed; the same patterns are
handled by other code and we end up with the same results, even without the
custom lowering. I also added additional test cases in a50037aaa6.
Let's just remove the unneeded code.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D99437
If the successor block has a phi node, then additional moves may
be inserted into predecessors, which may clobber eflags. Don't try
to fold the with.overflow result into the branch in that case.
This is done by explicitly checking for any phis in successor
blocks, not sure if there's some more principled way to address
this. Other fused compare and branch patterns avoid the issue by
emitting the comparison when handling the branch, so that no
instructions may be inserted in between. In this case, the
with.overflow call is emitted separately (and I don't think this
is avoidable, as it will generally have at least two users).
Fixes https://bugs.llvm.org/show_bug.cgi?id=49587.
Differential Revision: https://reviews.llvm.org/D98600
Note, only src0 and src1 will be commuted if the isCommutable flag
is set. This patch does not change that, it just makes it possible
to commute src0 and src1 of more instructions.
Reviewed By: foad, rampitec
Differential Revision: https://reviews.llvm.org/D99376
Change-Id: I61e20490962d95ea429beb355c55f55c024dafdc
D89239 adjusts the stack offset of emergency spill slots for overaligned
stacks. However the adjustment is not valid for targets whose stack
grows up (such as AMDGPU).
This change makes the adjustment conditional only to those targets whose
stack grows down.
Fixes https://bugs.llvm.org/show_bug.cgi?id=49686
Differential Revision: https://reviews.llvm.org/D99504
This matches what we do in our isel patterns. In our internal
testing we've found this is needed to make the fast register
allocator happy at -O0. Otherwise it may assign V0 to an earlier
operand and find itself with no registers left when it reaches
the mask operand. By using V0 explicitly, the fast register allocator
will see it when it checks for phys register usages before it
starts allocating vregs. I'll try to update this with a test case.
Unfortunately, this does appear to prevent some instruction reordering
by the pre-RA scheduler which leads to the increased spills seen in
some tests. I suspect that problem could already occur for other
instructions that already used V0 directly.
There's a lot of repeated code here that could do with some
wrapper functions. Not sure if that should be at the level of the
new code that deals with V0. That would require multiple output
parameters to pass the glue, chain and register back. Maybe it
should be at a higher level over the entire set of push_backs.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D99367
Previously we only used RIP relative when PIC was enabled. But
we know we're in small/kernel code model here so we should
be able to always use RIP-relative which will give a smaller
encoding.
Here's a godbolt link that demonstrates the current codegen https://godbolt.org/z/j3158o
Note in the non-PIC version the load from .LCPI0_0 doesn't use
RIP-relative addressing, but if you change the constant in the
source from 0.0 to 1.0 it will become RIP-relative.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D97208
In D97111 we changed the RVV frame layout when using sp or bp to address
the stack slots so we could address the emergency stack slot. The idea
is to put the RVV objects as far as possible (in offset terms) from the
frame reference register (sp / fp / bp).
When using fp this happens naturally because the RVV objects are already
the top of the stack and due to the constraints of RVV (VLENB being a
power of two >= 128) the stack remains aligned. The rest of this summary
does not apply to this case.
When using sp / bp we need to skip the non-RVV stack slots. The size of
the the non-RVV objects is computed subtracting the callee saved
register size (whose computation is added in D97111 itself) to the total
size of the stack (which does not account for RVV stack slots). However,
when doing so we round to 16 bytes when computing that size and we end
emitting a smaller offset that may belong to a scalar stack slot (see
D98801). So this change removes that rounding.
Also, because we want the RVV objects be between the non-RVV stack slots
and the callee-saved register slots, we need to make sure the RVV
objects are properly aligned to 8 bytes. Adding a padding of 8 would
render the stack unaligned. So when allocating space for RVV (only when
we don't use fp) we need to have extra padding that preserves the stack
alignment. This way we can round to 8 bytes the offset that skips the
non-RVV objects and we do not misalign the whole stack in the way. In
some circumstances this means that the RVV objects may have padding
before (=lower offsets from sp/bp) and after (before the CSR stack
slots).
Differential Revision: https://reviews.llvm.org/D98802
This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.
It will serve two purposes:
1) Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
2) For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.
Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.
This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:
1) We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
2) Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.
Differential Revision: https://reviews.llvm.org/D99146
during profile update.
When we inline a function and update the profile, the value profiles of the
indirect call in the inliner and inlinee will be scaled. In
https://reviews.llvm.org/D96806 and https://reviews.llvm.org/D97350, we start
using the magic number NOMORE_ICP_MAGICNUM (-1) to mark targets which have
been promoted. The magic number shouldn't be scaled during the profile update.
Although the problem has been suppressed by https://reviews.llvm.org/D98187
for SampleFDO, which stops profile update for inlining in sampleFDO, the patch
is still wanted since it will be more consistent to handle the magic number
properly in profile update.
Differential Revision: https://reviews.llvm.org/D99394
Re-apply 25fbe803d4, with a small update to emit the right remark
class.
Original message:
[LV] Move runtime pointer size check to LVP::plan().
This removes the need for the remaining doesNotMeet check and instead
directly checks if there are too many runtime checks for vectorization
in the planner.
A subsequent patch will adjust the logic used to decide whether to
vectorize with runtime to consider their cost more accurately.
Reviewed By: lebedev.ri
This is currently performed in SelectionDAGLegalize, here we make it also
happen in LegalizeVectorOps, allowing a target to lower the SETCC condition
codes first in LegalizeVectorOps and then lower to a custom node afterwards,
without having to duplicate all of the SETCC condition legalization in the
target specific lowering.
As a result of this, fixed length floating point SETCC nodes can now be
properly lowered for SVE.
Differential Revision: https://reviews.llvm.org/D98939
This is a 2nd try of:
3c8473ba53
which was reverted at:
a26312f9d4
because of crashing.
This version includes extra code and tests to avoid the known
crashing examples as discussed in PR49730.
Original commit message:
As noted in D98152, we need to patch SLP to avoid regressions when
we start canonicalizing to integer min/max intrinsics.
Most of the real work to make this possible was in:
7202f47508
Differential Revision: https://reviews.llvm.org/D98981
This removes the need for the remaining doesNotMeet check and instead
directly checks if there are too many runtime checks for vectorization
in the planner.
A subsequent patch will adjust the logic used to decide whether to
vectorize with runtime to consider their cost more accurately.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D98634
Using $ breaks demangling of the symbols. For example,
$ c++filt _Z3foov\$123
_Z3foov$123
This causes problems for developers who would like to see nice stack traces
etc., but also for automatic crash tracking systems which try to organize
crashes based on the stack traces.
Instead, use the period as suffix separator, since Itanium demanglers normally
ignore such suffixes:
$ c++filt _Z3foov.123
foo() [clone .123]
This is already done in some places; try to do it everywhere.
Differential revision: https://reviews.llvm.org/D97484
The following operations have no associated cost for them
when applied to scalable vectors, and as a consequence
can trigger a crash when a call is made to
AArch64TTIImpl::getCastInstrCost():
- fptrunc
- trunc
- fpext
- fpto(u,s)i
This patch adds costs for these operations and
relevant regression tests.
Differential Revision: https://reviews.llvm.org/D98934
This extends the recent MVE lane interleaving passto handle other
non-instruction leaves, for which a new shuffle is added. This helps
especially for constants and potentially for arguments.
Differential Revision: https://reviews.llvm.org/D97289
LLVMOrcDisposeObjectLayer and LLVMOrcExecutionSessionGetJITDylibByName did not
have matching signatures between the C-API header and binding implementations.
Fixes http://llvm.org/PR49745.
Patch by Mats Larsen. Thanks Mats!
Reviewed by: lhames
Differential Revision: https://reviews.llvm.org/D99478
This can only happen if offset types that are larger than the
pointer size are involved. The previous implementation did not
assert in this case because it initialized the APInts to the
width of one of the variables -- though I strongly suspect it
did not compute correct results in this case.
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=32621
reported by fhahn.
For these cases we need to extract the upper or lower elements,
multiply them using 16-bit multiplies and repack them.
Previously we used punpcklbw/punpckhbw+psraw or pmovsxbw+pshudfd to
extract and sign extend so we could use pmullw to compute the 16-bit
product and then shift down the high bits.
We can avoid the need to sign extend if we unpack the bytes into
the high byte of each word and fill the lower byte with 0 using
pxor. This puts the sign bit of each byte into the sign bit of
each word. Since the LHS and RHS have 8 trailing zeros, the full
32-bit product of those 16-bit values will have 16 trailing zeros.
This means the 16-bit product of the original bytes is in the upper
16 bits which we can calculate using pmulhw.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D98587
MVE does not have a single sext/zext or trunc instruction that takes the
bottom half of a vector and extends to a full width, like NEON has with
MOVL. Instead it is expected that this happens through top/bottom
instructions. So the MVE equivalent VMOVLT/B instructions take either
the even or odd elements of the input and extend them to the larger
type, producing a vector with half the number of elements each of double
the bitwidth. As there is no simple instruction for a normal extend, we
often have to expand sext/zext/trunc into a series of lane moves (or
stack loads/stores, which we do not do yet).
This pass takes vector code that starts at truncs, looks for
interconnected blobs of operations that end with sext/zext and
transforms them by adding shuffles so that the lanes are interleaved and
the MVE VMOVL/VMOVN instructions can be used. This is done pre-ISel so
that it can work across basic blocks.
This initial version of the pass just handles a limited set of
instructions, not handling constants or splats or FP, which can all come
as extensions to this base.
Differential Revision: https://reviews.llvm.org/D95804
I think byval/sret and the others are close to being able to rip out
the code to support the missing type case. A lot of this code is
shared with inalloca, so catch this up to the others so that can
happen.