This pass allows a user to dump a MIR function to a dot file
and view it as a graph. It is targeted to provide a similar
functionality as -dot-cfg pass on LLVM-IR. As of now the pass
also support below flags:
-dot-mcfg-only [optional][won't print instructions in the
graph just block name]
-mcfg-dot-filename-prefix [optional][prefix to add to output dot file]
-mcfg-func-name [optional] [specify function name or it's
substring, handy if mir file contains multiple functions and
you need to see graph of just one]
More flags and details can be introduced as per the requirements
in future. This pass is inspired from -dot-cfg IR pass and APIs
are written in almost identical format.
Patch by Yashwant Singh <Yashwant.Singh@amd.com> (yassingh)
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D133709
I don't know what was going on originally with these tests. It seems reasonable
to have the immediate be the same byte alignment unit as the IR, in which case
we need to take the log2 in order to set the right number of low bits.
This fixes a miscompile in chromium.
Differential Revision: https://reviews.llvm.org/D134380
This patch uses the API provided by LiveRangeEdit to handle rematerialization.
It will make future maintenance and improvement more easier.
No functional change.
Differential Revision: https://reviews.llvm.org/D133610
This patch changes a FADD / FMUL => FMA ISel pattern implemented
in D80801 so that it peeks through more than one FMA.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D132837
I'm not sure why the SEXT_INREG was gated on a bitwidth check of the mask
vs element size.
This fixes a miscompile in chromium's skia library.
Differential Revision: https://reviews.llvm.org/D134236
The following changes are necessasy to get the generated tree
matcher to compile:
- In CodeExpansions::declare(), the assert() prevents connecting
two instructions. E.g. the match code
(match (MUL $t, $s1, $s2),
(SUB $d, $t, $s3)),
results in two declarations of $t, one for the def and one for
the use. Removing the assertion allows this construct.
If $t is later used, it is one of the operands, which should be
perfectly fine.
- The code emitted in GIMatchTreeVRegDefPartitioner::generatePartitionSelectorCode()
is not compilable:
- The value of NewInstrID should be emitted, not the name
- Both calls involving getOperand() end with one parenthesis too many
- Swaps generated condition for the partition code in the latter function
It also changes the rules i2p_to_p2i, fabs_fabs_fold, and fneg_fneg_fold
to use the tree matcher for a linear match. These rules are tested by:
CodeGen/AArch64/GlobalISel/combine-fabs.mir
CodeGen/AArch64/GlobalISel/combine-fneg.mir
CodeGen/AArch64/GlobalISel/combine-ptrtoint.mir
CodeGen/AMDGPU/GlobalISel/combine-add-nullptr.mir
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D133257
This patch adds in instruction based features to the regalloc advisor
gated behind a flag so a user can decide at runtime whether or not they
want to enable the feature. The features are only enabled when LLVM is
compiled in MLGO develpment mode (LLVM_HAVE_TF_API) is set to true.
To extract the instruction features, I'm taking a list of segments from
each LiveInterval and noting the start and end SlotIndices. This list is then
sorted based on the start SlotIndex and I iterate through each SlotIndex
to grab instructions, making sure to check for overlaps. This results in
a vector of opcodes and binary mapping matrix that maps live ranges to the
opcodes of the instructions within that LR.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D131930
Use salvageDebugInfo for instructions erased as trivially dead in
GlobalISel.
It would be helpful to implement support of G_PTR_ADD and G_FRAME_INDEX
in salvageDebugInfo in future in order to preserve more variable
location.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D133986
This is a follow-up to D133777, which resolved a use-after-free case but
did not cover all possible memory bugs due to misplacement of loads.
In short, the overall problem was that sinked loads could be moved after
state-modifying instructions leading to memory bugs.
The solution is to restrict load sinking unless it is found to be sound.
i) Within a basic block (to-be-sinked load and select-user are in the same BB),
loads can be sinked only if there is no intervening state-modifying instruction.
This is a conservative approach to avoid resorting to alias analysis to detect
potential memory overlap.
ii) Across basic blocks, sinking of loads is avoided. This is because going over
multiple basic blocks looking for memory conflicts could be computationally
expensive and also unlikely to allow loads to sink. Further, experiments showed
that not sinking these loads has a slight positive performance effect.
Maybe for some of these loads, having some separation allows enough time
for the load to be executed in time for its user. This is not the case for
floating point operations that benefit more from sinking.
The solution in D133777 was essentially undone in this patch,
since the latter is a complete solution to the observed problem.
Overall, the performance impact of this patch is minimal.
Tested on two internal Google workloads with instrPGO.
Search application showed <0.05% perf difference,
while the database one showed a slight improvement,
but not statistically significant.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D133999
This is a partial port of the code used by the SelectionDAGBuilder to
translate selects.
In particular, see matchSelectPattern in ValueTracking.cpp. This is a
GISel-equivalent of the portion which handles fminnum/fmaxnum/fminimum/fmaximum.
I tried to set it up so it'd be easy to add the non-FP cases. Those are simpler.
On the AArch64-end, it seems like the FP cases are more important for perf
right now, so I bit the bullet and went at the more complicated problem. :)
I elected to do this as a post-legalize combine rather than in the
IRTranslator because
Deciding which fmax/fmin to use can depend on legalization rules
Philosophically-speaking (TM), putting it in a combine just feels cleaner
Being able to enable/disable the combine is handy
Another option would be to use the ValueTracking code in the IRTranslator and
match what SelectionDAGBuilder::visitSelect does. I think that may be somewhat
annoying since we'd need to write lowerings back into the selects in the
legalizer. I'm not strongly opposed to the approach.
We'd also want to be careful with vector selects once that's implemented,
which explicitly check if a vector select is legal on the target. That'd
probably need a hook.
From what I can tell, doing this as a combine is probably a cleaner option
long-term.
Differential Revision: https://reviews.llvm.org/D116702
Currently, it can become extremely costly to compute MayIncreasePressure if the
size of CSUses turns out to be very large. In that case, it's no longer cost
effective to keep computing MayIncreasePressure. Therefore, to limit the amount
of time spent in isProfitableToCSE, we simply conservatively assume
MayIncreasePressure if the size of CSUses is too large. This can reduce overall
compile time by 30% for some benchmarks.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D134003
I want to default all VP operations to Expand. These 2 were blocking
because VE doesn't support them and the tests were expecting them
to fail a specific way. Using Expand caused them to fail differently.
Seemed better to emulate them using operations that are supported.
@simoll mentioned on Discord that VE has some expansion downstream. Not
sure if its done like this or in the VE target.
Reviewed By: frasercrmck, efocht
Differential Revision: https://reviews.llvm.org/D133514
used, and MUL_LOHI is available
Folding into a sra(mul) / srl(mul) into a mulh introduces an extra
multiplication to compute the high half of the multiplication,
while it is more profitable to compute the high and lower halfs with a
single mul_lohi.
Differential Revision: https://reviews.llvm.org/D133768
The IR stack protector pass should insert stack checks before the tail
calls not only the musttail calls. So that the attributes `ssqreq` and
`tail call`, which are emited by llvm-opt, could be both enabled by
llvm-llc.
Reviewed By: compnerd
Differential Revision: https://reviews.llvm.org/D133860
On AArch64, doing the vector truncate separately after the fptoui
conversion can be lowered more efficiently using tbl.4, building on
D133495.
https://alive2.llvm.org/ce/z/T538CC
Depends on D133495
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133496
Similar to using tbl to lower vector ZExts, tbl4 can be used to lower
vector truncates.
The initial version support i32->i8 conversions.
Depends on D120571
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D133495
Callee save registers must be preserved, so -fzero-call-used-regs
should not be zeroing them. The previous implementation only did
not zero callee save registers that were saved&restored inside the
function, but we need preserve all of them.
Fixes https://github.com/llvm/llvm-project/issues/57692.
Differential Revision: https://reviews.llvm.org/D133946
The method of counting resource consumption is modified to be based on
"Cycles" value when DFA is not used.
The calculation of ResMII is modified to total "Cycles" and divide it
by the number of units for each resource. Previously, ResMII was
excessive because it was assumed that resources were consumed for
the cycles of "Latency" value.
The method of resource reservation is modified similarly. When a
value of "Cycles" is larger than 1, the resource is considered to be
consumed by 1 for cycles of its length from the scheduled cycle.
To realize this, ResourceManager maintains a resource table for all
slots. Previously, resource consumption was always 1 for 1 cycle
regardless of the value of "Cycles" or "Latency".
In addition, the number of micro operations per cycle is modified to
be constrained by "IssueWidth". To disable the constraint,
--pipeliner-force-issue-width=100 can be used.
For the case of using DFA, the scheduling results are unchanged.
Reviewed By: dpenry
Differential Revision: https://reviews.llvm.org/D133572
This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D133593
This patch extends CodeGenPrepare to lower zext v16i8 -> v16i32 in loops
using a wide shuffle creating a v64i8 vector, selecting groups of 3
zero elements and an element from the input.
This is profitable on AArch64 where such shuffles can be lowered to tbl
instructions, but only in loops, because it requires materializing 4
masks, which can be done in the loop preheader.
This is the only reason the transform is part of CGP. If there's a
better alternative I missed, please let me know. The same goes for the
shouldReplaceZExtWithShuffle hook which guards this. I am not sure if
this transform will be beneficial on other targets, but it seems like
there is no way other convenient way.
This improves the generated code for loops like the one below in
combination with D96522.
int foo(uint8_t *p, int N) {
unsigned long long sum = 0;
for (int i = 0; i < N ; i++, p++) {
unsigned int v = *p;
sum += (v < 127) ? v : 256 - v;
}
return sum;
}
https://clang.godbolt.org/z/Wco866MjY
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D120571
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
The class priority is expected to be at most 5 bits before it starts
clobbering bits used for other fields. Also clamp the instruction
distance in case we have millions of instructions.
AMDGPU was accidentally overflowing into the global priority bit in
some cases. I think in principal we would have wanted this, but in the
cases I've looked at, it had the counter intuitive effect and
de-prioritized the large register tuple.
Avoid using weird bit hack PPC uses for global priority. The
AllocationPriority field is really 5 bits, and PPC was relying on
overflowing this to 6-bits to forcibly set the global priority
bit. Split this out as a separate flag to avoid having magic behavior
for values above 31.
MachineMemOperand::print can dereference a NULL pointer if TII
is not passed from the printMemOperand. This does not happen while
dumping the DAG/MIR from llc but crashes the debugger if a dump()
method is called from gdb.
Differential Revision: https://reviews.llvm.org/D133903
Get some load-store forwarding cases for big-endian where a larger store covers
a smaller load, and the offset would be 0 and handled on little-endian but on
big-endian the offset is adjusted to be non-zero. The idea is just to shift the
data to make it look like the offset 0 case.
Differential Revision: https://reviews.llvm.org/D130115
https://reviews.llvm.org/D133637 fixes the problem where we should hash raw content of
register mask instead of the pointer to it.
Fix the same issue in `llvm::hash_value()`.
Remove the added API `MachineOperand::getRegMaskSize()` to avoid potential confusion.
Add an assert to emphasize that we probably should hash a machine operand iff it has
associated machine function, but keep the fallback logic in the original change.
Reviewed By: MatzeB
Differential Revision: https://reviews.llvm.org/D133747
There's no real read of the register, so the copy introduced a new
live value. Make sure we introduce a replacement implicit_def instead
of just erasing the copy.
Found from llvm-reduce since it tries to set undef on everything.
When a select is converted to a branch and load instructions are sinked to the true/false blocks,
lifetime intrinsics (if present) could be made unsound if not moved.
This conservatively moves all lifetime intrinsics in a transformed BB to the end block to ensure
preserved lifetime semantics.
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D133777
This is the ultimate fallback code if UADDO isn't supported.
If the target uses 0/1 we used one compare, but if the target doesn't
use 0/1 we emitted two compares. Regardless of boolean constants we
should only need to check that the Result is less than one of the
original operands. So we only need one compare.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D133708
Some uses of std::make_pair and the std::pair's first/second members
in the ScheduleDAGInstrs.[cpp|h] files were replaced with using of the
vector's emplace_back along with structure bindings from C++17.
This attempts to stop the type promotion pass transforming where it is
not profitable, by not marking PhiNodes as ToPromote and being more
aggressive about pulling extends out of loops.
Differential Revision: https://reviews.llvm.org/D133203
This patch refactors SlotIndex::getInstrDistance to
SlotIndex::getApproxInstrDistance to better describe the actual
functionality of this function. This patch also adds in some additional
comments better documenting the assumptions that this function makes to
increase clarity.
Based on discussion on the LLVM Discourse:
https://discourse.llvm.org/t/odd-behavior-in-slotindex-getinstrdistance/64934/5
Reviewed By: mtrofin, foad
Differential Revision: https://reviews.llvm.org/D133386
The bit masking lowering only works for vectors of scalars, so for pointer
element types we need to add some casting.
Differential Revision: https://reviews.llvm.org/D133672
__declspec(safebuffers) is equivalent to
__attribute__((no_stack_protector)). This information is recorded in
CodeView.
While we are here, add support for strict_gs_check.
I'm planning to deprecate and eventually remove llvm::empty.
I thought about replacing llvm::empty(x) with std::empty(x), but it
turns out that all uses can be converted to x.empty(). That is, no
use requires the ability of std::empty to accept C arrays and
std::initializer_list.
Differential Revision: https://reviews.llvm.org/D133677
MachineOperand::getRegMask() returns a pointer to register mask. We should hash the raw content of register mask instead of its pointer.
Reviewed By: kyulee
Differential Revision: https://reviews.llvm.org/D133637
For remainder:
If (1 << (Bitwidth / 2)) % Divisor == 1, we can add the high and low halves
together and use a (Bitwidth / 2) urem. If (BitWidth /2) is a legal integer
type, this urem will be expand by DAGCombiner using multiply by magic
constant. We do have to take into account that adding high and low
together can produce a carry, making it a (BitWidth / 2)+1 bit number.
So we need to also add back in the carry from the first addition.
For division:
We can use the above trick to compute the remainder, subtract that
remainder from the dividend, then multiply by the multiplicative
inverse of the Divisor modulo (1 << BitWidth).
This is based on the section "Remainder by Summing Digits" in
Hacker's delight.
The remainder trick is similar to a trick you may have learned for
determining if a decimal number is divisible by 3. You can add all the
digits together and see if the sum is divisible by 3. If you're not sure
if the sum is divisible by 3, you can add its digits together. This
can be repeated until you have a single decimal digit. If that digit
is 3, 6, or 9, then the original number is divisible by 3. This works
because 10 % 3 == 1.
gcc already does this same trick. There are additional tricks gcc
does urem as well as srem, udiv, and sdiv that I plan to add in
future patches.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130862
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
Simplify extended add/sub (with carry-in and carry-out) to add/sub with
carry (with carry-out only) if carry-in is known to be zero.
Differential Revision: https://reviews.llvm.org/D133702
This was only trying this to relax register class constraints, but
this can also help if there are subranges involved.
This solves a compilation failure for AMDGPU when there is high
pressure created by large register tuples. If one virtual register is
using most of the available budget, we need to be able to evict
subranges.
This solves the immediate failure, but this solution leaves a lot to
be desired. In the relevant testcases, we have 32-element tuples but
most of the uses are operations on 1 element subranges of it. What
we're now getting is a spill and restore of the full 1024 bits and an
extract of the used 32-bits. It would be far better if we introduced a
copy to a new virtual register with a smaller register class and used
narrower spills.
Furthermore, we could probably do a better job if the allocator were
to introduce new subranges where none previously existed in the
highest pressure scenarios. The block and region splits should also
try to split specific subranges out.
The mve-vst3.ll test changes looks like noise to me, but instruction
count increased by one. mve-vst4.ll looks like a solid improvement
with several 16-byte spills eliminated. splitkit-copy-live-lanes.mir
also shows a solid reduction in total spill count.
This could use more tests but it's pretty tiring to come up with cases
that fail on this.
Some uses of std::make_pair and the std::pair's first/second members
in the ScheduleDAGRRList.cpp file were replaced with using of the
vector's emplace_back along with structure bindings from C++17.
Fixes#50098
LLVM uses X19 as the frame base pointer, if it needs to. Meaning you
can get warnings if you clobber that with inline asm.
However, it doesn't explain why. The frame base register is not part
of the ABI so it's pretty confusing why you get that warning out of the blue.
This adds a method to explain a reserved register with X19 as the first one.
The logic is the same as getReservedRegs.
I could have added a return parameter to isASMClobberable and friends
but found that there's a lot of things that call isReservedReg in various
ways.
So while one more method on the pile isn't great design, it is simpler
right now to do it this way and only pay the cost if you are actually using
a reserved register.
Reviewed By: lenary
Differential Revision: https://reviews.llvm.org/D133213
Replacing the following instances of UndefValue with PoisonValue, where the UndefValue is used as an arbitrary value:
- llvm/lib/CodeGen/WinEHPrepare.cpp
`demotePHIsOnFunclets`: RAUW arbitrary value for lingering uses of removed PHI nodes
- llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
`FoldSingleEntryPHINodes`: Removes a self-referential single entry phi node.
- llvm/lib/Transforms/Utils/CallGraphUpdater.cpp
`finalize`: Remove all references to removed functions.
- llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
`cleanup`: the result is not used then the inserted instructions are removed.
- llvm/tools/bugpoint/CrashDebugger.cpp
`TestInts`: the program is cloned and instructions are removed to narrow down source of crash.
Differential Revision: https://reviews.llvm.org/D133640
getNegatedExpression can delete nodes. If the first call to
getNegatedExpression produced a node that the second call also
manages to create, it might get deleted. Use a HandleSDNode to
ensure it has a use to prevent it from being deleted.
Fixes PR57658.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D133602
The LLVM performance tips suggest that allocas should be placed at the
beginning of the entry block. So far, llvm doesn’t provide any helper to
find that position.
Add BasicBlock::getFirstNonPHIOrDbgOrAlloca and IRBuilder::SetInsertPointPastAllocas(Function*)
that get an insert position after the (static) allocas at the start of a
function and use it in ShadowStackGCLowering.
Differential Revision: https://reviews.llvm.org/D132554
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.
Change call sites to use `std::size` instead.
Differential Revision: https://reviews.llvm.org/D133429
This patch introduces the priority analysis and the priority advisor,
the default implementation, and the scaffolding for introducing the
other implementations of the advisor.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D132835
Interpret MD_pcsections in AsmPrinter emitting the requested metadata to
the associated sections. Functions and normal instructions are handled.
Differential Revision: https://reviews.llvm.org/D130879
Propagate (most) PC sections metadata to MachineInstr when GlobalISel is
doing instruction selection.
This change results in support for architectures using GlobalISel (such
as -O0 with AArch64). Not all instructions may be supported yet, and
requires further target-specific handling (such as done for AArch64
pseudo-atomics). Expanding supported instructions is planned on a
case-by-case basis and new use cases for PC sections metadata.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130886
When expanding IR atomics to target-specific atomics, copy all
!pcsections Metadata to expanded atomics automatically.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130885
Propagate PC sections metadata to MachineInstr when FastISel is doing
instruction selection.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130884
Add a new entry to SDNodeExtraInfo to propagate PCSections through
SelectionDAG.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130882
Details:
Currently CodeGenPrepare is very time consuming in handling big functions.
Old Algorithm :
It iterate each BB in function, and go on handle very instructions in BB.
Due to some instruction optimizations may affect the BBs' dominate tree.
The old logic will re-iterate and try optimize for each BB.
Suppose we have a big function with 20000 BBs, If we handled the last BB
with fine tuning the dominate tree. We need totally re-iterate and try optimize
the 20000 BBs from the beginning.
The Complex is near N!
And we really encounter somes big tests (> 20000 BBs) that cost more than 30
mins in this pass. (Debug version compiler will cost 2 hours here)
What this patch do for huge function ?
It mainly changes the iteration way for optimization.
1 We do optimizeBlock for each BB (that is same with old way).
And, in the meaning time, If BB is changed/updated in the optimization, it will
be put into FreshBBs (try do optimizeBlock again).
The new created BB at previous iteration will also put into FreshBBs.
2 For the BBs which not updated at previous iteration, we directly skip it.
Strictly speaking, here may miss some opportunity, but the probability is very
small.
3 For Instructions in single BB, we do optimizeInst for each instruction.
If optimizeInst change the instruction dominator in this BB, rather than break
and go back to optimize the first BB (the old way), we directly iterate
instructions (to do optimizeInst) in this updated BB again (the new way).
What this patch do for small/normal (not huge) function ?
It is same with the Old Algorithm. (NFC)
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D129352
This patch is essentially an alternative to https://reviews.llvm.org/D75836 and was mentioned by @lhames in a comment.
The gist of the issue is that Mach-O has restrictions on which kind of sections are allowed after debug info has been emitted, which is also properly asserted within LLVM. Problem is that stack maps are currently emitted as one of the last sections in each target-specific AsmPrinter so far, which would cause the assertion to trigger. The current approach of special casing for the `__LLVM_STACKMAPS` section is not viable either, as downstream users can overwrite the stackmap format using plugins, which may want to use different sections.
This patch fixes the issue by emitting the stack map earlier, right before debug info is emitted. The way this is implemented is by taking the choice when to emit the StackMap away from the target AsmPrinter and doing so in the base class. The only disadvantage of this approach is that the `StackMaps` member is now part of the base class, even for targets that do not support them. This is functionaly not a problem however, as emitting an empty `StackMaps` is a no-op.
Differential Revision: https://reviews.llvm.org/D132708
During SelectionDAG legalization SDNodes with associated extra info may
be replaced with a new SDNode. Preserve associated extra info on
ReplaceAllUsesWith and remove entries in DeallocateNode.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130881
For information infrequently attached to SDNodes, it is useful to
provide a way to add this information out-of-line. This is already done
for call-site specific information.
Rename CallSiteDbgInfo to NodeExtraInfo in preparation of adding
additional information not necessarily related to call sites only.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130880
This adds the ExpandLargeDivRem to the default pass pipeline.
The limit at which it expands div/rem instructions is configured
via a new TargetTransformInfo hook (default: no expansion)
X86, Arm and AArch64 backends implement this hook to expand div/rem
instructions with more than 128 bits.
Differential Revision: https://reviews.llvm.org/D130076
Provide MachineInstr::setPCSection(), to propagate relevant metadata
through the backend. Use ExtraInfo to store the metadata.
Reviewed By: vitalybuka
Differential Revision: https://reviews.llvm.org/D130876
The main difference is that this preserves intermediate rounding steps,
which the other route doesn't. This aligns bfloat16 more with half
floats, which use this path on most targets.
I didn't understand what the difference was between these softening
approaches when I first added bfloat lowerings, would be nice if we only
had one of them.
Based on @pengfei 's D131502
Differential Revision: https://reviews.llvm.org/D133207
The issue was with processing two subregs of the same reg are used in the same
instruction (e.g. inline asm): "def early-clobber" and other just "def".
Register coalescer ran in bad recursion if the early clobbered subreg is second
in the following sequence of COPYs.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D127136
MachineVerifier tried to checkLivenessAtDef() ignoring it is actually a subreg.
The issue was with processing two subregs of the same reg are used in the same
instruction (e.g. inline asm): "def early-clobber" and other just "def".
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D126661
For some indices we can simply extract the fixed-length subvector from the
low half of the scalable vector, for example when the index is less than the
minimum number of elements in the low half. For all other cases we can
expand the operation through the stack by storing out the vector and
reloading the fixed-length part we need.
Fixes https://github.com/llvm/llvm-project/issues/55412
Tests added here:
CodeGen/AArch64/sve-extract-fixed-from-scalable-vector.ll
Differential Revision: https://reviews.llvm.org/D117499
Similar to #57402 - we were calling isGuaranteedNotToBeUndefOrPoison on the freeze operand (with Depth = 0), but wasn't accounting for the fact that a later isGuaranteedNotToBeUndefOrPoison assertion will call from the new node (with Depth = 0 as well) - which will then recursively call isGuaranteedNotToBeUndefOrPoison for its operands with Depth = 1
Fixes#57554
The way ComputeNumSignBits was being used was only correct if
OuterBitSize is exactly 2x InnerBitSize. Which is always true,
but not obviously so. Comparing ComputeMaxSignificantBits to
InnerBitSize feels more correct.
CGP uses a raw `getInstructionCost(I, TargetTransformInfo::TCK_SizeAndLatency) >= TCC_Expensive` check to see if its better to move an expensive instruction used in a select behind a branch instead.
This is causing issues with upcoming improvements to TCK_SizeAndLatency costs on X86 as we need to use TCK_SizeAndLatency as an uop count (so its compatible with various target-specific buffer sizes - see D132288), but we can have instructions that have a low TCK_SizeAndLatency value but should still be treated as 'expensive' (FDIV for example) - by adding a isExpensiveToSpeculativelyExecute wrapper we can keep the current behaviour but still add an x86 override in a future patch when the cost tables are updated to compensate.
[MachineFunctionPass] Support -filter-passes for -print-changed
-filter-passes specifies a `PassID` (a lower-case dashed-separated pass name,
also used by -print-after, -stop-after, etc) instead of a CamelCasePass.
`-filter-passes=CamelCaseNewPMPass` seems like a workaround for new PM passes before
we can use lower-case dashed-separated pass names (as used by `-passes=`).
Example:
```
# getPassName() is "IRTranslator". PassID is "irtranslator"
llc -mtriple=aarch64 -print-changed -filter-passes=irtranslator < print-changed-machine.ll
```
Close https://github.com/llvm/llvm-project/issues/57453
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D133055
DwarfEhPrepare inserts calls to _Unwind_Resume into landing pads.
If _Unwind_Resume happens to be defined in the same module and
debug info is used, then this leads to a verifier error:
inlinable function call in a function with debug info must
have a !dbg location
call void @_Unwind_Resume(ptr %exn.obj) #0
Fix this by assigning a dummy location to the call. (As this
happens in the backend, inlining is not actually relevant here.)
Fixes https://github.com/llvm/llvm-project/issues/57469.
Differential Revision: https://reviews.llvm.org/D133095
Some targets like RISC-V require operands to be inspected to
determine if an instruction is similar to a move.
Spotted while investigating code differences between using an ADDI
vs an ADDIW. RISC-V has the isAsCheapAsAMove flag for ADDI, but
the TII hook checks the immediate is 0 or the register is X0. ADDIW
is never generated with X0 or with an immediate of 0 so it doesn't
have the isAsCheapAsAMove flag.
I don't know enough about the PRE code to write a test for this yet.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D132981
Warn if `.size` is specified for a function symbol. The size of a
function symbol is determined solely by its content.
I noticed this simplification was possible while debugging #57427, but
this change doesn't fix that specific issue.
Differential Revision: https://reviews.llvm.org/D132929
We feed the result from the first extractShiftForRotate call into the second, and that result might no longer be a shift op (usually due to constant folding).
NOTE: We REALLY need to stop creating nodes on the fly inside extractShiftForRotate!
Fixes Issue #57474
We were calling isGuaranteedNotToBeUndefOrPoison on operands (with Depth = 0), but wasn't accounting for the fact that a later isGuaranteedNotToBeUndefOrPoison assertion will call from the new node (with Depth = 0 as well) - which will then recursively call isGuaranteedNotToBeUndefOrPoison for its operands with Depth = 1
Fixes#57402
IIUC, the conversion part is not part of atomic operations and fences should be put around converted atomic operations.
This also fixes atomic load of floating point values which requires fence on PowerPC.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D127609
The provided testcase would previously fail with an assertion due to later down below trying to allocate registers for `token` return types and arguments. This is especially problematic as the process would then exit instead of falling back to using FastIsel.
This patch fixes that by simply explicitly failing translation if either of these intrinsics are encountered.
Fixes https://github.com/llvm/llvm-project/issues/57349
Differential Revision: https://reviews.llvm.org/D132974
widenScalarDst updates the insert point to after MI, so
widenScalarSrc must be called before widenScalarDst. Otherwise
The updated Src values will appear after MI and break SSA. e.g.:
%14:_(s64), %15:_(s1) = G_UADDE %9:_, %11:_, %13:_
becomes
%14:_(s64), %16:_(s32) = G_UADDE %9:_, %11:_, %17:_
%15:_(s1) = G_TRUNC %16:_(s32)
%17:_(s32) = G_ZEXT %13:_(s1)
Differential Revision: https://reviews.llvm.org/D132547
Change-Id: Ie3458747a6879433f4d5ab9939d2bd102dd0f2db
Mostly just modeled after vp.fneg except there is a
"functional instruction" for fneg while fabs is always an
intrinsic.
Reviewed By: fakepaper56
Differential Revision: https://reviews.llvm.org/D132793
This patch replaces calls to greatestCommonDivisor with std::gcd where
two arguments are of the same type. This means that
std::common_type_t of the argument type is the same as the argument
type.
We could drop calls to std::abs in some cases, but that's left for
another patch.
This patch replaces calls to greatestCommonDivisor with std::gcd where
both arguments are known to be of unsigned. This means that
std::common_type_t of the two argument types should just be the wider
one of the two.
This patch replaces getLCMSize with std::lcm, a C++17 feature.
Note that all the arguments are of unsigned with no implicit type
conversion as they are passed to getLCMSize.
Adds a pass ExpandLargeDivRem to expand div/rem instructions
with more than 128 bits into a loop computing that value.
As discussed on https://reviews.llvm.org/D120327, this approach has the advantage
that it is independent of the runtime library. This also helps the clang driver,
which otherwise would need to understand enough about the runtime library
to know whether to allow _BitInts with more than 128 bits.
Targets are still free to disable this pass and instead provide a faster
implementation in a runtime library.
Fixes https://github.com/llvm/llvm-project/issues/44994
Differential Revision: https://reviews.llvm.org/D126644
This patch follows the InstCombine approach of stripping poison generating flags (nsw/nuw from add/sub etc.) to allow us to push a freeze() through the op. Unlike InstCombine it doesn't retain any flags, but we have plenty of DAG folds that do the same thing already. We assert that the newly generated op isGuaranteedNotToBeUndefOrPoison.
Similar to the ValueTracking approach, isGuaranteedNotToBeUndefOrPoison has been updated to confirm that if an op can't create undef/poison and its operands are guaranteed not to be undef/poison - then its not undef/poison. This is just for the generic opcodes - target specific opcodes will need to do this manually just in case they have some special cases.
Differential Revision: https://reviews.llvm.org/D132333
While this does not matter for most targets, when building for Arm Morello,
we have to mark the symbol as a function and add size information, so that
LLD can correctly evaluate relocations against the local symbol.
Since Morello is an out-of-tree target, I tried to reproduce this with
in-tree backends and with the previous reviews applied this results in
a noticeable difference when targeting Thumb.
Background: Morello uses a method similar Thumb where the encoding mode is
specified in the LSB of the symbol. If we don't mark the target as a
function, the relocation will not have the LSB set and calls will end up
using the wrong encoding mode (which will almost certainly crash).
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131429
D132080 introduced a bug leading to `RegisterClassInfo` caches not
getting invalidated when there was exactly one more CSR register added.
Differential Revision: https://reviews.llvm.org/D132606
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Relands 67504c9549 with a fix for
32-bit builds.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
The diff modifies ext-tsp code layout algorithm in the following ways:
(i) fixes merging of cold block chains (this is a port of D129397);
(ii) adjusts the cost model utilized for optimization;
(iii) adjusts some APIs so that the implementation can be used in BOLT; this is
a prerequisite for D129895.
The only non-trivial change is (ii). Here we introduce different weights for
conditional and unconditional branches in the cost model. Based on the new model
it is slightly more important to increase the number of "fall-through
unconditional" jumps, which makes sense, as placing two blocks with an
unconditional jump next to each other reduces the number of jump instructions in
the generated code. Experimentally, this makes a mild impact on the performance;
I've seen up to 0.2%-0.3% perf win on some benchmarks.
Reviewed By: hoy
Differential Revision: https://reviews.llvm.org/D129893
This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.
For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.
Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.
Differential Revision: https://reviews.llvm.org/D132520
Based off Issue #57283 - we need to try harder to ensure we're not creating nodes on-the-fly - so make sure we're just using SelectionDAG for analysis where possible
extractShiftForRotate may fail to return canonicalized shifts due to constant folding or other simplification that can occur in getNode()
Fixes Issue #57283
(ctpop x) == 1 --> (x != 0) && ((x & x-1) == 0)
Adjust the legality check to avoid the poor codegen on AArch64.
We probably only want to use popcount on this pattern when it
is a single instruction.
fixes#57225
Differential Revision: https://reviews.llvm.org/D132237
This patch builds on prior support patches to enable support for
variadic debug values in InstrRefLDV, allowing DBG_VALUE_LISTs to
have their ranges extended.
Differential Revision: https://reviews.llvm.org/D128212
musttail should be honored even in the presence of attributes like "disable-tail-calls". SelectionDAG properly handles this.
Update LangRef to explicitly mention that this is the semantics of musttail.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D132193
This patch adds the last of the changes required to enable
DBG_VALUE_LIST handling in InstrRefLDV, handling variadic debug values
during the transfer tracking step. Most of the changes are fairly
straightforward, and based around tracking multiple locations per
variable in TransferTracker::VLocTracker.
Differential Revision: https://reviews.llvm.org/D128211
In preparation for supporting DBG_VALUE_LIST in InstrRefLDV, this patch
adds the logic for emitting DBG_VALUE_LIST instructions from
InstrRefLDV. The logical changes here are fairly simple, with the main
change being that instead of directly prepending offsets to the DIExpr,
we use appendOpsToArg to modify the expression for individual debug
operands in the expression. The function emitLoc is also changed to take
a list of debug ops, with an empty list meaning an undef value.
Differential Revision: https://reviews.llvm.org/D128209
CodeGenPrepare pass can sink pointer comparison across statepoint
to the point of use (see comment in IR/SafepointIRVerifier.cpp)
Due to specifics of statepoints, it is still legal to have tied
def and use rewritten to the same register in TwoAddress pass.
However, properly updating LiveIntervals and LiveVariables becomes
complicated. For simplicity, let's fall back to generic handling of
tied registers when we detect such case.
TODO: This fixes functional (assertion) failure. Ideally we should
try to recompute new live range/liveness in place.
Reviewed By: skatkov
Differential Revision: https://reviews.llvm.org/D132255
In preparation for adding support for DBG_VALUE_LIST instructions in
InstrRefLDV, this patch updates the logic for joining variables at block
joins to support joining variables that use multiple debug operands.
This is one of the more meaty "logical" changes, although the line count
isn't too high - this changes pickVPHILoc to find a valid joined
location for every operand, with part of the function being split off
into pickValuePHILoc which finds a location for a single operand.
Differential Revision: https://reviews.llvm.org/D128180
This interface allows a target to reject a proposed
SMS schedule. For Hexagon/PowerPC, all schedules
are accepted, leaving behavior unchanged. For ARM,
schedules which exceed register pressure limits are
rejected.
Also, two RegisterPressureTracker methods now need to be public so
that register pressure can be computed by more callers.
Reapplication of D128941/(reversion:D132037) with small fix.
Differential Revision: https://reviews.llvm.org/D132170