This pass replaces each indirect call/jump with a direct call to a thunk that looks like:
lfence
jmpq *%r11
This ensures that if the value in register %r11 was loaded from memory, then
the value in %r11 is (architecturally) correct prior to the jump.
Also adds a new target feature to X86: +lvi-cfi
("cfi" meaning control-flow integrity)
The feature can be added via clang CLI using -mlvi-cfi.
This is an alternate implementation to https://reviews.llvm.org/D75934 That merges the thunk insertion functionality with the existing X86 retpoline code.
Differential Revision: https://reviews.llvm.org/D76812
Summary:
This patch adds parsing and dumping DWARFv5 .debug_macro section in llvm-dwarfdump,
it does not introduce any new switch. Existing switch "--debug-macro"
should be used to dump macinfo or macro section.
Reviewed By: dblaikie, ikudrin, jhenderson
Differential Revision: https://reviews.llvm.org/D73086
Summary:
Add mapping from exp2 math functions
to corresponding SVML calls.
This is a follow up and extension for llvm diff
https://reviews.llvm.org/D19544
Test Plan:
- update test case and run ninja check.
- run tests locally
Reviewers: wenlei, hoyFB, mmasten, mzolotukhin, spatel
Reviewed By: spatel
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77114
Summary:
A recent change in the instruction simplifier enables a call to a function that just returns one of its parameter to be simplified as simply loading the parameter. This exposes a bug in the inliner where double inlining may be involved which in turn may cause compiler ICE when an already-inlined callsite is reused for further inlining.
To put it simply, in the following-like C program, when the function call second(t) is inlined, its code t = third(t) will be reduced to just loading the return value of the callsite first(). This causes the inliner internal data structure to register the first() callsite for the call edge representing the third() call, therefore incurs a double inlining when both call edges are considered an inline candidate. I'm making a fix to break the inliner from reusing a callsite for new call edges.
```
void top()
{
int t = first();
second(t);
}
void second(int t)
{
t = third(t);
fourth(t);
}
void third(int t)
{
return t;
}
```
The actual failing case is much trickier than the example here and is only reproducible with the legacy inliner. The way the legacy inliner works is to process each SCC in a bottom-up order. That means in reality function first may be already inlined into top, or function third is either inlined to second or is folded into nothing. To repro the failure seen from building a large application, we need to figure out a way to confuse the inliner so that the bottom-up inlining is not fulfilled. I'm doing this by making the second call indirect so that the alias analyzer fails to figure out the right call graph edge from top to second and top can be processed before second during the bottom-up. We also need to tweak the test code so that when the inlining of top happens, the function body of second is not that optimized, by delaying the pass of function attribute deducer (i.e, which tells function third has no side effect and just returns its parameter). Since the CGSCC pass is iterative, additional calls are added to top to postpone the inlining of second to the second round right after the first function attribute deducing pass is done. I haven't been able to repro the failure with the new pass manager since the processing order of ininlined callsites is a bit different, but in theory the issue could happen there too.
Note that this fix could introduce a side effect that blocks the simplification of inlined code, specifically for a call site that can be folded to another call site. I hope this can probably be complemented by subsequent inlining or folding, as shown in the attached unit test. The ideal fix should be to separate the use of VMap. However, in reality this failing pattern shouldn't happen often. And even if it happens, there should be a good chance that the non-folded call site will be refolded by iterative inlining or subsequent simplification.
Reviewers: wenlei, davidxl, tejohnson
Reviewed By: wenlei, davidxl
Subscribers: eraman, nikic, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76248
Summary:
This patch comes from H.J.'s 2bd54ce7fa
**This patch fix the failed llvm unit tests which running on CET machine. **(e.g. ExecutionEngine/MCJIT/MCJITTests)
The reason we enable IBT at "JIT compiled with CET" is mainly that: the JIT don't know the its caller program is CET enable or not.
If JIT's caller program is non-CET, it is no problem JIT generate CET code or not.
But if JIT's caller program is CET enabled, JIT must generate CET code or it will cause Control protection exceptions.
I have test the patch at llvm-unit-test and llvm-test-suite at CET machine. It passed.
and H.J. also test it at building and running VNCserver(Virtual Network Console), it works too.
(if not apply this patch, VNCserver will crash at CET machine.)
Reviewers: hjl.tools, craig.topper, LuoYuanke, annita.zhang, pengfei
Subscribers: tstellar, efriedma, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76900
This was causing a machine verifier failure on the test suite.
Make sure that we don't end up with a weird register class here.
Failure for reference:
*** Bad machine code: Illegal virtual register for instruction ***
- function: check_constrain
- basic block: %bb.1 (0x7f8b70839f80)
- instruction: early-clobber %6:gpr64, early-clobber %7:gpr64sp =
JumpTableDest32 %5:gpr64, %1:gpr64sp, %jump-table.0
- operand 3: %1:gpr64sp
Expected a GPR64 register, but got a GPR64sp register
Differential Revision: https://reviews.llvm.org/D77349
MI peephole will remove unnecessary FRSP instructions. This patch
removes such unnecessary XSRSP.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D77208
Summary:
This fixes a few issues related to SMRD offsets. On gfx9 and gfx10 we have a
signed byte offset immediate, however we can overflow into a negative since we
treat it as unsigned.
Also, the SMRD SOFFSET sgpr is an unsigned offset on all subtargets. We
sometimes tried to use negative values here.
Third, S_BUFFER instructions should never use a signed offset immediate.
Differential Revision: https://reviews.llvm.org/D77082
This will likely introduce catastrophic performance regressions on
older subtargets, but should be correct. A follow up change will
remove the old fp32-denormals subtarget features, and switch to using
the new denormal-fp-math/denormal-fp-math-f32 attributes. Frontends
should be making sure to add the denormal-fp-math-f32 attribute when
appropriate to avoid performance regressions.
The problem on Windows was that the \b in "..\bin" was interpreted
as an escape sequence. Use r"" strings to prevent that.
This reverts commit ab11b9eefa,
with raw strings in the lit.site.cfg.py.in files.
Differential Revision: https://reviews.llvm.org/D77184
Consider a callee function that has a call (C) within it which feeds
into the return. When we inline that callee into a callsite that has
return attributes, we can backward propagate valid attributes to the
call (C) within that inlined callee body.
This is safe to do so only if we can guarantee transfer of execution to
successor in the window of instructions between return value (i.e. the
call C) and the return instruction.
Also, this is valid only for attributes which are a property of a
callsite and not those that are not dependent on the ABI, or a property
of the call itself.
Reviewed-By: reames, jdoerfert
Differential Revision: https://reviews.llvm.org/D76140
This is a workaround for clang adding noinline to all functions at
-O0. Previously, we would just add alwaysinline, and the verifier
would complain about having both noinline and alwaysinline. We
currently can't truly codegen this case as a freestanding function, so
override the user forcing noinline.
Currently, all generated lit.site.cfg files contain absolute paths.
This makes it impossible to build on one machine, and then transfer the
build output to another machine for test execution. Being able to do
this is useful for several use cases:
1. When running tests on an ARM machine, it would be possible to build
on a fast x86 machine and then copy build artifacts over after building.
2. It allows running several test suites (clang, llvm, lld) on 3
different machines, reducing test time from sum(each test suite time) to
max(each test suite time).
This patch makes it possible to pass a list of variables that should be
relative in the generated lit.site.cfg.py file to
configure_lit_site_cfg(). The lit.site.cfg.py.in file needs to call
`path()` on these variables, so that the paths are converted to absolute
form at lit start time.
The testers would have to have an LLVM checkout at the same revision,
and the build dir would have to be at the same relative path as on the
builder.
This does not yet cover how to figure out which files to copy from the
builder machine to the tester machines. (One idea is to look at the
`--graphviz=test.dot` output and copy all inputs of the `check-llvm`
target.)
Differential Revision: https://reviews.llvm.org/D77184
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
We do not attempt this in InstCombine because we do not want to change
types and create new shuffle ops that are potentially not lowered as
well as the original code. Here, we can check the cost model to see if
it is worthwhile.
I've aggressively enabled this transform even if the types are the same
size and/or equal cost because moving the bitcast allows InstCombine to
make further simplifications.
In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the
redundant shuffles, but we probably want to extend VectorCombine to
handle the inverse pattern (shuffle-of-bitcast) to get that
simplification directly in IR.
Differential Revision: https://reviews.llvm.org/D76727
SILoadStoreOptimizer::checkAndPrepareMerge() expects base and
paired instruction to come in order and scans MBB from base to
the paired instruction. An original order can be changed if
there were a dependent instruction in between and base instruction
was moved.
Fixed by bailing the optimization. In theory it might be possible
still to perform a merge by swapping instructions, but on practice
it bails anyway because it finds dependency on that same instruction
which has resulted in the base move.
Differential Revision: https://reviews.llvm.org/D77245
These are versions of a function that regressed with:
rGf2fbdf76d8d0
That particular problem occurs with an instcombine-simplifycfg-instcombine
sequence, but we can show that it exists within instcombine only with
other variations of the pattern.
This reverts commit f2fbdf76d8.
As noted in the post-commit thread:
https://reviews.llvm.org/rGf2fbdf76d8d0
...this can obscure a min/max pattern where the components
have extra uses. We can show that the problem is independent
of this change with a slightly modified source example, so
this revert just delays/reduces the need to fix the real
problem.
We need to improve our analysis of negation or -- more
generally -- subtraction using patches like D77230 or D68408.
Summary:
Splitting Knowledge retention into Queries in Analysis and Builder into Transform/Utils
allows Queries and Transform/Utils to use Analysis.
Reviewers: jdoerfert, sstefan1
Reviewed By: jdoerfert
Subscribers: mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77171
This patch adds
- New arguments to getMinPrefetchStride() to let the target decide on a
per-loop basis if software prefetching should be done even with a stride
within the limit of the hw prefetcher.
- New TTI hook enableWritePrefetching() to let a target do write prefetching
by default (defaults to false).
- In LoopDataPrefetch:
- A search through the whole loop to gather information before emitting any
prefetches. This way the target can get information via new arguments to
getMinPrefetchStride() and emit prefetches more selectively. Collected
information includes: Does the loop have a call, how many memory
accesses, how many of them are strided, how many prefetches will cover
them. This is NFC to before as long as the target does not change its
definition of getMinPrefetchStride().
- If a previous access to the same exact address was 'read', and the
current one is 'write', make it a 'write' prefetch.
- If two accesses that are covered by the same prefetch do not dominate
each other, put the prefetch in a block that dominates both of them.
- If a ConstantMaxTripCount is less than ItersAhead, then skip the loop.
- A SystemZ implementation of getMinPrefetchStride().
Review: Ulrich Weigand, Michael Kruse
Differential Revision: https://reviews.llvm.org/D70228
Add an option to llvm-dwarfdump to calculate the bytes within
the debug sections. Dump this numbers when using --statistics
option as well.
This is an initial patch (e.g. we should support other units,
since we only support 'bytes' now).
Differential Revision: https://reviews.llvm.org/D74205
This adds MVE vmull patterns, which are conceptually the same as
mul(vmovl, vmovl), and so the tablegen patterns follow the same
structure.
For i8 and i16 this is simple enough, but in the i32 version the
multiply (in 64bits) is illegal, meaning we need to catch the pattern
earlier in a dag fold. Because bitcasts are involved in the zext
versions and the patterns are a little different in little and big
endian. I have only added little endian support in this patch.
Differential Revision: https://reviews.llvm.org/D76740
The unpredictable/hasSideEffects flag is usually inferred by tablegen
from whether the instruction has a tablegen pattern (and that pattern
only has a single output instruction). Now that the MVE intrinsics are
all committed and producing code, the remaining instructions still
marked as unpredictable need to be specially handled. This adds the flag
directly to instructions that need it, notably the V*MLAL instructions
and some of the MOV's.
Differential Revision: https://reviews.llvm.org/D76910
Summary:
This allows doing `memcmp(p, q, 7)` with 2 loads instead of a call to
memcmp.
This fixes part of PR45147.
Reviewers: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76133
As pointed out by @thakis, currently CallSiteSplitting bails out after
checking the first PHI node. We should check all PHI nodes, until we
find one where call site splitting is beneficial.
This patch also slightly simplifies the code using BasicBlock::phis().
Reviewers: davidxl, junbuml, thakis
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D77089
There was a TODO in genericValueTraversal to provide the context
instruction and due to the lack of it users that wanted one just used
something available. Unfortunately, using a fixed instruction is wrong
in the presence of PHIs so we need to update the context instruction
properly.
Reviewed By: uenoku
Differential Revision: https://reviews.llvm.org/D76870
While D68850 allowed functions to be deleted I accidentally saved some
version of the function to be used once a suitable prefix was found.
This turned out to be problematic when the occasionally deleted function
is also occasionally modified. The test case is adjusted to resemble the
case in which the problem was found.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D76586
Use DL & ABI information for better alignment deduction, e.g., if a type
is accessed and the ABI specifies an alignment requirement for such an
access we can use it. This is based on a patch by @lebedev.ri and
inspired by getBaseAlign in Loads.cpp.
Depends on D76673.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D76674
If we have a must-tail call the callee and caller need to have matching
ABIs. Part of that is alignment which we might modify when we deduce
alignment of arguments of either. Since we would need to keep them in
sync, which is not as simple, we simply avoid deducing alignment for
arguments of the must-tail caller or callee.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D76673
Failure to export __cxa_atexit can lead to an attempt to import a definition
from the process itself (if __cxa_atexit is referenced from another JITDylib),
but the process definition will clash with the existing non-exported definition
to produce an unexpected DuplicateDefinitionError.
This patch fixes the immediate issue by exporting __cxa_atexit. It also fixes a
bug where atexit functions in other JITDylibs were not being run by adding a
copy of run_atexits_helper to every JITDylib.
A follow up patch will deal with the bug where definition generators are called
despite a non-exported definition being present.
This means the linker will be expect them be undefined at link time an
will generate imports from the `env` module rather than reporting
undefined externals.
Differential Revision: https://reviews.llvm.org/D77192
The MemoryBuffer::getMemBuffer method's RequiresNullTerminator parameter
defaults to true, but object files are not null terminated so we need to
explicitly pass false here.
Negation is equivalent to bitwise-not + 1, so try to convert more
subtracts into adds using this relationship:
0 - (A ^ C) => ((A ^ C) ^ -1) + 1 => A ^ ~C + 1
I doubt this will recover the regression noted in rGf2fbdf76d8d0,
but seems like we're going to need to improve here and/or revive D68408?
Alive2 proofs:
http://volta.cs.utah.edu:8080/z/Re5tMUhttp://volta.cs.utah.edu:8080/z/An-uns
Differential Revision: https://reviews.llvm.org/D77230
find() was altering the UserChain, even in cases where it subsequently
discovered that the resulting constant was a 0. This confuses
rebuildWithoutConstOffset() when it attempts to walk the chain later, since it
is expected that the chain itself be a path down the use-def edges of an
expression.
Make the attributor pass aware of aligned_alloc for converting heap
allocations to stack ones.
Depends on D76971.
Differential Revision: https://reviews.llvm.org/D76974
Summary:
The previous code for determining the innermost region in CFGSort was
not correct. We determine subregion relationship by domination of their
headers, i.e., if region A's header dominates region B's header, B is a
subregion of A. Previously we assumed that if a BB belongs to both a
loop and an exception, the region with fewer number of BBs is the
innermost one. This may not be true, because while WebAssemblyException
contains BBs in all its subregions (loops or exceptions), MachineLoop
may not, because MachineLoop does not contain BBs that don't have a path
to its header even if they are dominated by its header.
Loop header <---|
| |
Exception header |
| \ |
A B |
| \ |
| C |
| |
Loop latch |
| |
-------------|
For example, in this CFG, the loop does not contain B and C, because
they don't have a path back to the loops header. But for CFGSort we
consider the exception here belongs to the loop and the exception should
be a subregion of the loop and scheduled together.
So here we should use `WE->contains(ML->getHeader())` (but not
`ML->contains(WE->getHeader())`, for the stated region above).
This also fixes some comments and deletes `Regions` vector in
`RegionInfo` class, which was not used anywere.
Reviewers: dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77181
We have scenarios when the logic of --elf-hash-histogram/--hash-symbols/--hash-table
options might crash when given a broken hash table.
This patch adds pre-checks for tables for these 3 options
and provides test cases.
Differential revision: https://reviews.llvm.org/D77147
Summary:
Currently, the comparison argument used for ATOMIC_CMP_XCHG is legalised
with GetPromotedInteger, which leaves the upper bits of the value
undefind. Since this is used for comparing in an LR/SC loop with a
full-width comparison, we must sign extend it. We introduce a new
getExtendForAtomicCmpSwapArg to complement getExtendForAtomicOps, since
many targets have compare-and-swap instructions (or pseudos) that
correctly handle an any-extend input, and the existing function
determines the extension of the result, whereas we are concerned with
the input.
This is related to https://reviews.llvm.org/D58829, which solved the
issue for ATOMIC_CMP_SWAP_WITH_SUCCESS, but not the simpler
ATOMIC_CMP_SWAP.
Reviewers: asb, lenary, efriedma
Reviewed By: asb
Subscribers: arichardson, hiraditya, rbar, johnrusso, simoncook, sabuasal, niosHD, kito-cheng, shiva0217, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, rkruppe, jfb, PkmX, jocewei, psnobl, benna, Jim, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, evandro, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74453
Prior to this change the clang interface stubs format resembled
something ending with a symbol list like this:
Symbols:
a: { Type: Func }
This was problematic because we didn't actually want a map format and
also because we didn't like that an empty symbol list required
"Symbols: {}". That is to say without the empty {} llvm-ifs would crash
on an empty list.
With this new format it is much more clear which field is the symbol
name, and instead the [] that is used to express an empty symbol vector
is optional, ie:
Symbols:
- { Name: a, Type: Func }
or
Symbols: []
or
Symbols:
This further diverges the format from existing llvm-elftapi. This is a
good thing because although the format originally came from the same
place, they are not the same in any way.
Differential Revision: https://reviews.llvm.org/D76979
This allows the MVE VPT Block insertion pass to remove VPNOTs in
order to create more complex VPT blocks such as TE, TEET, TETE, etc.
Differential Revision: https://reviews.llvm.org/D75993
Summary:
Aggregate types containing scalable vectors aren't supported and as far
as I can tell this pass is mostly concerned with optimisations on
aggregate types, so the majority of this pass isn't very useful for
scalable vectors.
This patch modifies SROA such that mem2reg is run on allocas with
scalable types that are promotable, but nothing else such as slicing is
done.
The use of TypeSize in this pass has also been updated to be explicitly
fixed size. When invoking the following methods in DataLayout:
* getTypeSizeInBits
* getTypeStoreSize
* getTypeStoreSizeInBits
* getTypeAllocSize
we now called getFixedSize on the resultant TypeSize. This is quite an
extensive change with around 50 calls to these functions, and also the
first change of this kind (being explicit about fixed vs scalable
size) as far as I'm aware, so feedback welcome.
A test is included containing IR with scalable vectors that this pass is
able to optimise.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D76720
PTEST/TESTP sets EFLAGS as:
TESTZ: ZF = (Op0 & Op1) == 0
TESTC: CF = (~Op0 & Op1) == 0
TESTNZC: ZF == 0 && CF == 0
If we are inverting the 0'th operand of a PTEST/TESTP instruction we can adjust the comparisons to correct handle the inversion implicitly.
Additionally, for "TESTZ" (ZF) cases, the allones case, PTEST(X,-1) can be simplified to PTEST(X,X).
We can expand this for the TESTZ(X,~Y) pattern and also handle KTEST/KORTEST in the future.
Differential Revision: https://reviews.llvm.org/D76984
Summary:
Make sure we do not assert on value types not being
simple in getFauxShuffleMask when analysing operations
such as "v8i16 = truncate v8i24".
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77136
Currently ConstantRange::binaryAnd/binaryOr results are too pessimistic
for single element constant ranges.
If both operands are single element ranges, we can use APInt's AND and
OR implementations directly.
Note that some other binary operations on constant ranges can cover the
single element cases naturally, but for OR and AND this unfortunately is
not the case.
Reviewers: nikic, spatel, lebedev.ri
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D76446
Currently, DAG combiner uses (fmul (rsqrt x) x) to estimate square
root of x. However, this method would return NaN if x is +Inf, which
is incorrect.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D76853
This patch adds checks to the verifier to ensure the dimension arguments
passed to the matrix intrinsics match the vector types for their
arugments/return values.
Reviewers: anemet, Gerolf, andrew.w.kaylor, LuoYuanke
Reviewed By: anemet
Differential Revision: https://reviews.llvm.org/D77129
Also add lowerShuffleWithPACK call to lowerV32I16Shuffle - shuffle combining was catching it but we avoid a lot of temporary shuffle creations if we catch it at lowering first.
Summary:
In https://bugs.llvm.org/show_bug.cgi?id=45297, it fails selecting
instructions for `PPCISD::ST_VSR_SCAL_INT`. The reason it generate the
`PPCISD::ST_VSR_SCAL_INT` with `-power8-vector` in IR is PPC's
combiner checks `hasP8Altivec` rather than `hasP8Vector`. This patch
should resolve PR45297.
Differential Revision: https://reviews.llvm.org/D76773
Summary:
Select folding in JumpThreading can create a conditional branch on a
code patch that did not have one in the original program. This is not a
valid transformation in sanitize_memory functions.
Note that JumpThreading does select folding in 3 different places. Two
of them seem safe - they apply to a select instruction in a BB that ends
with an unconditional branch to another BB, which (in turn) ends with a
conditional branch or a switch with the same condition.
Fixes PR45220.
Reviewers: glider, dvyukov, efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76332
Summary:
When we encounter an XCOFF file, reflect that in the triple information.
In addition to knowing the object file format, we know that the
associated OS is AIX.
This means that we can expect that there is no output difference in the
processing of an XCOFF32 input file between cases where the triple is
left unspecified by the user and cases where the user specifies
`--triple powerpc-ibm-aix` explicitly.
Reviewers: jhenderson, sfertile, jasonliu, daltenty
Reviewed By: jasonliu
Subscribers: wuzish, nemanjai, hiraditya, MaskRay, rupprecht, steven.zhang, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77025
The UnwindHelp object is used during exception handling by runtime
code. It must be findable from a fixed offset from FP.
This change allocates the UnwindHelp object as a fixed object (as is
done for x86_64) to ensure that both the generated code and runtime
agree on the location of the object.
Fixes https://bugs.llvm.org/show_bug.cgi?id=45346
Differential Revision: https://reviews.llvm.org/D77016
The generated code for a funclet can have an add to sp in the epilogue
for which there is no corresponding sub in the prologue.
This patch removes the early return from emitPrologue that was
preventing the sub to sp, and instead conditionalizes the appropriate
parts of the rest of the function.
Fixes https://bugs.llvm.org/show_bug.cgi?id=45345
Differential Revision: https://reviews.llvm.org/D77015
This reverts commit 28518d9ae3.
There is a failure in MsgPackReader.cpp when built with clang. It
complains about "signext and zeroext" are incompatible. Investigating
offline if it is infact a UB in the MsgPackReader code.
The attached test case is simplified from tcmalloc. Both function calls should be optimized as tailcall. But llvm can only optimize the first call. The second call can't be optimized because function dupRetToEnableTailCallOpts failed to duplicate ret into block case2.
There 2 problems blocked the duplication:
1 Intrinsic call llvm.assume is not handled by dupRetToEnableTailCallOpts.
2 The control flow is more complex than expected, dupRetToEnableTailCallOpts can only duplicate ret into its predecessor, but here we have an intermediate block between call and ret.
The solutions:
1 Since CodeGenPrepare is already at the end of LLVM IR phase, we can simply delete the intrinsic call to llvm.assume.
2 A general solution to the complex control flow is hard, but for this case, after exit2 is duplicated into case1, exit2 is the only successor of exit1 and exit1 is the only predecessor of exit2, so they can be combined through eliminateFallThrough. But this function is called too late, there is no more dupRetToEnableTailCallOpts after it. We can add an earlier call to eliminateFallThrough to solve it.
Differential Revision: https://reviews.llvm.org/D76539
We have loads preserving low and high 16 bits of their
destinations. However, we always use a whole 32 bit register
for these. The same happens with 16 bit stores, we have to
use full 32 bit register so if high bits are clobbered the
register needs to be copied. One example of such code is
added to the load-hi16.ll.
The proper solution to the problem is to define 16 bit subregs
and use them in the operations which do not read another half
of a VGPR or preserve it if the VGPR is written.
This patch simply defines subregisters and register classes.
At the moment there should be no difference in code generation.
A lot more work is needed to actually use these new register
classes. Therefore, there are no new tests at this time.
Register weight calculation has changed with new subregs so
appropriate changes were made to keep all calculations just
as they are now, especially calculations of register pressure.
Differential Revision: https://reviews.llvm.org/D74873
Consider a callee function that has a call (C) within it which feeds
into the return. When we inline that callee into a callsite that has
return attributes, we can backward propagate those attributes to the
call (C) within that inlined callee body.
This is safe to do so only if we can guarantee transfer of execution to
successor in the window of instructions between return value (i.e. the
call C) and the return instruction.
See added test cases.
Reviewed-By: reames, jdoerfert
Differential Revision: https://reviews.llvm.org/D76140
Registers used in any address (as well as in a few other contexts)
have special semantics when a "zero" register is used, which is
why the back-end defines extra register classes ADDR32, ADDR64 etc
to be used to prevent the register allocator from using %r0 there.
However, when writing assembler code "by hand", you sometimes need
to trigger that special semantics. However, currently the AsmParser
will reject %r0 in those places. In some cases it may be possible
to write that instruction differently - but in others it is currently
not possible at all.
This check in AsmParser simply seems overly strict, so this patch
just removes the check completely. This brings the behaviour of
AsmParser in line with the GNU assembler as well.
Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=45092
Make InstCombine aware of the aligned_alloc library function.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Depends on D76970.
Differential Revision: https://reviews.llvm.org/D76971
Summary:
CGProfilePass is run by default in certain new pass manager optimization pipeline. Assemblers other than llvm as (such as gnu as) cannot recognize the .cgprofile entries generated and emitted from this pass, causing build time error.
This patch adds new options in clang CodeGenOpts and PassBuilder options so that we can turn cgprofile off when not using integrated assembler.
Reviewers: Bigcheese, xur, george.burgess.iv, chandlerc, manojgupta
Reviewed By: manojgupta
Subscribers: manojgupta, void, hiraditya, dexonsmith, llvm-commits, tcwang, llozano
Tags: #llvm, #clang
Differential Revision: https://reviews.llvm.org/D62627
Summary: this patch preserve information from various places in EarlyCSE into assume bundles.
Reviewers: jdoerfert
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76769
If canLowerByDroppingEvenElements indicates that the shuffle is a N:1 compaction pattern and the inputs are suitably sign/zero extended then we can use a chain of PACKSS/PACKUS to compact.
This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask can recognise PACKSS/PACKUS chains.
The script now includes extra info about command-line options used
when generating its advertisement heading, but we don't want that
here. This is a special-case because we have enhanced the check
lines (as noted in the 2nd comment line).
This patch updates ValueLattice to distinguish between ranges that are
guaranteed to not include undef and ranges that may include undef.
A constant range guaranteed to not contain undef can be used to simplify
instructions to arbitrary values. A constant range that may contain
undef can only be used to simplify to a constant. If the value can be
undef, it might take a value outside the range. For example, consider
the snipped below
define i32 @f(i32 %a, i1 %c) {
br i1 %c, label %true, label %false
true:
%a.255 = and i32 %a, 255
br label %exit
false:
br label %exit
exit:
%p = phi i32 [ %a.255, %true ], [ undef, %false ]
%f.1 = icmp eq i32 %p, 300
call void @use(i1 %f.1)
%res = and i32 %p, 255
ret i32 %res
}
In the exit block, %p would be a constant range [0, 256) including undef as
%p could be undef. We can use the range information to replace %f.1 with
false because we remove the compare, effectively forcing the use of the
constant to be != 300. We cannot replace %res with %p however, because
if %a would be undef %cond may be true but the second use might not be
< 256.
Currently LazyValueInfo uses the new behavior just when simplifying AND
instructions and does not distinguish between constant ranges with and
without undef otherwise. I think we should address the remaining issues
in LVI incrementally.
Reviewers: efriedma, reames, aqjune, jdoerfert, sstefan1
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D76931
For the PHI node
%1 = phi [%A, %entry], [%X, %latch]
it is incorrect to use SCEV of backedge val %X as an exit value
of PHI unless %X is loop invariant.
This is because exit value of %1 is value of %X at one-before-last
iteration of the loop.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D73181
Canonicalize the case when a scalar extracted from a vector is
truncated. Transform such cases to bitcast-then-extractelement.
This will enable erasing the truncate operation.
This commit fixes PR45314.
reviewers: spatel
Differential revision: https://reviews.llvm.org/D76983
Add a new llvm.amdgcn.ballot intrinsic modeled on the ballot function
in GLSL and other shader languages. It returns a bitfield containing the
result of its boolean argument in all active lanes, and zero in all
inactive lanes.
This is intended to replace the existing llvm.amdgcn.icmp and
llvm.amdgcn.fcmp intrinsics after a suitable transition period.
Use the new intrinsic in the atomic optimizer pass.
Differential Revision: https://reviews.llvm.org/D65088
For casts with constant range operands, we can use
ConstantRange::castOp.
Reviewers: davide, efriedma, mssimpso
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D71938
Leverage ARM ELF build attribute section to create ELF attribute section
for RISC-V. Extract the common part of parsing logic for this section
into ELFAttributeParser.[cpp|h] and ELFAttributes.[cpp|h].
Differential Revision: https://reviews.llvm.org/D74023
Octeon branches (bbit0/bbit032/bbit1/bbit132) have an immediate operand,
so it is legal to have such replacement within
MipsBranchExpansion::replaceBranch().
According to the specification, a branch (e.g. bbit0 ) looks like:
bbit0 rs p offset // p is an immediate operand
if !rs<p> then branch
Without this patch, an assertion triggers in the method,
and the problem has been found in the real example.
Differential Revision: https://reviews.llvm.org/D76842
In the past, AVR functions were only lowered with interrupt-specific
machine code if the function was defined with the "avr-interrupt" or
"avr-signal" calling conventions.
This patch modifies the backend so that if the function does not have a
special calling convention, but does have an "interrupt" attribute,
that function is interpreted as a function with interrupts.
This also extracts the "is this function an interrupt" logic from
several disparate places in the backend into one AVRMachineFunctionInfo
attribute.
Bug found by Wilhelm Meier.
Compbinary format uses MD5 to represent strings in name table. That gives smaller profile without the need of compression/decompression when writing/reading the profile. The patch adds the support in extbinary format. It is off by default but user can choose to enable it.
Note the feature of using MD5 in name table can bring very small chance of name conflict leading to profile mismatch. Besides, profile using the feature won't have the profile remapping support.
Differential Revision: https://reviews.llvm.org/D76255
When we have
```
a = G_OR x, x
```
or
```
b = G_AND y, y
```
We can drop the G_OR/G_AND and just use x/y respectively.
Also update arm64-fallback.ll because there was an or in there which hits this
transformation.
Differential Revision: https://reviews.llvm.org/D77105
Implement identity combines for operations like the following:
```
%a = G_SUB %b, 0
```
This can just be replaced with %b.
Over CTMark, this gives some minor size improvements at -O3.
Differential Revision: https://reviews.llvm.org/D76640
This reverts commit b3297ef051.
This change is incorrect. The current semantic of null in the IR is a
pointer with the bitvalue 0. It is not a cast from an integer 0, so
this should preserve the pointer type.
We currently don't have a way to map to the equivalent intrinsic
opcode, so track immediate 0s in place of the address for the
selection to know to change the final opcode.
InstCombine has a mess of logic that tries to preserve min/max patterns,
but AFAICT, this one is not necessary because we can always narrow the
corresponding select in this sequence to match the narrow compare.
The biggest danger for this patch is inducing infinite looping or
assert from exceeding max iterations. If any bots hit that in the
vicinity of this commit, this is the likely patch to blame.
I got a report recently that a user was having trouble interpreting the
meaning of the error message. Hopefully this is more readable; produces
something like the following:
error: No such file or directory: Could not read profile data!
Differential Revision: https://reviews.llvm.org/D76796
Summary:
Code frequently relies upon the results of "is.constant" intrinsics to
DCE invalid code paths. We don't want the intrinsic to be made control-
dependent on any additional values. For instance, we can't split a PHI
into a "constant" and "non-constant" part via jump threading in order
to "optimize" the constant part, because the "is.constant" intrinsic is
meant to return "false".
Reviewers: wmi, kazu, MaskRay
Reviewed By: kazu
Subscribers: jdoerfert, efriedma, joerg, lebedev.ri, nikic, xbolva00, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75799
Summary:
This change adds amdgcn.reloc.constant intrinsic to the amdgpu backend, which will compile into a relocation entry in the resulting elf.
The intrinsics takes a MetadataNode (String) as its only argument, which specifies the symbol name of the relocation entry.
`SelectionDAGBuilder::getValueImpl` is changed to allow metadata operands passed through to ISel.
Author: csyonghe <yonghe@google.com>
Reviewers: tpr, nhaehnle
Reviewed By: nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76440
For each natural loop with multiple exit blocks, this pass creates a
new block N such that all exiting blocks now branch to N, and then
control flow is redistributed to all the original exit blocks.
The bulk of the tranformation is a new function introduced in
BasicBlockUtils that an redirect control flow from a set of incoming
blocks to a set of outgoing blocks via a common "hub".
This is a useful workaround for a limitation in the structurizer which
incorrectly orders blocks when processing a nest of loops. This pass
bypasses that issue by ensuring that each natural loop is recognized
as a separate region. Since the structurizer is a region pass, it no
longer sees a nest of loops in a single region, and instead processes
each "level" in the nesting as a separate region.
The AMDGPU backend provides a new option to enable this pass before
the structurizer, which may eventually be enabled by default.
Reviewers: madhur13490, arsenm, nhaehnle
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D75865
In InnerLoopVectorizer::getOrCreateTripCount, when the backedge taken
count is a SCEV add expression, its type is defined by the type of the
last operand of the add expression.
In the test case from PR45259, this last operand happens to be a
pointer, which (according to llvm::Type) does not have a primitive size
in bits. In this case, LoopVectorize fails to truncate the SCEV and
crashes as a result.
Uing ScalarEvolution::getTypeSizeInBits makes the truncation work as expected.
https://bugs.llvm.org/show_bug.cgi?id=45259
Differential Revision: https://reviews.llvm.org/D76669
Summary:
Otherwise PostRA list scheduler may reorder instruction, such as
schedule this
'''
pushq $0x8
pop %rbx
lea 0x2a0(%rsp),%r15
'''
to
'''
pushq $0x8
lea 0x2a0(%rsp),%r15
pop %rbx
'''
by mistake. The patch is to prevent this to happen by making sure POP has
implicit use of SP.
Reviewers: craig.topper
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77031
I still think the call lowering type legalization logic split between
the generic code and target is too confusing, but largely induced by
the reliance on the DAG infrastructure.
This test missed the check of histograms printed for .hash sections.
It was removed by mistake in D71606 where I tried to get rid of precompiled objects
and did not realize that time that both SHT_GNU_HASH and SHT_HASH sections
were tested and not just GNU version.
Also it never tested aliases for the --elf-hash-histogram option.
Differential revision: https://reviews.llvm.org/D76920
We are able to reduce `-DBITS=32/64` to reduce this test case.
I've rewrote the comments we had to generalize them and
fix wrong computations they contained.
Differential revision: https://reviews.llvm.org/D76924
If we are lowering to X86ISD::SHUF128 we are going to lose track of individual 128-bit lanes that are UNDEF, so if we can widen these to guarantee that they are sequential with their neighbour we should. This helps with later shuffle combines.
Add a bit more logic into the 'FalseLaneZeros' tracking to enable
horizontal reductions and also make the VADDV variants
validForTailPredication.
Differential Revision: https://reviews.llvm.org/D76708
In the original batch of MVE VMOVimm code generation VMOV.i64 was left
out due to the way it was done downstream. It turns out that it's fairly
simple though. This adds the codegen for it, similar to NEON.
Bigendian is technically incorrect in this version, which John is fixing
in a Neon patch.
Aligned_alloc is a standard lib function and has been in glibc since
2.16 and in the C11 standard. It has semantics similar to malloc/calloc
for several analyses/transforms. This patch introduces aligned_alloc
in target library info and memory builtins. Subsequent ones will
make other passes aware and fix https://bugs.llvm.org/show_bug.cgi?id=44062
This change will also be useful to LLVM generators that need to allocate
buffers of vector elements larger than 16 bytes (for eg. 256-bit ones),
element boundary alignment for which is not typically provided by glibc malloc.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Differential Revision: https://reviews.llvm.org/D76970
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128, and we can use the implicit zeroing of the upper half.
I've added some extra tests to vector-shuffle-combining-avx2.ll to make sure we don't lose coverage.
Replace the explicit isAtom() || isSLM() test with the more general (and more specific) slowTwoMemOps() check to avoid the use of the PUSHrmm push from memory case.
This is actually very tricky to test in anything but quite complex code, but the atomic-idempotent.ll tests seem to be the most straightforward to use.
Differential Revision: https://reviews.llvm.org/D76239
Summary:
On targets with different pointer sizes, -alignment-from-assumptions could attempt to create SCEV expressions which use different effective SCEV types. The provided test illustrates the issue.
In `getNewAlignment`, AASCEV would be the (only) alloca, which would have an effective SCEV type of i32. But PtrSCEV, the GEP in this case, due to being in the flat/default address space, will have an effective SCEV of i64.
This patch resolves the issue by truncating PtrSCEV to AASCEV's effective type.
Reviewers: hfinkel, jdoerfert
Reviewed By: jdoerfert
Subscribers: jvesely, nhaehnle, hiraditya, javed.absar, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75471
Currently, bpf does not specify 128bit alignment in its
layout spec. So for a structure like
struct ipv6_key_t {
unsigned pid;
unsigned __int128 saddr;
unsigned short lport;
};
clang will generate IR type
%struct.ipv6_key_t = type { i32, [12 x i8], i128, i16, [14 x i8] }
Additional padding is to ensure later IR->MIR can generate correct
stack layout with target layout spec.
But it is common practice for a tracing program to be
first compiled with target flag (e.g., x86_64 or aarch64) through
clang to generate IR and then go through llc to generate bpf
byte code. Tracing program often refers to kernel internal
data structures which needs to be compiled with non-bpf target.
But such a compilation model may cause a problem on aarch64.
The bcc issue https://github.com/iovisor/bcc/issues/2827
reported such a problem.
For the above structure, since aarch64 has "i128:128" in its
layout string, the generated IR will have
%struct.ipv6_key_t = type { i32, i128, i16 }
Since bpf does not have "i128:128" in its spec string,
the selectionDAG assumes alignment 8 for i128 and
computes the stack storage size for the above is 32 bytes,
which leads incorrect code later.
The x86_64 does not have this issue as it does not have
"i128:128" in its layout spec as it does permits i128 to
be alignmented at 8 bytes at stack. Its IR type looks like
%struct.ipv6_key_t = type { i32, [12 x i8], i128, i16, [14 x i8] }
The fix here is add i128 support in layout spec, the same as
aarch64. The only downside is we may have less optimal stack
allocation in certain cases since we require 16byte alignment
for i128 instead of 8. But this is probably fine as i128 is
not used widely and in most cases users should already
have proper alignment.
Differential Revision: https://reviews.llvm.org/D76587
There was already a test case for landingpads to handle this case, but I
had forgotten to consider PHI instructions preceding the EH_LABEL in the
landingpad.
PR45261
To make sure that replaced operands get DCEd. This drops one
iteration from gepphigep.ll, which is still not optimal.
This was the last test case performing more than 3 iterations.
NFC-ish, only worklist order should change.
Because this code does not use the IC-aware replaceInstUsesWith()
helper, we need to manually push users to the worklist.
This is NFC-ish, in that it may only change worklist order.
MC already knows how to emulate the .weak directive (with its ELF
semantics; i.e., an undefined weak symbol resolves to 0, and a defined
weak symbol has lower link precedence than a strong symbol of the same
name) using COFF weak externals. Plumb this through the ASM printer too,
so that definitions marked with __attribute__((weak)) at the language
level (which gets translated to weak linkage at the IR level) have the
corresponding .weak directive emitted. Note that declarations marked
with __attribute__((weak)) at the language level (which translates to
extern_weak at the IR level) already have .weak directives emitted.
Weak*/linkonce* symbols without an associated comdat (in particular, ones
generated with __attribute__((weak)) in C/C++) were earlier emitted as
normal unique globals, as the comdat is required to provide the linkonce
semantics. This change makes sure they are emitted as .weak instead,
allowing other symbols to override them.
Rename the existing coff-weak.ll test to coff-linkonce.ll. I'm not
quite sure what that test covers, since the behavior being tested in it
(the emission of a one_only section) is just a result of passing
-function-sections to llc; the linkonce_odr makes no difference.
Add a new coff-weak.ll which tests the new directive emission.
Based on an previous patch by Shoaib Meenai.
Differential Revision: https://reviews.llvm.org/D44543
This seems to be used in some resource files, e.g.
f3217573d7/include/wx/msw/wx.rc (L28).
MSVC rc.exe and GNU windres both allow any value here, and silently
just truncate to uint16_t range. This just explicitly allows the
-1 value and errors out on others - the same was done for control
IDs in dialogs in c1a67857ba.
Differential Revision: https://reviews.llvm.org/D76951
This change implements constant folding to constrained versions of
intrinsics, implementing rounding: floor, ceil, trunc, round, rint and
nearbyint.
Differential Revision: https://reviews.llvm.org/D72930
When we see this:
```
%a = COPY $physreg
...
SOMETHING implicit-def $physreg
...
%b = COPY $physreg
```
The two copies are not equivalent, and so we shouldn't perform any folding
on them.
When we have two instructions which use a physical register check that they
define the same virtual register(s) as well.
e.g., if we run into this case
```
%a = COPY $physreg
...
%b = COPY %a
```
we can say that the two copies are the same, and can be folded.
Differential Revision: https://reviews.llvm.org/D76890
Currently -fno-unroll-loops is ignored when doing LTO on Darwin. This
patch adds a new -lto-no-unroll-loops option to the LTO code generator
and forwards it to the linker if -fno-unroll-loops is passed.
Reviewers: thegameg, steven_wu
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D76916
Summary:
"Per CU" is a bit simplistic for gfx10, but I couldn't think of a better
name.
Reviewers: arsenm, rampitec, nhaehnle, dstuttard, tpr
Subscribers: kzhuravl, jvesely, wdng, yaxunl, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76861
Generalizes D62014 (R_386_NONE/R_X86_64_NONE).
Unlike ARM (D76746) and AArch64 (D76754), we cannot delete FK_NONE from
getFixupKindSize because FK_NONE is still used by R_386_TLS_DESC_CALL/R_X86_64_TLSDESC_CALL.
Having arbitrary passes looking at the TargetOptions is pretty
messy. This was also disregarding if a function already had an
explicit attribute setting on it. opt/llc now add the attributes to
functions that don't specify the attribute. clang and lld do not call
the function to do this, which they maybe should.
This was also treating unsafe-fp-math as implying the others, and
setting the other attributes based on it. This is not done anywhere
else, and I'm not sure is correct based on the current description of
the option bit.
Effectively reverts 1d8cf2be89
Generalizes D61992. In GNU as, the .reloc directive supports arbitrary relocation types.
A MCFixupKind value `V` larger than or equal to FirstLiteralRelocationKind
is used to represent the relocation type whose number is V-FirstLiteralRelocationKind.
This is useful for linker tests. Without the feature the assembler
cannot produce certain relocation records (e.g. R_ARM_ALU_PC_G0/R_ARM_LDR_PC_G0)
This helps move forward D75349 and D76575.
Differential Revision: https://reviews.llvm.org/D76746
This will cause the operation to be repeated in both a mask and another masked
or unmasked form. This can a wasted of execution resources.
Differential Revision: https://reviews.llvm.org/D60940
Summary:
Implement several XCOFF hooks to get '-r' option working for llvm-objdump -r.
Reviewer: DiggerLin, hubert.reinterpretcast, jhenderson, MaskRay
Differential Revision: https://reviews.llvm.org/D75131
Given that some instructions generate wider result elements than
their inputs, flag them as being able to generate non zeros in the
false lanes.
Differential Revision: https://reviews.llvm.org/D76766
We might have a crash scenario when we have an invalid DT_STRTAB value
that is larger than the file size. I've added a test case to demonstrate.
Differential revision: https://reviews.llvm.org/D76706
Summary:
There is a tiny logic error of D75300, making branch is not
correctly aligned with option -x86-pad-max-prefix-size
Reviewers: reames, MaskRay, craig.topper, LuoYuanke, jyknight
Reviewed By: reames
Subscribers: hiraditya, llvm-commits, annita.zhang
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76285
Summary: This patch is the first effort to adding basic optimizations for FREEZE in SelDag.
Reviewers: spatel, lebedev.ri
Reviewed By: spatel
Subscribers: xbolva00, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76707
Fix the LowerGlobalDtors pass to run destructors in the same order as the
regular LLVM destructor lowering -- in reverse order. Adjacent
destructors with the same associated object are grouped, but destructors
are not reordered based on associated objects.
Differential Revision: https://reviews.llvm.org/D70685
These transforms rely on a vector reduction flag on the SDNode
set by SelectionDAGBuilder. This flag exists because SelectionDAG
can't see across basic blocks so SelectionDAGBuilder is looking
across and saving the info. X86 is the only target that uses this
flag currently. By removing the X86 code we can remove the flag
and the SelectionDAGBuilder code.
This pass adds a dedicated IR pass for X86 that looks across the
blocks and transforms the IR into a form that the X86 SelectionDAG
can finish.
An advantage of this new approach is that we can enhance it to
shrink the phi nodes and final reduction tree based on the zeroes
that we need to concatenate to bring the partially reduced
reduction back up to the original width.
Differential Revision: https://reviews.llvm.org/D76649
We can improve computeKnownBits results by avoiding excess bitcasts.
For this pattern we were doing:
(v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK))))
By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on.
This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.
SUMMARY:
SUMMARY
for a source file "test.c"
void foo() {};
llc will generate assembly code as (assembly patch)
.globl foo
.globl .foo
.csect foo[DS]
foo:
.long .foo
.long TOC[TC0]
.long 0
and symbol table as (xcoff object file)
[4] m 0x00000004 .data 1 unamex foo
[5] a4 0x0000000c 0 0 SD DS 0 0
[6] m 0x00000004 .data 1 extern foo
[7] a4 0x00000004 0 0 LD DS 0 0
After first patch, the assembly will be as
.globl foo[DS] # -- Begin function foo
.globl .foo
.align 2
.csect foo[DS]
.long .foo
.long TOC[TC0]
.long 0
and symbol table will as
[6] m 0x00000004 .data 1 extern foo
[7] a4 0x00000004 0 0 DS DS 0 0
Change the code for the assembly path and xcoff objectfile patch for llc.
Reviewers: Jason Liu
Subscribers: wuzish, nemanjai, hiraditya
Differential Revision: https://reviews.llvm.org/D76162
Summary:
The linker is free to relax this (relocation R_PPC_GOT_TPREL16) against
R_PPC_TLS, if it sees fit (initial exec to local exec). If r0 is used,
this can generate execution-invalid code (converts to 'addi %rX, %r0,
FOO, which translates in PPC-lingo to li %rX, FOO). Forbid this
instead.
This fixes static binaries using locales on FreeBSD/powerpc
(tested on FreeBSD/powerpcspe).
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D76662
As discussed on PR31443, we should be trying to use PACKUS for binary truncation patterns to reduce the number of shuffles.
The plan is to support AVX2+ targets once we've worked around PR45315 - we fail to peek through a VBROADCAST_LOAD mask to recognise zero upper bits in a PACKUS pattern.
We should also be able to add support for v8i16 and possibly 256/512-bit vectors as well.
```
// llvm-objdump -d output (before)
0: bl .-4
4: bl .+0
8: bl .+4
// llvm-objdump -d output (after) ; GNU objdump -d
0: bl 0xfffffffc / bl 0xfffffffffffffffc
4: bl 0x4
8: bl 0xc
```
Many Operand's are not annotated as OPERAND_PCREL.
They are not affected (e.g. `b .+67108860`). I plan to fix them in future patches.
Modified test/tools/llvm-objdump/ELF/PowerPC/branch-offset.s to test
address space wraparound for powerpc32 and powerpc64.
Reviewed By: sfertile, jhenderson
Differential Revision: https://reviews.llvm.org/D76591
```
// llvm-objdump -d output (before)
400000: e8 0b 00 00 00 callq 11
400005: e8 0b 00 00 00 callq 11
// llvm-objdump -d output (after)
400000: e8 0b 00 00 00 callq 0x400010
400005: e8 0b 00 00 00 callq 0x400015
// GNU objdump -d. The lack of 0x is not ideal because the result cannot be re-assembled
400000: e8 0b 00 00 00 callq 400010
400005: e8 0b 00 00 00 callq 400015
```
In llvm-objdump, we pass the address of the next MCInst. Ideally we
should just thread the address of the current address, unfortunately we
cannot call X86MCCodeEmitter::encodeInstruction (X86MCCodeEmitter
requires MCInstrInfo and MCContext) to get the length of the MCInst.
MCInstPrinter::printInst has other callers (e.g llvm-mc -filetype=asm, llvm-mca) which set Address to 0.
They leave MCInstPrinter::PrintBranchImmAsAddress as false and this change is a no-op for them.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D76580
In some scalarize/split result methods (unary, binary, ...), flags in
SDNode were not passed down, which may lead to unexpected results in
unsafe float-point optimization. This patch fixes them. (maybe not
complete)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D76832
As long we extract from a source vector with smaller elements and we zero-extend the element in the final shuffle mask then we can safely peek through truncations and any/zero-extensions to find the source extraction.
Summary:
Below InstAlias have been redefined, this patch is to remove the repeated
definition.
mtdec/mfdec mtsdr1/mfsdr1 mtsrr0/mfsrr0 mtsrr1/mfsrr1 mtasr
Reviewed By: nemanjai, steven.zhang
Differential Revision: https://reviews.llvm.org/D75821
Summary:
This patch adds initial support for the following intrinsics:
* llvm.aarch64.sve.st2
* llvm.aarch64.sve.st3
* llvm.aarch64.sve.st4
For storing two, three and four vectors worth of data. Basic codegen for
reg+immediate forms are implemented. Reg+reg addressing modes will be
addressed in a later patch.
These intrinsics are intended for use in the Arm C Language Extension
(ACLE).
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D75947
Summary:
This patch introduces command-line support for the Armv8.6-a architecture and assembly support for BFloat16. Details can be found
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
in addition to the GCC patch for the 8..6-a CLI:
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-11/msg02647.html
In detail this patch
- march options for armv8.6-a
- BFloat16 assembly
This is part of a patch series, starting with command-line and Bfloat16
assembly support. The subsequent patches will upstream intrinsics
support for BFloat16, followed by Matrix Multiplication and the
remaining Virtualization features of the armv8.6-a architecture.
Based on work by:
- labrinea
- MarkMurrayARM
- Luke Cheeseman
- Javed Asbar
- Mikhail Maltsev
- Luke Geeson
Reviewers: SjoerdMeijer, craig.topper, rjmccall, jfb, LukeGeeson
Reviewed By: SjoerdMeijer
Subscribers: stuij, kristof.beyls, hiraditya, dexonsmith, danielkiss, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D76062
Some MVE floating point instructions have gpr register variants that take
the scalar gpr value and splat them to all lanes. In order to accept
them in loops, the shuffle_vector and insert need to be sunk down into
the loop, next to the instruction so that ISel can see the whole
pattern.
This does that sinking for FAdd, FSub, FMul and FCmp. The patterns for
mul are slightly more constrained as there are no fms variants taking
register arguments.
Differential Revision: https://reviews.llvm.org/D76023
I want to extend D60940 to scalar FP which will prevent forming
masked instructions if the arithmetic op has another use. To
prepare for that, this patch updates tests to avoid repeating
the operation multiple times with different masking.
Previously, we would ignore alloca alignment when building the frame
and just use the natural alignment of the allocated type. If an alloca
is over-aligned for its IR type, this could lead to a frame entry with
inadequate alignment for the downstream uses of the alloca.
Since highly-aligned fields also tend to produce poor layouts under a
naive layout algorithm, I've also switched coroutine frames to use the
new optimal struct layout algorithm.
In order to communicate the frame size and alignment to later passes,
I needed to set align+dereferenceable attributes on the frame-pointer
parameter of the resume function. This is clearly the right thing to
do, but the align attribute currently seems to result in assumptions
being added during inlining that the optimizer cannot easily remove.
We can legalize the operation MUL for v8i16 with instruction (vmladduhm A, B, 0)
if altivec enabled. Now, it is set as custom and expand it later, which is not
the right way. And then, we can add the pattern to match the mul + add with (vmladduhm A, B, C)
Reviewed By: Nemanjai
Differential Revision: https://reviews.llvm.org/D76751
More mechanical splitting of tests so we can add a one use
check to the isel patterns for forming masked instructions.
In a few cases I changed immediates of instructions in
order to avoid needing to split.
AMDGPUPropagateAttributes pass was skipping some of the functions
when cloning. Functions were added to root set and then skipped
on the next interation because they are already in the root set,
while were meant to be processed with different features.
Differential Revision: https://reviews.llvm.org/D76815
AMDGPUPropagateAttributes can swap names while cloning a function.
Only do it if original symbol was not externally visible.
Differential Revision: https://reviews.llvm.org/D76789
Summary:
This patch allows code-sinking in InstCombine to be performed when instruction have uses in llvm.assume.
Use are considered droppable when it is preferable to modify the User such that the use disappears rather than to prevent a transformation because of the use.
for now uses are considered droppable if they are in an llvm.assume.
Reviewers: jdoerfert, nikic, spatel, lebedev.ri, sstefan1
Reviewed By: jdoerfert
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73832
Because using -print-imports is not thread-safe, make the test rely on llvm-dis instead.
Also cover the ICALL-PROM part as intended originally.
Differential Revision: https://reviews.llvm.org/D76775
Add support for combining shuffles to AVX512 truncate instructions - another step toward fixing D56387/D66004. It also fixes SKX code on PR31443.
We could probably extend this further to handle non-VLX truncation cases.
Summary:
This patch implements the following CDE intrinsics:
T __arm_vcx1q_m(int coproc, T inactive, uint32_t imm, mve_pred_t p);
T __arm_vcx2q_m(int coproc, T inactive, U n, uint32_t imm, mve_pred_t p);
T __arm_vcx3q_m(int coproc, T inactive, U n, V m, uint32_t imm, mve_pred_t p);
T __arm_vcx1qa_m(int coproc, T acc, uint32_t imm, mve_pred_t p);
T __arm_vcx2qa_m(int coproc, T acc, U n, uint32_t imm, mve_pred_t p);
T __arm_vcx3qa_m(int coproc, T acc, U n, V m, uint32_t imm, mve_pred_t p);
The intrinsics are not part of the released ACLE spec, but internally at
Arm we have reached consensus to add them to the next ACLE release.
Reviewers: simon_tatham, MarkMurrayARM, ostannard, dmgreen
Reviewed By: simon_tatham
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76610
Move ARM ConstantIsland and LowOverheadLopps passes later in the pipeline
such that they will be run after the upcoming Machine Outlining pass.
Differential Revision: https://reviews.llvm.org/D76065
This pass can handle all the optimization
opportunities found just before code emission.
Presently it includes the handling of vcc branch
optimization that was handled earlier in SIInsertSkips.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D76712
A spilled load of an immediate can use MVHI/MVGHI instead.
A compare of a spilled register against an immediate can use CHSI/CGHSI.
A logical compare can use CLFHSI/CLGHSI.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D76055
Summary:
These were accidentally left out of D76123. I added tests for the
other three instructions in this small cross-product family (vqdmlah,
vqrdmlah, vqrdmlash) but missed this one.
Reviewers: miyuki
Reviewed By: miyuki
Subscribers: kristof.beyls, dmgreen, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76714
We need to split tests that rely on isel duplicating operations
for different masking conditions. Repeating the operation is
more costly than emitting the masking separately.
The change here is a mechanical splitting of tests that
call multiple intrinsics in one function into separate
functions that call one intrinsic. We could obviously avoid
the splitting by giving the intrinsics different operands, but
that would need closer scrutiny than just splitting.
to remap object file paths (but no source paths) before
processing. This is meant to be used for Clang objects where the
module cache location was remapped using ``-fdebug-prefix-map``; to
help dsymutil find the Clang module cache.
<rdar://problem/55685132>
Differential Revision: https://reviews.llvm.org/D76391
On Darwin these need to be selected into a function call for the TLS
address lookup. As a result, they can't be moved below a physreg write,
which happens in call sequences. In the long term, we should have some
mechanism in the localizer to prevent localizing into target-specific
atomic instruction sequences.
rdar://60056248
Differential Revision: https://reviews.llvm.org/D76652
This patch integrates operand bundle llvm.assumes [0] with the
Attributor. Most IRAttributes will now look at uses of the associated
value and if there are llvm.assume operand bundle uses with the right
tag we will check if they are in the must-be-executed-context (around
the context instruction). Droppable users, which is currently only
llvm::assume, are handled special in some places now as well.
[0] http://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html
Reviewed By: uenoku
Differential Revision: https://reviews.llvm.org/D74888
These intrinsics take a v4i32/v4f32 input and are supposed to
broadcast elements 0 and 1. Instead the autoupgrade code was
broadcasting elements 0, 1, 2, and 3.
I could fix the autoupgrade, but since its been broken for years
it seemed better just to steer anyone still trying to use it away
completely.
This reverts commit 4e0fe038f4. Re-lands
65b21282c7.
After landing 5ff5ddd0ad to add int3 into
trailing unreachable blocks, we can now remove these extra stack
adjustments without confusing the Win64 unwinder. See
https://llvm.org/45064#c4 or X86AvoidTrailingCall.cpp for a full
explanation.
Fixes PR45064.
Record the address of a tail-calling branch instruction within its call
site entry using DW_AT_call_pc. This allows a debugger to determine the
address to use when creating aritificial frames.
This creates an extra attribute + relocation at tail call sites, which
constitute 3-5% of all call sites in xnu/clang respectively.
rdar://60307600
Differential Revision: https://reviews.llvm.org/D76336
Summary:
DivRemPairs is unsound with respect to undef values.
```
// bb1:
// %rem = srem %x, %y
// bb2:
// %div = sdiv %x, %y
// -->
// bb1:
// %div = sdiv %x, %y
// %mul = mul %div, %y
// %rem = sub %x, %mul
```
If X can be undef, X should be frozen first.
For example, let's assume that Y = 1 & X = undef:
```
%div = sdiv undef, 1 // %div = undef
%rem = srem undef, 1 // %rem = 0
=>
%div = sdiv undef, 1 // %div = undef
%mul = mul %div, 1 // %mul = undef
%rem = sub %x, %mul // %rem = undef - undef = undef
```
http://volta.cs.utah.edu:8080/z/m7Xrx5
Same for Y. If X = 1 and Y = (undef | 1), %rem in src is either 1 or 0,
but %rem in tgt can be one of many integer values.
This resolves https://bugs.llvm.org/show_bug.cgi?id=42619 .
This miscompilation disappears if undef value is removed, but it may take a while.
DivRemPair happens pretty late during the optimization pipeline, so this optimization seemed as a good candidate to fix without major regression using freeze than other broken optimizations.
Reviewers: spatel, lebedev.ri, george.burgess.iv
Reviewed By: spatel
Subscribers: wuzish, regehr, nlopes, nemanjai, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76483
This adds a simple fold to combine VMOVrh load to a integer load.
Similar to what is already performed for BITCAST, but needs to account
for the types being of different sizes, creating an zero extending load.
Differential Revision: https://reviews.llvm.org/D76485
Summary:
The directory_count and file_name_count fields are (section 6.2.4 of
DWARF5 spec) supposed to be uleb128s, not bytes. This bug meant that it
was not possible to correctly parse headers with more than 128 files or
directories.
I've found this bug by code inspection, though the limit is so small
someone would have run into it for real sooner or later. I've verified
that the producer side handles many files correctly, and that we are
able to parse such files after this fix.
Reviewers: dblaikie, jhenderson
Subscribers: aprantl, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76498
These should really be moved over to a ConstantFolding test file,
but since this may overlap with the in-progress D76010 and similar
tests already exist here, we can do that as a later cleanup.
We deliberately split stores of the form
store(truncate(larger-than-legal-type)) into two stores, allowing each
store to perform part of the truncate for free.
There are times however where it makes more sense to use VMOVN to
de-interlace the results back into a single vector, and store that in
one go. This adds a check for that situation, not splitting the store if
it looks like a VMOVN can be more useful.
Differential Revision: https://reviews.llvm.org/D76511
coroutine frame
Currently we move all allocas into the frame when build coroutine frame in
CoroSplit pass. However, this can be relaxed.
Since CoroSplit pass run after Inline pass, we can use lifetime intrinsic to
do such analysis: If the scope of lifetime intrinsic is not across any suspend
point, rather than move the allocas to frame, we can just move them to entry bb
of corresponding function. This reduce the frame size.
More importantly, this also avoid data race in multithread environment.
Consider one inline function by coroutine: it starts a thread which access
local variables, while after inline the movement of allocs to frame also access
them. cause data race.
Differential Revision: https://reviews.llvm.org/D75664
PR35760 shows an example program which, when compiled with `clang -O0`
or gcc at any optimization level, prints '0'. However, llvm transforms
the program in a way that causes it to print '1'.
Fix the issue by having `AllUsesOfValueWillTrapIfNull` return false when
analyzing a load from a global which is used by an `icmp`. This special
case was untested [0] so this is just deleting dead code.
An alternative fix might be to change the GlobalStatus analysis for the
global to report "Stored" instead of "StoredOnce". However, "StoredOnce"
is appropriate when only one value other than the initializer is stored
to the global.
[0]
http://lab.llvm.org:8080/coverage/coverage-reports/coverage/Users/buildslave/jenkins/workspace/coverage/llvm-project/llvm/lib/Transforms/IPO/GlobalOpt.cpp.html#L662
Differential Revision: https://reviews.llvm.org/D76645
When we find something like this:
```
%a:_(s32) = G_SOMETHING ...
...
%select:_(s32) = G_SELECT %cond(s1), %a, %a
```
We can remove the select and just replace it entirely with `%a` because it's
always going to result in `%a`.
Same if we have
```
%select:_(s32) = G_SELECT %cond(s1), %a, %b
```
where we can deduce that `%a == %b`.
This implements the following cases:
- `%select:_(s32) = G_SELECT %cond(s1), %a, %a` -> `%a`
- `%select:_(s32) = G_SELECT %cond(s1), %a, %some_copy_from_a` -> `%a`
- `%select:_(s32) = G_SELECT %cond(s1), %a, %b` -> `%a` when `%a` and `%b`
are defined by identical instructions
This gives a few minor code size improvements on CTMark at -O3 for AArch64.
Differential Revision: https://reviews.llvm.org/D76523
An analysis of real world code turned up a number of patterns with BUILD_VECTOR
of nodes resulting from operations on extracted vector elements for which we
produce poor code. This addresses those cases. No attempt is made for
completeness as that would entail a large amount of work for something that
there is no evidence of in real code.
Differential revision: https://reviews.llvm.org/D72660
The e500 core has a silicon bug that triggers an illegal instruction
program trap on any sync other than msync. Other cores will typically
ignore illegal sync types, and the documentation even implies that the
'illegal' bits are ignored.
Address this hardware deficiency by only using msync, like the PPC440.
Differential Revision: https://reviews.llvm.org/D76614
There seems to be a small benefit to the legalized sequence for v2f16
round with packed instructions, so allow vectorizing it by reducing
the cost.
An unintended side effect is vectorization of f32 round also
happens. The current FMA logic seems off to me, and isn't checking for
packed instructions.
Debian and some other distributions install llvm-strip as llvm-strip-$major (e.g. `/usr/bin/llvm-strip-9`)
D54193 made it work with llvm-strip-$major but did not add a test.
The behavior was regressed by D69146.
Fixes https://github.com/ClangBuiltLinux/linux/issues/940
Reviewed By: alexshap
Differential Revision: https://reviews.llvm.org/D76562
Since intrinsics can now specify when an argument is required to be
constant, it is now OK to replace arguments with variables if they
aren't. This means intrinsics must now be accurately marked with
immarg.
Two one-use checks were added with rGfdcb27105537,
but only the first one is necessary to limit an
increase in instruction count. The second transform
only creates one instruction, so it is always a
reasonable canonicalization/optimization.
Validation of the found runtime library functions declarations types
(return and argument types) with the expected types.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D76058
Otherwise, the Win64 unwinder considers direct branches to such empty
trailing BBs to be a branch out of the function. It treats such a branch
as a tail call, which can only be part of an epilogue. If the unwinder
misclassifies such a branch as part of the epilogue, it will fail to
unwind the stack further. This can lead to bad stack traces, or failure
to handle exceptions properly. This is described in
https://llvm.org/PR45064#c4, and by the comment at the top of the
X86AvoidTrailingCallPass.cpp file.
It should be safe to insert int3 for such blocks. An empty trailing BB
that reaches this pass is pretty much guaranteed to be unreachable. If
a program executed such a block, it would fall off the end of the
function.
Most of the complexity in this patch comes from threading through the
"EHFuncletEntry" boolean on the MIRParser and registering the pass so we
can stop and start codegen around it. I used an MIR test because we
should teach LLVM to optimize away these branches as a follow-up.
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D76531
We special cased must-tail calls all over the place because they cannot
be modified as other calls can be. However, we already centralized the
modification API so we can centralize the handling as well. This
simplifies the code and allows to remove must-tail calls completely.
Summary:
Add new generic MIR opcodes G_SADDSAT etc. Add support in IRTranslator
for translating the saturating add/subtract intrinsics to the new
opcodes.
Reviewers: aemerson, dsanders, paquette, arsenm
Subscribers: jvesely, wdng, nhaehnle, rovka, hiraditya, volkan, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76600
Replace single-lane (W... form) vector "multiply and add" and "multiply and
subtract" instructions with equivalent floating point instructions whenever
possible in SystemZShortenInst.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D76370
If the section headers have been removed by a tool such as llvm-objcopy
or llvm-strip, previously llvm-readobj/llvm-readelf would not dump the
dynamic symbols when --dyn-symbols was specified. However, the nchain
value of the DT_HASH data specifies the number of dynamic symbols, so if
it is present, we can use that. This patch implements this behaviour.
Fixes https://bugs.llvm.org/show_bug.cgi?id=45089.
Reviewed by: grimar, MaskRay
Differential Revision: https://reviews.llvm.org/D76352
When decided whether to generate a post-inc load/store, look at the
other memory nodes that use the same base address and, if any proceed
the current node, then don't do the combine.
The change only seems to be affecting the Arm backend, which I was
surprised at, but it appears to fix a lot of our issues around MVE
masked load/stores having to store a temporary address after an early
post-increment on a shared base address.
Differential Revision: https://reviews.llvm.org/D75847
Summary:
Widening G_UNMERGE_VALUES to a type which is larger than the
original source type is the same as widening it to the same
type as the source type: in both cases, G_UNMERGE_VALUES has
to be replaced with bit arithmetic which. Although the arithmetic
itself is independent of whether the source type is smaller
or equal to the widen type, widening the source type to the
widen type should result in less artifacts being emitted,
since this is the type that the user explicitly requested.
Reviewers: arsenm, dsanders, aemerson, aditya_nandakumar
Reviewed By: arsenm, dsanders
Subscribers: jvesely, wdng, nhaehnle, rovka, hiraditya, volkan, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76494
createPSADBW uses SplitsOpsAndApply so should be able to handle
any size.
Restrict the extract result type to i32 or i64 since that's what
we have coverage for today and probably matches what the
isSimple() check gave us before.
Differential Revision: https://reviews.llvm.org/D76560
This patch attempts to more accurately model the reduction of
power of 2 vectors of types we natively support. This takes into
account the narrowing of vectors that occur as we go from 512
bits to 256 bits, to 128 bits. It also takes into account the use
of wider elements in the shuffles for the first 2 steps of a
reduction from 128 bits. And uses a v8i16 shift for the final step
of vXi8 reduction.
The default implementation uses the legalized type for the arithmetic
for all levels. And uses the single source permute cost of the
legalized type for all levels. This penalizes things like
lack of v16i8 pshufb on pre-sse3 targets and the splitting and
joining that needs to be done for integer types on AVX1. We never
need v16i8 shuffle for a reduction and we only need split AVX1 ops
when type the type wide and needs to be split. I think we're still
over costing splits and joins for AVX1, but we're closer now.
I've also removed all pairwise special casing because I don't
think we ever want to generate that on X86. I've also adjusted
the add handling to more accurately account for any type splitting
that occurs before we reach a legal type.
Differential Revision: https://reviews.llvm.org/D76478
This directive inserts code to add $gp to the argument's register when
support for position independent code is enabled.
For example, this code:
.cpadd $4
expands to:
addu $4, $4, $gp
SplitsOpsAndApply will take care of any needed splitting correctly.
All that we need to check is that the vector element count is a
power of 2.
Differential Revision: https://reviews.llvm.org/D76558
D75801 removed the last and only user of this option, so we can
drop it now. The original idea behind this was to only run expensive
transforms under -O3, but apart from the one known bits transform,
this has never really taken off. I believe nowadays the recommendation
is to put expensive transforms in AggressiveInstCombine instead,
though that isn't terribly popular either :)
Differential Revision: https://reviews.llvm.org/D76540
For folding pattern `x-(fma y,z,u*v) -> (fma -y,z,(fma -u,v,x))`, if
`yz` is 1, `uv` is -1 and `x` is -0, sign of result would be changed.
Differential Revision: https://reviews.llvm.org/D76419
Summary:
Avoid blind cast from Operator to ExtractElementInst in
computeKnownBitsFromOperator. This resulted in some crashes
in downstream fuzzy testing. Instead we use getOperand directly
on the Operator when accessing the vector/index operands.
Haven't seen any problems with InsertElement and ShuffleVector,
but I believe those could be used in constant expressions as well.
So the same kind of fix as for ExtractElement was also applied for
InsertElement.
When it comes to ShuffleVector we now simply bail out if a dynamic
cast of the Operator to ShuffleVectorInst fails. I've got no
reproducer indicating problems for ShuffleVector, and a fix would be
slightly more complicated as getShuffleDemandedElts is involved.
Reviewers: RKSimon, nikic, spatel, efriedma
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76564
We often widen xmm/ymm vectors to ymm/zmm by insertion into an undef base vector. By letting getTargetShuffleAndZeroables track the undef elts we can help avoid a lot of unnecessary cross-lane shuffles.
Fixes PR44694
Now that rG18c19441d105 has improved VPERM2X128 handling, we can perform this to improve x64->x32 truncation without poor cross-lane issues.
Someday combineX86ShufflesRecursively will handle this, but we're still really bad at dealing with different vector widths.
Technically we can permit EXTLOAD of the LHS operand but only if all the extended bits are shifted out. Until we test coverage for that case, I'm just disabling this to fix PR45265.
-fuse-init-array is now the CC1 default but TargetLoweringObjectFileELF::UseInitArray still defaults to false.
The following two unknown OS target triples continue using .ctors/.dtors because InitializeELF is not called.
clang -target i386 -c a.c
clang -target x86_64 -c a.c
This cleanup fixes this as a bonus.
X86SpeculativeLoadHardeningPass::tracePredStateThroughCall can call
MCContext::createTempSymbol before TargetLoweringObjectFileELF::Initialize().
We need to call TargetLoweringObjectFileELF::Initialize() ealier.
test/CodeGen/X86/speculative-load-hardening-indirect.ll
Differential Revision: https://reviews.llvm.org/D71360
Summary:
DataLayout::getTypeStoreSize() returns TypeSize.
For cases where it can not be scalable vector (e.g., GlobalVariable),
explicitly call TypeSize::getFixedSize().
For cases where scalable property doesn't matter, (e.g., check for
zero-sized type), use TypeSize::isNonZero().
Reviewers: sdesmalen, efriedma, apazos, reames
Reviewed By: efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76454
Summary:
Return None if GEP index type is scalable vector. Size of scalable vectors
are multiplied by a runtime constant.
Avoid transforming:
%a = bitcast i8* %p to <vscale x 16 x i8>*
%tmp0 = getelementptr <vscale x 16 x i8>, <vscale x 16 x i8>* %a, i64 0
store <vscale x 16 x i8> zeroinitializer, <vscale x 16 x i8>* %tmp0
%tmp1 = getelementptr <vscale x 16 x i8>, <vscale x 16 x i8>* %a, i64 1
store <vscale x 16 x i8> zeroinitializer, <vscale x 16 x i8>* %tmp1
into:
%a = bitcast i8* %p to <vscale x 16 x i8>*
%tmp0 = getelementptr <vscale x 16 x i8>, <vscale x 16 x i8>* %a, i64 0
%1 = bitcast <vscale x 16 x i8>* %tmp0 to i8*
call void @llvm.memset.p0i8.i64(i8* align 16 %1, i8 0, i64 32, i1 false)
Reviewers: sdesmalen, efriedma, apazos, reames
Reviewed By: sdesmalen
Subscribers: tschuett, hiraditya, rkruppe, arphaman, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76464
Summary:
When using full LTO on cross-compile settings, instead of generating the
default archive kind of the host platform, we could deduce the archive
kind based on the target triple.
This specifically addresses https://github.com/android/ndk/issues/1209
by making it possible to drop llvm-ar in place of GNU ar without extra
flags.
Reviewers: compnerd, pcc, srhines, danalbert
Subscribers: hiraditya, MaskRay, steven_wu, dexonsmith, rupprecht, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76461
If ExpensiveCombines is enabled (which is the case with -O3 on the
legacy PM and always on the new PM), InstCombine tries to compute
the known bits of all instructions in the hope that all bits end up
being known, which is fairly expensive.
How effective is it? If we add some statistics on how often the
constant folding succeeds and how many KnownBits calculations are
performed and run test-suite we get:
"instcombine.NumConstPropKnownBits": 642,
"instcombine.NumConstPropKnownBitsComputed": 18744965,
In other words, we get one fold for every 30000 KnownBits calculations.
However, the truth is actually much worse: Currently, known bits are
computed before performing other folds, so there is a high chance
that cases that get folded by known bits would also have been
handled by other folds.
What happens if we compute known bits after all other folds
(hacky implementation: https://gist.github.com/nikic/751f25b3b9d9e0860db5dde934f70f46)?
"instcombine.NumConstPropKnownBits": 0,
"instcombine.NumConstPropKnownBitsComputed": 18105547,
So it turns out despite doing 18 million known bits calculations,
the known bits fold does not do anything useful on test-suite.
I was originally planning to move this into AggressiveInstCombine
so it only runs once in the pipeline, but seeing this, I think
we're better off removing it entirely.
As this is the only use of the "expensive combines" mechanism,
it may be removed afterwards, but I'll leave that to a separate patch.
Differential Revision: https://reviews.llvm.org/D75801
UseInitArray is now the CC1 default but TargetLoweringObjectFileELF::UseInitArray still defaults to false.
The following two unknown OS target triples continue using .ctors/.dtors because InitializeELF is not called.
clang -target i386 -c a.c
clang -target x86_64 -c a.c
This cleanup fixes this as a bonus.
Differential Revision: https://reviews.llvm.org/D71360
Ideally SimplifyDemanded should compute the same known bits as
computeKnownBits(). This patch addresses one discrepancy, where
ValueTracking is more powerful: If we have a shl nsw shift, we
know that the sign bit of the input and output must be the same.
If this results in a conflict, the result is poison.
This is implemented in
2c4ca6832f/lib/Analysis/ValueTracking.cpp (L1175-L1179)
and
2c4ca6832f/lib/Analysis/ValueTracking.cpp (L904-L908).
This implements the same basic logic in SimplifyDemanded. It's
slightly stronger, because I return undef instead of zero for the
poison case (which is not an option inside ValueTracking).
As mentioned in https://reviews.llvm.org/D75801#inline-698484,
we could detect poison in more cases, this just establishes parity
with the existing logic.
Differential Revision: https://reviews.llvm.org/D76489
Summary:
It can be the case that a vector type is legal but the corresponding
scalar type is not legal for an architecture (i8 vs. v16i8 on AArch64).
Check if the scalar type created when folding
truncate(build_vector(x,y)) -> build_vector(truncate(x),truncate(y))
is legal if we are running after the type legalizer.
This fixes https://github.com/android/ndk/issues/1207.
Reviewers: RKSimon, srhines
Subscribers: kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76312
Adds/changes some types in the ByVal cc test so that they aren't all
structs of arrays of bytes, and adds testing for passing multiple
ByVal arguments.
The sll/srl/sra scalar vector shifts can be replaced with generic shifts if the shift amount is known to be in range.
This also required public DemandedElts variants of llvm::computeKnownBits to be exposed (PR36319).
Summary:
I've implemented them as target-specific IR intrinsics rather than
using `@llvm.experimental.vector.reduce.add`, on the grounds that the
'experimental' intrinsic doesn't currently have much code generation
benefit, and my replacements encapsulate the sign- or zero-extension
so that you don't expose the illegal MVE vector type (`<4 x i64>`) in
IR.
The machine instructions come in two versions: with and without an
input accumulator. My new IR intrinsics, like the 'experimental' one,
don't take an accumulator parameter: we represent that by just adding
on the input value using an ordinary i32 or i64 add. So if you write
the `vaddvaq` C-language intrinsic with an input accumulator of zero,
it can be optimised to VADDV, and conversely, if you write something
like `x += vaddvq(y)` then that can be combined into VADDVA.
Most of this is achieved in isel lowering, by converting these IR
intrinsics into the existing `ARMISD::VADDV` family of custom SDNode
types. For the difficult case (64-bit accumulators), isel lowering
already implements the optimization of folding an addition into a
VADDLV to make a VADDLVA; so once we've made a VADDLV, our job is
already done, except that I had to introduce a parallel set of ARMISD
nodes for the //predicated// forms of VADDLV.
For the simpler VADDV, we handle the predicated form by just leaving
the IR intrinsic alone and matching it in an ordinary dag pattern.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: dmgreen
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76491
Summary:
I've implemented these as target-specific IR intrinsics, because
they're not //quite// enough like @llvm.experimental.vector.reduce.min
(which doesn't take the extra scalar parameter). Also this keeps the
predicated and unpredicated versions looking similar, and the
floating-point minnm/maxnm versions fold into the same schema.
We had a couple of min/max reductions already implemented, from the
initial pathfinding exercise in D67158. Those were done by having
separate IR intrinsic names for the signed and unsigned integer
versions; as part of this commit, I've changed them to use a flag
parameter indicating signedness, which is how we ended up deciding
that the rest of the MVE intrinsics family ought to work. So now
hopefully the ewhole lot is consistent.
In the new llc test, the output code from the `v8f16` test functions
looks quite unpleasant, but most of it is PCS lowering (you can't pass
a `half` directly in or out of a function). In other circumstances,
where you do something else with your `half` in the same function, it
doesn't look nearly as nasty.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: MarkMurrayARM
Subscribers: kristof.beyls, hiraditya, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76490
The zero sized structs force creation of a stack object of size 1, align
8 in the locals area, but otherwise have no effect on the calling convention
code. i.e. They consume no registers or stack space in the paramater save area.
The 32-bit codegen has 8 bytes of padding to fit the new stack object so
stack size stays the same. 64-bit codegen has no padding in the stack
frames allocated so 8 bytes is added, and becuase of 16-byte aligned
stack, the stack size increases from 112 bytes to 128.
Summary:
For some reason the order in which we call getNegatedExpression
for the involved operands, after a call to isCheaperToUseNegatedFPOps,
seem to matter. This patch includes a new test case in
test/CodeGen/X86/fdiv.ll that crashes if we reverse the order of
those calls. Before this patch that could happen depending on
which compiler that were used when buildind llvm. With my GCC
version (7.4.0) I got the crash, because it seems like it is
using a different order for the argument evaluation compared
to clang.
All other users of isCheaperToUseNegatedFPOps already used this
pattern with unfolded/ordered calls to getNegatedExpression, so
this patch is aligning visitFDIV with the other use cases.
This patch simply deals with the non-determinism for FDIV. While
the underlying problem with getNegatedExpression is discussed
further in D76439.
Reviewers: spatel, RKSimon
Reviewed By: spatel
Subscribers: hiraditya, mgrang, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76319
Summary:
Currently we custom select add/sub with carry out to scalar form relying on later replacing them to vector form if necessary.
This change enables custom selection code to take the divergence of adde/addc SDNodes into account and select the appropriate form in one step.
Reviewers: arsenm, vpykhtin, rampitec
Reviewed By: arsenm, vpykhtin
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa
Differential Revision: https://reviews.llvm.org/D76371
Summary:
This patch implements the following intrinsics:
uint8x16_t __arm_vcx1q_u8 (int coproc, uint32_t imm);
T __arm_vcx1qa(int coproc, T acc, uint32_t imm);
T __arm_vcx2q(int coproc, T n, uint32_t imm);
uint8x16_t __arm_vcx2q_u8(int coproc, T n, uint32_t imm);
T __arm_vcx2qa(int coproc, T acc, U n, uint32_t imm);
T __arm_vcx3q(int coproc, T n, U m, uint32_t imm);
uint8x16_t __arm_vcx3q_u8(int coproc, T n, U m, uint32_t imm);
T __arm_vcx3qa(int coproc, T acc, U n, V m, uint32_t imm);
Most of them are polymorphic. Furthermore, some intrinsics are
polymorphic by 2 or 3 parameter types, such polymorphism is not
supported by the existing MVE/CDE tablegen backends, also we don't
really want to have a combinatorial explosion caused by 1000 different
combinations of 3 vector types. Because of this some intrinsics are
implemented as macros involving a cast of the polymorphic arguments to
uint8x16_t.
The IR intrinsics are even more restricted in terms of types: all MVE
vectors are cast to v16i8.
Reviewers: simon_tatham, MarkMurrayARM, dmgreen, ostannard
Reviewed By: MarkMurrayARM
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76299
Summary:
This change implements ACLE CDE intrinsics that translate to
instructions working with general-purpose registers.
The specification is available at
https://static.docs.arm.com/101028/0010/ACLE_2019Q4_release-0010.pdf
Each ACLE intrinsic gets a corresponding LLVM IR intrinsic (because
they have distinct function prototypes). Dual-register operands are
represented as pairs of i32 values. Because of this the instruction
selection for these intrinsics cannot be represented as TableGen
patterns and requires custom C++ code.
Reviewers: simon_tatham, MarkMurrayARM, dmgreen, ostannard
Reviewed By: MarkMurrayARM
Subscribers: kristof.beyls, hiraditya, danielkiss, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76296
Prior to this change, for non-relocatable objects llvm-readobj would
assume that all symbols that corresponded to a stack size section's
entries were in the section specified by the section's sh_link field.
In the presence of an output section description combining
SHF_LINK_ORDER sections linking different output sections, this cannot
be respected, since linker script section patterns are "by name" by
nature. Consequently, the sh_link value would not be correct for all
section entries.
This patch changes llvm-readobj to ignore the section of symbols in a
non-relocatable object.
Fixes https://bugs.llvm.org/show_bug.cgi?id=45228.
Reviewed by: grimar, MaskRay
Differential Revision: https://reviews.llvm.org/D76425
Seems we do not test how we print relocation addends well.
And the behavior of dumpers does not seem to be ideal here
(and llvm-readelf does not match GNU as the test case shows).
This patch adds a test case to document the current behavior.
Differential revision: https://reviews.llvm.org/D75671
The MVE VDUP instruction take a GPR and splats into every lane of a
vector register. Unlike NEON we do not have a VDUPLANE equivalent
instruction, doing the same splat from a fp register. Previously a VDUP
to a v4f32/v8f16 would be represented as a (v4f32 VDUP f32), which
would mean the instruction pattern needs to add a COPY_TO_REGCLASS to
the GPR.
Instead this now converts that earlier during an ISel DAG combine,
converting (VDUP x) to (VDUP (bitcast x)). This can allow instruction
selection to tell that the input needs to be an i32, which in one of the
testcases allows it to use ldr (or specifically ldm) over (vldr;vmov).
Whilst being simple enough for floats, as the types sizes are the same,
these is no BITCAST equivalent for getting a half into a i32. This uses
a VMOVrh ARMISD node, which doesn't know the same tricks yet.
Differential Revision: https://reviews.llvm.org/D76292
Floating point positive zero can be selected using fmv.w.x / fmv.d.x /
fcvt.d.w and the zero source register.
Differential Revision: https://reviews.llvm.org/D75729
If a call argument has the "returned" attribute, we can simplify
the call to the value of that argument. This was already partially
handled by InstSimplify/InstCombine for the case where the argument
is an integer constant, and the result is thus known via known bits.
The non-constant (or non-int) argument cases weren't handled though.
This previously landed as an InstSimplify transform, but was reverted
due to assertion failures when compiling the Linux kernel. The reason
is that simplifying a call to another call breaks assumptions in
call graph updating during inlining. As the code is not easy to fix,
and there is no particularly strong motivation for having this in
InstSimplify, the transform is only performed in InstCombine instead.
Differential Revision: https://reviews.llvm.org/D75815
This is the same change as D75824, but for two cases where
InstCombine performs the same optimization: Replacing an instruction
whose bits are fully known with a constant. This is not (generally)
legal for musttail calls.
Differential Revision: https://reviews.llvm.org/D76457
For MemoryPhis, we have to avoid that the MemoryPhi may be executed
before before the access we are currently looking at.
To do this we do a post-order numbering of the basic blocks in the
function and bail out once we reach a MemoryPhi with a larger (or equal)
post-order block number than the current MemoryAccess.
This changes the order in which we visit stores for elimination.
This patch also adds support for exploring multiple paths. We keep a worklist (ToCheck) of memory accesses that might be eliminated by our starting MemoryDef or MemoryPhis for further exploration. For MemoryPhis, we add the incoming values to the worklist, for MemoryDefs we add the defining access.
Reviewers: dmgreen, rnk, efriedma, bryant, asbirlea
Reviewed By: asbirlea
Differential Revision: https://reviews.llvm.org/D72148
If there were no free VGPRs we would need two emergency spill slots for register
scavenging during PEI/frame index elimination. Reuse 'ResultReg' for scale
calculation so that only one spill is needed.
Differential Revision: https://reviews.llvm.org/D76387
Apart from the argument registers, set the CostPerUse
value as per the ratio reg_index/allocation_granularity.
It is a pre-commit for introducing the scratch registers
in the ABI. This change should help in a balanced
register allocation.
Differential Revision: https://reviews.llvm.org/D76417
For now, when final suspend can be simplified by simplifySuspendPoint,
handleFinalSuspend is executed as well to remove last case in switch
instruction. This patch fixes it.
Differential Revision: https://reviews.llvm.org/D76345
Summary:
Swift ABI is based on basic C ABI described here https://github.com/WebAssembly/tool-conventions/blob/master/BasicCABI.md
Swift Calling Convention on WebAssembly is a little deffer from swiftcc
on another architectures.
On non WebAssembly arch, swiftcc accepts extra parameters that are
attributed with swifterror or swiftself by caller. Even if callee
doesn't have these parameters, the invocation succeed ignoring extra
parameters.
But WebAssembly strictly checks that callee and caller signatures are
same. https://github.com/WebAssembly/design/blob/master/Semantics.md#calls
So at WebAssembly level, all swiftcc functions end up extra arguments
and all function definitions and invocations explicitly have additional
parameters to fill swifterror and swiftself.
This patch support signature difference for swiftself and swifterror cc
is swiftcc.
e.g.
```
declare swiftcc void @foo(i32, i32)
@data = global i8* bitcast (void (i32, i32)* @foo to i8*)
define swiftcc void @bar() {
%1 = load i8*, i8** @data
%2 = bitcast i8* %1 to void (i32, i32, i32)*
call swiftcc void %2(i32 1, i32 2, i32 swiftself 3)
ret void
}
```
For swiftcc, emit additional swiftself and swifterror parameters
if there aren't while lowering. These additional parameters are added
for both callee and caller.
They are necessary to match callee and caller signature for direct and
indirect function call.
Differential Revision: https://reviews.llvm.org/D76049
Summary:
These were merged to the SIMD proposal in
https://github.com/WebAssembly/simd/pull/128.
Depends on D76397 to avoid merge conflicts.
Reviewers: aheejin
Subscribers: dschuff, sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76399
Port over the following:
- shuffle undef, undef, any_mask -> undef
- shuffle anything, anything, undef_mask -> undef
This sort of thing shows up a lot when you try to bugpoint code containing
shufflevector.
Differential Revision: https://reviews.llvm.org/D76382
Add pseudo instructions for ldrsbt/ldrht/ldrsht with implicit immediate
and add fall back C++ code to transform the instruction to the
equivalent LDRSBTi/LDRHTi/LDRSHTi form.
This is similar to how it has been done in commit
fb3950ec63
This fixes:
https://bugs.llvm.org/show_bug.cgi?id=45070
Summary:
This patch fixes https://bugs.llvm.org/show_bug.cgi?id=44611 by
preventing an infinite loop in the jump threading pass when
-jump-threading-across-loop-headers is on. Specifically, without this
patch, jump threading through two basic blocks would trigger on the
same area of the CFG over and over, resulting in an infinite loop.
Consider testcase PR44611-across-header-hang.ll in this patch. The
first opportunity to thread through two basic blocks is:
from bb_body2 through bb_header and bb_body1 to bb_body2.
The pass duplicates bb_header and bb_body1 as, say, bb_header.thread1
and bb_body1.thread1. Since bb_header contains a successor edge back
to itself, bb_header.thread1 also contains a successor edge to
bb_header, immediately giving rise to the next jump threading
opportunity:
from bb_header.thread1 through bb_header and bb_body1 to bb_body2.
After that, we repeatedly thread an incoming edge into bb_header
through bb_header and bb_body1 to bb_body2. In other words, we keep
peeling one iteration from bb_header's self loop.
The patch fixes the problem by preventing the pass from duplicating a
basic block containing a self loop.
Reviewers: wmi, junparser, efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76390
Remove the gap left between the stack pointer (s32) and frame pointer
(s34) now that the scratch wave offset is no longer a part of the
calling convention ABI.
Update llvm/docs/AMDGPUUsage.rst to reflect the change.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75657
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in
the entry function prologue. This allows us to removes the scratch wave
offset register from the calling convention ABI.
As part of this change, allow the use of an inline constant zero for the
SOffset of MUBUF instructions accessing the stack in entry functions
when a frame pointer is not requested/required. Entry functions with
calls still need to set up the calling convention ABI stack pointer
register, and reference it in order to address arguments of called
functions. The ABI stack pointer register remains unswizzled, but is now
wave-relative instead of queue-relative.
Non-entry functions also use an inline constant zero SOffset for
wave-relative scratch access, but continue to use the stack and frame
pointers as before. When the stack or frame pointer is converted to a
swizzled offset it is now scaled directly, as the scratch wave offset no
longer needs to be subtracted first.
Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling
convention.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75138
Support prefixing destructive operations, with the MOVPRFX instruction, to build constructive operations.
Differential Revision: https://reviews.llvm.org/D75064
Previously we multiplied the cost for the table entries by the number of splits needed. But that implies that each split goes through a reduction to scalar independently. I think what really happens is that the we AND/OR the split pieces until we're down to a single value with a legal type and then do special reduction sequence on that.
So to model that this patch takes the number of splits minus one multiplied by the cost of a AND/OR at the legal element count and adds that on top of the table lookup.
Differential Revision: https://reviews.llvm.org/D76400
A number of X86 tests were accidentally disabled in
https://reviews.llvm.org/D73568. This commit re-enables those tests.
```
$ for x86_test in $(gg 'REQUIRES: x86$' llvm/test | fst); do sed -i "" '/REQUIRES: x86/d' $x86_test; done
```
(Note that 'x86' is not an available feature, that's what caused the
tests to be disabled.)
The slli/srli/srai 'immediate' vector shifts (although its not immediate anymore to match gcc) can be replaced with generic shifts if the shift amount is known to be in range.
Currently obj2yaml always emits the `EntSize` property when `sh_entsize != 0`.
It is not correct. For example, for `SHT_DYNAMIC` section, `EntSize == 0`
is abnormal, while `sizeof(ELFT::Dyn)` is the expected default.
To reduce the output produces we should not dump default values.
yaml2obj tests that shows `sh_entsize` values produced are:
1) For `SHT_REL*` sections: `yaml2obj\ELF\reloc-sec-entry-size.yaml`
2) For `SHT_DYNAMIC`: `yaml2obj\ELF\dynamic-section.yaml`
Differential revision: https://reviews.llvm.org/D76227
We do not have tests that shows the current behavior.
It is needed for D76227 which changes the logic of dumping of `EntSize` fields.
Differential revision: https://reviews.llvm.org/D76282
Summary:
In order to keep the names consistent with other SVE gather loads, the
intrinsics for gather prefetch are renamed as follows:
* @llvm.aarch64.sve.gather.prfb -> @llvm.aarch64.sve.prfb.gather
Reviewed by: fpetrogalli
Differential Revision: https://reviews.llvm.org/D76421
For PHIs with multiple incoming values, we can improve precision by
using constant ranges for integers. We can over-approximate phis
by merging the incoming values.
Reviewers: davide, efriedma, mssimpso
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D71933
We usually start error messages with lowercase letters and most of them
in llvm-dwp follow that rule. This patch fixes a few messages that
started with capital letters.
Differential revision: https://reviews.llvm.org/D76277
`.rela.dyn` is a dynamic relocation section that normally has
no value in `sh_info` field.
The existent `elf-reladyn-section-shinfo.yaml` which tests this piece has issues:
1) It does not check the case when we have more than one `SHT_REL[A]`
section with `sh_info == 0` in the object. Because of this it did not catch the issue.
Currently we print an excessive "Info" field:
```
- Name: .rela.dyn
Type: SHT_RELA
EntSize: 0x0000000000000018
- Name: .rel.dyn
Type: SHT_REL
EntSize: 0x0000000000000010
Info: ' [1]'
```
2) It seems can be more generic. I've added a `rel-rela-section.yaml` instead.
Differential revision: https://reviews.llvm.org/D76281
If one of the operands of a binary operator is a constant range, we can
use ConstantRange::binaryOp to approximate the result.
We still handle single element constant ranges as we did previously,
with ConstantExpr::get(), because ConstantRange::binaryOp still gives
worse results in a few cases for single element ranges.
Also note that we bail out early if any of the operands is still unknown.
Reviewers: davide, efriedma, mssimpso
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D71936
On Powerpc fma is faster than fadd + fmul for some types,
(PPCTargetLowering::isFMAFasterThanFMulAndFAdd). we should implement target
hook isProfitableToHoist to prevent simplifyCFGpass from breaking fma
pattern by hoisting fmul to predecessor block.
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D76207
Summary:
DataLayout::getTypeAllocSize() return TypeSize. For cases where scalable
property doesn't matter (check for zero-sized alloca), we should explicitly
call getKnownMinSize() to avoid implicit type conversion to uint64_t, which is
invalid for scalable vector type.
Reviewers: sdesmalen, efriedma, spatel, apazos
Reviewed By: efriedma
Subscribers: tschuett, hiraditya, rkruppe, psnobl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76386
SelectionDAG CSEs nodes based on their result type and operands, but not their flags. The flags are expected to be intersected when they are CSEd. In SelectionDAGBuilder, for FP nodes we manage both the fast math flags and the nofpexcept flag after the nodes have already been CSEd when they were created with getNode. The management of the fastmath flags before the constrained nodes prevents the nofpexcept management from working correctly.
This commit moves the FMF handling for constrained intrinsics into their visitor and disables the common FMF handling for these nodes.
Differential Revision: https://reviews.llvm.org/D75224
This is fixing up various places that use the implicit
TypeSize->uint64_t conversion.
The new overloads in MemoryLocation.h are already used in various places
that construct a MemoryLocation from a TypeSize, including MemorySSA.
(They were using the implicit conversion before.)
Differential Revision: https://reviews.llvm.org/D76249
This ports some combines from DAGCombiner.cpp which perform some trivial
transformations on instructions with undef operands.
Not having these can make it extremely annoying to find out where we differ
from SelectionDAG by looking at existing lit tests. Without them, we tend to
produce pretty bad code generation when we run into instructions which use
undef operands.
Also remove the nonpow2_store_narrowing testcase from arm64-fallback.ll, since
we no longer fall back on the add.
Differential Revision: https://reviews.llvm.org/D76339
Summary: The following change is to allow the machine outlining can be applied for Nth times, where N is specified by the compiler option. By default the value of N is 1. The motivation is that the repeated machine outlining can further reduce code size. Please refer to the presentation "Improving Swift Binary Size via Link Time Optimization" in LLVM Developers' Meeting in 2019.
Reviewers: aschwaighofer, tellenbach, paquette
Reviewed By: paquette
Subscribers: tellenbach, hiraditya, llvm-commits, jinlin
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71027
Summary:
This is another set of instructions too complicated to be sensibly
expressed in IR by anything short of a target-specific intrinsic.
Given input vectors a,b, the instruction generates intermediate values
2*(a[0]*b[0]+a[1]+b[1]), 2*(a[2]*b[2]+a[3]+b[3]), etc; takes the high
half of each double-width values, and overwrites half the lanes in the
output vector c, which you therefore have to provide the input value
of. Optionally you can swap the elements of b so that the are things
like a[0]*b[1]+a[1]*b[0]; optionally you can round to nearest when
taking the high half; and optionally you can take the difference
rather than sum of the two products. Finally, saturation is applied
when converting back to a single-width vector lane.
Reviewers: dmgreen, MarkMurrayARM, miyuki, ostannard
Reviewed By: miyuki
Subscribers: kristof.beyls, hiraditya, cfe-commits
Tags: #clang
Differential Revision: https://reviews.llvm.org/D76359
This isn't really usable, and requires using the
-amdgpu-fixed-function-abi flag to work.
Assumes a uniform call target, and will hit a verifier error if the
call target ends up in a VGPR. Also doesn't attempt to do anything
sensible for the reported register/stack usage.
This reverts commit 9bca8fc4cf.
Rearrange handling to avoid changing the instruction in the case where
it's going to be erased and replaced with undef.
Summary:
For the case where "done" bits on existing exports are removed
by unifyReturnBlockSet(), unify all return blocks - even the
uniformly reached ones. We do not want to end up with a non-unified,
uniformly reached block containing a normal export with the "done"
bit cleared.
That case is believed to be rare - possible with infinite loops
in pixel shaders.
This is a fix for D71192.
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76364