The previous code used the type of the first field for the VT
passed to getNode for every field.
I've based the implementation here off what is done in visitSelect
as it removes the need to special case aggregates.
Differential Revision: https://reviews.llvm.org/D77093
So that constant expressions like the following are permitted:
and w0, w0, #~(0xfe<<24)
and w1, w1, #~(0xff<<24)
The behavior matches GNU as (opcodes/aarch64-opc.c:aarch64_logical_immediate_p).
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D75885
The cmyk test is based on the known regression that resulted from:
rGf2fbdf76d8d0
This improves on the equivalent signed min/max change:
rG867f0c3c4d8c
The underlying icmp equivalence is:
~X pred ~Y --> Y pred X
For an icmp with constant, canonicalization results in a swapped pred:
~X < C --> X > ~C
SUMMARY:
For the llvm-objdump -D, the symbol name is used as a label in the disassembly for the specific address (when a symbol address is equal to the virtual address in the dump).
In XCOFF, multiple symbols may have the same name, being differentiated by their storage mapping class. It is helpful to print the QualName and not just the name when forming the output label for a csect symbol. The symbol index further removes any ambiguity caused by duplicate names.
To maintain compatibility with the binutils objdump, the XCOFF-specific --symbol-description option is added to enable the enhanced format.
Reviewers: hubert.reinterpretcast, James Henderson, Jason Liu ,daltenty
Subscribers: wuzish, nemanjai, hiraditya
Differential Revision: https://reviews.llvm.org/D72973
llvm-readobj prints the file path as part of the output, so
--implicit-check-not patterns can accidentally match it. This patch
makes the test more robust by preventing failures if "foo" is in
somebody's path.
Reviewed by: grimar
Differential Revision: https://reviews.llvm.org/D77546
ExecutionEngine/MCJIT/cet-code-model-lager.ll is failing on 32-bit
windows, see llvm-commits thread for fef2dab.
This reverts commit 43f031d312
and the follow-ups fef2dab100 and
6a800f6f62.
It was failing in 32-bit Windows builds with:
$ ":" "RUN: at line 1"
$ "c:\src\llvm_package_944db8a4\build32_stage0\bin\lli.exe" "-mtriple=i686-pc-windows-msvc-elf" "-code-model=large" "C:\src\llvm_package_944db8a4\llvm-project\llvm\test\ExecutionEngine\MCJIT\cet-code-model-lager.ll"
# command stderr:
Assertion failed: Is64Bit && "Large code model is only legal in 64-bit mode.", file C:\src\llvm_package_944db8a4\llvm-project\llvm\lib\Target\X86\X86ISelLowering.cpp, line 4212
Let's see if this helps.
Summary:
This patch adds support for emission of following DWARFv5 macro forms
in .debug_macro section.
1. DW_MACRO_start_file
2. DW_MACRO_end_file
3. DW_MACRO_define_strp
4. DW_MACRO_undef_strp.
Reviewed By: dblaikie, ikudrin
Differential Revision: https://reviews.llvm.org/D72828
truncateVectorWithPACK has its own vector length controls, so we can rely on those directly. This helps some existing truncation to subvector tests, which were being combined later during shuffle lowering at which point the sign/zero bit detection had become obscured preventing lowerShuffleWithPACK working as well as it could.
Summary:
Modify lea/load/store instructions to accept `disp(index, base)`
style addressing mode (called ASX format). Also, uniform the
number of DAG nodes to have 3 operands for this ASX format
instructions, and update selectADDR functions to lower
appropriate MI.
Reviewers: arsenm, simoll, k-ishizaka
Reviewed By: simoll
Differential Revision: https://reviews.llvm.org/D76822
This patch adds a -matrix-default-layout option which can be used to
set the default matrix layout to row-major or column-major (default).
The initial patch updates codegen for loads, stores, binary operators
and matrix multiply.
Reviewers: anemet, Gerolf, andrew.w.kaylor, LuoYuanke
Reviewed By: anemet
Differential Revision: https://reviews.llvm.org/D76325
This patch adds initial fusion for load/multiply/store chains of matrix
operations.
The patch contains roughly two parts:
1. Code generation for a fused load/multiply/store chain (LowerMatrixMultiplyFused).
First, we ensure that both loads of the multiply operands do not alias the store.
If they do, we create new non-aliasing copies of the operands. Note that this
may introduce new basic block. Finally we process TileSize x TileSize blocks.
That is: load tiles from the input operands, multiply and store them.
2. Identify fusion candidates & matrix instructions.
As a first step, collect all instructions with shape info and fusion candidates
(currently @llvm.matrix.multiply calls). Next, try to fuse candidates and
collect instructions eliminated by fusion. Finally iterate over all matrix
instructions, skip the ones eliminated by fusion and lower the rest as usual.
Reviewers: anemet, Gerolf, hfinkel, andrew.w.kaylor, LuoYuanke
Reviewed By: anemet
Differential Revision: https://reviews.llvm.org/D75566
llvm-dwp did not check section identifiers read from input files.
In the case of an unexpected identifier, the calculated index for
Contributions[] pointed outside the array. This fix avoids the issue
by skipping unsupported identifiers.
Differential Revision: https://reviews.llvm.org/D76543
In package files, the base offset provided by index sections should be
used to find the contribution of a unit. The patch adds that base
offset when reading range list tables.
Differential revision: https://reviews.llvm.org/D77401
This fixes the reading of location lists headers for compilation units
in package files by adjusting the reading offset according to the
corresponding record in the unit index. This is required for
DW_FORM_loclistx to work.
Differential revision: https://reviews.llvm.org/D77146
Without the patch, all version 5 compile units in a DWP file read
location tables from the beginning of a .debug_loclists.dwo section.
The patch fixes that by adjusting the reading offset the same way as
for pre-v5 units. The section identifier to find the contribution
entry corresponds to the version of the unit.
Differential revision: https://reviews.llvm.org/D77145
DWARFv5 defines index sections in package files in a slightly different
way than the pre-standard GNU proposal, see Section 7.3.5 in the DWARF
standard and https://gcc.gnu.org/wiki/DebugFissionDWP for GNU proposal.
The main concern here is values for section identifiers, which are
partially overlapped with changed meanings. The patch adds support for
v5 index sections and resolves that difficulty by defining a set of
identifiers for internal use which can represent and distinct values
of both standards.
Differential Revision: https://reviews.llvm.org/D75929
The new and old pass managers (PassManagerBuilder.cpp and
PassBuilder.cpp) are exposed to an `extern` declaration of
`attributor-disable` option which will guard the addition of the
attributor passes to the pass pipelines.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D76871
Add a new overload of StaticLibraryDefinitionGenerator::Load that takes a triple
argument and supports loading archives from MachO universal binaries in addition
to regular archives.
The LLI tool is updated to use this overload.
We had previously limited the shuffle(HORIZOP,HORIZOP) combine to binary shuffles, but we can often merge unary shuffles just as well, folding in UNDEF/ZERO values into the 64-bit half lanes.
For the (P)HADD/HSUB cases this is limited to fast-horizontal cases but PACKSS/PACKUS combines under all cases.
This patch builds upon D76140 by updating metadata on pointer typed
loads in inlined functions, when the load is the return value, and the
callsite contains return attributes which can be updated as metadata on
the load.
Added test cases show this for nonnull, dereferenceable,
dereferenceable_or_null
Reviewed-By: jdoerfert
Differential Revision: https://reviews.llvm.org/D76792
The newly-created constant zero will need an extra register to hold it
in the current statepoint lowering implementation. Remove it if there
exists one.
Summary:
This patch upstreams support the optional ARMv8.0 Data Gathering Hint (DGH)
extension, which adds the Data Gathering Hint instruction to the hint
space.
See ARMv8.0-DGH in the Arm Architecture Reference Manual Armv8 for more
information.
Reviewers: t.p.northover, rengolin, SjoerdMeijer, ab, danielkiss, samparker
Reviewed By: SjoerdMeijer
Subscribers: LukeGeeson, ostannard, kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77097
Summary:
This patch upstreams support for the ARMv8.6A Enhanced Counter Virtualization
(ECV) extension, which adds 6 new system registers.
See ARMv8.6-ECV in the Arm Architecture Reference Manual Armv8 for more
information.
Reviewers: t.p.northover, rengolin, SjoerdMeijer, pcc, ab, chill
Reviewed By: SjoerdMeijer
Subscribers: LukeGeeson, ostannard, kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77094
As discussed in D76983, that patch can turn a chain of insert/extract
with scalar trunc ops into bitcast+extract and existing instcombine
vector transforms end up creating a shuffle out of that (see the
PhaseOrdering test for an example). Currently, that process requires
at least this sequence: -instcombine -early-cse -instcombine.
Before D76983, the sequence of insert/extract would reach the SLP
vectorizer and become a vector trunc there.
Based on a small sampling of public targets/types, converting the
shuffle to a trunc is better for codegen in most cases (and a
regression of that form is the reason this was noticed). The trunc is
clearly better for IR-level analysis as well.
This means that we can induce "spontaneous vectorization" without
invoking any explicit vectorizer passes (at least a vector cast op
may be created out of scalar casts), but that seems to be the right
choice given that we started with a chain of insert/extract, and the
backend would expand back to that chain if a target does not support
the op.
Differential Revision: https://reviews.llvm.org/D77299
Summary:
This patch upstreams support for the ARMv8.6A Fine Grain Traps (FGT)
extension, which adds 5 new system registers.
See ARMv8.6-FGT in the Arm Architecture Reference Manual Armv8 for more
information.
Reviewers: t.p.northover, rengolin, SjoerdMeijer, ab, momchil.velikov
Reviewed By: SjoerdMeijer
Subscribers: LukeGeeson, ostannard, kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76991
The cmyk tests are based on the known regression that resulted from:
rGf2fbdf76d8d0
So this improvement in analysis might be enough to restore that commit.
Summary:
This patch upstreams v8.6A activity monitors virtualization
assembler support, which consists of 32 new system
registers (two groups, each with 16 numbered registers).
See ARMv8.6-AMU in the Arm Architecture Reference Manual Armv8 for more
information.
Reviewers: t.p.northover, rengolin, SjoerdMeijer, ab, john.brawn, ostannard
Reviewed By: ostannard
Subscribers: LukeGeeson, dnsampaio, ostannard, kristof.beyls, hiraditya, danielkiss, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76998
Our existing combine allows to merge the shuffle of 2 similar 64-bit wide 'horizontal ops' (HADD/PACK/etc.) if the shuffle was a UNPCK/MOVSD.
This patch generalizes this to decode any target shuffle mask that can be widened to a 128-bit repeating v2*64 mask, which helps us catch PBLENDW/PBLENDD cases.
If we're packing from 128-bits to 64-bits then we don't need the RHS argument. This helps with register allocation, especially as we avoid repeating a use of the input value.
Summary:
This patch is to teach `llvm-objdump` dump dynamic symbols (`-T` and `--dynamic-syms`). Currently, this patch is not fully compatible with `gnu-objdump`, but I would like to continue working on this in next few patches. It has two issues.
1. Some symbols shouldn't be marked as global(g). (`-t/--syms` has same issue as well) (Fixed by D75659)
2. `gnu-objdump` can dump version information and *dynamically* insert before symbol name field.
`objdump -T a.out` gives:
```
DYNAMIC SYMBOL TABLE:
0000000000000000 w D *UND* 0000000000000000 _ITM_deregisterTMCloneTable
0000000000000000 DF *UND* 0000000000000000 GLIBC_2.2.5 printf
0000000000000000 DF *UND* 0000000000000000 GLIBC_2.2.5 __libc_start_main
0000000000000000 w D *UND* 0000000000000000 __gmon_start__
0000000000000000 w D *UND* 0000000000000000 _ITM_registerTMCloneTable
0000000000000000 w DF *UND* 0000000000000000 GLIBC_2.2.5 __cxa_finalize
```
`llvm-objdump -T a.out` gives:
```
DYNAMIC SYMBOL TABLE:
0000000000000000 w D *UND* 0000000000000000 _ITM_deregisterTMCloneTable
0000000000000000 g DF *UND* 0000000000000000 printf
0000000000000000 g DF *UND* 0000000000000000 __libc_start_main
0000000000000000 w D *UND* 0000000000000000 __gmon_start__
0000000000000000 w D *UND* 0000000000000000 _ITM_registerTMCloneTable
0000000000000000 w DF *UND* 0000000000000000 __cxa_finalize
```
Reviewers: jhenderson, grimar, MaskRay, espindola
Reviewed By: jhenderson, grimar
Subscribers: emaste, rupprecht, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75756
`isKnownReachable` had only interface (always returns true).
Changed it to call `isPotentiallyReachable`.
This change enables deductions of other Abstract Attributes depending on
AAReachability to use reachability information obtained from CFG, and it
can make them stronger.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D76210
This commit was made to settle [[ https://github.com/llvm/llvm-project/issues/175 | this issue on GitHub ]].
I added analysis getters for LoopInfo, DominatorTree, and
PostDominatorTree. And I added a test to show an improvement of the
deduction of `dereferenceable` attribute.
Reviewed By: jdoerfert, uenoku
Differential Revision: https://reviews.llvm.org/D76378
Query AAValueSimplify on pointers in memory accessing instructions to take
advantage of the constant propagation (or any other value simplification) of such values.
Summary:
When we insert a call to the personality function wrapper
(`_Unwind_CallPersonality`) for a catch pad, we store some necessary
info in `__wasm_lpad_context` struct and pass it. One of the info is the
LSDA address for the function. For this, we insert a call to
`wasm.lsda()`, which will be lowered down to the address of LSDA, and
store it in a field in `__wasm_lpad_context`.
There are exceptions to this personality call insertion: catchpads for
`catch (...)` and cleanuppads (for destructors) don't need personality
function calls, because we don't need to figure out whether the current
exception should be caught or not. (They always should.)
There was a little optimization to `wasm.lsda()` call insertion. Because
the LSDA address is the same throughout a function, we don't need to
insert a store of `wasm.lsda()` return value in every catchpad. For
example:
```
try {
foo();
} catch (int) {
// wasm.lsda() call and a store are inserted here, like, in
// pseudocode,
// %lsda = wasm.lsda();
// store %lsda to a field in __wasm_lpad_context
try {
foo();
} catch (int) {
// We don't need to insert the wasm.lsda() and store again, because
// to arrive here, we have already stored the LSDA address to
// __wasm_lpad_context in the outer catch.
}
}
```
So the previous algorithm checked if the current catch has a parent EH
pad, we didn't insert a call to `wasm.lsda()` and its store.
But this was incorrect, because what if the outer catch is `catch (...)`
or a cleanuppad?
```
try {
foo();
} catch (...) {
// wasm.lsda() call and a store are NOT inserted here
try {
foo();
} catch (int) {
// We need wasm.lsda() here!
}
}
```
In this case we need to insert `wasm.lsda()` in the inner catchpad,
because the outer catchpad does not have one.
To minimize the number of inserted `wasm.lsda()` calls and stores, we
need a way to figure out whether we have encountered `wasm.lsda()` call
in any of EH pads that dominates the current EH pad. To figure that
out, we now visit EH pads in BFS order in the dominator tree so that we
visit parent BBs first before visiting its child BBs in the domtree.
We keep a set named `ExecutedLSDA`, which basically means "Do we have
`wasm.lsda()` either in the current EH pad or any of its parent EH
pads in the dominator tree?". This is to prevent scanning the domtree up
to the root in the worst case every time we examine an EH pad: each EH
pad only needs to examine its immediate parent EH pad.
- If any of its parent EH pads in the domtree has `wasm.lsda()`, this
means we don't need `wasm.lsda()` in the current EH pad. We also insert
the current EH pad in `ExecutedLSDA` set.
- If none of its parent EH pad has `wasm.lsda()`
- If the current EH pad is a `catch (...)` or a cleanuppad, done.
- If the current EH pad is neither a `catch (...)` nor a cleanuppad,
add `wasm.lsda()` and the store in the current EH pad, and add the
current EH pad to `ExecutedLSDA` set.
Reviewers: dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77423
Similar to the lowerV16I8Shuffle implementation, for binary compaction v8i16 shuffles we can avoid the PUNPCKLDQ(PSHUFB,PSHUFB) pattern on SSE41+ targets by using PACKUSDW and PBLENDW. Before SSE41 we would need to use PACKSSDW but that requires sign extension that seems to destroy any gains, even on targets without PSHUFB.
This is a bigger gain on AMD than Intel targets but should never be a regression, and avoiding the shuffle mask load(s) is always useful.
Noticed in codegen while dealing with PR31443.
Summary:
Note: This revision is very similar to D62296.
In D75756, we need `getDynamicSymbolIterators()` to skip first NULL symbol in `.dynsym`. And I believe it might be worth pointing this out in a separate patch to gather you experts' opinions.
I have checked that current code base will not be affected by this change.
```
dynamic_symbol_begin()
|- dynamic_symbol_end(): Ok
`- getDynamicSymbolIterators()
|- addDynamicElfSymbols(): llvm/tools/llvm-objdump/llvm-objdump.cpp, Line 934
| Ok, NULL symbol will be omitted by Line 945-947
| StringRef Name = unwrapOrError(Symbol.getName(), Obj->getName());
| if (Name.empty()) continue;
|- dumpSymbolNameFromObject(): llvm/tools/llvm-nm/llvm-nm.cpp, Line 1192
| There's no test for dumping dynamic debugging symbol. This patch helps improve llvm-nm behavior. (we should add test for this later)
`- computeSymbolSizes(): llvm/lib/Object/SymbolSize.cpp, Line 52
|- OProfileJITEventListener::notifyObjectLoaded(): llvm/lib/ExecutionEngine/OProfileJIT/OProfileJITEventListener.cpp, Line 92
| Ok, NULL symbol will be omitted by Line 94-95
| if (!Sym.getType() || *Sym.getType() != SF_Function) continue;
|- IntelJITEventListener::notifyObjectLoaded(): llvm/lib/ExecutionEngine/IntelJITEvents/IntelJITEventListener.cpp, Line 98
| Ok, NULL symbol will be omitted by Line 124-126 (same as previous one)
|- PerfJITEventListener::notifyObjectLoaded(): llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp, Line 244
| Ok, NULL symbol will be omitted by Line 254-256, (same as previous one)
|- SymbolizableObjectFile::create(): llvm/lib/DebugInfo/Symbolize/SymbolizableObjectFile.cpp, Line 73
| Ok, NULL symbol will be omitted by Line 75
| res->addSymbol()
| In addSymbol(), Line 167-168
| if (!Sec || (Obj && Obj->section_end() == *Sec)) return std::error_code();
|- dumpCXXData(): llvm/tools/llvm-cxxdump/llvm-cxxdump.cpp, Line 189
| Ok, NULL symbol will be omitted by Line 199-202
| object::section_iterator SecI = *SecIOrErr;
| // Skip external symbols.
| if (SecI == Obj->section_end())
| continue;
`- printLineInfoForInput(): llvm/tools/llvm-rtdyld/llvm-rtdyld.cpp, Line 418
Ok, NULL symbol will be omitted by Line 430-477
if (Type == object::SymbolRef::ST_Function) {
...
}
```
Reviewers: grimar, jhenderson, MaskRay
Reviewed By: jhenderson, MaskRay
Subscribers: rupprecht, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76081
After finding all such gadgets in a given function, the pass minimally inserts
LFENCE instructions in such a manner that the following property is satisfied:
for all SOURCE+SINK pairs, all paths in the CFG from SOURCE to SINK contain at
least one LFENCE instruction. The algorithm that implements this minimal
insertion is influenced by an academic paper that minimally inserts memory
fences for high-performance concurrent programs:
http://www.cs.ucr.edu/~lesani/companion/oopsla15/OOPSLA15.pdf
The algorithm implemented in this pass is as follows:
1. Build a condensed CFG (i.e., a GadgetGraph) consisting only of the following components:
-SOURCE instructions (also includes function arguments)
-SINK instructions
-Basic block entry points
-Basic block terminators
-LFENCE instructions
2. Analyze the GadgetGraph to determine which SOURCE+SINK pairs (i.e., gadgets) are already mitigated by existing LFENCEs. If all gadgets have been mitigated, go to step 6.
3. Use a heuristic or plugin to approximate minimal LFENCE insertion.
4. Insert one LFENCE along each CFG edge that was cut in step 3.
5. Go to step 2.
6. If any LFENCEs were inserted, return true from runOnFunction() to tell LLVM that the function was modified.
By default, the heuristic used in Step 3 is a greedy heuristic that avoids
inserting LFENCEs into loops unless absolutely necessary. There is also a
CLI option to load a plugin that can provide even better optimization,
inserting fewer fences, while still mitigating all of the LVI gadgets.
The plugin can be found here: https://github.com/intel/lvi-llvm-optimization-plugin,
and a description of the pass's behavior with the plugin can be found here:
https://software.intel.com/security-software-guidance/insights/optimized-mitigation-approach-load-value-injection.
Differential Revision: https://reviews.llvm.org/D75937
Adds a new data structure, ImmutableGraph, and uses RDF to find LVI gadgets and add them to a MachineGadgetGraph.
More specifically, a new X86 machine pass finds Load Value Injection (LVI) gadgets consisting of a load from memory (i.e., SOURCE), and any operation that may transmit the value loaded from memory over a covert channel, or use the value loaded from memory to determine a branch/call target (i.e., SINK).
Also adds a new target feature to X86: +lvi-load-hardening
The feature can be added via the clang CLI using -mlvi-hardening.
Differential Revision: https://reviews.llvm.org/D75936
Adding a pass that replaces every ret instruction with the sequence:
pop <scratch-reg>
lfence
jmp *<scratch-reg>
where <scratch-reg> is some available scratch register, according to the
calling convention of the function being mitigated.
Differential Revision: https://reviews.llvm.org/D75935
We can fix register class of PHI based on its all AGPR uses.
That leaves behind all PHIs which were already processed
earlier. Propagate RC back to PHI operands of a PHI.
Differential Revision: https://reviews.llvm.org/D77344
Extracting to the same index that we are going to insert back into
allows forming select ("blend") shuffles and enables further transforms.
Admittedly, this is a quick-fix for a more general problem that I'm
hoping to solve by adding transforms for patterns that start with an
insertelement.
But this might resolve some regressions known to be caused by the
extract-extract transform (although I have not gotten more details on
those yet).
In the motivating case from PR34724:
https://bugs.llvm.org/show_bug.cgi?id=34724
The combination of subsequent instcombine and codegen transforms gets us this improvement:
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm4
vmovshdup %xmm1, %xmm3 ## xmm3 = xmm1[1,1,3,3]
vaddps %xmm0, %xmm2, %xmm0
vaddps %xmm1, %xmm3, %xmm1
vshufps $200, %xmm4, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm4[0,3]
vinsertps $177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]
-->
vmovshdup %xmm0, %xmm2 ## xmm2 = xmm0[1,1,3,3]
vhaddps %xmm1, %xmm1, %xmm1
vaddps %xmm0, %xmm2, %xmm0
vshufps $200, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm1[0,3]
Differential Revision: https://reviews.llvm.org/D76623
Extend lowerShuffleWithPACK/matchShuffleWithPACK/createPackShuffleMask to handle compaction style shuffle masks that can be lowered to chains of PACKSS/PACKUS if their inputs are suitably sign/zero extended.
This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask should recognise the PACKSS/PACKUS chains.
As discussed in post-commit review in https://reviews.llvm.org/D73501
if the goal of this is to help vectorizer, then we should actually
be teaching vectorizer to do this, because right now this rewrite
is still budget-limited, which isn't what we'd want.
Additionally, while the rest of the patch series was universally profitable,
this particular patch is reportedly (https://reviews.llvm.org/D73501#1905171)
exposing cost-modeling issues on ARM.
So let's just back this particular patch out. Once there's an undo transform,
this could be considered for reintegration.
This reverts commit 44edc6fd2c.
We got some of the potential optimizations with D76727 and D76844.
There are 2 likely enhancements that we could add to -vector-combine
to get most of the remaining cases:
1. Allow bitcasted shuffle mask narrowing (widen the elements).
2. Combine shuffle-of-shuffle into a single shuffle.
This is already partly handled by the x86 backend, but the
tests here show that we still miss some of the potential
combines.
Currently when the target is big-endian vmov.i64 reverses the order of the two
words of the vector. This is correct only when the underlying element type is
32-bit, as actually what it should be doing is considering it a vector of the
underlying type and reversing the elements of that.
Differential Revision: https://reviews.llvm.org/D76515
If we have an element-wise vmov immediate instruction then a subsequent vrev
with width greater or equal to the vmov element width, then that vrev won't do
anything. Add a DAG combine to convert bitcasts that would become such vrevs
into vector_reg_casts instead.
Differential Revision: https://reviews.llvm.org/D76514
1. It's redundant to explicitly check that string does not contain `PAUSE_MM`
token if check that the `PAUSE` token is followed by an angle bracket.
2. The removed `CHECK-NOT` directives do not have trailing colons.
This pass replaces each indirect call/jump with a direct call to a thunk that looks like:
lfence
jmpq *%r11
This ensures that if the value in register %r11 was loaded from memory, then
the value in %r11 is (architecturally) correct prior to the jump.
Also adds a new target feature to X86: +lvi-cfi
("cfi" meaning control-flow integrity)
The feature can be added via clang CLI using -mlvi-cfi.
This is an alternate implementation to https://reviews.llvm.org/D75934 That merges the thunk insertion functionality with the existing X86 retpoline code.
Differential Revision: https://reviews.llvm.org/D76812
Summary:
This patch adds parsing and dumping DWARFv5 .debug_macro section in llvm-dwarfdump,
it does not introduce any new switch. Existing switch "--debug-macro"
should be used to dump macinfo or macro section.
Reviewed By: dblaikie, ikudrin, jhenderson
Differential Revision: https://reviews.llvm.org/D73086
Summary:
Add mapping from exp2 math functions
to corresponding SVML calls.
This is a follow up and extension for llvm diff
https://reviews.llvm.org/D19544
Test Plan:
- update test case and run ninja check.
- run tests locally
Reviewers: wenlei, hoyFB, mmasten, mzolotukhin, spatel
Reviewed By: spatel
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77114
Summary:
A recent change in the instruction simplifier enables a call to a function that just returns one of its parameter to be simplified as simply loading the parameter. This exposes a bug in the inliner where double inlining may be involved which in turn may cause compiler ICE when an already-inlined callsite is reused for further inlining.
To put it simply, in the following-like C program, when the function call second(t) is inlined, its code t = third(t) will be reduced to just loading the return value of the callsite first(). This causes the inliner internal data structure to register the first() callsite for the call edge representing the third() call, therefore incurs a double inlining when both call edges are considered an inline candidate. I'm making a fix to break the inliner from reusing a callsite for new call edges.
```
void top()
{
int t = first();
second(t);
}
void second(int t)
{
t = third(t);
fourth(t);
}
void third(int t)
{
return t;
}
```
The actual failing case is much trickier than the example here and is only reproducible with the legacy inliner. The way the legacy inliner works is to process each SCC in a bottom-up order. That means in reality function first may be already inlined into top, or function third is either inlined to second or is folded into nothing. To repro the failure seen from building a large application, we need to figure out a way to confuse the inliner so that the bottom-up inlining is not fulfilled. I'm doing this by making the second call indirect so that the alias analyzer fails to figure out the right call graph edge from top to second and top can be processed before second during the bottom-up. We also need to tweak the test code so that when the inlining of top happens, the function body of second is not that optimized, by delaying the pass of function attribute deducer (i.e, which tells function third has no side effect and just returns its parameter). Since the CGSCC pass is iterative, additional calls are added to top to postpone the inlining of second to the second round right after the first function attribute deducing pass is done. I haven't been able to repro the failure with the new pass manager since the processing order of ininlined callsites is a bit different, but in theory the issue could happen there too.
Note that this fix could introduce a side effect that blocks the simplification of inlined code, specifically for a call site that can be folded to another call site. I hope this can probably be complemented by subsequent inlining or folding, as shown in the attached unit test. The ideal fix should be to separate the use of VMap. However, in reality this failing pattern shouldn't happen often. And even if it happens, there should be a good chance that the non-folded call site will be refolded by iterative inlining or subsequent simplification.
Reviewers: wenlei, davidxl, tejohnson
Reviewed By: wenlei, davidxl
Subscribers: eraman, nikic, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76248
Summary:
This patch comes from H.J.'s 2bd54ce7fa
**This patch fix the failed llvm unit tests which running on CET machine. **(e.g. ExecutionEngine/MCJIT/MCJITTests)
The reason we enable IBT at "JIT compiled with CET" is mainly that: the JIT don't know the its caller program is CET enable or not.
If JIT's caller program is non-CET, it is no problem JIT generate CET code or not.
But if JIT's caller program is CET enabled, JIT must generate CET code or it will cause Control protection exceptions.
I have test the patch at llvm-unit-test and llvm-test-suite at CET machine. It passed.
and H.J. also test it at building and running VNCserver(Virtual Network Console), it works too.
(if not apply this patch, VNCserver will crash at CET machine.)
Reviewers: hjl.tools, craig.topper, LuoYuanke, annita.zhang, pengfei
Subscribers: tstellar, efriedma, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76900
This was causing a machine verifier failure on the test suite.
Make sure that we don't end up with a weird register class here.
Failure for reference:
*** Bad machine code: Illegal virtual register for instruction ***
- function: check_constrain
- basic block: %bb.1 (0x7f8b70839f80)
- instruction: early-clobber %6:gpr64, early-clobber %7:gpr64sp =
JumpTableDest32 %5:gpr64, %1:gpr64sp, %jump-table.0
- operand 3: %1:gpr64sp
Expected a GPR64 register, but got a GPR64sp register
Differential Revision: https://reviews.llvm.org/D77349
MI peephole will remove unnecessary FRSP instructions. This patch
removes such unnecessary XSRSP.
Reviewed By: steven.zhang
Differential Revision: https://reviews.llvm.org/D77208
Summary:
This fixes a few issues related to SMRD offsets. On gfx9 and gfx10 we have a
signed byte offset immediate, however we can overflow into a negative since we
treat it as unsigned.
Also, the SMRD SOFFSET sgpr is an unsigned offset on all subtargets. We
sometimes tried to use negative values here.
Third, S_BUFFER instructions should never use a signed offset immediate.
Differential Revision: https://reviews.llvm.org/D77082
This will likely introduce catastrophic performance regressions on
older subtargets, but should be correct. A follow up change will
remove the old fp32-denormals subtarget features, and switch to using
the new denormal-fp-math/denormal-fp-math-f32 attributes. Frontends
should be making sure to add the denormal-fp-math-f32 attribute when
appropriate to avoid performance regressions.
The problem on Windows was that the \b in "..\bin" was interpreted
as an escape sequence. Use r"" strings to prevent that.
This reverts commit ab11b9eefa,
with raw strings in the lit.site.cfg.py.in files.
Differential Revision: https://reviews.llvm.org/D77184
Consider a callee function that has a call (C) within it which feeds
into the return. When we inline that callee into a callsite that has
return attributes, we can backward propagate valid attributes to the
call (C) within that inlined callee body.
This is safe to do so only if we can guarantee transfer of execution to
successor in the window of instructions between return value (i.e. the
call C) and the return instruction.
Also, this is valid only for attributes which are a property of a
callsite and not those that are not dependent on the ABI, or a property
of the call itself.
Reviewed-By: reames, jdoerfert
Differential Revision: https://reviews.llvm.org/D76140
This is a workaround for clang adding noinline to all functions at
-O0. Previously, we would just add alwaysinline, and the verifier
would complain about having both noinline and alwaysinline. We
currently can't truly codegen this case as a freestanding function, so
override the user forcing noinline.
Currently, all generated lit.site.cfg files contain absolute paths.
This makes it impossible to build on one machine, and then transfer the
build output to another machine for test execution. Being able to do
this is useful for several use cases:
1. When running tests on an ARM machine, it would be possible to build
on a fast x86 machine and then copy build artifacts over after building.
2. It allows running several test suites (clang, llvm, lld) on 3
different machines, reducing test time from sum(each test suite time) to
max(each test suite time).
This patch makes it possible to pass a list of variables that should be
relative in the generated lit.site.cfg.py file to
configure_lit_site_cfg(). The lit.site.cfg.py.in file needs to call
`path()` on these variables, so that the paths are converted to absolute
form at lit start time.
The testers would have to have an LLVM checkout at the same revision,
and the build dir would have to be at the same relative path as on the
builder.
This does not yet cover how to figure out which files to copy from the
builder machine to the tester machines. (One idea is to look at the
`--graphviz=test.dot` output and copy all inputs of the `check-llvm`
target.)
Differential Revision: https://reviews.llvm.org/D77184
bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'
We do not attempt this in InstCombine because we do not want to change
types and create new shuffle ops that are potentially not lowered as
well as the original code. Here, we can check the cost model to see if
it is worthwhile.
I've aggressively enabled this transform even if the types are the same
size and/or equal cost because moving the bitcast allows InstCombine to
make further simplifications.
In the motivating cases from PR35454:
https://bugs.llvm.org/show_bug.cgi?id=35454
...this is enough to let instcombine and the backend eliminate the
redundant shuffles, but we probably want to extend VectorCombine to
handle the inverse pattern (shuffle-of-bitcast) to get that
simplification directly in IR.
Differential Revision: https://reviews.llvm.org/D76727
SILoadStoreOptimizer::checkAndPrepareMerge() expects base and
paired instruction to come in order and scans MBB from base to
the paired instruction. An original order can be changed if
there were a dependent instruction in between and base instruction
was moved.
Fixed by bailing the optimization. In theory it might be possible
still to perform a merge by swapping instructions, but on practice
it bails anyway because it finds dependency on that same instruction
which has resulted in the base move.
Differential Revision: https://reviews.llvm.org/D77245
These are versions of a function that regressed with:
rGf2fbdf76d8d0
That particular problem occurs with an instcombine-simplifycfg-instcombine
sequence, but we can show that it exists within instcombine only with
other variations of the pattern.
This reverts commit f2fbdf76d8.
As noted in the post-commit thread:
https://reviews.llvm.org/rGf2fbdf76d8d0
...this can obscure a min/max pattern where the components
have extra uses. We can show that the problem is independent
of this change with a slightly modified source example, so
this revert just delays/reduces the need to fix the real
problem.
We need to improve our analysis of negation or -- more
generally -- subtraction using patches like D77230 or D68408.
Summary:
Splitting Knowledge retention into Queries in Analysis and Builder into Transform/Utils
allows Queries and Transform/Utils to use Analysis.
Reviewers: jdoerfert, sstefan1
Reviewed By: jdoerfert
Subscribers: mgorny, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77171
This patch adds
- New arguments to getMinPrefetchStride() to let the target decide on a
per-loop basis if software prefetching should be done even with a stride
within the limit of the hw prefetcher.
- New TTI hook enableWritePrefetching() to let a target do write prefetching
by default (defaults to false).
- In LoopDataPrefetch:
- A search through the whole loop to gather information before emitting any
prefetches. This way the target can get information via new arguments to
getMinPrefetchStride() and emit prefetches more selectively. Collected
information includes: Does the loop have a call, how many memory
accesses, how many of them are strided, how many prefetches will cover
them. This is NFC to before as long as the target does not change its
definition of getMinPrefetchStride().
- If a previous access to the same exact address was 'read', and the
current one is 'write', make it a 'write' prefetch.
- If two accesses that are covered by the same prefetch do not dominate
each other, put the prefetch in a block that dominates both of them.
- If a ConstantMaxTripCount is less than ItersAhead, then skip the loop.
- A SystemZ implementation of getMinPrefetchStride().
Review: Ulrich Weigand, Michael Kruse
Differential Revision: https://reviews.llvm.org/D70228
Add an option to llvm-dwarfdump to calculate the bytes within
the debug sections. Dump this numbers when using --statistics
option as well.
This is an initial patch (e.g. we should support other units,
since we only support 'bytes' now).
Differential Revision: https://reviews.llvm.org/D74205
This adds MVE vmull patterns, which are conceptually the same as
mul(vmovl, vmovl), and so the tablegen patterns follow the same
structure.
For i8 and i16 this is simple enough, but in the i32 version the
multiply (in 64bits) is illegal, meaning we need to catch the pattern
earlier in a dag fold. Because bitcasts are involved in the zext
versions and the patterns are a little different in little and big
endian. I have only added little endian support in this patch.
Differential Revision: https://reviews.llvm.org/D76740
The unpredictable/hasSideEffects flag is usually inferred by tablegen
from whether the instruction has a tablegen pattern (and that pattern
only has a single output instruction). Now that the MVE intrinsics are
all committed and producing code, the remaining instructions still
marked as unpredictable need to be specially handled. This adds the flag
directly to instructions that need it, notably the V*MLAL instructions
and some of the MOV's.
Differential Revision: https://reviews.llvm.org/D76910
Summary:
This allows doing `memcmp(p, q, 7)` with 2 loads instead of a call to
memcmp.
This fixes part of PR45147.
Reviewers: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76133
As pointed out by @thakis, currently CallSiteSplitting bails out after
checking the first PHI node. We should check all PHI nodes, until we
find one where call site splitting is beneficial.
This patch also slightly simplifies the code using BasicBlock::phis().
Reviewers: davidxl, junbuml, thakis
Reviewed By: davidxl
Differential Revision: https://reviews.llvm.org/D77089
There was a TODO in genericValueTraversal to provide the context
instruction and due to the lack of it users that wanted one just used
something available. Unfortunately, using a fixed instruction is wrong
in the presence of PHIs so we need to update the context instruction
properly.
Reviewed By: uenoku
Differential Revision: https://reviews.llvm.org/D76870
While D68850 allowed functions to be deleted I accidentally saved some
version of the function to be used once a suitable prefix was found.
This turned out to be problematic when the occasionally deleted function
is also occasionally modified. The test case is adjusted to resemble the
case in which the problem was found.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D76586
Use DL & ABI information for better alignment deduction, e.g., if a type
is accessed and the ABI specifies an alignment requirement for such an
access we can use it. This is based on a patch by @lebedev.ri and
inspired by getBaseAlign in Loads.cpp.
Depends on D76673.
Reviewed By: lebedev.ri
Differential Revision: https://reviews.llvm.org/D76674
If we have a must-tail call the callee and caller need to have matching
ABIs. Part of that is alignment which we might modify when we deduce
alignment of arguments of either. Since we would need to keep them in
sync, which is not as simple, we simply avoid deducing alignment for
arguments of the must-tail caller or callee.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D76673
Failure to export __cxa_atexit can lead to an attempt to import a definition
from the process itself (if __cxa_atexit is referenced from another JITDylib),
but the process definition will clash with the existing non-exported definition
to produce an unexpected DuplicateDefinitionError.
This patch fixes the immediate issue by exporting __cxa_atexit. It also fixes a
bug where atexit functions in other JITDylibs were not being run by adding a
copy of run_atexits_helper to every JITDylib.
A follow up patch will deal with the bug where definition generators are called
despite a non-exported definition being present.
This means the linker will be expect them be undefined at link time an
will generate imports from the `env` module rather than reporting
undefined externals.
Differential Revision: https://reviews.llvm.org/D77192
The MemoryBuffer::getMemBuffer method's RequiresNullTerminator parameter
defaults to true, but object files are not null terminated so we need to
explicitly pass false here.
Negation is equivalent to bitwise-not + 1, so try to convert more
subtracts into adds using this relationship:
0 - (A ^ C) => ((A ^ C) ^ -1) + 1 => A ^ ~C + 1
I doubt this will recover the regression noted in rGf2fbdf76d8d0,
but seems like we're going to need to improve here and/or revive D68408?
Alive2 proofs:
http://volta.cs.utah.edu:8080/z/Re5tMUhttp://volta.cs.utah.edu:8080/z/An-uns
Differential Revision: https://reviews.llvm.org/D77230
find() was altering the UserChain, even in cases where it subsequently
discovered that the resulting constant was a 0. This confuses
rebuildWithoutConstOffset() when it attempts to walk the chain later, since it
is expected that the chain itself be a path down the use-def edges of an
expression.
Make the attributor pass aware of aligned_alloc for converting heap
allocations to stack ones.
Depends on D76971.
Differential Revision: https://reviews.llvm.org/D76974
Summary:
The previous code for determining the innermost region in CFGSort was
not correct. We determine subregion relationship by domination of their
headers, i.e., if region A's header dominates region B's header, B is a
subregion of A. Previously we assumed that if a BB belongs to both a
loop and an exception, the region with fewer number of BBs is the
innermost one. This may not be true, because while WebAssemblyException
contains BBs in all its subregions (loops or exceptions), MachineLoop
may not, because MachineLoop does not contain BBs that don't have a path
to its header even if they are dominated by its header.
Loop header <---|
| |
Exception header |
| \ |
A B |
| \ |
| C |
| |
Loop latch |
| |
-------------|
For example, in this CFG, the loop does not contain B and C, because
they don't have a path back to the loops header. But for CFGSort we
consider the exception here belongs to the loop and the exception should
be a subregion of the loop and scheduled together.
So here we should use `WE->contains(ML->getHeader())` (but not
`ML->contains(WE->getHeader())`, for the stated region above).
This also fixes some comments and deletes `Regions` vector in
`RegionInfo` class, which was not used anywere.
Reviewers: dschuff
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77181
We have scenarios when the logic of --elf-hash-histogram/--hash-symbols/--hash-table
options might crash when given a broken hash table.
This patch adds pre-checks for tables for these 3 options
and provides test cases.
Differential revision: https://reviews.llvm.org/D77147
Summary:
Currently, the comparison argument used for ATOMIC_CMP_XCHG is legalised
with GetPromotedInteger, which leaves the upper bits of the value
undefind. Since this is used for comparing in an LR/SC loop with a
full-width comparison, we must sign extend it. We introduce a new
getExtendForAtomicCmpSwapArg to complement getExtendForAtomicOps, since
many targets have compare-and-swap instructions (or pseudos) that
correctly handle an any-extend input, and the existing function
determines the extension of the result, whereas we are concerned with
the input.
This is related to https://reviews.llvm.org/D58829, which solved the
issue for ATOMIC_CMP_SWAP_WITH_SUCCESS, but not the simpler
ATOMIC_CMP_SWAP.
Reviewers: asb, lenary, efriedma
Reviewed By: asb
Subscribers: arichardson, hiraditya, rbar, johnrusso, simoncook, sabuasal, niosHD, kito-cheng, shiva0217, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, rkruppe, jfb, PkmX, jocewei, psnobl, benna, Jim, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, evandro, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D74453
Prior to this change the clang interface stubs format resembled
something ending with a symbol list like this:
Symbols:
a: { Type: Func }
This was problematic because we didn't actually want a map format and
also because we didn't like that an empty symbol list required
"Symbols: {}". That is to say without the empty {} llvm-ifs would crash
on an empty list.
With this new format it is much more clear which field is the symbol
name, and instead the [] that is used to express an empty symbol vector
is optional, ie:
Symbols:
- { Name: a, Type: Func }
or
Symbols: []
or
Symbols:
This further diverges the format from existing llvm-elftapi. This is a
good thing because although the format originally came from the same
place, they are not the same in any way.
Differential Revision: https://reviews.llvm.org/D76979
This allows the MVE VPT Block insertion pass to remove VPNOTs in
order to create more complex VPT blocks such as TE, TEET, TETE, etc.
Differential Revision: https://reviews.llvm.org/D75993
Summary:
Aggregate types containing scalable vectors aren't supported and as far
as I can tell this pass is mostly concerned with optimisations on
aggregate types, so the majority of this pass isn't very useful for
scalable vectors.
This patch modifies SROA such that mem2reg is run on allocas with
scalable types that are promotable, but nothing else such as slicing is
done.
The use of TypeSize in this pass has also been updated to be explicitly
fixed size. When invoking the following methods in DataLayout:
* getTypeSizeInBits
* getTypeStoreSize
* getTypeStoreSizeInBits
* getTypeAllocSize
we now called getFixedSize on the resultant TypeSize. This is quite an
extensive change with around 50 calls to these functions, and also the
first change of this kind (being explicit about fixed vs scalable
size) as far as I'm aware, so feedback welcome.
A test is included containing IR with scalable vectors that this pass is
able to optimise.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D76720
PTEST/TESTP sets EFLAGS as:
TESTZ: ZF = (Op0 & Op1) == 0
TESTC: CF = (~Op0 & Op1) == 0
TESTNZC: ZF == 0 && CF == 0
If we are inverting the 0'th operand of a PTEST/TESTP instruction we can adjust the comparisons to correct handle the inversion implicitly.
Additionally, for "TESTZ" (ZF) cases, the allones case, PTEST(X,-1) can be simplified to PTEST(X,X).
We can expand this for the TESTZ(X,~Y) pattern and also handle KTEST/KORTEST in the future.
Differential Revision: https://reviews.llvm.org/D76984
Summary:
Make sure we do not assert on value types not being
simple in getFauxShuffleMask when analysing operations
such as "v8i16 = truncate v8i24".
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77136
Currently ConstantRange::binaryAnd/binaryOr results are too pessimistic
for single element constant ranges.
If both operands are single element ranges, we can use APInt's AND and
OR implementations directly.
Note that some other binary operations on constant ranges can cover the
single element cases naturally, but for OR and AND this unfortunately is
not the case.
Reviewers: nikic, spatel, lebedev.ri
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D76446
Currently, DAG combiner uses (fmul (rsqrt x) x) to estimate square
root of x. However, this method would return NaN if x is +Inf, which
is incorrect.
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D76853
This patch adds checks to the verifier to ensure the dimension arguments
passed to the matrix intrinsics match the vector types for their
arugments/return values.
Reviewers: anemet, Gerolf, andrew.w.kaylor, LuoYuanke
Reviewed By: anemet
Differential Revision: https://reviews.llvm.org/D77129
Also add lowerShuffleWithPACK call to lowerV32I16Shuffle - shuffle combining was catching it but we avoid a lot of temporary shuffle creations if we catch it at lowering first.
Summary:
In https://bugs.llvm.org/show_bug.cgi?id=45297, it fails selecting
instructions for `PPCISD::ST_VSR_SCAL_INT`. The reason it generate the
`PPCISD::ST_VSR_SCAL_INT` with `-power8-vector` in IR is PPC's
combiner checks `hasP8Altivec` rather than `hasP8Vector`. This patch
should resolve PR45297.
Differential Revision: https://reviews.llvm.org/D76773
Summary:
Select folding in JumpThreading can create a conditional branch on a
code patch that did not have one in the original program. This is not a
valid transformation in sanitize_memory functions.
Note that JumpThreading does select folding in 3 different places. Two
of them seem safe - they apply to a select instruction in a BB that ends
with an unconditional branch to another BB, which (in turn) ends with a
conditional branch or a switch with the same condition.
Fixes PR45220.
Reviewers: glider, dvyukov, efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76332
Summary:
When we encounter an XCOFF file, reflect that in the triple information.
In addition to knowing the object file format, we know that the
associated OS is AIX.
This means that we can expect that there is no output difference in the
processing of an XCOFF32 input file between cases where the triple is
left unspecified by the user and cases where the user specifies
`--triple powerpc-ibm-aix` explicitly.
Reviewers: jhenderson, sfertile, jasonliu, daltenty
Reviewed By: jasonliu
Subscribers: wuzish, nemanjai, hiraditya, MaskRay, rupprecht, steven.zhang, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77025
The UnwindHelp object is used during exception handling by runtime
code. It must be findable from a fixed offset from FP.
This change allocates the UnwindHelp object as a fixed object (as is
done for x86_64) to ensure that both the generated code and runtime
agree on the location of the object.
Fixes https://bugs.llvm.org/show_bug.cgi?id=45346
Differential Revision: https://reviews.llvm.org/D77016
The generated code for a funclet can have an add to sp in the epilogue
for which there is no corresponding sub in the prologue.
This patch removes the early return from emitPrologue that was
preventing the sub to sp, and instead conditionalizes the appropriate
parts of the rest of the function.
Fixes https://bugs.llvm.org/show_bug.cgi?id=45345
Differential Revision: https://reviews.llvm.org/D77015
This reverts commit 28518d9ae3.
There is a failure in MsgPackReader.cpp when built with clang. It
complains about "signext and zeroext" are incompatible. Investigating
offline if it is infact a UB in the MsgPackReader code.
The attached test case is simplified from tcmalloc. Both function calls should be optimized as tailcall. But llvm can only optimize the first call. The second call can't be optimized because function dupRetToEnableTailCallOpts failed to duplicate ret into block case2.
There 2 problems blocked the duplication:
1 Intrinsic call llvm.assume is not handled by dupRetToEnableTailCallOpts.
2 The control flow is more complex than expected, dupRetToEnableTailCallOpts can only duplicate ret into its predecessor, but here we have an intermediate block between call and ret.
The solutions:
1 Since CodeGenPrepare is already at the end of LLVM IR phase, we can simply delete the intrinsic call to llvm.assume.
2 A general solution to the complex control flow is hard, but for this case, after exit2 is duplicated into case1, exit2 is the only successor of exit1 and exit1 is the only predecessor of exit2, so they can be combined through eliminateFallThrough. But this function is called too late, there is no more dupRetToEnableTailCallOpts after it. We can add an earlier call to eliminateFallThrough to solve it.
Differential Revision: https://reviews.llvm.org/D76539
We have loads preserving low and high 16 bits of their
destinations. However, we always use a whole 32 bit register
for these. The same happens with 16 bit stores, we have to
use full 32 bit register so if high bits are clobbered the
register needs to be copied. One example of such code is
added to the load-hi16.ll.
The proper solution to the problem is to define 16 bit subregs
and use them in the operations which do not read another half
of a VGPR or preserve it if the VGPR is written.
This patch simply defines subregisters and register classes.
At the moment there should be no difference in code generation.
A lot more work is needed to actually use these new register
classes. Therefore, there are no new tests at this time.
Register weight calculation has changed with new subregs so
appropriate changes were made to keep all calculations just
as they are now, especially calculations of register pressure.
Differential Revision: https://reviews.llvm.org/D74873
Consider a callee function that has a call (C) within it which feeds
into the return. When we inline that callee into a callsite that has
return attributes, we can backward propagate those attributes to the
call (C) within that inlined callee body.
This is safe to do so only if we can guarantee transfer of execution to
successor in the window of instructions between return value (i.e. the
call C) and the return instruction.
See added test cases.
Reviewed-By: reames, jdoerfert
Differential Revision: https://reviews.llvm.org/D76140
Registers used in any address (as well as in a few other contexts)
have special semantics when a "zero" register is used, which is
why the back-end defines extra register classes ADDR32, ADDR64 etc
to be used to prevent the register allocator from using %r0 there.
However, when writing assembler code "by hand", you sometimes need
to trigger that special semantics. However, currently the AsmParser
will reject %r0 in those places. In some cases it may be possible
to write that instruction differently - but in others it is currently
not possible at all.
This check in AsmParser simply seems overly strict, so this patch
just removes the check completely. This brings the behaviour of
AsmParser in line with the GNU assembler as well.
Bugzilla: https://bugs.llvm.org/show_bug.cgi?id=45092
Make InstCombine aware of the aligned_alloc library function.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Depends on D76970.
Differential Revision: https://reviews.llvm.org/D76971
Summary:
CGProfilePass is run by default in certain new pass manager optimization pipeline. Assemblers other than llvm as (such as gnu as) cannot recognize the .cgprofile entries generated and emitted from this pass, causing build time error.
This patch adds new options in clang CodeGenOpts and PassBuilder options so that we can turn cgprofile off when not using integrated assembler.
Reviewers: Bigcheese, xur, george.burgess.iv, chandlerc, manojgupta
Reviewed By: manojgupta
Subscribers: manojgupta, void, hiraditya, dexonsmith, llvm-commits, tcwang, llozano
Tags: #llvm, #clang
Differential Revision: https://reviews.llvm.org/D62627
Summary: this patch preserve information from various places in EarlyCSE into assume bundles.
Reviewers: jdoerfert
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76769
If canLowerByDroppingEvenElements indicates that the shuffle is a N:1 compaction pattern and the inputs are suitably sign/zero extended then we can use a chain of PACKSS/PACKUS to compact.
This helps avoid PSHUFB (and its mask load) for short shuffle chains, shuffle combining will still replace with a PSHUFB if we have enough shuffles as getFauxShuffleMask can recognise PACKSS/PACKUS chains.
The script now includes extra info about command-line options used
when generating its advertisement heading, but we don't want that
here. This is a special-case because we have enhanced the check
lines (as noted in the 2nd comment line).
This patch updates ValueLattice to distinguish between ranges that are
guaranteed to not include undef and ranges that may include undef.
A constant range guaranteed to not contain undef can be used to simplify
instructions to arbitrary values. A constant range that may contain
undef can only be used to simplify to a constant. If the value can be
undef, it might take a value outside the range. For example, consider
the snipped below
define i32 @f(i32 %a, i1 %c) {
br i1 %c, label %true, label %false
true:
%a.255 = and i32 %a, 255
br label %exit
false:
br label %exit
exit:
%p = phi i32 [ %a.255, %true ], [ undef, %false ]
%f.1 = icmp eq i32 %p, 300
call void @use(i1 %f.1)
%res = and i32 %p, 255
ret i32 %res
}
In the exit block, %p would be a constant range [0, 256) including undef as
%p could be undef. We can use the range information to replace %f.1 with
false because we remove the compare, effectively forcing the use of the
constant to be != 300. We cannot replace %res with %p however, because
if %a would be undef %cond may be true but the second use might not be
< 256.
Currently LazyValueInfo uses the new behavior just when simplifying AND
instructions and does not distinguish between constant ranges with and
without undef otherwise. I think we should address the remaining issues
in LVI incrementally.
Reviewers: efriedma, reames, aqjune, jdoerfert, sstefan1
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D76931
For the PHI node
%1 = phi [%A, %entry], [%X, %latch]
it is incorrect to use SCEV of backedge val %X as an exit value
of PHI unless %X is loop invariant.
This is because exit value of %1 is value of %X at one-before-last
iteration of the loop.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D73181
Canonicalize the case when a scalar extracted from a vector is
truncated. Transform such cases to bitcast-then-extractelement.
This will enable erasing the truncate operation.
This commit fixes PR45314.
reviewers: spatel
Differential revision: https://reviews.llvm.org/D76983
Add a new llvm.amdgcn.ballot intrinsic modeled on the ballot function
in GLSL and other shader languages. It returns a bitfield containing the
result of its boolean argument in all active lanes, and zero in all
inactive lanes.
This is intended to replace the existing llvm.amdgcn.icmp and
llvm.amdgcn.fcmp intrinsics after a suitable transition period.
Use the new intrinsic in the atomic optimizer pass.
Differential Revision: https://reviews.llvm.org/D65088
For casts with constant range operands, we can use
ConstantRange::castOp.
Reviewers: davide, efriedma, mssimpso
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D71938
Leverage ARM ELF build attribute section to create ELF attribute section
for RISC-V. Extract the common part of parsing logic for this section
into ELFAttributeParser.[cpp|h] and ELFAttributes.[cpp|h].
Differential Revision: https://reviews.llvm.org/D74023
Octeon branches (bbit0/bbit032/bbit1/bbit132) have an immediate operand,
so it is legal to have such replacement within
MipsBranchExpansion::replaceBranch().
According to the specification, a branch (e.g. bbit0 ) looks like:
bbit0 rs p offset // p is an immediate operand
if !rs<p> then branch
Without this patch, an assertion triggers in the method,
and the problem has been found in the real example.
Differential Revision: https://reviews.llvm.org/D76842
In the past, AVR functions were only lowered with interrupt-specific
machine code if the function was defined with the "avr-interrupt" or
"avr-signal" calling conventions.
This patch modifies the backend so that if the function does not have a
special calling convention, but does have an "interrupt" attribute,
that function is interpreted as a function with interrupts.
This also extracts the "is this function an interrupt" logic from
several disparate places in the backend into one AVRMachineFunctionInfo
attribute.
Bug found by Wilhelm Meier.
Compbinary format uses MD5 to represent strings in name table. That gives smaller profile without the need of compression/decompression when writing/reading the profile. The patch adds the support in extbinary format. It is off by default but user can choose to enable it.
Note the feature of using MD5 in name table can bring very small chance of name conflict leading to profile mismatch. Besides, profile using the feature won't have the profile remapping support.
Differential Revision: https://reviews.llvm.org/D76255
When we have
```
a = G_OR x, x
```
or
```
b = G_AND y, y
```
We can drop the G_OR/G_AND and just use x/y respectively.
Also update arm64-fallback.ll because there was an or in there which hits this
transformation.
Differential Revision: https://reviews.llvm.org/D77105
Implement identity combines for operations like the following:
```
%a = G_SUB %b, 0
```
This can just be replaced with %b.
Over CTMark, this gives some minor size improvements at -O3.
Differential Revision: https://reviews.llvm.org/D76640
This reverts commit b3297ef051.
This change is incorrect. The current semantic of null in the IR is a
pointer with the bitvalue 0. It is not a cast from an integer 0, so
this should preserve the pointer type.
We currently don't have a way to map to the equivalent intrinsic
opcode, so track immediate 0s in place of the address for the
selection to know to change the final opcode.
InstCombine has a mess of logic that tries to preserve min/max patterns,
but AFAICT, this one is not necessary because we can always narrow the
corresponding select in this sequence to match the narrow compare.
The biggest danger for this patch is inducing infinite looping or
assert from exceeding max iterations. If any bots hit that in the
vicinity of this commit, this is the likely patch to blame.
I got a report recently that a user was having trouble interpreting the
meaning of the error message. Hopefully this is more readable; produces
something like the following:
error: No such file or directory: Could not read profile data!
Differential Revision: https://reviews.llvm.org/D76796
Summary:
Code frequently relies upon the results of "is.constant" intrinsics to
DCE invalid code paths. We don't want the intrinsic to be made control-
dependent on any additional values. For instance, we can't split a PHI
into a "constant" and "non-constant" part via jump threading in order
to "optimize" the constant part, because the "is.constant" intrinsic is
meant to return "false".
Reviewers: wmi, kazu, MaskRay
Reviewed By: kazu
Subscribers: jdoerfert, efriedma, joerg, lebedev.ri, nikic, xbolva00, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75799
Summary:
This change adds amdgcn.reloc.constant intrinsic to the amdgpu backend, which will compile into a relocation entry in the resulting elf.
The intrinsics takes a MetadataNode (String) as its only argument, which specifies the symbol name of the relocation entry.
`SelectionDAGBuilder::getValueImpl` is changed to allow metadata operands passed through to ISel.
Author: csyonghe <yonghe@google.com>
Reviewers: tpr, nhaehnle
Reviewed By: nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76440
For each natural loop with multiple exit blocks, this pass creates a
new block N such that all exiting blocks now branch to N, and then
control flow is redistributed to all the original exit blocks.
The bulk of the tranformation is a new function introduced in
BasicBlockUtils that an redirect control flow from a set of incoming
blocks to a set of outgoing blocks via a common "hub".
This is a useful workaround for a limitation in the structurizer which
incorrectly orders blocks when processing a nest of loops. This pass
bypasses that issue by ensuring that each natural loop is recognized
as a separate region. Since the structurizer is a region pass, it no
longer sees a nest of loops in a single region, and instead processes
each "level" in the nesting as a separate region.
The AMDGPU backend provides a new option to enable this pass before
the structurizer, which may eventually be enabled by default.
Reviewers: madhur13490, arsenm, nhaehnle
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D75865
In InnerLoopVectorizer::getOrCreateTripCount, when the backedge taken
count is a SCEV add expression, its type is defined by the type of the
last operand of the add expression.
In the test case from PR45259, this last operand happens to be a
pointer, which (according to llvm::Type) does not have a primitive size
in bits. In this case, LoopVectorize fails to truncate the SCEV and
crashes as a result.
Uing ScalarEvolution::getTypeSizeInBits makes the truncation work as expected.
https://bugs.llvm.org/show_bug.cgi?id=45259
Differential Revision: https://reviews.llvm.org/D76669
Summary:
Otherwise PostRA list scheduler may reorder instruction, such as
schedule this
'''
pushq $0x8
pop %rbx
lea 0x2a0(%rsp),%r15
'''
to
'''
pushq $0x8
lea 0x2a0(%rsp),%r15
pop %rbx
'''
by mistake. The patch is to prevent this to happen by making sure POP has
implicit use of SP.
Reviewers: craig.topper
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D77031
I still think the call lowering type legalization logic split between
the generic code and target is too confusing, but largely induced by
the reliance on the DAG infrastructure.
This test missed the check of histograms printed for .hash sections.
It was removed by mistake in D71606 where I tried to get rid of precompiled objects
and did not realize that time that both SHT_GNU_HASH and SHT_HASH sections
were tested and not just GNU version.
Also it never tested aliases for the --elf-hash-histogram option.
Differential revision: https://reviews.llvm.org/D76920
We are able to reduce `-DBITS=32/64` to reduce this test case.
I've rewrote the comments we had to generalize them and
fix wrong computations they contained.
Differential revision: https://reviews.llvm.org/D76924
If we are lowering to X86ISD::SHUF128 we are going to lose track of individual 128-bit lanes that are UNDEF, so if we can widen these to guarantee that they are sequential with their neighbour we should. This helps with later shuffle combines.
Add a bit more logic into the 'FalseLaneZeros' tracking to enable
horizontal reductions and also make the VADDV variants
validForTailPredication.
Differential Revision: https://reviews.llvm.org/D76708
In the original batch of MVE VMOVimm code generation VMOV.i64 was left
out due to the way it was done downstream. It turns out that it's fairly
simple though. This adds the codegen for it, similar to NEON.
Bigendian is technically incorrect in this version, which John is fixing
in a Neon patch.
Aligned_alloc is a standard lib function and has been in glibc since
2.16 and in the C11 standard. It has semantics similar to malloc/calloc
for several analyses/transforms. This patch introduces aligned_alloc
in target library info and memory builtins. Subsequent ones will
make other passes aware and fix https://bugs.llvm.org/show_bug.cgi?id=44062
This change will also be useful to LLVM generators that need to allocate
buffers of vector elements larger than 16 bytes (for eg. 256-bit ones),
element boundary alignment for which is not typically provided by glibc malloc.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Differential Revision: https://reviews.llvm.org/D76970
As explained on PR40720, EXTRACTF128 is always as good/better than VPERM2F128, and we can use the implicit zeroing of the upper half.
I've added some extra tests to vector-shuffle-combining-avx2.ll to make sure we don't lose coverage.
Replace the explicit isAtom() || isSLM() test with the more general (and more specific) slowTwoMemOps() check to avoid the use of the PUSHrmm push from memory case.
This is actually very tricky to test in anything but quite complex code, but the atomic-idempotent.ll tests seem to be the most straightforward to use.
Differential Revision: https://reviews.llvm.org/D76239
Summary:
On targets with different pointer sizes, -alignment-from-assumptions could attempt to create SCEV expressions which use different effective SCEV types. The provided test illustrates the issue.
In `getNewAlignment`, AASCEV would be the (only) alloca, which would have an effective SCEV type of i32. But PtrSCEV, the GEP in this case, due to being in the flat/default address space, will have an effective SCEV of i64.
This patch resolves the issue by truncating PtrSCEV to AASCEV's effective type.
Reviewers: hfinkel, jdoerfert
Reviewed By: jdoerfert
Subscribers: jvesely, nhaehnle, hiraditya, javed.absar, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75471
Currently, bpf does not specify 128bit alignment in its
layout spec. So for a structure like
struct ipv6_key_t {
unsigned pid;
unsigned __int128 saddr;
unsigned short lport;
};
clang will generate IR type
%struct.ipv6_key_t = type { i32, [12 x i8], i128, i16, [14 x i8] }
Additional padding is to ensure later IR->MIR can generate correct
stack layout with target layout spec.
But it is common practice for a tracing program to be
first compiled with target flag (e.g., x86_64 or aarch64) through
clang to generate IR and then go through llc to generate bpf
byte code. Tracing program often refers to kernel internal
data structures which needs to be compiled with non-bpf target.
But such a compilation model may cause a problem on aarch64.
The bcc issue https://github.com/iovisor/bcc/issues/2827
reported such a problem.
For the above structure, since aarch64 has "i128:128" in its
layout string, the generated IR will have
%struct.ipv6_key_t = type { i32, i128, i16 }
Since bpf does not have "i128:128" in its spec string,
the selectionDAG assumes alignment 8 for i128 and
computes the stack storage size for the above is 32 bytes,
which leads incorrect code later.
The x86_64 does not have this issue as it does not have
"i128:128" in its layout spec as it does permits i128 to
be alignmented at 8 bytes at stack. Its IR type looks like
%struct.ipv6_key_t = type { i32, [12 x i8], i128, i16, [14 x i8] }
The fix here is add i128 support in layout spec, the same as
aarch64. The only downside is we may have less optimal stack
allocation in certain cases since we require 16byte alignment
for i128 instead of 8. But this is probably fine as i128 is
not used widely and in most cases users should already
have proper alignment.
Differential Revision: https://reviews.llvm.org/D76587
There was already a test case for landingpads to handle this case, but I
had forgotten to consider PHI instructions preceding the EH_LABEL in the
landingpad.
PR45261
To make sure that replaced operands get DCEd. This drops one
iteration from gepphigep.ll, which is still not optimal.
This was the last test case performing more than 3 iterations.
NFC-ish, only worklist order should change.
Because this code does not use the IC-aware replaceInstUsesWith()
helper, we need to manually push users to the worklist.
This is NFC-ish, in that it may only change worklist order.
MC already knows how to emulate the .weak directive (with its ELF
semantics; i.e., an undefined weak symbol resolves to 0, and a defined
weak symbol has lower link precedence than a strong symbol of the same
name) using COFF weak externals. Plumb this through the ASM printer too,
so that definitions marked with __attribute__((weak)) at the language
level (which gets translated to weak linkage at the IR level) have the
corresponding .weak directive emitted. Note that declarations marked
with __attribute__((weak)) at the language level (which translates to
extern_weak at the IR level) already have .weak directives emitted.
Weak*/linkonce* symbols without an associated comdat (in particular, ones
generated with __attribute__((weak)) in C/C++) were earlier emitted as
normal unique globals, as the comdat is required to provide the linkonce
semantics. This change makes sure they are emitted as .weak instead,
allowing other symbols to override them.
Rename the existing coff-weak.ll test to coff-linkonce.ll. I'm not
quite sure what that test covers, since the behavior being tested in it
(the emission of a one_only section) is just a result of passing
-function-sections to llc; the linkonce_odr makes no difference.
Add a new coff-weak.ll which tests the new directive emission.
Based on an previous patch by Shoaib Meenai.
Differential Revision: https://reviews.llvm.org/D44543
This seems to be used in some resource files, e.g.
f3217573d7/include/wx/msw/wx.rc (L28).
MSVC rc.exe and GNU windres both allow any value here, and silently
just truncate to uint16_t range. This just explicitly allows the
-1 value and errors out on others - the same was done for control
IDs in dialogs in c1a67857ba.
Differential Revision: https://reviews.llvm.org/D76951
This change implements constant folding to constrained versions of
intrinsics, implementing rounding: floor, ceil, trunc, round, rint and
nearbyint.
Differential Revision: https://reviews.llvm.org/D72930
When we see this:
```
%a = COPY $physreg
...
SOMETHING implicit-def $physreg
...
%b = COPY $physreg
```
The two copies are not equivalent, and so we shouldn't perform any folding
on them.
When we have two instructions which use a physical register check that they
define the same virtual register(s) as well.
e.g., if we run into this case
```
%a = COPY $physreg
...
%b = COPY %a
```
we can say that the two copies are the same, and can be folded.
Differential Revision: https://reviews.llvm.org/D76890
Currently -fno-unroll-loops is ignored when doing LTO on Darwin. This
patch adds a new -lto-no-unroll-loops option to the LTO code generator
and forwards it to the linker if -fno-unroll-loops is passed.
Reviewers: thegameg, steven_wu
Reviewed By: thegameg
Differential Revision: https://reviews.llvm.org/D76916
Summary:
"Per CU" is a bit simplistic for gfx10, but I couldn't think of a better
name.
Reviewers: arsenm, rampitec, nhaehnle, dstuttard, tpr
Subscribers: kzhuravl, jvesely, wdng, yaxunl, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76861
Generalizes D62014 (R_386_NONE/R_X86_64_NONE).
Unlike ARM (D76746) and AArch64 (D76754), we cannot delete FK_NONE from
getFixupKindSize because FK_NONE is still used by R_386_TLS_DESC_CALL/R_X86_64_TLSDESC_CALL.
Having arbitrary passes looking at the TargetOptions is pretty
messy. This was also disregarding if a function already had an
explicit attribute setting on it. opt/llc now add the attributes to
functions that don't specify the attribute. clang and lld do not call
the function to do this, which they maybe should.
This was also treating unsafe-fp-math as implying the others, and
setting the other attributes based on it. This is not done anywhere
else, and I'm not sure is correct based on the current description of
the option bit.
Effectively reverts 1d8cf2be89
Generalizes D61992. In GNU as, the .reloc directive supports arbitrary relocation types.
A MCFixupKind value `V` larger than or equal to FirstLiteralRelocationKind
is used to represent the relocation type whose number is V-FirstLiteralRelocationKind.
This is useful for linker tests. Without the feature the assembler
cannot produce certain relocation records (e.g. R_ARM_ALU_PC_G0/R_ARM_LDR_PC_G0)
This helps move forward D75349 and D76575.
Differential Revision: https://reviews.llvm.org/D76746
This will cause the operation to be repeated in both a mask and another masked
or unmasked form. This can a wasted of execution resources.
Differential Revision: https://reviews.llvm.org/D60940
Summary:
Implement several XCOFF hooks to get '-r' option working for llvm-objdump -r.
Reviewer: DiggerLin, hubert.reinterpretcast, jhenderson, MaskRay
Differential Revision: https://reviews.llvm.org/D75131
Given that some instructions generate wider result elements than
their inputs, flag them as being able to generate non zeros in the
false lanes.
Differential Revision: https://reviews.llvm.org/D76766
We might have a crash scenario when we have an invalid DT_STRTAB value
that is larger than the file size. I've added a test case to demonstrate.
Differential revision: https://reviews.llvm.org/D76706
Summary:
There is a tiny logic error of D75300, making branch is not
correctly aligned with option -x86-pad-max-prefix-size
Reviewers: reames, MaskRay, craig.topper, LuoYuanke, jyknight
Reviewed By: reames
Subscribers: hiraditya, llvm-commits, annita.zhang
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76285
Summary: This patch is the first effort to adding basic optimizations for FREEZE in SelDag.
Reviewers: spatel, lebedev.ri
Reviewed By: spatel
Subscribers: xbolva00, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76707
Fix the LowerGlobalDtors pass to run destructors in the same order as the
regular LLVM destructor lowering -- in reverse order. Adjacent
destructors with the same associated object are grouped, but destructors
are not reordered based on associated objects.
Differential Revision: https://reviews.llvm.org/D70685
These transforms rely on a vector reduction flag on the SDNode
set by SelectionDAGBuilder. This flag exists because SelectionDAG
can't see across basic blocks so SelectionDAGBuilder is looking
across and saving the info. X86 is the only target that uses this
flag currently. By removing the X86 code we can remove the flag
and the SelectionDAGBuilder code.
This pass adds a dedicated IR pass for X86 that looks across the
blocks and transforms the IR into a form that the X86 SelectionDAG
can finish.
An advantage of this new approach is that we can enhance it to
shrink the phi nodes and final reduction tree based on the zeroes
that we need to concatenate to bring the partially reduced
reduction back up to the original width.
Differential Revision: https://reviews.llvm.org/D76649
We can improve computeKnownBits results by avoiding excess bitcasts.
For this pattern we were doing:
(v16i8 PACKUS(v8i16 BITCAST(v16i8 AND(V1, MASK)), v8i16 BITCAST(v16i8 AND(V2, MASK))))
By performing the MASK/AND with a v8i16 type and bitcasting V1/V2 directly we can help computeKnownBits see that the mask is clearing the upper bits and allows shuffle combining to peek through later on.
This will be necessary to extend rG9d1721ce3926 to AVX2+ targets in a future patch.
SUMMARY:
SUMMARY
for a source file "test.c"
void foo() {};
llc will generate assembly code as (assembly patch)
.globl foo
.globl .foo
.csect foo[DS]
foo:
.long .foo
.long TOC[TC0]
.long 0
and symbol table as (xcoff object file)
[4] m 0x00000004 .data 1 unamex foo
[5] a4 0x0000000c 0 0 SD DS 0 0
[6] m 0x00000004 .data 1 extern foo
[7] a4 0x00000004 0 0 LD DS 0 0
After first patch, the assembly will be as
.globl foo[DS] # -- Begin function foo
.globl .foo
.align 2
.csect foo[DS]
.long .foo
.long TOC[TC0]
.long 0
and symbol table will as
[6] m 0x00000004 .data 1 extern foo
[7] a4 0x00000004 0 0 DS DS 0 0
Change the code for the assembly path and xcoff objectfile patch for llc.
Reviewers: Jason Liu
Subscribers: wuzish, nemanjai, hiraditya
Differential Revision: https://reviews.llvm.org/D76162
Summary:
The linker is free to relax this (relocation R_PPC_GOT_TPREL16) against
R_PPC_TLS, if it sees fit (initial exec to local exec). If r0 is used,
this can generate execution-invalid code (converts to 'addi %rX, %r0,
FOO, which translates in PPC-lingo to li %rX, FOO). Forbid this
instead.
This fixes static binaries using locales on FreeBSD/powerpc
(tested on FreeBSD/powerpcspe).
Reviewed By: nemanjai
Differential Revision: https://reviews.llvm.org/D76662
As discussed on PR31443, we should be trying to use PACKUS for binary truncation patterns to reduce the number of shuffles.
The plan is to support AVX2+ targets once we've worked around PR45315 - we fail to peek through a VBROADCAST_LOAD mask to recognise zero upper bits in a PACKUS pattern.
We should also be able to add support for v8i16 and possibly 256/512-bit vectors as well.
```
// llvm-objdump -d output (before)
0: bl .-4
4: bl .+0
8: bl .+4
// llvm-objdump -d output (after) ; GNU objdump -d
0: bl 0xfffffffc / bl 0xfffffffffffffffc
4: bl 0x4
8: bl 0xc
```
Many Operand's are not annotated as OPERAND_PCREL.
They are not affected (e.g. `b .+67108860`). I plan to fix them in future patches.
Modified test/tools/llvm-objdump/ELF/PowerPC/branch-offset.s to test
address space wraparound for powerpc32 and powerpc64.
Reviewed By: sfertile, jhenderson
Differential Revision: https://reviews.llvm.org/D76591
```
// llvm-objdump -d output (before)
400000: e8 0b 00 00 00 callq 11
400005: e8 0b 00 00 00 callq 11
// llvm-objdump -d output (after)
400000: e8 0b 00 00 00 callq 0x400010
400005: e8 0b 00 00 00 callq 0x400015
// GNU objdump -d. The lack of 0x is not ideal because the result cannot be re-assembled
400000: e8 0b 00 00 00 callq 400010
400005: e8 0b 00 00 00 callq 400015
```
In llvm-objdump, we pass the address of the next MCInst. Ideally we
should just thread the address of the current address, unfortunately we
cannot call X86MCCodeEmitter::encodeInstruction (X86MCCodeEmitter
requires MCInstrInfo and MCContext) to get the length of the MCInst.
MCInstPrinter::printInst has other callers (e.g llvm-mc -filetype=asm, llvm-mca) which set Address to 0.
They leave MCInstPrinter::PrintBranchImmAsAddress as false and this change is a no-op for them.
Reviewed By: jhenderson
Differential Revision: https://reviews.llvm.org/D76580
In some scalarize/split result methods (unary, binary, ...), flags in
SDNode were not passed down, which may lead to unexpected results in
unsafe float-point optimization. This patch fixes them. (maybe not
complete)
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D76832
As long we extract from a source vector with smaller elements and we zero-extend the element in the final shuffle mask then we can safely peek through truncations and any/zero-extensions to find the source extraction.
Summary:
Below InstAlias have been redefined, this patch is to remove the repeated
definition.
mtdec/mfdec mtsdr1/mfsdr1 mtsrr0/mfsrr0 mtsrr1/mfsrr1 mtasr
Reviewed By: nemanjai, steven.zhang
Differential Revision: https://reviews.llvm.org/D75821
Summary:
This patch adds initial support for the following intrinsics:
* llvm.aarch64.sve.st2
* llvm.aarch64.sve.st3
* llvm.aarch64.sve.st4
For storing two, three and four vectors worth of data. Basic codegen for
reg+immediate forms are implemented. Reg+reg addressing modes will be
addressed in a later patch.
These intrinsics are intended for use in the Arm C Language Extension
(ACLE).
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D75947
Summary:
This patch introduces command-line support for the Armv8.6-a architecture and assembly support for BFloat16. Details can be found
https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a
in addition to the GCC patch for the 8..6-a CLI:
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-11/msg02647.html
In detail this patch
- march options for armv8.6-a
- BFloat16 assembly
This is part of a patch series, starting with command-line and Bfloat16
assembly support. The subsequent patches will upstream intrinsics
support for BFloat16, followed by Matrix Multiplication and the
remaining Virtualization features of the armv8.6-a architecture.
Based on work by:
- labrinea
- MarkMurrayARM
- Luke Cheeseman
- Javed Asbar
- Mikhail Maltsev
- Luke Geeson
Reviewers: SjoerdMeijer, craig.topper, rjmccall, jfb, LukeGeeson
Reviewed By: SjoerdMeijer
Subscribers: stuij, kristof.beyls, hiraditya, dexonsmith, danielkiss, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D76062
Some MVE floating point instructions have gpr register variants that take
the scalar gpr value and splat them to all lanes. In order to accept
them in loops, the shuffle_vector and insert need to be sunk down into
the loop, next to the instruction so that ISel can see the whole
pattern.
This does that sinking for FAdd, FSub, FMul and FCmp. The patterns for
mul are slightly more constrained as there are no fms variants taking
register arguments.
Differential Revision: https://reviews.llvm.org/D76023
I want to extend D60940 to scalar FP which will prevent forming
masked instructions if the arithmetic op has another use. To
prepare for that, this patch updates tests to avoid repeating
the operation multiple times with different masking.
Previously, we would ignore alloca alignment when building the frame
and just use the natural alignment of the allocated type. If an alloca
is over-aligned for its IR type, this could lead to a frame entry with
inadequate alignment for the downstream uses of the alloca.
Since highly-aligned fields also tend to produce poor layouts under a
naive layout algorithm, I've also switched coroutine frames to use the
new optimal struct layout algorithm.
In order to communicate the frame size and alignment to later passes,
I needed to set align+dereferenceable attributes on the frame-pointer
parameter of the resume function. This is clearly the right thing to
do, but the align attribute currently seems to result in assumptions
being added during inlining that the optimizer cannot easily remove.
We can legalize the operation MUL for v8i16 with instruction (vmladduhm A, B, 0)
if altivec enabled. Now, it is set as custom and expand it later, which is not
the right way. And then, we can add the pattern to match the mul + add with (vmladduhm A, B, C)
Reviewed By: Nemanjai
Differential Revision: https://reviews.llvm.org/D76751
More mechanical splitting of tests so we can add a one use
check to the isel patterns for forming masked instructions.
In a few cases I changed immediates of instructions in
order to avoid needing to split.