This enabled the select optimize patch for ARM Out of order AArch64
cores. It is trying to solve a problem that is difficult for the
compiler to fix. The criteria for when a csel is better or worse than a
branch depends heavily on whether the branch is well predicted and the
amount of ILP in the loop (as well as other criteria like the core in
question and the relative performance of the branch predictor). The
pass seems to do a decent job though, with the inner loop heuristics
being well implemented and doing a better job than I had expected in
general, even without PGO information.
I've been doing quite a bit of benchmarking. The headline numbers are
these for SPEC2017 on a Neoverse N1:
500.perlbench_r -0.12%
502.gcc_r 0.02%
505.mcf_r 6.02%
520.omnetpp_r 0.32%
523.xalancbmk_r 0.20%
525.x264_r 0.02%
531.deepsjeng_r 0.00%
541.leela_r -0.09%
548.exchange2_r 0.00%
557.xz_r -0.20%
Running benchmarks with a combination of the llvm-test-suite plus
several versions of SPEC gave between a 0.2% and 0.4% geomean
improvement depending on the core/run. The instruction count went down
by 0.1% too, which is a good sign, but the results can be a little
noisy. Some issues from other benchmarks I had ran were improved in
rGca78b5601466f8515f5f958ef8e63d787d9d812e. In summary well predicted
branches will see in improvement, badly predicted branches may get
worse, and on average performance seems to be a little better overall.
This patch enables the pass for AArch64 under -O3 for cores that will
benefit for it. i.e. not in-order cores that do not fit into the "Assume
infinite resources that allow to fully exploit the available
instruction-level parallelism" cost model. It uses a subtarget feature
for specifying when the pass will be enabled, which I have enabled under
cpu=generic as the performance increases for out of order cores seems
larger than any decreases for inorder, which were minor.
Differential Revision: https://reviews.llvm.org/D138990
This option was enabled in D128582, and whilst it seems to be a net
improvement in many cases, at least a couple of issues have been
reported from D135957 and from the CSE added to the backend causing more
instructions in executed blocks. Revert for the time being, until we can
improve the precision.
The Key for the SubtargetMap had the StreamingSVEModeDisabled in the
wrong place. This change is non-functional, since the string (key) is
still unique.
When the SME attributes tell that a function is or may be executed in Streaming
SVE mode, we currently need to be conservative and disable _any_ vectorization
(fixed or scalable) because the code-generator does not yet support generating
streaming-compatible code.
Scalable auto-vec will be gradually enabled in the future when we have
confidence that the loop-vectorizer won't use any SVE or NEON instructions
that are illegal in Streaming SVE mode.
Reviewed By: paulwalker-arm
Differential Revision: https://reviews.llvm.org/D135950
The new pass implements the following:
* Inserts code at the start of an arm_new_za function to
commit a lazy-save when the lazy-save mechanism is active.
* Adds a smstart intrinsic at the start of the function.
* Adds a smstop intrinsic at the end of the function.
Patch co-authored by kmclaughlin.
Differential Revision: https://reviews.llvm.org/D133896
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Relands 67504c9549 with a fix for
32-bit builds.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
GEP's across basic blocks were not getting splitted due to EnableGEPOpt
which was turned off by default. Hence, EarlyCSE missed the opportunity
to eliminate common part of GEP's. This can be achieved by simply
turning GEP pass on.
- This patch moves SeparateConstOffsetFromGEPPass() just before LSR.
- It enables EnableGEPOpt by default.
Resolves - https://github.com/llvm/llvm-project/issues/50528
Added an unit test.
Differential Revision: https://reviews.llvm.org/D128582
treated as Copy instruction in MCP.
This is then used in AArch64 to remove copy instructions after taildup
ran in machine block placement
Differential Revision: https://reviews.llvm.org/D125335
This patch adds an AArch64 specific PostRA MachineScheduler to try to schedule
STP Q's to the same base-address in ascending order of offsets. We have found
this to improve performance on Neoverse N1 and should not hurt other AArch64
cores.
Differential Revision: https://reviews.llvm.org/D125377
This pass inserts the necessary CFI instructions to compensate for the
inconsistency of the call-frame information caused by linear (non-CGA
aware) nature of the unwind tables.
Unlike the `CFIInstrInserer` pass, this one almost always emits only
`.cfi_remember_state`/`.cfi_restore_state`, which results in smaller
unwind tables and also transparently handles custom unwind info
extensions like CFA offset adjustement and save locations of SVE
registers.
This pass takes advantage of the constraints taht LLVM imposes on the
placement of save/restore points (cf. `ShrinkWrap.cpp`):
* there is a single basic block, containing the function prologue
* possibly multiple epilogue blocks, where each epilogue block is
complete and self-contained, i.e. CSR restore instructions (and the
corresponding CFI instructions are not split across two or more
blocks.
* prologue and epilogue blocks are outside of any loops
Thus, during execution, at the beginning and at the end of each basic
block the function can be in one of two states:
- "has a call frame", if the function has executed the prologue, or
has not executed any epilogue
- "does not have a call frame", if the function has not executed the
prologue, or has executed an epilogue
These properties can be computed for each basic block by a single RPO
traversal.
From the point of view of the unwind tables, the "has/does not have
call frame" state at beginning of each block is determined by the
state at the end of the previous block, in layout order.
Where these states differ, we insert compensating CFI instructions,
which come in two flavours:
- CFI instructions, which reset the unwind table state to the
initial one. This is done by a target specific hook and is
expected to be trivial to implement, for example it could be:
```
.cfi_def_cfa <sp>, 0
.cfi_same_value <rN>
.cfi_same_value <rN-1>
...
```
where `<rN>` are the callee-saved registers.
- CFI instructions, which reset the unwind table state to the one
created by the function prologue. These are the sequence:
```
.cfi_restore_state
.cfi_remember_state
```
In this case we also insert a `.cfi_remember_state` after the
last CFI instruction in the function prologue.
Reviewed By: MaskRay, danielkiss, chill
Differential Revision: https://reviews.llvm.org/D114545
static_cast is a little safer here since the compiler will
ensure we're casting to a class derived from
yaml::MachineFunctionInfo.
I believe this first appeared on AMDGPU and was copied to the
other two targets.
Spotted when it was being copied to RISCV in D123178.
Differential Revision: https://reviews.llvm.org/D123260
This pass inserts the necessary CFI instructions to compensate for the
inconsistency of the call-frame information caused by linear (non-CFG
aware) nature of the unwind tables.
Unlike the `CFIInstrInserer` pass, this one almost always emits only
`.cfi_remember_state`/`.cfi_restore_state`, which results in smaller
unwind tables and also transparently handles custom unwind info
extensions like CFA offset adjustement and save locations of SVE
registers.
This pass takes advantage of the constraints that LLVM imposes on the
placement of save/restore points (cf. `ShrinkWrap.cpp`):
* there is a single basic block, containing the function prologue
* possibly multiple epilogue blocks, where each epilogue block is
complete and self-contained, i.e. CSR restore instructions (and the
corresponding CFI instructions are not split across two or more
blocks.
* prologue and epilogue blocks are outside of any loops
Thus, during execution, at the beginning and at the end of each basic
block the function can be in one of two states:
- "has a call frame", if the function has executed the prologue, or
has not executed any epilogue
- "does not have a call frame", if the function has not executed the
prologue, or has executed an epilogue
These properties can be computed for each basic block by a single RPO
traversal.
In order to accommodate backends which do not generate unwind info in
epilogues we compute an additional property "strong no call frame on
entry" which is set for the entry point of the function and for every
block reachable from the entry along a path that does not execute the
prologue. If this property holds, it takes precedence over the "has a
call frame" property.
From the point of view of the unwind tables, the "has/does not have
call frame" state at beginning of each block is determined by the
state at the end of the previous block, in layout order.
Where these states differ, we insert compensating CFI instructions,
which come in two flavours:
- CFI instructions, which reset the unwind table state to the
initial one. This is done by a target specific hook and is
expected to be trivial to implement, for example it could be:
```
.cfi_def_cfa <sp>, 0
.cfi_same_value <rN>
.cfi_same_value <rN-1>
...
```
where `<rN>` are the callee-saved registers.
- CFI instructions, which reset the unwind table state to the one
created by the function prologue. These are the sequence:
```
.cfi_restore_state
.cfi_remember_state
```
In this case we also insert a `.cfi_remember_state` after the
last CFI instruction in the function prologue.
Reviewed By: MaskRay, danielkiss, chill
Differential Revision: https://reviews.llvm.org/D114545
The introduction and some examples are on this page:
https://devblogs.microsoft.com/cppblog/announcing-jmc-stepping-in-visual-studio/
The `/JMC` flag enables these instrumentations:
- Insert at the beginning of every function immediately after the prologue with
a call to `void __fastcall __CheckForDebuggerJustMyCode(unsigned char *JMC_flag)`.
The argument for `__CheckForDebuggerJustMyCode` is the address of a boolean
global variable (the global variable is initialized to 1) with the name
convention `__<hash>_<filename>`. All such global variables are placed in
the `.msvcjmc` section.
- The `<hash>` part of `__<hash>_<filename>` has a one-to-one mapping
with a directory path. MSVC uses some unknown hashing function. Here I
used DJB.
- Add a dummy/empty COMDAT function `__JustMyCode_Default`.
- Add `/alternatename:__CheckForDebuggerJustMyCode=__JustMyCode_Default` link
option via ".drectve" section. This is to prevent failure in
case `__CheckForDebuggerJustMyCode` is not provided during linking.
Implementation:
All the instrumentations are implemented in an IR codegen pass. The pass is placed immediately before CodeGenPrepare pass. This is to not interfere with mid-end optimizations and make the instrumentation target-independent (I'm still working on an ELF port in a separate patch).
Reviewed By: hans
Differential Revision: https://reviews.llvm.org/D118428
When this pass was originally implemented, the fix pass was enabled
using a llvm command-line flag. This works fine, except in the case of
LTO, where the flag is not passed into the linker plugin in order to
enable the function pass in the LTO backend.
Now LTO exists, the expectation now is to use target features rather
than command-line arguments to control code generation, as this ensures
that different command-line arguments in different files are correctly
represented, and target-features always get to the LTO plugin as they
are encoded into LLVM IR.
The fall-out of this change is that the fix pass has to always be added
to the backend pass pipeline, so now it makes no changes if the function
does not have the right target feature to enable it. This should make a
minimal difference to compile time.
One advantage is it's now much easier to enable when compiling for a
Cortex-A53, as CPUs imply their own individual sets of target-features,
in a more fine-grained way. I haven't done this yet, but it is an
option, if the fix should be enabled in more places.
Existing tests of the user interface are unaffected, the changes are to
reflect that the argument is now turned into a target feature.
Reviewed By: tmatheson
Differential Revision: https://reviews.llvm.org/D114703
This is a first attempt at a constant value consecutive store merging pass,
a counterpart to the DAGCombiner's store merging optimization.
The high level goals of this pass:
* Have a simple and efficient algorithm. As close to linear time as we can get.
Thus, prioritizing scalability of the algorithm over merging every corner case
we can find. The DAGCombiner's store merging code has been the source of
compile time and complexity issues in the past and I wanted to avoid that.
* Don't introduce any new data structures for ordering memory operations. In MIR,
we don't have the concept of chains like we do in the DAG, and the instruction
order is stricter than enforcing ordering with graph edges. Although I
considered adding something similar, I couldn't justify the overhead.
The pass is current split into 3 main parts. The main store merging code focuses
on identifying candidate stores and managing the candidate group that's under
consideration for merging. Analyzing addressing of stores is a potentially
complex part and for now there's just a basic implementation to identify easy
cases. Finally, the other main bit of complexity is the alias analysis, which
tries to follow the same logic as the DAG's AA.
Currently this implementation only supports merging of constant stores. Stores
of arbitrary variables are technically possible with a very small change, but
the DAG chooses not to do this. Doing so here makes most code worse since
there's extra overhead in merging values into wider registers.
On AArch64 -Os, this optimization results in very minor savings on CTMark.
Differential Revision: https://reviews.llvm.org/D109131
This patch ensures that we always tune for a given CPU on AArch64
targets when the user specifies the "-mtune=xyz" flag. In the
AArch64Subtarget if the tune flag is unset we use the CPU value
instead.
I've updated the release notes here:
llvm/docs/ReleaseNotes.rst
and added tests here:
clang/test/Driver/aarch64-mtune.c
Differential Revision: https://reviews.llvm.org/D110258
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.
This allows us to ensure that Support doesn't have includes from MC/*.
Differential Revision: https://reviews.llvm.org/D111454
This enables the type promotion pass for AArch64, which acts as a
CodeGenPrepare pass to promote illegal integers to legal ones,
especially useful for removing extends that would otherwise require
cross-basic-block analysis.
I have enabled this generally, for both ISel and GlobalISel. In some
quick experiments it appeared to help GlobalISel remove extra extends in
places too, but that might just be missing optimizations that are better
left for later. We can disable it again if required.
In my experiments, this can improvement performance in some cases, and
codesize was a small improvement. SPEC was a very small improvement,
within the noise. Some of the test cases show extends being moved out of
loops, often when the extend would be part of a cmp operand, but that
should reduce the latency of the instruction in the loop on many cpus.
The signed-truncation-check tests are increasing as they are no longer
matching specific DAG combines.
We also hope to add some additional improvements to the pass in the near
future, to capture more cases of promoting extends through phis that
have come up in a few places lately.
Differential Revision: https://reviews.llvm.org/D110239
This reverts the revert commit f85d8a5bed
with bug fixes.
Original message:
MOVi32imm + ANDWrr ==> ANDWri + ANDWri
MOVi64imm + ANDXrr ==> ANDXri + ANDXri
The mov pseudo instruction could be expanded to multiple mov instructions later.
In this case, try to split the constant operand of mov instruction into two
bitmask immediates. It makes only two AND instructions intead of multiple
mov + and instructions.
Added a peephole optimization pass on MIR level to implement it.
Differential Revision: https://reviews.llvm.org/D109963
MOVi32imm + ANDWrr ==> ANDWri + ANDWri
MOVi64imm + ANDXrr ==> ANDXri + ANDXri
The mov pseudo instruction could be expanded to multiple mov instructions later.
In this case, try to split the constant operand of mov instruction into two
bitmask immediates. It makes only two AND instructions intead of multiple
mov + and instructions.
Added a peephole optimization pass on MIR level to implement it.
Differential Revision: https://reviews.llvm.org/D109963
We never bothered to have a separate set of combines for -O0 in the prelegalizer
before. This results in some minor performance hits for a mode where performance
isn't a concern (although not regressing code size significantly is still preferable).
This also removes the CSE option since we don't need it for -O0.
Through experiments, I've arrived at a set of combines that gets the most code
size improvement at -O0, while reducing the amount of time spent in the combiner
by around 35% give or take.
Differential Revision: https://reviews.llvm.org/D102038
To do this while supporting the existing functionality in SelectionDAG of using
PGO info, we add the ProfileSummaryInfo and LazyBlockFrequencyInfo analysis
dependencies to the instruction selector pass.
Then, use the predicate to generate constant pool loads for f32 materialization,
if we're targeting optsize/minsize.
Differential Revision: https://reviews.llvm.org/D97732
This patch adjusts the placement of the bundle unpacking to just before
code emission. In particular, this means bundle unpacking happens AFTER
the machine outliner. With the previous position, the machine outliner
may outline parts of a bundle, which breaks them up.
This is an issue for BLR_RVMARKER handling, as illustrated by the
rvmarker-pseudo-expansion-and-outlining.mir test case. The machine
outliner should not break up the bundles created during pseudo
expansion.
This should fix PR49082.
Reviewed By: SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D96294
In the future Windows will enable Control-flow Enforcement Technology (CET aka shadow stacks). To protect the path where the context is updated during exception handling, the binary is required to enumerate valid unwind entrypoints in a dedicated section which is validated when the context is being set during exception handling.
This change allows llvm to generate the section that contains the appropriate symbol references in the form expected by the msvc linker.
This feature is enabled through a new module flag, ehcontguard, which was modelled on the cfguard flag.
The change includes a test that when the module flag is enabled the section is correctly generated.
The set of exception continuation information includes returns from exceptional control flow (catchret in llvm).
In order to collect catchret we:
1) Includes an additional flag on machine basic blocks to indicate that the given block is the target of a catchret operation,
2) Introduces a new machine function pass to insert and collect symbols at the start of each block, and
3) Combines these targets with the other EHCont targets that were already being collected.
Change originally authored by Daniel Frampton <dframpto@microsoft.com>
For more details, see MSVC documentation for `/guard:ehcont`
https://docs.microsoft.com/en-us/cpp/build/reference/guard-enable-eh-continuation-metadata
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D94835
Second land attempt. MachineVerifier DefRegState expensive check errors fixed.
Prologs and epilogs handle callee-save registers and tend to be irregular with
different immediate offsets that are not often handled by the MachineOutliner.
Commit D18619/a5335647d5e8 (combining stack operations) stretched irregularity
further.
This patch tries to emit homogeneous stores and loads with the same offset for
prologs and epilogs respectively. We have observed that this canonicalizes
(homogenizes) prologs and epilogs significantly and results in a greatly
increased chance of outlining, resulting in a code size reduction.
Despite the above results, there are still size wins to be had that the
MachineOutliner does not provide due to the special handling X30/LR. To handle
the LR case, his patch custom-outlines prologs and epilogs in place. It does
this by doing the following:
* Injects HOM_Prolog and HOM_Epilog pseudo instructions during a Prolog and
Epilog Injection Pass.
* Lowers and optimizes said pseudos in a AArchLowerHomogneousPrologEpilog Pass.
* Outlined helpers are created on demand. Identical helpers are merged by the linker.
* An opt-in flag is introduced to enable this feature. Another threshold flag
is also introduced to control the aggressiveness of outlining for application's need.
This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.
Differential Revision: https://reviews.llvm.org/D76570
Prologs and epilogs handle callee-save registers and tend to be irregular with
different immediate offsets that are not often handled by the MachineOutliner.
Commit D18619/a5335647d5e8 (combining stack operations) stretched irregularity
further.
This patch tries to emit homogeneous stores and loads with the same offset for
prologs and epilogs respectively. We have observed that this canonicalizes
(homogenizes) prologs and epilogs significantly and results in a greatly
increased chance of outlining, resulting in a code size reduction.
Despite the above results, there are still size wins to be had that the
MachineOutliner does not provide due to the special handling X30/LR. To handle
the LR case, his patch custom-outlines prologs and epilogs in place. It does
this by doing the following:
* Injects HOM_Prolog and HOM_Epilog pseudo instructions during a Prolog and
Epilog Injection Pass.
* Lowers and optimizes said pseudos in a AArchLowerHomogneousPrologEpilog Pass.
* Outlined helpers are created on demand. Identical helpers are merged by the linker.
* An opt-in flag is introduced to enable this feature. Another threshold flag
is also introduced to control the aggressiveness of outlining for application's need.
This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.
Differential Revision: https://reviews.llvm.org/D76570
Add the aarch64[_be]-*-gnu_ilp32 targets to support the GNU ILP32 ABI for AArch64.
The needed codegen changes were mostly already implemented in D61259, which added support for the watchOS ILP32 ABI. The main changes are:
- Wiring up the new target to enable ILP32 codegen and MC.
- ILP32 va_list support.
- ILP32 TLSDESC relocation support.
There was existing MC support for ELF ILP32 relocations from D25159 which could be enabled by passing "-target-abi ilp32" to llvm-mc. This was changed to check for "gnu_ilp32" in the target triple instead. This shouldn't cause any issues since the existing support was slightly broken: it was generating ELF64 objects instead of the ELF32 object files expected by the GNU ILP32 toolchain.
This target has been tested by running the full rustc testsuite on a big-endian ILP32 system based on the GCC ILP32 toolchain.
Reviewed By: kristof.beyls
Differential Revision: https://reviews.llvm.org/D94143