This allows a number of optimisation passes to work.
E.g. BranchFolding and MachineBlockPlacement.
Differential Revision: https://reviews.llvm.org/D131316
There is at least one Clang test (clang/test/CodeGen/arm_acle.c) which
has functions guarded by #if's that cause those functions to be compiled
only for a subset of RUN lines.
This results in a case where one RUN line has a body for the function
and another doesn't. Treat this case as a conflict for any prefixes that
the two RUN lines have in common.
This change exposed a bug where functions with '$' in the name weren't
properly recognized in ARM assembly (despite there being a test case
that was supposed to catch the problem!). This bug is fixed as well.
Differential Revision: https://reviews.llvm.org/D130089
The first attempt missed changing test files for tools
(update_llc_test_checks.py).
Original commit message:
This implements the main suggested change from issue #56498.
Using the shorter (non-extending) instruction with only
-Oz ("minsize") rather than -Os ("optsize") is left as a
possible follow-up.
As noted in the bug report, the zero-extending load may have
shorter latency/better throughput across a wide range of x86
micro-arches, and it avoids a potential false dependency.
The cost is an extra instruction byte.
This could cause perf ups and downs from secondary effects,
but I don't think it is possible to account for those in
advance, and that will likely also depend on exact micro-arch.
This does bring LLVM x86 codegen more in line with existing
gcc codegen, so if problems are exposed they are more likely
to occur for both compilers.
Differential Revision: https://reviews.llvm.org/D129775
Add LoongArch assembly scrubbing and triple support to update_llc_test_checks.
Depends on D128432
Reviewed By: MaskRay, xen0n
Differential Revision: https://reviews.llvm.org/D128433
When we appended check lines at the end we could not share prefixes
before. This patch should make it possible and allow us to reduce
some check line counts (especially for Clang/OpenMP tests).
See also: https://reviews.llvm.org/D128686
Differential Revision: https://reviews.llvm.org/D128684
Use the query that doesn't assert if TracksLiveness isn't set, which
needs to always be available. We also need to start printing liveins
regardless of TracksLiveness.
Support the pattern where a test file uses multiple prefixes per run line:
one prefix that is unique to the run line, and additional prefixes that are
common with other run lines.
Decide on a per-function basis which prefix(es) to emit, based on which run
lines have the same output.
Move the renaming of vregs earlier, so that we can compare the output as it
would actually be printed in check lines.
Differential Revision: https://reviews.llvm.org/D126411
This is scoped to autogenerated tests.
The goal is to support having each RUN line specify a list of
check-prefixes where one can specify potentially redundant prefixes. For example,
for X86, if one specified prefixes for both AVX1 and AVX2, and the codegen happened to
match today, one of the prefixes would be used and the onther one not.
If the unused prefix were dropped, and later, codegen differences were
introduced, one would have to go figure out where to add what prefix
(paraphrasing
https://lists.llvm.org/pipermail/llvm-dev/2021-February/148326.html)
To avoid getting errors due to unused prefixes, whole directories can be
opted out (as discussed on that thread), but that means that tests that
aren't autogenerated in such directories could have undetected unused
prefix bugs.
This patch proposes an alternative that both avoids the above, dir-level
optout, and supports the main autogen scenario discussed first. The autogen
tool appends at the end of the test file the list of unused prefixes,
together with a note explaining that is the case. Each prefix is set up
to always pass.
This way, unexpected unused prefixes are easily discoverable, and
expected cases "just work".
Differential Revision: https://reviews.llvm.org/D124306
I had initially assumed this was the problem with
https://github.com/llvm/llvm-project/issues/55271#issuecomment-1133426243
But it turns out that was a simpler issue. This patch is still
more correct than what we were doing before so figured I'd submit
it anyway.
No test case because I'm not sure how to get an undef around
until expansion.
Looking at the test deltas I wonder if it be valid to combine
(sext_inreg (freeze (aextload X))) -> (freeze (sextload X)).
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D126175
Variable captures such as `<MCInst #` can change based on unrelated changes
to the LLVM backends, to avoid the generated test cases being different
use an incrementing counter for variable names instead of using the
actual value from the output file.
This change may also be beneficial for some nameless IR variables
(especially when combined with filtering of output), but for now I've
restricted this change to the obvious candidates (--asm-show-inst output).
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D125405
To avoid test churn when backends add/rename new instructions/registers,
it makes sense to use FileCheck captures for the exact MCInst/Reg number.
This is motivated by D125307, where I use --asm-show-inst to differentiate
the output for multiple instructions with the same mnemonic.
This does not quite fix the churn issue yet: While files with the generated
checks will be immune to the numbers changing, the update script test
still suffers from this problem since the number is encoded in the
FileCheck variable name. I plan to address this in a follow-up patch.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D125307
To avoid test churn when backends add/rename new instructions/registers,
it makes sense to scrub the exact MCInst/Reg number.
Differential Revision: https://reviews.llvm.org/D125305
Subj, or on other words, we have a lot of tests that are driven by
the LoopVectorizer's debug output, but we don't have
any meaningful way to autogenerate checklines in them,
which means that an insurmountable amount of manual work
is required when modifying the appropriate cost models.
That is not sustainable, so this presents a solution.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D121133
This patch makes possible generating NVPTX assembly check lines with
update_llc_test_checks.py utility.
Differential Revision: https://reviews.llvm.org/D122986
Place PersistentId declaration under #if LLVM_ENABLE_ABI_BREAKING_CHECKS to
reduce memory usage when it is not needed.
Differential Revision: https://reviews.llvm.org/D120714
Re-commit of 32e8b550e5
This patch rearranges emission of CFI instructions, so the resulting
DWARF and `.eh_frame` information is precise at every instruction.
The current state is that the unwind info is emitted only after the
function prologue. This is fine for synchronous (e.g. C++) exceptions,
but the information is generally incorrect when the program counter is
at an instruction in the prologue or the epilogue, for example:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
mov x29, sp
.cfi_def_cfa w29, 16
...
```
after the `stp` is executed the (initial) rule for the CFA still says
the CFA is in the `sp`, even though it's already offset by 16 bytes
A correct unwind info could look like:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
.cfi_def_cfa_offset 16
mov x29, sp
.cfi_def_cfa w29, 16
...
```
Having this information precise up to an instruction is useful for
sampling profilers that would like to get a stack backtrace. The end
goal (towards this patch is just a step) is to have fully working
`-fasynchronous-unwind-tables`.
Reviewed By: danielkiss, MaskRay
Differential Revision: https://reviews.llvm.org/D111411
Currently the return address ABI registers s[30:31], which fall in the call
clobbered register range, are added as a live-in on the function entry to
preserve its value when we have calls so that it gets saved and restored
around the calls.
But the DWARF unwind information (CFI) needs to track where the return address
resides in a frame and the above approach makes it difficult to track the
return address when the CFI information is emitted during the frame lowering,
due to the involvment of understanding the control flow.
This patch moves the return address ABI registers s[30:31] into callee saved
registers range and stops adding live-in for return address registers, so that
the CFI machinery will know where the return address resides when CSR
save/restore happen during the frame lowering.
And doing the above poses an issue that now the return instruction uses undefined
register `sgpr30_sgpr31`. This is resolved by hiding the return address register
use by the return instruction through the `SI_RETURN` pseudo instruction, which
doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the
`S_SETPC_B64_return` during the `expandPostRAPseudo()`.
As an added benefit, this patch simplifies overall return instruction handling.
Note: The AMDGPU CFI changes are there only in the downstream code and another
version of this patch will be posted for review for the downstream code.
Reviewed By: arsenm, ronlieb
Differential Revision: https://reviews.llvm.org/D114652
Previously we used sra+add+xor if ADDCARRY is supported. This changes
to sra+xor+sub is SUBCARRY is available.
This is consistent with the recent change to the default expansion
in LegalizeDAG.
Differential Revision: https://reviews.llvm.org/D121039
We can't just split by space, that's not going to give us the same
argv we'd have gotten from the shell, it could be in a string,
we must actually parse that as argv.
It caused builds to assert with:
(StackSize == 0 && "We already have the CFA offset!"),
function generateCompactUnwindEncoding, file AArch64AsmBackend.cpp, line 624.
when targeting iOS. See comment on the code review for reproducer.
> This patch rearranges emission of CFI instructions, so the resulting
> DWARF and `.eh_frame` information is precise at every instruction.
>
> The current state is that the unwind info is emitted only after the
> function prologue. This is fine for synchronous (e.g. C++) exceptions,
> but the information is generally incorrect when the program counter is
> at an instruction in the prologue or the epilogue, for example:
>
> ```
> stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
> mov x29, sp
> .cfi_def_cfa w29, 16
> ...
> ```
>
> after the `stp` is executed the (initial) rule for the CFA still says
> the CFA is in the `sp`, even though it's already offset by 16 bytes
>
> A correct unwind info could look like:
> ```
> stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
> .cfi_def_cfa_offset 16
> mov x29, sp
> .cfi_def_cfa w29, 16
> ...
> ```
>
> Having this information precise up to an instruction is useful for
> sampling profilers that would like to get a stack backtrace. The end
> goal (towards this patch is just a step) is to have fully working
> `-fasynchronous-unwind-tables`.
>
> Reviewed By: danielkiss, MaskRay
>
> Differential Revision: https://reviews.llvm.org/D111411
This reverts commit 32e8b550e5.
body_start was never used, resulting in the first filtered line to be
skipped.
Fixes the --filter option introduced in D117694.
Differential Revision: https://reviews.llvm.org/D119704
Add a check on run lines to pick up isel options in llc commands and allow
generating check lines of isel final output other than assembly. If llc command
line contains -debug-only=isel, update_llc_test_checks.py will try to scrub isel
output, otherwise, the script will fall back on default behaviour, which is try to
scrub assembly output instead.
The motivation of this change is to allow usage of update_llc_test_checks.py to
autogenerate checks of instruction selection results. In this way, we can detect
errors at an earlier stage before the compilation goes all the way to assembly.
It is an example of having some transparency for the stages between IR and
assembly. These generated tests are almost like "unit tests" of isel stage.
This patch only implements the initial change to differentiate isel output from
assembly output for Lanai. Other targets will not be supported for isel check
generation at the moment. Although adding support for it will only require
implementing the function regex and scrubber for corresponding targets.
The Lanai implementation was chosen mainly for the simplicity of demonstrating
the difference between isel checks and asm checks.
This patch also do not include the implementation of function prefix, which is
required for the generated isel checks to pass. I will put up a follow up revision
for the function prefix change to complete isel support.
Reviewed By: Flakebi
Differential Revision: https://reviews.llvm.org/D119368
This patch rearranges emission of CFI instructions, so the resulting
DWARF and `.eh_frame` information is precise at every instruction.
The current state is that the unwind info is emitted only after the
function prologue. This is fine for synchronous (e.g. C++) exceptions,
but the information is generally incorrect when the program counter is
at an instruction in the prologue or the epilogue, for example:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
mov x29, sp
.cfi_def_cfa w29, 16
...
```
after the `stp` is executed the (initial) rule for the CFA still says
the CFA is in the `sp`, even though it's already offset by 16 bytes
A correct unwind info could look like:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
.cfi_def_cfa_offset 16
mov x29, sp
.cfi_def_cfa w29, 16
...
```
Having this information precise up to an instruction is useful for
sampling profilers that would like to get a stack backtrace. The end
goal (towards this patch is just a step) is to have fully working
`-fasynchronous-unwind-tables`.
Reviewed By: danielkiss, MaskRay
Differential Revision: https://reviews.llvm.org/D111411
Previous we used sra (X, size(X)-1); xor (add (X, Y), Y).
By placing sub at the end, we allow RISCV to combine sign_extend_inreg
with it to form subw.
Some X86 tests for Z - abs(X) seem to have improved as well.
Other targets look to be a wash.
I had to modify ARM's abs matching code to match from sub instead of
xor. Maybe instead ISD::ABS should be made legal. I'll try that in
parallel to this patch.
This is an alternative to D119099 which was focused on RISCV only.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D119171
We have the `clang -cc1` command-line option `-funwind-tables=1|2` and
the codegen option `VALUE_CODEGENOPT(UnwindTables, 2, 0) ///< Unwind
tables (1) or asynchronous unwind tables (2)`. However, this is
encoded in LLVM IR by the presence or the absence of the `uwtable`
attribute, i.e. we lose the information whether to generate want just
some unwind tables or asynchronous unwind tables.
Asynchronous unwind tables take more space in the runtime image, I'd
estimate something like 80-90% more, as the difference is adding
roughly the same number of CFI directives as for prologues, only a bit
simpler (e.g. `.cfi_offset reg, off` vs. `.cfi_restore reg`). Or even
more, if you consider tail duplication of epilogue blocks.
Asynchronous unwind tables could also restrict code generation to
having only a finite number of frame pointer adjustments (an example
of *not* having a finite number of `SP` adjustments is on AArch64 when
untagging the stack (MTE) in some cases the compiler can modify `SP`
in a loop).
Having the CFI precise up to an instruction generally also means one
cannot bundle together CFI instructions once the prologue is done,
they need to be interspersed with ordinary instructions, which means
extra `DW_CFA_advance_loc` commands, further increasing the unwind
tables size.
That is to say, async unwind tables impose a non-negligible overhead,
yet for the most common use cases (like C++ exceptions), they are not
even needed.
This patch extends the `uwtable` attribute with an optional
value:
- `uwtable` (default to `async`)
- `uwtable(sync)`, synchronous unwind tables
- `uwtable(async)`, asynchronous (instruction precise) unwind tables
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D114543
Re-add filtering options with fixes for failed tests. We were not passing the
is_filtered argument in all check generator calls in update_cc_test_checks.py
Enhance the various update_*_test_checks.py tools to allow filtering the tool
output with regular expressions. The --filter option will emit only tool output
lines matching the given regular expression while the --filter-out option will
emit only tools output lines not matching the given regular expression. Filters
are applied in order of appearance on the command line (or in UTC_ARGS) and the
first matching filter terminates the search.
This allows test authors to create more focused tests by removing irrelevant
tool output and checking only the pieces of output necessary to test the desired
functionality.
Differential Revision: https://reviews.llvm.org/D117694
Enhance the various update_*_test_checks.py tools to allow filtering the tool
output with regular expressions. The --filter option will emit only tool output
lines matching the given regular expression while the --filter-out option will
emit only tools output lines not matching the given regular expression. Filters
are applied in order of appearance on the command line (or in UTC_ARGS) and the
first matching filter terminates the search.
This allows test authors to create more focused tests by removing irrelevant
tool output and checking only the pieces of output necessary to test the desired
functionality.
Differential Revision: https://reviews.llvm.org/D117694
By reordering the objects on the stack frame after looking at the users, a
better utilization of displacement operands will result. This means less
needed Load Address instructions for the accessing of these objects.
This is important for very large functions where otherwise small changes
could cause a lot more/less accesses go out of range.
Note: this is not yet enabled for SystemZXPLINKFrameLowering, but should be.
Review: Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D115690
The UpdateTestChecks tool itself does not care about which pass
manager that is used in the opt invocation. So the lit tests that
are verifying the behavior of the UpdateTestChecks tool is updated
to use the new-PM syntax (-passes=) when specifying the pass pipeline
in the test cases that are used for verifying the UpdateTestChecks
tool.
Differential Revision: https://reviews.llvm.org/D114517
Add an alias of `addi [x], zero, imm` to generate pseudo
instruction li, which makes assembly mush more readable.
For existed tests, users can update them by running script
`llvm/utils/update_llc_test_checks.py`.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D112692
CMOVGE reads SF and OF. CMOVNS only reads SF. This matches with
other recent changes to use a single flag where possible. It also
matches gcc codegen.
I believe this technically changes whether the conditioanl move happens
on INT_MIN, but for INT_MIN both registers are the same so it doesn't
matter.
Differential Revision: https://reviews.llvm.org/D111826
We would like to start pushing -mcpu=generic towards enabling the set of
features that improves performance for some CPUs, without hurting any
others. A blend of the performance options hopefully beneficial to all
CPUs. The largest part of that is enabling in-order scheduling using the
Cortex-A55 schedule model. This is similar to the Arm backend change
from eecb353d0e which made -mcpu=generic perform in-order scheduling
using the cortex-a8 schedule model.
The idea is that in-order cpu's require the most help in instruction
scheduling, whereas out-of-order cpus can for the most part out-of-order
schedule around different codegen. Our benchmarking suggests that
hypothesis holds. When running on an in-order core this improved
performance by 3.8% geomean on a set of DSP workloads, 2% geomean on
some other embedded benchmark and between 1% and 1.8% on a set of
singlecore and multicore workloads, all running on a Cortex-A55 cluster.
On an out-of-order cpu the results are a lot more noisy but show flat
performance or an improvement. On the set of DSP and embedded
benchmarks, run on a Cortex-A78 there was a very noisy 1% speed
improvement. Using the most detailed results I could find, SPEC2006 runs
on a Neoverse N1 show a small increase in instruction count (+0.127%),
but a decrease in cycle counts (-0.155%, on average). The instruction
count is very low noise, the cycle count is more noisy with a 0.15%
decrease not being significant. SPEC2k17 shows a small decrease (-0.2%)
in instruction count leading to a -0.296% decrease in cycle count. These
results are within noise margins but tend to show a small improvement in
general.
When specifying an Apple target, clang will set "-target-cpu apple-a7"
on the command line, so should not be affected by this change when
running from clang. This also doesn't enable more runtime unrolling like
-mcpu=cortex-a55 does, only changing the schedule used.
A lot of existing tests have updated. This is a summary of the important
differences:
- Most changes are the same instructions in a different order.
- Sometimes this leads to very minor inefficiencies, such as requiring
an extra mov to move variables into r0/v0 for the return value of a test
function.
- misched-fusion.ll was no longer fusing the pairs of instructions it
should, as per D110561. I've changed the schedule used in the test
for now.
- neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to
the different latencies. This seems fine to me.
- Some SVE tests do not always remove movprfx where they did before due
to different register allocation giving different destructive forms.
- The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll
produce two LDR where they previously produced an LDP due to
store-pair-suppress kicking in.
- arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD.
- Some tests such as arm64-neon-mul-div.ll and
ragreedy-local-interval-cost.ll have more, less or just different
spilling.
- In aarch64_generated_funcs.ll.generated.expected one part of the
function is no longer outlined. Interestingly if I switch this to use
any other scheduled even less is outlined.
Some of these are expected to happen, such as differences in outlining
or register spilling. There will be places where these result in worse
codegen, places where they are better, with the SPEC instruction counts
suggesting it is not a decrease overall, on average.
Differential Revision: https://reviews.llvm.org/D110830
On MIPS, functions with exception handling code emits an additional
temporary label at the start of the function (due to UseAssignmentForEHBegin):
_Z8do_catchv: # @_Z8do_catchv
.Ltmp3:
.set .Lfunc_begin0, .Ltmp3
.cfi_startproc
.cfi_personality 128, DW.ref.__gxx_personality_v0
.cfi_lsda 0, .Lexception0
.frame $c11,48,$c17
.mask 0x00000000,0
.fmask 0x00000000,0
.set noreorder
.set nomacro
.set noat
# %bb.0: # %entry
The `[^:]*` regex was terminating the search after .Ltmp<N>: and therefore
not detecting functions with exception handling.
Reviewed By: atanasyan, MaskRay
Differential Revision: https://reviews.llvm.org/D100027
Make the update_llc_test_checks script test independant of llc behavior
by using cat with static files to simulate llc output.
This allows changing llc without breaking the script test case.
The update script is executed in a temporary directory, so the
llc-generated assembly files are copied there. %T is deprecated, but it
allows copying a file with a predictable filename.
Differential Revision: https://reviews.llvm.org/D110143
Previously the script emitted output using plain CHECK directives. This
can result in a test passing even if there are some instructions between
CHECK directives that should have been removed. It also makes debugging
tests that have the output in a different order more difficult since
FileCheck can match with a later line and then complain about the "wrong"
directive not being found.
This will cause quite large diffs when updating existing tests, but I'm not sure we need an opt-in flag here.
Depends on D109765 (pre-commit tests)
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D109767