Put AllocationFn check before I->willReturn can allow CodeGenPrepare to remove useless malloc instruction
Differential Revision: https://reviews.llvm.org/D130126
In https://reviews.llvm.org/D30114, support for mismatching address
spaces was introduced to CodeGenPrepare's optimizeMemoryInst, using
addrspacecast as it was argued that only no-op addrspacecasts would be
considered when constructing the address mode. However, by doing
inttoptr/ptrtoint, it's possible to get CGP to emit an addrspace
that's not actually no-op, introducing a miscompilation:
define void @kernel(i8* %julia_ptr) {
%intptr = ptrtoint i8* %julia_ptr to i64
%ptr = inttoptr i64 %intptr to i32 addrspace(3)*
br label %end
end:
store atomic i32 1, i32 addrspace(3)* %ptr unordered, align 4
ret void
}
Gets compiled to:
define void @kernel(i8* %julia_ptr) {
end:
%0 = addrspacecast i8* %julia_ptr to i32 addrspace(3)*
store atomic i32 1, i32 addrspace(3)* %0 unordered, align 4
ret void
}
In the case of NVPTX, this introduces a cvta.to.shared, whereas
leaving out the %end block and branch doesn't trigger this
optimization. This results in illegal memory accesses as seen in
https://github.com/JuliaGPU/CUDA.jl/issues/558
In this change, I introduced a check before doing the pointer cast
that verifies address spaces are the same. If not, it emits a
ptrtoint/inttoptr combination to get a no-op cast between address
spaces. I decided against disallowing ptrtoint/inttoptr with
non-default AS in matchOperationAddr, because now its still possible
to look through multiple sequences of them that ultimately do not
result in a address space mismatch (i.e. the second lit test).
D125887 changed the ctlz/cttz despeculation transform to insert
a freeze for the introduced branch on zero. While this does fix
the "branch on poison" issue, we may still get in trouble if we
pick a different value for the branch and for the ctz argument
(i.e. non-zero for the branch, but zero for the ctz). To avoid
this, we should use the same frozen value in both positions.
This does cause a regression in RISCV codegen by introducing an
additional sext. The DAG looks like this:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %3
t4: i64 = AssertSext t2, ValueType:ch:i32
t23: i64 = freeze t4
t9: ch = CopyToReg t0, Register:i64 %0, t23
t16: ch = CopyToReg t0, Register:i64 %4, Constant:i64<32>
t18: ch = TokenFactor t9, t16
t25: i64 = sign_extend_inreg t23, ValueType:ch:i32
t24: i64 = setcc t25, Constant:i64<0>, seteq:ch
t28: i64 = and t24, Constant:i64<1>
t19: ch = brcond t18, t28, BasicBlock:ch<cond.end 0x8311f68>
t21: ch = br t19, BasicBlock:ch<cond.false 0x8311e80>
I don't see a really obvious way to improve this, as we can't push
the freeze past the AssertSext (which may produce poison).
Differential Revision: https://reviews.llvm.org/D126638
If the fma operates on a legal vector type, the indexed variants can be
used, if the second operand is a splat of a valid index.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D126234
Freeze the condition of the newly introduced conditional branch,
to avoid immediate undefined behavior if the input to ctlz/cttz
was originally poison.
Differential Revision: https://reviews.llvm.org/D125887
We often see code like the following after running SCCP:
switch (x) { case 42: phi(42, ...); }
This tends to produce bad code as we currently materialize the constant
phi-argument in the switch-block. This increases register pressure and
if the pattern repeats for `n` case statements, we end up generating `n`
constant values.
This changes CodeGenPrepare to catch this pattern and revert it back to:
switch (x) { case 42: phi(x, ...); }
Differential Revision: https://reviews.llvm.org/D124552
This adds a `TargetLoweringBase::getSwitchConditionType` callback to
give targets a chance to control the type used in
`CodeGenPrepare::optimizeSwitchInst`.
Implement callback for X86 to avoid i8 and i16 types where possible as
they often incur extra zero-extensions.
This is NFC for non-X86 targets.
Differential Revision: https://reviews.llvm.org/D124894
Re-commit of 32e8b550e5
This patch rearranges emission of CFI instructions, so the resulting
DWARF and `.eh_frame` information is precise at every instruction.
The current state is that the unwind info is emitted only after the
function prologue. This is fine for synchronous (e.g. C++) exceptions,
but the information is generally incorrect when the program counter is
at an instruction in the prologue or the epilogue, for example:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
mov x29, sp
.cfi_def_cfa w29, 16
...
```
after the `stp` is executed the (initial) rule for the CFA still says
the CFA is in the `sp`, even though it's already offset by 16 bytes
A correct unwind info could look like:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
.cfi_def_cfa_offset 16
mov x29, sp
.cfi_def_cfa w29, 16
...
```
Having this information precise up to an instruction is useful for
sampling profilers that would like to get a stack backtrace. The end
goal (towards this patch is just a step) is to have fully working
`-fasynchronous-unwind-tables`.
Reviewed By: danielkiss, MaskRay
Differential Revision: https://reviews.llvm.org/D111411
It caused builds to assert with:
(StackSize == 0 && "We already have the CFA offset!"),
function generateCompactUnwindEncoding, file AArch64AsmBackend.cpp, line 624.
when targeting iOS. See comment on the code review for reproducer.
> This patch rearranges emission of CFI instructions, so the resulting
> DWARF and `.eh_frame` information is precise at every instruction.
>
> The current state is that the unwind info is emitted only after the
> function prologue. This is fine for synchronous (e.g. C++) exceptions,
> but the information is generally incorrect when the program counter is
> at an instruction in the prologue or the epilogue, for example:
>
> ```
> stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
> mov x29, sp
> .cfi_def_cfa w29, 16
> ...
> ```
>
> after the `stp` is executed the (initial) rule for the CFA still says
> the CFA is in the `sp`, even though it's already offset by 16 bytes
>
> A correct unwind info could look like:
> ```
> stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
> .cfi_def_cfa_offset 16
> mov x29, sp
> .cfi_def_cfa w29, 16
> ...
> ```
>
> Having this information precise up to an instruction is useful for
> sampling profilers that would like to get a stack backtrace. The end
> goal (towards this patch is just a step) is to have fully working
> `-fasynchronous-unwind-tables`.
>
> Reviewed By: danielkiss, MaskRay
>
> Differential Revision: https://reviews.llvm.org/D111411
This reverts commit 32e8b550e5.
This patch rearranges emission of CFI instructions, so the resulting
DWARF and `.eh_frame` information is precise at every instruction.
The current state is that the unwind info is emitted only after the
function prologue. This is fine for synchronous (e.g. C++) exceptions,
but the information is generally incorrect when the program counter is
at an instruction in the prologue or the epilogue, for example:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
mov x29, sp
.cfi_def_cfa w29, 16
...
```
after the `stp` is executed the (initial) rule for the CFA still says
the CFA is in the `sp`, even though it's already offset by 16 bytes
A correct unwind info could look like:
```
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
.cfi_def_cfa_offset 16
mov x29, sp
.cfi_def_cfa w29, 16
...
```
Having this information precise up to an instruction is useful for
sampling profilers that would like to get a stack backtrace. The end
goal (towards this patch is just a step) is to have fully working
`-fasynchronous-unwind-tables`.
Reviewed By: danielkiss, MaskRay
Differential Revision: https://reviews.llvm.org/D111411
This teaches AArch64TargetLowering::shouldSinkOperands to sink the
operands of aarch64_neon_pmull intrinsic.
Differential Revision: https://reviews.llvm.org/D117944
Fixes the build issue with D111034, whose goal was to optimize
add/sub with long immediates.
Optimize ([add|sub] r, imm) -> ([ADD|SUB] ([ADD|SUB] r, #imm0, lsl #12), #imm1),
if imm == (imm0<<12)+imm1. and both imm0 and imm1 are non-zero 12-bit unsigned
integers.
Optimize ([add|sub] r, imm) -> ([SUB|ADD] ([SUB|ADD] r, #imm0, lsl #12), #imm1),
if imm == -(imm0<<12)-imm1, and both imm0 and imm1 are non-zero 12-bit unsigned
integers.
The change which fixed the build issue in D111034 was the use of new virtual
registers so that SSA form is maintained until deleting MI.
Differential Revision: https://reviews.llvm.org/D117429
Fixes the build issue with D111034, whose goal was to optimize
add/sub with long immediates.
Optimize ([add|sub] r, imm) -> ([ADD|SUB] ([ADD|SUB] r, #imm0, lsl #12), #imm1),
if imm == (imm0<<12)+imm1. and both imm0 and imm1 are non-zero 12-bit unsigned
integers.
Optimize ([add|sub] r, imm) -> ([SUB|ADD] ([SUB|ADD] r, #imm0, lsl #12), #imm1),
if imm == -(imm0<<12)-imm1, and both imm0 and imm1 are non-zero 12-bit unsigned
integers.
The change which fixed the build issue in D111034 was the use of new virtual
registers so that SSA form is maintained until deleting MI.
Differential Revision: https://reviews.llvm.org/D117429
This teaches AArch64TargetLowering::shouldSinkOperands to sink splat
shuffles to certain neon intrinsics, so that they can make use of the
lane variants of the instructions that are available.
Differential Revision: https://reviews.llvm.org/D112994
This patch fixes a crash when despeculating ctlz/cttz intrinsics with
scalable-vector types. It is not safe to speculatively get the size of
the vector type in bits in case the vector type is not a fixed-length type. As
it happens this isn't required as vector types are skipped anyway.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D112141
We would like to start pushing -mcpu=generic towards enabling the set of
features that improves performance for some CPUs, without hurting any
others. A blend of the performance options hopefully beneficial to all
CPUs. The largest part of that is enabling in-order scheduling using the
Cortex-A55 schedule model. This is similar to the Arm backend change
from eecb353d0e which made -mcpu=generic perform in-order scheduling
using the cortex-a8 schedule model.
The idea is that in-order cpu's require the most help in instruction
scheduling, whereas out-of-order cpus can for the most part out-of-order
schedule around different codegen. Our benchmarking suggests that
hypothesis holds. When running on an in-order core this improved
performance by 3.8% geomean on a set of DSP workloads, 2% geomean on
some other embedded benchmark and between 1% and 1.8% on a set of
singlecore and multicore workloads, all running on a Cortex-A55 cluster.
On an out-of-order cpu the results are a lot more noisy but show flat
performance or an improvement. On the set of DSP and embedded
benchmarks, run on a Cortex-A78 there was a very noisy 1% speed
improvement. Using the most detailed results I could find, SPEC2006 runs
on a Neoverse N1 show a small increase in instruction count (+0.127%),
but a decrease in cycle counts (-0.155%, on average). The instruction
count is very low noise, the cycle count is more noisy with a 0.15%
decrease not being significant. SPEC2k17 shows a small decrease (-0.2%)
in instruction count leading to a -0.296% decrease in cycle count. These
results are within noise margins but tend to show a small improvement in
general.
When specifying an Apple target, clang will set "-target-cpu apple-a7"
on the command line, so should not be affected by this change when
running from clang. This also doesn't enable more runtime unrolling like
-mcpu=cortex-a55 does, only changing the schedule used.
A lot of existing tests have updated. This is a summary of the important
differences:
- Most changes are the same instructions in a different order.
- Sometimes this leads to very minor inefficiencies, such as requiring
an extra mov to move variables into r0/v0 for the return value of a test
function.
- misched-fusion.ll was no longer fusing the pairs of instructions it
should, as per D110561. I've changed the schedule used in the test
for now.
- neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to
the different latencies. This seems fine to me.
- Some SVE tests do not always remove movprfx where they did before due
to different register allocation giving different destructive forms.
- The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll
produce two LDR where they previously produced an LDP due to
store-pair-suppress kicking in.
- arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD.
- Some tests such as arm64-neon-mul-div.ll and
ragreedy-local-interval-cost.ll have more, less or just different
spilling.
- In aarch64_generated_funcs.ll.generated.expected one part of the
function is no longer outlined. Interestingly if I switch this to use
any other scheduled even less is outlined.
Some of these are expected to happen, such as differences in outlining
or register spilling. There will be places where these result in worse
codegen, places where they are better, with the SPEC instruction counts
suggesting it is not a decrease overall, on average.
Differential Revision: https://reviews.llvm.org/D110830
This updates a few more check lines, in some mte tests that were close
to auto generated already and some CodeGenPrepare/consthoist tests where
being able to see the entire code sequence is useful for determining
whether code differences are improvements or not.
In the combination of addressing modes, when replacing the matched phi nodes,
sometimes the phi node to be replaced has been modified. For example,
there’s matcher set [A, B] and [C, A], which will have cyclic dependency:
A is replaced by B and C will be replaced by A. Because we tried to match new phi node
to another new phi node, we should ignore new phi nodes when mapping new phi node to old one.
Reviewed By: skatkov
Differential Revision: https://reviews.llvm.org/D108635
In current implementation, the instruction to be sunk will be inserted before the target instruction without considering the def-use tree,
which may case Instruction does not dominate all uses error. We need to choose a suitable location to insert according to the use chain
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D107262
This was checking extends as shuffles, where as we should be checking
the operands. This helps sink the shuffles, creating more addl/subl
instructions.
Differential Revision: https://reviews.llvm.org/D107623
This patch updates ConstantVector::getSplat to use poison instead
of undef when using insertelement/shufflevector to splat.
This follows on from D93793.
Differential Revision: https://reviews.llvm.org/D107751
This adds a simple fold into codegenprepare that converts comparison of
branches towards comparison with zero if possible. For example:
%c = icmp ult %x, 8
br %c, bla, blb
%tc = lshr %x, 3
becomes
%tc = lshr %x, 3
%c = icmp eq %tc, 0
br %c, bla, blb
As a first order approximation, this can reduce the number of
instructions needed to perform the branch as the shift is (often) needed
anyway. At the moment this does not effect very much, as llvm tends to
prefer the opposite form. But it can protect against regressions from
commits like rG9423f78240a2.
Simple cases of Add and Sub are added along with Shift, equally as the
comparison to zero can often be folded with cpsr flags.
Differential Revision: https://reviews.llvm.org/D101778
We were missing bitreverse matches in cases where InstCombine had seen a byte-level rotation at the end of a bitreverse sequence (replacing or() with fshl()), hindering the exhaustive bitreverse matching in CodeGenPrepare later on.
tries to check for the
absence of a sequence of instructions with several CHECK-NOT with
one of
those directives using a variable defined in another.
LLVM test CodeGenPrepare/ARM/sink-add-mul-shufflevector.ll tries to
check for the absence of a sequence of instructions with several
CHECK-NOT with one of those directives using a variable defined in
another. However, CHECK-NOT are checked independently so that is using
a variable defined in a pattern that should not occur in the input. The
bug was then copied over in
Transforms/CodeGenPrepare/ARM/sink-add-mul-shufflevector-inseltpoison.ll
This commit removes the definition and uses of variable to check each
line independently, making the check stronger than the current one.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D99597
The new test llvm/test/Transforms/CodeGenPrepare/remove-assume-block.ll
breaks on non-X86 machines. Change it to look like the existing test
llvm/test/Transforms/CodeGenPrepare/X86/delete-assume-dead-code.ll
to fix it.
Reviewed By: bkramer
Differential Revision: https://reviews.llvm.org/D97952
CodeGenPrepare currently first removes empty blocks, then in a loop
performs other optimizations. One of those optimizations is the removal
of call instructions that invoke @llvm.assume, which can create new
empty blocks.
This means that when a branch only contains a call to __builtin_assume(),
the empty branch will survive into MIR, and will then only be
half-removed by MIR-level optimizations (e.g. removing the branch but
leaving the condition intact).
Fix it by eliminating @llvm.expect builtin calls before removing empty
blocks.
Reviewed By: bkramer
Differential Revision: https://reviews.llvm.org/D97848