Init captures added in processBlock() to avoid capturing structured bindings,
which caused the build problems (with clang).
RISCV has this disabled for now until problems relating to post RA pseudo
expansions are resolved.
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instructions
whenever found by scanning the MF once while keeping track of register
definitions in a map. These instructions are typically immediate loads
resulting from rematerialization, and address loads emitted by target in
eliminateFrameInde().
This is enabled by default, but a target could easily disable it by means of
'disablePass(&MachineLateInstrsCleanupID);'.
This late cleanup is naturally not "optimal" in removing instructions as it
is done by looking at phys-regs, but still quite effective. It would be
desirable to improve other parts of CodeGen and avoid these redundant
instructions in the first place, but there are no ideas for this yet.
Differential Revision: https://reviews.llvm.org/D123394
Reviewed By: RKSimon, foad, craig.topper, arsenm, asb
As stated in
https://discourse.llvm.org/t/rfc-llc-add-expandlargeintfpconvert-pass-for-fp-int-conversion-of-large-bitint/65528,
this implementation is very similar to ExpandLargeDivRem, which expands
‘fptoui .. to’, ‘fptosi .. to’, ‘uitofp .. to’, ‘sitofp .. to’ instructions
with a bitwidth above a threshold into auto-generated functions. This is
useful for targets like x86_64 that cannot lower fp convertions with more
than 128 bits. The expanded nodes are referring from the IR generated by
`compiler-rt/lib/builtins/floattidf.c`, `compiler-rt/lib/builtins/fixdfti.c`,
and etc.
Corner cases:
1. For fp16: as there is no related builtins added in compliler-rt. So I
mainly utilized the fp32 <-> fp16 lib calls to implement.
2. For fp80: as this pass is soft fp emulation and no fp80 instructions can
help in this problem. I recommend users to deprecate this usage. For now, the
implementation uses fp128 as the temporary conversion type and inserts
fptrunc/ext at top/end of the function.
3. For bf16: as clang FE currently doesn't support bf16 algorithm operations
(convert to int, float, +, -, *, ...), this patch doesn't consider bf16 for
now.
4. For unsigned FPToI: since both default hardware behaviors and libgcc are
ignoring "returns 0 for negative input" spec. This pass follows this old way
to ignore unsigned FPToI. See this example:
https://gcc.godbolt.org/z/bnv3jqW1M
The end-to-end tests are uploaded at https://reviews.llvm.org/D138261
Reviewed By: LuoYuanke, mgehre-amd
Differential Revision: https://reviews.llvm.org/D137241
Currently per-function metadata consists of:
(start-pc, size, features)
This adds a new UAR feature and if it's set an additional element:
(start-pc, size, features, stack-args-size)
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D136078
We can reuse constants if we use SRL followed by AND and AND followed by SHL.
Similar was done to bitreverse previously.
Differential Revision: https://reviews.llvm.org/D138045
Adds the Complex Deinterleaving Pass implementing support for complex numbers in a target-independent manner, deferring to the TargetLowering for the given target to create a target-specific intrinsic.
Differential Revision: https://reviews.llvm.org/D114174
SpeculativelyExecuteBB(), which converts a branch + phi structure
into a select, currently bails out if the block contains an assume
(because it is not speculatable).
Adjust the fold to ignore ephemeral values (i.e. assumes and values
only used in assumes) for cost modelling purposes, and drop them
when performing the fold.
Theoretically, we could try to preserve the assume information by
generating a assume(br_cond || assume_cond) style assume, but this
is very unlikely to to be useful (because we don't do anything
useful with assumes of this form) and it would make things
substantially more complicated once we take operand bundle assumes
into account (which don't really support a || operation).
I'd prefer not to do that without good motivation.
Differential Revision: https://reviews.llvm.org/D137339
The instruction icmp ule <4 x i32> %0, zeroinitializer will usually be
simplified to icmp eq <4 x i32> %0, zeroinitializer. It is not
guaranteed though, and the code for lowering vector compares could pick
the wrong form of the instruction if this happened. I've tried to make
the code more explicit about the supported conditions.
This fixes NEON being unable to select VCMPZ with HS conditions, and
fixes some incorrect MVE patterns.
Fixes#58514.
Differential Revision: https://reviews.llvm.org/D136447
Currently MachineCSE forbids PRE when the instruction reads a physical
register. Relax this so that it's allowed when the value being read is
the same as what would be read in the place the instruction would be
hoisted to.
This is being done in preparation for adding FPCR handling to the
AArch64 backend, in order to prevent it to from worsening the
generated code, but for targets that already have a similar register
it should improve things.
This patch affects code generation in several tests. The new code
looks better except for in Thumb2/LowOverheadLoops/memcall.ll where
we perform PRE but the LowOverheadLoops transformation then undoes
it. Also in AMDGPU/selectcc-opt.ll the CHECK makes things look worse,
but actually the function as a whole is better (as a MOV is PRE'd).
Differential Revision: https://reviews.llvm.org/D136675
This prepares for an upcoming change to make --print-imm-hex the default
behavior of llvm-objdump. These tests were updated in a semi-automatic
fashion.
See D136972 for details.
In D136659 I found a few tests that write through readonly parameters:
* Analysis/BasicAA/pr18573.ll: @foo1 writes through %arr.ptr, but declares it
readonly. I removed the readonly annotation.
* CodeGen/ARM/ParallelDSP/aliasing.ll: @restrict writes through the readonly
%arg3, @store_alias_arg3_illegal_1 writes through the readonly %arg3, and
@store_alias_arg3_illegal_2 writes through the readonly %arg3. I removed
readonly from all three. Also, I added some CHECK-LABEL directives to make it
harder for FileCheck output to be mixed up.
* Transforms/LoopVectorize/AArch64/sve-gather-scatter.ll:
@gather_nxv4i32_ind64_stride2 writes through the readonly %a. I removed the
readonly attribute.
* Transforms/LoopVectorize/interleaved-accesses.ll: @load_gap_reverse writes
through the readonly %P1 and %P2. Also, the corresponding C code in the comment
didn't match the test. I removed the readonly attribute from both parameters
and corrected the C code.
Differential Revision: https://reviews.llvm.org/D136880
We have some odd redundancy where clang specially handles
the stack size case. If clang prints it, the source location is first
followed by "warning". The backend diagnostic, as printed by other tools
puts "warning" first.
Use DefaultAttrsIntrinsics for most ARM intrinsics. This adds the
WillReturn, NoSync, NoFree and NoCallback attributes and is needed
to avoid regressions in the future.
I've switched to DefaultAttrIntrinsics for everything doing arithmetic
and load/store. I've left some TODOs in cases where all DefaultsAttrs
are not correct (e.g. ldrex etc are clearly not nosync) or it wasn't
entirely obvious to me (e.g. stuff interacting with a coprocessor).
Differential Revision: https://reviews.llvm.org/D136758
Add statistics about how much memory is used, in variables, spills, and
unsafestack.
Issue #58168 describes some of the difficulty diagnosing stack size issues
identified by -Wframe-larger-than. D135488 addresses some of those issues by
giving developers a method to view the stack layout and thereby understand
where and how stack memory is used.
However, that solution requires an additional pass, when a short summary about
how the compiler has allocated stack memory can inform developers about where
they should investigate. When they need the complete context, D135488 can
provide them with a more comprehensive set of diagnostics.
Reviewed By: nickdesaulniers
Differential Revision: https://reviews.llvm.org/D136484
This was previously using the 32bit variant of the instruction, instead
of the 16bit as intended.
Fixes#58512
Differential Revision: https://reviews.llvm.org/D136422
For now, we have not parse section flag `Info` in asm file. When we emit a section with info flag to asm, then compile asm to obj we will lose the Info flag for the section.
The motivation of this change is ARM64EC's hybmp$x section. If we lose the Info flag MSVC link will report a warning:
`warning LNK4078: multiple '.hybmp' sections found with different attributes`
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D136125
The main motivation for these additional targets is to cover the
differences in the instructions available between Thumb2 and Thumb1.
Ths shows up in these test due to the lack of the following in
Thumb1:
- Mulitply and Subtract instruction (mls) - used when calculating
a remainder.
- Unsigned Muliple Long instruction (umull) - used in certain
cases when optimising division with a constant.
Differential Revision: https://reviews.llvm.org/D135875
fp16 and bf16 values can be used in GCC's inline assembly using the "w"
constraint, which means "VFP floating-point registers d0-d31" - fp16 and
bf16 values are stored in S registers (which alias the D registers).
This change ensures that LLVM is compatible with GCC for programs that
use fp16 and the 'w' constraint.
Differential Revision: https://reviews.llvm.org/D135662
Much like f16 and f32, we shouldn't try to shrink bf16 to smaller fp
constant. The code may not be optimal, but this allows us to legalize
bf16 constants under Arm without errors.
fp16 and bf16 values can be used in GCC's inline assembly using the "t"
constraint, which means "VFP floating-point registers s0-s31" - fp16 and
bf16 values are stored in S registers too.
This change ensures that LLVM is compatible with GCC for programs that
use fp16 and the 't' constraint.
Fixes#57753
Differential Revision: https://reviews.llvm.org/D134553
The `CodeGenPrepare` pass can sink bitwise `and` used by compare to
zero into the basic blocks where the users are. This operation is
guarded by lowering hook, which is disabled for ARM. In the ARM
architecture versions from v7-M up these two operations can be folded
into `tst rN, #imm` instruction. Sinking of `and` can also enable
the cmov-to-bfi DAG combiner.
This patch fixes some benchmark regressions caused
by https://reviews.llvm.org/D129370 as well scoring slightly better overall.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D134360
Commit c442698 updated the check lines in this file, but did so in a
way that removed a number of the existing checks, as the
update_llc_test_checks script does not understand all triples. This
fixes it up as needed to keep testing Thumb1 code.
The IR stack protector pass should insert stack checks before the tail
calls not only the musttail calls. So that the attributes `ssqreq` and
`tail call`, which are emited by llvm-opt, could be both enabled by
llvm-llc.
Reviewed By: compnerd
Differential Revision: https://reviews.llvm.org/D133860
For remainder:
If (1 << (Bitwidth / 2)) % Divisor == 1, we can add the high and low halves
together and use a (Bitwidth / 2) urem. If (BitWidth /2) is a legal integer
type, this urem will be expand by DAGCombiner using multiply by magic
constant. We do have to take into account that adding high and low
together can produce a carry, making it a (BitWidth / 2)+1 bit number.
So we need to also add back in the carry from the first addition.
For division:
We can use the above trick to compute the remainder, subtract that
remainder from the dividend, then multiply by the multiplicative
inverse of the Divisor modulo (1 << BitWidth).
This is based on the section "Remainder by Summing Digits" in
Hacker's delight.
The remainder trick is similar to a trick you may have learned for
determining if a decimal number is divisible by 3. You can add all the
digits together and see if the sum is divisible by 3. If you're not sure
if the sum is divisible by 3, you can add its digits together. This
can be repeated until you have a single decimal digit. If that digit
is 3, 6, or 9, then the original number is divisible by 3. This works
because 10 % 3 == 1.
gcc already does this same trick. There are additional tricks gcc
does urem as well as srem, udiv, and sdiv that I plan to add in
future patches.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130862
This adds the ExpandLargeDivRem to the default pass pipeline.
The limit at which it expands div/rem instructions is configured
via a new TargetTransformInfo hook (default: no expansion)
X86, Arm and AArch64 backends implement this hook to expand div/rem
instructions with more than 128 bits.
Differential Revision: https://reviews.llvm.org/D130076
When the only ADR instruction we have is the 16-bit thumb one then all
constant pool entries need to be 4-byte aligned, as tADR has an offset
that's a multiple of 4.
It looks like previously there happened to be no situations in which
we encountered a constant pool entry with alignment less than 4, so
failing to do this didn't cause any problems, but the expansion of
cttz to a table added by D128911 does use a constant pool with
alignment 1, so we now need to handle it correctly.
Differential Revision: https://reviews.llvm.org/D133199
This allows relaxing some relocations to symbol+offset instead of emitting
a relocation against a symbol.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131433
While this does not matter for most targets, when building for Arm Morello,
we have to mark the symbol as a function and add size information, so that
LLD can correctly evaluate relocations against the local symbol.
Since Morello is an out-of-tree target, I tried to reproduce this with
in-tree backends and with the previous reviews applied this results in
a noticeable difference when targeting Thumb.
Background: Morello uses a method similar Thumb where the encoding mode is
specified in the LSB of the symbol. If we don't mark the target as a
function, the relocation will not have the LSB set and calls will end up
using the wrong encoding mode (which will almost certainly crash).
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D131429
Previously, we were generating zeroes when generating code alignments for AArch64, but now we should omit the value and let the assembler choose to generate nops or zeroes.
Reviewed By: efriedma, MaskRay
Differential Revision: https://reviews.llvm.org/D132508
For mingw target, if a symbol is not marked DSO local, a `.refptr` is
generated for it. This makes CFG check calls use an extra pointer
dereference, which adds extra overhead compared to the MSVC version,
so mark the CFG guard check funciton pointer DSO local to stop it.
This should have no effect on MSVC target.
Also adapt the existing cfguard tests to run for mingw targets, so that
this change is checked.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D132331
There are two different senses in which a block can be "address-taken".
There can be a BlockAddress involved, which means we need to map the
IR-level value to some specific block of machine code. Or there can be
constructs inside a function which involve using the address of a basic
block to implement certain kinds of control flow.
Mixing these together causes a problem: if target-specific passes are
marking random blocks "address-taken", if we have a BlockAddress, we
can't actually tell which MachineBasicBlock corresponds to the
BlockAddress.
So split this into two separate bits: one for BlockAddress, and one for
the machine-specific bits.
Discovered while trying to sort out related stuff on D102817.
Differential Revision: https://reviews.llvm.org/D124697