Commit Graph

68457 Commits

Author SHA1 Message Date
David Green a9e9dd9a3a [AArch64] Add bf16 select handling
A bfloat select operation will currently crash, but is allowed from C.
This adds handling for the operation, turning it into a FCSELHrrr if
fullfp16 is present, or converting it to a FCSELSrrr if not. The
FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert
the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally
attempted to do this via a tablegen pattern, but it appears that the
nzcv glue is places onto the wrong node, causing it to be forgotten and
incorrect scheduling to be emitted).

The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not
present, which helps avoid an unnecessary promotion to f32.

Differential Revision: https://reviews.llvm.org/D131253
2022-08-11 14:20:36 +01:00
Simon Pilgrim 5dcf0c342b [X86] lowerShuffleWithVPMOV - remove oneuse constraints on shuffle(trunc(x),undef) -> vpmov(x) lowering
These were added in rG057bdd63 but shuffle combining has gotten a lot better at folding different vector widths since then.
2022-08-11 14:06:42 +01:00
David Stuttard 1d1cc05539 AMDGPU: mbcnt allow for non-zero src1 for known-bits
Src1 for mbcnt can be a non-zero literal or register. Take this into account
when calculating known bits.

Differential Revision: https://reviews.llvm.org/D131478
2022-08-11 13:23:43 +01:00
Yeting Kuo 875694089d [RISCV] Peephole optimization to fold merge.vvm and unmasked intrinsics.
The patch uses peephole method to fold merge.vvm and unmasked intrinsics to
masked intrinsics. Using peephole intead of tablegen patterns is to avoid large
auto gnerated code.

Note: The patch ignores segment loads since I don't know how to test them.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D130442
2022-08-11 17:58:11 +08:00
Jonas Hahnfeld 940733d6a0 [RISCV] Re-enable JIT support
Commit 8922adf646 recently made JITTargetMachineBuilder honor the
hasJIT property of the target. LLVM supports just-in-time compilation
on RISC-V, so set the flag.

Differential Revision: https://reviews.llvm.org/D131617
2022-08-11 11:41:02 +02:00
Fangrui Song 96850003d2 [PowerPC] Change a double Log2 for localentry to integral Log2. NFC 2022-08-10 23:45:13 -07:00
aqjune 02e56e2533 [CodeGen] Generate efficient assembly for freeze(poison) version of `mm*_cast*` intel intrinsics
This patch makes the variants of `mm*_cast*` intel intrinsics that use `shufflevector(freeze(poison), ..)` emit efficient assembly.
(These intrinsics are planned to use `shufflevector(freeze(poison), ..)` after shufflevector's semantics update; relevant thread: D103874)

To do so, this patch

1. Updates `LowerAVXCONCAT_VECTORS` in X86ISelLowering.cpp to recognize `FREEZE(UNDEF)` operand of `CONCAT_VECTOR` in addition to `UNDEF`
2. Updates X86InstrVecCompiler.td to recognize `insert_subvector` of `FREEZE(UNDEF)` vector as its first operand.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D130339
2022-08-11 13:36:21 +09:00
wanglei c437412fbc [LoongArch] Override TargetLowering::isOffsetFoldingLegal()
This patch disable GlobalAddress+Offset folding.

Differential Revision: https://reviews.llvm.org/D131491
2022-08-11 11:26:54 +08:00
jacquesguan 21bf59c92a [RISCV] Add cost model for mask vector extend and truncate instruction.
As extending from or truncating to mask vector do not use the same instructions as the normal cast, this path changed it to 2 which is the number of instructions we used.

Differential Revision: https://reviews.llvm.org/D131552
2022-08-11 10:55:43 +08:00
WANG Xuerui ed078c48f0 [LoongArch] Add insn aliases `jr` and `ret`
Differential Revision: https://reviews.llvm.org/D131512
2022-08-11 10:02:45 +08:00
WANG Xuerui 326f7aed38 [LoongArch] Add codegen support for bitreverse
Differential Revision: https://reviews.llvm.org/D131378
2022-08-11 08:55:14 +08:00
Evgenii Stepanov 8ea1cf3111 Revert "[AMDGPU] SIFixSGPRCopies refactoring"
Breaks ASan tests.

This reverts commit 3f8ae7efa8.
2022-08-10 11:32:46 -07:00
Venkata Ramanaiah Nalamothu 486594119d [AMDGPU] Fix prologue/epilogue markers in .debug_line table for trivial functions
All the prologue instructions should have unknown source location
co-ordinates while the epilogue instructions should have source
location of last non-debug instruction after which epilogue
instructions are insrted.

This ensures the prologue/epilogue markers are generated correctly
in the line table.

Changes are brought in from the downstream CFI patches.

Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D131485
2022-08-10 23:00:19 +05:30
Umesh Kalappa 9757f4f2dd [PowerPC] Don't use the S30 and S31 regs for the pic code
These changes to address issue
https://github.com/llvm/llvm-project/issues/55857.

Since R30/S30 is used as pointer (32 bits) for GOT Table in the ppc32 ABI,
remove it from the SPE callee save register when PIC is enabled.

This prevents emitting the SPE load and store for S30 and S31 regs.

Differential revision: https://reviews.llvm.org/D127495
2022-08-10 10:31:27 -05:00
Amaury Séchet 9bceb8981d [X86] (0 - SetCC) | C -> (zext (not SetCC)) * (C + 1) - 1 if we can get a LEA out of it.
This adresses various regression in D131260 , as well as is a useful optimization in itself.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D131358
2022-08-10 15:12:00 +00:00
Justin Hibbits f43b228581 PowerPC: Don't hoist float multiply + add to fused operation on SPE
SPE doesn't have a fmadd instruction, so don't bother hoisting a
multiply and add sequence to this, as it'd become just a library call.
Hoisting happens too late for the CTR usability test to veto using the
CTR in a loop, and results in an assert "Invalid PPC CTR loop!".
2022-08-10 11:04:27 -04:00
Simon Pilgrim b92c7dc211 [X86] Use DAG.getFreeze() to create freeze node. NFC. 2022-08-10 15:03:56 +01:00
David Truby b1b9c39629 [AArch64][SVE] Use SVE for VLS fcopysign for wide vectors
Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D128642
2022-08-10 10:17:19 +00:00
Alex Bradbury 47b1f8362a [RISCV] Implement isUsedByReturnOnly TargetLowering hook in order to tailcall more libcalls
Prior to this patch, libcalls inserted by the SelectionDAG legalizer
could never be tailcalled. The eligibility of libcalls for tail calling
is is partly determined by checking TargetLowering::isInTailCallPosition
and comparing the return type of the libcall and the calleer.
isInTailCallPosition in turn calls TargetLowering::isUsedByReturnOnly
(which always returns false if not implemented by the target).

This patch provides a minimal implementation of
TargetLowering::isUsedByReturnOnly - enough to support tail calling
libcalls on hard float ABIs. Soft-float ABIs are left for a follow on
patch. libcall-tail-calls.ll also shows missed opportunities to tail
call integer libcalls, but this is due to issues outside of
the isUsedByReturnOnly hook.

Differential Revision: https://reviews.llvm.org/D131087
2022-08-10 10:50:29 +01:00
Alex Bradbury 7e7860c5d7 [X86][NFCI] Remove target-specific branch optimisation that's handled in BranchFolding
This specific optimisation is handled in OptimizeBlock in BranchFolding
so is redundant. As discussed on the review thread, I've verified that
we have test coverage for that optimisation within test/CodeGen/X86 by
disabling the BranchFolding version of this transform after applying
this patch and rerunning the test suite.

Differential Revision: https://reviews.llvm.org/D129204
2022-08-10 10:35:31 +01:00
Alex Bradbury 104a24ec8b [WebAssembly] Produce error when encountering unlowerable Wasm global accesses
WebAssembly globals are represented as IR globals with the wasm_var
address space (AS1). Prior to this patch, a wasm global load that isn't
lowerable will produce a failure to select, while a wasm global store
will produced incorrect code. This patch ensures we consistently produce
a clear error.

As noted in the test cases, it's conceivable that a frontend or an
optimisation pass could produce similar IR even in the presence of the
semantic restrictions on pointers to Wasm globals in the frontend, which
is a separate problem to address.

Differential Revision: https://reviews.llvm.org/D131387
2022-08-10 10:34:10 +01:00
jacquesguan b6b1c0d1c4 [RISCV] Add cost model for fp-mask cast op.
The cost of convert from or to mask vector is different from other cases. We could not use PowDiff to calculate it. This patch set it to 3 as we use 3 instruction to make it.

Differential Revision: https://reviews.llvm.org/D131149
2022-08-10 17:14:37 +08:00
Phoebe Wang c7ec6e19d5 [X86][BF16] Make backend type bf16 to follow the psABI
X86 psABI has updated to support __bf16 type, the ABI of which is the
same as FP16. See https://discourse.llvm.org/t/patch-add-optional-bfloat16-support/63149

This is an alternative of D129858, which has less code modification and
supports the vector type as well.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D130832
2022-08-10 08:58:56 +08:00
alex-t 3f8ae7efa8 [AMDGPU] SIFixSGPRCopies refactoring
This change finalizes the series of patches aiming to replace old
strategy of VGPR to SGPR copies loweriong.  Following the
https://reviews.llvm.org/D128252 and https://reviews.llvm.org/D130367 code
parts that are no longer used were removed.  Pass main loop is no longer used
for the MIR changes but collect information for further analysis.  Actual MIR
lowering happens further according the analysys result in the set of separate
functions. Another important change concerns the order of lowering: VGPR to
SGPR copies lowering is done first to have priority on the rest of the MIR
changes.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D131246
2022-08-10 00:51:57 +02:00
Pengxuan Zheng 9bb6622423 [ARM] Do not use LOAD_STACK_GUARD with ROPI/RWPI
ROPI/RWPI are not supported with LOAD_STACK_GUARD currently.

Reviewed By: nickdesaulniers, rengolin

Differential Revision: https://reviews.llvm.org/D131427
2022-08-09 14:59:08 -07:00
Dinar Temirbulatov cab6cd6834 [AArch64][LoopVectorize] Introduce trip count minimal value threshold to ignore tail-folding.
After D121595 was commited, I noticed regressions assosicated with small trip
count numbersvectorisation by tail folding with scalable vectors. As a solution
for those issues I propose to introduce the minimal trip count threshold value.

  Differential Revision: https://reviews.llvm.org/D130755
2022-08-09 22:10:17 +01:00
Archibald Elliott b20fe2c25b [docs][AArch64] Label Features with Arm ARM Names
This patch adds the names of the Arm Architecture Reference Manual (ARM)
features to the corresponding Subtarget Features in the AArch64 backend
and target parser.

The aim of this is to make it clearer what architectural features a
subtarget feature might enable (so, which features a CPU must provide to
support that subtarget feature), and so make it easier to add new CPUs
in the future.

Differential Revision: https://reviews.llvm.org/D131257
2022-08-09 18:45:50 +01:00
Yaxun (Sam) Liu e780648a15 [AMDGPU] Unify unreachable intrinsics
si-annotate-control-flow does depth first traversal of BB's of
a function to insert amdgcn if intrinsics for conditional
branches so that isel can generate correct instructions later.

si-annotate-control-flow checks whether the successor BB for the 'else'
branch of a conditional branch has been visited. If it has been
visited, si-annotate-control-flow assumes the conditional
branch has been handled and will not try to insert if intrinsic
for it.

This assumption is not correct when the IR contains multiple
unreachable BB's. Then 'if' intrinscs are not inserted and incorrect
ISA are generated.

This patch fixes the issue by let amdgpu-unify-divergent-exit-nodes
unify unreachables even if they are uniformly reached. In this way
the IR will not contain multiple exits, and structurizer is able to
structurize the IR containing one unified exit.

Reviewed by: Ruiling Song, Matt Arsenault

Differential Revision: https://reviews.llvm.org/D131181

Fixes: SWDEV-343244
2022-08-09 10:23:32 -04:00
Nikita Popov f5ed0cb217 [RISCV] Add target feature to force-enable atomics
This adds a +forced-atomics target feature with the same semantics
as +atomics-32 on ARM (D130480). For RISCV targets without the +a
extension, this forces LLVM to assume that lock-free atomics
(up to 32/64 bits for riscv32/64 respectively) are available.

This means that atomic load/store are lowered to a simple load/store
(and fence as necessary), as these are guaranteed to be atomic
(as long as they're aligned). Atomic RMW/CAS are lowered to __sync
(rather than __atomic) libcalls. Responsibility for providing the
__sync libcalls lies with the user (for privileged single-core code
they can be implemented by disabling interrupts). Code using
+forced-atomics and -forced-atomics are not ABI compatible if atomic
variables cross the ABI boundary.

For context, the difference between __sync and __atomic is that the
former are required to be lock-free, while the latter requires a
shared global lock provided by a shared object library. See
https://llvm.org/docs/Atomics.html#libcalls-atomic for a detailed
discussion on the topic.

This target feature will be used by Rust's riscv32i target family
to support the use of atomic load/store without atomic RMW/CAS.

Differential Revision: https://reviews.llvm.org/D130621
2022-08-09 16:04:46 +02:00
gonglingqin cf75ef460c [LoongArch] Add codegen support for ISD::ROTL and ISD::ROTR
Differential Revision: https://reviews.llvm.org/D131231
2022-08-09 19:39:17 +08:00
WANG Xuerui 7d48a9e1ae [LoongArch] Support register-register-addressed GPR loads/stores
Differential Revision: https://reviews.llvm.org/D131380
2022-08-09 19:13:36 +08:00
Alex Richardson 6db15a82cc [ARM] Use getSymbolPreferLocal() in GetARMGVSymbol
This allows relaxing some relocations to STT_SECTION symbol+offset
instead of emitting a relocation against a symbol.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D131433
2022-08-09 09:53:47 +00:00
Alex Richardson 9a2b14afa0 [ARM] Emit local aliases (.Lfoo$local) for functions
ARMAsmPrinter::emitFunctionEntryLabel() was not calling the base class
function so the $local alias was not being emitted. This should not have
any function effect right now since ARM does not generate different code
for the $local symbols, but it could be improved in the future.

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D131392
2022-08-09 09:53:47 +00:00
gonglingqin d3580c2eb6 [LoongArch] Add codegen support for not
Differential Revision: https://reviews.llvm.org/D131384
2022-08-09 14:05:09 +08:00
wanglei 8716513e65 [LoongArch] Implement branch analysis
This allows a number of optimisation passes to work.
E.g. BranchFolding and MachineBlockPlacement.

Differential Revision: https://reviews.llvm.org/D131316
2022-08-09 14:03:09 +08:00
WANG Xuerui f35cb7ba34 [LoongArch] Add codegen support for bswap
Differential Revision: https://reviews.llvm.org/D131352
2022-08-09 13:42:03 +08:00
Luo, Yuanke aaf6c7b05c [globalisel] Select register bank for DBG_VALUE
The register operand of DBG_VALUE is not selected to a proper register
bank in both AArch64 and X86. This would cause getRegClass crash after
global ISel. After discussion, we think the MIR should assume all
vritual register should be set proper register class after global ISel,
so this patch is to fix the gap of DBG_VALUE for AArch64 and X86.

Differential Revision: https://reviews.llvm.org/D129037
2022-08-09 13:11:51 +08:00
Chen Zheng 22e475f5ac [NFC] fix warning 2022-08-09 00:22:01 -04:00
Yuta Mukai 3f561996bf [AArch64] Fix and add A64FX scheduling resource/latency info
1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR)
   is added and CompleteModel is set to one.

2. Information for pseudo SVE instructions is added. Those
   instructions are present at the time of scheduling.

3. Resource and latency information for SVE instructions is modified
   to be more accurate.
   For example, the description for CMPEQ, which consumes one cycle
   each of unit FLA and PPR, is as follows.
```
Previous:
  def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>;
  def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {...
Modified:
  def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>;
  def A64FXGI1 : ProcResGroup<[A64FXIPPR]>;
  def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {...
```

Reference: A64FX Microarchitecture Manual (Table 16-3)
https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf

Reviewed By: dmgreen, kawashima-fj

Differential Revision: https://reviews.llvm.org/D131165
2022-08-09 10:53:40 +09:00
Chen Zheng d9004dfbab [PowerPC] mapping hardward loop intrinsics to powerpc pseudo
Map hardware loop intrinsics loop_decrement and set_loop_iteration
to the new PowerPC pseudo instructions, so that the hardware loop
intrinsics will be expanded to normal cmp+branch form or ctrloop
form based on the CTR register usage on MIR level.

Reviewed By: lkail

Differential Revision: https://reviews.llvm.org/D123366
2022-08-08 21:34:20 -04:00
Fangrui Song de9d80c1c5 [llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC
With C++17 there is no Clang pedantic warning or MSVC C5051.
2022-08-08 11:24:15 -07:00
Craig Topper e2bfbed2bb [RISCV] Add ReadFStoreData as a SchedRead.
The floating point stores use a different register class, it
probably makes sense to have a different SchedRead.

Reviewed By: monkchiang

Differential Revision: https://reviews.llvm.org/D131379
2022-08-08 09:33:19 -07:00
Simon Pilgrim 9ea54ac9ce [X86] X86ISelDAGToDAG.cpp - use auto for all values derived from cast/dyn_cast (style). NFC. 2022-08-08 14:35:06 +01:00
Simon Tatham 72017e9b16 [llvm-objdump,ARM] Fix big-endian AArch32 disassembly.
The ABI for big-endian AArch32, as specified by AAELF32, is above-
averagely complicated. Relocatable object files are expected to store
instruction encodings in byte order matching the ELF file's endianness
(so, big-endian for a BE ELF file). But executable images can
//either// do that //or// store instructions little-endian regardless
of data and ELF endianness (to support BE32 and BE8 platforms
respectively). They signal the latter by setting the EF_ARM_BE8 flag
in the ELF header.

(In the case of the Thumb instruction set, this all means that each
16-bit halfword of a Thumb instruction is stored in one or other
endianness. The two halfwords of a 32-bit Thumb instruction must
appear in the same order no matter what, because the first halfword is
the one that must avoid overlapping the encoding of any 16-bit Thumb
instruction.)

llvm-objdump was unconditionally expecting Arm instructions to be
stored little-endian. So it would correctly disassemble a BE8 image,
but if you gave it a BE32 image or a BE object file, it would retrieve
every instruction in byte-swapped form and disassemble it to
nonsense. (Even an object file output by LLVM itself, because
ARMMCCodeEmitter outputs instructions big-endian in big-endian mode,
which is correct for writing an object file.)

This patch allows llvm-objdump to correctly disassemble all three of
those classes of Arm ELF file. It does it by introducing a new
SubtargetFeature for big-endian instructions, setting it from the ELF
image type and flags during llvm-objdump setup, and teaching both
ARMDisassembler and llvm-objdump itself to pay attention to it when
retrieving instruction data from a section being disassembled.

Differential Revision: https://reviews.llvm.org/D130902
2022-08-08 10:49:51 +01:00
Cullen Rhodes a6dec9f5b2 [AArch64][SVE] Add patterns to select masked FP arith
Add patterns to select predicated instructions when lowering:

  fadd(a, select(mask, b, splat(0)))
  fsub(a, select(mask, b, splat(0)))

'fadd' is unsafe unless no-signed zeros fast-math flag is set, since

  -0.0 + 0.0 = 0.0

changes the sign. Alive2: https://alive2.llvm.org/ce/z/wbhJh_

Also adds FMA patterns for:

  fadd(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)
  fsub(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c)

These patterns require the 'contract' fast-math flag to be set, and the
fadd 'nsz' as above.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D130564
2022-08-08 08:44:13 +00:00
Kazu Hirata e20d210eef [llvm] Qualify auto (NFC)
Identified with readability-qualified-auto.
2022-08-07 23:55:27 -07:00
wanglei 0c2b738f8f [LoongArch] Support for varargs
This patch ensures the `$fp` always points to the bottom of the vararg
spill region.
Includes support for expand `ISD::DYNAMIC_STACKALLOC`.

Differential Revision: https://reviews.llvm.org/D130250
2022-08-08 14:01:24 +08:00
Sheng 64d326c33c [M68k] Add MC support for link/unlk
Reviewers: myhsu

Differential Revision: https://reviews.llvm.org/D125444
2022-08-08 11:00:11 +08:00
Kazu Hirata ba0407ba86 [llvm] Use range-based for loops (NFC) 2022-08-07 00:16:21 -07:00
Kazu Hirata 54199d805a [x86] Remove unused declaration processWaitCnt (NFC)
The declaration was introduced without a corresponding definition on
Jan 2, 2022 in commit 85e6e748d4.
2022-08-07 00:16:19 -07:00
Kazu Hirata d0ec61c9ff [Target] Remove unused forward declarations (NFC) 2022-08-07 00:16:16 -07:00
Kazu Hirata a2d4501718 [llvm] Fix comment typos (NFC) 2022-08-07 00:16:14 -07:00
Krzysztof Parzyszek 2bc390bdd6 [RDF] Use default TargetOperandInfo if not given in constructor
All current in-tree users use the default implementation.
2022-08-06 14:32:52 -05:00
Chen Zheng ef60e44fe8 [PowerPC] fix stack size allocated for float point argument
This is for https://github.com/llvm/llvm-project/issues/56469

Allocate 4 bytes for float point arguments on PPC32.

Reviewed By: nemanjai

Differential Revision: https://reviews.llvm.org/D129558
2022-08-06 08:38:52 -04:00
Markus Böck f7b73b7e8e [llvm] Remove uses of deprecated `std::iterator`
std::iterator has been deprecated in C++17 and some standard library implementations such as MS STL or libc++ emit deperecation messages when using the class.
Since LLVM has now switched to C++17 these will emit warnings on these implementations, or worse, errors in build configurations using -Werror.

This patch fixes these issues by replacing them with LLVMs own llvm::iterator_facade_base which offers a superset of functionality of std::iterator.

Differential Revision: https://reviews.llvm.org/D131320
2022-08-06 14:07:37 +02:00
Leon Clark 6a275cd53c Transform illegal intrinsics to V_ILLEGAL
Related tasks:

- SWDEV-240194
- SWDEV-309417
- SWDEV-334876

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D123693
2022-08-06 08:59:00 +01:00
Craig Topper 75c64c7c4e [RISCV] Don't use li+sh3add for constants that can use lui+add.
If we're adding a constant that can't use addi we try a few tricks,
one of which is using li+sh3add. We should not do this if lui+add
would work. For example adding 8192. Using sh3add prevents folding
a sext.w to form addw, thus increasing instruction count.
2022-08-05 12:47:03 -07:00
Philip Reames 9a9848f4b9 [RISCVInsertVSETVLI] Remove an unsound optimization
This fixes a bug reported privately by @craig.topper. Here's an example which illustrates the problem:

vsetivli a1, a0, e32, m1, ta, mu # both DefInfo and PrevInfo
vsetivli a2, a1, e32, m4, ta, mu

With the unsound result being:

vsetivli a1, a0, e32, m1, ta, mu
vsetivli a2, a0, e32, m4, ta, mu

Consider the case where this is running on a machine with VLEN=512,. For this case, the VLMAXs are 16 and 64 respectively.

Consider for a0 = 33. The correct result is: a1 = 16, and a2 = 16

After the unsound optimization: a1 = 16 and a2 = 33

This particular example used VLMAXs which differed by more than a power of two. With a difference of only one power of two, there's another form of this bug which involves the AVL < 2 x VLMAX special case, but that ones more complicated to construct as many examples turn out accidentally sound.

This patch takes the approach of simply removing the unsound optimization, but there are multiple sound sub-cases of it. I plan to return to at least a couple of them, but figured it was cleaner to remove the unsound optimization (for ease of backporting), and then review the new optimizations on their own.

Differential Revision: https://reviews.llvm.org/D131264
2022-08-05 12:13:08 -07:00
Paul Walker 0533c39a76 [SVE] Expand DUPM patterns to handle all integer vector types.
NOTE: i8 vector splats are ignored because the immediate range of
DUP already has full coverage.

Differential Revision: https://reviews.llvm.org/D131078
2022-08-05 16:00:08 +00:00
Mirko Brkusanin 19bb535ed9 [AMDGPU] Remove unused MIMG tablegen variants
There are no AMDGPUSampleVariant versions for _G16, it is treated more like a
modifier for derivatives (_D) (also for intrinsics where it is overloaded type
instead of part of instrinsic name) so we ended up making more variants for
these instruction then we actually needed.

32-bit derivatives need 6 dwords at most, while 16-bit need 4 at most. Using
same AMDGPUSampleVariant for both, we ended up creating 2 extra variants per
instruction than were necessary.

In total this deletes 260 unused tablegen records.

Differential Revision: https://reviews.llvm.org/D131252
2022-08-05 15:30:47 +02:00
Dawid Jurczak 1bd31a6898 [NFC] Add SmallVector constructor to allow creation of SmallVector<T> from ArrayRef of items convertible to type T
Extracted from https://reviews.llvm.org/D129781 and address comment:
https://reviews.llvm.org/D129781#3655571

Differential Revision: https://reviews.llvm.org/D130268
2022-08-05 13:35:41 +02:00
Phoebe Wang 2312b747b8 [X86] Move getting module flag into `runOnMachineFunction` to reduce compile-time. NFCI
Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D131245
2022-08-05 01:58:17 -07:00
wanglei 57eb77d411 [LoongArch] Implement more of the ABI
According to the description of the LoongArch abi documentation,
(https://loongson.github.io/LoongArch-Documentation/LoongArch-ELF-ABI-EN.html)
the calling convention of LoongArch is almost the same as the RISCV's
(except for the vector part), so we borrow the implementation of RISCV.

This patch only guarantees the correctness of lp64d, because only the
part of lp64d is described in detail in the documentation.

Differential Revision: https://reviews.llvm.org/D130249
2022-08-05 15:14:16 +08:00
David Green 38c2366b3f [AArch64][GlobalISel] Recognise some CCMPri
This is a simple addition to emitConditionalComparison, to match CCMP
with immediates using getIConstantVRegValWithLookThrough, letting it
select the CCMPri variants of the instructions.

Differential Revision: https://reviews.llvm.org/D131073
2022-08-05 07:48:42 +01:00
Phoebe Wang 7f648d27a8 Reland "[X86][MC] Always emit `rep` prefix for `bsf`"
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.

GCC has already done this way: https://c.godbolt.org/z/6xere6fs1

Fixes #34191

Reviewed By: craig.topper, skan

Differential Revision: https://reviews.llvm.org/D130956
2022-08-05 10:22:48 +08:00
Craig Topper 12a1ca9c42 [RISCV] Relax another one use restriction in performSRACombine.
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), i32), C)
it's possible that the add is used by multiple sras. We should
allow the combine if all the SRAs will eventually be updated.

After transforming all of the sras, the shls will share a single
(sext_inreg (add X, C1), i32).

This pattern occurs if an sra with 32 is used as index in multiple
GEPs with different scales. The shl from the GEPs will be combined
with the sra before we get a chance to match the sra pattern.
2022-08-04 14:32:31 -07:00
Mingming Liu bc8f2f3649 [AArch64][TTI][NFC] Overload method 'getVectorInstrCost' to provide vector instruction itself, as a context information for cost estimation.
1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method.
2) This patch also changes a few callsites (VectorCombine.cpp,
   SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method.
3) This is a split of D128302.

Differential Revision: https://reviews.llvm.org/D131114
2022-08-04 12:58:25 -07:00
Craig Topper a2de12c987 [RISCV] Relax a one use restriction performSRACombine
When folding (sra (add (shl X, 32), C1), 32 - C) -> (shl (sext_inreg (add X, C1), C)
ignore the use count on the (shl X, 32).

The sext_inreg after the transform is free. So we're only making
2 new instructions, the add and the shl. So we only need to be
concerned with replacing the original sra+add. The original shl
can have other uses. This helps if there are multiple different
constants being added to the same shl.
2022-08-04 11:25:08 -07:00
Joshua Cranmer 2138c90645 [IR] Move support for dxil::TypedPointerType to LLVM core IR.
This allows the construct to be shared between different backends. However, it
still remains illegal to use TypedPointerType in LLVM IR--the type is intended
to remain an auxiliary type, not a real LLVM type. So no support is provided for
LLVM-C, nor bitcode, nor LLVM assembly (besides the bare minimum needed to make
Type->dump() work properly).

Reviewed By: beanz, nikic, aeubanks

Differential Revision: https://reviews.llvm.org/D130592
2022-08-04 10:41:11 -04:00
jacquesguan b61cfc91ea [RISCV] Add cost modelling for vector widenning reduction.
In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction. So as the vector float widenning reduction.

Differential Revision: https://reviews.llvm.org/D129994
2022-08-04 15:31:31 +08:00
Phoebe Wang 6f867f9102 [X86] Support ``-mindirect-branch-cs-prefix`` for call and jmp to indirect thunk
This is to address feature request from https://github.com/ClangBuiltLinux/linux/issues/1665

Reviewed By: nickdesaulniers, MaskRay

Differential Revision: https://reviews.llvm.org/D130754
2022-08-04 15:12:15 +08:00
Thomas Lively b19de814ad [WebAssembly] Improve codegen for v128.bitselect
Add patterns selecting ((v1 ^ v2) & c) ^ v2 and ((v1 ^ v2) & ~c) ^ v2 to
v128.bitselect.

Resolves #56827.

Reviewed By: aheejin

Differential Revision: https://reviews.llvm.org/D131131
2022-08-03 23:28:37 -07:00
Craig Topper 91e8079cd5 [X86] Teach PostprocessISelDAG to fold ANDrm+TESTrr when chain result is used.
The isOnlyUserOf prevented the fold if the chain result had any
users. What we really care about is the the data result from the
AND is only used by the TEST, and the flags results from the ANDs
aren't used at all. It's ok if the chain has users, we just need
to replace those users with the chain from the TESTrm.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D131117
2022-08-03 21:00:22 -07:00
Craig Topper 53d560b22f [RISCV] Prevent infinite loop after D129980.
D129980 converts (seteq (i64 (and X, 0xffffffff)), C1) into
(seteq (i64 (sext_inreg X, i32)), C1). If bit 31 of X is 0, it
will be turned back into an 'and' by SimplifyDemandedBits which
can cause an infinite loop.

To prevent this, check if bit 31 is 0 with computeKnownBits before
doing the transformation.

Fixes PR56905.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D131113
2022-08-03 15:19:07 -07:00
Craig Topper 84e9194828 Revert "[X86][MC] Always emit `rep` prefix for `bsf`"
This reverts commit c2066d19cd.

It's causing failures on the build bots.
2022-08-03 14:51:34 -07:00
Chris Bieneman ce0bb316eb [DX] [NFC] Move hasSection check up
Juming out earlier if the global doesn't have a section is just a
cleaner early out.
2022-08-03 15:54:53 -05:00
Craig Topper ff91b2d9df [X86] Promote i16 CTTZ/CTTZ_ZERO_UNDEF always.
If we're going to emit a rep prefix before bsf as proposed in
D130956, it makes sense to promote i16 operations to i32 to avoid
the false depedency of tzcntw.

Reviewed By: skan, pengfei

Differential Revision: https://reviews.llvm.org/D130995
2022-08-03 13:12:20 -07:00
David Truby 9a976f3661 [llvm] Always use TargetConstant for FP_ROUND ISD Nodes
This patch ensures consistency in the construction of FP_ROUND nodes
such that they always use ISD::TargetConstant instead of ISD::Constant.

This additionally fixes a bug in the AArch64 SVE backend where patterns
were matching against TargetConstant nodes and sometimes failing when
passed a Constant node.

Reviewed By: paulwalker-arm

Differential Revision: https://reviews.llvm.org/D130370
2022-08-03 14:02:11 +01:00
Alex Bradbury 28f12a09ae [RISCV] Teach ComputeNumSignBitsForTargetNode about masked atomic intrinsics
An unnecessary sext.w is generated when masking the result of the
riscv_masked_cmpxchg_i64 intrinsic. Implementing handling of the
intrinsic in ComputeNumSignBitsForTargetNode allows it to be removed.

Although this isn't a particularly important optimisation, removing the
sext.w simplifies implementation of an additional cmpxchg-related
optimisation in D130192.

Although I can't produce a test with different codegen for the other
atomics intrinsics, these are added as well for completeness.

Differential Revision: https://reviews.llvm.org/D130191
2022-08-03 13:41:58 +01:00
Dmitry Preobrazhensky 05b3aadfff [AMDGPU][MC][GFX11] Correct v_dot2_f16_f16 and v_dot2_bf16_bf16
Enable SGPRs for the following operands of these opcodes:

- src operands of VOP3 variant.
- src2 operand of DPP variants.

Differential Revision: https://reviews.llvm.org/D130989
2022-08-03 15:08:23 +03:00
Dmitry Preobrazhensky ae553f9e49 [AMDGPU][MC][GFX10] Correct encoding of VOP3 v_cmpx* opcodes
Encode dst=EXEC but allow disassembler accept any dst value.

Differential Revision: https://reviews.llvm.org/D130978
2022-08-03 15:03:44 +03:00
Fraser Cormack 646e2f4803 [VP] Rename VP int<->float conversion ISD opcodes
These should be named like the non-VP versions for consistency.

Reviewed By: reames

Differential Revision: https://reviews.llvm.org/D130967
2022-08-03 10:04:38 +01:00
Phoebe Wang c2066d19cd [X86][MC] Always emit `rep` prefix for `bsf`
`BMI` new instruction `tzcnt` has better performance than `bsf` on new
processors. Its encoding has a mandatory prefix '0xf3' compared to
`bsf`. If we force emit `rep` prefix for `bsf`, we will gain better
performance when the same code run on new processors.

GCC has already done this way: https://c.godbolt.org/z/6xere6fs1

Fixes #34191

Reviewed By: skan

Differential Revision: https://reviews.llvm.org/D130956
2022-08-03 17:09:36 +08:00
Liu, Chen3 5bbb0a831f [X86] Using `X86MemOperand` instead of `Operand` for `i32mem_TC` and `i64mem_TC`
To fix build fail when X86_GEN_FOLD_TABLES is enabled.

Differential Revision: https://reviews.llvm.org/D131049
2022-08-03 16:17:51 +08:00
Nikita Popov b128e057c1 [AA] Make ModRefInfo a bitmask enum (NFC)
Mark ModRefInfo as a bitmask enum, which allows using normal
& and | operators on it. This supersedes various functions like
unionModRef() and intersectModRef(). I think this makes the code
cleaner than going through helper functions...

Differential Revision: https://reviews.llvm.org/D130870
2022-08-03 10:05:55 +02:00
Craig Topper f19497f7b0 [RISCV] Use InstVisitor in RISCVCodeGenPrepare. NFC
Makes it easy to add new instructions to look at without dispatching
manually.
2022-08-02 21:19:30 -07:00
Paul Kirth d434e40f39 [llvm][NFC] Refactor code to use ProfDataUtils
In this patch we replace common code patterns with the use of utility
functions for dealing with profiling metadata. There should be no change
in functionality, as the existing checks should be preserved in all
cases.

Reviewed By: bogner, davidxl

Differential Revision: https://reviews.llvm.org/D128860
2022-08-03 00:09:45 +00:00
Austin Kerbow 3dfa562643 [AMDGPU] Add CL option for max-ilp scheduler.
When compiling for multiple targets the scheduler that is selected via the
-misched option is applied globally. This patch adds a target CL option instead.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D131022
2022-08-02 16:52:14 -07:00
Craig Topper a5605f1f68 [RISCV] Fix operand number in debug message in RISCVMergeBaseOffset.
This used to print from the ADDI where the operand number was
correct. It recently changed to print from the LUI or AUIPC which
needs to use operand 1 instead of 2.

This shows up as a crash with -debug.
2022-08-02 15:27:23 -07:00
Austin Kerbow 40eec27618 [AMDGPU] Add llvm_unreachable to switch statement added in d7100b398. 2022-08-02 13:45:38 -07:00
Austin Kerbow d7100b398b [AMDGPU] Add GCNMaxILPSchedStrategy
Creates a new scheduling strategy that attempts to maximize ILP for a single
wave.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D130869
2022-08-02 13:21:24 -07:00
Xiang Li 20f7f9b709 [NFC][DirectX backend] Fix crash when emit_obj for DirectX backend.
When emit-obj from clang directly, DirectX backend will hit assert caused by not initialize passes for AsmPrinter.
The fix will initialize the passes by calling createPassConfig.
Also ignore global variable which not has section in DXILAsmPrinter::emitGlobalVariable to avoid hit llvm_unreachable in DXILTargetObjectFile::SelectSectionForGlobal.

Reviewed By: beanz

Differential Revision: https://reviews.llvm.org/D130856
2022-08-02 12:09:07 -07:00
Vladislav Dzhidzhoev 71aecbb75c [AArch64] Treat x18 as callee-saved in functions with Windows calling convention on Darwin
rGcf97e0ec42b8 makes $x18 to be treated as callee-saved in functions with
Windows calling convention on non-Windows OSes.

Here we mark $x18 as callee-saved for functions with Windows calling
convention on Darwin, as well as on other non-Windows platforms, in
order to prevent some miscompilations (like miscompilation of
win64cc-darwin-backup-x18.ll).

Since getCalleeSavedRegs doesn't return x18 in list of callee-saved
registers, assignCalleeSavedSpillSlots and determineCalleeSaves
consider different sets of registers as callee-saved. It causes an
error:
```
Assertion failed: ((!HasCalleeSavedStackSize || getCalleeSavedStackSize() == Size) && "Invalid size calculated for callee saves"), function getCalleeSavedStackSize, file
AArch64MachineFunctionInfo.h, line 292.
```

Differential Revision: https://reviews.llvm.org/D130676
2022-08-02 20:33:42 +03:00
Guozhi Wei 85a6dd50ad [MIPS] Expose the ZERO register as a constant physical register
The ZERO register should be exposed as a constant physical register through the interface TargetRegisterInfo::isConstantPhysReg.

Differential Revision: https://reviews.llvm.org/D130932
2022-08-02 17:04:52 +00:00
Craig Topper ae6877836e [RISCV] Add scheduler classes to PseudoVMV*R_V.
I think these pseudos will exist when the post-RA scheduler runs
so they should have sched classes.

Reviewed By: monkchiang

Differential Revision: https://reviews.llvm.org/D130945
2022-08-02 09:38:32 -07:00
Craig Topper 2e5c516a3d [RISCV] Add scheduler class to PseudoReadVLENB.
Reviewed By: monkchiang

Differential Revision: https://reviews.llvm.org/D130938
2022-08-02 09:38:32 -07:00
Alexander Timofeev a321d95b59 [AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs
In the 2e29b0138c we introduce a specific solving algorithm
that analyzes the VGPR to SGPR copies use chains and either lowers
the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms.
Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs
in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs
are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert
copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues.
At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain
were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF
because of the weird mixing of SALU and VALU instructions.
This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact
that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs,
we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common
VGPR to SGPR lowering algorithm.
This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D130367
2022-08-02 18:37:57 +02:00
Jay Foad e301e071ba [AMDGPU] Remove IR SpeculativeExecution pass from codegen pipeline
This pass seems to have very little effect because all it does is hoist
some instructions, but it is followed later in the codegen pipeline by
the IR CodeSinking pass which does the opposite.

Differential Revision: https://reviews.llvm.org/D130258
2022-08-02 17:35:20 +01:00
Jay Foad c24d68fff1 [AMDGPU] Take advantage of VOP3 literals in convertToThreeAddress
This improves a corner case where v_fmac can be converted to v_fma on
GFX10+ even if it has a literal operand.

Differential Revision: https://reviews.llvm.org/D130992
2022-08-02 17:27:11 +01:00
Phoebe Wang 23021d4d8c [X86][FP16] Fix vector_shuffle and lowering without f16c feature problems
The problem Alexander reported on D127982 was caused by an optimization
for AVX512-FP16 instruction. We must limit it to the feature enabled only.

During the investigation, I found we didn't expand for fp_round/fp_extend
without F16C. This may result runtime crash, so change them too.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D130817
2022-08-02 22:26:41 +08:00