Commit Graph

63852 Commits

Author SHA1 Message Date
David Green ad8e75caa2 [ARM] Fix for matching reductions that are both sext and zext.
Fix a silly mistake that was not making sure that _both_ operands were
the correct extend code.
2021-07-16 23:11:42 +01:00
Nemanja Ivanovic 35a18a981f [PowerPC] Implement intrinsics for mtfsf[i]
This provides intrinsics for emitting instructions that set the FPSCR (`mtfsf/mtfsfi`).

The patch also conservatively marks the rounding mode as an implicit def for both since they both may set the rounding mode depending on the operands.

Reviewed By: #powerpc, qiucf

Differential Revision: https://reviews.llvm.org/D105957
2021-07-16 16:26:11 -05:00
Jon Roelofs 15267595fd [RISCV] Compose vector subregs hierarchically
This fixes the test I broke in: https://reviews.llvm.org/D105953#2883579

Differential revision: https://reviews.llvm.org/D106168
2021-07-16 12:32:13 -07:00
Simon Pilgrim d2458bcdc6 [X86][SSE] combineX86ShufflesRecursively - bail if constant folding fails due to oneuse limits.
Fixes issue reported on D105827 where a single shuffle of a constant (with multiple uses) was caught in an infinite loop where one shuffle (UNPCKL) used an undef arg but then that got recombined to SHUFPS as the constant value had its own undef that confused matching.....
2021-07-16 19:21:46 +01:00
Lei Huang c8937b6cb9 [PowerPC] Implement XL compact math builtins
Implement a subset of builtins required for compatiblilty with AIX XL compiler.

Reviewed By: nemanjai

Differential Revision: https://reviews.llvm.org/D105930
2021-07-16 13:21:13 -05:00
Craig Topper 8f0343cc9c [RISCV] Use tail agnostic policy for fixed vector vwmacc(u).
This adds new pseudoinstructions with ForceTailAgnostic set. This
matches what we did for non-widening VMACC. We should move to a
tail policy operand on the pseudos when we expand the intrinsic
interface to include the tail policy.
2021-07-16 10:41:09 -07:00
Craig Topper d634ec8d29 [RISCV] Refactor where in the multiclass hierarchy we add commutable VFMADD/VFMACC instructions. NFC
I'm preparing to add tail agnostic versions of VWMACC and VFWMACC
so this will make them more consistent.
2021-07-16 10:41:09 -07:00
Guozhi Wei 5609c8b607 [X86FixupLEAs] Try again to transform the sequence LEA/SUB to SUB/SUB
This patch transforms the sequence
    lea (reg1, reg2), reg3
    sub reg3, reg4
to two sub instructions
    sub reg1, reg4
    sub reg2, reg4

Similar optimization can also be applied to LEA/ADD sequence.

The modifications to TwoAddressInstructionPass is to ensure the operands of ADD
instruction has expected order (the dest register of LEA should be src register
of ADD).

Differential Revision: https://reviews.llvm.org/D104684
2021-07-16 10:16:03 -07:00
Craig Topper 4dbb788068 [RISCV] Teach constant materialization that it can use zext.w at the end with Zba to reduce number of instructions.
If the upper 32 bits are zero and bit 31 is set, we might be able to
use zext.w to fill in the zeros after using an lui and/or addi.

Most of this patch is plumbing the subtarget features into the constant
materialization.

Reviewed By: luismarques

Differential Revision: https://reviews.llvm.org/D105509
2021-07-16 09:35:56 -07:00
Craig Topper 0ce13f92b7 [RISCV] Add curly braces around a case body that declares variables. NFC
This is at the end of the switch so doesn't cause any issues now,
but if a new case is added it will break.
2021-07-16 09:35:56 -07:00
Matt Arsenault 9ad1a49956 Mips/GlobalISel: Use LLT form of getMachineMemOperand
NFC here since it's just using a scalar anyway.
2021-07-16 11:41:32 -04:00
Masoud Ataei ee2068b30e [PowerPC] Updated the error message of MASSV pass to mention vectorization
is needed be enable on P8 and later targets.

 Differential Revision: https://reviews.llvm.org/D106091
2021-07-16 14:45:09 +00:00
Amy Kwan ba627a32e1 [PowerPC] Update Refactored Load/Store Implementation, XForm VSX Patterns, and Tests
This patch includes the following updates to the load/store refactoring effort introduced in D93370:
 - Update various VSX patterns that use to "force" an XForm, to instead just XForm.
   This allows the ability for the patterns to compute the most optimal addressing
   mode (and to produce a DForm instruction when possible)
- Update pattern and test case for the LXVD2X/STXVD2X intrinsics
- Update LIT test cases that use to use the XForm instruction to use the DForm instruction

Differential Revision: https://reviews.llvm.org/D95115
2021-07-16 09:28:48 -05:00
Fraser Cormack e3fa2b1eab Revert "[RISCV] Lower more BUILD_VECTOR sequences to RVV's VID"
This reverts commit a6ca88e908.

More caution is required to avoid overflow/underflow. Thanks to the
santizers for catching this.
2021-07-16 15:00:20 +01:00
Matt Arsenault 3ceb92295e AMDGPU/GlobalISel: Preserve more memory types 2021-07-16 08:57:26 -04:00
Matt Arsenault 21a0ef8d19 AMDGPU/GlobalISel: Redo kernel argument load handling
This avoids relying on G_EXTRACT on unusual types, and also properly
decomposes structs into multiple registers. This also preserves the
LLTs in the memory operands.
2021-07-16 08:56:54 -04:00
Dmitry Preobrazhensky 09c9f4dc7d [AMDGPU][MC] Added missing isCall/isBranch flags
Added isCall for S_CALL_B64; added isBranch for S_SUBVECTOR_LOOP_*.

Differential Revision: https://reviews.llvm.org/D106072
2021-07-16 14:59:10 +03:00
Nicholas Guy 9769535efd [AArch64] Update Cortex-A55 SchedModel to improve LDP scheduling
Specifying the latencies of specific LDP variants appears to improve
performance almost universally.

Differential Revision: https://reviews.llvm.org/D105882
2021-07-16 12:00:57 +01:00
Cullen Rhodes 99eb96f031 [AArch64][SME] Add load and store instructions
This patch adds support for following contiguous load and store
instructions:

  * LD1B, LD1H, LD1W, LD1D, LD1Q
  * ST1B, ST1H, ST1W, ST1D, ST1Q

A new register class and operand is added for the 32-bit vector select
register W12-W15. The differences in the following tests which have been
re-generated are caused by the introduction of this register class:

  * llvm/test/CodeGen/AArch64/GlobalISel/irtranslator-inline-asm.ll
  * llvm/test/CodeGen/AArch64/GlobalISel/regbank-inlineasm.mir
  * llvm/test/CodeGen/AArch64/stp-opt-with-renaming-reserved-regs.mir
  * llvm/test/CodeGen/AArch64/stp-opt-with-renaming.mir

D88663 attempts to resolve the issue with the store pair test
differences in the AArch64 load/store optimizer.

The GlobalISel differences are caused by changes in the enum values of
register classes, tests have been updated with the new values.

The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06

Reviewed By: CarolineConcatto

Differential Revision: https://reviews.llvm.org/D105572
2021-07-16 10:11:10 +00:00
Fraser Cormack a6ca88e908 [RISCV] Lower more BUILD_VECTOR sequences to RVV's VID
This patch teaches the compiler to identify a wider variety of
`BUILD_VECTOR`s which form integer arithmetic sequences, and to lower
them to `vid.v` with modifications for non-unit steps and non-zero
addends.

The sequences handled by this optimization must either be monotonically
increasing or decreasing. Consecutive elements holding the same value
indicate a fractional step which, while simple mathematically,
becomes more complex to handle both in the realm of lossy integer
division and in the presence of `undef`s.

For example, a common "interleaving" shuffle index will be lowered by
LLVM to both `<0,u,1,u,2,...>` and `<u,0,u,1,u,...>` `BUILD_VECTOR`
nodes. Either of these would ideally be lowered to `vid.v` shifted right
by 1. Detection of this sequence in presence of general `undef` values
is more complicated, however: `<0,u,u,1,>` could match either
`<0,0,0,1,>` or `<0,0,1,1,>` depending on later values in the sequence.
Both are possible, so backtracking or multiple passes is inevitable.

Sticking to monotonic sequences keeps the logic simpler as it can be
done in one pass. Fractional steps will likely be a separate
optimization in a future patch.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D104921
2021-07-16 10:35:13 +01:00
Mehdi Amini 76374573ce Use ManagedStatic and lazy initialization of cl::opt in libSupport to make it free of global initializer
We can build it with -Werror=global-constructors now. This helps
in situation where libSupport is embedded as a shared library,
potential with dlopen/dlclose scenario, and when command-line
parsing or other facilities may not be involved. Avoiding the
implicit construction of these cl::opt can avoid double-registration
issues and other kind of behavior.

Reviewed By: lattner, jpienaar

Differential Revision: https://reviews.llvm.org/D105959
2021-07-16 07:38:16 +00:00
Mehdi Amini 8d051d8546 Revert "Use ManagedStatic and lazy initialization of cl::opt in libSupport to make it free of global initializer"
This reverts commit af9321739b.
Still some specific config broken in some way that requires more
investigation.
2021-07-16 07:35:13 +00:00
Mehdi Amini af9321739b Use ManagedStatic and lazy initialization of cl::opt in libSupport to make it free of global initializer
We can build it with -Werror=global-constructors now. This helps
in situation where libSupport is embedded as a shared library,
potential with dlopen/dlclose scenario, and when command-line
parsing or other facilities may not be involved. Avoiding the
implicit construction of these cl::opt can avoid double-registration
issues and other kind of behavior.

Reviewed By: lattner, jpienaar

Differential Revision: https://reviews.llvm.org/D105959
2021-07-16 06:54:26 +00:00
Mehdi Amini 16b5e9d6a2 Revert "Use ManagedStatic and lazy initialization of cl::opt in libSupport to make it free of global initializer"
This reverts commit 42f588f39c.
Broke some buildbots
2021-07-16 03:46:53 +00:00
Mehdi Amini 42f588f39c Use ManagedStatic and lazy initialization of cl::opt in libSupport to make it free of global initializer
We can build it with -Werror=global-constructors now. This helps
in situation where libSupport is embedded as a shared library,
potential with dlopen/dlclose scenario, and when command-line
parsing or other facilities may not be involved. Avoiding the
implicit construction of these cl::opt can avoid double-registration
issues and other kind of behavior.

Reviewed By: lattner, jpienaar

Differential Revision: https://reviews.llvm.org/D105959
2021-07-16 03:33:20 +00:00
Matt Arsenault e91da668d0 GlobalISel: Track argument pointeriness with arg flags
Since we're still building on top of the MVT based infrastructure, we
need to track the pointer type/address space on the side so we can end
up with the correct pointer LLTs when interpreting CCValAssigns.
2021-07-15 19:11:40 -04:00
Victor Huang 4eb107ccba [PowerPC] Add PowerPC population count, reversed load and store related builtins and instrinsics for XL compatibility
This patch is in a series of patches to provide builtins for compatibility
with the XL compiler. This patch adds the builtins and instrisics for population
count, reversed load and store related operations.

Reviewed By: nemanjai, #powerpc

Differential revision: https://reviews.llvm.org/D106021
2021-07-15 17:23:56 -05:00
Harald van Dijk a8ad917054
[X86] Fix handling of maskmovdqu in X32
The maskmovdqu instruction is an odd one: it has a 32-bit and a 64-bit
variant, the former using EDI, the latter RDI, but the use of the
register is implicit. In 64-bit mode, a 0x67 prefix can be used to get
the version using EDI, but there is no way to express this in
assembly in a single instruction, the only way is with an explicit
addr32.

This change adds support for the instruction. When generating assembly
text, that explicit addr32 will be added. When not generating assembly
text, it will be kept as a single instruction and will be emitted with
that 0x67 prefix. When parsing assembly text, it will be re-parsed as
ADDR32 followed by MASKMOVDQU64, which still results in the correct
bytes when converted to machine code.

The same applies to vmaskmovdqu as well.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D103427
2021-07-15 22:56:08 +01:00
Jessica Paquette 46c8e7122b [AArch64][GlobalISel] Clamp <n x p0> vecs when legalizing G_EXTRACT_VECTOR_ELT
This case was missing from G_EXTRACT_VECTOR_ELT. It's the same as for s64.

https://godbolt.org/z/Tnq4acY8z

Differential Revision: https://reviews.llvm.org/D105952
2021-07-15 14:05:28 -07:00
Artem Belevich d774b4aa5e [NVPTX, CUDA] Add .and.popc variant of the b1 MMA instruction.
That should allow clang to compile mma.h from CUDA-11.3.

Differential Revision: https://reviews.llvm.org/D105384
2021-07-15 12:02:09 -07:00
Sushma Unnibhavi aaccc985a8 [M68k][GloballSel] LegalizerInfo implementation
Added rules for G_ADD, G_SUB, G_MUL, G_UDIV to be legal.

Differential Revision: https://reviews.llvm.org/D105536
2021-07-15 13:00:43 -06:00
Sam Tebbs ff0ef6a518 [ARM][LowOverheadLoops] Make some stack spills valid for tail predication
This patch makes vector spills valid for tail predication when all loads
from the same stack slot are within the loop

Differential Revision: https://reviews.llvm.org/D105443
2021-07-15 19:23:52 +01:00
Quinn Pham de3956605a [PowerPC] Fix popcntb XL Compat Builtin for 32bit
This patch implements the `__popcntb` XL compatibility builtin for 32bit in the frontend and backend. This patch also updates tests for `__popcntb` and other XL Compat sync related builtins.

Reviewed By: #powerpc, nemanjai, amyk

Differential Revision: https://reviews.llvm.org/D105360
2021-07-15 13:19:47 -05:00
Stanislav Mekhanoshin c46d99e4ba [AMDGPU] Refine -O0 and -O1 passes.
Differential Revision: https://reviews.llvm.org/D105579
2021-07-15 09:51:54 -07:00
David Green dad506bd4e [ARM] Expand types handled in VQDMULH recognition
We have a DAG combine for recognizing the sequence of nodes that make up
an MVE VQDMULH, but only currently handles specifically legal types.
This patch expands that to other power-2 vector types. For smaller than
legal types this means any_extending the type and casting it to a legal
type, using a VQDMULH where we only use some of the lanes. The result is
sign extended back to the original type, to properly set the invalid
lanes. Larger than legal types are split into chunks with extracts and
concat back together.

Differential Revision: https://reviews.llvm.org/D105814
2021-07-15 14:47:53 +01:00
Simon Pilgrim 91e151476c [TTI] Consistently make getMinVectorRegisterBitWidth() methods const. NFCI.
The underlying getMinVectorRegisterBitWidth() methods are const, but it was missed in a couple of TargetTransformInfo wrappers.

Noticed while working on D103925
2021-07-15 13:27:55 +01:00
Irina Dobrescu 831ee6b0c3 [AArch64][GlobalISel] Optimise lowering for some vector types for min/max
Differential Revision: https://reviews.llvm.org/D105696
2021-07-15 11:34:32 +01:00
Sebastian Neubauer afd895709d [AMDGPU] Use isMetaInstruction for instruction size
Meta instructions have a size of 0. Use isMetaInstruction instead of
listing them explicitly.

Differential Revision: https://reviews.llvm.org/D106043
2021-07-15 12:23:11 +02:00
Cullen Rhodes dfa76933c2 [AArch64][SME] Add outer product instructions
This patch adds support for the following outer product instructions:

  * BFMOPA, BFMOPS, FMOPA, FMOPS, SMOPA, SMOPS, SUMOPA, SUMOPS, UMOPA,
    UMOPS, USMOPA, USMOPS.

Depends on D105570.

The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D105571
2021-07-15 09:51:06 +00:00
Bogdan Graur 442123cada Fixes memory sanitizer 'use-of-uninitialized-value' diagnostic.
Differential Revision: https://reviews.llvm.org/D106047
2021-07-15 11:17:04 +02:00
Kai Luo b9c3941cd6 [PowerPC] Generate inlined quadword lock free atomic operations via AtomicExpand
This patch uses AtomicExpandPass to implement quadword lock free atomic operations. It adopts the method introduced in https://reviews.llvm.org/D47882, which expand atomic operations post RA to avoid spilling that might prevent LL/SC progress.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D103614
2021-07-15 01:12:09 +00:00
Thomas Lively 4a4229f70f [WebAssembly] Codegen for v128.storeX_lane instructions
Replace the experimental clang builtins and LLVM intrinsics for these
instructions with normal codegen patterns. Resolves PR50435.

Differential Revision: https://reviews.llvm.org/D106019
2021-07-14 16:15:25 -07:00
Jon Roelofs 0e49c54a8c [AArch64] Fix selection of G_UNMERGE <2 x s16>
Differential revision: https://reviews.llvm.org/D106007
2021-07-14 13:40:56 -07:00
Stanislav Mekhanoshin 76b7d3432e [AMDGPU] Add TII::isIgnorableUse() to allow VOP rematerialization
Any def of EXEC prevents rematerialization of any VOP instruction
because of the physreg use. Create a callback to check if the
physreg use can be ingored to allow rematerialization.

Differential Revision: https://reviews.llvm.org/D105836
2021-07-14 13:03:58 -07:00
David Green 31b8f40006 [ARM] Move add(VMLALVA(A, X, Y), B) to VMLALVA(add(A, B), X, Y)
For i64 reductions we currently try and convert add(VMLALV(X, Y), B) to
VMLALVA(B, X, Y), incorporating the addition into the VMLALVA. If we
have an add of an existing VMLALVA, this patch pushes the add up above
the VMLALVA so that it may potentially be simplified further, for
example being folded into another VMLALV.

Differential Revision: https://reviews.llvm.org/D105686
2021-07-14 20:06:49 +01:00
Eli Friedman 1e30bf8621 [SelectionDAG] Add an overload of getStepVector that assumes step 1.
This is mostly a minor convenience, but the pattern seems frequent
enough to be worthwhile (and we'll probably add more uses in the
future).

Differential Revision: https://reviews.llvm.org/D105850
2021-07-14 11:37:01 -07:00
Thomas Lively 970e090010 [WebAssembly] Codegen for v128.loadX_lane instructions
Replace the experimental clang builtin and LLVM intrinsics for these
instructions with normal codegen patterns. Resolves PR50433.

Differential Revision: https://reviews.llvm.org/D105950
2021-07-14 11:31:53 -07:00
David Green 338314f9c2 [ARM] Lower v16i8 -> i64 VMLA reductions.
MVE does not have a VMLALV instruction that can perform v16i8 -> i64
reductions, like it does for v8i16->i64 and v4i32->i64 reductions. That
means that the pattern to create them will be spilt up by type
legalization, creating a lot of instructions.

This extends the patterns for matching i64 reductions a little to handle
the v16i8->i64 case. We need to turn them into a pair of v8i16->i64
VMLALVs that each perform half of the reduction and are summed together
(so the later is a VMLALVA). The order of the lanes does not matter for
the reduction so we generate a MVEEXT for the extension, that will
either be folded into a extending load or can be optimized to a
VREV/VMOVL. Some of the resulting codegen isn't optimal, but will be
improved in a later patch.

Differential Revision: https://reviews.llvm.org/D105680
2021-07-14 18:11:32 +01:00
Matt Arsenault 47269da5d8 GlobalISel: Handle lowering non-power-of-2 extloads 2021-07-14 11:54:11 -04:00
Sander de Smalen eac1670739 [CostModel][AArch64] Make loads/stores of <vscale x 1 x eltty> invalid.
At the moment, <vscale x 1 x eltty> are not yet fully handled by the
code-generator, so to avoid vectorizing loops with that VF, we mark the
cost for these types as invalid.
The reason for not adding a new "TTI::getMinimumScalableVF" is because
the type is supposed to be a type that can be legalized. It partially is,
although the support for these types need some more work.

Reviewed By: paulwalker-arm, dmgreen

Differential Revision: https://reviews.llvm.org/D103882
2021-07-14 16:44:22 +01:00
Jinsong Ji fe52296a34 [AIX] Enable dollar sign as PC in inlineasm
$ is used as PC for PowerPC inlineasm, ELF use it,
enable it for AIX XCOFF as well.

Reviewed By: #powerpc, amyk, nemanjai

Differential Revision: https://reviews.llvm.org/D105956
2021-07-14 13:37:52 +00:00
Tim Northover b18bda6791 ARM: reuse existing libcall global variable if possible.
If we try to create a new GlobalVariable on each iteration, the Module will
detect the name collision and "helpfully" rename later iterations by appending
".1" etc. But "___udivsi3.1" doesn't exist and we definitely don't want to try
to call it.

So instead check whether there's already a global with the right name in the
module and use that if so.
2021-07-14 14:14:47 +01:00
Simon Pilgrim ee71c1bbcc [X86] Implement smarter instruction lowering for FP_TO_UINT from f32/f64 to i32/i64 and vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs (and for FP_TO_UINT, negative float values are undefined). We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:

small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied there too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

@TomHender checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

Original Patch by @TomHender (Tom Hender)

Differential Revision: https://reviews.llvm.org/D89697
2021-07-14 12:03:49 +01:00
Stephen Tozer 810e4c3c66 [DebugInfo] Correctly update dbg.values with duplicated location ops
This patch fixes code that incorrectly handled dbg.values with duplicate
location operands, i.e. !DIArgList(i32 %a, i32 %a). The errors in
question were caused by either applying an update to dbg.value multiple
times when the update is only valid once, or by updating the
DIExpression for only the first instance of a value that appears
multiple times.

Differential Revision: https://reviews.llvm.org/D105831
2021-07-14 11:17:24 +01:00
Fraser Cormack 03a4702c88 [RISCV] Fix the neutral element in vector 'fadd' reductions
Using positive zero as the neutral element in 'fadd' reductions, while
it generates better code, is incorrect. The correct neutral element is
negative zero: 0.0 + -0.0 = 0.0, whereas -0.0 + -0.0 = -0.0.

There are perhaps more optimal lowerings of negative zero avoiding
constant-pool loads which could be left as future work.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D105902
2021-07-14 10:18:38 +01:00
Sebastian Neubauer 4359b870b1 [AMDGPU] Init scratch only if necessary
If no scratch or flat instructions are used, we do not need to
initialize the flat scratch hardware register.

Differential Revision: https://reviews.llvm.org/D105920
2021-07-14 10:45:22 +02:00
Cullen Rhodes c08dabb0f4 [AArch64][SME] Add matrix register definitions and parsing support
SME introduces the ZA array, a new piece of architectural register state
consisting of a matrix of [SVLb x SVLb] bytes, where SVL is the
implementation defined Streaming SVE vector length and SVLb is the
number of 8-bit elements in a vector of SVL bits.

SME instructions consist of three types of matrix operands:

  * Tiles: a ZA tile is a square, two-dimensional sub-array of elements
  within the ZA array. These tiles make up the larger accumulator array
  and the granularity varies based on the element size, i.e.
    - ZAQ0..ZAQ15 (smallest tile granule)
    - ZAD0..ZAD7
    - ZAS0..ZAS3
    - ZAH0..ZAH1
    or ZAB0       (largest tile granule, single tile)
  * Tile vectors: similar to regular tiles, but have an extra 'h' or 'v'
  to tell how the vector at [reg+offset] is layed out in the tile,
  horizontally or vertically. E.g. za1h.h or za15v.q, which corresponds
  to vectors in registers ZAH1 and ZAQ15, respectively.
  * Accumulator matrix: this is the entire accumulator array ZA.

This patch adds the register classes and related operands and parsing
for SME instructions operating on the accumulator array.

The ADDHA and ADDVA instructions which operate on tiles are also added
in this patch to make some use of the code added, later patches will
make use of the other operands introduced here.

The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06

Co-authored by: Sander de Smalen (@sdesmalen)

Reviewed By: david-arm

Differential Revision: https://reviews.llvm.org/D105570
2021-07-14 08:25:49 +00:00
Ruiling Song d9b9fdd91b [AMDGPU] Don't handle export done when unify exit nodes
This patch aims to revert the changes introduced by D70781 D71192 D76364

D70781 was introduced to fix hardware hang where we do not insert exp-
null-done for a kill inside infinit loop. At that time we have not added
exp-null-done for kill early termination, but I believe as for now, we will
always add the exp-null-done for early termination case in LaterBranchLowering.

D71192 was introduced to handle the only_kill case, which is also been
handled by the kill early termination work.

D76364 was used to fix a regression by D71192, where we cleared the done
bit of the export in the existing program and not let the normal return
block branching to the new unified return block.

With this change, we just trust frontends have setup exp-done correctly
which is true for all existing frontends. The backend only inserts
exp-null-done for the kill cases which is handled in SILateBranchLowering.cpp.

Reviewed by: critson

Differential Revision: https://reviews.llvm.org/D105610
2021-07-14 14:54:37 +08:00
Jon Roelofs 87c6bf92a9 [AArch64] rm unused subreg's 2021-07-13 18:06:31 -07:00
Jon Roelofs 6377388c32 [AArch64] Fix AArch64::dsub's size 2021-07-13 18:06:31 -07:00
Jessica Paquette 5bd7cc4f42 [AArch64][GlobalISel] Mark v2s64 -> v2p0 G_INTTOPTR as legal
Allow

```
%x:_<2 x p0> = G_INTTOPTR %y:_<2 x s64>
```

This shows up when building clang for AArch64 with GlobalISel.

Also show that we can select it.

This should match SDAG's behaviour: https://godbolt.org/z/33oqYoaYv

Differential Revision: https://reviews.llvm.org/D105944
2021-07-13 17:28:14 -07:00
Matt Arsenault eebe841a47 RegAlloc: Allow targets to split register allocation
AMDGPU normally spills SGPRs to VGPRs. Previously, since all register
classes are handled at the same time, this was problematic. We don't
know ahead of time how many registers will be needed to be reserved to
handle the spilling. If no VGPRs were left for spilling, we would have
to try to spill to memory. If the spilled SGPRs were required for exec
mask manipulation, it is highly problematic because the lanes active
at the point of spill are not necessarily the same as at the restore
point.

Avoid this problem by fully allocating SGPRs in a separate regalloc
run from VGPRs. This way we know the exact number of VGPRs needed, and
can reserve them for a second run.  This fixes the most serious
issues, but it is still possible using inline asm to make all VGPRs
unavailable. Start erroring in the case where we ever would require
memory for an SGPR spill.

This is implemented by giving each regalloc pass a callback which
reports if a register class should be handled or not. A few passes
need some small changes to deal with leftover virtual registers.

In the AMDGPU implementation, a new pass is introduced to take the
place of PrologEpilogInserter for SGPR spills emitted during the first
run.

One disadvantage of this is currently StackSlotColoring is no longer
used for SGPR spills. It would need to be run again, which will
require more work.

Error if the standard -regalloc option is used. Introduce new separate
-sgpr-regalloc and -vgpr-regalloc flags, so the two runs can be
controlled individually. PBQB is not currently supported, so this also
prevents using the unhandled allocator.
2021-07-13 18:49:29 -04:00
Victor Huang 18c19414eb [PowerPC] Add PowerPC compare and multiply related builtins and instrinsics for XL compatibility
This patch is in a series of patches to provide builtins for compatibility
with the XL compiler. This patch adds the builtins and instrisics for compare
and multiply related operations.

Reviewed By: nemanjai, #powerpc

Differential revision: https://reviews.llvm.org/D102875
2021-07-13 16:55:09 -05:00
Victor Huang 781929b423 [PowerPC][NFC] Power ISA features for Semachecking
[NFC] This patch adds features for pwr7, pwr8, and pwr9 that can be
used for semachecking builtin functions that are only valid for certain
versions of ppc.

Reviewed By: nemanjai, #powerpc
Authored By: Quinn Pham <Quinn.Pham@ibm.com>

Differential revision: https://reviews.llvm.org/D105501
2021-07-13 13:13:34 -05:00
Victor Huang e4585d3f4e Revert "[PowerPC][NFC] Power ISA features for Semachecking"
This reverts commit 10e0cdfc65.
2021-07-13 13:13:34 -05:00
Jon Roelofs eba638dbbb [AArch64][GlobalISel] Legalize load <2 x i16>
Differential revision: https://reviews.llvm.org/D105913
2021-07-13 11:12:05 -07:00
Jon Roelofs 43c7ca8e49 [AArch64][GlobalISel] Legalize store <2 x i16>
Differential revision: https://reviews.llvm.org/D105912
2021-07-13 11:12:05 -07:00
Craig Topper 1e670dc7d7 [RISCV] Use DIVUW/REMUW/DIVW instructions for i8/i16/i32 udiv/urem/sdiv when LHS is constant.
We don't really have optimizations for division with a constant
LHS. If we don't use a W instruction we end up needing to sign
or zero extend the RHS to use the 64-bit instruction.

I had to sign_extend i32 constants on the LHS instead of using
any_extend which becomes zero_extend. If we don't do this, constants
that were originally negative become harder to materialize. I think
this problem exists for more of our W instruction cases. For example
(i32 (shl -1, X)), but we don't have lit tests. I'll work on that
as a follow up.

I also left a FIXME for enabling W instruction for RHS constants
under -Oz.

Reviewed By: luismarques

Differential Revision: https://reviews.llvm.org/D105769
2021-07-13 10:33:57 -07:00
Amy Kwan b5f4ac4c11 [PowerPC] Add FI alignment check if the addressing mode is DS/DQ-Form, emit X-Form if necessary.
This patch adds a function that checks whether or not the frame index
is aligned when the computed addressing mode is an aligned D-Form (DS, or DQ-Form).
If the frame index appears to be unaligned, within these two modes, reset
the mode to X-Form in order to fall back to selection X-Form loads.

A test case is added to ensure that the test emits X-Form loads and not DQ-Form
loads since the frame index is not aligned within the test case.

Differential Revision: https://reviews.llvm.org/D105661
2021-07-13 12:31:52 -05:00
Craig Topper 46e8970817 [RISCV] Prevent use of t0(aka x5) as rs1 for jalr instructions.
Some microarchitectures treat rs1=x1/x5 on jalr as a hint to pop
the return-address stack. We should avoid using x5 on jalr
instructions since we aren't using x5 as an alternate link register.

Differential Revision: https://reviews.llvm.org/D105875
2021-07-13 09:46:21 -07:00
Arthur Eubanks 693bc04bf6 [OpaquePtr] Use GlobalValue::getValueType() more 2021-07-13 09:34:34 -07:00
Arthur Eubanks b25aca503d [OpaquePtr] Use AllocaInst::getAllocatedType() 2021-07-13 09:34:33 -07:00
Fangrui Song 3d89fb4d13 [RISCV] Support machine constraint "S"
Similar to D46745, "S" represents an absolute symbolic operand, which
can be used to specify the access models, e.g.

  extern int var;
  void *addr_via_asm() {
    void *ret;
    asm("lui %0, %%hi(%1)\naddi %0,%0,%%lo(%1)" : "=r"(ret) : "S"(&var));
    return ret;
  }

'S' is documented in trunk GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101275

Reviewed By: luismarques

Differential Revision: https://reviews.llvm.org/D105254
2021-07-13 09:30:09 -07:00
Albion Fung f1aca5ac96 [PowerPC] Fix L[D|W]ARX Implementation
LDARX and LWARX sometimes gets optimized out by the compiler
when it is critical to the correctness of the code. This inline asm generation
ensures that it preserved.

Differential Revision: https://reviews.llvm.org/D105754
2021-07-13 11:02:07 -05:00
Victor Huang 10e0cdfc65 [PowerPC][NFC] Power ISA features for Semachecking
[NFC] This patch adds features for pwr7, pwr8, and pwr9 that can be
used for semachecking builtin functions that are only valid for certain
versions of ppc.

Reviewed By: nemanjai, #powerpc
Authored By: Quinn Pham <Quinn.Pham@ibm.com>

Differential revision: https://reviews.llvm.org/D105501
2021-07-13 10:51:25 -05:00
Matt Arsenault fb44c3223e AMDGPU: Promote signext/zeroext i16 shader returns
This makes them consistent with all the other return convention
handling. If we don't do this, we lose the sext/zext flag if treated
as a full assignment, which complicates a future GlobalISel patch.
2021-07-13 11:04:51 -04:00
Matt Arsenault 77a608d9de GlobalISel: Remove getIntrinsicID utility function
This is redundant with a method directly on MachineInstr
2021-07-13 11:04:10 -04:00
Matt Arsenault 121541fdcd Mips/GlobalISel: Use more standard call lowering infrastructure
This also fixes some missing implicit uses on call instructions, adds
missing G_ASSERT_SEXT/ZEXT annotations, and some missing outgoing
sext/zexts. This also fixes not respecting tablegen requested type
promotions.

This starts treating f64 passed in i32 GPRs as a type of custom
assignment, which restores some previously XFAILed tests. This is due
to getNumRegistersForCallingConv returns a static value, but in this
case it is context dependent on other arguments.

Most of the ugliness is reproducing a hack CC_MipsO32 uses in
SelectionDAG. CC_MipsO32 depends on a bunch of vectors populated from
the original IR argument types in MipsCCState. The way this ends up
working in GlobalISel is it only ends up inspecting the most recently
added vector element. I'm pretty sure there are cleaner ways to do
this, but this seemed easier than fixing up the current DAG
handling. This is another case where it would be easier of the
CCAssignFns were passed the original type instead of only the
pre-legalized ones.

There's still a lot of junk here that shouldn't be necessary. This
also likely breaks big endian handling, but it wasn't complete/tested
anyway since the IRTranslator gives up on big endian targets.
2021-07-13 11:04:10 -04:00
Matt Arsenault 6a3904f16e Mips: Mark special case calling convention handling as custom
The number of registers used for passing f64 in some cases is context
dependent, and thus getNumRegistersForCallingConv is sometimes
inaccurate. For f64, it reports 1 but is sometimes split into 2 32-bit
registers.

For GlobalISel, the generic argument assignment code expects
getNumRegistersForCallingConv to return an accurate answer. Switch to
marking these arguments as custom so we can deal with this case as a
custom assignment rather.

This temporarily breaks a few globalisel tests which are fixed by a
future change to use more of the generic infrastructure.
2021-07-13 11:04:10 -04:00
Simon Pilgrim 3cee36c5ac [X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result (REAPPLIED)
Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))

Reapplied rGe4aa6ad13216 with fix for intrinsic variants of the opcode which uses a vector return type
2021-07-13 12:31:09 +01:00
Hafiz Abid Qadeer b205f2bb89 [AMDGPU] Handle s_branch to another section.
Currently, if target of s_branch instruction is in another section, it will fail with the error of undefined label.  Although in this case, the label is not undefined but present in another section. This patch tries to handle this issue. So while handling fixup_si_sopp_br fixup in getRelocType, if the target label is undefined we issue an error as before. If it is defined, a new relocation type R_AMDGPU_REL16 is returned.

This issue has been reported in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100181 and https://bugs.llvm.org/show_bug.cgi?id=45887. Before https://reviews.llvm.org/D79943, we used to get an crash for this scenario. The crash is fixed now but the we still get an undefined label error.  Jumps to other section can arise with hold/cold splitting.

A patch to handle the relocation in lld will follow shortly.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D105760
2021-07-13 12:17:47 +01:00
Sebastian Neubauer ad2c66ec5d [AMDGPU] Optimize VGPR LiveRange in waterfall loops
The loops are run exactly once per lane, so VGPRs do not need to be
saved. Use the SIOptimizeVGPRLiveRange pass to add phi nodes that take
undef when coming from the loop.

There is still a shortcoming:
Return values from a function call in the loop are copied because their
live range conflicts with the live range of arguments, even if arguments
are only IMPLICIT_DEF after the phi insertion.

Differential Revision: https://reviews.llvm.org/D105192
2021-07-13 12:15:08 +02:00
Sebastian Neubauer 9d72c0ad43 [AMDGPU] Mark waterfall loops as SI_WATERFALL_LOOP
This way, they can be detected later, e.g. by the
SIOptimizeVGPRLiveRange pass.

Differential Revision: https://reviews.llvm.org/D105467
2021-07-13 12:15:08 +02:00
Tim Northover 7802f62b3f AArch64: use 4-byte slots for arm64_32 pointers in a tail call 2021-07-13 11:08:59 +01:00
Fraser Cormack d991b7212b [RISCV] Pass undef VECTOR_SHUFFLE indices on to BUILD_VECTOR
Often when lowering vector shuffles, we split the shuffle into two
LHS/RHS shuffles which are then blended together. To do so we split the
original indices into two, indexed into each respective vector. These
two index vectors are then separately lowered as BUILD_VECTORs.

This patch forwards on any undef indices to the BUILD_VECTOR, rather
than having the VECTOR_SHUFFLE lowering decide on an optimal concrete
index. The motiviation for ths change is so that we don't duplicate
optimization logic between the two lowering methods and let BUILD_VECTOR
do what it does best.

Propagating undef in this way allows us, for example, to generate
`vid.v` to produce the LHS indices of commonly-used interleave-type
shuffles. I have designs on further optimizing interleave-type and other
common shuffle patterns in the near future.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D104789
2021-07-13 10:41:54 +01:00
Stanislav Mekhanoshin d46d534dbb [AMDGPU] Make some VOP1 instructions rematerializable
This is a pilot change to verify the logic. The rest will be
done in a same way, at least the rest of VOP1.

Differential Revision: https://reviews.llvm.org/D105742
2021-07-12 23:43:45 -07:00
Qiu Chaofan 6fd9c1901f [PowerPC] Fix typo in vector shuffle combining
a22ecb4 fixed a crash on big endian subtargets. This commit fixes a typo
in that commit which may cause miscompile.
2021-07-13 14:35:47 +08:00
David Green ca78151001 [ARM] Introduce MVEEXT ISel lowering
Similar to D91921 (and D104515) this introduces two MVESEXT and MVEZEXT
nodes that larger-than-legal sext and zext are lowered to. These either
get optimized away or end up becoming a series of stack loads/store, in
order to perform the extending whilst keeping the order of the lanes
correct. They are generated from v8i16->v8i32, v16i8->v16i16 and
v16i8->v16i32 extends, potentially with a intermediate extend for the
larger v16i8->v16i32 extend. A number of combines have been added for
obvious cases that come up in tests, notably MVEEXT of shuffles. More
may be needed in the future, but this seems to cover most of the cases
that come up in the tests.

Differential Revision: https://reviews.llvm.org/D105090
2021-07-13 07:21:20 +01:00
hyeongyu kim d7d9c577ed [NFC] Edit the comment in M68kInstrInfo::ExpandMOVSZX_RM 2021-07-13 15:10:24 +09:00
Vitaly Buka 606551ee98 Revert "[X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result"
Fails here https://lab.llvm.org/buildbot/#/builders/37/builds/5267

This reverts commit e4aa6ad132.
2021-07-12 22:26:54 -07:00
Jon Roelofs 6611fbc62a [AArch64] Dump a little more info about unimplemented reg-to-reg copies. NFC 2021-07-12 15:37:11 -07:00
Eli Friedman 6c04b7dd4f [AArch64] Optimize overflow checks for [s|u]mul.with.overflow.i32.
Saves one instruction for signed, uses a cheaper instruction for
unsigned.

Differential Revision: https://reviews.llvm.org/D105770
2021-07-12 15:30:42 -07:00
Amy Kwan 35909ff6cf [PowerPC] Fix the splat immediate in PPCMIPeephole depending on if we have an Altivec and VSX splat instruction.
An assertion of the following can occur because Altivec and VSX splats use a different operand number for the immediate:
```
int64_t llvm::MachineOperand::getImm() const: Assertion `isImm() && "Wrong MachineOperand accessor"' failed.
```
This patch updates PPCMIPeephole.cpp assign the correct splat immediate.

Differential Revision: https://reviews.llvm.org/D105790
2021-07-12 16:20:11 -05:00
Wouter van Oortmerssen 1689d14ed1 [WebAssembly] fix typo in range check for Asm locals 2021-07-12 13:07:11 -07:00
Simon Pilgrim ae0d73ac3b [CostModel][X86] Adjust fptosi/fptoui SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXf32/vXf64 -> vXi8/vXi16 fptosi/fptoui costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 20:38:25 +01:00
Thomas Johnson 6b3eba7c28 [ARC] Add disassembly for the conditioned move immediate instruction
This change is a step towards implementing codegen for __builtin_clz().
Full support for CLZ with a regression test will follow shortly.
Differential Revision: https://reviews.llvm.org/D105560
2021-07-12 12:35:56 -07:00
Jinsong Ji 2377eca93c [PowerPC] Custom Lowering BUILD_VECTOR for v2i64 for P7 as well
The lowering for v2i64 is now guarded with hasDirectMove,
however, the current lowering can handle the pattern correctly,
only lowering it when there is efficient patterns and corresponding
instructions.

The original guard was added in D21135, and was for Legal action.
The code has evloved now, this guard is not necessary anymore.

Reviewed By: #powerpc, nemanjai

Differential Revision: https://reviews.llvm.org/D105596
2021-07-12 17:56:10 +00:00
Thomas Lively cbabfc63b1 [WebAssembly] Custom combines for f32x4.demote_zero_f64x2
Replace the clang builtin function and LLVM intrinsic for
f32x4.demote_zero_f64x2 with combines from normal SDNodes. Also add missing
combines for i32x4.trunc_sat_zero_f64x2_{s,u}, which share the same pattern.

Differential Revision: https://reviews.llvm.org/D105755
2021-07-12 10:32:18 -07:00
Craig Topper d5c97f4bf0 [X86] Teach X86FloatingPoint's handleCall to only erase the FP stack if there is a regmask operand that clobbers the FP stack.
There are some calls to functions like `__alloca` that are missing
a regmask operand. Lack of a regmask operand means that all
registers that aren't mentioned by def operands are preserved.
__alloca only updates EAX and ESP and has def operands for
them so this is ok. Because there is no regmask the register
allocator won't spill the FP registers across the call. Assuming
we want to keep the FP stack untoched across these calls, we
need to handle this is in the FP stackifier.

We might want to add a proper regmask operand to the code that
creates these calls to indicate all registers are preserved, but we'd
still need this change to the FP stackifier to know to preserve the
FP stack for such a regmask.

The test is kind of long, but bugpoint wasn't able to reduce it
any further.

Fixes PR50782

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D105762
2021-07-12 10:15:38 -07:00
Albion Fung ef49d925e2 [PowerPC] Implement trap and conversion builtins for XL compatibility
This patch implements trap and FP to and from double conversions. The builtins
generate code that mirror what is generated from the XL compiler. Intrinsics
are named conventionally with builtin_ppc, but are aliased to provide the same
builtin names as the XL compiler.

Differential Revision: https://reviews.llvm.org/D103668
2021-07-12 11:04:17 -05:00
Benjamin Kramer 0da3573a9e [AArch64] Silence unused variable warning. NFC.
AArch64ISelLowering.cpp:15167:8: warning: unused variable 'OpCode' [-Wunused-variable]
  auto OpCode = N->getOpcode();
       ^
2021-07-12 16:01:11 +02:00
Cullen Rhodes 9e42675103 [AArch64] Add target features for Armv9-A Scalable Matrix Extension (SME)
First patch in a series adding MC layer support for the Arm Scalable
Matrix Extension.

This patch adds the following features:

    sme, sme-i64, sme-f64

The sme-i64 and sme-f64 flags are for the optional I16I64 and F64F64
features.

If a target supports I16I64 then the following instructions are
implemented:

  * 64-bit integer ADDHA and ADDVA variants (D105570).
  * SMOPA, SMOPS, SUMOPA, SUMOPS, UMOPA, UMOPS, USMOPA, and USMOPS
    instructions that accumulate 16-bit integer outer products into 64-bit
    integer tiles.

If a target supports F64F64 then the FMOPA and FMOPS instructions that
accumulate double-precision floating-point outer products into
double-precision tiles are implemented.

Outer products are implemented in D105571.

The reference can be found here:
https://developer.arm.com/documentation/ddi0602/2021-06

Reviewed By: CarolineConcatto

Differential Revision: https://reviews.llvm.org/D105569
2021-07-12 13:28:10 +00:00
Michael Liao 8253fa2298 Fix warning '-Wparentheses'. NFC. 2021-07-12 09:25:30 -04:00
Jonas Paulsson 96421af5f8 [SystemZ] Bugfix for the 'N' code for inline asm operand.
Don't use a local MachineOperand copy in SystemZAsmPrinter::PrintAsmOperand()
and change the register as it may break the MRI tracking of register
uses. Use an MCOperand instead.

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D105757
2021-07-12 15:04:08 +02:00
Simon Pilgrim 96b4117d51 [CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca reports.
Update truncation costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 13:50:43 +01:00
David Green f73334c46d [AArch64] Set the latency of Cortex-A55 stores to 1
This sets the latency of stores to 1 in the Cortex-A55 scheduling model,
to better match the values given in the software optimization guide.

The latency of a store in normal llvm scheduling does not appear to have
a lot of uses. If the store has no outputs then the latency is somewhat
meaningless (and pre/post increment update operands use the WriteAdr
write for those operands instead). The one place it does alter things is
the latency between a store and the end of the scheduling region, which
can in turn have an effect on the critical path length. As a result a
latency of 1 is more correct and offers ever-so-slightly better
scheduling of instructions near the end of the block.

They are marked as RetireOOO to keep the llvm-mca from introducing
stalls where non would exist.

Differential Revision: https://reviews.llvm.org/D105541
2021-07-12 13:39:35 +01:00
David Truby c305557acd [llvm][sve] Lowering for VLS truncating stores
This adds custom lowering for truncating stores when operating on
fixed length vectors in SVE. It also includes a DAG combine to
fold extends followed by truncating stores into non-truncating
stores in order to prevent this pattern appearing once truncating
stores are supported.

Currently truncating stores are not used in certain cases where
the size of the vector is larger than the target vector width.

Differential Revision: https://reviews.llvm.org/D104471
2021-07-12 11:14:17 +01:00
Simon Pilgrim e4aa6ad132 [X86][SSE] X86ISD::FSETCC nodes (cmpss/cmpsd) return a 0/-1 allbits signbits result
Annoyingly, i686 cmpsd handling still fails to remove the unnecessary neg(and(x,1))
2021-07-12 09:56:59 +01:00
Fangrui Song 57503524b1 [AArch64] De-capitalize some Emit* functions
AsmParser/AsmPrinter/Streamer are mostly consistent on emit* functions now.
2021-07-11 22:05:39 -07:00
Daniel Egger 98c2e4115d [ARM] Add lowering of uadd_sat to uq{add|sub}8 and uq{add|sub}16
This follow the lead of https://reviews.llvm.org/D68974 to add lowering
of unsigned saturated addition/subtraction.

Differential Revision: https://reviews.llvm.org/D105413
2021-07-11 15:58:11 +01:00
Amara Emerson 97c426394a [AArch64][GlobalISel] Implement moreElements legalization for G_SHUFFLE_VECTOR.
Differential Revision: https://reviews.llvm.org/D103301
2021-07-10 00:25:26 -07:00
Amara Emerson 58a2cb5143 [GlobalISel] Add a new artifact combiner for unmerge which looks through general artifact expressions.
The original motivation for this was to implement moreElementsVector of shuffles
on AArch64, which resulted in complex sequences of artifacts like unmerge(unmerge(concat...))
which the combiner couldn't handle. It seemed here that the better option,
instead of writing ever-more-complex combines, was to have a way to find
the original "non-artifact" source registers for a given definition, walking
through arbitrary expressions of unmerge/concat/insert. As long as the bits
aren't extended or truncated, this is a pretty simple algorithm that avoids
the need for lots of combines and instead jumps straight to the final result
we want.

I've only used this new technique in 2 places within tryCombineUnmerge, using it
in more general situations resulted in infinite loops in AMDGPU. So for now
it's used when we would otherwise fail to combine and that seems to work.

In order to support looking through G_INSERTs, I also had to add it as an
artifact in isArtifact(), which caused a whole lot of issues in tests. AMDGPU
started infinite looping since full legalization of G_INSERT doensn't seem to
be there. To work around this, I've temporarily added a CLI option to use the
old behaviour so that the MIR tests will still run and terminate.

Other minor changes include no longer making >128b G_MERGE/UNMERGE legal.
We never had isel support for that anyway and it was a remnant of the legacy
legalizer rules. However being legal prevented the combiner from checking if it
was dead and deleting them.

Differential Revision: https://reviews.llvm.org/D104355
2021-07-09 22:35:00 -07:00
Thomas Lively e5220104d0 [WebAssembly] Custom combines for f64x2.promote_low_f32x4
Replace the clang builtin function and LLVM intrinsic previously used to select
the f64x2.promote_low_f32x4 instruction with custom combines from standard
SelectionDAG nodes. Implement the new combines to share code with the similar
combines for f64x2.convert_low_i32x4_{s,u}. Resolves PR50232.

Differential Revision: https://reviews.llvm.org/D105675
2021-07-09 18:59:29 -07:00
Derek Schuff ac02baab48 WebAssembly: Update datalayout to match fp128 ABI change
This fix goes along with d1a96e906c
and makes the fp128 alignment match clang's long double alignment.

Differential Revision: https://reviews.llvm.org/D105749
2021-07-09 16:51:36 -07:00
Kazu Hirata 5f306feb4d [WebAssembly] Fix warnings 2021-07-09 16:40:01 -07:00
Wouter van Oortmerssen f3e6c3f327 [WebAssembly] Fixed 2 warnings in Asm Type Checker 2021-07-09 14:38:52 -07:00
Wouter van Oortmerssen 9647a6f719 [WebAssembly] Added initial type checker to MC Assembler
This to protect against non-sensical instruction sequences being assembled,
which would either cause asserts/crashes further down, or a Wasm module being output that doesn't validate.

Unlike a validator, this type checker is able to give type-errors as part of the parsing process, which makes the assembler much friendlier to be used by humans writing manual input.

Because the MC system is single pass (instructions aren't even stored in MC format, they are directly output) the type checker has to be single pass as well, which means that from now on .globaltype and .functype decls must come before their use. An extra pass is added to Codegen to collect information for this purpose, since AsmPrinter is normally single pass / streaming as well, and would otherwise generate this information on the fly.

A `-no-type-check` flag was added to llvm-mc (and any other tools that take asm input) that surpresses type errors, as a quick escape hatch for tests that were not intended to be type correct.

This is a first version of the type checker that ignores control flow, i.e. it checks that types are correct along the linear path, but not the branch path. This will still catch most errors. Branch checking could be added in the future.

Differential Revision: https://reviews.llvm.org/D104945
2021-07-09 14:07:25 -07:00
Stanislav Mekhanoshin 4a3b055653 [AMDGPU] Fix flags of V_MOV_B64_PSEUDO
In particular it was not rematerializable.

Differential Revision: https://reviews.llvm.org/D105724
2021-07-09 12:49:28 -07:00
Graham Yiu ecd15fbf6b [ARC][NFC] Include file re-ordering
- Sort includes in alphabetical order via clang-format
2021-07-09 12:20:32 -07:00
Jeremy Morse 30cce54dad [X86] Return src/dest register from stack spill/restore recogniser
LLVM provides target hooks to recognise stack spill and restore
instructions, such as isLoadFromStackSlot, and it also provides post frame
elimination versions such as isLoadFromStackSlotPostFE. These are supposed
to return the store-source and load-destination registers; unfortunately on
X86, the PostFE recognisers just return "1", apparently to signify "yes
it's a spill/load". This patch alters the hooks to correctly return the
store-source and load-destination registers:

This is really useful for debug-info as we it helps follow variable values
as they move on/off the stack. There should be no codegen changes: the only
other users of these PostFE target hooks are MachineInstr::getRestoreSize
and MachineInstr::getSpillSize, which don't attempt to interpret the
returned register location.

While we're here, delete the (InstrRef) LiveDebugValues heuristic that
tries to find the spill source register by looking for a killed reg -- we
should be able to rely on the target hooks for that. This involves
temporarily turning off a n InstrRef LivedDebugValues test on aarch64
(patch to re-enable it is in D104521).

Differential Revision: https://reviews.llvm.org/D105428
2021-07-09 18:12:30 +01:00
Sylvestre Ledru 0ac7532cc1 m86k: adjust the usage of ArgInfo after change 9b057f647d70fc958d4a1a7a00e2deba65
Fails with:

```

/build/llvm-toolchain-snapshot-13~++20210709092633+88326bbce38c/llvm/lib/Target/M68k/GlSel/M68kCallLowering.cpp: In member function 'virtual bool llvm::M68kCallLowering::lowerReturn(llvm::MachineIRBuilder&, const llvm::Value*, llvm::ArrayRef<llvm::Register>, llvm::FunctionLoweringInfo&, llvm::Register) const':
/build/llvm-toolchain-snapshot-13~++20210709092633+88326bbce38c/llvm/lib/Target/M68k/GlSel/M68kCallLowering.cpp:71:42: error: no matching function for call to 'llvm::CallLowering::ArgInfo::ArgInfo(<brace-enclosed initializer list>)'
     ArgInfo OrigArg{VRegs, Val->getType()};
```

Differential Revision: https://reviews.llvm.org/D105689
2021-07-09 18:56:49 +02:00
zhijian 841077a7e9 [AIX][XCOFF] Use bit order of has_vec and longtbtable bits as defined in AIX header debug.h
Summary:

  The bit order of the has_vec and longtbtable bits in the traceback table generated by the XL compiler flipped at some point after v12.1. This is different from the definition is the AIX header debug.h. The change in the XL compiler that caused the deviation from the OS header definition was unintentional. Since both orderings are extant and the XL compiler runtime also expects the ordering defined by the OS, we will correct the output from LLVM to match the defined ordering given by the OS (which is also consistent with the Assembler Language Reference). Mitigation for traceback tables encoded with the wrong ordering is required for either ordering.

Reviewers: XingXue, HubertTong
Differential Revision: https://reviews.llvm.org/D105487
2021-07-09 11:06:46 -04:00
Simon Pilgrim 9dbeac16ba [X86] ReplaceNodeResults - fp_to_sint/uint - manually widen v2i32 results to let us add AssertSext/AssertZext
Its proving tricky to move this to the generic legalizer code, so manually insert the v2i32 subvector into v4i32, insert the AssertSext/AssertZext node, then extract the subvector again.

This avoids masks in the truncation/pack code, which means we avoid a PSHUFB in the fp_to_sint/uint code for sub-128 bit types (specific targets can still combine the packs to a pshufb if they have fast variable per-lane shuffles).

This was noticed when I was trying to improve fp_to_sint/uint costs with D103695 (and some targets had very high fp_to_sint costs due to the PSHUFB), so we can then update the fp_to_uint codegen from D89697.
2021-07-09 12:07:33 +01:00
David Green 38c9a4068d [TTI] Remove IsPairwiseForm from getArithmeticReductionCost
This patch removes the IsPairwiseForm flag from the Reduction Cost TTI
hooks, along with some accompanying code for pattern matching reductions
from trees starting at extract elements. IsPairWise is now assumed to be
false, which was the predominant way that the value was used from both
the Loop and SLP vectorizers. Since the adjustments such as D93860, the
SLP vectorizer has not relied upon this distinction between paiwise and
non-pairwise reductions.

This also removes some code that was detecting reductions trees starting
from extract elements inside the costmodel. This case was
double-counting costs though, adding the individual costs on the
individual instruction _and_ the total cost of the reduction. Removing
it changes the costs in llvm/test/Analysis/CostModel/X86/reduction.ll to
not double count. The cost of reduction intrinsics is still tested
through the various tests in
llvm/test/Analysis/CostModel/X86/reduce-xyz.ll.

Differential Revision: https://reviews.llvm.org/D105484
2021-07-09 11:51:16 +01:00
Kai Luo 55bd12d4b7 [PowerPC] Remove implicit use register after transformToImmFormFedByLI()
When the instruction has imm form and fed by LI, we can remove the redundat LI instruction.
Below is an example:
```
    renamable $x5 = LI8 2
    renamable $x4 = exact SRD killed renamable $x4, killed renamable $r5, implicit $x5
```

will be converted to:
```
   renamable $x5 = LI8 2
   renamable $x4 = exact RLDICL killed renamable $x4, 62, 2,  implicit killed $x5
```

But when we do this optimization, we forget to remove implicit killed $x5
This bug has caused a lnt case error. This patch is to fix above bug.

Reviewed By: #powerpc, shchenz

Differential Revision: https://reviews.llvm.org/D85288
2021-07-09 04:42:54 +00:00
Muhammad Omair Javaid 932e3d9960 Revert "GlobalISel/AArch64: don't optimize away redundant branches at -O0"
This reverts commit 458c230b5e.

This broke LLDB buildbot testcase where breakpoint set at start of loop
failed to hit. https://lab.llvm.org/buildbot/#/builders/96/builds/9404

https://github.com/llvm/llvm-project/blob/main/lldb/test/API/commands/process/attach/main.cpp#L15

Differential Revision: https://reviews.llvm.org/D105238
2021-07-09 08:23:36 +05:00
Stanislav Mekhanoshin e5b0fe1b83 [AMDGPU] Mark more SOP instructions as rematerializable
The rest of the SOP instructions implicitly set SCC and not
suitable for the rematerialization.

Differential Revision: https://reviews.llvm.org/D105670
2021-07-08 16:00:45 -07:00
Craig Topper 631516301e [ARM] Pass 2 instead of 0 to PHINode::Create in MVEGatherScatterLowering. NFC
This parameter controls how much space is reserved for incoming
values. There are always going to be 2 incoming values in this case.

While there remove the unused std::vector right below.

Found while looking at porting this code to RISCV.
2021-07-08 15:59:33 -07:00
Thomas Lively 3dd75f5371 [WebAssembly] Scalarize extract_vector_elt of binops
Override the `shouldScalarizeBinop` target lowering hook using the same
implementation used in the x86 backend. This causes `extract_vector_elt`s of
vector binary ops to be scalarized if the scalarized version would be supported.

Differential Revision: https://reviews.llvm.org/D105646
2021-07-08 14:31:53 -07:00
Nikita Popov 9e225a2a71 [AMDGPU] Simplify GEP construction (NFC)
Noticed while making a related change. This code was doing
something really peculiar: Creating an APInt by parsing a string.
And then creating a SmallVector with one element to create the
GEP.

Instead create the APInt from integers and directly pass the single
index to GetElementPtrInst::Create().
2021-07-08 21:21:43 +02:00
Nikita Popov cfb94212d4 [AMDGPU] Pass explicit GEP type in printf transform (NFC)
This code is working on an i8*. Avoid nullptr element type in
preparation for removing support.
2021-07-08 21:21:43 +02:00
Nikita Popov b5a7da4391 [NVPTX] Pass explicit GEP type (NFC)
Use source element type of original GEP, as we're just changing
the address space.
2021-07-08 21:21:43 +02:00
Alexey Bataev 0d74fd3fdf [SLP][COST][X86]Improve cost model for masked gather.
Revived D101297 in its original form + added some changes in X86
legalization cehcking for masked gathers.

This solution is the most stable and the most correct one. We have to
check the legality before trying to build the masked gather in SLP.
Without this check we have incorrect cost (for SLP) in case if the masked gather
is not legal/slower than the gather. And we're missing some
vectorization opportunities.

This can be fixed in the cost model, but in this case we need to add
special checks for the cost of GEPs for ScatterVectorize node, add
special check for small trees, etc., i.e. there are a lot of corner
cases here and there, which insrease code base and make it harder to
maintain the code.

> Can't we rely on cost model to deal with this? This can be profitable for futher vectorization, when we can start from such gather loads as seed.

The question from D101297. Actually, no, it can't. Actually, simple
gather may give us better result, especially after we started
vectorization of insertelements. Plus, like I said before, the cost for
non-legal masked gathers leads to missed vectorization opportunities.

Differential Revision: https://reviews.llvm.org/D105042
2021-07-08 11:53:30 -07:00
Craig Topper 6dd94cbff5 [ARM] Use matchSimpleRecurrence to simplify some code in MVEGatherScatterLowering. NFCI
Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D105262
2021-07-08 11:42:56 -07:00
Matt Arsenault 43f25e61ce Mips/GlobalISel: Remove custom splitToValueTypes 2021-07-08 13:39:06 -04:00
Matt Arsenault 9b057f647d GlobalISel: Track original argument index in ArgInfo
SelectionDAG's equivalents in ISD::InputArg/OutputArg track the
original argument index. Mips relies on this, and its currently
reinventing its own parallel CallLowering infrastructure which tracks
these indexes on the side. Add this to help move towards deleting the
custom mips handling.
2021-07-08 13:39:02 -04:00
Matt Arsenault 2f9504aa41 Mips/GlobalISel: Use correct callee calling convention
This was using the convention from the calling function.
2021-07-08 13:38:57 -04:00
Simon Pilgrim 8ef67fa9d2 [CostModel][X86] Account for older SSE targets with slow fp->int conversions
Both the conversion cost and the xmm->gpr transfer cost tend to be a lot higher on early SSE targets
2021-07-08 18:08:24 +01:00
Stanislav Mekhanoshin 74a5760d35 [AMDGPU] Set LoopInfo as preserved by SIAnnotateControlFlow
The pass does not change loops, it just adds calls.

Differential Revision: https://reviews.llvm.org/D105583
2021-07-08 09:34:43 -07:00
Jeremy Morse 63cc251eb9 [DebugInfo][InstrRef][4/4] Support DBG_INSTR_REF through all backend passes
This is a cleanup patch -- we're now able to support all flavours of
variable location in instruction referencing mode. This patch updates
various tests for debug instructions to be broader: numerous code paths
try to ignore debug isntructions, and they now have to ignore the
additional DBG_PHI and DBG_INSTR_REFs that we can generate.

A small amount of rework happens for LiveDebugVariables: as we don't need
to track live intervals through regalloc any more, we can get away with
unlinking debug instructions before regalloc, then re-inserting them after.
Note that this isn't (yet) true of DBG_VALUE_LISTs, they still have to go
through live interval tracking.

In SelectionDAG, add a helper lambda that emits half-formed DBG_INSTR_REFs
for arguments in instr-ref mode, DBG_VALUE otherwise. This is one of the
final locations where DBG_VALUEs are emitted for vreg arguments.

X86InstrInfo now un-sets the debug instr number on SUB instructions that
get mutated into CMP instructions. As the instruction no longer computes a
subtraction, we can't use it for variable locations.

Differential Revision: https://reviews.llvm.org/D88898
2021-07-08 16:42:24 +01:00
Michael Liao cc92833f8a [amdgpu] Remove the GlobalDCE pass prior to the internalization pass.
- In [D98783](https://reviews.llvm.org/D98783), an extra GlobalDCE pass
  is inserted before the internalization pass to ensure a global
  variable without users could be internalized even if there are dead
  users. Instead of inserting a dedicated optimization pass, the
  dead user checking, i.e. 'use_empty()', should be preceeded with
  constant dead user removal to ensure an accurate result.

Differential Revision: https://reviews.llvm.org/D105590
2021-07-08 10:25:58 -04:00
Bradley Smith 026bb84bcd [AArch64][SVE] Add ISel patterns for floating point compare with zero instructions
Additionally, lower the floating point compare SVE intrinsics to
SETCC_MERGE_ZERO ISD nodes to avoid duplicating ISel patterns.

Differential Revision: https://reviews.llvm.org/D105486
2021-07-08 10:46:12 +00:00
Sebastian Neubauer 9ced1e44ad [AMDGPU] Fix typo 2021-07-08 10:07:33 +02:00
Thomas Lively f8c5a4c670 [WebAssembly] Optimize out shift masks
WebAssembly's shift instructions implicitly masks the shift count, so optimize
out redundant explicit masks of the shift count. For vector shifts, this
currently only works if the mask is applied before splatting the shift count,
but this should be addressed in a future commit. Resolves PR49655.

Differential Revision: https://reviews.llvm.org/D105600
2021-07-07 23:14:31 -07:00
Patrick Holland d38b9f1f31 Revert "[MCA] [AMDGPU] Adding an implementation to AMDGPUCustomBehaviour for handling s_waitcnt instructions."
Build failures when building with shared libraries. Reverting until I can fix.

Differential Revision: https://reviews.llvm.org/D104730
2021-07-07 20:48:42 -07:00
Qiu Chaofan a22ecb4508 [PowerPC] Fix i64 to vector lowering on big endian
Lowering for scalar to vector would skip if current subtarget is big
endian and the scalar is larger or equal than 64 bits. However there's
some issue in implementation that SToVRHS may refer to SToVLHS's scalar
size if SToVLHS is present, which leads to some crash.o

Reviewed By: nemanjai, shchenz

Differential Revision: https://reviews.llvm.org/D105094
2021-07-08 11:05:09 +08:00
Stanislav Mekhanoshin 0fdb25cd95 [AMDGPU] Disable garbage collection passes
Differential Revision: https://reviews.llvm.org/D105593
2021-07-07 15:47:57 -07:00
Patrick Holland af3baf1761 [MCA] [AMDGPU] Adding an implementation to AMDGPUCustomBehaviour for handling s_waitcnt instructions.
This commit also makes some slight changes to the scheduling model for AMDGPU to set the RetireOOO flag for all scheduling classes.

This flag is only used by llvm-mca and allows instructions to retire out of order.

See the differential link below for a deeper explanation of everything.

Differential Revision: https://reviews.llvm.org/D104730
2021-07-07 14:17:54 -07:00
Adrian Prantl 458c230b5e GlobalISel/AArch64: don't optimize away redundant branches at -O0
This patch prevents GlobalISel from optimizing out redundant branch
instructions when compiling without optimizations.

The motivating example is code like the following common pattern in
Swift, where users expect to be able to set a breakpoint on the early
exit:

public func f(b: Bool) {
  guard b else {
    return // I would like to set a breakpoint here.
  }
  ...
}

The patch modifies two places in GlobalISEL: The first one is in
IRTranslator.cpp where the removal of redundant branches is made
conditional on the optimization level. The second one is in
AArch64InstructionSelector.cpp where an -O0 *only* optimization is
being removed.

Disabling these optimizations increases code size at -O0 by
~8%. However, doing so improves debuggability, and debug builds are
the primary reason why developers compile without optimizations. We
thus concluded that this is the right trade-off.

rdar://79515454

Differential Revision: https://reviews.llvm.org/D105238
2021-07-07 12:51:55 -07:00
Nemanja Ivanovic 6a06dbafa1 [PowerPC] Disable permuted SCALAR_TO_VECTOR on LE without direct moves
There are some patterns involving the permuted scalar to vector node
for which we don't have patterns without direct moves on little endian
subtargets. This causes selection errors. While we can of course add
the missing patterns, any additional effort to make this work is not
useful since there is no support for any CPU that can run in
little endian mode and does not support direct moves.
2021-07-07 13:50:49 -05:00
Simon Pilgrim ded8866f4a [X86][Atom] Fix vector fp<->int resource/throughputs
Match whats documented in the Intel AOM - almost all the conversion instructions requires BOTH ports (apart from the MMX cvtpi2ps/cvtpi2ps instructions which we already override) - this was being incorrectly modelled as EITHER port.

Now that we can use in-order models in llvm-mca, the atom model is a good "worst case scenario" analysis for x86.
2021-07-07 16:52:34 +01:00
Irina Dobrescu 5888a194c1 [AArch64][GlobalISel] Lower vector types for min/max
Differential Revision: https://reviews.llvm.org/D105433
2021-07-07 15:34:03 +01:00
Zarko Todorovski ee6ca9c7df [AIX] Use VSSRC/VSFRC Register classes for f32/f64 callee arguments on P8 and above
Adding usage of VSSRC and VSFRC when adding the live in registers on AIX.
This matches the behaviour of the rest of PPC Subtargets.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D104396
2021-07-07 09:18:20 -04:00
Simon Pilgrim 4c7e9a3852 [CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports.
Update costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 13:58:27 +01:00
Simon Pilgrim a7da0296a6 [CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca reports.
Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 12:03:45 +01:00
Jay Foad ce098ccc1c [AMDGPU] Simplify tablegen files. NFC.
There is no need to cast records to strings before comparing them.
2021-07-07 09:19:23 +01:00
Stanislav Mekhanoshin b16400449f [AMDGPU] isPassEnabled() helper to check cl::opt and OptLevel
We have several checks for both cl::opt and OptLevel over our
pass config, although these checks do not properly work if
default value of a cl::opt will be false. Create a helper to
use instead and properly handle it. NFC for now.

Differential Revision: https://reviews.llvm.org/D105517
2021-07-06 21:53:35 -07:00
Nemanja Ivanovic 3553698de7 [PowerPC] Re-enable combine for i64 BSWAP on targets without LDBRX
The combine was disabled in 4e22c7265d as it caused failures in
the ppc64be-multistage (bootstrap) bot.
It turns out that the combine did not correctly update the MMO for
the high load which caused aliased stores to be reported as unaliased.
This patch fixes that problem and re-enables the combine.
2021-07-06 20:42:01 -05:00
Eli Friedman 56b3e9edc4 [AArch64] Sync isDef32 to the current x86 version.
We should probably come up with some better way to do this, but let's
make sure to catch known issues for now.
2021-07-06 17:05:01 -07:00
Stanislav Mekhanoshin a0ab45799b [AMDGPU] Move atomic expand past infer address spaces
There are cases where infer address spaces pass cannot yet
infer an address space in the opt pipeline and then in the
llc pipeline it runs too late for atomic expand pass to
benefit from a specific address space.

Move atomic expand pass past the infer address spaces.

Fixes: SWDEV-293410

Differential Revision: https://reviews.llvm.org/D105511
2021-07-06 15:53:32 -07:00
Stanislav Mekhanoshin 5915d33874 [AMDGPU] Do not run IR optimizations at -O0
Differential Revision: https://reviews.llvm.org/D105515
2021-07-06 15:29:52 -07:00
Stanislav Mekhanoshin aff66b7eef [AMDGPU] Fix pass name of AMDGPULowerKernelAttributes. NFC.
This was obviously copy-pasted.
2021-07-06 15:03:31 -07:00
Krzysztof Parzyszek 94e01d579c [Hexagon] Generate trap/undef if misaligned access is detected
This applies to memory accesses to (compile-time) constant addresses
(such as memory-mapped registers). Currently when a misaligned access
to such an address is detected, a fatal error is reported. This change
will emit a remark, and the compilation will continue with a trap,
and "undef" (for loads) emitted.

This fixes https://llvm.org/PR50838.

Differential Revision: https://reviews.llvm.org/D50524
2021-07-06 14:52:23 -05:00
Craig Topper 12d51f95fe [RISCV] Implement lround*/llround*/lrint*/llrint* with fcvt instruction with -fno-math-errno
These are fp->int conversions using either RMM or dynamic rounding modes.

The lround and lrint opcodes have a return type of either i32 or
i64 depending on sizeof(long) in the frontend which should follow
xlen. llround/llrint should always return i64 so we'll need a libcall
for those on rv32.

The frontend will only emit the intrinsics if -fno-math-errno is in
effect otherwise a libcall will be emitted which will not use
these ISD opcodes.

gcc also does this optimization.

Reviewed By: arcbbb

Differential Revision: https://reviews.llvm.org/D105206
2021-07-06 11:43:22 -07:00
Jonas Paulsson 458eac2573 [SystemZ] Support the 'N' code for the odd register in inline-asm.
The odd register of a (128 bit) register pair is accessed with the 'N' code
with an inline assembly operand.

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D105502
2021-07-06 19:46:49 +02:00
Craig Topper 2b5e53111a [RISCV] Add support for matching vwmul(u) and vwmacc(u) from fixed vectors.
This adds a DAG combine to detect sext/zext inputs and emit a
new ISD opcode. The extends will either be removed or replaced
with narrower extends.

Isel patterns are used to match add and widening mul to vwmacc
similar to the recently added vmacc patterns.

There's still some work to be to match vmulsu.
We should also rewrite splats that were extended as scalars and
then splatted.

Reviewed By: arcbbb

Differential Revision: https://reviews.llvm.org/D104802
2021-07-06 10:24:31 -07:00
Simon Pilgrim b298308ba2 [CostModel][X86] fptosi/fptoui to i8/i16 are truncated from fptosi to i32
Provide a generic fallback that performs the fptosi to i32 types, then truncates to sub-i32 scalars.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.
2021-07-06 17:28:03 +01:00
Jonas Paulsson 37a92f3b03 [SystemZ] Generate XC loop for memset 0 of variable length.
Benchmarking has shown that it is worthwhile to implement a variable length
memset of 0 with XC (exclusive or) like gcc does, instead of using a libcall.

This requires the use of the EXecute Relative Long (EXRL) instruction which
can now be done in a framework that can also be used with other target
instructions (not just XC).

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D103865
2021-07-06 18:07:31 +02:00
Bradley Smith 5ab9000fbb [AArch64][SVE] Fix selection failures for scalable MLOAD nodes with passthru
Differential Revision: https://reviews.llvm.org/D105348
2021-07-06 14:17:23 +00:00
Simon Pilgrim 6f3f9535fc [CostModel][X86] i8/i16 sitofp/uitofp are sext/zext to i32 for sitofp
Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions.

These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.

We get the extension for free for non-vector loads.
2021-07-06 13:58:52 +01:00
Kerry McLaughlin a7512401e5 [LV] Prevent vectorization with unsupported element types.
This patch adds a TTI function, isElementTypeLegalForScalableVector, to query
whether it is possible to vectorize a given element type. This is called by
isLegalToVectorizeInstTypesForScalable to reject scalable vectorization if
any of the instruction types in the loop are unsupported, e.g:

  int foo(__int128_t* ptr, int N)
    #pragma clang loop vectorize_width(4, scalable)
    for (int i=0; i<N; ++i)
      ptr[i] = ptr[i] + 42;

This example currently crashes if we attempt to vectorize since i128 is not a
supported type for scalable vectorization.

Reviewed By: sdesmalen, david-arm

Differential Revision: https://reviews.llvm.org/D102253
2021-07-06 13:06:21 +01:00
Peter Waller c5dfee44b9 [CodeGen][AArch64][SVE] Use ld1r[bhsd] for vector splat from memory
This avoids the use of the vector unit for copying from scalar to
vector. There is an extra ptrue instruction, but a predicate register
with the ptrue pattern populated is likely to be free in the context of
real code.

Tests were generated from a template to cover the axes mentioned at the
top of the test file.

Co-authored-by: Francesco Petrogalli <francesco.petrogalli@arm.com>

Differential Revision: https://reviews.llvm.org/D103170
2021-07-06 12:03:54 +00:00
Jay Foad c9d747e9cd [AMDGPU] Remove outdated comment and tidy up. NFC.
This was left over from D94746.
2021-07-06 11:29:36 +01:00
Sebastian Neubauer db646de3ee [AMDGPU] Set optional PAL metadata
Set informational fields in the .shader_functions table.

Also correct the documentation, .scratch_memory_size and .lds_size are
integers.

Differential Revision: https://reviews.llvm.org/D105116
2021-07-06 11:58:00 +02:00
Albion Fung 7d10dd60ce [PowerPC] Implament Load and Reserve and Store Conditional Builtins
This patch implaments the load and reserve and store conditional
builtins for the PowerPC target, in order to have feature parody with
xlC on AIX.

Differential revision: https://reviews.llvm.org/D105236
2021-07-05 21:35:41 -05:00
David Green a77e2d196c [ARM] Fix arm.mve.pred.v2i range upper limit
The range metadata specifies a half open range, so our top limit was one
off.
2021-07-05 21:06:30 +01:00
Sushma Unnibhavi 086370faee [M68k][GloballSel] Lower outgoing return values in IRTranslator
Implementation of lowerReturn in the IRTranslator for the M68k backend.

Differential Revision: https://reviews.llvm.org/D105332
2021-07-05 11:39:09 -07:00
Tiehu Zhang d4ed965b2d [AArch64ISelDAGToDAG] Fix ORRWrs/ORRXrs usefulbits calculation bug
For the following case:

    t8: i32 = or t7, t4
    t10: i32 = ORRWrs t8, t8, TargetConstant:i32<73>

Current code wrongly returns (t8 >> shiftConstant) as the
UsefulBits of t8, which in fact is (t8 | (t8 >> shiftConstant)).

Reviewed by: sdesmalen, mdchen

Differential Revision: https://reviews.llvm.org/D102759
2021-07-06 00:38:42 +08:00
Paul Walker 88522455c0 Fix typo in help text for -aarch64-enable-branch-targets. 2021-07-05 16:15:40 +01:00
Caroline Concatto a2c5c56055 [AArch64][CostModel] Add cost model for experimental.vector.splice
This patch adds a new  ShuffleKind SK_Splice and then handle the cost in
getShuffleCost, as in experimental.vector.reverse.

Differential Revision: https://reviews.llvm.org/D104630
2021-07-05 14:30:24 +01:00
Wang, Pengfei 9ab99f773f [X86] Twist shuffle mask when fold HOP(SHUFFLE(X,Y),SHUFFLE(X,Y)) -> SHUFFLE(HOP(X,Y))
This patch fixes PR50823.

The shuffle mask should be twisted twice before gotten the correct one due to the difference between inner HOP and outer.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104903
2021-07-05 21:29:42 +08:00
Simon Pilgrim 5db826e4ce [CostModel][X86] Handle costs for insert/extractelement with non-immediate indices via stack
Determine the insert/extractelement costs when performing this as a sequence of aliased loads+stores via the stack.
2021-07-05 13:26:53 +01:00
Simon Pilgrim 65e4240fa1 [CostModel][X86] Adjust i32/i64 to f32/f64 scalar based on llvm-mca reports (+ Agner).
Older SSE targets have slower gpr->fpu scalar conversions - we also need to account for uitofp i32 > f32/f64 being lowered as sitofp i64 -> f32/f64
2021-07-05 13:26:53 +01:00
Bradley Smith cc273983f7 [AArch64][SVE] Improve fixed length codegen for common vector shuffle case
Improve codegen when lowering the common vector shuffle case from the
vectorizer (op1[last]:op2[0:last-1]). This patch only handles this
common case as it is difficult to handle this more generally when using
fixed length vectors, due to being unable to use the SVE ext instruction.

Differential Revision: https://reviews.llvm.org/D105289
2021-07-05 12:09:27 +01:00
Sjoerd Meijer ee752134ac [AArch64] Cost-model i8 vector loads/stores
Loads of <4 x i8> vectors were modeled as extremely expensive. And while we
don't have a load instruction that supports this, it isn't that expensive to
create a vector of i8 elements. The codegen for this was fixed/optimised in
D105110. This now tweaks the cost model and enables SLP vectorisation of my
motivating case loadi8.ll.

Differential Revision: https://reviews.llvm.org/D103629
2021-07-05 11:25:10 +01:00
David Stuttard b8173c3178 [AMDGPU] Stop mulhi from doing 24 bit mul for uniform values
Added support to check if architecture supports s_mulhi which is used as part of
the decision whether or not to use valu 24 bit mul (if the mulhi gets
transformed to a valu op anyway, then may as well use it).

This is an extension of the work in D97063

Differential Revision: https://reviews.llvm.org/D103321

Change-Id: I80b1323de640a52623d69ac005a97d06a5d42a14
2021-07-05 10:33:23 +01:00
Craig Topper 21a1bcbd4d [RISCV] Pass FeatureBitset by reference rather than by value. NFCI
FeatureBitset is 4 64-bit values in an array. It's better passed by
reference rather than copying it.

I may be adding FeatureBitset as an argument to another function
and noticed this while working on that.
2021-07-04 23:11:40 -07:00
Nikita Popov a213f735d8 [IR] Deprecate GetElementPtrInst::CreateInBounds without element type
This API is not compatible with opaque pointers, the method
accepting an explicit pointer element type should be used instead.

Thankfully there were few in-tree users. The BPF case still ends
up using the pointer element type for now and needs something like
D105407 to avoid doing so.
2021-07-04 16:49:30 +02:00
Paul Walker 287d39dd5a [NFC] Fix a few whitespace issues and typos. 2021-07-04 11:49:58 +01:00
Nikita Popov fabc17192e [IRBuilder] Add type argument to CreateMaskedLoad/Gather
Same as other CreateLoad-style APIs, these need an explicit type
argument to support opaque pointers.

Differential Revision: https://reviews.llvm.org/D105395
2021-07-04 12:17:59 +02:00
David Green fbc329efbd [AArch64] Add S/UQXTRN tablegen patterns.
This adds simple patterns for signed and unsigned saturating extract
narrow instructions. They combine a min/max/truncate into a single
instruction, providing that the immediates on the min/max are correct
for the saturation type. This is just handled in tablegen with some
extra patterns.

v2i64->v2i32 is not handled here as the min/max nodes are not legal,
making the lowering quite different.

Differential Revision: https://reviews.llvm.org/D103263
2021-07-03 07:57:19 +01:00
Kai Luo c063946476 [AIX] Adjust CSR order to avoid breaking ABI regarding traceback
Allocate non-volatile registers in order to be compatible with ABI, regarding gpr_save.

Quoted from https://www.ibm.com/docs/en/ssw_aix_72/assembler/assembler_pdf.pdf page55,
> The preferred method of using GPRs is to use the volatile registers first. Next, use the nonvolatile registers
> in descending order, starting with GPR31.

This patch is based on @jsji 's initial draft.

Tested on test-suite and SPEC, found no degradation.

Reviewed By: jsji, ZarkoCA, xingxue

Differential Revision: https://reviews.llvm.org/D100167
2021-07-03 04:45:26 +00:00
Krzysztof Parzyszek df88c26f0d [OpaquePtr] Add type parameter to emitLoadLinked
Differential Revision: https://reviews.llvm.org/D105353
2021-07-02 13:07:40 -05:00
Krzysztof Parzyszek 81b42ca951 [Hexagon] Handle opaque pointers in vector combine 2021-07-02 13:07:40 -05:00
Amir Ayupov 884bc6a6ed [X86] Modify LOOP*, HLT control flow attributes
Add missing control flow attributes:
- LOOP*: isBranch, isTerminator
- HLT: isTerminator

This helps downstream disassemblers (such as BOLT) reconstruct the control
flow graph more accurately.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D102297
2021-07-02 10:34:29 -07:00
Simon Pilgrim e5fdff1cf8 [X86][SLM] Keep similar scheduler costs types together. NFCI.
The SLM model is inconsistent about where it kept its 'unsupported' schedule classes - better to keep them close to similar classes.

I'm not sure why some ymm classes are defined and others are unsupported though (but I haven't altered them) - the only SLM-like CPU supporting any ymm is KNL and that currently uses the HSW model.
2021-07-02 14:50:24 +01:00
Simon Pilgrim d867634fbd [CostModel][X86] Update comment describing source of costs - we now use llvm-mca more than IACA 2021-07-02 14:29:32 +01:00
Simon Pilgrim d181fd918d [CostModel][X86] Drop some hard coded fp<->int scalarization costs
Scalarization costs handling is a lot better now, and the hard coded costs were higher than the worse case numbers from the script in D103695
2021-07-02 14:29:32 +01:00
Simon Pilgrim 2aecffcd40 [CostModel][X86] Find AVX conversion costs using legalized types if custom types didn't match
Building on rG2a1ef8784ad9a, fallback to attempting to match against legalized types like we do for SSE targets.
2021-07-02 13:49:31 +01:00
Simon Pilgrim cdca1785d3 [CostModel][X86] Adjust uitofp(vXi64) SSE/AVX legalized costs based on llvm-mca reports.
Update v4i64 -> v4f32/v4f64 uitofp costs based on the worst case costs from the script in D103695.

Fixes a few regressions before we start adding AVX costs for legalized types.
2021-07-02 13:09:00 +01:00
Florian Hahn 1a248233a5
[AArch64] Use custom lowering for fp16 vector copysign.
The custom copysign lowering already supports fp16. Use it.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D105277
2021-07-02 11:15:30 +01:00
Roman Lebedev c2c0d3ea89
Revert "[WebAssembly] Implementation of global.get/set for reftypes in LLVM IR"
This reverts commit 4facbf213c.

```
********************
FAIL: LLVM :: CodeGen/WebAssembly/funcref-call.ll (44466 of 44468)
******************** TEST 'LLVM :: CodeGen/WebAssembly/funcref-call.ll' FAILED ********************
Script:
--
: 'RUN: at line 1';   /builddirs/llvm-project/build-Clang12/bin/llc < /repositories/llvm-project/llvm/test/CodeGen/WebAssembly/funcref-call.ll --mtriple=wasm32-unknown-unknown -asm-verbose=false -mattr=+reference-types | /builddirs/llvm-project/build-Clang12/bin/FileCheck /repositories/llvm-project/llvm/test/CodeGen/WebAssembly/funcref-call.ll
--
Exit Code: 2

Command Output (stderr):
--
llc: /repositories/llvm-project/llvm/include/llvm/Support/LowLevelTypeImpl.h:44: static llvm::LLT llvm::LLT::scalar(unsigned int): Assertion `SizeInBits > 0 && "invalid scalar size"' failed.

```
2021-07-02 11:49:51 +03:00
Paulo Matos 4facbf213c [WebAssembly] Implementation of global.get/set for reftypes in LLVM IR
Reland of 31859f896.

This change implements new DAG notes GLOBAL_GET/GLOBAL_SET, and
lowering methods for load and stores of reference types from IR
globals. Once the lowering creates the new nodes, tablegen pattern
matches those and converts them to Wasm global.get/set.

Differential Revision: https://reviews.llvm.org/D104797
2021-07-02 09:46:28 +02:00
Matt Arsenault 32a73198fc Mips/GlobalISel: Use accurate memory LLTs 2021-07-01 20:08:14 -04:00
Eli Friedman 0176ac9503 [AArch64] Optimize SVE bitcasts of unpacked types.
Target-independent code only knows how to spill to the stack; instead,
use AArch64ISD::REINTERPRET_CAST.

Differential Revision: https://reviews.llvm.org/D104573
2021-07-01 15:35:48 -07:00
David Green 3d48775b89 [ARM] Reassociate BFI
D104868 removed an (incorrect) fold for distributing BFI instructions in
a chain, combining them into a single instruction. BFIs like that are
hard to test, as the patterns are often destroyed before they become
BFIs. But it can come up in places, with chains of BFIs that can be
combined.

This patch adds a replacement, which reassociates BFI instructions with
non-overlapping insertion masks so that low bits are inserted first.
This can end up sorting the nodes so that adjacent inserts are next to
one another, allowing the existing folds to combine into a single BFI.

Differential Revision: https://reviews.llvm.org/D105096
2021-07-01 21:08:13 +01:00
Matt Arsenault 99c7e918b5 GlobalISel: Use LLT in call lowering callbacks
This preserves the memory type so the lowerings can rely on them.
2021-07-01 12:15:54 -04:00
Bradley Smith 2668727929 [SelectionDAG] Implement PromoteIntRes_INSERT_SUBVECTOR
Inserting into a smaller-than-legal scalable vector would result in an
internal compiler error. For example, inserting a <vscale x 4 x i8> into
a <vscale x 8 x i8> (both illegal vector types for SVE) would cause a
crash.

This crash was happening because there was no code to promote (legalise)
the result of an INSERT_SUBVECTOR node.

This patch implements PromoteIntRes_INSERT_SUBVECTOR, which legalises
the ISD node. This is currently done by going through memory. This is
necessary because of the requirement that the SubVec parameter of the
INSERT_SUBVECTOR node must be smaller than the Vec parameter, which
means that INSERT_SUBVECTOR cannot always have a legal result/operand
types.

Co-Authored-by: Joe Ellis <joe.ellis@arm.com>

Differential Revision: https://reviews.llvm.org/D102766
2021-07-01 17:05:53 +01:00
Stanislav Mekhanoshin 661577e698 [AMDGPU] Fix immediate sign during V_MOV_B64_PSEUDO expansion
Creating a V_MOV_B32 with zero extended immediate source
prevented conversion to V_BFREV_B32.

Differential Revision: https://reviews.llvm.org/D105235
2021-07-01 09:00:29 -07:00
Irina Dobrescu 71d5b0a757 [AArch64][GlobalISel]Legalise some vector types for min/max
Differential Revision: https://reviews.llvm.org/D105200
2021-07-01 16:29:38 +01:00
Simon Pilgrim 5e5ba14b4d [CostModel][X86] Adjust fp<->int vXi32 SSE legalized costs based on llvm-mca reports.
Building on rG2a1ef8784ad9a, adjust the SSE cost tables to use the legalized types based on the worst case costs from the script in D103695.

To account for different numbers of src/dst legalized type registers we must scale the cost by maximum of the src/dst, not just use src
2021-07-01 15:34:20 +01:00
Sam Tebbs 24d76419d6 [ARM] Transform a floating-point to fixed-point conversion to a VCVT_fix
Much like fixed-point to floating-point conversion, the converse can
also be transformed into a fixed-point VCVT. This patch transforms
multiplications of floating point numbers by 2^n into a VCVT_fix. The
exception is that a float to fixed conversion with 1 fractional bit
ends up being an FADD (FADD(x, x) emulates FMUL(x, 2)) rather than an FMUL so there is a special case for that. This patch also moves the code from https://reviews.llvm.org/D103903 into a separate function as fixed to float and float to fixed are very similar.

Differential Revision: https://reviews.llvm.org/D104793
2021-07-01 15:10:40 +01:00
Bradley Smith 01b846674d [AArch64][SVE] Add support for fixed length MSCATTER/MGATHER
Since gather lowering can now lower to nodes that may need expansion via
the vector legalizer, do MGATHER lowering via vector legalizer.

Additionally, as part of adding passthru support for fixed typed
gathers, fix passthru support for scalable types.

Depends on D104910

Differential Revision: https://reviews.llvm.org/D104217
2021-07-01 12:13:59 +01:00
Simon Pilgrim 2a1ef8784a [CostModel][X86] getCastInstrCost - attempt to match custom cast/conversion before legalized types.
Move the (SSE-only) generic, legalized type conversion matching after the specific,custom conversion cases, allowing us to properly provide cost overrides.

The next step will be to clean up some of the weird existing costs and then to enable AVX+ legalized costs, which will let us strip out a lot of the cost tables entries.
2021-07-01 12:06:40 +01:00
Jeremy Morse 47c3fe2a22 [DebugInfo][InstrRef][1/4] Support transformations that widen values
Very late in compilation, backends like X86 will perform optimisations like
this:

    $cx = MOV16rm $rax, ...
    ->
    $rcx = MOV64rm $rax, ...

Widening the load from 16 bits to 64 bits. SEeing how the lower 16 bits
remain the same, this doesn't affect execution. However, any debug
instruction reference to the defined operand now refers to a 64 bit value,
nto a 16 bit one, which might be unexpected. Elsewhere in codegen, there's
often this pattern:

    CALL64pcrel32 @foo, implicit-def $rax
    %0:gr64 = COPY $rax
    %1:gr32 = COPY %0.sub_32bit

Where we want to refer to the definition of $eax by the call, but don't
want to refer the copies (they don't define values in the way
LiveDebugValues sees it). To solve this, add a subregister field to the
existing "substitutions" facility, so that we can describe a field within
a larger value definition. I would imagine that this would be used most
often when a value is widened, and we need to refer to the original,
narrower definition.

Differential Revision: https://reviews.llvm.org/D88891
2021-07-01 11:19:27 +01:00
Qiu Chaofan 07f0faed11 [NFC][Scheduler] Refactor tryCandidate to return boolean
This patch changes return type of tryCandidate from void to bool:

1. Methods in some targets already follow this convention.
2. This would help if some target wants to re-use generic code.
3. It looks more intuitive if these try-method returns the same type.

We may need to change return type of them from bool to some enum
further, to make it less confusing.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D103951
2021-07-01 14:31:47 +08:00
Jun Ma ae5433945f [AArch64][SVEIntrinsicOpts] Convect cntb/h/w/d to vscale intrinsic or constant.
As is mentioned above

Differential Revision: https://reviews.llvm.org/D104852
2021-07-01 10:09:47 +08:00
Fangrui Song 17858da022 [AArch64] Remove unneeded ExternalSymbolSDNode code for machine constraint "S". NFC
ExternalSymbolSDNode is implicitly generated libcalls but with an address taking
operation we cannot reference an ExternalSymbolSDNode.
2021-06-30 17:52:56 -07:00
Min-Yih Hsu 557bed31e4 Reapply "[M68k][GloballSel] Formal arguments lowering in IRTranslator"
Implementation of formal arguments lowering in the IRTranslator for the
M68k backend

Differential Revision: https://reviews.llvm.org/D104542
2021-06-30 17:13:45 -07:00
Matt Arsenault 28f2f66200 GlobalISel: Use LLT in memory legality queries
This enables proper lowering of non-byte sized loads. We still aren't
faithfully preserving memory types everywhere, so the legality checks
still only consider the size.
2021-06-30 17:44:13 -04:00
Jonas Paulsson 7aef99351a [MCStreamer] Move emission of attributes section into MCELFStreamer
Enable the emission of a GNU attributes section by reusing the code for
emitting the ARM build attributes section.

The GNU attributes follow the exact same section format as the ARM
BuildAttributes section, so this can be factored out and reused for GNU
attributes generally.

The immediate motivation for this is to emit a GNU attributes section for the
vector ABI on SystemZ (https://reviews.llvm.org/D105067).

Review: Logan Chien, Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D102894
2021-06-30 16:00:27 -05:00
Jon Roelofs a642872476 [GISel] Support llvm.memcpy.inline
Differential revision: https://reviews.llvm.org/D105072
2021-06-30 12:39:05 -07:00
Stanislav Mekhanoshin 381ded345b [AMDGPU] Add S_MOV_B64_IMM_PSEUDO for wide constants
This is to allow 64 bit constant rematerialization. If a constant
is split into two separate moves initializing sub0 and sub1 like
now RA cannot rematerizalize a 64 bit register.

This gives 10-20% uplift in a set of huge apps heavily using double
precession math.

Fixes: SWDEV-292645

Differential Revision: https://reviews.llvm.org/D104874
2021-06-30 11:45:38 -07:00
David Green cd76f43b49 [ARM] Set the immediate cost of GEP operands to 0
This prevents constant gep operands from being hoisted by the Constant
Hoisting pass, leaving them to CodegenPrepare which can usually do a
better job at splitting large offsets. This can, in general, improve
performance and decrease codesize, especially for v6m where many
constants have a high cost.

Differential Revision: https://reviews.llvm.org/D104877
2021-06-30 19:19:03 +01:00
zhijian 9a9e6189d7 [AIX][XCOFF][BUG-Fixed] need to switch back to text section after emit a dumy eh structure
Summary:

in the patch https://reviews.llvm.org/D103651 [AIX][XCOFF] generate eh_info when vector registers are saved according to the traceback table.

when generate eh_info, it switch to other section, when it done, it need to switch back to text section again.

Reviewers: Jason Liu
Differential Revision: https://reviews.llvm.org/105195
2021-06-30 13:56:37 -04:00
Simon Pilgrim 59fa435ea6 [X86] Canonicalize SGT/UGT compares with constants to use SGE/UGE to reduce the number of EFLAGs reads. (PR48760)
This demonstrates a possible fix for PR48760 - for compares with constants, canonicalize the SGT/UGT condition code to use SGE/UGE which should reduce the number of EFLAGs bits we need to read.

As discussed on PR48760, some EFLAG bits are treated independently which can require additional uops to merge together for certain CMOVcc/SETcc/etc. modes.

I've limited this to cases where the constant increment doesn't result in a larger encoding or additional i64 constant materializations.

Differential Revision: https://reviews.llvm.org/D101074
2021-06-30 18:46:50 +01:00
Craig Topper 0f1f92156f [ARM] Fix incorrect assignment of Changed variable in MVEGatherScatterLowering::optimiseOffsets.
I believe this Changed flag should be initialized to false,
otherwise the if (!Changed) is always dead. This doesn't
manifest in a functional issue because the PHINode checks will
fail if nothing changed. They are identical to the earlier
checks that must have already failed to get into this else block.

While there remove an else after return to reduce indentation.

Differential Revision: https://reviews.llvm.org/D105159
2021-06-30 07:52:57 -07:00
Simon Pilgrim 47941d601d [CostModel][X86] Adjust fp<->int vXi32 AVX1+ costs based on llvm-mca reports
Based off the worse case numbers generated by D103695, the AVX1/2/512 sitofp/uitofp/fptosi/fptoui costs were higher than necessary (based off instruction counts instead of actual throughput).

The SSE costs still need further fixes, but I hit an issue with the order in which SSE costs are checked - we need to check CUSTOM costs (with non-legal types) first, and then fallback to LEGALIZED types. I'm looking at this now, and this should let us start thinning out a lot of the duplicates in the costs tables.

Then we can finally start work on vXi64 / vXi16 / vXi8 / vXi1 integers, which should let us look at sub-128-bit vectorization (D103925).
2021-06-30 15:23:34 +01:00
alex-t e585b332e4 [AMDGPU] PHI node cost should not be counted for the size and latency.
Details: https://reviews.llvm.org/D96805 changed the GCNTTIImpl::getCFInstrCost to return 1 for the PHI nodes
  for the TTI::TCK_CodeSize and TTI::TCK_SizeAndLatency. This is incorrect because the value moves that are the
  result of the PHI lowering are inserted into the basic block predecessors - not into the block itself.
  As a result of this change LoopRotate and LoopUnroll were broken because of the incorrect Loop header and loop
  body size/cost estimation.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D105104
2021-06-30 16:11:17 +03:00
Simon Pilgrim fcd0cb3921 Fix MSVC "32-bit shift implicitly converted to 64 bits" warning. 2021-06-30 13:23:53 +01:00
madhur13490 a7ed55f64c [AMDGPU] Simplify getReservedNumSGPRs
This is a followup patch on D103636 where
it seemed checking on amdgpu-calls and
amdgpu-stack-objects is unnecessary. Removing these
checks didn't regress any tests functionally.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D104513
2021-06-30 16:19:39 +05:30
Florian Mayer a24f104645 [MTE] Remove redundant helper function.
Looking at PostDominatorTree::dominates, we can see that has the same
logic (with the addition of handling Phi nodes - which are not used as inputs in
this pass) as the helper function.

Reviewed By: eugenis

Differential Revision: https://reviews.llvm.org/D105141
2021-06-30 11:11:26 +01:00
Igor Kudrin 657e067bb5 [ARMInstPrinter] Print the target address of a branch instruction
This follows other patches that changed printing immediate values of
branch instructions to target addresses, see D76580 (x86), D76591 (PPC),
D77853 (AArch64).

As observing immediate values might sometimes be useful, they are
printed as comments for branch instructions.

// llvm-objdump -d output (before)
000200b4 <_start>:
   200b4: ff ff ff fa   blx     #-4 <thumb>
000200b8 <thumb>:
   200b8: ff f7 fc ef   blx     #-8 <_start>

// llvm-objdump -d output (after)
000200b4 <_start>:
   200b4: ff ff ff fa   blx     0x200b8 <thumb>         @ imm = #-4
000200b8 <thumb>:
   200b8: ff f7 fc ef   blx     0x200b4 <_start>        @ imm = #-8

// GNU objdump -d.
000200b4 <_start>:
   200b4:       faffffff        blx     200b8 <thumb>
000200b8 <thumb>:
   200b8:       f7ff effc       blx     200b4 <_start>

Differential Revision: https://reviews.llvm.org/D104701
2021-06-30 16:35:28 +07:00
Igor Kudrin 17bcae8906 [ARM][NFC] Remove an unused method
`ARMInstPrinter::printMveAddrModeQOperand()` was added in D62680, but
was never used. It looks like `printT2AddrModeImm8Operand<false>()` is
used instead.

Differential Revision: https://reviews.llvm.org/D105124
2021-06-30 15:55:37 +07:00
Sjoerd Meijer b062fff87a Recommit "[AArch64] Custom lower <4 x i8> loads"
This recommits D104782 including a fix for adding a wrong operand to the new
load node.

Differential Revision: https://reviews.llvm.org/D105110
2021-06-30 09:18:06 +01:00
Tony Tye 7f19aa73c2 [AMDGPU] Update gfx90a memory model support
Update AMDGPU gfx90a memory model to make coarse grain memory allocations
consistent when fine grained system scope atomic acquire and release is
performed.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D105137
2021-06-30 04:05:22 +00:00
Steffen Larsen 3644726a78 [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructions
Adds NVPTX builtins and intrinsics for the CUDA PTX `wmma.load`, `wmma.store`, `wmma.mma`, and `mma` instructions added in PTX 6.5 and 7.0.

PTX ISA description of

  - `wmma.load`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-ld
  - `wmma.store`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-st
  - `wmma.mma`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma
  - `mma`: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma

Overview of `wmma.mma` and `mma` matrix shape/type combinations added with specific PTX versions: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape

Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>
Co-Authored-by: Stuart Adams <stuart.adams@codeplay.com>

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D104847
2021-06-29 15:44:07 -07:00
Matt Arsenault 990278d026 CodeGen: Store LLT instead of uint64_t in MachineMemOperand
GlobalISel is relying on regular MachineMemOperands to track all of
the memory properties of accesses. Just the raw byte size is
insufficent to disambiguate all situations. For example, if we need to
split an unaligned extending load, we need to know the number of bits
in the original source value and can't infer it from the result
type. This is also a problem for extending vector loads.

This does decrease the maximum representable size from the full
uint64_t bytes to a maximum of 16-bits. No in tree testcases hit this,
other than places using UINT64_MAX for unknown sizes. This may be an
issue for G_MEMCPY and co., although they can just use unknown size
for large static sizes. This also has potential for backend abuse by
relying on the type when it really shouldn't be relevant after
selection.

This does not include the necessary MIR printer/parser changes to
represent this.
2021-06-29 17:38:51 -04:00
Craig Topper 3b6dfa381e [RISCV] Protect the SHL/SRA/SRL handlers in LowerOperation against being called for an illegal i32 shift amount.
It seems it is possible for DAG combine to create a shl with an
i64 result type and an i32 shift amount. This is ok before type
legalization since the type don't need to match in SelectionDAG.
This results in type legalization calling LowerOperation to
legalize just the amount. We weren't expecting this so we
asserted for not finding a fixed vector shift.

To fix this, I've added a check for the fixed vector case and
returned SDValue() to get the default type legalizer. I've
factored all shifts together and added a fixed vector specific
handler to avoid repeating similar code for each in
LowerOperation.

The particular case I found was exposed by D104581, but the bad
shift is created after that patch triggers.
2021-06-29 09:45:13 -07:00
Piotr Sobczak f38a8b54ea [AMDGPU] Fix 224-bit spills
Related to D104622.

Differential Revision: https://reviews.llvm.org/D105109
2021-06-29 17:52:16 +02:00
Dylan Fleming c3d3defd11 [SVE] Added CodeGen support for inserting an element into a predicate vector
Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D104722
2021-06-29 14:55:40 +01:00
Ben Shi c85175c5f6 [AVR] Fix a bug in prologue of ISR
The r1 register should be cleared in prologue of ISR as it is used
as constant zero.

Reviewed By: dylanmckay

Differential Revision: https://reviews.llvm.org/D99467
2021-06-29 21:44:50 +08:00
Tim Northover c82957e792 ARM: fix vacuously true assertion to actually check what it should. NFC. 2021-06-29 14:24:03 +01:00
David Green 371ee32e01 [ARM] Fold extract of ARM_BUILD_VECTOR
This adds a small fold for extract (ARM_BUILD_VECTOR) to fold to the
original node. This can help simplify the resulting codegen in some
cases.

Differential Revision: https://reviews.llvm.org/D104860
2021-06-29 11:03:19 +01:00
Krzysztof Parzyszek 9c5ed8d567 [Hexagon] Add patterns to load i1
This fixes https://llvm.org/PR50853
2021-06-28 12:17:30 -05:00
Sjoerd Meijer 3a7cea2858 Revert "[AArch64] Custom lower <4 x i8> loads"
This reverts commit 51e434fc25 because of a
build bot failure in test-suite::GCC-C-execute-pr60960.test that I need to
investigate.
2021-06-28 17:44:46 +01:00
Jay Foad 75cacc6775 [AMDGPU] Use opName instead of PseudoName in VOP2 multiclasses. NFC.
This is just for consistency with all other instruction multiclasses
that pass around pseudo names as arguments.
2021-06-28 16:46:35 +01:00
David Spickett 558d9e8228 [llvm][ARM] Treat xscale arch as an alias of armv5te
Previously xscale was known to everything apart
from the ELF streamer so we would crash as soon
as you tried to output an object file.

Reviewed By: nickdesaulniers

Differential Revision: https://reviews.llvm.org/D104776
2021-06-28 15:20:24 +00:00
Bradley Smith c089e29aa4 [AArch64][SVE] DAG combine SETCC_MERGE_ZERO of a SETCC_MERGE_ZERO
This helps remove extra comparisons when generating masks for fixed
length masked operations.

Differential Revision: https://reviews.llvm.org/D104910
2021-06-28 15:06:06 +01:00
Brendon Cahoon f9f5d41545 [AMDGPU][GlobalISel] Legalize and select G_SBFX and G_UBFX
Adds legalizer, register bank select, and instruction
select support for G_SBFX and G_UBFX. These opcodes generate
scalar or vector ALU bitfield extract instructions for
AMDGPU. The instructions allow both constant or register
values for the offset and width operands.

The 32-bit scalar version is expanded to a sequence that
combines the offset and width into a single register.

There are no 64-bit vgpr bitfield extract instructions, so the
operations are expanded to a sequence of instructions that
implement the operation. If the width is a constant,
then the 32-bit bitfield extract instructions are used.

Moved the AArch64 specific code for creating G_SBFX to
CombinerHelper.cpp so that it can be used by other targets.
Only bitfield extracts with constant offset and width values
are handled currently.

Differential Revision: https://reviews.llvm.org/D100149
2021-06-28 09:06:44 -04:00
Lucas Prates 88b1135e72 [Aarch64] Adding support for Armv9-A Realm Management Extension
This adds support for Armv9-A's Realm Management Extension, including
three new system registers - MFAR_EL3, GPCCR_EL3 and GPTBR_EL3 - and
four new TLBI instructions.

The reference for the Realm Management Extension can be found at: https://developer.arm.com/documentation/ddi0615/aa.

Based on patches by Victor Campos.

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D104773
2021-06-28 13:45:22 +01:00
David Green a1c0f09a89 [ARM] Add an extra fold for f32 extract(vdup(i32))
This adds another small fold for extract of a vdup, between a i32 and a
f32, converting to a BITCAST. This allows some extra folding to happen,
simplifying the resulting code.

Differential Revision: https://reviews.llvm.org/D104857
2021-06-28 08:54:03 +01:00
Min-Yih Hsu 04242bdca9 Revert "[M68k][GloballSel] Formal arguments lowering in IRTranslator"
This reverts commit 8f43407a07 due to
failure on its associated test.
2021-06-27 23:22:40 -07:00
Sushma Unnibhavi 8f43407a07 [M68k][GloballSel] Formal arguments lowering in IRTranslator
Implementation of formal arguments lowering in the IRTranslator for the
M68k backend

Differential Revision: https://reviews.llvm.org/D104542
2021-06-27 16:13:05 -07:00
Craig Topper 010f0f000f Revert "[RISCV] Use zexti32/sexti32 in srliw/sraiw isel patterns to improve usage of those instructions."
I thought this might help with another optimization I was
thinking about, but I don't think it will. So it just wastes
compile time calling computeKnownBits for no benefit.

This reverts commit 81b2f95971.
2021-06-27 10:33:43 -07:00
Craig Topper 81f6d7c082 [X86] Tighten up some inline assembly constraint handling.
Don't allow vectors to split into GPRs for 'r' and other scalar
constraints. Prevents assertion in getCopyToPartsVector.

Makes PR50907 give a better error instead of crashing.
2021-06-26 22:57:22 -07:00
David Green 41d8149ee9 [ARM] Lower MVETRUNC to stack operations
The MVETRUNC node truncates two wide vectors to a single vector with
narrower elements. This is usually lowered to a series of extract/insert
elements, going via GPR registers. This patch changes that to instead
use a pair of truncating stores and a stack reload. This cuts down the
number of instructions at the expense of some stack space.

Differential Revision: https://reviews.llvm.org/D104515
2021-06-26 22:12:57 +01:00
David Green 5955812927 [ARM] Introduce MVETRUNC ISel lowering
Currently, when encountering store(trunc(..)) where the trunc is double
a legal vector lenth in MVE, we spilt the node into two different stores
each performing half of the trunc from the wider type. This works well
for efficiently lowering wider than legal types, else the trunc becomes
a series of individual lane moves. Unfortunately this splitting is
currently one of the first combines attempted, so can happen before any
other combines which might be more preferable.

This patch instead introduces the concept of a MVETRUNC ISel node that
the trunk is initially lowered to, to keep it intact as a single item as
opposed to splitting it up. This allows us to push the store(trunc(..))
combine later, allowing other optimisations to potentially happen on the
trunc first. The store(trunc(..)) splitting can then be done later in
the legalisation period if needed, or else fall back to a buildvector as
before.

This can also be used in the future to lower to loads/stores, as opposed
to the more expensive lane extracts/inserts. Some extra combines are
added to keep all the existing tests happy.

Differential Revision: https://reviews.llvm.org/D91921
2021-06-26 22:00:26 +01:00
Craig Topper 81b2f95971 [RISCV] Use zexti32/sexti32 in srliw/sraiw isel patterns to improve usage of those instructions. 2021-06-26 11:57:26 -07:00
David Green 0f83d37a14 [ARM] MVE vabd
This adds MVE lowering for VABDS/VABDU, using the code parted from
AArch64 in D91937.

Differential Revision: https://reviews.llvm.org/D91938
2021-06-26 19:41:32 +01:00
David Green 2887f14639 [ISel] Port AArch64 SABD and UABD to DAGCombine
This ports the AArch64 SABD and USBD over to DAG Combine, where they can be
used by more backends (notably MVE in a follow-up patch). The matching code
has changed very little, just to handle legal operations and types
differently. It selects from (ABS (SUB (EXTEND a), (EXTEND b))), producing
a ubds/abdu which is zexted to the original type.

Differential Revision: https://reviews.llvm.org/D91937
2021-06-26 19:34:16 +01:00
Jim Lin 779d2b0a42 [RISCV][NFC] Combine the control flow for different RetOp of interrupt function
Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D104838
2021-06-26 17:28:03 +08:00
Craig Topper d4f4a1ba62 [RISCV] Add DAG combine to detect opportunities to replace (i64 (any_extend (i32 X)) with sign_extend.
If type legalization is going to insert a sign_extend for other users
of X and we can fold the sign_extend into ADDW/MULW/SUBW, it is
better to replace the ANY_EXTEND so we don't end up with a separate
ADD/MUL/SUB instruction for the users of the ANY_EXTEND.

I'm only handling setcc uses right now, but there are other
instructions that force sign_extends like ashr.

There are probably other *W instructions we could use in addition
to ADDW/SUBW/MULW.

My motivating case was a loop terminating compare and a phi use
as seen in the new test file.

Reviewed By: asb

Differential Revision: https://reviews.llvm.org/D104581
2021-06-25 23:16:37 -07:00
Eric Astor e074d580b2 [ms] [llvm-ml] Disable C-style comments 2021-06-25 23:09:13 -04:00
Luo, Yuanke 36003c20ad [X86] Selecting fld0 for undefined value in fast ISEL.
When set opt-bisect-limit to some value that is less than ISel pass
in command line and CurBisectNum expired, "DAG to DAG" pass lower
its opt level to O0. However "processimpdefs" and "X86 FP Stackifier"
is not stopped due to the CurBisectNum expiration. So undefined fp0
is generated. This cause crash in the "X86 FP Stackifier" pass,
because Stackifier doesn't expect any undefined fp value.

Here is the scenario that cause compiler crash.

  successors: %bb.26
  liveins: $r14
    ST_FPrr $st0, implicit-def $fpsw, implicit $fpcw
    renamable $rdi = MOV64ri @.str.3.16422
    renamable $rdx = LEA64r %stack.6, 1, $noreg, 0, $noreg
    ADJCALLSTACKDOWN64 0, 0, 0, implicit-def $rsp, implicit-def dead
    $eflags, implicit-def $ssp, implicit $rsp, implicit $ssp
    dead $esi = MOV32r0 implicit-def dead $eflags, implicit-def $rsi
    CALL64pcrel32 @foo, implicit $rsp, implicit $ssp, implicit $rdi,
    implicit $rsi, implicit $rdx, implicit-def dead $fp0
    renamable $xmm0 = MOVSDrm_alt %stack.10, 1, $noreg, 0, $noreg :: (load 8
    from %stack.10)
    ADJCALLSTACKUP64 0, 0, implicit-def $rsp, implicit-def dead $eflags,
    implicit-def $ssp, implicit $rsp, implicit $ssp
    renamable $fp2 = CHS_Fp80 killed undef renamable $fp0, implicit-def
    $fpsw
    JMP_1 %bb.26
The CALL64pcrel32 mark fp0 dead, so llvm free the stack slot for fp0
and the stack become empty. In the late instruction CHS_Fp80, it use
undefined register fp0, the original code assume there must be a stack
slot for the src register (fp0) without respecting it is undefined,
so llvm report error.

We have some discussion in https://reviews.llvm.org/D104440 and we
decide to fix it in fast ISel. The fix is to lower undefined fp value to
zero value, so that it release the burden of "X86 FP Stackifier" pass.
Thank Craig for the suggestion and the initial patch to fix it.

Differential Revision: https://reviews.llvm.org/D104678
2021-06-26 08:43:09 +08:00
Jon Chesterfield 50ad3478bd Disable ReplaceLDS pass, patch up tests to match
Most tests passed with an extra argument to explicitly enable the pass.
One does not, deleted it as part of this change. I can't see why the codegen
would be different between default on and default off but switched on. It
can be retrieved from the project history.

This would be a revert, but git revert was not clean. Disabling the pass
and leaving it in tree is less likely to cause breakage elsewhere than
patching up the git revert conflicts on unfamiliar code. It'll be landed
without review, as @hsmhsm is believed unavailable at present.

Differential Revision: https://reviews.llvm.org/D104962
2021-06-26 01:36:42 +01:00
Nemanja Ivanovic 4e22c7265d [PowerPC] Disable combine 64-bit bswap(load) without LDBRX
This causes failures on the big endian bootstrap bot.
Disabling this combine temporarily until I can get a proper fix.
2021-06-25 15:11:22 -05:00
Ulrich Weigand b2674670f2 [SystemZ] Add support for .reloc assembler directive
Add support for the .reloc directive along the lines of
other back-ends.

This fixes a regression after https://reviews.llvm.org/D104080
was merged, since that patch presupposed support for .reloc.
2021-06-25 21:51:10 +02:00
Craig Topper 0f3bc00a7d [X86] Simplify part of the isel for X86ISD::FCMP/STRICT_FCMP/STRICT_FCMPS.
We don't need to have the compare output a value and then copy it
to FPSW for use by FNSTSW. Instead we can just have the compare
output Glue and glue the FNSTSW to it. InstrEmitter effectively
performed this optimization when emitting the Machine IR. Doing
it directly simplifies the codes and reduces the work in
InstrEmitter. There's no change in the machine IR at the end of
isel before and after this change.
2021-06-25 11:39:01 -07:00
Sander de Smalen b732e6c9a8 Revert "[GlobalISel] NFC: Have LLT::getSizeInBits/Bytes return a TypeSize."
This patch seems to be causing build errors, reverting it for now.

This reverts commit aeab9d9570.
2021-06-25 17:37:16 +01:00
Sander de Smalen aeab9d9570 [GlobalISel] NFC: Have LLT::getSizeInBits/Bytes return a TypeSize.
To reflect that the size may be scalable, a TypeSize is returned
instead of an unsigned. In places where the result is used,
it currently relies on an implicit cast of TypeSize -> uint64_t,
which asserts that the type is not scalable.

This patch is NFC for fixed-width vectors.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D104454
2021-06-25 17:06:50 +01:00
Jay Foad c3cc9d1eb2 [AMDGPU] Removed unused Predicate HasOffset3fBug. NFC.
The predicate definition didn't make sense anyway because it was defined
as being the opposite of what the name suggests.
2021-06-25 16:58:44 +01:00
Krzysztof Parzyszek 8a9ec39bd0 [Hexagon] Convert getTypeAlignment to return Align
Plus some minor related changes of the same nature.
2021-06-25 10:53:14 -05:00
Sander de Smalen c9acd2f32e [GlobalISel] NFC: Change LLT::changeNumElements to LLT::changeElementCount.
Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D104453
2021-06-25 15:54:00 +01:00
Sander de Smalen 968980ef08 [GlobalISel] NFC: Change LLT::scalarOrVector to take ElementCount.
Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D104452
2021-06-25 11:26:16 +01:00
Sjoerd Meijer 51e434fc25 [AArch64] Custom lower <4 x i8> loads
This custom lowers <4 x i8> vector loads using a 32-bit load, followed by 2
SSHLL instructions to extend it to e.g. a <4 x i32> vector. Before, it was
really inefficient and expensive to construct a <4 x i32> for this as 4 byte
loads and 4 moves were used. With this improvement SLP vectorisation might for
example become profitable, see D103629.

Differential Revision: https://reviews.llvm.org/D104782
2021-06-25 09:53:51 +01:00
Qiu Chaofan a08fc1361a [PowerPC] Change VSRpRC allocation order
On PowerPC, VSRpRC represents the pairs of even and odd VSX register,
and VRRC corresponds to higher 32 VSX registers. In some cases, extra
copies are produced when handling incoming VRRC arguments with VSRpRC.

This patch changes allocation order of VSRpRC to eliminate this kind of
copy.

Stack frame sizes may increase if allocating non-volatile registers, and
some other vector copies happen. They need fix in future changes.

Reviewed By: nemanjai

Differential Revision: https://reviews.llvm.org/D104855
2021-06-25 16:04:41 +08:00
Amara Emerson f9b3840c3d [ARM] Fix crash in chained BFI combine due to incorrectly RAUW'ing a node.
For a bfi chain like:
a = bfi input, x, y
b = bfi a, x', y'

The previous code was RAUW'ing a with x, mutating the second 'b' bfi, and when
SelectionDAG's CSE code ended up deleting it unexpectedly, bad things happend.
There's no need to RAUW in this case because we can just return our newly
created replacement BFI node. It also looked incorrect because it didn't account
for other users of the 'a' bfi.

Since it seems that chains of more than 2 BFI nodes are hard/impossible to
produce without this combine kicking in at some point, I've removed that
functionality since it had no test coverage.

rdar://79095399

Differential Revision: https://reviews.llvm.org/D104868
2021-06-24 23:35:47 -07:00
Fraser Cormack ab1bd25593 [RISCV] Permit larger RVV stacks and stack offsets
This patch teaches the compiler to generate code to handle larger RVV
stack sizes and stack offsets which resolve an amount larger than 2047
vector registers in size.

The previous behaviour was asserting on such large values as it was only
able to materialize the constant by feeding it to the 12-bit immediate
of an `ADDI` instruction. The compiler can now materialize this amount
into a temporary register before continuing with the computation.

A test case for this scenario is included which also checks that the
temporary register used to materialize the amount doesn't require an
additional spill slot over what we're already reserving for RVV code.

Reviewed By: rogfer01

Differential Revision: https://reviews.llvm.org/D104727
2021-06-25 07:17:33 +01:00
Serge Pavlov b36d214bed [X86] Add description of FXAM instruction
Previously this instruction could be used only in assembler. This change
makes it available for compiler also. Scheduling information was copied
from FTST instruction, hopefully this can be a satisfactory approximation.

Differential Revision: https://reviews.llvm.org/D104853
2021-06-25 12:26:51 +07:00
Kai Luo b904574b3d [PowerPC] Move PPCBranchSelector as close to asm printer as possible
Currently, PPCBranchSelector is not immediately preceding asm printer pass. `-debug-pass=Structure` gives
```
      PowerPC Branch Selector
      Contiguously Lay Out Funclets
      StackMap Liveness Analysis
      Live DEBUG_VALUE analysis
      Lazy Machine Block Frequency Analysis
      Machine Optimization Remark Emitter
      Linux PPC Assembly Printer
```
After the patch
```
      Contiguously Lay Out Funclets
      StackMap Liveness Analysis
      Live DEBUG_VALUE analysis
      PowerPC Branch Selector
      Lazy Machine Block Frequency Analysis
      Machine Optimization Remark Emitter
      Linux PPC Assembly Printer
```

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D104762
2021-06-25 02:05:19 +00:00
Nemanja Ivanovic dcccb2f594 [PowerPC] Fix bswap combine for big endian systems
Commit 0464586ac5 added a combine
for a 64-bit load feeding a bswap but the implementation is only
correct for little endian systems.
This fixes it for big endian systems.
2021-06-24 18:04:50 -05:00
Martin Storsjö 42f74e8249 [llvm] Rename StringRef _lower() method calls to _insensitive()
This is a mechanical change. This actually also renames the
similarly named methods in the SmallString class, however these
methods don't seem to be used outside of the llvm subproject, so
this doesn't break building of the rest of the monorepo.
2021-06-25 00:22:01 +03:00
Krzysztof Parzyszek d09218a82e [Hexagon] Opaquify pointer usage in GEP commoning 2021-06-24 16:06:36 -05:00
Nemanja Ivanovic 0464586ac5 [PowerPC] Combine 64-bit bswap(load) without LDBRX
When targeting CPUs that don't have LDBRX, we end up producing code that is
very inefficient and large for this common idiom. This patch just
optimizes it two 32-bit LWBRX instructions along with a merge.

This fixes https://bugs.llvm.org/show_bug.cgi?id=49610

Differential revision: https://reviews.llvm.org/D104836
2021-06-24 15:11:47 -05:00
Aakanksha Patil 3453f3dd46 [AMDGPU] Add gfx1035 target
Differential Revision: https://reviews.llvm.org/D104804
2021-06-24 14:32:41 -04:00
Pablo Barrio 571c8c5263 [AArch64][v8.3A] Avoid inserting implicit landing pads (PACI*SP)
PACI*SP have the advantage that they are in HINT space, meaning
they can be run successfully in hardware without PAuth support -
they will just behave as a NOP. However, PACI*SP are also implicit
landing pads (think of an extra BTI jc). Therefore, they allow
indirect jumps of all kinds into them, potentially inserting new
gadgets. This patch replaces PACI*SP by PACI* LR, SP when
compiling explicitly for hardware with full PAuth support. PACI*
is not in the HINT space, therefore it will fault when run in
hardware without PAuth support, but it is also not a landing pad,
making programs safer in newer HW.

Differential Revision: https://reviews.llvm.org/D101920
2021-06-24 18:24:32 +01:00
Anirudh Prasad 631362665c [AsmParser][SystemZ][z/OS] Support for emitting labels in upper case
- Currently, the emitting of labels in the parsePrimaryExpr function is case independent. It just takes the identifier and emits it.
- However, for HLASM the emitting of labels is case independent. We are emitting them in the upper case only, to enforce case independency. So we need to ensure that at the time of parsing the label we are emitting the upper case (in `parseAsHLASMLabel`), but also, when we are processing a PC-relative relocatable expression, we need to ensure we emit it in upper case (in `parsePrimaryExpr`)
- To achieve this a new MCAsmInfo attribute has been introduced which corresponding targets can override if needed.

Reviewed By: abhina.sreeskantharajan, uweigand

Differential Revision: https://reviews.llvm.org/D104715
2021-06-24 12:50:11 -04:00
David Green 1113e06821 [ARM] Extend narrow values to allow using truncating scatters
As a minor adjustment to the existing lowering of offset scatters, this
extends any smaller-than-legal vectors into full vectors using a zext,
so that the truncating scatters can be used. Due to the way MVE
legalizes the vectors this should be cheap in most situations, and will
prevent the vector from being scalarized.

Differential Revision: https://reviews.llvm.org/D103704
2021-06-24 13:09:11 +01:00
Florian Hahn a54c6fc083
[X86] Exclude invalid element types for bitcast/broadcast folding.
It looks like the fold introduced in 63f3383ece can cause crashes
if the type of the bitcasted value is not a valid vector element type,
like x86_mmx.

To resolve the crash, reject invalid vector element types. The way it is
done in the patch is a bit clunky. Perhaps there's a better way to
check?

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104792
2021-06-24 12:39:01 +01:00
Rosie Sumpter 0c4651f0a8 [CostModel][AArch64] Improve cost model for vector reduction intrinsics
OR, XOR and AND entries are added to the cost table. An extra cost
is added when vector splitting occurs.

This is done to address the issue of a missed SLP vectorization
opportunity due to unreasonably high costs being attributed to the vector
Or reduction (see: https://bugs.llvm.org/show_bug.cgi?id=44593).

Differential Revision: https://reviews.llvm.org/D104538
2021-06-24 12:02:58 +01:00
Simon Pilgrim c4d3eedc7f [X86] Fold nested select_cc to select (cmp*ge/le Cond0, Cond1), LHS, Y)
select (cmpeq Cond0, Cond1), LHS, (select (cmpugt Cond0, Cond1), LHS, Y) --> (select (cmpuge Cond0, Cond1), LHS, Y)
etc,

We already perform this fold in DAGCombiner for MVT::i1 comparison results, but these can still appear after legalization (in x86 case with MVT::i8 results), where we need to be more careful about generating new comparison codes.

Pulled out of D101074 to help address the remaining regressions.

Differential Revision: https://reviews.llvm.org/D104707
2021-06-24 11:27:57 +01:00
Sander de Smalen d5e14ba88c [GlobalISel] NFC: Change LLT::vector to take ElementCount.
This also adds new interfaces for the fixed- and scalable case:
* LLT::fixed_vector
* LLT::scalable_vector

The strategy for migrating to the new interfaces was as follows:
* If the new LLT is a (modified) clone of another LLT, taking the
  same number of elements, then use LLT::vector(OtherTy.getElementCount())
  or if the number of elements is halfed/doubled, it uses .divideCoefficientBy(2)
  or operator*. That is because there is no reason to specifically restrict
  the types to 'fixed_vector'.
* If the algorithm works on the number of elements (as unsigned), then
  just use fixed_vector. This will need to be fixed up in the future when
  modifying the algorithm to also work for scalable vectors, and will need
  then need additional tests to confirm the behaviour works the same for
  scalable vectors.
* If the test used the '/*Scalable=*/true` flag of LLT::vector, then
  this is replaced by LLT::scalable_vector.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D104451
2021-06-24 11:26:12 +01:00
Fraser Cormack a4729f7f88 [RISCV] Lower RVV vector SELECTs to VSELECTs
This patch optimizes the code generation of vector-type SELECTs (LLVM
select instructions with scalar conditions) by custom-lowering to
VSELECTs (LLVM select instructions with vector conditions) by splatting
the condition to a vector. This avoids the default expansion path which
would either introduce control flow or fully scalarize.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D104772
2021-06-24 10:12:51 +01:00
Carl Ritson 98f48723f2 [AMDGPU] Add 224-bit vector types and link 192-bit types to MVTs
Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D104622
2021-06-24 12:41:22 +09:00
Jon Chesterfield 660cae84c3 Revert "[AMDGPU] [IndirectCalls] Don't propagate attributes to address taken functions and their callees"
This reverts commit 6a3beb1f68.
Test case that triggers an infinite loop before the revert is at
the review for D103138.
2021-06-24 02:33:50 +01:00
Stanislav Mekhanoshin d274d64ef4 [AMDGPU] Check for pointer operand while refining LDS align
Also skips the propagation if alignment is 1.

Differential Revision: https://reviews.llvm.org/D104796
2021-06-23 12:27:55 -07:00
David Green 8cfc080132 [ARM] Limit v6m unrolling with multiple live outs
v6m cores only have a limited number of registers available. Unrolling
can mean we spend more on stack spills and reloads than we save from the
unrolling. This patch adds an extra heuristic to put a limit on the
unroll count for loops with multiple live out values, as measured from
the LCSSA phi nodes.

Differential Revision: https://reviews.llvm.org/D104659
2021-06-23 16:36:37 +01:00
Craig Topper a37cf17834 [RISCV] Add explicit copy to V0 in the masked vmsge(u).vx intrinsic handling.
This is consistent with our other masked vector instructions.
Previously we found cases where not doing this broke fast reg
alloc.
2021-06-23 08:04:42 -07:00
Jay Foad a16cb95a3a [AMDGPU] Remove unused multiclass MUBUF_Real_gfx10_with_name 2021-06-23 14:37:28 +01:00
Nikita Popov 8c01deb8e6 [ARMParallelDSP] Remove unnecessary wrapper function (NFC)
AreSequentialAccesses() forwards directly to isConsecutiveAccess()
and has an unnecessary template parameter to boot.
2021-06-23 15:27:54 +02:00
Jay Foad dfb8c08739 [AMDGPU] Stop using LegacyLegalizerInfo. NFCI.
Differential Revision: https://reviews.llvm.org/D103684
2021-06-23 10:50:32 +01:00
Jay Foad c65f3f562b [AMDGPU] Simplify collectReachableCallees. NFCI.
Don't use SCC iterators when we're only interested in reachability.
Use df_begin/df_end inline to find reachable nodes.

Differential Revision: https://reviews.llvm.org/D104704
2021-06-23 09:11:29 +01:00
Stanislav Mekhanoshin 2b43209ee3 [AMDGPU] Propagate LDS align into to instructions
Differential Revision: https://reviews.llvm.org/D104316
2021-06-23 00:57:16 -07:00
Martin Storsjö 1cb7849a55 Revert "[AArch64LoadStoreOptimizer] Recommit: Generate more STPs by renaming registers earlier"
This reverts commit ea011ec5ed.

This still causes some miscompiles, I'll follow up in the phabricator
review with a sample of that issue (which is part of the sample of
the previous issue).
2021-06-23 09:54:16 +03:00
Min-Yih Hsu dfafd56daa [M68k] Fix incorrect #include-ed file in M68kSubtarget
In https://reviews.llvm.org/rG2193347e72fa , a cpp file is accidentally
included instead of its header file counterpart. This patch fixes this
error.
2021-06-22 23:02:21 -07:00
Jim Lin 5cb5225cf5 [M68k] Refactor codegen patterns for logic operations and add tests for it
Refactor pat for and, or and xor operation and add missing tests for it

Reviewed By: myhsu

Differential Revision: https://reviews.llvm.org/D104626
2021-06-23 13:25:24 +08:00
David Green 015c27caa2 [ARM] Change some Gather/Scatter interface types to Instructions. NFC
These returned Values are cast to an Instruction already, this just
cleans up the interface a little to match the expected types.
2021-06-22 19:11:39 +01:00
Matt Arsenault 39f8a792f0 AMDGPU: Try to eliminate clearing of high bits of 16-bit instructions
These used to consistently be zeroed pre-gfx9, but gfx9 made the
situation complicated since now some still do and some don't. This
also manages to pick up a few cases that the pattern fails to optimize
away.

We handle some cases with instruction patterns, but some get
through. In particular this improves the integer cases.
2021-06-22 13:42:49 -04:00
Matt Arsenault 9ad8a1f6fb AMDGPU: Fix high 16-bit optimization on gfx9
We can do this optimization in the majority of cases, but we currently
don't have a way to do it. We do not track/model which instructions
have which behavior, the control bit to change the high bit behavior,
or making use of preserved bits at all. This is a bit fuzzy since we
don't know precisely how the source instruction will be lowered, but
that only really matters in one case (for fma_mixlo).

We do need to fixup some of these cases after selection, but the
pattern helps eliminate many of these zexts.
2021-06-22 13:16:45 -04:00
zhijian bd240b3d77 [AIX][XCOFF] generate eh_info when vector registers are saved according to the traceback table.
Summary:

generate eh_info when vector registers are saved according to the traceback table.

struct eh_info_t {

unsigned version;       /* EH info version 0 */
#if defined(64BIT)

char _pad[4];           /* padding */
#endif

unsigned long lsda;     /* Pointer to Language Specific Data Area */
unsigned long personality; /* Pointer to the personality routine */
};

the value of lsda and personality is zero when the number of vector registers saved is large zero and there is not personality of the function

Reviewers: Jason Liu
Differential Revision: https://reviews.llvm.org/D103651
2021-06-22 13:01:31 -04:00
Stanislav Mekhanoshin d797a7f8da [AMDGPU] Use performOptimizedStructLayout for LDS sort
This gives better packing.

Differential Revision: https://reviews.llvm.org/D104331
2021-06-22 09:58:10 -07:00
Matt Arsenault a7786badb7 AMDGPU: Move zeroed FP high bits optimization to patterns 2021-06-22 12:47:56 -04:00
Meera Nakrani ea011ec5ed [AArch64LoadStoreOptimizer] Recommit: Generate more STPs by renaming registers earlier
This is a recommit that fixes unwanted STP generation by checking that
the base register has not been modified or used elsewhere.

Our initial motivating case was memcpy's with alignments > 16. The
loads/stores, to which small memcpy's expand, are kept together in
several places so that we get a sequence like this for a 64 bit copy:
LD w0
LD w1
ST w0
ST w1
The load/store optimiser can generate a LDP/STP w0, w1 from this because
the registers read/written are consecutive. In our case however, the
sequence is optimised during ISel, resulting in:
LD w0
ST w0
LD w0
ST w0
This instruction reordering allows reuse of registers. Since the registers
are no longer consecutive (i.e. they are the same), it inhibits LDP/STP
creation. The approach here is to perform renaming:
LD w0
ST w0
LD w1
ST w1
to enable the folding of the stores into a STP. We do not yet generate
the LDP due to a limitation in the renaming implementation, but plan to
look at that in a follow-up so that we fully support this case. While
this was initially motivated by certain memcpy's, this is a general
approach and thus is beneficial for other cases too, as can be seen
in some test changes.

Differential Revision: https://reviews.llvm.org/D103597
2021-06-22 15:29:13 +00:00
Thomas Johnson 2ef1fbfe0e Add norm sub-target feature to table gen for ARC
This adds the `norm` sub-target feature (without backing implementation for now) to table gen.

Differential Revision: https://reviews.llvm.org/D104558
2021-06-22 14:39:29 +03:00
Eli Friedman 74909e4b6e Rename MachineMemOperand::getOrdering -> getSuccessOrdering.
Since this method can apply to cmpxchg operations, make sure it's clear
what value we're actually retrieving.  This will help ensure we don't
accidentally ignore the failure ordering of cmpxchg in the future.

We could potentially introduce a getOrdering() method on AtomicSDNode
that asserts the operation isn't cmpxchg, but not sure that's
worthwhile.

Differential Revision: https://reviews.llvm.org/D103338
2021-06-21 16:49:27 -07:00
Eli Friedman bf0d0671a1 [ARM] Make sure we don't transform unaligned store to stm on Thumb1.
This isn't likely to come up in practice; the combination of compiler
flags required to hit this issue should be rare. Found by inspection.
2021-06-21 14:32:42 -07:00
Fangrui Song c618692218 [AArch64][X86] Allow 64-bit label differences lower to IMAGE_REL_*_REL32
`IMAGE_REL_ARM64_REL64/IMAGE_REL_AMD64_REL64` do not exist and `.quad a - .` is
currently not representable.

For instrumentation, `.quad a - .` is useful representing a cross-section
reference in a metadata section, to allow ELF medium/large code models. The COFF
limitation makes such generic instrumentations inconvenient. I plan to make a
PGO/coverage metadata section field relative in D104556.

Differential Revision: https://reviews.llvm.org/D104564
2021-06-21 14:32:25 -07:00
Craig Topper c2e01ee4a5 [RISCV] Remove extra character from a comment. NFC 2021-06-21 12:52:02 -07:00
Jonas Paulsson b2cd98d5fe [SystemZ] Fix some typos in comments. 2021-06-21 13:50:54 -05:00
Craig Topper 9080659ac7 [RISCV] Add isel patterns to match vmacc/vmadd/vnmsub/vnmsac from add/sub and mul.
Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D104163
2021-06-21 11:27:44 -07:00
Sam Tebbs bbe16b7af2 [ARM] Transform a fixed-point to floating-point conversion into a VCVT_fix
Conversion from a fixed-point number to a floating-point number is done by
multiplying the fixed-point number by 2^(-n) where n is the number of
fractional bits. Currently this is lowered to a vcvt
(integer to floating-point) then a vmul, but it can instead be lowered
directly to a vcvt (fixed-point to floating-point). This patch enables
such transformations as long as the multiplication factor is a power of 2.

Differential Revision: https://reviews.llvm.org/D103903
2021-06-21 14:14:09 +01:00
Bradley Smith 9e7329e37e [AArch64][SVE] Wire up vscale_range attribute to SVE min/max vector queries
Differential Revision: https://reviews.llvm.org/D103702
2021-06-21 13:00:36 +01:00
Jordan Rupprecht b650778dc4 [NFC] Wrap entire assert-only block in LLVM_DEBUG 2021-06-21 04:01:27 -07:00
Sebastian Neubauer bbd7424402 [AMDGPU] Fix linking with shared libraries
AMDGPULDSUtils depends on llvm::CallGraph.
2021-06-21 11:11:13 +02:00
Ruiling Song 208332de8a [AMDGPU] Add Optimize VGPR LiveRange Pass.
This pass aims to optimize VGPR live-range in a typical divergent if-else
control flow. For example:

def(a)
if(cond)
  use(a)
  ... // A
else
  use(a)

As AMDGPU access vgpr with respect to active-mask, we can mark `a` as
dead in region A. For details, please refer to the comments in
implementation file.

The pass is enabled by default, the frontend can disable it through
"-amdgpu-opt-vgpr-liverange=false".

Differential Revision: https://reviews.llvm.org/D102212
2021-06-21 15:25:55 +08:00
hsmahesha 80fd5fa526 [AMDGPU] Replace non-kernel function uses of LDS globals by pointers.
The main motivation behind pointer replacement of LDS use within non-kernel
functions is - to *avoid* subsequent LDS lowering pass from directly packing
LDS (assume large LDS) into a struct type which would otherwise cause allocating
huge memory for struct instance within every kernel.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D103225
2021-06-21 11:51:49 +05:30
Fangrui Song 521d373274 Fix -Wunused-variable and -Wunused-but-set-variable in -DLLVM_ENABLE_ASSERTIONS=off build. NFC 2021-06-20 11:09:07 -07:00
Craig Topper b663f30fa4 [RISCV] Prevent formation of shXadd(.uw) and add.uw if it prevents the use of addi.
If the outer add has an simm12 immediate operand we should prefer
it instead of materializing it in a register. This would guarantee
and extra instruction and temporary register. Since we don't check
one use on the shl or zext we might generate more instructions if
there is an additional user.
2021-06-19 12:10:42 -07:00
Roman Lebedev 834aafa55b
[NFC] AMD Zen 3: fix typo in a comment 2021-06-19 22:05:17 +03:00
Fangrui Song 59d90fe817 Simplify some typedef struct 2021-06-19 11:36:44 -07:00
Michael Liao 940efa4f69 [amdgpu] Improve the from f32 to i64.
- Take the same principle as the conversion from f64 to i64 with extra
  necessary pre- and post-processing. It helps to reduce that conversion
  sequence by half compared to legacy one.

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D104427
2021-06-19 12:46:48 -04:00
Tomas Matheson 18dbe68978 [ARM][NFC] Tidy up subtarget frame pointer routines
getFramePointerReg only depends on information in ARMSubtarget,
so move it in there so it can be accessed from more places.

Make use of ARMSubtarget::getFramePointerReg to remove duplicated code.

The main use of useR7AsFramePointer is getFramePointerReg, so inline it.

Differential Revision: https://reviews.llvm.org/D104476
2021-06-19 17:00:45 +01:00
Ben Shi d934b72809 [RISCV] Optimize add-mul in the zba extension with SH*ADD
This patch does the following optimization.

Rx + Ry * 18 => (SH1ADD (SH3ADD Rx, Rx), Ry)
Rx + Ry * 20 => (SH2ADD (SH2ADD Rx, Rx), Ry)
Rx + Ry * 24 => (SH3ADD (SH1ADD Rx, Rx), Ry)
Rx + Ry * 36 => (SH2ADD (SH3ADD Rx, Rx), Ry)
Rx + Ry * 40 => (SH3ADD (SH2ADD Rx, Rx), Ry)
Rx + Ry * 72 => (SH3ADD (SH3ADD Rx, Rx), Ry)

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D104588
2021-06-19 14:33:27 +08:00
Matt Arsenault d6467e00df AMDGPU: Fix infinite loop in DAG combine with fneg + fma
We were not reporting isFNegFree for v2f32, although it is effectively
free after legalization. The generic combine was pulling fneg out of
the fma source operands, and the AMDGPU combine was doing the
opposite.
2021-06-18 19:09:03 -04:00
Matt Arsenault ad4a18251a AMDGPU: Fix assert on m0_lo16/m0_hi16
These get added (redundantly) to the bundle expanded for indirect
register accesses. We hit this path only when there is a call in the
function.
2021-06-18 18:48:53 -04:00
Craig Topper ac87133f1d [RISCV] Teach vsetvli insertion to remember when predecessors have same AVL and SEW/LMUL ratio if their VTYPEs otherwise mismatch.
Previously we went directly to unknown state on VTYPE mismatch.
If we instead remember the partial match, we can use this to
still use X0, X0 vsetvli in successors if AVL and needed SEW/LMUL
ratio match.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D104069
2021-06-18 12:16:07 -07:00
Anshil Gandhi 2e5dc4a1ef [AMDGPU] [CodeGen] Fold negate llvm.amdgcn.class into test mask
Implemented the transformation of xor (llvm.amdgcn.class x, mask), -1 into
llvm.amdgcn.class(x, ~mask). Added LIT tests as well.

Differential Revision: https://reviews.llvm.org/D104049
2021-06-18 13:04:12 -06:00
Jingu Kang 78b75b452b [AArch64] Add TableGen patterns to generate uaddlv
uaddv(uaddlp(x)) ==> uaddlv(x)
addp(uaddlp(x))  ==> uaddlv(x)

Differential Revision: https://reviews.llvm.org/D104236
2021-06-18 17:23:26 +01:00
Luke c2e97ba85e [RISCV] Don't enable Interleaved Access Vectorization
The patch https://reviews.llvm.org/D101469 is intended to enable loop unrolling,
not interleaved access vectorization. The method bool enableInterleavedAccessVectorization()
should not be implemented.
2021-06-18 12:32:30 +08:00
Igor Kudrin 85ec210751 [objdump][ARM] Fix evaluating the target address of a Thumb BLX(i)
The instruction can be 16-bit aligned while targeting 32-bit aligned
code. To calculate the target address correctly, the address of the
instruction has to be adjusted.

Differential Revision: https://reviews.llvm.org/D104446
2021-06-18 10:40:55 +07:00
Carl Ritson a10aeb3b32 [AMDGPU] Remove duplicate setOperationAction for v4i16/v4f16 (NFC) 2021-06-18 12:38:54 +09:00
Heejin Ahn 1d891d44f3 [WebAssembly] Rename event to tag
We recently decided to change 'event' to 'tag', and 'event section' to
'tag section', out of the rationale that the section contains a
generalized tag that references a type, which may be used for something
other than exceptions, and the name 'event' can be confusing in the web
context.

See
- https://github.com/WebAssembly/exception-handling/issues/159#issuecomment-857910130
- https://github.com/WebAssembly/exception-handling/pull/161

Reviewed By: tlively

Differential Revision: https://reviews.llvm.org/D104423
2021-06-17 20:34:19 -07:00
Jim Lin e7bf451056 [M68k][NFC] Fix indentation in M68kInstrArithmetic.td
Merely fix indentation

Reviewed By: myhsu

Differential Revision: https://reviews.llvm.org/D104434
2021-06-18 09:49:04 +08:00
Saleem Abdulrasool 116841c623 RISCV: clean up target expression handling
The target specific expression handling was slightly regressed by
bbea64250f.  This restores the proper
sub-expression evaluation to allow for constant folding within the
expression.  We explicitly discard the layout and assembler when
evaluating the expression to avoid any symbolic computation and instead
using the `evaluateAsRelocatable` to canonicalise and constant fold
only.

We can also simplify the expression handling - none of the target
variants support symbolic difference.  This simplifies the logic for
that and adds additional tests to ensure that we do not accidentally
regress here in the future.

Reviewed By: maskray

Differential Revision: https://reviews.llvm.org/D104473
2021-06-17 13:35:32 -07:00
Jon Roelofs 7b06120882 [AArch64][GISel] and+or+shl => bfi
This fixes a GISEL vs SDAG regression that showed up at -Os in 256.bzip2

In `_getAndMoveToFrontDecode`:

gisel:
```
and w9, w0, #0xff
orr w9, w9, w8, lsl #8
```

sdag:
```
bfi w0, w8, #8, #24
```

Differential revision: https://reviews.llvm.org/D103291
2021-06-17 12:52:59 -07:00
Roman Lebedev 69caacc626
[X86] AMD Zen 3: don't confuse shift and shuffle, NFC
These proc res groups occupy the exact same pipes,
so this doesn't affect the modelling,
but it's confusing nontheless.
2021-06-17 21:07:35 +03:00
Haojian Wu 53f5f14136 fix an -Wunused-variable warning in release built, NFC 2021-06-17 18:48:47 +02:00
Saleem Abdulrasool bbea64250f RISCV: adjust handling of relocation emission for RISCV
This re-architects the RISCV relocation handling to bring the
implementation closer in line with the implementation in binutils.  We
would previously aggressively resolve the relocation.  With this
restructuring, we always will emit a paired relocation for any symbolic
difference of the type of S±T[±C] where S and T are labels and C is a
constant.

GAS has a special target hook controlled by `RELOC_EXPANSION_POSSIBLE`
which indicates that a fixup may be expanded into multiple relocations.
This is used by the RISCV backend to always emit a paired relocation -
either ADD[WIDTH] + SUB[WIDTH] for text relocations or SET[WIDTH] +
SUB[WIDTH] for a debug info relocation.  Irrespective of whether linker
relaxation support is enabled, symbolic difference is always emitted as
a paired relocation.

This change also sinks the target specific behaviour down into the
target specific area rather than exposing it to the shared relocation
handling.  In the process, we also sink the "special" handling for debug
information down into the RISCV target.  Although this improves the path
for the other targets, this is not necessarily entirely ideal either.
The changes in the debug info emission could be done through another
type of hook as this functionality would be required by any other target
which wishes to do linker relaxation.  However, as there are no other
targets in LLVM which currently do this, this is a reasonable thing to
do until such time as the code needs to be shared.

Improve the handling of the relocation (and add a reduced test case from
the Linux kernel) to ensure that we handle complex expressions for
symbolic difference.  This ensures that we correct relocate symbols with
the adddends normalized and associated with the addition portion of the
paired relocation.

This change also addresses some review comments from Alex Bradbury about
the relocations meant for use in the DWARF CFA being named incorrectly
(using ADD6 instead of SET6) in the original change which introduced the
relocation type.

This resolves the issues with the symbolic difference emission
sufficiently to enable building the Linux kernel with clang+IAS+lld
(without linker relaxation).

Resolves PR50153, PR50156!
Fixes: ClangBuiltLinux/linux#1023, ClangBuiltLinux/linux#1143

Reviewed By: nickdesaulniers, maskray

Differential Revision: https://reviews.llvm.org/D103539
2021-06-17 08:20:02 -07:00
Simon Pilgrim cdb4fcf9a1 [X86] combineSelect - refactor MIN/MAX detection code to make it easier to add additional select(setcc,x,y) folds. NFCI.
I need to add some additional handling to address some of the regressions from D101074
2021-06-17 13:50:59 +01:00
Fraser Cormack fed1503e85 [RISCV][VP] Lower FP VP ISD nodes to RVV instructions
With the exception of `frem`, this patch supports the current set of VP
floating-point binary intrinsics by lowering them to to RVV instructions. It
does so by using the existing `RISCVISD *_VL` custom nodes as an intermediate
layer. Both scalable and fixed-length vectors are supported by using this
method.

The `frem` node is unsupported due to a lack of available instructions. For
fixed-length vectors we could scalarize but that option is not (currently)
available for scalable-vector types. The support is intentionally left out so
it equivalent for both vector types.

The matching of vector/scalar forms is currently lacking, as scalable vector
types do not lower to the custom `VFMV_V_F_VL` node. We could either make
floating-point scalable vector splats lower to this node, or support the
matching of multiple kinds of splat via a `ComplexPattern`, much like we do for
integer types.

Reviewed By: rogfer01

Differential Revision: https://reviews.llvm.org/D104237
2021-06-17 10:04:00 +01:00
Bjorn Pettersson 4c7f820b2b Update @llvm.powi to handle different int sizes for the exponent
This can be seen as a follow up to commit 0ee439b705,
that changed the second argument of __powidf2, __powisf2 and
__powitf2 in compiler-rt from si_int to int. That was to align with
how those runtimes are defined in libgcc.
One thing that seem to have been missing in that patch was to make
sure that the rest of LLVM also handle that the argument now depends
on the size of int (not using the si_int machine mode for 32-bit).
When using __builtin_powi for a target with 16-bit int clang crashed.
And when emitting libcalls to those rtlib functions, typically when
lowering @llvm.powi), the backend would always prepare the exponent
argument as an i32 which caused miscompiles when the rtlib was
compiled with 16-bit int.

The solution used here is to use an overloaded type for the second
argument in @llvm.powi. This way clang can use the "correct" type
when lowering __builtin_powi, and then later when emitting the libcall
it is assumed that the type used in @llvm.powi matches the rtlib
function.

One thing that needed some extra attention was that when vectorizing
calls several passes did not support that several arguments could
be overloaded in the intrinsics. This patch allows overload of a
scalar operand by adding hasVectorInstrinsicOverloadedScalarOpd, with
an entry for powi.

Differential Revision: https://reviews.llvm.org/D99439
2021-06-17 09:38:28 +02:00
Fangrui Song 1a76bff626 RISCVFixupKinds.h: Don’t duplicate function or class name at the beginning of the comment && fix some comments 2021-06-16 10:42:43 -07:00
Sushma Unnibhavi 2193347e72 [M68k][GloballSel] Adding initial GlobalISel infrastructure
Wiring up GlobalISel for the M68k backend

Differential Revision: https://reviews.llvm.org/D101819
2021-06-16 10:48:38 -06:00
Dylan Fleming 2a936be388 [SVE] Selection failure with scalable insertelements
Reviewed By: efriedma, CarolineConcatto

Differential Revision: https://reviews.llvm.org/D104244
2021-06-16 15:38:31 +01:00
Jay Foad 66234ce49f [AMDGPU] Set VOP3P flag on Real instructions
This does not affect codegen but might benefit llvm-mca.
2021-06-16 15:00:45 +01:00
David Spickett e4ecd83fe9 [llvm][AArch64] Handle arrays of struct properly (from IR)
This only applies to FastIsel. GlobalIsel seems to sidestep
the issue.

This fixes https://bugs.llvm.org/show_bug.cgi?id=46996

One of the things we do in llvm is decide if a type needs
consecutive registers. Previously, we just checked if it
was an array or not.
(plus an SVE specific check that is not changing here)

This causes some confusion when you arbitrary IR like:
```
%T1 = type { double, i1 };
define [ 1 x %T1 ] @foo() {
entry:
  ret [ 1 x %T1 ] zeroinitializer
}
```

We see it is an array so we call CC_AArch64_Custom_Block
which bails out when it sees the i1, a type we don't want
to put into a block.

This leaves the location of the double in some kind of
intermediate state and leads to odd codegen. Which then crashes
the backend because it doesn't know how to implement
what it's been asked for.

You get this:
```
  renamable $d0 = FMOVD0
  $w0 = COPY killed renamable $d0
```

Rather than this:
```
  $d0 = FMOVD0
  $w0 = COPY $wzr
```

The backend knows how to copy 64 bit to 64 bit registers,
but not 64 to 32. It can certainly be taught how but the real
issue seems to be us even trying to assign a register block
in the first place.

This change makes the logic of
AArch64TargetLowering::functionArgumentNeedsConsecutiveRegisters
a bit more in depth. If we find an array, also check that all the
nested aggregates in that array have a single member type.

Then CC_AArch64_Custom_Block's assumption of a type that looks
like [ N x type ] will be valid and we get the expected codegen.

New tests have been added to exercise these situations. Note that
some of the output is not ABI compliant. The aim of this change is
to simply handle these situations and not to make our processing
of arbitrary IR ABI compliant.

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D104123
2021-06-16 13:56:01 +00:00
Jay Foad 7f3ac6714a [AMDGPU] Set SALU, VALU and other instruction type flags on Real instructions
This does not affect codegen but might benefit llvm-mca.
2021-06-16 13:36:02 +01:00
Jay Foad 24ffc343f9 [AMDGPU] Set IsAtomicRet and IsAtomicNoRet on Real instructions
This does not affect codegen but might benefit llvm-mca.
2021-06-16 12:23:29 +01:00
Jay Foad 323b3e645d [AMDGPU] Set mayLoad and mayStore on Real instructions
This does not affect codegen but might benefit llvm-mca.
2021-06-16 12:10:23 +01:00
David Green 0a714eaa51 [ARM] Correct type of setcc results for FP vectors
Under MVE v4f32 and v8f16 vectors should be using v4i1/v8i1 predicates
for the setcc result type, as they have predicated registers for those
types. Setting this correctly prevents some inefficient optimizations
from happening.
2021-06-16 11:11:03 +01:00
Jay Foad 6f778fed8e [AMDGPU] Set more flags on Real instructions
This does not affect codegen, which only tests these flags on Pseudo
instructions, but might help llvm-mca which has to work with Real
instructions. In particular setting LGKM_CNT on DS instructions helps
with the problem identified in D104149.

Differential Revision: https://reviews.llvm.org/D104293
2021-06-16 09:58:50 +01:00
Jay Foad 37109974af [AMDGPU] Use defvar in SOPInstructions.td. NFC.
Factor out repeated !cast<SOP*_Pseudo>(NAME) into a new "defvar ps",
just to improve readability and maintainability.

Differential Revision: https://reviews.llvm.org/D104306
2021-06-16 09:16:45 +01:00
Roman Lebedev 308f6a5245
[NFC][X86] lowerVECTOR_SHUFFLE(): drop FIXME about widening to i128 (YMM half) element type
As per the discussion in D103818, so far, this does not appear to be worthwhile.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103818
2021-06-16 10:24:33 +03:00
Saleem Abdulrasool 17bdc0ff6f X86: balance the frame prologue and epilogue on Win64
This was broken in ba1509da7b.  The Win64
frame would not perform the setup of the Swift async context parameter
but would tear down the setup in the epilogue resulting in crashes.
This ensures that we do the full setup when we do the tear down.
Although this is non-conforming to the Win64 calling convention, it
corrects the setup and exposes the actual issue that the change
introduced: incorrect frame setup.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D104246
2021-06-15 20:13:52 -07:00
Wenlei He 76de2f4a9c CMake: allow overriding CMAKE_CXX_VISIBILITY_PRESET
This allows overriding the `CMAKE_CXX_VISIBILITY_PRESET` on the command line. For example, setting the value to `default` lets PIC LLVM static libraries be converted to DSOs, without the need to rebuild LLVM with BUILD_SHARED_LIBS=ON.

Reviewed By: wenlei

Differential Revision: https://reviews.llvm.org/D104168
2021-06-15 15:51:18 -07:00
Nemanja Ivanovic 821a8f680e [PowerPC] Fix spilling of paired VSX registers
We have added STXVP/LXVP for spilling and restoring the registers
but we neglected to add FI elimination code for these. The result
is that we end up producing impossible MachineInstr's that have
register operands in place of immediates.
2021-06-15 14:13:17 -05:00
Bob Haarman 3bc899b4de [X86] avoid assert with varargs, soft float, and no-implicit-float
Fixes:
 - PR36507 Floating point varargs are not handled correctly with
   -mno-implicit-float
 - PR48528 __builtin_va_start assumes it can pass SSE registers
   when using -Xclang -msoft-float -Xclang -no-implicit-float

On x86_64, floating-point parameters are normally passed in XMM
registers. For va_start, we spill those to memory so va_arg can
find them. There is an interaction here with -msoft-float and
-no-implicit-float:

When -msoft-float is in effect, instead of passing floating-point
parameters in XMM registers, they are passed in general-purpose
registers.

When -no-implicit-float is in effect, it "disables implicit
floating-point instructions" (per the LangRef). The intended
effect is to not have the compiler generate floating-point code
unless explicit floating-point operations are present in the
source code, but what exactly counts as an explicit floating-point
operation is not specified. The existing behavior of LLVM here has
led to some surprises and PRs.

This change modifies the behavior as follows:

  | soft | no-implicit | old behavior    | new behavior    |
  |  no  |   no        | spill XMM regs  | spill XMM regs  |
  | yes  |   no        | don't spill XMM | don't spill XMM |
  |  no  |  yes        | don't spill XMM | spill XMM regs  |
  | yes  |  yes        | assert          | don't spill XMM |

In particular, this avoids the assert that happens when
-msoft-float and -no-implicit-float are both in effect. This
seems like a perfectly reasonable combination: If we don't want
to rely on hardware floating-point support, we want to both
avoid using float registers to pass parameters and avoid having
the compiler generate floating-point code that wasn't in the
original program. Instead of crashing the compiler, the new
behavior is to not synthesize floating-point code in this
case. This fixes PR48528.

The other interesting case is when -no-implicit-float is in
effect, but -msoft-float is not. In that case, any floating-point
parameters that are present will be in XMM registers, and so we
have to spill them to correctly handle those. This fixes
PR36507. The spill is conditional on %al indicating that
parameters are present in XMM registers, so no floating-point
code will be executed unless the function is called with
floating-point parameters.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D104001
2021-06-15 11:27:35 -07:00
David Green 93aa445e16 Revert "[ARM] Extend narrow values to allow using truncating scatters"
This commit adds nodes that might not always be used, which the
expensive checks builder does not like. Reverting for now to think up a
better way of handling it.
2021-06-15 18:19:25 +01:00
Arthur Eubanks be5d454f3f [NFC][OpaquePtr] Avoid calling getPointerElementType()
Pointee types are going away soon.

For this, we mostly just care about store/load types, which are already
available without the pointee types. The other intrinsics always use
i8*.

Reviewed By: dblaikie

Differential Revision: https://reviews.llvm.org/D103719
2021-06-15 09:53:12 -07:00
Arthur Eubanks 25b2126b9e [NFC] Remove redundant variable
Differential Revision: https://reviews.llvm.org/D103706
2021-06-15 09:53:11 -07:00
David Green b9bd2936f9 [ARM] Extend narrow values to allow using truncating scatters
As a minor adjustment to the existing lowering of offset scatters, this
extends any smaller-than-legal vectors into full vectors using a zext,
so that the truncating scatters can be used. Due to the way MVE
legalizes the vectors this should be cheap in most situations, and will
prevent the vector from being scalarized.

Differential Revision: https://reviews.llvm.org/D103704
2021-06-15 17:45:14 +01:00
David Green 680d3f8f17 [ARM] Use rq gather/scatters for smaller v4 vectors
A pointer will always fit into an i32, so a rq offset gather/scatter can
be used with v4i8 and v4i16 gathers, using a base of 0 and the Ptr as
the offsets. The rq gather can then correctly extend the type, allowing
us to use the gathers without falling back to scalarizing.

This patch rejigs tryCreateMaskedGatherOffset in the
MVEGatherScatterLowering pass to decompose the Ptr into Base:0 +
Offset:Ptr (with a scale of 1), if the Ptr could not be decomposed from
a GEP. v4i32 gathers will already use qi gathers, this extends that to
v4i8 and v4i16 gathers using the extending rq variants.

Differential Revision: https://reviews.llvm.org/D103674
2021-06-15 17:06:15 +01:00
David Green 09924cbab7 [ARM] Rejig some of the MVE gather/scatter lowering pass. NFC
This adjusts some of how the gather/scatter lowering pass passes around
data and where certain gathers/scatters are created from. It should not
effect code generation on its own, but allows other patches to more
clearly reason about the code.

A number of extra test cases were also added for smaller gathers/
scatters that can be extended, and some of the test comments were
updated.
2021-06-15 15:38:39 +01:00
Roman Lebedev 88da6c1ead
[X86] Schedule-model second (mask) output of GATHER instruction
Much like `mulx`'s `WriteIMulH`, there are two outputs of
AVX2 GATHER instructions. This was changed back in rL160110,
but the sched model change wasn't present.

So right now, for sched models that are marked as complete
(`znver3` only now), codegen'ning `GATHER` results in a crash:
```
DefIdx 1 exceeds machine model writes for early-clobber renamable $ymm3, dead early-clobber renamable $ymm2 = VPGATHERDDYrm killed renamable $ymm3(tied-def 0), undef renamable $rax, 4, renamable $ymm0, 0, $noreg, killed renamable $ymm2(tied-def 1) :: (load 32, align 1)
```
https://godbolt.org/z/Ks7zW7WGh

I'm guessing we need to deal with this like we deal with `WriteIMulH`.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104205
2021-06-15 12:04:33 +03:00
Craig Topper 4017d0335a [X86] Use EVT::getVectorVT instead of changeVectorElementType in reduceVMULWidth.
Changing vector element type doesn't work for v6i32->v6i16 now
that v6i32 is an MVT and v6i16 is not.

I would like to fix this in changeVectorElementType, but you
need a LLVMContext to call getVectorVT which we can't get from
an MVT.

Fixes PR50709.
2021-06-14 22:07:04 -07:00
Kai Luo 1c450c3d7e [PowerPC] Export 16 byte load-store instructions
Export `lq`, `stq`, `lqarx` and `stqcx.` in preparation for implementing 16-byte lock free atomic operations on AIX.
Add a new register class `g8prc` for these instructions, since these instructions require even-odd register pair.

Reviewed By: nemanjai, jsji, #powerpc

Differential Revision: https://reviews.llvm.org/D103010
2021-06-15 01:56:10 +00:00
Huihui Zhang 1c096bf09f [SVE][LSR] Teach LSR to enable simple scaled-index addressing mode generation for SVE.
Currently, Loop strengh reduce is not handling loops with scalable stride very well.

Take loop vectorized with scalable vector type <vscale x 8 x i16> for instance,
(refer to test/CodeGen/AArch64/sve-lsr-scaled-index-addressing-mode.ll added).

Memory accesses are incremented by "16*vscale", while induction variable is incremented
by "8*vscale". The scaling factor "2" needs to be extracted to build candidate formula
i.e., "reg(%in) + 2*reg({0,+,(8 * %vscale)}". So that addrec register reg({0,+,(8*vscale)})
can be reused among Address and ICmpZero LSRUses to enable optimal solution selection.

This patch allow LSR getExactSDiv to recognize special cases like "C1*X*Y /s C2*X*Y",
and pull out "C1 /s C2" as scaling factor whenever possible. Without this change, LSR
is missing candidate formula with proper scaled factor to leverage target scaled-index
addressing mode.

Note: This patch doesn't fully fix AArch64 isLegalAddressingMode for scalable
vector. But allow simple valid scale to pass through.

Reviewed By: sdesmalen

Differential Revision: https://reviews.llvm.org/D103939
2021-06-14 16:42:34 -07:00
Piotr Sobczak e0c382a9d5 [AMDGPU] Limit runs of fixLdsBranchVmemWARHazard
The code in fixLdsBranchVmemWARHazard looks for patterns of a vmem/lds
access followed by a branch, followed by an lds/vmem access.

The handling of the hazard requires an arbitrary number of instructions
to process. In the worst case where a function has a vmem access, but no lds
accesses, all instructions are examined only to conclude that the hazard
cannot occur.

Add the pre-processing stage which detects if there is both lds and vmem
present in the function and only then does the more costly search.

This patch significantly improves compilation time in the cases the hazard
cannot happen. In one pathological case I looked at IsHazardInst is needlesly
called 88.6 milions times.

The numbers could also be improved by introducing a map around the
inner calls to ::getWaitStatesSince in fixLdsBranchVmemWARHazard, but
nothing will beat not running fixLdsBranchVmemWARHazard at all in the cases
detected by shouldRunLdsBranchVmemWARHazardFixup().

Differential Revision: https://reviews.llvm.org/D104219
2021-06-14 22:30:23 +02:00
Saleem Abdulrasool 8c8dbc1082 X86: pass swift_async context in R14 on Win64
Pass swift_async context in a callee-saved register rather than as a
regular parameter.  This is similar to the Swift `self` and `error`
parameters.
2021-06-14 11:02:21 -07:00
Fraser Cormack c75e454cb9 [RISCV] Transform unaligned RVV vector loads/stores to aligned ones
This patch adds support for loading and storing unaligned vectors via an
equivalently-sized i8 vector type, which has support in the RVV
specification for byte-aligned access.

This offers a more optimal path for handling of unaligned fixed-length
vector accesses, which are currently scalarized. It also prevents
crashing when `LegalizeDAG` sees an unaligned scalable-vector load/store
operation.

Future work could be to investigate loading/storing via the largest
vector element type for the given alignment, in case that would be more
optimal on hardware. For instance, a 4-byte-aligned nxv2i64 vector load
could loaded as nxv4i32 instead of as nxv16i8.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D104032
2021-06-14 18:12:18 +01:00
zhijian 7ed515d168 [AIX][XCOFF] emit vector info of traceback table.
Summary:

emit vector info of traceback table.

Reviewers: Jason Liu,Hubert Tong
Differential Revision: https://reviews.llvm.org/D93659
2021-06-14 11:15:22 -04:00
Jingu Kang 08ce52ef5e [AArch64] Improve SAD pattern
Given a vecreduce_add node, detect the below pattern and convert it to the node
sequence with UABDL, [S|U]ADB and UADDLP.

i32 vecreduce_add(
 v16i32 abs(
   v16i32 sub(
    v16i32 [sign|zero]_extend(v16i8 a), v16i32 [sign|zero]_extend(v16i8 b))))
=================>
i32 vecreduce_add(
  v4i32 UADDLP(
    v8i16 add(
      v8i16 zext(
        v8i8 [S|U]ABD low8:v16i8 a, low8:v16i8 b
      v8i16 zext(
        v8i8 [S|U]ABD high8:v16i8 a, high8:v16i8 b

Differential Revision: https://reviews.llvm.org/D104042
2021-06-14 15:48:51 +01:00
Eric Astor f09e200b31 [ms] [llvm-ml] When parsing MASM, "jmp short" instructions are case insensitive
Handle "short" in a case-insensitive fashion in MASM.

Required to correctly parse z_Windows_NT-586_asm.asm from the OpenMP runtime.

Reviewed By: thakis

Differential Revision: https://reviews.llvm.org/D104195
2021-06-13 18:36:00 -04:00
Eric Astor 56edcbc2ad Fix misspelled instruction in X86 assembly parser
Did not correctly handle "jecxz short <address>".

Discovered while working on LLVM-ML; shows up in z_Windows_NT-586_asm.asm from the OpenMP runtime

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D104194
2021-06-13 18:34:15 -04:00
LemonBoy 5be3a1a064 [SPARC] Legalize truncation and extension between fp128 and half
Lower truncations and expansions between fp128 and half values into libcalls.
Expand truncating stores into two separate truncation and a store operations.

Reviewed By: jrtc27

Differential Revision: https://reviews.llvm.org/D104185
2021-06-13 20:05:15 +02:00
David Green bee2f618d5 [ARM] Introduce t2WhileLoopStartTP
This adds t2WhileLoopStartTP, similar to the t2DoLoopStartTP added in
D90591. It keeps a reference to both the tripcount register and the
element count register, so that the ARMLowOverheadLoops pass in the
backend can pick the correct one without having to search for it from
the operand of a VCTP.

Differential Revision: https://reviews.llvm.org/D103236
2021-06-13 13:55:34 +01:00
Kristina Bessonova f6b9836b09 [ARM][NEON] Combine base address updates for vld1Ndup intrinsics
Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D103836
2021-06-13 11:18:32 +02:00
Luo, Yuanke 5be314f79b [X86] Check immediate before get it.
For CMP imm instruction, when the operand 1 is symbol address we should
check if it is immediate first. Here is the example code.
`CMP64mi32 $noreg, 8, killed renamable $rcx, @d, $noreg, @a, implicit-def
$eflags`
Many thanks to Craig, Topper for the test case to reproduce this issue.

Differential Revision: https://reviews.llvm.org/D104037
2021-06-13 15:40:52 +08:00
Luo, Yuanke 1e72b9d52f Revert "[X86] Check immediate before get it."
This reverts commit 9eb2f723c2.
2021-06-13 13:55:38 +08:00
Luo, Yuanke 9eb2f723c2 [X86] Check immediate before get it.
For CMP imm instruction, when the operand 1 is symbol address we should
check if it is immediate first. Here is the example code.
`CMP64mi32 $noreg, 8, killed renamable $rcx, @d, $noreg, @a, implicit-def
$eflags`
Many thanks to Craig, Topper for the test case to reproduce this issue.

Differential Revision: https://reviews.llvm.org/D104037
2021-06-13 09:08:40 +08:00
Craig Topper c997867dc0 [X86] Add ISD::FREEZE and ISD::AssertAlign to the list of opcodes that don't guarantee upper 32 bits are zero.
The freeze issue was reported here
https://llvm.discourse.group/t/bug-or-feature-freeze-instruction/3639

I don't have a test for AssertAlign. I just noticed it was missing
and assume it should be similar to the other two Asserts.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D104178
2021-06-12 09:52:29 -07:00
Florian Hahn 5cd66420cc
Revert "[X86FixupLEAs] Transform the sequence LEA/SUB to SUB/SUB"
This reverts commit 1b748faf2b because it
breaks building the llvm-test-suite with -verify-machineinstrs on X86:
http://green.lab.llvm.org/green/job/test-suite-verify-machineinstrs-x86_64-O3/9585/

Running llc -verify-machineinstr on X86 crashes on the IR below:

    target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"

    %struct.widget = type { i32, i32, i32, i32, i32*, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [16 x [16 x i16]], [6 x [32 x i32]], [16 x [16 x i32]], [4 x [12 x [4 x [4 x i32]]]], [16 x i32], i8**, i32*, i32***, i32**, i32, i32, i32, i32, %struct.baz*, %struct.wobble.1*, i32, i32, i32, i32, i32, i32, %struct.quux.2*, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x i32], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32***, i32***, i32****, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, [3 x [2 x i32]], [3 x [2 x i32]], i32, i32, i64, i64, %struct.zot.3, %struct.zot.3, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }
    %struct.baz = type { i32, i32, i32, i32, i32, i32, i32, i32, i32, %struct.snork*, %struct.wombat.0*, %struct.wobble*, i32, i32*, i32*, i32*, i32, i32*, i32*, i32*, i32 (%struct.widget*, %struct.eggs*)*, i32, i32, i32, i32 }
    %struct.snork = type { %struct.spam*, %struct.zot, i32 (%struct.wombat*, %struct.widget*, %struct.snork*)* }
    %struct.spam = type { i32, i32, i32, i32, i8*, i32 }
    %struct.zot = type { i32, i32, i32, i32, i32, i8*, i32* }
    %struct.wombat = type { i32, i32, i32, i32, i32, i32, i32, i32, void (i32, i32, i32*, i32*)*, void (%struct.wombat*, %struct.widget*, %struct.zot*)* }
    %struct.wombat.0 = type { [4 x [11 x %struct.quux]], [2 x [9 x %struct.quux]], [2 x [10 x %struct.quux]], [2 x [6 x %struct.quux]], [4 x %struct.quux], [4 x %struct.quux], [3 x %struct.quux] }
    %struct.quux = type { i16, i8 }
    %struct.wobble = type { [2 x %struct.quux], [4 x %struct.quux], [3 x [4 x %struct.quux]], [10 x [4 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [5 x %struct.quux]], [10 x [15 x %struct.quux]], [10 x [15 x %struct.quux]] }
    %struct.eggs = type { [1000 x i8], [1000 x i8], [1000 x i8], i32, i32, i32, i32, i32, i32, i32, i32 }
    %struct.wobble.1 = type { i32, [2 x i32], i32, i32, %struct.wobble.1*, %struct.wobble.1*, i32, [2 x [4 x [4 x [2 x i32]]]], i32, i64, i64, i32, i32, [4 x i8], [4 x i8], i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }
    %struct.quux.2 = type { i32, i32, i32, i32, i32, %struct.quux.2* }
    %struct.zot.3 = type { i64, i16, i16, i16 }

    define void @blam(%struct.widget* %arg, i32 %arg1) local_unnamed_addr {
    bb:
      %tmp = load i32, i32* undef, align 4
      %tmp2 = sdiv i32 %tmp, 6
      %tmp3 = sdiv i32 undef, 6
      %tmp4 = load i32, i32* undef, align 4
      %tmp5 = icmp eq i32 %tmp4, 4
      %tmp6 = select i1 %tmp5, i32 %tmp3, i32 %tmp2
      %tmp7 = getelementptr inbounds [4 x [4 x i32]], [4 x [4 x i32]]* undef, i64 0, i64 0, i64 0
      %tmp8 = zext i16 undef to i32
      %tmp9 = zext i16 undef to i32
      %tmp10 = load i16, i16* undef, align 2
      %tmp11 = zext i16 %tmp10 to i32
      %tmp12 = zext i16 undef to i32
      %tmp13 = zext i16 undef to i32
      %tmp14 = zext i16 undef to i32
      %tmp15 = load i16, i16* undef, align 2
      %tmp16 = zext i16 %tmp15 to i32
      %tmp17 = zext i16 undef to i32
      %tmp18 = sub nsw i32 %tmp8, %tmp9
      %tmp19 = shl nsw i32 undef, 1
      %tmp20 = add nsw i32 %tmp19, %tmp18
      %tmp21 = sub nsw i32 %tmp11, %tmp12
      %tmp22 = shl nsw i32 undef, 1
      %tmp23 = add nsw i32 %tmp22, %tmp21
      %tmp24 = sub nsw i32 %tmp13, %tmp14
      %tmp25 = shl nsw i32 undef, 1
      %tmp26 = add nsw i32 %tmp25, %tmp24
      %tmp27 = sub nsw i32 %tmp16, %tmp17
      %tmp28 = shl nsw i32 undef, 1
      %tmp29 = add nsw i32 %tmp28, %tmp27
      %tmp30 = sub nsw i32 %tmp20, %tmp29
      %tmp31 = sub nsw i32 %tmp23, %tmp26
      %tmp32 = shl nsw i32 %tmp30, 1
      %tmp33 = add nsw i32 %tmp32, %tmp31
      store i32 %tmp33, i32* undef, align 4
      %tmp34 = mul nsw i32 %tmp31, -2
      %tmp35 = add nsw i32 %tmp34, %tmp30
      store i32 %tmp35, i32* undef, align 4
      %tmp36 = select i1 %tmp5, i32 undef, i32 undef
      br label %bb37

    bb37:                                             ; preds = %bb
      %tmp38 = load i32, i32* undef, align 4
      %tmp39 = ashr i32 %tmp38, %tmp6
      %tmp40 = load i32, i32* undef, align 4
      %tmp41 = sdiv i32 %tmp39, %tmp40
      store i32 %tmp41, i32* undef, align 4
      ret void
    }
2021-06-12 11:41:38 +01:00
Florian Hahn e087b4f149
Revert "[X86FixupLEAs] Sub register usage of LEA dest should block LEA/SUB optimization"
This reverts commit f35bcea1d4 because it
depends on 1b748faf2b, which breaks
building the llvm-test-suite with -verify-machineinstrs on X86.

See 154adc0f135cff3f8a8861c335d2b88c8049d098 for more details.
2021-06-12 11:40:47 +01:00
madhur13490 c27e8141b3 [AMDGPU][IndirectCalls] Fix register usage propagation for indirect/external calls
This patch computes max SGPRs and VGPRs used by module
in presence of indirect calls and makes that
as register requirement for functions/kernels
which makes indirect calls.

This patch also refactors code AMDGPUSubTarget.cpp
which add a "base" variants of getMaxNumSGPRs which
is used by MachineFunction and new Function version.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D103636
2021-06-12 11:59:34 +05:30
Matt Arsenault a845dc1e56 AMDGPU/GlobalISel: Remove leftover hack for argument memory sizes
Since the call lowering code now tries to respect the tablegen
reported argument types, this is no longer necessary.
2021-06-11 13:45:25 -04:00
Matt Arsenault 6dd54dada3 AMDGPU/GlobalISel: Fix indentation 2021-06-11 13:45:25 -04:00
Guozhi Wei f35bcea1d4 [X86FixupLEAs] Sub register usage of LEA dest should block LEA/SUB optimization
In function searchALUInst, sub register usage of LEA dest should also block LEA/SUB optimization, otherwise the sub register usage gets an undefined value.

This patch fixes https://bugs.llvm.org/show_bug.cgi?id=50615.

Differential Revision: https://reviews.llvm.org/D103922
2021-06-11 09:45:56 -07:00
Simon Pilgrim 61cdaf66fe [ADT] Remove APInt/APSInt toString() std::string variants
<string> is currently the highest impact header in a clang+llvm build:

https://commondatastorage.googleapis.com/chromium-browser-clang/llvm-include-analysis.html

One of the most common places this is being included is the APInt.h header, which needs it for an old toString() implementation that returns std::string - an inefficient method compared to the SmallString versions that it actually wraps.

This patch replaces these APInt/APSInt methods with a pair of llvm::toString() helpers inside StringExtras.h, adjusts users accordingly and removes the <string> from APInt.h - I was hoping that more of these users could be converted to use the SmallString methods, but it appears that most end up creating a std::string anyhow. I avoided trying to use the raw_ostream << operators as well as I didn't want to lose having the integer radix explicit in the code.

Differential Revision: https://reviews.llvm.org/D103888
2021-06-11 13:19:15 +01:00
Zarko Todorovski c1bb75febe [PowerPC] Allow wa inline asm to also accept floating point arguments
GCC documentation for the `wa` constraint states that:
```
wa

    A VSX register (VSR), vs0…vs63. This is either an FPR (vs0…vs31 are f0…f31)
    or a VR (vs32…vs63 are v0…v31).
```
This technically means that we could accept floating point parameters. In fact,
gcc itself does. The following testcase compiles and runs on all PPC platforms with GCC,
whereas clang/llc will assert:
```
#include <stdio.h>
double foo ( vector double a ) {
  double b, c;
  asm("xvabsdp  %x0, %x2        \n"
             "xxsldwi  %x1, %x0, %x0, 2 \n"
      :  "+wa"    (b),
         "=wa"    (c)
      :  "wa"    (a)
      );
  return b+c;
}
int main(void) {
  vector double a = {-3., -4.};
  double t = foo( a );
  printf("%g\n", t);
}
```
This patch allows clang/llc to build and run this testcase.

Reviewed By: nemanjai, #powerpc

Differential Revision: https://reviews.llvm.org/D103409
2021-06-11 07:19:10 -04:00
Koutheir Attouchi 789708617d Do not generate calls to the 128-bit function __multi3() on 32-bit ARM
Re-applying this patch after bots failures. Should be fine now.

The function __multi3() is undefined on 32-bit ARM, so a call to it should
never be emitted. Instead, plain instructions need to be generated to
perform 128-bit multiplications.

Differential Revision: https://reviews.llvm.org/D103906
2021-06-11 11:45:21 +01:00
Rosie Sumpter d7c219a506 [CostModel][AArch64] Improve the cost estimate of CTPOP intrinsic
Added a case for CTPOP to AArch64TTIImpl::getIntrinsicInstrCost so that
the cost estimate matches the codegen in
test/CodeGen/AArch64/arm64-vpopcnt.ll

Differential Revision: https://reviews.llvm.org/D103952
2021-06-11 11:15:46 +01:00
Bing1 Yu 56d5c46b49 [X86] Support __tile_stream_loadd intrinsic for new AMX interface
Adding support for __tile_stream_loadd intrinsic.

Reviewed By: LuoYuanke

Differential Revision: https://reviews.llvm.org/D103784
2021-06-11 17:28:43 +08:00
Simon Pilgrim 5e6bfb661e [Analysis] Pass RecurrenceDescriptor as const reference. NFCI.
We were passing the RecurrenceDescriptor by value to most of the reduction analysis methods, despite it being rather bulky with TrackingVH members (that can be costly to copy). In all these cases we're only using the RecurrenceDescriptor for rather basic purposes (access to types/kinds etc.).

Differential Revision: https://reviews.llvm.org/D104029
2021-06-11 10:24:14 +01:00
Qiu Chaofan bc104fdcec [PowerPC] Relax register superclasses for paired memops
Relaxing superclass constraint for VSX register classes helps reducing
32-byte spills and copies when register pressure is high.

In test case affected, some of them introduces more copies due to new
allocation order. However, this patch should not be the root cause, and
we may be able to fix it in other places of register allocation.

Reviewed By: nemanjai

Differential Revision: https://reviews.llvm.org/D104006
2021-06-11 14:54:03 +08:00
Hsiangkai Wang 643b6407fa [RISCV] Avoid scalar outgoing argumetns overwriting vector frame objects.
When using FP to access stack objects, the scalable stack objects will
be put at the lower end of the frame. It looks like

```
|-------------------|  <-- FP
| callee-saved regs |
|-------------------|
| scalar local vars |
|-------------------|
| RVV local vars    |
|-------------------|  <-- SP
```

If there are scalar arguments that need to pass through memory and there
are vector objects on the stack using FP to access. The outgoing scalar
arguments will overwrite the vector objects. It looks like

```
|-------------------|  <-- FP
| callee-saved regs |
|-------------------|
| scalar local vars |
|-------------------|         |-------------------|
| RVV local vars    |         | outgoing args     | <- outgoing arguments
|-------------------|  <-- SP |-------------------|    overwrite from here.
```

In this patch, we reserve the stack for the outgoing arguments before
function calls if using FP to access and there are scalable vector frame
objects. It looks like

```
|-------------------|  <-- FP
| callee-saved regs |
|-------------------|
| scalar local vars |
|-------------------|
| RVV local vars    |
|-------------------|
| outgoing args     |
|-------------------|  <-- SP
```

Differential Revision: https://reviews.llvm.org/D103622
2021-06-11 12:26:29 +08:00
Craig Topper 420bd5ee8e [RISCV] Use ComputeNumSignBits/MaskedValueIsZero in RISCVDAGToDAGISel::selectSExti32/selectZExti32.
This helps us select W instructions in more cases. Most of the
affected tests have had the sign_extend_inreg or AND folded into
sextload/zextload.

Differential Revision: https://reviews.llvm.org/D104079
2021-06-10 19:06:45 -07:00
Amara Emerson 670edf3ee0 [AArch64][GlobalISel] Fix incorrectly generating uxtw/sxtw for addressing modes.
When the extend is from 8 or 16 bits, the addressing modes don't support those
extensions, but we weren't checking that and therefore always generated the 32->64b
extension mode. Fun.

Differential Revision: https://reviews.llvm.org/D104070
2021-06-10 16:59:39 -07:00
Jessica Paquette 933df6ca79 [AArch64][GlobalISel] Legalize scalar G_CTTZ + G_CTTZ_ZERO_UNDEF
This adds legalization for scalar G_CTTZ and G_CTTZ_ZERO_UNDEF. Vector support
requires handling vector G_BITREVERSE, which I haven't gotten around to yet.

For G_CTTZ_ZERO_UNDEF, we just lower it to G_CTTZ.

For G_CTTZ, we match SelectionDAG's lowering to a G_BITREVERSE + G_CTLZ.

e.g. https://godbolt.org/z/nPEseYh1s

(With this patch, we have slightly worse codegen than SDAG for types smaller
than s32; it seems like we're missing a combine.)

Also, this adds in a function to build G_BITREVERSE to MachineIRBuilder.

Differential Revision: https://reviews.llvm.org/D104065
2021-06-10 15:29:51 -07:00
David Green 5d5b686f6b [ARM] Fix Changed status in MVEGatherScatterLoweringPass.
Now that we are calling SimplifyInstructionsInBlock, make sure we update
Changed when it reports alterations.
2021-06-10 21:53:04 +01:00
David Green e0c605f638 [ARM] Ensure instructions are simplified prior to GatherScatter lowering.
Surprisingly, not all instructions are always simplified after unrolling
and before MVE gather/scatter lowering. Notably dead gather operations
can be left around which cause the gather/scatter lowering pass to crash
if there are multiple gathers, some of which are dead.

This patch ensures they are simplified before we modify anything, which
can change some of the existing tests, including making them no-longer
test what they originally tested. This uses a combination of disabling
the gather/scatter lowering pass and adjusting the test to keep them as
before.

Differential Revision: https://reviews.llvm.org/D103150
2021-06-10 20:18:12 +01:00
Jessica Paquette 1b894ccdc9 [AArch64][GlobalISel] Mark some G_BITREVERSE types as legal + select them
We fall back on G_CTTZ_ZERO_UNDEF a lot when building clang for arm64 with
gisel.

Handling this will require that we can handle G_BITREVERSE.

This patch marks G_BITREVERSE instructions with natively supported types as
legal. We get selection on these types for free via the importer.

Differential Revision: https://reviews.llvm.org/D103999
2021-06-10 10:33:52 -07:00
Benjamin Kramer 3dceffd0fd [AArch64] Silence fallthrough warning. NFC.
AArch64TargetTransformInfo.cpp:302:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
  default:
    ^
2021-06-10 17:23:37 +02:00
Luo, Yuanke 63233da723 [X86][NFC] Fix typo. 2021-06-10 22:49:11 +08:00
Irina Dobrescu de79919e9e [AArch64] Add cost tests for bitreverse
This patch includes cost tests for bit reverse as well as some adjustments to the cost model.

Differential Revision: https://reviews.llvm.org/D102755
2021-06-10 14:51:33 +01:00
David Green 9872551ca0 [ARM] Skip debug during vpt block creation
Debug info is currently preventing VPT block creation, leading to
different codegen. This patch attempts to skip any debug instructions
during vpt block creation, making sure they do not interfere.

Differential Revision: https://reviews.llvm.org/D103610
2021-06-10 14:49:04 +01:00
Timm Bäder a9e4f91adf [llvm][PPC] Add missing case for 'I' asm memory operands
From https://llvm.org/docs/LangRef.html#asm-template-argument-modifiers:

I: Print the letter ‘i’ if the operand is an integer constant,
otherwise nothing. Used to print ‘addi’ vs ‘add’ instructions.

Differential Revision: https://reviews.llvm.org/D103968
2021-06-10 12:52:50 +02:00
David Spickett 64de8763aa Revert "Implementation of global.get/set for reftypes in LLVM IR"
This reverts commit 31859f896c.

Causing SVE and RISCV-V test failures on bots.
2021-06-10 10:11:17 +00:00
Simon Pilgrim 4eb47e3cd4 [TargetLowering] getABIAlignmentForCallingConv - pass DataLayout by const reference. NFCI.
Avoid unnecessary copies and match every other method in TargetLowering that takes DataLayout as an argument.
2021-06-10 10:55:24 +01:00
Paulo Matos 31859f896c Implementation of global.get/set for reftypes in LLVM IR
This change implements new DAG notes GLOBAL_GET/GLOBAL_SET, and
lowering methods for load and stores of reference types from IR
globals. Once the lowering creates the new nodes, tablegen pattern
matches those and converts them to Wasm global.get/set.

Reviewed By: tlively

Differential Revision: https://reviews.llvm.org/D95425
2021-06-10 10:07:45 +02:00
Martin Storsjö 99653702fd Revert "[AArch64LoadStoreOptimizer] Generate more STPs by renaming registers earlier"
This reverts commit d96ea46629, as it
caused various misoptimizations, see https://reviews.llvm.org/D103597
for discussion on the issues.
2021-06-10 10:30:13 +03:00
hsmahesha f6632f11ed [AMDGPU] Fix missing lowering of LDS used in global scope.
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D103431
2021-06-10 08:40:01 +05:30
Jinsong Ji 4a89ed373c [AIX] Add traceback ssp canary bit support
We will need to set the ssp canary bit in traceback table to communicate
with unwinder about the canary.

Reviewed By: #powerpc, shchenz

Differential Revision: https://reviews.llvm.org/D103202
2021-06-10 02:40:02 +00:00
Craig Topper 8dfd0810f2 [RISCV] Remove unused method from RISCVInsertVSETVLI. NFC
If this becomes needed its trivial to add it back.
2021-06-09 15:35:26 -07:00
Nico Weber 68a1d9a1f5 Revert "Do not generate calls to the 128-bit function __multi3() on 32-bit ARM"
This reverts commit 64e9aa3302.
Breaks check-llvm everywhere, see https://reviews.llvm.org/D103906
2021-06-09 13:21:05 -04:00
Koutheir Attouchi 64e9aa3302 Do not generate calls to the 128-bit function __multi3() on 32-bit ARM
The function __multi3() is undefined on 32-bit ARM, so a call to it
should never be emitted. Instead, plain instructions need to be
generated to perform 128-bit multiplications.

Differential Revision: https://reviews.llvm.org/D103906
2021-06-09 16:21:16 +01:00
Craig Topper 765ef4bb2a [X86] Check destination element type before forming VTRUNCS/VTRUNCUS in combineTruncateWithSat.
Fixes crash reported here https://reviews.llvm.org/D73607

Using a store to keep the trunc intact. Returning v16i24 would
cause the trunc to be optimized away in SelectionDAGBuilder.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D103940
2021-06-09 07:08:17 -07:00
Yvan Roux 6c78dbd4ca [ARM] Fix Machine Outliner LDRD/STRD handling in Thumb mode.
This is a fix for PR50481

Immediate values for AddrModeT2_i8s4 are already scaled in MCinst operand.
This patch changes the number of bits and scale factor to reflect that
state when checking stack offset status. AddrModeT2_i7s[2|4] also have
this particularity but since MVE instructions are not outlined, just move
these cases to the unhandled ones.

Differential Revision: https://reviews.llvm.org/D103167
2021-06-09 15:37:21 +02:00
Simon Pilgrim 630820bafc [X86][SLM] Adjust XMM non-PMULLD throughput costs to half rate.
Match what's reported in the costs table, Agner's tables and the Intel AOM
2021-06-09 13:51:40 +01:00
Meera Nakrani d96ea46629 [AArch64LoadStoreOptimizer] Generate more STPs by renaming registers earlier
Our initial motivating case was memcpy's with alignments > 16. The
loads/stores, to which small memcpy's expand, are kept together in
several places so that we get a sequence like this for a 64 bit copy:
LD w0
LD w1
ST w0
ST w1
The load/store optimiser can generate a LDP/STP w0, w1 from this because
the registers read/written are consecutive. In our case however, the
sequence is optimised during ISel, resulting in:
LD w0
ST w0
LD w0
ST w0
This instruction reordering allows reuse of registers. Since the registers
are no longer consecutive (i.e. they are the same), it inhibits LDP/STP
creation. The approach here is to perform renaming:
LD w0
ST w0
LD w1
ST w1
to enable the folding of the stores into a STP. We do not yet generate
the LDP due to a limitation in the renaming implementation, but plan to
look at that in a follow-up so that we fully support this case. While
this was initially motivated by certain memcpy's, this is a general
approach and thus is beneficial for other cases too, as can be seen
in some test changes.

Differential Revision: https://reviews.llvm.org/D103597
2021-06-09 11:25:26 +00:00
Fraser Cormack 502edebd9d [ValueTypes][RISCV] Cap RVV fixed-length vectors by size
This patch changes RVV's policy for its supported list of fixed-length
vector types by capping by vector size rather than element count. Now
all 1024-byte vectors (of supported element types) are supported, rather
than all 256-element vectors.

This is a more natural fit for the architecture, and allows us to, for
example, improve the support for vector bitcasts.

This change necessitated the adding of some new simple types to avoid
"regressing" on the number of currently-supported vectors. We round out
the 1024-byte types by adding `v512i8`, `v1024i8`, `v512i16` and
`v512f16`.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D103884
2021-06-09 12:15:37 +01:00
Fraser Cormack e8f1f89103 [RISCV] Support CONCAT_VECTORS on scalable masks
This patch is a simple fix which registers CONCAT_VECTORS as
custom-lowered for scalable mask vectors. This follows the pattern of
all other scalable-vector types, as the default expansion of
CONCAT_VECTORS cannot handle scalable types, and even if it did it'd go
through the stack and generate worse code.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D103896
2021-06-09 09:07:44 +01:00
Kai Luo bf58600bad [PowerPC] Make sure the first probe is full size or is the last probe when stack is realigned
When `-fstack-clash-protection` is enabled and stack has to be realigned, some parts of redzone is written prior the probe, so probe might overwrite content already written in redzone. To avoid it, we have to make sure the first probe is at full probe size or is the last probe so that we can skip redzone.

It also fixes violation of ABI under PPC where `r1` isn't updated atomically.

This fixes https://bugs.llvm.org/show_bug.cgi?id=49903.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D100290
2021-06-09 06:35:35 +00:00
Jim Lin 242ddd5089 [RISCV][NFC] Add a single space after comma for VType
In most of cases, it has a single space after comma in assembly operands.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D103790
2021-06-09 11:18:22 +08:00
Kai Luo c87c294397 [PowerPC][Dwarf] Assign MMA register's dwarf register number to negative value
According to ELF V2 ABI, `0` should be the dwarf number of `r0`. Currently MMA's register also uses `0` as its dwarf number, this confuses `RegisterInfoEmitter` and generates wrong dwarf -> llvm mapping.
```
extern const MCRegisterInfo::DwarfLLVMRegPair PPCDwarfFlavour1Dwarf2L[] = {
  { 0U, PPC::VSRp31 },
```
This leads to wrong cfi output in https://reviews.llvm.org/D100290.

Reviewed By: jsji

Differential Revision: https://reviews.llvm.org/D103761
2021-06-09 02:24:01 +00:00
Brendon Cahoon 294efbbd3e Reland "[AMDGPU] Add gfx1013 target"
This reverts commit 211e584fa2.

Fixed a use-after-free error that caused the sanitizers to fail.
2021-06-08 21:15:35 -04:00
Jonas Paulsson 8b32e25bc2 [SystemZ] Return true from convertSetCCLogicToBitwiseLogic for scalar integer.
Review: Ulrich Weigand
2021-06-08 16:27:28 -05:00
Jonas Paulsson d5e4f28c0a [SystemZ] Return true from isMaskAndCmp0FoldingBeneficial().
Return true if the mask is a constant uint of 2 bytes, in which case TMLL is
available.

Review: Ulrich Weigand
2021-06-08 15:42:46 -05:00
Brendon Cahoon 211e584fa2 Revert "[AMDGPU] Add gfx1013 target"
This reverts commit ea10a86984.

A sanitizer buildbot reports an error.
2021-06-08 16:29:41 -04:00
David Green d7853bae94 [ARM] Generate VDUP(Const) from constant buildvectors
If we cannot otherwise use a VMOVimm/VMOVFPimm/VMVNimm, fall back to
producing a VDUP(const) as opposed to a constant pool load. This will at
least be smaller codesize and can allow the VDUP to be folded into other
instructions.

Differential Revision: https://reviews.llvm.org/D103808
2021-06-08 20:51:33 +01:00
Michael Liao 27332968d8 [amdgpu] Add `-enable-ocl-mangling-mismatch-workaround`.
- Add `-enable-ocl-mangling-mismatch-workaround` to work around the
  mismatch on OCL name mangling so far.

Reviewed By: yaxunl, rampitec

Differential Revision: https://reviews.llvm.org/D103920
2021-06-08 15:42:27 -04:00
Nick Desaulniers 3787ee4571 reland [IR] make -stack-alignment= into a module attr
Relands commit 433c8d950c with fixes for
MIPS.

Similar to D102742, specifying the stack alignment via CodegenOpts means
that this flag gets dropped during LTO, unless the command line is
re-specified as a plugin opt. Instead, encode this information as a
module level attribute so that we don't have to expose this llvm
internal flag when linking the Linux kernel with LTO.

Looks like external dependencies might need a fix:
* https://github.com/llvm-hs/llvm-hs/issues/345
* https://github.com/halide/Halide/issues/6079

Link: https://github.com/ClangBuiltLinux/linux/issues/1377

Reviewed By: tejohnson

Differential Revision: https://reviews.llvm.org/D103048
2021-06-08 10:59:46 -07:00
Simon Pilgrim 01b77159e3 PPCISelLowering.cpp - don't dereference a dyn_cast<>.
dyn_cast<> can return nullptr which we would then dereference - use cast<> which will assert that the type is correct.
2021-06-08 17:59:05 +01:00
Brendon Cahoon ea10a86984 [AMDGPU] Add gfx1013 target
Differential Revision: https://reviews.llvm.org/D103663
2021-06-08 12:49:49 -04:00
Craig Topper 8b4c80d380 Further improve register allocation for vwadd(u).wv, vwsub(u).wv, vfwadd.wv, and vfwsub.wv.
The first source has the same EEW as the destination, but we're
using earlyclobber which prevents them from ever being the same
register. This patch attempts to work around this.

-For unmasked .wv, add a special TIED pseudo that pretends like
 the first operand and the destination must be the same register. This
 disables the earlyclobber for that source. Mark the instruction
 as convertible to 3 address form which will switch it to the
 original untied pseudo when the TwoAddressInstructionPass decides
 that keeping them tied would require an extra copy. This uses
 code in RISCVInstrInfo.cpp to do the conversion to the untied
 opcode.

The untie test case show that we can generate the untied version.
Not sure it was profitable to do it in this case, but they have
really simple IR.

Reviewed By: arcbbb

Differential Revision: https://reviews.llvm.org/D103552
2021-06-08 09:43:43 -07:00
Craig Topper c57bce9cc5 [RISCV] Remove ForceTailAgnostic flag from vmv.s.x, vfmv.s.f and reductions.
In 0.9 these were defined to leave elements other than 0 in the
destination unmodified. They were changed to use the tail policy
in 0.10. I missed that update.

I assume no one has noticed because in order cores treat tail
agnostic the same as tail undisturbed. I believe Spike and QEMU do
the same.

Reviewed By: arcbbb, frasercrmck

Differential Revision: https://reviews.llvm.org/D103736
2021-06-08 09:22:40 -07:00
Nick Desaulniers a596b54d47 Revert "[IR] make -stack-alignment= into a module attr"
This reverts commit 433c8d950c.

Breaks the MIPS build.
2021-06-08 08:55:50 -07:00
Nick Desaulniers 433c8d950c [IR] make -stack-alignment= into a module attr
Similar to D102742, specifying the stack alignment via CodegenOpts means
that this flag gets dropped during LTO, unless the command line is
re-specified as a plugin opt. Instead, encode this information as a
module level attribute so that we don't have to expose this llvm
internal flag when linking the Linux kernel with LTO.

Looks like external dependencies might need a fix:
* https://github.com/llvm-hs/llvm-hs/issues/345
* https://github.com/halide/Halide/issues/6079

Link: https://github.com/ClangBuiltLinux/linux/issues/1377

Reviewed By: tejohnson

Differential Revision: https://reviews.llvm.org/D103048
2021-06-08 08:31:04 -07:00
Simon Moll 2b626aba44 [VE][NFC] IRBuilder<> -> IRBuilderBase
VE's TTI broke with the switch from IRBuilder<> to IRBuilderBase.
Following that change to compile again.
2021-06-08 13:55:49 +02:00
Kerry McLaughlin 5db52751a5 [CostModel] Return an invalid cost for memory ops with unsupported types
Fixes getTypeConversion to return `TypeScalarizeScalableVector` when a scalable vector
type cannot be legalized by widening/splitting. When this is the method of legalization
found, getTypeLegalizationCost will return an Invalid cost.

The getMemoryOpCost, getMaskedMemoryOpCost & getGatherScatterOpCost functions already call
getTypeLegalizationCost and will now also return an Invalid cost for unsupported types.

Reviewed By: sdesmalen, david-arm

Differential Revision: https://reviews.llvm.org/D102515
2021-06-08 12:07:36 +01:00