A target can return if a misaligned access is 'fast' as defined
by the target or not. In reality there can be different levels
of 'fast' and 'slow'. This patch changes the boolean 'Fast'
argument of the allowsMisalignedMemoryAccesses family of functions
to an unsigned representing its speed.
A target can still define it as it wants and the direct translation
of the current code uses 0 and 1 for current false and true. This
makes the change an NFC.
Subsequent patch will start using an actual value of speed in
the load/store vectorizer to compare if a vectorized access going
to be not just fast, but not slower than before.
Differential Revision: https://reviews.llvm.org/D124217
We've been shipping implementations of these with a soft-float ABI since MacOS
10.10 in 2014 and there's evidence they're in binaries now, so we can't easily
switch to %xmm0.
This emits special libcalls with casts in place to restore the soft-float ABI
for __truncdfhf2, __truncsfhf2, and __extendhfsf2.
This was done as a test for D137302 and it makes sense to push these changes
Reviewed By: dblaikie
Differential Revision: https://reviews.llvm.org/D137493
If the inner broadcast scalar type is smaller/same width as the outer broadcast scalar type then we can broadcast using the same inner type directly. Works for vbroadcast_load as well.
On AVX512, extract legal bool vectors as bool subvectors before bitcasting to scalars to avoid spilling to stack.
This helps rust which internally represents bool vectors as bool arrays
It also exposes more missed opportunities to use the KADD instruction to add masks together before moving to gpr
Fixes#58546
Extends existing anyextend fold to make use of the implicit zero-extension of the movd instruction
This also helps replace some nasty xmm->gpr->xmm traffic with a shuffle pattern instead
Noticed while looking at D130953
This is an alternative of D120395 and D120411.
Previously we use `__bfloat16` as a typedef of `unsigned short`. The
name may give user an impression it is a brand new type to represent
BF16. So that they may use it in arithmetic operations and we don't have
a good way to block it.
To solve the problem, we introduced `__bf16` to X86 psABI and landed the
support in Clang by D130964. Now we can solve the problem by switching
intrinsics to the new type.
Reviewed By: LuoYuanke, RKSimon
Differential Revision: https://reviews.llvm.org/D132329
[This Godbolt link](https://godbolt.org/z/s17Kv1s9T) shows different codegen between clang and gcc for a transpose operation.
clang result:
```
vmovdqu xmm0, xmmword ptr [rcx + rax]
vmovdqu xmm1, xmmword ptr [rcx + rax + 16]
vmovdqu xmm2, xmmword ptr [r8 + rax]
vmovdqu xmm3, xmmword ptr [r8 + rax + 16]
vpunpckhbw xmm4, xmm2, xmm0
vpunpcklbw xmm0, xmm2, xmm0
vpunpcklbw xmm2, xmm3, xmm1
vpunpckhbw xmm1, xmm3, xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
vmovdqu xmmword ptr [rdi + 2*rax], xmm0
vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm4
```
gcc result:
```
vmovdqu ymm3, YMMWORD PTR [rdi+rax]
vpunpcklbw ymm1, ymm3, YMMWORD PTR [rsi+rax]
vpunpckhbw ymm0, ymm3, YMMWORD PTR [rsi+rax]
vperm2i128 ymm2, ymm1, ymm0, 32
vperm2i128 ymm1, ymm1, ymm0, 49
vmovdqu YMMWORD PTR [rcx+rax*2], ymm2
vmovdqu YMMWORD PTR [rcx+32+rax*2], ymm1
```
clang's code is roughly 15% slower than gcc's when evaluated on an internal compression benchmark.
The loop vectorizer generates the following shufflevector intrinsic:
```
%interleaved.vec = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
```
which is lowered to SelectionDAG:
```
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t6: v64i8 = concat_vectors t2, undef:v32i8
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t7: v64i8 = concat_vectors t4, undef:v32i8
t8: v64i8 = vector_shuffle<0,64,1,65,2,66,3,67,4,68,5,69,6,70,7,71,8,72,9,73,10,74,11,75,12,76,13,77,14,78,15,79,16,80,17,81,18,82,19,83,20,84,21,85,22,86,23,87,24,88,25,89,26,90,27,91,28,92,29,93,30,94,31,95> t6, t7
```
So far this `vector_shuffle` is good enough for us to pattern-match and transform, but as we go down the SelectionDAG pipeline, it got split into smaller shuffles. During dagcombine1, the shuffle is split by `foldShuffleOfConcatUndefs`.
```
// shuffle (concat X, undef), (concat Y, undef), Mask -->
// concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t19: v32i8 = vector_shuffle<0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47> t2, t4
t15: ch,glue = CopyToReg t0, Register:v32i8 $ymm0, t19
t20: v32i8 = vector_shuffle<16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63> t2, t4
t17: ch,glue = CopyToReg t15, Register:v32i8 $ymm1, t20, t15:1
```
With `foldShuffleOfConcatUndefs` commented out, the vector is still split later by the type legalizer, which comes after dagcombine1, because v64i8 is not a legal type in AVX2 (64 * 8 = 512 bits while ymm = 256 bits). There doesn't seem to be a good way to avoid this split. Lowering the `vector_shuffle` into unpck and perm during dagcombine1 is too early. Therefore, although somewhat inconvenient, we decided to go with pattern-matching a pair vector shuffles later in the SelectionDAG pipeline, as part of `lowerV32I8Shuffle`.
The code looks at the two operands of the first shuffle it encounters, iterates through the users of the operands, and tries to find two shuffles that are consecutive interleaves. Once the pattern is found, it lowers them into unpcks and perms. It returns the perm for the shuffle that's currently being lowered (have ISel modify the DAG), and replaces the other shuffle in place.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D134477
In Linux PIC model, there are 4 cases about value/label addressing:
Case 1: Function call or Label jmp inside the module.
Case 2: Data access (such as global variable, static variable) inside the module.
Case 3: Function call or Label jmp outside the module.
Case 4: Data access (such as global variable) outside the module.
Due to current llvm inline asm architecture designed to not "recognize" the asm
code, there are quite troubles for us to treat mem addressing differently for
same value/adress used in different instuctions.
For example, in pic model, call a func may in plt way or direclty pc-related,
but lea/mov a function adress may use got.
This patch fix/refine the case 1 and case 2 in inline asm.
Due to currently inline asm didn't support jmp the outsider lable, this patch
mainly focus on fix the function call addressing bugs in inline asm.
Reviewed By: Pengfei, RKSimon
Differential Revision: https://reviews.llvm.org/D133914
Allows us to use combineX86ShuffleChainWithExtract to combine targetshuffle(low_subvector(x),high_subvector(x)) -> low_subvector(targetshuffle(x)) style patterns
This is currently very limited (it must have a v2i64/v2f64 result), but while triaging I noticed we might be able to extend this to allow more types for targets with suitable variable cross lane shuffle support.
Fixes#58339
Fix crash issue of D129537 and reopen it.
Currently the X86 shuffle lowering would widen the element type for
shuffle if the mask element value is adjacent. For below example
%t2 = add nsw <16 x i32> %t0, %t1
%t3 = sub nsw <16 x i32> %t0, %t1
%t4 = shufflevector <16 x i32> %t2, <16 x i32> %t3,
<16 x i32> <i32 16, i32 17, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10,
i32 11, i32 12, i32 13, i32 14, i32 15>
ret <16 x i32> %t4
Compiler would transform the shuffle to
%t4 = shufflevector <8 x i64> %t2, <8 x i64> %t3,
<8 x i64> <i32 8, i32 1, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7>
This may lose the oppotunity to let ISel select mask instruction when
avx512 is enabled.
This patch is to prevent the tranform when avx512 feature is enabled.
Thank Simon for the idea.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130830
In combineOr (X86ISelLowering.cpp) there is a DAG combine that rewrite
a "(0 - SetCC) | C" pattern into something simpler given that a LEA
can be used. Another requirement is that C has some specific value,
for example 1 or 7. When checking those requirements the code used a
32-bit unsigned variable to store the value of C. So for a 64-bit OR
this could miscompile in case any of the 32 most significant bits in
C were non zero.
This patch adds fixes the bug by using a large enough type for the
C value.
The faulty code seem to have been introduced by commit 9bceb8981d
(D131358).
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D134892
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
For remainder:
If (1 << (Bitwidth / 2)) % Divisor == 1, we can add the high and low halves
together and use a (Bitwidth / 2) urem. If (BitWidth /2) is a legal integer
type, this urem will be expand by DAGCombiner using multiply by magic
constant. We do have to take into account that adding high and low
together can produce a carry, making it a (BitWidth / 2)+1 bit number.
So we need to also add back in the carry from the first addition.
For division:
We can use the above trick to compute the remainder, subtract that
remainder from the dividend, then multiply by the multiplicative
inverse of the Divisor modulo (1 << BitWidth).
This is based on the section "Remainder by Summing Digits" in
Hacker's delight.
The remainder trick is similar to a trick you may have learned for
determining if a decimal number is divisible by 3. You can add all the
digits together and see if the sum is divisible by 3. If you're not sure
if the sum is divisible by 3, you can add its digits together. This
can be repeated until you have a single decimal digit. If that digit
is 3, 6, or 9, then the original number is divisible by 3. This works
because 10 % 3 == 1.
gcc already does this same trick. There are additional tricks gcc
does urem as well as srem, udiv, and sdiv that I plan to add in
future patches.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130862
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
LLVM contains a helpful function for getting the size of a C-style
array: `llvm::array_lengthof`. This is useful prior to C++17, but not as
helpful for C++17 or later: `std::size` already has support for C-style
arrays.
Change call sites to use `std::size` instead.
Differential Revision: https://reviews.llvm.org/D133429
For now, clang and gcc both failed to generate sae version from _mm512_cvt_roundps_ph:
https://godbolt.org/z/oh7eTGY5z. Intrinsic guide description is also wrong, which will be
update soon.
Reviewed By: pengfei
Differential Revision: https://reviews.llvm.org/D132641
See if we can remove X86ISD::ANDNP nodes by checking the state of the knownbits of the demanded elements.
This is equivalent to the generic ISD::AND handling, but flips the bits of the LHS operand to account for ANDNP.
Differential Revision: https://reviews.llvm.org/D128216
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Relands 67504c9549 with a fix for
32-bit builds.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
The KCFI sanitizer, enabled with `-fsanitize=kcfi`, implements a
forward-edge control flow integrity scheme for indirect calls. It
uses a !kcfi_type metadata node to attach a type identifier for each
function and injects verification code before indirect calls.
Unlike the current CFI schemes implemented in LLVM, KCFI does not
require LTO, does not alter function references to point to a jump
table, and never breaks function address equality. KCFI is intended
to be used in low-level code, such as operating system kernels,
where the existing schemes can cause undue complications because
of the aforementioned properties. However, unlike the existing
schemes, KCFI is limited to validating only function pointers and is
not compatible with executable-only memory.
KCFI does not provide runtime support, but always traps when a
type mismatch is encountered. Users of the scheme are expected
to handle the trap. With `-fsanitize=kcfi`, Clang emits a `kcfi`
operand bundle to indirect calls, and LLVM lowers this to a
known architecture-specific sequence of instructions for each
callsite to make runtime patching easier for users who require this
functionality.
A KCFI type identifier is a 32-bit constant produced by taking the
lower half of xxHash64 from a C++ mangled typename. If a program
contains indirect calls to assembly functions, they must be
manually annotated with the expected type identifiers to prevent
errors. To make this easier, Clang generates a weak SHN_ABS
`__kcfi_typeid_<function>` symbol for each address-taken function
declaration, which can be used to annotate functions in assembly
as long as at least one C translation unit linked into the program
takes the function address. For example on AArch64, we might have
the following code:
```
.c:
int f(void);
int (*p)(void) = f;
p();
.s:
.4byte __kcfi_typeid_f
.global f
f:
...
```
Note that X86 uses a different preamble format for compatibility
with Linux kernel tooling. See the comments in
`X86AsmPrinter::emitKCFITypeId` for details.
As users of KCFI may need to locate trap locations for binary
validation and error handling, LLVM can additionally emit the
locations of traps to a `.kcfi_traps` section.
Similarly to other sanitizers, KCFI checking can be disabled for a
function with a `no_sanitize("kcfi")` function attribute.
Reviewed By: nickdesaulniers, kees, joaomoreira, MaskRay
Differential Revision: https://reviews.llvm.org/D119296
This patch adds a Type operand to the TLI isCheapToSpeculateCttz/isCheapToSpeculateCtlz callbacks, allowing targets to decide whether branches should occur on a type-by-type/legality basis.
For X86, this patch proposes to allow CTTZ speculation for i8/i16 types that will lower to promoted i32 BSF instructions by masking the operand above the msb (we already do something similar for i8/i16 TZCNT). This required a minor tweak to CTTZ lowering - if the src operand is known never zero (i.e. due to the promotion masking) we can remove the CMOV zero src handling.
Although BSF isn't very fast, most CPUs from the last 20 years don't do that bad a job with it, although there are some annoying passthrough EFLAGS dependencies. Additionally, now that we emit 'REP BSF' in most cases, we are tending towards assuming this will most likely be executed as a TZCNT instruction on any semi-modern CPU.
Differential Revision: https://reviews.llvm.org/D132520
TragetLowering had two last InstructionCost related `getTypeLegalizationCost()`
and `getScalingFactorCost()` members, but all other costs are processed in TTI.
E.g. it is not comfortable to use other TTI members in these two functions
overrided in a target.
Minor refactoring: `getTypeLegalizationCost()` now doesn't need DataLayout
parameter - it was always passed from TTI.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D117723
There are two different senses in which a block can be "address-taken".
There can be a BlockAddress involved, which means we need to map the
IR-level value to some specific block of machine code. Or there can be
constructs inside a function which involve using the address of a basic
block to implement certain kinds of control flow.
Mixing these together causes a problem: if target-specific passes are
marking random blocks "address-taken", if we have a BlockAddress, we
can't actually tell which MachineBasicBlock corresponds to the
BlockAddress.
So split this into two separate bits: one for BlockAddress, and one for
the machine-specific bits.
Discovered while trying to sort out related stuff on D102817.
Differential Revision: https://reviews.llvm.org/D124697
Previously, LegaizeDAG didn't check mask.compress's passthrough might be float, and this lead to getConstant crash since it doesn't support fp
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D131947
I had hoped to make this a generic fold in DAGCombine, but there's quite a few regressions in Thumb2 MVE that need addressing first.
Fixes regressions from D106675.