This is an alternative of D120395 and D120411.
Previously we use `__bfloat16` as a typedef of `unsigned short`. The
name may give user an impression it is a brand new type to represent
BF16. So that they may use it in arithmetic operations and we don't have
a good way to block it.
To solve the problem, we introduced `__bf16` to X86 psABI and landed the
support in Clang by D130964. Now we can solve the problem by switching
intrinsics to the new type.
Reviewed By: LuoYuanke, RKSimon
Differential Revision: https://reviews.llvm.org/D132329
[This Godbolt link](https://godbolt.org/z/s17Kv1s9T) shows different codegen between clang and gcc for a transpose operation.
clang result:
```
vmovdqu xmm0, xmmword ptr [rcx + rax]
vmovdqu xmm1, xmmword ptr [rcx + rax + 16]
vmovdqu xmm2, xmmword ptr [r8 + rax]
vmovdqu xmm3, xmmword ptr [r8 + rax + 16]
vpunpckhbw xmm4, xmm2, xmm0
vpunpcklbw xmm0, xmm2, xmm0
vpunpcklbw xmm2, xmm3, xmm1
vpunpckhbw xmm1, xmm3, xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
vmovdqu xmmword ptr [rdi + 2*rax], xmm0
vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm4
```
gcc result:
```
vmovdqu ymm3, YMMWORD PTR [rdi+rax]
vpunpcklbw ymm1, ymm3, YMMWORD PTR [rsi+rax]
vpunpckhbw ymm0, ymm3, YMMWORD PTR [rsi+rax]
vperm2i128 ymm2, ymm1, ymm0, 32
vperm2i128 ymm1, ymm1, ymm0, 49
vmovdqu YMMWORD PTR [rcx+rax*2], ymm2
vmovdqu YMMWORD PTR [rcx+32+rax*2], ymm1
```
clang's code is roughly 15% slower than gcc's when evaluated on an internal compression benchmark.
The loop vectorizer generates the following shufflevector intrinsic:
```
%interleaved.vec = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
```
which is lowered to SelectionDAG:
```
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t6: v64i8 = concat_vectors t2, undef:v32i8
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t7: v64i8 = concat_vectors t4, undef:v32i8
t8: v64i8 = vector_shuffle<0,64,1,65,2,66,3,67,4,68,5,69,6,70,7,71,8,72,9,73,10,74,11,75,12,76,13,77,14,78,15,79,16,80,17,81,18,82,19,83,20,84,21,85,22,86,23,87,24,88,25,89,26,90,27,91,28,92,29,93,30,94,31,95> t6, t7
```
So far this `vector_shuffle` is good enough for us to pattern-match and transform, but as we go down the SelectionDAG pipeline, it got split into smaller shuffles. During dagcombine1, the shuffle is split by `foldShuffleOfConcatUndefs`.
```
// shuffle (concat X, undef), (concat Y, undef), Mask -->
// concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t19: v32i8 = vector_shuffle<0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47> t2, t4
t15: ch,glue = CopyToReg t0, Register:v32i8 $ymm0, t19
t20: v32i8 = vector_shuffle<16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63> t2, t4
t17: ch,glue = CopyToReg t15, Register:v32i8 $ymm1, t20, t15:1
```
With `foldShuffleOfConcatUndefs` commented out, the vector is still split later by the type legalizer, which comes after dagcombine1, because v64i8 is not a legal type in AVX2 (64 * 8 = 512 bits while ymm = 256 bits). There doesn't seem to be a good way to avoid this split. Lowering the `vector_shuffle` into unpck and perm during dagcombine1 is too early. Therefore, although somewhat inconvenient, we decided to go with pattern-matching a pair vector shuffles later in the SelectionDAG pipeline, as part of `lowerV32I8Shuffle`.
The code looks at the two operands of the first shuffle it encounters, iterates through the users of the operands, and tries to find two shuffles that are consecutive interleaves. Once the pattern is found, it lowers them into unpcks and perms. It returns the perm for the shuffle that's currently being lowered (have ISel modify the DAG), and replaces the other shuffle in place.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D134477
In Linux PIC model, there are 4 cases about value/label addressing:
Case 1: Function call or Label jmp inside the module.
Case 2: Data access (such as global variable, static variable) inside the module.
Case 3: Function call or Label jmp outside the module.
Case 4: Data access (such as global variable) outside the module.
Due to current llvm inline asm architecture designed to not "recognize" the asm
code, there are quite troubles for us to treat mem addressing differently for
same value/adress used in different instuctions.
For example, in pic model, call a func may in plt way or direclty pc-related,
but lea/mov a function adress may use got.
This patch fix/refine the case 1 and case 2 in inline asm.
Due to currently inline asm didn't support jmp the outsider lable, this patch
mainly focus on fix the function call addressing bugs in inline asm.
Reviewed By: Pengfei, RKSimon
Differential Revision: https://reviews.llvm.org/D133914
Allows us to use combineX86ShuffleChainWithExtract to combine targetshuffle(low_subvector(x),high_subvector(x)) -> low_subvector(targetshuffle(x)) style patterns
This is currently very limited (it must have a v2i64/v2f64 result), but while triaging I noticed we might be able to extend this to allow more types for targets with suitable variable cross lane shuffle support.
Fixes#58339
Fix crash issue of D129537 and reopen it.
Currently the X86 shuffle lowering would widen the element type for
shuffle if the mask element value is adjacent. For below example
%t2 = add nsw <16 x i32> %t0, %t1
%t3 = sub nsw <16 x i32> %t0, %t1
%t4 = shufflevector <16 x i32> %t2, <16 x i32> %t3,
<16 x i32> <i32 16, i32 17, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7, i32 8, i32 9, i32 10,
i32 11, i32 12, i32 13, i32 14, i32 15>
ret <16 x i32> %t4
Compiler would transform the shuffle to
%t4 = shufflevector <8 x i64> %t2, <8 x i64> %t3,
<8 x i64> <i32 8, i32 1, i32 2, i32 3, i32 4,
i32 5, i32 6, i32 7>
This may lose the oppotunity to let ISel select mask instruction when
avx512 is enabled.
This patch is to prevent the tranform when avx512 feature is enabled.
Thank Simon for the idea.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130830
Clang may optimize conditional tailcall blocks with the following layout:
cmp <condition>
je tailcall_target
ret
When retpoline is in place, indirect calls are converted into direct calls to a retpoline thunk. When these indirect calls are tail calls, they may be subject to the above described optimization (there is no indirect JCC, but since now the jump is direct it can be made conditional). The above layout is non-ideal for the Linux kernel scenario because the branches into thunks may be patched back into indirect branches during runtime depending on the underlying CPU features, what would not be feasible if the binary is emitted with the optimized layout above.
Thus, prevent clang from emitting this it if CodeModel is Kernel.
Feature request from the respective kernel mailing list: https://lore.kernel.org/llvm/Yv3uI%2FMoJVctmBCh@worktop.programming.kicks-ass.net/
Reviewed By: nickdesaulniers, pengfei
Differential Revision: https://reviews.llvm.org/D134915
This stops reporting CostPerUse 1 for `R8`-`R15` and `XMM8`-`XMM31`.
This was previously done because instruction encoding require a REX
prefix when using them resulting in longer instruction encodings. I
found that this regresses the quality of the register allocation as the
costs impose an ordering on eviction candidates. I also feel that there
is a bit of an impedance mismatch as the actual costs occure when
encoding instructions using those registers, but the order of VReg
assignments is not primarily ordered by number of Defs+Uses.
I did extensive measurements with the llvm-test-suite wiht SPEC2006 +
SPEC2017 included, internal services showed similar patterns. Generally
there are a log of improvements but also a lot of regression. But on
average the allocation quality seems to improve at a small code size
regression.
Results for measuring static and dynamic instruction counts:
Dynamic Counts (scaled by execution frequency) / Optimization Remarks:
Spills+FoldedSpills -5.6%
Reloads+FoldedReloads -4.2%
Copies -0.1%
Static / LLVM Statistics:
regalloc.NumSpills mean -1.6%, geomean -2.8%
regalloc.NumReloads mean -1.7%, geomean -3.1%
size..text mean +0.4%, geomean +0.4%
Static / LLVM Statistics:
mean -2.2%, geomean -3.1%) regalloc.NumSpills
mean -2.6%, geomean -3.9%) regalloc.NumReloads
mean +0.6%, geomean +0.6%) size..text
Static / LLVM Statistics:
regalloc.NumSpills mean -3.0%
regalloc.NumReloads mean -3.3%
size..text mean +0.3%, geomean +0.3%
Differential Revision: https://reviews.llvm.org/D133902
In combineOr (X86ISelLowering.cpp) there is a DAG combine that rewrite
a "(0 - SetCC) | C" pattern into something simpler given that a LEA
can be used. Another requirement is that C has some specific value,
for example 1 or 7. When checking those requirements the code used a
32-bit unsigned variable to store the value of C. So for a 64-bit OR
this could miscompile in case any of the 32 most significant bits in
C were non zero.
This patch adds fixes the bug by using a large enough type for the
C value.
The faulty code seem to have been introduced by commit 9bceb8981d
(D131358).
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D134892
This has the advantage of dealing with live EFLAGS, using LEA instead of
SUB if needed to avoid clobbering. That also respects feature "lea-sp".
We could allow unrolled stack probing from blocks with live-EFLAGS, if
canUseAsEpilogue learns when emitStackProbeInlineGeneric will be used.
Differential Revision: https://reviews.llvm.org/D134495
The mul by constant costmodels handle power-of-2 constants, but not negated-power-of-2, despite the backends handling both.
This patch adds the OperandValueProperties::OP_NegatedPowerOf2 enum and wires it for use for basic mul cost analysis and SLP handling.
Fixes#50778
Differential Revision: https://reviews.llvm.org/D111968
This mainly just adds costs for the targets where we have actual funnelshift/rotate instructions (VBMI2/XOP etc.) - the cases where we expand still need addressing, although for many the default shift+or expansion, especially for uniform cases, isn't that bad.
This was achieved with the 'cost-tables vs llvm-mca' script D103695
Add costs for the funnel shift instructions - fixes some discrepancies I was hitting with costs numbers from the 'cost-tables vs llvm-mca' script D103695
The change add support for the cases when return value is passed in
memory rathen than in registers.
Differential Revision: https://reviews.llvm.org/D134181
The LEA optimization pass visits each basic block of a given machine
function. In each basic block, for each pair of LEAs that differ only
in their displacement fields, we replace all uses of the second LEA
with the first LEA while adjusting the displacement.
Now, without this patch, after all the replacements are made, the
following assert triggers:
assert(MRI->use_empty(LastVReg) &&
"The LEA's def register must have no uses");
The replacement loop uses:
for (MachineOperand &MO :
llvm::make_early_inc_range(MRI->use_operands(LastVReg))) {
which is equivalent to:
for (auto UI = MRI->use_begin(LastVReg), UE = MRI->use_end();
UI != UE;) {
MachineOperand &MO = *UI++; // <-- Look!
That is, immediately after the post increment, make_early_inc_range
already has the iterator for the next iteration in its mind.
The problem is that in one iteration of the loop, we could replace two
uses in a debug instruction like:
DBG_VALUE_LIST !"r", !DIExpression(DW_OP_LLVM_arg, 0), %0:gr64, %0:gr64, ...
So, the iterator for the next iteration becomes invalid. We end up
traversing a garbage use list from that point on. In turn, we don't
get to visit remaining uses.
The patch fixes the problem by switching to a "draining" while loop:
while (!MRI->use_empty(LastVReg)) {
MachineOperand &MO = *MRI->use_begin(LastVReg);
MachineInstr &MI = *MO.getParent();
The credit goes to Simon Pilgrim for reducing the test case.
Fixes https://github.com/llvm/llvm-project/issues/57673
Differential Revision: https://reviews.llvm.org/D133631
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (and recent fixes to the bdver2 + alderlake models)
Adding full CostKinds costs are affecting some other tests as they make assumptions about SizeLatency costs, so they need addressing first
Although unsupported on HSW, we reuse this model for KNL which does require them
Noticed when running the cost model fuzz script from D103695 with -mcpu=knl
These were based off a mixture of vector integer add/sub costs and the numbers from the 'cost-tables vs llvm-mca' script from D103695 - the extra costs for different predicates are still proving tricky to implement, but I've gotten most costs to within +/1 now - the AVX512 are tricky as we still don't handle predicate results properly, so most of these were done by hand.
All in-tree targets pass pointer-sized ConstantSDNodes to the
method. This overload reduced amount of boilerplate code a bit. This
also makes getCALLSEQ_END consistent with getCALLSEQ_START, which
already takes uint64_ts.
These are the worst case generic vector shift costs, where nothing is known about the shift amounts - in particular this should stop us using the default sizelatency cost of 1 for so many pre-AVX2 vector shifts that can often actually expand during lowering to +20 uops, just for 128-bit vectors, resulting in some horrible inline/unroll decisions.
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
Vector shift by const uniform is the cheapest shift instruction we have, non-const uniform have a marginally higher cost - some targets 'splat' the amount internally to use the shift-per-element instruction, others see a higher cost for the explicit zeroing of the upper bits for the (64-bit) shift amount.
This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)
Corrects the shift by constant costs to better account for them being converted to multiples for lowering - which demonstrates that we should probably be trying harder NOT to convert these to multiplies for some CPUs (v4i32 in particular).
__declspec(safebuffers) is equivalent to
__attribute__((no_stack_protector)). This information is recorded in
CodeView.
While we are here, add support for strict_gs_check.
I'm planning to deprecate and eventually remove llvm::empty.
I thought about replacing llvm::empty(x) with std::empty(x), but it
turns out that all uses can be converted to x.empty(). That is, no
use requires the ability of std::empty to accept C arrays and
std::initializer_list.
Differential Revision: https://reviews.llvm.org/D133677
For remainder:
If (1 << (Bitwidth / 2)) % Divisor == 1, we can add the high and low halves
together and use a (Bitwidth / 2) urem. If (BitWidth /2) is a legal integer
type, this urem will be expand by DAGCombiner using multiply by magic
constant. We do have to take into account that adding high and low
together can produce a carry, making it a (BitWidth / 2)+1 bit number.
So we need to also add back in the carry from the first addition.
For division:
We can use the above trick to compute the remainder, subtract that
remainder from the dividend, then multiply by the multiplicative
inverse of the Divisor modulo (1 << BitWidth).
This is based on the section "Remainder by Summing Digits" in
Hacker's delight.
The remainder trick is similar to a trick you may have learned for
determining if a decimal number is divisible by 3. You can add all the
digits together and see if the sum is divisible by 3. If you're not sure
if the sum is divisible by 3, you can add its digits together. This
can be repeated until you have a single decimal digit. If that digit
is 3, 6, or 9, then the original number is divisible by 3. This works
because 10 % 3 == 1.
gcc already does this same trick. There are additional tricks gcc
does urem as well as srem, udiv, and sdiv that I plan to add in
future patches.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D130862
Also remove new-pass-manager version of ExpandLargeDivRem because there is no way
yet to access TargetLowering in the new pass manager.
Differential Revision: https://reviews.llvm.org/D133691
They shouldn't be happening after XOP shift costs - AVX2 shift supports takes preference over XOP for everything but vXi8 shifts - the improvement is pretty limited as it only affects bdver4 targets but it does help clean up a fraction of the messy shift cost logic....
Noticed while trying to get vector ctpop/ctlz/cttz costs fixed using the script from D103695 - all of these are full-rate but the throughput costs were weirdly high for bdver2
Matches AMD 15h SoG, Agner and instlatx64
Noticed while trying to get vector shifts costs fixed using the script from D103695 - all of these are full-rate but the throughput costs were weirdly high for bdver2
Matches AMD 15h SoG, Agner and instlatx64