The isOnlyUserOf prevented the fold if the chain result had any
users. What we really care about is the the data result from the
AND is only used by the TEST, and the flags results from the ANDs
aren't used at all. It's ok if the chain has users, we just need
to replace those users with the chain from the TESTrm.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D131117
I can't remove the function just yet as it is used in the generated .inc files.
I would also like to provide a way to compare alignment with TypeSize since it came up a few times.
Differential Revision: https://reviews.llvm.org/D126910
This patch adds support for inline assembly address operands using the "p"
constraint on X86 and SystemZ.
This was in fact broken on X86 (see example at
https://reviews.llvm.org/D110267, Nov 23).
These operands should probably be treated the same as memory operands by
CodeGenPrepare, which have been commented with "TODO" there.
Review: Xiang Zhang and Ulrich Weigand
Differential Revision: https://reviews.llvm.org/D122220
Allows us to fold XOR(X, MIN_SIGNED_VALUE) == ADD(X, MIN_SIGNED_VALUE) into LEA patterns
As mentioned on PR52267.
Differential Revision: https://reviews.llvm.org/D122815
Instead of taking a SkipDefs parameter, rename to getCondSrcNoFromDesc
and have it return the source operand number. Make getCondFromMI
responsible for adding the number of Defs for MI instructions.
While there remove some unneeded casts to unsigned and check for
negative numbers instead of explicitly -1. Less than 0 is easier
for a compiler to codegen.
Differential Revision: https://reviews.llvm.org/D122113
We favor 'and' and 'test' in earlier phases of optimization,
and that's usually the better option, but we can save a few
instruction bytes by converting a mask constant to a shift here.
Differential Revision: https://reviews.llvm.org/D121147
Optimize a pattern where a sequence of 8/16 or 32 bits is tested for
zero: LLVM normalizes this towards and `AND` with mask which is usually
good, but does not work well on X86 when the mask does not fit into a
64bit register. This DagToDAG peephole transforms sequences like:
```
movabsq $562941363486720, %rax # imm = 0x1FFFE00000000
testq %rax, %rdi
```
to
```
shrq $33, %rdi
testw %di, %di
```
The result has a shorter encoding and saves a register if the tested
value isn't used otherwise.
Differential Revision: https://reviews.llvm.org/D121320
After D118128 relaxed the heuristic to require only one EFLAGS generating operand, it now makes sense to avoid X86ISD::SMUL/UMULO duplication as well.
Differential Revision: https://reviews.llvm.org/D119578
As suggested by @craig.topper, relaxing LEA matching to only require the ADD to be fed from a single op with EFLAGS helps avoid duplication when the EFLAGS are consumed in a later, dependent instruction.
There was some concern about whether the heuristic is too simple, not taking into account lost loads that can't fold by using a LEA, but some basic tests (included in select-lea.ll) don't suggest that's really a problem.
Differential Revision: https://reviews.llvm.org/D118128
This is effectively inverting the transform added with D116804
because the downside of the false dependency of something like
"sbb %eax, %eax" is much greater than the upside of eliminating
a zeroing instruction on (all?) Intel CPUs.
Differential Revision: https://reviews.llvm.org/D118843
As noted in D116804, we want to effectively invert that patch
for CPUs (intel) that don't break the false dependency on
sbb %eax, %eax
So we will likely want to create that here in the
X86DAGToDAGISel::Select() case for X86::SETCC_CARRY.
This reverts commit fd4808887e.
This patch causes gcc to issue a lot of warnings like:
warning: base class ‘class llvm::MCParsedAsmOperand’ should be
explicitly initialized in the copy constructor [-Wextra]
The VINSERTF128 instruction is often much quicker, and never slower, than the more general VPERM2F128 instruction, so we should try to use that in more circumstances.
This requires a fallback to a commuted VPERM2F128 for the case where we need to fold the 256-bit vector source instead of the 128-bit subvector source.
There is one interesting side effect - DAGCombine's narrowExtractedVectorLoad combine gets called in a number of locations, this often creates an extracted subvector load without regard to other uses of the original wider load. I'm expecting AVX cpus to be capable of merging such aliased loads, but I do wonder whether narrowExtractedVectorLoad's call to X86TargetLowering::shouldReduceLoadWidth needs to be altered to check for more partial uses?
Noticed while investigating the quality of interleaved load/store codegen.
Differential Revision: https://reviews.llvm.org/D111960
The tests are based on the example from:
https://llvm.org/PR52032
I suspect that it looks worse than it actually is. :)
That is, llvm-mca says there's no uop/timing difference with the
load folding and pcmpeq vs. broadcast on Haswell (and probably
other targets).
The load-folding definitely makes the code smaller, so it's good
for that at least. So this requires carving a narrow hole in the
transform to get just this case without changing others that look
good as-is (in other words, the transform still seems good for
most examples).
Differential Revision: https://reviews.llvm.org/D112464
This can avoid a vector add and a constant pool load. Or an explicit broadcast in case of non-constant.
Also reverse the transform any time we encounter a constant index addend that can't be moved to base. In that case pull the constant from base into the index. This reduces code size needed for the displacement since we needed the index add anyway. Limit this to scale of 1 to avoid divisibility and wrap issues.
Authored by Craig.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D111595
Stop using APInt constructors and methods that were soft-deprecated in
D109483. This fixes all the uses I found in llvm, except for the APInt
unit tests which should still test the deprecated methods.
Differential Revision: https://reviews.llvm.org/D110807
Currently, we only deal with the case where we can match
the number of low bits to be kept, i.e.:
```
x & ((1 << y) - 1)
```
will extract low `y` bits of `x`.
But what will
```
x & (-1 >> y)
```
do?
Logically, it will extract `bitwidth(x) - y` low bits, i.e.:
```
x & ~(-1 << (bitwidth(x)-y))
```
... except we can't do such a transformation in IR in general,
because if we wanted to extract all the bits `(-1 >> 0)` is fine,
but `-1 << bitwidth(x)` would be `poison`: https://alive2.llvm.org/ce/z/BKJZfw,
Yet, here with BMI's BEXTR and BMI2's BZHI we don't have any such problems with edge-cases.
So what we can do is: https://alive2.llvm.org/ce/z/gm5M2B
As briefly discussed with @craig.topper, this appears to be not worse than what we'd end up with currently (a pair of shifts):
* https://godbolt.org/z/nsPb8bejs (direct data dependency, sequential execution)
* https://godbolt.org/z/7bj3zeh1d (no direct data dependency, parallel execution)
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D107923
This is a more general version of D109273. Though it doesn't
peek through bitcasts or rearange broadcasts.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D109295
This code tries to form a TEST from CMP+AND with an optional
truncate in between. If we looked through the truncate, we may
have extra bits in the AND mask that shouldn't participate in
the checks. Normally SimplifyDemendedBits takes care of this, but
the AND may have another user. So manually mask out any extra bits.
Fixes PR51175.
Differential Revision: https://reviews.llvm.org/D106634
We don't need to have the compare output a value and then copy it
to FPSW for use by FNSTSW. Instead we can just have the compare
output Glue and glue the FNSTSW to it. InstrEmitter effectively
performed this optimization when emitting the Machine IR. Doing
it directly simplifies the codes and reduces the work in
InstrEmitter. There's no change in the machine IR at the end of
isel before and after this change.
I've seen this in the RawSpeed's BitPumpMSB*::push() hotpath,
after fixing the buffer abstraction to a more sane one,
when looking into a +5% runtime regression.
I was hoping that this would fix it, but it does not look it does.
This seems to be at least not worse than the original pattern.
But i'm actually mainly interested in the case where we already
compute `(y+32)` (see last test),
https://alive2.llvm.org/ce/z/ZCzJio
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D101944
As X32 uses 32-bit pointers without having 32-bit indirect branch
instructions, we need to fix up indirect branches by extending the
branch targets to 64 bits. This was already done for BRIND but not yet
for NT_BRIND. The same logic works for both, so this applies that
existing logic to NT_BRIND as well.
Reviewed By: MaskRay
Differential Revision: https://reviews.llvm.org/D101499
Adding support for intrinsics of AMX-BF16.
This patch alse fix a bug that AMX-INT8 instructions will be selected with wrong
predicate.
Differential Revision: https://reviews.llvm.org/D97358
This is an optimized approach for D94155.
Previous code build the model that tile config register is the user of
each AMX instruction. There is a problem for the tile config register
spill. When across function, the ldtilecfg instruction may be inserted
on each AMX instruction which use tile config register. This cause all
tile data register clobber.
To fix this issue, we remove the model of tile config register. Instead,
we analyze the AMX instructions between one call to another. We will
insert ldtilecfg after the first call if we find any AMX instructions.
Reviewed By: LuoYuanke
Differential Revision: https://reviews.llvm.org/D95136
Revert the change to use APInt::isSignedIntN from
5ff5cf8e05.
Its clear that the games we were playing to avoid the topological
sort aren't working. So just fix it once and for all.
Fixes PR48888.