r289653 added a case where `vselect <cond> <vector1> <all-zeros>`
is transformed to:
`vselect xor(cond, DAG.getConstant(1, DL, CondVT) <all-zeros> <vector1>`
This was not aimed to catch cases where Cond is not a vXi1
mask but it does. Moreover, when Cond type is VxiN (N > 1)
then xor(cond, DAG.getConstant(1, DL, CondVT) != NOT(cond).
This patch changes the above to xor with allones, and avoids
entering the case for non-mask Conds.
llvm-svn: 291745
DAG patterns optimization: truncate + unsigned saturation supported by VPMOVUS* instructions in AVX-512.
And VPACKUS* instructions on SEE* targets.
Differential Revision: https://reviews.llvm.org/D28216
llvm-svn: 291670
This was reverted because it would miscompile code where the cmp had
multiple uses. That was due to a deficiency in the existing code, which
was fixed in r291630 (see the PR for details).
This re-commit includes an extra test for the kind of code that got
miscompiled: @test_sub_1_setcc_jcc.
llvm-svn: 291640
We would miscompile the following:
void g(int);
int f(volatile long long *p) {
bool b = __atomic_fetch_add(p, 1, __ATOMIC_SEQ_CST) < 0;
g(b ? 12 : 34);
return b ? 56 : 78;
}
into
pushq %rax
lock incq (%rdi)
movl $12, %eax
movl $34, %edi
cmovlel %eax, %edi
callq g(int)
testq %rax, %rax <---- Bad.
movl $56, %ecx
movl $78, %eax
cmovsl %ecx, %eax
popq %rcx
retq
because the code failed to take into account that the cmp has multiple
uses, replaced one of them, and left the other one comparing garbage.
llvm-svn: 291630
This patch fix PR31351: https://llvm.org/bugs/show_bug.cgi?id=31351
1. This patch adds new type of shuffle lowering
2. We can use the expand instruction, When the shuffle pattern is as following:
{ 0*a[0]0*a[1]...0*a[n] , n >=0 where a[] elements in a ascending order}.
Reviewers: 1. igorb
2. guyblank
3. craig.topper
4. RKSimon
Differential Revision: https://reviews.llvm.org/D28352
llvm-svn: 291584
Use the existing AVX2 v8i16 vector shift lowering for v16i8 (extending to v16i32) on AVX512 targets and v32i8 (extending to v32i16) on AVX512BW targets.
Cost model updates to follow.
llvm-svn: 291451
Use the existing AVX2 v8i16 vector shift lowering for v16i16 on AVX512 targets (AVX512BW will have already have lowered with vpsravw).
Cost model updates to follow.
llvm-svn: 291445
I noticed this problem as part of the ongoing attempt to canonicalize min/max ops in IR.
The debug output shows nodes like this:
t4: i32 = xor t2, Constant:i32<-1>
t21: i8 = setcc t4, Constant:i32<0>, setlt:ch
t14: i32 = select t21, t4, Constant:i32<-1>
And because the select is holding onto the t4 (xor) node while EmitTest creates a new
x86-specific xor node, the lowering results in:
t4: i32 = xor t2, Constant:i32<-1>
t25: i32,i32 = X86ISD::XOR t2, Constant:i32<-1>
t28: i32,glue = X86ISD::CMOV Constant:i32<-1>, t4, Constant:i8<15>, t25:1
Differential Revision: https://reviews.llvm.org/D28374
llvm-svn: 291392
Summary:
For instructions such as PSLLW/PSLLD/PSLLQ a variable shift amount may be passed in an XMM register.
The lower 64-bits of the register are evaluated to determine the shift amount.
This patch improves the construction of the vector containing the shift amount.
Reviewers: craig.topper, delena, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28353
llvm-svn: 291120
In some cases its more efficient to combine TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) ) if the binop is legal for the truncated types.
This is true for vector integer multiplication (especially vXi64), as well as ADD/AND/XOR/OR in cases where we only need to truncate one of the inputs at runtime (e.g. a duplicated input or an one use constant we can fold).
Further work could be done here - scalar cases (especially i64) could often benefit (if we avoid partial registers etc.), other opcodes, and better analysis of when truncating the inputs reduces costs.
I have considered implementing this for all targets within the DAGCombiner but wasn't sure we could devise a suitable cost model system that would give us the range we need.
Differential Revision: https://reviews.llvm.org/D28219
llvm-svn: 290947
This reverts commit r290694. It broke sanitizer tests on Win64. I'll
probably bring this back, but the jump tables will just live in .text
like they do for MSVC.
llvm-svn: 290714
Summary:
We were already using 32-bit jump table entries, but this was a
consequence of the default PIC model on Win64, and not an intentional
design decision. This patch ensures that we always use 32-bit label
difference jump table entries on Win64 regardless of the PIC model. This
is a good idea because it saves executable size and object file size.
Moving the jump tables to .rdata cleans up the disassembled object code
and reduces the available ROP targets, but it requires adding one more
RIP-relative lea to the code. COFF doesn't have relocations to express
the difference between two arbitrary symbols, so we can't use the jump
table label in the label difference like we do elsewhere.
Fixes PR31488
Reviewers: majnemer, compnerd
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D28141
llvm-svn: 290694
I added API for creation a target specific memory node in DAG. Today, all memory nodes are common for all targets and their constructors are located in SelectionDAG.cpp.
There are some cases in X86 where we need to create a special node - truncation-with-saturation store, float-to-half-store.
In the current patch I added truncation-with-saturation nodes and I'm using them for intrinsics. In the future I plan to implement DAG lowering for truncation-with-saturation pattern.
Differential Revision: https://reviews.llvm.org/D27899
llvm-svn: 290250
The vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible.
vectorcall uses more registers for arguments than fastcall or the default x64 calling convention use.
The vectorcall calling convention is only supported in native code on x86 and x64 processors that include Streaming SIMD Extensions 2 (SSE2) and above.
The current implementation does not handle Homogeneous Vector Aggregates (HVAs) correctly and this review attempts to fix it.
This aubmit also includes additional lit tests to cover better HVAs corner cases.
Differential Revision: https://reviews.llvm.org/D27392
llvm-svn: 290240
I haven't managed to get this to fail yet but its technically possible for the AND -> shuffle decomposition to result in illegal types.
llvm-svn: 290183
Not sure whether it causes and ASAN false positive or whether it
actually leads to incorrect code or whether it even exposes bad code.
Hans, I'll get you instructions to reproduce this.
llvm-svn: 290066
These nodes are only emitted for lowering FABS/FNEG/FNABS/FCOPYSIGN. Ideally we just wouldn't create these nodes if SSE2 or higher is available, but it was simple to just convert them in DAG combine.
For SSE2, AVX, and AVX512 with DQI this is no functional change as the execution domain fixing pass ensures the right domain is selected regardless of the ISD opcode.
For AVX-512 without DQI we end up using integer instructions since the floating point versions aren't available. But we were already doing that for any logical operations in code that didn't come from FABS/FNEG/FNABS/FCOPYSIGN so this seems no worse. And we get the benefit of being able to fold broadcasts now.
llvm-svn: 290060
atomic_load_add returns the value before addition, but sets EFLAGS based on the
result of the addition. That means it's setting the flags based on effectively
subtracting C from the value at x, which is also what the outer cmp does.
This targets a pattern that occurs frequently with reference counting pointers:
void decrement(long volatile *ptr) {
if (_InterlockedDecrement(ptr) == 0)
release();
}
Clang would previously compile it (for 32-bit at -Os) as:
00000000 <?decrement@@YAXPCJ@Z>:
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: 31 c9 xor %ecx,%ecx
6: 49 dec %ecx
7: f0 0f c1 08 lock xadd %ecx,(%eax)
b: 83 f9 01 cmp $0x1,%ecx
e: 0f 84 00 00 00 00 je 14 <?decrement@@YAXPCJ@Z+0x14>
14: c3 ret
and with this patch it becomes:
00000000 <?decrement@@YAXPCJ@Z>:
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: f0 ff 08 lock decl (%eax)
7: 0f 84 00 00 00 00 je d <?decrement@@YAXPCJ@Z+0xd>
d: c3 ret
(Equivalent variants with _InterlockedExchangeAdd, std::atomic<>'s fetch_add
or pre-decrement operator generate the same code.)
Differential Revision: https://reviews.llvm.org/D27781
llvm-svn: 289955