The MMX intrinsics for shift by immediate take a 32-bit shift
amount but the hardware for shifting by immediate only encodes
8-bits. For the intrinsic we don't require the shift amount to
fit in 8-bits in the frontend because we don't check that its an
immediate in the frontend. If its is not an immediate we move it
to an MMX register and use the shift by register.
But if it is an immediate we'll use the shift by immediate
instruction. But we need to change the shift amount to 8-bits.
We were previously doing this accidentally by masking it in the
encoder. But this can make a large shift amount into a small
in bounds shift amount. Instead we should clamp larger shift
amounts to 255 so that the they don't become in bounds.
Fixes PR43922
PVS Studio noticed that we were asserting "VT.getVectorNumElements() == VT.getVectorNumElements()" instead of "VT.getVectorNumElements() == InVT.getVectorNumElements()".
When writing an email for a follow up proposal, I realized one of the diffs in the committed change was incorrect. Digging into it revealed that the fix is complicated enough to require some thought, so reverting in the meantime.
The problem is visible in this diff (from the revert):
; X64-SSE-LABEL: store_fp128:
; X64-SSE: # %bb.0:
-; X64-SSE-NEXT: movaps %xmm0, (%rdi)
+; X64-SSE-NEXT: subq $24, %rsp
+; X64-SSE-NEXT: .cfi_def_cfa_offset 32
+; X64-SSE-NEXT: movaps %xmm0, (%rsp)
+; X64-SSE-NEXT: movq (%rsp), %rsi
+; X64-SSE-NEXT: movq {{[0-9]+}}(%rsp), %rdx
+; X64-SSE-NEXT: callq __sync_lock_test_and_set_16
+; X64-SSE-NEXT: addq $24, %rsp
+; X64-SSE-NEXT: .cfi_def_cfa_offset 8
; X64-SSE-NEXT: retq
store atomic fp128 %v, fp128* %fptr unordered, align 16
ret void
The problem here is three fold:
1) x86-64 doesn't guarantee atomicity of anything larger than 8 bytes. Some platforms observably break this guarantee, others don't, but the codegen isn't considering this, so it's wrong on at least some platforms.
2) When I started to track down the problem, I discovered that DAGCombiner had stripped the atomicity off the store entirely. This comes down to idiomatic usage of DAG.getStore passing all MMO components separately as opposed to just passing the MMO.
3) On x86 (not -64), there are cases where 8 byte atomiciy is supported, but only for floating point operations. This would seem to imply that operation typing matters for correctness, and DAGCombine happily folds away bitcasts. I'm not 100% sure there's a problem here, but I'm not entirely sure there isn't either.
I plan on returning to each issue in turn; sorry for the churn here.
If we don't demand all elements, then attempt to combine to a simpler shuffle.
At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts (see D66004).
This reapplies rL368307 (reverted at rL369167) after the fix for the infinite loop reported at PR43024 was applied at rG3f087e38a2e7b87a5adaaac1c1b61e51220e7ff3
This stops infinite loops where KnownUndef elements are converted to Zeroable, resulting in KnownZero elements which are then simplified (via SimplifyDemandedElts etc.) back to KnownUndef elements........
Prep fix for PR43024 which will allow rL368307 to be re-applied.
This doesn't affect actual codegen, but is a minor refactor toward fixing PR43024 where we need to avoid excess changes (folding zeroables etc.) to the shuffle mask at Depth == 0.
Previously we marked zeroable elements in a way that prevented
the widening check from recognizing that it could widen. Now
we only mark them zeroable if V2 is an all zeros vector. This
matches what we do for widening elements in lowerVectorShuffle.
Fixes PR43866.
Teach combineVectorSizedSetCCEquality() to handle arbitrary memcmp
expansions but do not change any default policy for now.
This also fixes a bug in the memcmp expansion itself when large
displacements are needed.
https://reviews.llvm.org/D69507
Enable the new SelectionDAG representation for unordered loads and stores introduced in r371441 by default. As a reminder, the new lowering changes the representation of an unordered atomic load from an AtomicSDNode - which is essentially a black box which gets passed through without combines messing with it - to a LoadSDNode w/a atomic marker on the MMO. The later parallels the way we handle volatiles, and I've audited the code to ensure that every location which checks one checks the other.
This has been fairly heavily fuzzed, and I examined diffs in a reasonable large corpus of assembly by hand, so I'm reasonable sure this is correct for the common case. Late in the review for this, it was discovered that I hadn't correctly handled cases which could be legalized into CAS operations. This points out that there's a strong bias in the IR of the frontend I'm working with towards only legal atomics. If there are problems with this patch, the most likely area will be legalization.
Differential Revision: https://reviews.llvm.org/D69219
This catches some cases. There are probably ways to improve this.
I tried doing it as a combine on the setcc, but that broke
some cases involving flag reuse in place of test.
I renamed the isX86CCUnsigned to isX86CCSigned and flipped its
polarity to make it consistent with the similar functions for
ISD::SETCC. This avoids calling EQ/NE as being signed or unsigned.
Fixes PR43823.
Differential Revision: https://reviews.llvm.org/D69499
The legalization of v2i1->i2 or v4i1->i4 bitcasts followed by a setcc can create an and after the bitcast. If we're lucky enough that the input to the bitcast is a concat_vectors where the first operand is a setcc that can natively 0 all the upper bits of ak-register, then we should replace the other operands of the concat_vectors with zero in order to remove the AND.
With the AND removed we might be able to use a kortest on the result.
Differential Revision: https://reviews.llvm.org/D69205
PTEST and especially the MOVMSK instructions are slow on Knights Landing
or later. As a bonus, this patch increases instruction parallelism by
emitting:
KORTEST(PCMPNEQ(a, b), PCMPNEQ(c, d)) == 0
Instead of:
KORTEST(AND(PCMPEQ(a, b), PCMPEQ(c, d))) == ~0
https://reviews.llvm.org/D69157
tryToWidenViaDuplication lowers using the shuffle_v8i16(unpack_v16i8(shuffle_v8i16(x),shuffle_v8i16(x))) pattern, but the unpack only needs the even/odd 16i8 args if the original v16i8 shuffle mask references the even/odd elements - which isn't true for many extension style shuffles.
llvm-svn: 375342
We were always generating a single source HADDPD, but really we should only do this if shouldUseHorizontalOp says its a good idea.
Differential Revision: https://reviews.llvm.org/D69175
llvm-svn: 375341
Add generic DAG combine for extending masked loads.
Allow us to generate sext/zext masked loads which can access v4i8,
v8i8 and v4i16 memory to produce v4i32, v8i16 and v4i32 respectively.
Differential Revision: https://reviews.llvm.org/D68337
llvm-svn: 375085
Exposes an issue in getFauxShuffleMask where the OR(SHUFFLE,SHUFFLE) decode should always resolve zero/undef elements.
Part of the fix for PR43024 where ideally we shouldn't call resolveTargetShuffleAndZeroables for Depth == 0
llvm-svn: 374928
The only things VBROADCAST_LOAD uses is an address and a chain
node. It has no vector inputs.
So if its a user of the source of another broadcast that could
only mean one of two things. The other broadcast is broadcasting
the address of the broadcast_load. Or the source is a load and
the use we're seeing is the chain result from that load. Neither
of these cases make sense to combine here.
This issue was reported post-commit r373871. Test case has not
been reduced yet.
llvm-svn: 374862
This prevents isel from emitting a TEST instruction that
optimizeCompareInstr will need to remove later.
In some of the modified tests, the SUB gets duplicated due to
the flags being needed in two places and being clobbered in
between. optimizeCompareInstr was able to optimize away the TEST
that was using the result of one of them, but optimizeCompareInstr
doesn't know to turn SUB into CMP after removing the TEST. It
only knows how to turn SUB into CMP if the result was already
dead.
With this change the TEST never exists, so optimizeCompareInstr
doesn't have to remove it. Then it can just turn the SUB into
CMP immediately.
Fixes PR43649.
llvm-svn: 374755
We were already controlling whether the KnownZero elements were being written to the target mask, this extends it to the KnownUndef elements as well so we can prevent the target shuffle mask being manipulated at all.
llvm-svn: 374732
This enables use of the saturating truncate instructions when the
result type is less than 128 bits. It also enables the use of
saturating truncate instructions on KNL when the input is less
than 512 bits. We can do this by widening the input and then
extracting the result.
llvm-svn: 374731