Commit Graph

4691 Commits

Author SHA1 Message Date
Simon Pilgrim fc97d5049f Spelling mistakes in comments. NFCI.
llvm-svn: 299000
2017-03-29 15:27:24 +00:00
Simon Pilgrim 2845189bd1 [X86][AVX2] Prevent unary interleaving patterns from calling lowerVectorShuffleAsSplitOrBlend (PR32453)
llvm-svn: 298993
2017-03-29 13:00:00 +00:00
Simon Pilgrim c7c5aa47cf [X86][MMX] Match MMX fp_to_sint conversions from XMM registers
We currently perform the various fp_to_sint XMM conversion and then transfer to the MMX register (on 32-bit via the stack).

This patch improves support for MOVDQ2Q XMM to MMX transfers and adds the XMM->MMX fp_to_sint direct conversion patterns. The SSE2 specifications are the same as for XMM->XMM and XMM->MMX rounding/exceptions/etc.

Differential Revision: https://reviews.llvm.org/D30868

llvm-svn: 298943
2017-03-28 21:32:11 +00:00
Sanjay Patel f01a1dad7f [x86] use VPMOVMSK to replace memcmp libcalls for 32-byte equality
Follow-up to:
https://reviews.llvm.org/rL298775

llvm-svn: 298933
2017-03-28 17:23:49 +00:00
Simon Pilgrim 3e2aa7f40e [X86][AVX2] Add support for combining v16i16 shuffles to VPBLENDW
llvm-svn: 298929
2017-03-28 16:40:38 +00:00
Simon Pilgrim 6b30172372 [X86][SSE] Refactored shuffle BLEND combining to make future 16i16 support easier. NFCI.
Call the matchVectorShuffleAsBlend test as early as possible.

llvm-svn: 298925
2017-03-28 15:50:23 +00:00
Simon Pilgrim aa675ca77d Fix signed/unsigned comparison warning
llvm-svn: 298917
2017-03-28 13:40:09 +00:00
Simon Pilgrim d48f47e25c [X86][SSE] Begin merging vector shuffle to BLEND for lowering and combining.
Split off matchVectorShuffleAsBlend from lowerVectorShuffleAsBlend for reuse in combining.

llvm-svn: 298914
2017-03-28 13:05:48 +00:00
Simon Pilgrim 6afe0e2833 [X86][SSE] Set second operand to undef instead of first operand in unary shuffle combines.
Copy isn't necessary after the matchVectorShuffleWithUNPCK refactor and undef value will make some future undef/zero handling easier.

llvm-svn: 298910
2017-03-28 12:16:42 +00:00
Simon Pilgrim defee5683c Strip trailing whitespace
llvm-svn: 298909
2017-03-28 11:15:17 +00:00
Gadi Haber 89d5f9391a [X86][AVX2] bugzilla bug 21281 Performance regression in vector interleave in AVX2
This is a patch for an on-going bugzilla bug 21281 on the generated X86 code for a matrix transpose8x8 subroutine which requires vector interleaving. The generated code in AVX2 is currently non-optimal and requires 60 instructions as opposed to only 40 instructions generated for AVX1.
 The patch includes a fix for the AVX2 case where vector unpack instructions use less operations than the vector blend operations available in AVX2.
 In this case using vector unpack instructions is more efficient.

Reviewers:
zvi  
delena  
igorb  
craig.topper  
guyblank  
eladcohen  
m_zuckerman  
aymanmus  
RKSimon 

llvm-svn: 298840
2017-03-27 12:13:37 +00:00
Simon Pilgrim 92925ea701 [X86][SSE] Add computeKnownBitsForTargetNode support for (V)PSLL/(V)PSRL instructions
llvm-svn: 298806
2017-03-26 13:17:55 +00:00
Simon Pilgrim bec234c970 [X86] Pull out repeated ScalarValueSizeInBits code. NFCI.
llvm-svn: 298783
2017-03-25 21:22:12 +00:00
Simon Pilgrim c0720a4052 [X86][SSE] Combine (VSRLI (VSRAI X, Y), (NumSignBits-1)) -> (VSRLI X, (NumSignBits-1))
Part 3 of 3.

Differential Revision: https://reviews.llvm.org/D31347

llvm-svn: 298782
2017-03-25 20:43:01 +00:00
Simon Pilgrim 6397963c81 [X86][SSE] Added ComputeNumSignBitsForTargetNode support for (V)PSRAI
Part 2 of 3.

Differential Revision: https://reviews.llvm.org/D31347

llvm-svn: 298780
2017-03-25 19:58:36 +00:00
Simon Pilgrim 5400a4d0af [X86][SSE] Generalised CMP+AND1 combine to ZERO/ALLBITS+MASK
Patch to generalize combinePCMPAnd1 (for handling SETCC + ZEXT cases) to work for any input that has zero/all bits set masked with an 'all low bits' mask.

Replaced the implicit assumption of shift availability with a call to SupportedVectorShiftWithImm.

Part 1 of 3.

Differential Revision: https://reviews.llvm.org/D31347

llvm-svn: 298779
2017-03-25 19:50:14 +00:00
Sanjay Patel 9ebb68843e [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality
This is the payoff for D31156 - if a target has efficient comparison instructions for vector-sized equality, 
we can replace memcmp calls with inline code that is both smaller and faster.

Differential Revision: https://reviews.llvm.org/D31290

llvm-svn: 298775
2017-03-25 16:05:33 +00:00
Simon Pilgrim 6aac646308 [X86][SSE] Generalised lowerTruncate by PACKSS to work with any 'zero/all bits' result, not just comparisons.
Added vector compare opcodes to X86TargetLowering::ComputeNumSignBitsForTargetNode

Covered by existing tests added for D22814.

llvm-svn: 298704
2017-03-24 16:12:31 +00:00
Eric Christopher cff8492492 Remove the subtarget argument from LowerFP_TO_INT since there's one
stored on X86TargetLowering.

llvm-svn: 298628
2017-03-23 17:35:08 +00:00
Eric Christopher a19a14b42f Remove unused X86Subtarget argument from getOnesVector.
llvm-svn: 298627
2017-03-23 17:35:06 +00:00
Simon Pilgrim 1c048ab6ba [X86][SSE] Extract elements from narrower shuffle masks.
Add support for widening narrow shuffle masks so we can directly extract from the relevant input vector of the shuffle.

llvm-svn: 298616
2017-03-23 16:09:34 +00:00
Simon Pilgrim 8a18299f20 [X86][SSE] Tidyup canWidenShuffleElements. NFCI.
Pull out mask elements at the start, allowing us to make the widening pattern matching more readable.

llvm-svn: 298594
2017-03-23 13:33:03 +00:00
Michael Zuckerman 85436ece89 [X86][TD][vpmovm2 ] New TD pattern for the vpmovm2 instruction
Up until now, vpmovm2 instruction described its destination operand size
by the source operand size. This patch adds new pattern for the vpmovm2
instruction. The node describes new expansion of the destination (from
{128|256} to 512).

Differential Revision: https://reviews.llvm.org/D30654

llvm-svn: 298586
2017-03-23 09:57:01 +00:00
Eric Christopher fd8510cfec Clean up some Subtarget uses and casts in the X86 backend, removing unnecessary work or calls.
llvm-svn: 298555
2017-03-22 22:44:52 +00:00
Reid Kleckner b518054b87 Rename AttributeSet to AttributeList
Summary:
This class is a list of AttributeSetNodes corresponding the function
prototype of a call or function declaration. This class used to be
called ParamAttrListPtr, then AttrListPtr, then AttributeSet. It is
typically accessed by parameter and return value index, so
"AttributeList" seems like a more intuitive name.

Rename AttributeSetImpl to AttributeListImpl to follow suit.

It's useful to rename this class so that we can rename AttributeSetNode
to AttributeSet later. AttributeSet is the set of attributes that apply
to a single function, argument, or return value.

Reviewers: sanjoy, javed.absar, chandlerc, pete

Reviewed By: pete

Subscribers: pete, jholewinski, arsenm, dschuff, mehdi_amini, jfb, nhaehnle, sbc100, void, llvm-commits

Differential Revision: https://reviews.llvm.org/D31102

llvm-svn: 298393
2017-03-21 16:57:19 +00:00
Sanjay Patel 79379cae15 [x86] use PMOVMSK for vector-sized equality comparisons
We could do better by splitting any oversized type into whatever vector size the target supports, 
but I left that for future work if it ever comes up. The motivating case is memcmp() calls on 16-byte
structs, so I think we can wire that up with a TLI hook that feeds into this.

Differential Revision: https://reviews.llvm.org/D31156

llvm-svn: 298376
2017-03-21 13:50:33 +00:00
Evgeniy Stepanov e829eecc05 [Fuchsia] Use %gs for ABI slots under -mcmodel=kernel
Make x86_64-fuchsia targets under -mcmodel=kernel use %gs rather
than %fs to access ABI slots for stack-protector and safe-stack

Patch by Roland McGrath.

Differential Revision: https://reviews.llvm.org/D30870

llvm-svn: 298302
2017-03-20 20:35:37 +00:00
Craig Topper 5992c8d1dc [AVX-512] Handle kor/kand/kandn/kxor/kxnor/knot intrinsics at lowering time instead of isel
Summary:
Currently we handle these intrinsics at isel with special patterns. But as they just map to normal logic operations, we should just handle them at lowering. This will expose them to DAG combine optimizations. Right now the kor-sequence test generates a bunch of regclass copies between GR16 and VK16 that the peephole optimizer and/or register coallescing are removing to keep everything in the mask domain. By handling the logic op intrinsics earlier, these copies become bitcasts in the DAG and get removed by DAG combine which seems more robust.

This should help enable my plan to stop copying between K registers and GR8/GR16. The peephole optimizer can't remove a chain of copies between K and GR32 with insert_subreg/extract_subreg present in the chain so the kor-sequence test break. But this patch should dodge the problem entirely.

Reviewers: zvi, delena, RKSimon, igorb

Reviewed By: igorb

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D31056

llvm-svn: 298228
2017-03-19 17:11:09 +00:00
Oren Ben Simhon 0ef61ec32a [MIR] Support Customed Register Mask and CSRs
The MIR printer dumps a string that describe the register mask of a function.
A static predefined list of register masks matches a static list of strings.
However when the register mask is not from the static predefined list, there is no descriptor string and the printer fails.
This patch adds support to custom register mask printing and dumping.
Also the list of callee saved registers (describing the registers that must be preserved for the caller) might be dynamic.
As such this data needs to be dumped and parsed back to the Machine Register Info.

Differential Revision: https://reviews.llvm.org/D30971

llvm-svn: 298207
2017-03-19 08:14:18 +00:00
Matthias Braun e9f8209e87 ExecutionDepsFix: Normalize names; NFC
Normalize ExeDepsFix, execution-fix, ExecutionDependencyFix and
ExecutionDepsFix to the last one.

llvm-svn: 298183
2017-03-18 05:05:40 +00:00
Nirav Dave ac6081cb67 Make library calls sensitive to regparm module flag (Fixes PR3997).
Reviewers: mkuper, rnk

Subscribers: mehdi_amini, jyknight, aemerson, llvm-commits, rengolin

Differential Revision: https://reviews.llvm.org/D27050

llvm-svn: 298179
2017-03-18 00:44:07 +00:00
Nirav Dave 6de2c77944 Capitalize ArgListEntry fields. NFC.
llvm-svn: 298178
2017-03-18 00:43:57 +00:00
Sanjay Patel 455703a0c6 [x86] clean up setcc with negated operand transform and add missing test; NFCI
llvm-svn: 298118
2017-03-17 20:29:40 +00:00
Sanjay Patel 25bd713d33 [x86] avoid adc/sbb assert when both sides of add are zexted (PR32316)
As noted in the comment, we might want to account for this case,
but I didn't look at what that would mean for the asm. 

I'm also not sure why this only reproduces with avx512, but I'm 
putting a conservative fix in for now to avoid the crash. 

Also, if both sides of an add are zexted, shouldn't we shrink that add?

https://bugs.llvm.org/show_bug.cgi?id=32316

llvm-svn: 298107
2017-03-17 17:27:31 +00:00
Simon Pilgrim 493f4462bf [X86][SSE] Fixed shuffle MOVSS/MOVSD combining of all zeroable inputs
Turns out it can happen, so the assertion was too harsh

Found during fuzz testing

llvm-svn: 297833
2017-03-15 13:16:46 +00:00
Simon Pilgrim cf2da96c82 [SelectionDAG] Add a signed integer absolute ISD node
Reduced version of D26357 - based on the discussion on llvm-dev about canonicalization of UMIN/UMAX/SMIN/SMAX as well as ABS I've reduced that patch to just the ABS ISD node (with x86/sse support) to improve basic combines and lowering.

ARM/AArch64, Hexagon, PowerPC and NVPTX all have similar instructions allowing us to make this a generic opcode and move away from the hard coded tablegen patterns which makes it tricky to match more complex patterns.

At the moment this patch doesn't attempt legalization as we only create an ABS node if its legal/custom.

Differential Revision: https://reviews.llvm.org/D29639

llvm-svn: 297780
2017-03-14 21:26:58 +00:00
Oren Ben Simhon fe34c5e429 Disable Callee Saved Registers
Each Calling convention (CC) defines a static list of registers that should be preserved by a callee function. All other registers should be saved by the caller.
Some CCs use additional condition: If the register is used for passing/returning arguments – the caller needs to save it - even if it is part of the Callee Saved Registers (CSR) list.
The current LLVM implementation doesn’t support it. It will save a register if it is part of the static CSR list and will not care if the register is passed/returned by the callee.
The solution is to dynamically allocate the CSR lists (Only for these CCs). The lists will be updated with actual registers that should be saved by the callee.
Since we need the allocated lists to live as long as the function exists, the list should reside inside the Machine Register Info (MRI) which is a property of the Machine Function and managed by it (and has the same life span).
The lists should be saved in the MRI and populated upon LowerCall and LowerFormalArguments.
The patch will also assist to implement future no_caller_saved_regsiters attribute intended for interrupt handler CC.

Differential Revision: https://reviews.llvm.org/D28566

llvm-svn: 297715
2017-03-14 09:09:26 +00:00
Craig Topper 616641632e [X86] Lower AVX2 gather intrinsics similar to AVX-512. Apply the same input source optimizations to break execution dependencies.
For AVX-512 we force the input to zero if the input is undef or the mask is all ones to break an execution dependency. This patch brings the same behavior to AVX2.

llvm-svn: 297652
2017-03-13 18:34:46 +00:00
Craig Topper eb7ea28bdd [AVX-512] If gather mask is all ones, force the input to a zero vector.
We were already forcing undef inputs to become a zero vector, this now catches an all ones mask too.

Ideally we'd use undef and let execution dep fix handle picking the best register/clearance for the undef, but I don't think it can handle the early clobber today.

llvm-svn: 297651
2017-03-13 18:17:46 +00:00
Craig Topper 7d56c8315b [AVX-512] Fix the valid immediates for the scatter/gather prefetch intrinsics.
The immediate should be 1 or 2, not 0 or 1. This was found while adding bounds checking to clang. In fact the existing clang builtin test failed if we ran it all the way to assembly.

llvm-svn: 297591
2017-03-12 22:29:12 +00:00
Sanjay Patel f06b963a2b [x86] don't blindly transform SETB into SBB
I noticed unnecessary 'sbb' instructions in D30472 and while looking at 'ptest' codegen recently. 
This happens because we were transforming any 'setb' - even when we only wanted a single-bit result.

This patch moves those transforms under visitAdd/visitSub, so we we're only creating sbb/adc when it
is a win. I don't know why we need a SETCC_CARRY node type, but I'm not proposing to change that
existing behavior in this patch.

Also, I'm skeptical that sbb/adc are a win for all micro-arches, so I added comments to the test files
where this transform still fires.

The test changes here are all cases where we no longer produce sbb/adc. Avoiding partial register
stalls (generating an xor to clear a register) is not handled in some cases, but that's a separate
issue.

Differential Revision: https://reviews.llvm.org/D30611

llvm-svn: 297586
2017-03-12 18:28:48 +00:00
Simon Pilgrim 18debfa5b4 [X86][SSE] Improve extraction of elements from v16i8 (pre-SSE41)
Without SSE41 (pextrb) we currently extract byte elements from a vector by spilling to stack and reloading the byte.

This patch is an initial attempt at using MOVD/PEXTRW to extract the relevant DWORD/WORD from the vector and then shift+truncate to collect the correct byte.

Extraction of multiple bytes this way would result in code bloat, but as explained in the patch we could probably afford to be more aggressive with the supported extractions before again falling back on spilling - possibly through counting the number of extracts and which DWORD/WORD they originate?

Differential Revision: https://reviews.llvm.org/D29841

llvm-svn: 297568
2017-03-11 20:42:31 +00:00
Craig Topper 02b463270c [X86] Remove unnecessary commented out code. NFC
llvm-svn: 297563
2017-03-11 18:25:56 +00:00
Simon Pilgrim bfe263352a [X86] Fix Wunused-lambda-capture warning
llvm-svn: 297521
2017-03-10 22:10:34 +00:00
Simon Pilgrim b02667c469 [APInt] Add APInt::insertBits() method to insert an APInt into a larger APInt
We currently have to insert bits via a temporary variable of the same size as the target with various shift/mask stages, resulting in further temporary variables, all of which require the allocation of memory for large APInts (MaskSizeInBits > 64).

This is another of the compile time issues identified in PR32037 (see also D30265).

This patch adds the APInt::insertBits() helper method which avoids the temporary memory allocation and masks/inserts the raw bits directly into the target.

Differential Revision: https://reviews.llvm.org/D30780

llvm-svn: 297458
2017-03-10 13:44:32 +00:00
Simon Pilgrim 836bcc689f [X86][SSE] combineX86ShufflesRecursively can handle shuffle masks up to 64 elements wide
By defining the mask types as SmallVector<int, 16> we were causing a lot of unnecessary heap usage.

llvm-svn: 297267
2017-03-08 09:36:39 +00:00
Sanjoy Das c08a79fbf2 [X86] Add option to specify preferable loop alignment
Summary:
Loop alignment can cause a significant change of
the perfromance for short loops.
To be able to evaluate the impact of loop alignment this change
introduces the new option x86-experimental-pref-loop-alignment.
The alignment will be 2^Value bytes, the default value is 4.

Patch by Serguei Katkov!

Reviewers: craig.topper

Reviewed By: craig.topper

Subscribers: sanjoy, llvm-commits

Differential Revision: https://reviews.llvm.org/D30391

llvm-svn: 297178
2017-03-07 18:47:22 +00:00
Reid Kleckner 812191584f [X86] Fix arg copy elision for illegal types
Use the store size of the argument type, which will be a byte-sized
quantity, rather than dividing the size in bits by 8.

Fixes PR32136 and re-enables copy elision from i64 arguments.

Reverts the workaround in from r296950.

llvm-svn: 297045
2017-03-06 18:39:39 +00:00
Benjamin Kramer bb635e034c [X86] Silence GCC enum compare warning.
X86ISelLowering.cpp:26506:36: error: enumeral mismatch in conditional
expression: 'llvm::X86ISD::NodeType' vs 'llvm::ISD::NodeType'
[-Werror=enum-compare]

llvm-svn: 296986
2017-03-05 12:53:20 +00:00
Simon Pilgrim 9f5c251d57 [X86][SSE] Lower 128-bit vectors to SIGN/ZERO_EXTEND_VECTOR_IN_REG ops
As described on PR31712, we miss a variety of legalization combines because we lower these to X86ISD::VSEXT/VZEXT despite them having the same functionality. This patch makes 128-bit (SSE41) SIGN/ZERO_EXTEND_VECTOR_IN_REG ops legal, adds the necessary tablegen plumbing and uses a helper 'getExtendInVec' to decide when to use SIGN/ZERO_EXTEND_VECTOR_IN_REG or VSEXT/VZEXT.

We're missing a couple of shuffle combines that will be added in a future patch for review.

Later patches can then support the AVX2 cases as a mixture of SIGN/ZERO_EXTEND and SIGN/ZERO_EXTEND_VECTOR_IN_REG, and then finally deal with the AVX512 cases.

Differential Revision: https://reviews.llvm.org/D30549

llvm-svn: 296985
2017-03-05 09:57:20 +00:00
Sanjay Patel b974be5ef4 [x86] don't require a zext when forming ADC/SBB
The larger goal is to move the ADC/SBB transforms currently in 
combineX86SetCC() to combineAddOrSubToADCOrSBB() because we're 
creating ADC/SBB in lots of places where we shouldn't.

This was intended to be an NFC change, but avx-512 has something 
strange going on. It doesn't seem like any of the affected tests 
should really be using SET+TEST or ADC; a simple ADD could replace
several instructions. But that's another bug...

llvm-svn: 296978
2017-03-04 20:35:19 +00:00
Simon Pilgrim 40a0e66b37 [X86][SSE] Enable post-legalize vXi64 shuffle combining on 32-bit targets
Long ago (2010 according to svn blame), combineShuffle probably needed to prevent the accidental creation of illegal i64 types but there doesn't appear to be any combines that can cause this any more as they all have their own legality checks.

Differential Revision: https://reviews.llvm.org/D30213

llvm-svn: 296966
2017-03-04 12:50:47 +00:00
Matthias Braun 21f340fd25 X86ISelLowering: Only perform copy elision on legal types.
This fixes cases where i1 types were not properly legalized yet and lead
to the creating of 0-sized stack slots.

This fixes http://llvm.org/PR32136

llvm-svn: 296950
2017-03-04 01:40:40 +00:00
Sanjay Patel a84fd041c6 [x86] check for commuted add pattern to find ADC/SBB
llvm-svn: 296933
2017-03-04 00:18:31 +00:00
Sanjay Patel 7ee83b41e0 [x86] refactor combineAddOrSubToADCOrSBB(); NFCI
The comments were wrong, and this is not an obvious transform.
This hopefully makes it clearer that we're missing the commuted
patterns for adds. It's less clear that this is actually a good
transform for all micro-arch.

This is prep work for trying to clean up the current adc/sbb 
codegen because it's definitely not happening optimally.

llvm-svn: 296918
2017-03-03 22:35:11 +00:00
Sanjay Patel 58e241896d [x86] clean up materializeSBB(); NFCI
This is producing SBB where it is obviously not necessary, so it needs to be limited.

llvm-svn: 296894
2017-03-03 17:58:39 +00:00
Sanjay Patel e8674825fe [x86] fix formatting; NFC
llvm-svn: 296875
2017-03-03 15:17:41 +00:00
Simon Pilgrim c37a32d2b9 Use APInt::getHighBitsSet instead of APInt::getBitsSet for upper bit mask creation
llvm-svn: 296874
2017-03-03 14:37:57 +00:00
Simon Pilgrim b3067dc374 [X86][MMX] Fixed i32 extraction on 32-bit targets
MMX extraction often ends up as extract_i32(bitcast_v2i32(extract_i64(bitcast_v1i64(x86mmx v), 0)), 0) which fails to simplify on 32-bit targets as i64 isn't legal

llvm-svn: 296782
2017-03-02 18:56:06 +00:00
Reid Kleckner f7c0980c10 Elide argument copies during instruction selection
Summary:
Avoids tons of prologue boilerplate when arguments are passed in memory
and left in memory. This can happen in a debug build or in a release
build when an argument alloca is escaped.  This will dramatically affect
the code size of x86 debug builds, because X86 fast isel doesn't handle
arguments passed in memory at all. It only handles the x86_64 case of up
to 6 basic register parameters.

This is implemented by analyzing the entry block before ISel to identify
copy elision candidates. A copy elision candidate is an argument that is
used to fully initialize an alloca before any other possibly escaping
uses of that alloca. If an argument is a copy elision candidate, we set
a flag on the InputArg. If the the target generates loads from a fixed
stack object that matches the size and alignment requirements of the
alloca, the SelectionDAG builder will delete the stack object created
for the alloca and replace it with the fixed stack object. The load is
left behind to satisfy any remaining uses of the argument value. The
store is now dead and is therefore elided. The fixed stack object is
also marked as mutable, as it may now be modified by the user, and it
would be invalid to rematerialize the initial load from it.

Supersedes D28388

Fixes PR26328

Reviewers: chandlerc, MatzeB, qcolombet, inglorion, hans

Subscribers: igorb, llvm-commits

Differential Revision: https://reviews.llvm.org/D29668

llvm-svn: 296683
2017-03-01 21:42:00 +00:00
Simon Pilgrim 5c4efcdddf [X86][SSE] Attempt to extract vector elements through target shuffles
DAGCombiner already supports peeking thorough shuffles to improve vector element extraction, but legalization often leaves us in situations where we need to extract vector elements after shuffles have already been lowered.

This patch adds support for VECTOR_EXTRACT_ELEMENT/PEXTRW/PEXTRB instructions to attempt to handle target shuffles as well. I've covered some basic scenarios including handling shuffle mask scaling and the implicit zero-extension of PEXTRW/PEXTRB, there is more that could be done here (that I've mentioned in TODOs) but I haven't found many cases where its worth it.

Differential Revision: https://reviews.llvm.org/D30176

llvm-svn: 296381
2017-02-27 21:01:57 +00:00
Craig Topper 7502119ce8 [X86] Use APInt instead of SmallBitVector tracking undef elements from getTargetConstantBitsFromNode and getConstVector.
Summary:
SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc.

APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt.

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D30392

llvm-svn: 296355
2017-02-27 16:15:32 +00:00
Craig Topper 3917ca2af4 [X86] Use APInt instead of SmallBitVector for tracking Zeroable elements in shuffle lowering
Summary:
SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc.

APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt.

Reviewers: RKSimon

Reviewed By: RKSimon

Subscribers: llvm-commits

Differential Revision: https://reviews.llvm.org/D30390

llvm-svn: 296354
2017-02-27 16:15:30 +00:00
Craig Topper ed0101a0b9 [X86] Check for less than 0 rather than explicit compare with -1. NFC
llvm-svn: 296321
2017-02-27 06:05:30 +00:00
Simon Pilgrim 0f5fb5f549 [APInt] Add APInt::extractBits() method to extract APInt subrange (reapplied)
The current pattern for extract bits in range is typically:

Mask.lshr(BitOffset).trunc(SubSizeInBits);

Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.

This is another of the compile time issues identified in PR32037 (see also D30265).

This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.

Differential Revision: https://reviews.llvm.org/D30336

llvm-svn: 296272
2017-02-25 20:01:58 +00:00
Simon Pilgrim cdf2bd656a Revert: r296141 [APInt] Add APInt::extractBits() method to extract APInt subrange
The current pattern for extract bits in range is typically:

Mask.lshr(BitOffset).trunc(SubSizeInBits);

Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.

This is another of the compile time issues identified in PR32037 (see also D30265).

This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.

Differential Revision: https://reviews.llvm.org/D30336

llvm-svn: 296147
2017-02-24 18:31:04 +00:00
Simon Pilgrim bd9fb2ae95 [APInt] Add APInt::extractBits() method to extract APInt subrange
The current pattern for extract bits in range is typically:

Mask.lshr(BitOffset).trunc(SubSizeInBits);

Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.

This is another of the compile time issues identified in PR32037 (see also D30265).

This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.

Differential Revision: https://reviews.llvm.org/D30336

llvm-svn: 296141
2017-02-24 17:46:18 +00:00
Simon Pilgrim 7f6a7c97a7 [X86][SSE] Target shuffle combine can try to combine up to 16 vectors
Noticed while profiling PR32037, the target shuffle ops were being stored in SmallVector<*,8> types but the combiner could store as many as 16 ops at maximum depth (2 per depth).

llvm-svn: 296130
2017-02-24 15:35:52 +00:00
Sanjay Patel 9f0fa52aa2 [x86] use DAG.getAllOnesConstant(); NFCI
llvm-svn: 296128
2017-02-24 15:09:59 +00:00
Simon Pilgrim aed352273e [APInt] Add APInt::setBits() method to set all bits in range
The current pattern for setting bits in range is typically:

Mask |= APInt::getBitsSet(MaskSizeInBits, LoPos, HiPos);

Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation memory for the temporary variable.

This is one of the key compile time issues identified in PR32037.

This patch adds the APInt::setBits() helper method which avoids the temporary memory allocation completely, this first implementation uses setBit() internally instead but already significantly reduces the regression in PR32037 (~10% drop). Additional optimization may be possible.

I investigated whether there is need for APInt::clearBits() and APInt::flipBits() equivalents but haven't seen these patterns to be particularly common, but reusing the code would be trivial.

Differential Revision: https://reviews.llvm.org/D30265

llvm-svn: 296102
2017-02-24 10:15:29 +00:00
Craig Topper 8783bbb598 [AVX-512] Separate the fadd/fsub/fmul/fdiv/fmax/fmin with rounding mode ISD opcodes into separate packed and scalar opcodes. This is more consistent with the rest of the ISD opcodes. NFC
llvm-svn: 296094
2017-02-24 07:21:10 +00:00
Petr Hosek a7d5916308 [Fuchsia] Use thread-pointer ABI slots for stack-protector and safe-stack
The Fuchsia ABI defines slots from the thread pointer where the
stack-guard value for stack-protector, and the unsafe stack pointer
for safe-stack, are stored. This parallels the Android ABI support.

Patch by Roland McGrath

Differential Revision: https://reviews.llvm.org/D30237

llvm-svn: 296081
2017-02-24 03:10:10 +00:00
Evgeniy Stepanov ee2d77f6d6 Disable TLS for stack protector on Android API<17.
The TLS slot did not exist back then.

llvm-svn: 296014
2017-02-23 21:06:35 +00:00
Simon Pilgrim 13cdd57964 [X86][SSE] getTargetConstantBitsFromNode - insert constant bits directly into masks.
Minor optimization, don't create temporary mask APInts that are just going to be OR'd into the accumulate masks - insert directly instead.

llvm-svn: 295848
2017-02-22 15:38:13 +00:00
Simon Pilgrim 3a895c4873 [X86][SSE] Use APInt::getBitsSet() instead of APInt::getLowBitsSet().shl() separately. NFCI.
llvm-svn: 295845
2017-02-22 15:04:55 +00:00
Craig Topper 56d4022997 [AVX-512] Allow legacy scalar min/max intrinsics to select EVEX instructions when available
This patch introduces new X86ISD::FMAXS and X86ISD::FMINS opcodes. The legacy intrinsics now lower to this node. As do the AVX-512 masked intrinsics when the rounding mode is CUR_DIRECTION.

I've merged a copy of the tablegen multiclass avx512_fp_scalar into avx512_fp_scalar_sae. avx512_fp_scalar still needs to support CUR_DIRECTION appearing as a rounding mode for X86ISD::FADD_ROUND and others.

Differential revision: https://reviews.llvm.org/D30186

llvm-svn: 295810
2017-02-22 06:54:18 +00:00
Geoff Berry 5d534b6a11 [CodeGenPrepare] Sink and duplicate more 'and' instructions.
Summary:
Rework the code that was sinking/duplicating (icmp and, 0) sequences
into blocks where they were being used by conditional branches to form
more tbz instructions on AArch64.  The new code is more general in that
it just looks for 'and's that have all icmp 0's as users, with a target
hook used to select which subset of 'and' instructions to consider.
This change also enables 'and' sinking for X86, where it is more widely
beneficial than on AArch64.

The 'and' sinking/duplicating code is moved into the optimizeInst phase
of CodeGenPrepare, where it can take advantage of the fact the
OptimizeCmpExpression has already sunk/duplicated any icmps into the
blocks where they are used.  One minor complication from this change is
that optimizeLoadExt needed to be updated to always mark 'and's it has
determined should be in the same block as their feeding load in the
InsertedInsts set to avoid an infinite loop of hoisting and sinking the
same 'and'.

This change fixes a regression on X86 in the tsan runtime caused by
moving GVNHoist to a later place in the optimization pipeline (see
PR31382).

Reviewers: t.p.northover, qcolombet, MatzeB

Subscribers: aemerson, mcrosier, sebpop, llvm-commits

Differential Revision: https://reviews.llvm.org/D28813

llvm-svn: 295746
2017-02-21 18:53:14 +00:00
Simon Pilgrim 8eb515d8c4 [X86] EltsFromConsecutiveLoads SDLoc argument should be const&.
There appears never to have been a time that the reference was updated.

llvm-svn: 295739
2017-02-21 17:42:28 +00:00
Simon Pilgrim 3546156122 [X86][SSE] Prefer to combine shuffles to VZEXT over VZEXT_MOVL.
This matches what is already done during shuffle lowering and helps prevent the need for a zero-vector in cases where shuffles match both patterns.

llvm-svn: 295723
2017-02-21 15:09:00 +00:00
Igor Breger 812f319794 [AVX512] Fix EXTRACT_VECTOR_ELT for v2i1/v4i1/v32i1/v64i1 with variable index.
Differential Revision: https://reviews.llvm.org/D30189

llvm-svn: 295718
2017-02-21 14:01:25 +00:00
Craig Topper 16d9730b86 [X86] Fix formatting. NFC
llvm-svn: 295695
2017-02-21 06:27:13 +00:00
Sanjoy Das 90208720e3 Add a wrapper around copy_if in STLExtras; NFC
I will add one more use for this in a later change.

llvm-svn: 295685
2017-02-21 00:38:44 +00:00
Simon Pilgrim 2967ed1c7e [X86] Tidyup combineExtractVectorElt. NFCI.
Pull out repeated code for extraction index operand and source vector value type.

Use isNullConstant helper to check for zero extraction index.

llvm-svn: 295670
2017-02-20 16:09:45 +00:00
Igor Breger fda32d266a [X86] Fix EXTRACT_VECTOR_ELT with variable index from v32i16 and v64i8 vector.
Its more profitable to go through memory (1 cycles throughput)
than using VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput) to implement EXTRACT_VECTOR_ELT with variable index.
IACA tool was used to get performace estimation (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
For example for var_shuffle_v16i8_v16i8_xxxxxxxxxxxxxxxx_i8 test from vector-shuffle-variable-128.ll I get 26 cycles vs 79 cycles. 
Removing the VINSERT node, we don't need it any more.

Differential Revision: https://reviews.llvm.org/D29690

llvm-svn: 295660
2017-02-20 14:16:29 +00:00
Simon Pilgrim 5910ebe720 [X86][AVX512] Add support for ASHR v2i64/v4i64 support without VLX
Use v8i64 ASHR instructions if we don't have VLX.

Differential Revision: https://reviews.llvm.org/D28537

llvm-svn: 295656
2017-02-20 12:16:38 +00:00
Simon Pilgrim 14a7eee0b4 [X86] Use peekThroughOneUseBitcasts helper. NFCI.
llvm-svn: 295618
2017-02-19 21:40:51 +00:00
Simon Pilgrim d590de2998 [X86][SSE] Use getTargetConstantBitsFromNode to find zeroable shuffle elements.
Replaces existing approach that could only search BUILD_VECTOR nodes.

Requires getTargetConstantBitsFromNode to discriminate cases with all/partial UNDEF bits in each element - this should also be useful when we get around to supporting getTargetShuffleMaskIndices with UNDEF elements. 

llvm-svn: 295613
2017-02-19 19:40:31 +00:00
Simon Pilgrim 4271186f9c [X86][SSE] Enable initial support for domain crossing at high shuffle combine depths.
As discussed on D27692, this permits another domain to be used to combine a shuffle at high depths.

We currently set the required depth at 4 or more combined shuffles, this is probably too high for most targets but is a good starting point and already helps avoid a number of costly variable shuffles.

llvm-svn: 295608
2017-02-19 17:19:38 +00:00
Simon Pilgrim 6d07d514de [X86][SSE] Generalize INSERTPS/SHUFPS/SHUFPD combines across domains.
Relax the INSERTPS/SHUFPS/SHUFPD combines to support integer inputs if permitted.

llvm-svn: 295606
2017-02-19 15:15:40 +00:00
Simon Pilgrim b4460cf5a9 [X86][SSE] Add domain crossing support for target shuffle combines.
Add the infrastructure to flag whether float and/or int domains are permitable.

A future patch will enable domain crossing based off shuffle depth and the value types of the source vectors.

llvm-svn: 295604
2017-02-19 14:12:25 +00:00
Simon Pilgrim 2f2d8dc630 Fix signed/unsigned comparison warning.
llvm-svn: 295580
2017-02-18 22:56:17 +00:00
Simon Pilgrim 7a87eebcad [X86] Fix enumeral/non-enumeral comparison warning.
gcc only allows you to mix enums / ints if they have the same signedness.

llvm-svn: 295576
2017-02-18 22:40:58 +00:00
Simon Pilgrim 2e78c94ea5 [X86][SSE] Avoid repeated calls to SDValue::getValueType.
Added assertion to check input type of X86ISD::VZEXT during target known bits calculation.

llvm-svn: 295575
2017-02-18 22:25:27 +00:00
Sanjay Patel 12c2093e1e [x86] fold sext (xor Bool, -1) --> sub (zext Bool), 1
This is the same transform that is current used for:
select Bool, 0, -1

llvm-svn: 295568
2017-02-18 21:03:28 +00:00
Simon Pilgrim 7db8f42fe3 [X86] Simplify by pulling out valuetype. NFCI.
llvm-svn: 295502
2017-02-17 22:10:10 +00:00
Simon Pilgrim 2fe568c95e [X86] Remove local areOnlyUsersOf helper and use SDNode::areOnlyUsersOf instead.
llvm-svn: 295326
2017-02-16 15:11:49 +00:00
Simon Pilgrim 5b4c30fb32 [X86][SSE] Don't call EltsFromConsecutiveLoads if any element is missing.
Minor performance speedup - if any call to getShuffleScalarElt fails to get a result, don't both calling for the remaining elements as EltsFromConsecutiveLoads will fail anyhow.

llvm-svn: 295235
2017-02-15 21:09:00 +00:00
Simon Pilgrim da25d5c7b6 [X86][SSE] Propagate undef upper elements from scalar_to_vector during shuffle combining
Only do this for integer types currently - floats types (in particular insertps) load folding often fails with this.

llvm-svn: 295208
2017-02-15 17:41:33 +00:00
Simon Pilgrim 0f0e5bd3c6 [X86][SSE] Allow matchVectorShuffleWithUNPCK to recognise ZERO inputs
Add support for specifying an UNPCK input as ZERO, particularly improves ZEXT cases with non-zero offsets

llvm-svn: 295169
2017-02-15 11:46:15 +00:00
Craig Topper fbc7805e25 [X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types
Summary:
We don't seem to have great rules on what a valid VBROADCAST node looks like. And as a consequence we end up with a lot of patterns to try to catch everything. We have patterns with scalar inputs, 128-bit vector inputs, 256-bit vector inputs, and 512-bit vector inputs.

As you can see from the things improved here we are currently missing patterns for 128-bit loads being extended to 256-bit before the vbroadcast.

I'd like to propose that VBROADCAST should always take a 128-bit vector type as input. As a first step towards that this patch adds an EXTRACT_SUBVECTOR in front of VBROADCAST when the input is 256 or 512-bits. In the future I would like to add scalar_to_vector around all the scalar operations. And maybe we should consider adding a VBROADCAST+load node to avoid separating loads from the broadcasting operation when the load itself isn't foldable.

This requires an additional change in target shuffle combining to look for the extract subvector and look through it to find the original operand. I'm sure this change isn't perfect but was enough to fix a few test failures that were being caused.

Another interesting thing I noticed is that the changes in masked_gather_scatter.ll show cases were we don't remove a useless insert into element 1 before broadcasting element 0.

Reviewers: delena, RKSimon, zvi

Reviewed By: zvi

Subscribers: igorb, llvm-commits

Differential Revision: https://reviews.llvm.org/D28747

llvm-svn: 295155
2017-02-15 06:58:47 +00:00
Diego Novillo 8adfc8ef3a Remove unused variable.
llvm-svn: 295065
2017-02-14 16:39:54 +00:00
Simon Pilgrim 6f732e026d [X86][SSE] Allow matchVectorShuffleWithUNPCK to recognise UNDEF inputs
Add support for specifying an UNPCK input as UNDEF

llvm-svn: 295061
2017-02-14 16:22:04 +00:00
Simon Pilgrim a0878dea9e [X86][SSE] Move unary inputs handling inside matchVectorShuffleWithUNPCK.
llvm-svn: 295053
2017-02-14 13:47:17 +00:00
Simon Pilgrim 3efdffcb27 [X86][SSE] Tidyup matchVectorShuffleWithUNPCK helper function call.
Don't bother setting the V1/V2 operands again for unary shuffles.

Don't bother legalizing the value type unless the match succeeds.

llvm-svn: 295051
2017-02-14 12:54:39 +00:00
Simon Pilgrim fd6a84fbaa Fix indentation. NFCI.
llvm-svn: 294959
2017-02-13 15:31:08 +00:00
Simon Pilgrim 828dee1f70 [X86][SSE] Create matchVectorShuffleWithUNPCK helper function.
Currently only used by target shuffle combining - will use it for lowering as well in a future patch.

llvm-svn: 294943
2017-02-13 11:52:58 +00:00
Craig Topper 680c73e7ab [X86] Genericize the handling of INSERT_SUBVECTOR from an EXTRACT_SUBVECTOR to support 512-bit vectors with 128-bit or 256-bit subvectors.
We now detect that both the extract and insert indices are non-zero and convert to a shuffle. This will be lowered as a blend for 256-bit vectors or as a vshuf operations for 512-bit vectors.

llvm-svn: 294931
2017-02-13 04:53:29 +00:00
Craig Topper 53eafa8ea4 [X86] Don't let LowerEXTRACT_SUBVECTOR call getNode for EXTRACT_SUBVECTOR.
This results in the simplifications inside of getNode running while we're legalizing nodes popped off the worklist during the final DAG combine. This basically makes a DAG combine like operation occur during this legalize step, but we don't handle something quite the same way. I think we don't recursively added the removed nodes to the DAG combiner worklist.

llvm-svn: 294929
2017-02-12 23:49:46 +00:00
Simon Pilgrim cc9242bd1c [X86] Fix typo in function name. NFCI.
convertBitVectorToUnsiged - convertBitVectorToUnsigned

llvm-svn: 294914
2017-02-12 20:53:44 +00:00
Simon Pilgrim 04ec0f2b2a [X86][SSE] Update argument names to match function name. NFCI.
The target shuffle match function arguments were using the term 'Ops' but the function names referred to them as 'Inputs' - use 'Inputs' consistently.

llvm-svn: 294900
2017-02-12 16:46:41 +00:00
Simon Pilgrim 4cd841757a [X86][AVX2] Add support for combining target shuffles to VPMOVZX
Initial 256-bit vector support - 512-bit support requires extra checks for AVX512BW support (PMOVZXBW) that will be handled in a future patch.

llvm-svn: 294896
2017-02-12 14:31:23 +00:00
Craig Topper 1c37e991e6 [X86] Move code for using blendi for insert_subvector out to an isel pattern. This gives the DAG combiner more opportunity to optimize without needing to dig through the blend.
llvm-svn: 294876
2017-02-11 22:57:12 +00:00
Simon Pilgrim 755d9127f5 [X86][SSE] Use VSEXT/VZEXT constant folding for SIGN_EXTEND_VECTOR_INREG/ZERO_EXTEND_VECTOR_INREG
Preparatory step for PR31712

llvm-svn: 294874
2017-02-11 22:47:06 +00:00
Simon Pilgrim 437d64c49e [X86][SSE] Improve VSEXT/VZEXT constant folding.
Generalize VSEXT/VZEXT constant folding to work with any target constant bits source not just BUILD_VECTOR .

llvm-svn: 294873
2017-02-11 21:55:24 +00:00
Simon Pilgrim 4ef9672f0f [X86][SSE] Add early-out when trying to match blend shuffle. NFCI.
llvm-svn: 294864
2017-02-11 18:06:24 +00:00
Amaury Sechet 58ce15aba1 Fix indentation in X86ISelLowering. NFC
llvm-svn: 294859
2017-02-11 17:48:48 +00:00
Simon Pilgrim 0e6945e48a [X86][SSE] Convert getTargetShuffleMaskIndices to use getTargetConstantBitsFromNode.
Removes duplicate constant extraction code in getTargetShuffleMaskIndices.

getTargetConstantBitsFromNode - adds support for VZEXT_MOVL(SCALAR_TO_VECTOR) and fail if the caller doesn't support undef bits.

llvm-svn: 294856
2017-02-11 17:27:21 +00:00
Simon Pilgrim d59fa0e38a [X86] Merge repeated getScalarValueSizeInBits calls. NFCI.
llvm-svn: 294852
2017-02-11 16:42:07 +00:00
Ahmed Bougacha 2e275e272f [X86] Bitcast subvector before broadcasting it.
Since r274013, we've been looking through bitcasts on broadcast inputs.
In the scalar-folding case (from a load, build_vector, or sc2vec),
the input type didn't matter, as we'd simply bitcast the resulting
scalar back.

However, when broadcasting a 128-bit-lane-aligned element, we create an
EXTRACT_SUBVECTOR.  Use proper types, by creating an extract_subvector
of the original input type.

llvm-svn: 294774
2017-02-10 19:51:47 +00:00
Simon Pilgrim 8c8b10389d [X86][SSE] Use SDValue::getConstantOperandVal helper. NFCI.
Also reordered an if statement to test low cost comparisons first

llvm-svn: 294748
2017-02-10 14:27:59 +00:00
Simon Pilgrim c371159aac [X86][SSE] Add support for extracting target constants from BUILD_VECTOR
In some cases we call getTargetConstantBitsFromNode for nodes that haven't been lowered from BUILD_VECTOR yet

Note: We're getting very close to being able to move most of the constant extraction code from getTargetShuffleMaskIndices into getTargetConstantBitsFromNode
llvm-svn: 294746
2017-02-10 14:04:11 +00:00
Simon Pilgrim 1140281413 [X86][SSE] Add missing comment describing combing to SHUFPS. NFCI
llvm-svn: 294745
2017-02-10 13:16:01 +00:00
Simon Pilgrim 7f0d7e08b2 [X86] Remove duplicate call to getValueType. NFCI.
llvm-svn: 294640
2017-02-09 22:35:59 +00:00
Simon Pilgrim e0b5c2acbd Convert to for-range loop. NFCI.
llvm-svn: 294610
2017-02-09 18:52:24 +00:00
Simon Pilgrim 6bf1bd3ed6 [X86][MMX] Remove the (long time) unused MMX_PINSRW ISD opcode.
llvm-svn: 294596
2017-02-09 17:08:47 +00:00
Pierre Gousseau 6953b32475 [X86][btver2] PR31902: Fix a crash in combineOrCmpEqZeroToCtlzSrl under fast math.
In combineOrCmpEqZeroToCtlzSrl, replace "getConstantOperand == 0" by "isNullConstant" to account for floating point constants.

Differential Revision: https://reviews.llvm.org/D29756

llvm-svn: 294588
2017-02-09 14:43:58 +00:00
Simon Pilgrim 563e23e66e [X86][SSE] Attempt to break register dependencies during lowerBuildVector
LowerBuildVectorv16i8/LowerBuildVectorv8i16 insert values into a UNDEF vector if the build vector doesn't contain any zero elements, resulting in register dependencies with a previous use of the register.

This patch attempts to break the register dependency by either always zeroing the vector before hand or (if we're inserting to the 0'th element) by using VZEXT_MOVL(SCALAR_TO_VECTOR(i32 AEXT(Elt))) which lowers to (V)MOVD and performs a similar function. Additionally (V)MOVD is a shorter instruction than PINSRB/PINSRW. We already do something similar for SSE41 PINSRD.

On pre-SSE41 LowerBuildVectorv16i8 we go a little further and use VZEXT_MOVL(SCALAR_TO_VECTOR(i32 ZEXT(Elt))) if the build vector contains zeros to avoid the vector zeroing at the cost of a scalar zero extension, which can probably be brought over to the other cases in a future patch in some cases (load folding etc.)

Differential Revision: https://reviews.llvm.org/D29720

llvm-svn: 294581
2017-02-09 11:50:19 +00:00
Craig Topper 50f3d1452c [X86] Clzero intrinsic and its addition under znver1
This patch does the following.

1. Adds an Intrinsic int_x86_clzero which works with __builtin_ia32_clzero
2. Identifies clzero feature using cpuid info. (Function:8000_0008, Checks if EBX[0]=1)
3. Adds the clzero feature under znver1 architecture.
4. The custom inserter is added in Lowering.
5. A testcase is added to check the intrinsic.
6. The clzero instruction is added to assembler test.

Patch by Ganesh Gopalasubramanian with a couple formatting tweaks, a disassembler test, and using update_llc_test.py from me.

Differential revision: https://reviews.llvm.org/D29385

llvm-svn: 294558
2017-02-09 04:27:34 +00:00
Simon Pilgrim dcd10344a3 [X86][SSE] Tidyup LowerBuildVectorv16i8 and LowerBuildVectorv8i16. NFCI.
Run clang-format and standardized variable names between functions.

llvm-svn: 294456
2017-02-08 14:44:45 +00:00
Sanjay Patel b0cee9b273 [x86] improve comments for SHRUNKBLEND node creation; NFC
llvm-svn: 294344
2017-02-07 19:54:16 +00:00
Sanjay Patel ef6d573f67 [x86] use range-for loops; NFCI
llvm-svn: 294337
2017-02-07 19:18:25 +00:00
Sanjay Patel 633ecbf3c4 [x86] use getSignBit() for clarity; NFCI
llvm-svn: 294333
2017-02-07 19:01:35 +00:00
Simon Pilgrim 8c0f62d293 [X86][SSE] Ensure that vector shift-by-immediate inputs are correctly bitcast to the result type
vXi8/vXi64 vector shifts are often shifted as vYi16/vYi32 types but we weren't always remembering to bitcast the input.

Tested with a new assert as we don't currently manipulate these shifts enough for test cases to catch them.

llvm-svn: 294308
2017-02-07 14:22:25 +00:00
Simon Pilgrim bfd4495512 [X86][SSE] Combine shuffle nodes with multiple uses if all the users are being combined.
Currently we only combine shuffle nodes if they have a single user to prevent us from causing code bloat by splitting the shuffles into several different combines.

We don't take into account that in some cases we will already have combined all the users during recursively calling up the shuffle tree.

This patch keeps a list of all the shuffle nodes that have been combined so far and permits combining of further shuffle nodes if all its users are in that list.

Differential Revision: https://reviews.llvm.org/D29399

llvm-svn: 294183
2017-02-06 13:44:45 +00:00
Simon Pilgrim 380ce75687 [X86][SSE] Replace insert_vector_elt(vec, -1, idx) with shuffle
Similar to what we already do for zero elt insertion, we can quickly rematerialize 'allbits' vectors so to avoid a unnecessary gpr value and insertion into a vector

llvm-svn: 294162
2017-02-05 22:50:29 +00:00
Craig Topper 6a35a81fc5 [X86] In LowerTRUNCATE, create an ISD::VECTOR_SHUFFLE instead of explicitly creating a PSHUFB. This will be lowered by regular shuffle lowering to a PSHUFB later.
Similar was already done for several other shuffles in this function.

The test changes are because the old code used explicity zeroing for elements that could have been undef.

While I was here I also changed other shuffle vectors in the same function to use the same input twice instead of creating UNDEF nodes. getVectorShuffle can create the UNDEF for us.

llvm-svn: 294130
2017-02-05 18:33:14 +00:00
Craig Topper 978fdb75a4 [X86] Add support for folding (insert_subvector vec1, (extract_subvector vec2, idx1), idx1) -> (blendi vec2, vec1).
llvm-svn: 294112
2017-02-04 23:26:46 +00:00
Craig Topper 3d95228dbe [X86] Simplify the code that turns INSERT_SUBVECTOR into BLENDI. NFCI
llvm-svn: 294111
2017-02-04 23:26:42 +00:00
Simon Pilgrim 034c1bd32c [X86][SSE] Add support for combining scalar_to_vector(extract_vector_elt) into a target shuffle.
Correctly flagging upper elements as undef.

llvm-svn: 294020
2017-02-03 17:59:58 +00:00
Craig Topper bbb2b95ce5 [X86] Mark 256-bit and 512-bit INSERT_SUBVECTOR operations as legal and remove the custom lowering.
llvm-svn: 293969
2017-02-03 00:24:49 +00:00
Reid Kleckner 3c467e225e [X86] Avoid sorted order check in release builds
Effectively reverts r290248 and fixes the unused function warning with
ifndef NDEBUG.

llvm-svn: 293945
2017-02-02 22:06:30 +00:00
Craig Topper c45657375b [X86] Move turning 256-bit INSERT_SUBVECTORS into BLENDI from legalize to DAG combine.
On one test this seems to have given more chance for DAG combine to do other INSERT_SUBVECTOR/EXTRACT_SUBVECTOR combines before the BLENDI was created. Looks like we can still improve more by teaching DAG combine to optimize INSERT_SUBVECTOR/EXTRACT_SUBVECTOR with BLENDI.

llvm-svn: 293944
2017-02-02 22:02:57 +00:00
Simon Pilgrim 20ab6b875a [X86][SSE] Use MOVMSK for all_of/any_of reduction patterns
This is a first attempt at using the MOVMSK instructions to replace all_of/any_of reduction patterns (i.e. an and/or + shuffle chain).

So far this only matches patterns where we are reducing an all/none bits source vector (i.e. a comparison result) but we should be able to expand on this in conjunction with improvements to 'bool vector' handling both in the x86 backend as well as the vectorizers etc.

Differential Revision: https://reviews.llvm.org/D28810

llvm-svn: 293880
2017-02-02 11:52:33 +00:00
Craig Topper 047a8be18a [X86] Remove some unused DAGCombinerInfo parameters. NFC
llvm-svn: 293873
2017-02-02 08:03:23 +00:00
Craig Topper 94ed54b49a [X86] Move some INSERT_SUBVECTOR optimizations from legalize to DAG combine.
This moves creation of SUBV_BROADCAST and merging of adjacent loads that are being inserted together.

This is a step towards removing legalizing of INSERT_SUBVECTOR except for vXi1 cases.

llvm-svn: 293872
2017-02-02 08:03:20 +00:00
Simon Pilgrim ca931efc21 [X86][SSE] Remove unused argument. NFCI.
llvm-svn: 293777
2017-02-01 16:34:50 +00:00
Simon Pilgrim 55a9c79bd1 [X86][SSE] Merge SSE2 PINSRW lowering with SSE41 PINSRB/PINSRW lowering. NFCI.
These are identical apart from the extra SSE41 guard for PINSRB.

llvm-svn: 293766
2017-02-01 13:32:19 +00:00
Simon Pilgrim 1b39d5db7b [X86][SSE] Add support for combining PINSRB into a target shuffle.
llvm-svn: 293637
2017-01-31 14:59:44 +00:00
Benjamin Kramer 94a833962c [X86] Silence unused variable warning in Release builds.
llvm-svn: 293631
2017-01-31 14:13:53 +00:00
Simon Pilgrim 4eab18f6b8 [X86][SSE] Detect unary PBLEND shuffles.
These can appear during shuffle combining.

llvm-svn: 293628
2017-01-31 13:58:01 +00:00
Simon Pilgrim c29eab52e8 [X86][SSE] Add support for combining PINSRW into a target shuffle.
Also add the ability to recognise PINSR(Vex, 0, Idx).

Targets shuffle combines won't replace multiple insertions with a bit mask until a depth of 3 or more, so we avoid codesize bloat.

The unnecessary vpblendw in clearupper8xi16a will be fixed in an upcoming patch.

llvm-svn: 293627
2017-01-31 13:51:10 +00:00
Craig Topper b76494e017 [X86] Remove 'else' after 'return'. NFC
llvm-svn: 293589
2017-01-31 02:09:46 +00:00
Simon Pilgrim 3905e03a47 [X86][SSE] Fix unsigned <= 0 warning in assert. NFCI.
Thanks to @mkuper

llvm-svn: 293561
2017-01-30 22:58:44 +00:00
Simon Pilgrim a80a47afef [X86][SSE] Generalize the number of decoded shuffle inputs. NFCI.
combineX86ShufflesRecursively can still only handle a maximum of 2 shuffle inputs but everything before it now supports any number of shuffle inputs.

This will be necessary for combining OR(SHUFFLE, SHUFFLE) patterns.

llvm-svn: 293560
2017-01-30 22:48:49 +00:00
Simon Pilgrim 098998aef0 [X86][SSE] Add support for combining PINSRW+ASSERTZEXT+PEXTRW patterns with target shuffles
llvm-svn: 293500
2017-01-30 16:58:34 +00:00
Asaf Badouh e11d2d73bf [X86][MCU] Minor bug fix for r293469 + test case
llvm-svn: 293478
2017-01-30 13:14:37 +00:00
Asaf Badouh 53713df0c2 [X86][MCU] replace select with bit manipulation instead of branches
Differential Revision: https://reviews.llvm.org/D28354


 

llvm-svn: 293469
2017-01-30 08:16:59 +00:00
Craig Topper 3b7e823f92 [AVX-512] Don't reuse VSHLI/VSRLI for mask register shifts. VSHLI/VSHRI shift within elements while KSHIFT moves whole elements.
llvm-svn: 293448
2017-01-30 00:06:01 +00:00
Craig Topper db919caf1b [AVX-512] Fix lowering for mask register concatenation with undef in the lower half.
Previously this test case fired an assertion in getNode because we tried to create an insert_subvector with both input types the same size and the index pointing to half the vector width.

llvm-svn: 293446
2017-01-29 22:53:33 +00:00
Simon Pilgrim 76073f8d22 [X86][SSE] Lower scalar_to_vector(0) to zero vector
Replaces an xor+movd/movq with an xorps which will be shorter in codesize, avoid an int-fpu transfer, allow modern cores to fast path the result during decode and helps other combines recognise an all-zero vector.

The only reason I can think of that we'd want to keep scalar_to_vector in this case is to help recognise the upper elts are undef but this doesn't seem to be a problem.

Differential Revision: https://reviews.llvm.org/D29097

llvm-svn: 293438
2017-01-29 18:13:37 +00:00
Elena Demikhovsky 17fe27f1f2 [X86 Codegen] Fixed a bug in unsigned saturation
PACKUSWB converts Signed word to Unsigned byte, (the same about DW) and it can't be used for umin+truncate pattern.
AVX-512 VPMOVUS* instructions fit the pattern since they convert Unsigned to Unsigned.

See https://llvm.org/bugs/show_bug.cgi?id=31773

Differential Revision: https://reviews.llvm.org/D29196

llvm-svn: 293431
2017-01-29 13:18:30 +00:00
Craig Topper 6533e40e9d [X86] Fix vector ANDN matching to work correctly when both inputs to the AND are XORs.
llvm-svn: 293403
2017-01-28 23:52:09 +00:00
Simon Pilgrim 027bb453d9 [X86][SSE] Add support for combining ANDNP byte masks with target shuffles
llvm-svn: 293178
2017-01-26 14:31:12 +00:00
Simon Pilgrim 3057fd53f9 [X86][SSE] Pull out target shuffle resolve code into helper. NFCI.
Pulled out code that removed unused inputs from a target shuffle mask into a helper function to allow it to be reused in a future commit.

llvm-svn: 293175
2017-01-26 13:06:02 +00:00
Craig Topper bad53cce26 [AVX-512] Move the combine that runs combineBitcastForMaskedOp to the last DAG combine phase where I had originally meant to put it.
llvm-svn: 293157
2017-01-26 07:17:58 +00:00
Craig Topper f0bab7b739 [X86] When bitcasting INSERT_SUBVECTOR/EXTRACT_SUBVECTOR to match masked operations, use the correct type for the immediate operand.
llvm-svn: 293156
2017-01-26 07:17:53 +00:00
Martin Bohme 526299c81c [X86][SSE] Add explicit braces to avoid -Wdangling-else warning.
Reviewers: RKSimon

Subscribers: llvm-commits, igorb

Differential Revision: https://reviews.llvm.org/D29076

llvm-svn: 292924
2017-01-24 12:31:30 +00:00
Simon Pilgrim 0c45338961 Fix unused variable warning
llvm-svn: 292921
2017-01-24 11:54:27 +00:00
Simon Pilgrim e1ec9072f6 [X86][SSE] Add support for constant folding vector arithmetic shift by immediates
llvm-svn: 292919
2017-01-24 11:46:13 +00:00
Simon Pilgrim 6340e54861 [X86][SSE] Add support for constant folding vector logical shift by immediates
llvm-svn: 292915
2017-01-24 11:21:57 +00:00
Craig Topper fc8798fa1b [X86] Remove unnecessary peakThroughBitcasts call that's already take care of by the ISD::isBuildVectorAllOnes check below.
llvm-svn: 292894
2017-01-24 06:57:29 +00:00
Craig Topper 993edc9db1 [X86] Don't split v8i32 all ones values if only AVX1 is available. Keep it intact and split it at isel.
This allows us to remove the check in ANDN combining that had to look through the extraction.

llvm-svn: 292881
2017-01-24 04:33:03 +00:00
Craig Topper eb440a14a5 [X86] Remove Undef handling from extractSubVector. This is now handled inside getNode.
llvm-svn: 292877
2017-01-24 02:43:54 +00:00
Simon Pilgrim 0218ce1080 [X86][SSE] Add missing X86ISD::ANDNP combines.
llvm-svn: 292767
2017-01-22 22:45:23 +00:00
Simon Pilgrim 7e1cc97513 [X86][SSE] Improve shuffle combining with zero insertions
Add support for handling shuffles with scalar_to_vector(0)

llvm-svn: 292766
2017-01-22 22:21:44 +00:00
Sanjay Patel 8f49aede82 [x86] avoid crashing with illegal vector type (PR31672)
https://llvm.org/bugs/show_bug.cgi?id=31672

llvm-svn: 292758
2017-01-22 17:06:12 +00:00
Craig Topper 8e0724d332 [X86] Don't allow commuting to form phsub operations.
Fixes PR31714.

llvm-svn: 292713
2017-01-21 06:59:38 +00:00
Simon Pilgrim db101e4d57 [X86][SSE] Improve comments describing combineTruncatedArithmetic. NFCI.
llvm-svn: 292502
2017-01-19 18:18:32 +00:00
Simon Pilgrim 5f2f53b106 [X86][SSE] Attempt to pre-truncate arithmetic operations that have already been extended
As discussed on D28219 - it is profitable to combine trunc(binop (s/zext(x), s/zext(y)) to binop(trunc(s/zext(x)), trunc(s/zext(y))) assuming the trunc(ext()) will simplify further

llvm-svn: 292493
2017-01-19 16:25:02 +00:00
Elena Demikhovsky e01512cecf Recommiting unsigned saturation with a bugfix.
A test case that crached is added to avx512-trunc.ll.
(PR31589)

llvm-svn: 292479
2017-01-19 12:08:21 +00:00
Craig Topper 200ea31684 [AVX-512] Support ADD/SUB/MUL of mask vectors
Summary:
Currently we expand and scalarize these operations, but I think we should be able to implement ADD/SUB with KXOR and MUL with KAND.

We already do this for scalar i1 operations so I just extended it to vectors of i1.

Reviewers: zvi, delena

Reviewed By: delena

Subscribers: guyblank, llvm-commits

Differential Revision: https://reviews.llvm.org/D28888

llvm-svn: 292474
2017-01-19 07:12:35 +00:00
Craig Topper c227529105 [X86] Merge LowerADD and LowerSUB into a single LowerADD_SUB since they are identical.
llvm-svn: 292469
2017-01-19 03:49:29 +00:00
Michael Kuperstein d3d2925933 Revert r291670 because it introduces a crash.
r291670 doesn't crash on the original testcase from PR31589,
but it crashes on a slightly more complex one.

PR31589 has the new reproducer.

llvm-svn: 292444
2017-01-18 23:05:58 +00:00
Kirill Bobyrev 6afbaf0944 Revert 292404 due to buildbot failures.
llvm-svn: 292407
2017-01-18 16:34:25 +00:00
Kirill Bobyrev 9ad06dbe17 [X86] Minor code cleanup to fix several clang-tidy warnings. NFC
llvm-svn: 292404
2017-01-18 16:15:47 +00:00
Michael Zuckerman 0c0240ce84 [X86] Improve mul combine for negative multiplayer (2^c - 1)
This patch improves the mul instruction combine function (combineMul) 
by adding new layer of logic. 
In this patch, we are adding the ability to fold (mul x, -((1 << c) -1)) 
or (mul x, -((1 << c) +1)) into (neg(X << c) -x) or (neg((x << c) + x) respective.

Differential Revision: https://reviews.llvm.org/D28232

llvm-svn: 292358
2017-01-18 09:31:13 +00:00
Bob Wilson f2d0b68b3b Revert r291640 change to fold X86 comparison with atomic_load_add.
Even with the fix from r291630, this still causes problems. I get
widespread assertion failures in the Swift runtime's WeakRefCount::increment()
function. I sent a reduced testcase in reply to the commit.

llvm-svn: 292242
2017-01-17 19:18:57 +00:00
Craig Topper 729d30d0ae [AVX-512] Add support for taking a bitcast between a SUBV_BROADCAST and VSELECT and moving it to the input of the SUBV_BROADCAST if it will help with using a masked operation.
llvm-svn: 292201
2017-01-17 06:49:59 +00:00
Michael Zuckerman 6baa3838e9 Fix blend mask by switch the side of the operand since Blend node uses opposite mask then Select NODE.
llvm-svn: 292066
2017-01-15 16:43:14 +00:00
Craig Topper 9cc685a56e [X86] Simplify the code that calculates a scaled blend mask. We don't need a second loop.
llvm-svn: 291996
2017-01-14 04:29:15 +00:00
Craig Topper 9850210d03 [AVX-512] Change blend mask in lowerVectorShuffleAsBlend to a 64-bit value. Also add 32-bit mode command lines to the test case that exercises this just to make sure we sanely handle the 64-bit immediate there.
This fixes a undefined sanitizer failure from r291888.

llvm-svn: 291994
2017-01-14 04:19:35 +00:00
Benjamin Kramer 061f4a5fe6 Apply clang-tidy's performance-unnecessary-value-param to LLVM.
With some minor manual fixes for using function_ref instead of
std::function. No functional change intended.

llvm-svn: 291904
2017-01-13 14:39:03 +00:00
Simon Pilgrim 7f2a6d5e8c [X86][AVX512] Add support for variable ASHR v2i64/v4i64 support without VLX
Use v8i64 variable ASHR instructions if we don't have VLX.

This is a reduced version of D28537 that just adds support for variable shifts - I'll continue with that patch (for just constant/uniform shifts) once I've fixed the type legalization issue in avx512-cvt.ll.

Differential Revision: https://reviews.llvm.org/D28604

llvm-svn: 291901
2017-01-13 13:16:19 +00:00
Diana Picus 116bbab4e4 [CodeGen] Rename MachineInstrBuilder::addOperand. NFC
Rename from addOperand to just add, to match the other method that has been
added to MachineInstrBuilder for adding more than just 1 operand.

See https://reviews.llvm.org/D28057 for the whole discussion.

Differential Revision: https://reviews.llvm.org/D28556

llvm-svn: 291891
2017-01-13 09:58:52 +00:00
Michael Zuckerman 558a4d8419 [X86][AVX512] Adding missing shuffle lowering to blend mask instructions
Some shuffles can be lowered to blend mask instruction (VPBLENDMB/VPBLENDMW/VPBLENDMD/VPBLENDMQ) .
In this patch, I added new pattern match for this case.

Reviewers:
1. craig.topper
2. guyblank
3. RKSimon
4. igorb     

Differential Revision: https://reviews.llvm.org/D28483

llvm-svn: 291888
2017-01-13 09:06:00 +00:00
Nikolai Bozhenov f02ac0eeb2 [X86] Replace AND+IMM64 with SRL/SHL
Emit SHRQ/SHLQ instead of ANDQ with a 64 bit constant mask if the result
is unused and the mask has only higher/lower bits set. For example, with
this patch LLVM emits

  shrq $41, %rdi
  je

instead of

  movabsq $0xFFFFFE0000000000, %rcx
  testq   %rcx, %rdi
  je

This reduces number of instructions, code size and register pressure.
The transformation is applied only for cases where the mask cannot be
encoded as an immediate value within TESTQ instruction.

Differential Revision: https://reviews.llvm.org/D28198

llvm-svn: 291806
2017-01-12 19:54:27 +00:00
Nikolai Bozhenov 6bdf92cec7 [X86] Tune bypassing of slow division for Intel CPUs
64-bit integer division in Intel CPUs is extremely slow, much slower
than 32-bit division. On the other hand, 8-bit and 16-bit divisions
aren't any faster. The only important exception is Atom where DIV8
is fastest. Because of that, the patch
1) Enables bypassing of 64-bit division for Atom, Silvermont and
   all big cores.
2) Modifies 64-bit bypassing to use 32-bit division instead of
   16-bit one. This doesn't make the shorter division slower but
   increases chances of taking it. Moreover, it's much more likely
   to prove at compile-time that a value fits 32 bits and doesn't
   require a run-time check (e.g. zext i32 to i64).

Differential Revision: https://reviews.llvm.org/D28196

llvm-svn: 291800
2017-01-12 19:34:15 +00:00
Craig Topper 24c3a2395f [AVX-512] Improve lowering of zero_extend of v4i1 to v4i32 and v2i1 to v2i64 with VLX, but no DQ or BW support.
llvm-svn: 291747
2017-01-12 06:49:12 +00:00
Craig Topper 69ab67b279 [AVX-512] Improve lowering of sign_extend of v4i1 to v4i32 and v2i1 to v2i64 when avx512vl is available, but not avx512dq.
llvm-svn: 291746
2017-01-12 06:49:08 +00:00
Elad Cohen c5ba925ef2 [X86][AVX512] Fix PR31515 - Do not flip vselect condition if it's not a vXi1 mask
r289653 added a case where `vselect <cond> <vector1> <all-zeros>`
is transformed to:
`vselect xor(cond, DAG.getConstant(1, DL, CondVT) <all-zeros> <vector1>`
This was not aimed to catch cases where Cond is not a vXi1
mask but it does. Moreover, when Cond type is VxiN (N > 1)
then xor(cond, DAG.getConstant(1, DL, CondVT) != NOT(cond).
This patch changes the above to xor with allones, and avoids
entering the case for non-mask Conds.

llvm-svn: 291745
2017-01-12 06:49:03 +00:00