Commit Graph

19391 Commits

Author SHA1 Message Date
Tim Northover 522fb7eedc GlobalISel: support swiftself attribute
llvm-svn: 367683
2019-08-02 14:09:49 +00:00
Daniel Sanders 2bea69bf65 Finish moving TargetRegisterInfo::isVirtualRegister() and friends to llvm::Register as started by r367614. NFC
llvm-svn: 367633
2019-08-01 23:27:28 +00:00
Craig Topper a9ed5436bd [X86] In decomposeMulByConstant, legalize the VT before querying whether the multiply is legal
If a type is larger than a legal type and needs to be split, we would previously allow the multiply to be decomposed even if the split multiply is legal. Since the shift + add/sub code would also need to be split, its not any better to decompose it.

This patch figures out what type the mul will eventually be legalized to and then uses that type for the query. I tried just returning false illegal types and letting them get handled after type legalization, but then we can't recognize and i64 constant splat on 32-bit targets since will be destroyed by type legalization. We could special case vectors of i64 to avoid that...

Differential Revision: https://reviews.llvm.org/D65533

llvm-svn: 367601
2019-08-01 18:49:07 +00:00
Simon Pilgrim 63d4114f72 [X86][SSE] Add PEXTR*(PINSR*(v, s, c), c) -> s combine.
We should probably extend this to cover bitcasts as well to help other cases in promote-vec3.ll.

llvm-svn: 367582
2019-08-01 16:38:39 +00:00
Simon Pilgrim 33f5f863b5 [X86][SSE] SimplifyMultipleUseDemandedBits - Add PEXTR/PINSR B+W handling
This adds SimplifyMultipleUseDemandedBitsForTargetNode X86 support and uses it to allow us to peek through vector insertions to avoid dependencies on entire insertion chains.

llvm-svn: 367570
2019-08-01 14:46:03 +00:00
Simon Pilgrim f99f9881e3 [X86] EltsFromConsecutiveLoads - don't attempt to merge volatile loads (PR42846)
llvm-svn: 367556
2019-08-01 13:13:18 +00:00
Amy Huang 153f20057c Revert "[MS] Emit S_HEAPALLOCSITE debug info in Selection DAG" and
and partial fix.
Causes windows buildbot errors.

This reverts commit 6e65c34523963094acd0d6c94a5f5c64b32fe6aa and
53da7ca943.

llvm-svn: 367496
2019-07-31 23:59:31 +00:00
Craig Topper b51dc64063 [X86] Add DAG combine to fold any_extend_vector_inreg+truncstore to an extractelement+store
We have custom code that ignores the normal promoting type legalization on less than 128-bit vector types like v4i8 to emit pavgb, paddusb, psubusb since we don't have the equivalent instruction on a larger element type like v4i32. If this operation appears before a store, we can be left with an any_extend_vector_inreg followed by a truncstore after type legalization. When truncstore isn't legal, this will normally be decomposed into shuffles and a non-truncating store. This will then combine away the any_extend_vector_inreg and shuffle leaving just the store. On avx512, truncstore is legal so we don't decompose it and we had no combines to fix it.

This patch adds a new DAG combine to detect this case and emit either an extract_store for 64-bit stoers or a extractelement+store for 32 and 16 bit stores. This makes the avx512 codegen match the avx2 codegen for these situations. I'm restricting to only when -x86-experimental-vector-widening-legalization is false. When we're widening we're not likely to create this any_extend_inreg+truncstore combination. This means we should be able to remove this code when we flip the default. I would like to flip the default soon, but I need to investigate some performance regressions its causing in our branch that I wasn't seeing on trunk.

Differential Revision: https://reviews.llvm.org/D65538

llvm-svn: 367488
2019-07-31 22:43:08 +00:00
Mark Lacey 641ea2e701 [GISel] Address review feedback on passing MD_callees to lowerCall.
Preserve the nullptr default for KnownCallees that appears in
the base class.

llvm-svn: 367477
2019-07-31 20:34:05 +00:00
Mark Lacey 7b8d3eb9e2 [GISel] Pass MD_callees metadata down in call lowering.
Summary:
This will make it possible to improve IPRA by taking into account
register usage in indirect calls.

NFC yet; this is just laying the groundwork to start building
up patches to take advantage of the information for improved register
allocation.

Reviewers: aditya_nandakumar, volkan, qcolombet, arsenm, rovka, aemerson, paquette

Subscribers: sdardis, wdng, javed.absar, hiraditya, jrtc27, atanasyan, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65488

llvm-svn: 367476
2019-07-31 20:34:02 +00:00
Djordje Todorovic b9973f87c6 Reland "[DwarfDebug] Dump call site debug info"
The build failure found after the rL365467 has been
resolved.

Differential Revision: https://reviews.llvm.org/D60716

llvm-svn: 367446
2019-07-31 16:51:28 +00:00
Simon Pilgrim 0707f66ad0 [X86] Moved IsNOT helper earlier. NFCI.
Makes it available for more combines to use without adding declarations.

llvm-svn: 367436
2019-07-31 14:36:04 +00:00
Simon Pilgrim 24ad2b5e7d [X86][AVX] Ensure chained subvector insertions are the same size (PR42833)
Before combining insert_subvector(insert_subvector(vec, sub0, c0), sub1, c1) patterns, ensure that the subvectors are all the same type. On AVX512 targets especially we might have a mixture of 128/256 subvector insertions.

llvm-svn: 367429
2019-07-31 12:55:39 +00:00
Amy Huang 53da7ca943 [MS] Emit S_HEAPALLOCSITE debug info in SelectionDAG
Summary: This emits labels around heapallocsite calls in SelectionDAG.

Reviewers: rnk

Subscribers: MatzeB, aprantl, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61105

llvm-svn: 367374
2019-07-31 00:16:13 +00:00
Craig Topper 8b58371fae [X86] Fix mistake in comment. NFC
The code is matching sext not zext.

llvm-svn: 367357
2019-07-30 21:00:24 +00:00
Simon Pilgrim b989bc47c0 [X86] SimplifyDemandedVectorEltsForTargetNode should be calling resolveTargetShuffleInputs not getTargetShuffleMask
Add TODO comment.

llvm-svn: 367318
2019-07-30 15:06:09 +00:00
Simon Pilgrim e4d5423dcd [X86][AVX] SimplifyDemandedVectorElts - handle extraction from X86ISD::SUBV_BROADCAST source (PR42819)
PR42819 showed an issue that we couldn't handle the case where we demanded a 'sub-sub-vector' of the SUBV_BROADCAST 'sub-vector' source.

This patch recognizes these cases and extracts the sub-sub-vector instead of trying to broadcast to a type smaller than the 'sub-vector' source. 

llvm-svn: 367306
2019-07-30 11:35:13 +00:00
Craig Topper 479b45411e [X86] Fix typo in comment. We're looking at a right shift not a left shift. NFC
llvm-svn: 367251
2019-07-29 19:22:51 +00:00
Simon Pilgrim 962c03fac4 [X86] resolveTargetShuffleInputs - add depth to limit recursion.
Avoids slow downs from calls to ComputeNumSignBits/computeKnownBits going too deep.

llvm-svn: 367240
2019-07-29 17:17:58 +00:00
Simon Pilgrim 5ab948f823 [X86] combineX86ShufflesRecursively - start recursion at depth = 0. NFCI.
As discussed on rL367171, we have a problem where the depth recursion used in combineX86ShufflesRecursively was subtly different to computeKnownBits etc. - it starts at Depth=1 instead of Depth=0 like the others and has a different maximum recursion depth.

This NFC patch fixes the recursion depth to start at 0, so we can more easily reuse depth values in calls from combineX86ShufflesRecursively and its helper functions in computeKnownBits etc.

llvm-svn: 367232
2019-07-29 15:57:06 +00:00
Craig Topper eb1beabad9 [X86] Don't use PMADDWD for vector add reductions of multiplies if the mul inputs have an additional user.
The pmaddwd inserts a truncate, if that truncate would end up
creating additional instructions instead of making a zext
narrower, then we shouldn't do it.

I've restricted this to only sse4.1 targets since on prior
targets the zext will be done in stages. So the truncate will
probably not create additional instructions. Might need some
more investigation of mul shrinking and the other pmaddwd
transform to be sure this is the right decision.

There might be a slight regression on AVX1 targets due to add
splitting. Hard to say for sure. Maybe we need to look into
using the vector reduction flag to use 2 narrow loads and a
blend instead of extracting and inserting.

llvm-svn: 367198
2019-07-29 01:36:58 +00:00
Craig Topper 894916cac9 [X86] In combineLoopMAddPattern and combineLoopSADPattern, preserve the vector reduction flag on the final add. Handle unrolled loops by letting DAG combine revisit.
This reverts r340478 and r340631 and replaces them with a simpler
method of just letting DAG combine revisit the nodes to handle
the other operand.

llvm-svn: 367195
2019-07-28 18:45:42 +00:00
Simon Pilgrim 353a848473 [X86][SSE] Replace PMULDQ GetDemandedBits combine with SimplifyMultipleUseDemandedBits handler (Reapplied)
Recommit rL367100 which was reverted at rL367141. Until PR42777 is fixed, we no longer get the benefits of peeking through bitcasts but it does still remove a GetDemandedBits user and gives us the equivalent combines.

llvm-svn: 367172
2019-07-27 13:30:29 +00:00
Vlad Tsyrklevich 485b8789de Revert "[X86][SSE] Replace PMULDQ GetDemandedBits combine with SimplifyMultipleUseDemandedBits handler."
This reverts r367100, it appears to be causing test failures after
Nico's revert of r367091.

llvm-svn: 367141
2019-07-26 18:14:21 +00:00
Simon Pilgrim d93e8ece7b [X86][SSE] Replace PMULDQ GetDemandedBits combine with SimplifyMultipleUseDemandedBits handler.
This removes a GetDemandedBits user and allows us to benefit from the DemandedElts propagated through SimplifyDemandedBits.

llvm-svn: 367100
2019-07-26 11:10:20 +00:00
Pengfei Wang 9ad565f70e [WinEH] Allocate space in funclets stack to save XMM CSRs
Summary:
This is an alternate approach to D57970.
Currently funclets reuse the same stack slots that are used in the
parent function for saving callee-saved xmm registers. If the parent
function modifies a callee-saved xmm register before an excpetion is
thrown, the catch handler will overwrite the original saved value.

This patch allocates space in funclets stack for saving callee-saved xmm
registers and uses RSP instead RBP to access memory.

Reviewers: andrew.w.kaylor, LuoYuanke, annita.zhang, craig.topper,
RKSimon

Subscribers: rnk, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63396

Signed-off-by: pengfei <pengfei.wang@intel.com>
llvm-svn: 367088
2019-07-26 07:33:15 +00:00
Simon Pilgrim 447fe31964 [X86] concatSubVectors - remove unnecessary args. NFCI.
All these args can be cheaply recomputed and it makes it much easier to use the function as a quick helper.

llvm-svn: 367014
2019-07-25 13:05:46 +00:00
Seiya Nuta 21277e3ec2 [MC] Add MCInstrAnalysis::evaluateMemoryOperandAddress
Summary:
Add a new method which tries to compute the target address referenced by an operand.

This patch supports x86_64 RIP-relative addressing for now.

It is necessary to print referenced symbol names in llvm-objdump.

Reviewers: andreadb, MaskRay, grosbach, jgalenson, craig.topper

Reviewed By: MaskRay, craig.topper

Subscribers: bcain, rupprecht, jhenderson, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63847

llvm-svn: 366987
2019-07-25 06:57:09 +00:00
Roman Lebedev 017e272c3a [Codegen] (X & (C l>>/<< Y)) ==/!= 0 --> ((X <</l>> Y) & C) ==/!= 0 fold
Summary:
This was originally reported in D62818.
https://rise4fun.com/Alive/oPH

InstCombine does the opposite fold, in hope that `C l>>/<< Y` expression
will be hoisted out of a loop if `Y` is invariant and `X` is not.
But as it is seen from the diffs here, if it didn't get hoisted,
the produced assembly is almost universally worse.

Much like with my recent "hoist add/sub by/from const" patches,
we should get almost universal win if we hoist constant,
there is almost always an "and/test by imm" instruction,
but "shift of imm" not so much, so we may avoid having to
materialize the immediate, and thus need one less register.
And since we now shift not by constant, but by something else,
the live-range of that something else may reduce.

Special care needs to be applied not to disturb x86 `BT` / hexagon `tstbit`
instruction pattern. And to not get into endless combine loop.

Reviewers: RKSimon, efriedma, t.p.northover, craig.topper, spatel, arsenm

Reviewed By: spatel

Subscribers: hiraditya, MaskRay, wuzish, xbolva00, nikic, nemanjai, jvesely, wdng, nhaehnle, javed.absar, tpr, kristof.beyls, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62871

llvm-svn: 366955
2019-07-24 22:57:22 +00:00
Simon Pilgrim 7d318b2bb1 [DAGCombine] matchBinOpReduction - add partial reduction matching
This patch adds support for recognizing cases where a larger vector type is being used to reduce just the elements in the lower subvector:

e.g. <8 x i32> reduction pattern in a <16 x i32> vector:

<4,5,6,7,u,u,u,u,u,u,u,u,u,u,u,u>
<2,3,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
<1,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u>

matchBinOpReduction returns the lower extracted subvector in such cases, assuming isExtractSubvectorCheap accepts the extraction.

I've only enabled it for X86 reduction sums so far. I intend to enable it for the bitop/minmax cases in future patches, and eventually I think its worth turning it on all the time. This is mainly just a case of ensuring calls to matchBinOpReduction don't make assumptions on the vector width based on the original vector extraction.

Fixes the x86 partial reduction sum cases in PR33758 and PR42023.

Differential Revision: https://reviews.llvm.org/D65047

llvm-svn: 366933
2019-07-24 17:29:56 +00:00
Craig Topper 76bc3d6e07 [X86] In lowerVectorShuffle, instead of creating a new node to canonicalize the shuffle mask by commuting, just commute the mask and swap V1/V2.
LegalizeDAG tries to legal the DAG by legalizing nodes before
their operands.

If we create a new node, we end up legalizing it after its operands.
This prevents some of the optimizations that can be done when the
operand is a build_vector since the build_vector will have been
legalized to something else.

Differential Revision: https://reviews.llvm.org/D65132

llvm-svn: 366835
2019-07-23 18:46:15 +00:00
Craig Topper 510e6fadaa [X86] When using AND+PACKUS in lowerV16I8Shuffle, generate the build vector directly in v16i8 with the correct 0x00 or 0xFF elements rather than using another VT and bitcasting it.
The build_vector will become a constant pool load. By using the
desired type initially, it ensures we don't generate a bitcast
of the constant pool load which will need to be folded with
the load.

While experimenting with another patch, I noticed that when the
load type and the constant pool type don't match, then
SimplifyDemandedBits can't handle it. While we should probably
fix that, this was a simple way to fix the issue I saw.

llvm-svn: 366732
2019-07-22 19:58:49 +00:00
Simon Pilgrim b3d719e1cf [X86] EltsFromConsecutiveLoads - support common source loads (REAPPLIED)
This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.

A helper function, findEltLoadSrc, recurses to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.

Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.

Fixed out of bounds load assert identified in rL366501

Differential Revision: https://reviews.llvm.org/D64551

llvm-svn: 366681
2019-07-22 12:44:10 +00:00
Christudasan Devadasan 006cf8c03d Added address-space mangling for stack related intrinsics
Modified the following 3 intrinsics:
int_addressofreturnaddress,
int_frameaddress & int_sponentry.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D64561

llvm-svn: 366679
2019-07-22 12:42:48 +00:00
Simon Pilgrim 86fa3270ef [X86] SimplifyDemandedVectorEltsForTargetNode - Move SUBV_BROADCAST narrowing handling. NFCI.
Move the narrowing of SUBV_BROADCAST to where we handle all the other opcodes.

llvm-svn: 366660
2019-07-21 19:04:44 +00:00
Simon Pilgrim adec0f2252 [X86][SSE] Use PSADBW to improve vXi8 sum reduction (PR42674)
As detailed on PR42674, we can reduce a vXi8 down until we have the final <8 x i8>, and then use PSADBW with zero, to sum those values. We then extract the bottom i8, discarding any overflow from the upper bits of the i16 result.

llvm-svn: 366636
2019-07-20 15:20:11 +00:00
Than McIntosh e238a4c757 [X86] for split stack, not save/restore nested arg if unused
Summary:
For split-stack, if the nested argument (i.e. R10) is not used, no need to save/restore it in the prologue.

Reviewers: thanm

Reviewed By: thanm

Subscribers: mstorsjo, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64673

llvm-svn: 366569
2019-07-19 12:54:44 +00:00
Amara Emerson cf12c7815f [GlobalISel] Translate calls to memcpy et al to G_INTRINSIC_W_SIDE_EFFECTs and legalize later.
I plan on adding memcpy optimizations in the GlobalISel pipeline, but we can't
do that unless we delay lowering to actual function calls. This patch changes
the translator to generate G_INTRINSIC_W_SIDE_EFFECTS for these functions, and
then have each target specify that using the new custom legalizer for intrinsics
hook that they want it expanded it a libcall.

Differential Revision: https://reviews.llvm.org/D64895

llvm-svn: 366516
2019-07-19 00:24:45 +00:00
Reid Kleckner ba9c9e62cb Revert [X86] EltsFromConsecutiveLoads - support common source loads
This reverts r366441 (git commit 48104ef7c9)

This causes clang to fail to compile some file in Skia. Reduction soon.

llvm-svn: 366501
2019-07-18 21:26:41 +00:00
Simon Pilgrim 48104ef7c9 [X86] EltsFromConsecutiveLoads - support common source loads
This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.

A helper function, findEltLoadSrc, recurses to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.

Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.

Differential Revision: https://reviews.llvm.org/D64551

llvm-svn: 366441
2019-07-18 14:33:25 +00:00
Sanjay Patel e654785912 [x86] try harder to form LEA from ADD to avoid flag conflicts (PR40483)
LEA doesn't affect flags, so use it more liberally to replace an ADD when
we know that the ADD operands affect flags.

In the motivating example from PR40483:
https://bugs.llvm.org/show_bug.cgi?id=40483
...this lets us avoid duplicating a math op just to avoid flag conflict.

As mentioned in the TODO comments, this heuristic can be extended to
fire more often if that leads to more improvements.

Differential Revision: https://reviews.llvm.org/D64707

llvm-svn: 366431
2019-07-18 12:48:01 +00:00
Craig Topper 8da0402210 [X86] Disable combineConcatVectors for vXi1 vectors.
I'm not convinced the code this calls is properly vetted for
vXi1 vectors. Experimental vector widening legalization testing
for D55251 is now hitting an assertion failure inside
EltsFromConsecutiveLoads. This is occurring from a v2i1 load
having a store size different than its VT size. Hopefully
this commit will keep such issues from happening.

llvm-svn: 366405
2019-07-18 06:18:06 +00:00
Craig Topper 61fff7a337 [X86] Make sure we mark 128/256 MLOAD as Legal with VLX when min-legal-vector-width=256 is in effect.
This started triggering an assertion after r364718 when we made
these Custom under AVX2.

llvm-svn: 366382
2019-07-17 22:26:00 +00:00
Sanjay Patel d746a210e1 [x86] use more phadd for reductions
This is part of what is requested by PR42023:
https://bugs.llvm.org/show_bug.cgi?id=42023

There's an extension needed for FP add, but exactly how we would specify
that using flags is not clear to me, so I left that as a TODO.
We're still missing patterns for partial reductions when the input vector
is 256-bit or 512-bit, but I think that's a failure of vector narrowing.
If we can reduce the widths, then this matching should work on those tests.

Differential Revision: https://reviews.llvm.org/D64760

llvm-svn: 366268
2019-07-16 21:30:41 +00:00
Craig Topper c0b2ed664b [X86] In combineStore, don't convert v2f32 load/store pairs to f64 loads/stores.
Type legalization can take care of this. This gives DAG combine
a little more time with the original types.

llvm-svn: 366182
2019-07-16 05:52:27 +00:00
Rui Ueyama 49a3ad21d6 Fix parameter name comments using clang-tidy. NFC.
This patch applies clang-tidy's bugprone-argument-comment tool
to LLVM, clang and lld source trees. Here is how I created this
patch:

$ git clone https://github.com/llvm/llvm-project.git
$ cd llvm-project
$ mkdir build
$ cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug \
    -DLLVM_ENABLE_PROJECTS='clang;lld;clang-tools-extra' \
    -DCMAKE_EXPORT_COMPILE_COMMANDS=On -DLLVM_ENABLE_LLD=On \
    -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ../llvm
$ ninja
$ parallel clang-tidy -checks='-*,bugprone-argument-comment' \
    -config='{CheckOptions: [{key: StrictMode, value: 1}]}' -fix \
    ::: ../llvm/lib/**/*.{cpp,h} ../clang/lib/**/*.{cpp,h} ../lld/**/*.{cpp,h}

llvm-svn: 366177
2019-07-16 04:46:31 +00:00
Craig Topper 51193871da [X86] Teach convertToThreeAddress to handle SUB with immediate
We mostly avoid sub with immediate but there are a couple cases that can create them. One is the add 128, %rax -> sub -128, %rax trick in isel. The other is when a SUB immediate gets created for a compare where both the flags and the subtract value is used. If we are unable to linearize the SelectionDAG to satisfy the flag user and the sub result user from the same instruction, we will clone the sub immediate for the two uses. The one that produces flags will eventually become a compare. The other will have its flag output dead, and could then be considered for LEA creation.

I added additional test cases to add.ll to show the the sub -128 trick gets converted to LEA and a case where we don't need to convert it.

This showed up in the current codegen for PR42571.

Differential Revision: https://reviews.llvm.org/D64574

llvm-svn: 366151
2019-07-15 23:07:56 +00:00
Sanjay Patel eb99165b97 [x86] try to keep FP casted+truncated+extracted vector element out of GPRs
inttofp (trunc (extelt X, 0)) --> inttofp (extelt (bitcast X), 0)

We have pseudo-vectorization of scalar int to FP casts, so this tries to
make that more likely by replacing a truncate with a bitcast. I didn't see
any test diffs starting from 'uitofp', so I left that as a TODO. We can't
only match the shorter trunc+extract pattern because there's an opposing
transform somewhere, so we infinite loop. Waiting to try this during
lowering is another possibility.

A motivating case is shown in PR39975 and included in the test diffs here:
https://bugs.llvm.org/show_bug.cgi?id=39975

Differential Revision: https://reviews.llvm.org/D64710

llvm-svn: 366098
2019-07-15 18:17:23 +00:00
Craig Topper 81971b2b79 [X86] Return UNDEF from LowerScalarImmediateShift when the shift amount is out of range.
I think we only turn out of range shiftss to undef when
all elements are out of range or the shift amount is a splat out
of range. I'm not sure which, I didn't check.

During lowering we can split a shift where some elements
are out of range into multiple shifts. This can create a
new shift with a splat shift amount that is out of range.

This patch returns undef for this case.

Fixes PR42615.

Differential Revision: https://reviews.llvm.org/D64699

llvm-svn: 366096
2019-07-15 17:56:57 +00:00
Simon Pilgrim 60fb5e97a0 [X86] isTargetShuffleEquivalent - assert the expected mask is correctly formed. NFCI.
While we don't make any assumptions about the actual mask, assert that the expected mask only contains valid mask element values.

llvm-svn: 366066
2019-07-15 14:29:14 +00:00
Craig Topper 635d103e0b [X86] Separate the memory size of vzext_load/vextract_store from the element size of the result type. Use them improve the codegen of v2f32 loads/stores with sse1 only.
Summary:
SSE1 only supports v4f32. But does have instructions like movlps/movhps that load/store 64-bits of memory.

This patch breaks the connection between the node VT of the vzext_load/vextract_store patterns and the memory VT. Enabling a v4f32 node with a 64-bit memory VT. I've used i64 as the memory VT here. I've written the PatFrag predicate to just check the store size not the specific VT. I think the VT will only matter for CSE purposes. We could use v2f32, but if we want to start using these operations in more places a simple integer type might make the most sense.

I'd like to maybe use this same thing for SSE2 and later as well, but that will need more work to be supported by EltsFromConsecutiveLoads to avoid regressing lit tests. I'd maybe also like to combine bitcasts with these load/stores nodes now that the types are disconnected. And I'd also like to consider canonicalizing (scalar_to_vector + load) to vzext_load.

If you want I can split the mechanical tablegen stuff where I added the 32/64 off from the sse1 change.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64528

llvm-svn: 366034
2019-07-15 02:02:31 +00:00
Craig Topper 9450b0084a [X86] Remove offset of 8 from the call to FuseInst for UNPCKLPDrr folding added in r365287.
This was copy/pasted from above and I forgot to change it. We just
need the default offset of 0 here.

Fixes PR42616.

llvm-svn: 366011
2019-07-14 04:13:33 +00:00
Sanjay Patel 2097f75eab [x86] simplify cmov with same true/false operands
llvm-svn: 365998
2019-07-13 12:04:52 +00:00
Craig Topper b828f0b90a [X86] Use MachineInstr::findRegisterDefOperand to simplify some code in optimizeCompareInstr. NFCI
llvm-svn: 365946
2019-07-12 19:26:35 +00:00
Craig Topper 98f931639b [X86] Add NEG to isUseDefConvertible.
We can use the C flag from NEG to detect that the input was zero.

Really we could probably use the Z flag too. But C matches what
we'd do for usubo 0, X.

Haven't found a test case for this due to the usubo formation
in CGP. But I verified if I comment out the CGP code this
transformation catches some of the same cases.

llvm-svn: 365929
2019-07-12 17:52:17 +00:00
Djordje Todorovic 0739ccd3b5 Revert "[DwarfDebug] Dump call site debug info"
A build failure was found on the SystemZ platform.

This reverts commit 9e7e73578e54cd22b3c7af4b54274d743b6607cc.

llvm-svn: 365886
2019-07-12 09:45:12 +00:00
Sanjay Patel 5cc7c9ab93 [X86] Merge negated ISD::SUB nodes into X86ISD::SUB equivalent (PR40483)
Follow up to D58597, where it was noted that the commuted ISD::SUB variant
was having problems with lack of combines.

See also D63958 where we untangled setcc/sub pairs.

Differential Revision: https://reviews.llvm.org/D58875

llvm-svn: 365791
2019-07-11 15:56:33 +00:00
Fangrui Song f9ca13cb5f [X86] -fno-plt: use GOT __tls_get_addr only if GOTPCRELX is enabled
Summary:
As of binutils 2.32, ld has a bogus TLS relaxation error when the GD/LD
code sequence using R_X86_64_GOTPCREL (instead of R_X86_64_GOTPCRELX) is
attempted to be relaxed to IE/LE (binutils PR24784). gold and lld are good.

In gcc/config/i386/i386.md, there is a configure-time check of as/ld
support and the GOT relaxation will not be used if as/ld doesn't support
it:

    if (flag_plt || !HAVE_AS_IX86_TLS_GET_ADDR_GOT)
      return "call\t%P2";
    return "call\t{*%p2@GOT(%1)|[DWORD PTR %p2@GOT[%1]]}";

In clang, -DENABLE_X86_RELAX_RELOCATIONS=OFF is the default. The ld.bfd
bogus error can be reproduced with:

    thread_local int a;
    int main() { return a; }

clang -fno-plt -fpic a.cc -fuse-ld=bfd

GOTPCRELX gained relative good support in 2016, which is considered
relatively new.  It is even difficult to conditionally default to
-DENABLE_X86_RELAX_RELOCATIONS=ON due to cross compilation reasons. So
work around the ld.bfd bug by only using GOT when GOTPCRELX is enabled.

Reviewers: dalias, hjl.tools, nikic, rnk

Reviewed By: nikic

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64304

llvm-svn: 365752
2019-07-11 10:10:09 +00:00
Craig Topper 88729e3dec [X86] Don't convert 8 or 16 bit ADDs to LEAs on Atom in FixupLEAPass.
We use the functions that convert to three address to do the
conversion, but changing an 8 or 16 bit will cause it to create
a virtual register. This can't be done after register allocation
where this pass runs.

I've switched the pass completely to a white list of instructions
that can be converted to LEA instead of a blacklist that was
incorrect. This will avoid surprises if we enhance the three
address conversion function to include additional instructions
in the future.

Fixes PR42565.

llvm-svn: 365720
2019-07-11 01:01:39 +00:00
Craig Topper 1c327c7e0a [X86] Add patterns with and_flag_nocf for BLSI and TBM instructions.
Fixes similar issues to r352306.

llvm-svn: 365705
2019-07-10 22:44:32 +00:00
Craig Topper d916f23b83 [X86] Add BLSR and BLSMSK to isUseDefConvertible.
Unfortunately subo formation in CGP prevents obvious ways of
testing this.

But we already have BLSI in here and the flag behavior is
well understood.

Might become more useful if we improve PR42571.

llvm-svn: 365702
2019-07-10 22:14:39 +00:00
Craig Topper 021ba49b31 [X86] Remove unused variable. NFC
llvm-svn: 365697
2019-07-10 21:01:34 +00:00
Simon Pilgrim 5dd2af5248 [X86] EltsFromConsecutiveLoads - clean up element size calcs. NFCI.
Determine the element/load size calculations earlier and assert that they are whole bytes in size.

llvm-svn: 365674
2019-07-10 17:49:27 +00:00
Simon Pilgrim 093f4aa72f [X86] EltsFromConsecutiveLoads - remove duplicate check for element size. NFCI.
We've already checked that each element is the correct contributory size for VT when we inspect the elements for Undef/Zero/Load.

llvm-svn: 365656
2019-07-10 16:22:31 +00:00
Simon Pilgrim 893448a3e4 [X86] EltsFromConsecutiveLoads - ensure element reg/store sizes are the same size. NFCI.
This renames the type so it doesn't sound like its based off the load size - as we're moving towards supporting combining loads of different sizes.

llvm-svn: 365655
2019-07-10 16:14:26 +00:00
Simon Pilgrim 0a9479ef39 [X86] EltsFromConsecutiveLoads - cleanup Zero/Undef/Load element collection. NFCI.
llvm-svn: 365628
2019-07-10 13:28:13 +00:00
Simon Pilgrim ef1aac3191 [X86] EltsFromConsecutiveLoads - LDBase is non-null. NFCI.
Don't bother checking for LDBase != null - it should be (and we assert that it is).

llvm-svn: 365622
2019-07-10 12:22:59 +00:00
Simon Pilgrim c972193583 [X86] EltsFromConsecutiveLoads - store Loads on a per-element basis. NFCI.
Cache the LoadSDNode nodes so we can easily map to/from the element index instead of packing them together - this will be useful for future patches for PR16739 etc.

llvm-svn: 365620
2019-07-10 11:26:57 +00:00
Simon Pilgrim 6a58583951 [X86][SSE] EltsFromConsecutiveLoads - add basic dereferenceable support
This patch checks to see if the vector element loads are based off a dereferenceable pointer that covers the entire vector width, in which case we don't need to have element loads at both extremes of the vector width - just the start (base pointer) of it.

Another step towards partial vector loads......

Differential Revision: https://reviews.llvm.org/D64205

llvm-svn: 365614
2019-07-10 10:46:36 +00:00
Craig Topper 50f70de557 [X86] Limit getTargetConstantFromNode to only work on NormalLoads not extending loads.
This seems to fix a failure reported by Jordan Rupprecht, but we
don't have a reduced test case yet.

llvm-svn: 365589
2019-07-10 00:40:01 +00:00
Craig Topper 1ae60797cd [X86] Don't form extloads in combineExtInVec unless the load extension is legal.
This should prevent doing this on pre-sse4.1 targets or for 256
bit vectors without avx2.

I don't know of a failure from this. Op legalization will probably
take care of, but seemed better to be safe.

llvm-svn: 365577
2019-07-09 23:05:54 +00:00
Craig Topper 84a1f07363 [X86][AMDGPU][DAGCombiner] Move call to allowsMemoryAccess into isLoadBitCastBeneficial/isStoreBitCastBeneficial to allow X86 to bypass it
Basically the problem is that X86 doesn't set the Fast flag from
allowsMemoryAccess on certain CPUs due to slow unaligned memory
subtarget features. This prevents bitcasts from being folded into
loads and stores. But all vector loads and stores of the same width
are the same cost on X86.

This patch merges the allowsMemoryAccess call into isLoadBitCastBeneficial to allow X86 to skip it.

Differential Revision: https://reviews.llvm.org/D64295

llvm-svn: 365549
2019-07-09 19:55:28 +00:00
Simon Pilgrim 294f37561a [X86] LowerToHorizontalOp - use count_if to count non-UNDEF ops. NFCI.
llvm-svn: 365540
2019-07-09 19:19:17 +00:00
Djordje Todorovic 01eaae6dd1 [DwarfDebug] Dump call site debug info
Dump the DWARF information about call sites and call site parameters into
debug info sections.

The patch also provides an interface for the interpretation of instructions
that could load values of a call site parameters in order to generate DWARF
about the call site parameters.

([13/13] Introduce the debug entry values.)

Co-authored-by: Ananth Sowda <asowda@cisco.com>
Co-authored-by: Nikola Prica <nikola.prica@rt-rk.com>
Co-authored-by: Ivan Baev <ibaev@cisco.com>

Differential Revision: https://reviews.llvm.org/D60716

llvm-svn: 365467
2019-07-09 11:33:56 +00:00
Reid Kleckner 2f07c2e9d9 Standardize on MSVC behavior for triples with no environment
Summary:
This makes it so that IR files using triples without an environment work
out of the box, without normalizing them.

Typically, the MSVC behavior is more desirable. For example, it tends to
enable things like constant merging, use of associative comdats, etc.

Addresses PR42491

Reviewers: compnerd

Subscribers: hiraditya, dexonsmith, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D64109

llvm-svn: 365387
2019-07-08 21:05:20 +00:00
Simon Pilgrim e1a9b49d6b [X86] ISD::INSERT_SUBVECTOR - use uint64_t index. NFCI.
Keep the uint64_t type from getConstantOperandVal to stop truncation/extension overflow warnings in MSVC in subvector index math.

llvm-svn: 365328
2019-07-08 14:52:56 +00:00
Craig Topper 1deca50ab1 [X86] Allow execution domain fixing to turn SHUFPD into SHUFPS.
This can help with code size on SSE targets where SHUFPD requires
a 0x66 prefix and SHUFPS doesn't.

llvm-svn: 365293
2019-07-08 06:52:49 +00:00
Craig Topper d8261f0288 [X86] Make movsd commutable to shufpd with a 0x02 immediate on pre-SSE4.1 targets.
This can help avoid a copy or enable load folding.

On SSE4.1 targets we can commute it to blendi instead.

I had to make shufpd with a 0x02 immediate commutable as well
since we expect commuting to be reversible.

llvm-svn: 365292
2019-07-08 06:52:43 +00:00
Craig Topper 46f2b583a2 [X86] Add MOVSDrr->MOVLPDrm entry to load folding table. Add custom handling to turn UNPCKLPDrr->MOVHPDrm when load is under aligned.
If the load is aligned we can turn UNPCKLPDrr into UNPCKLPDrm.

llvm-svn: 365287
2019-07-08 02:10:20 +00:00
Craig Topper ac744d5a86 [X86] Make sure load isn't volatile before shrinking it in MOVDDUP isel patterns.
llvm-svn: 365275
2019-07-07 05:33:20 +00:00
Simon Pilgrim a7145c45a7 [X86] SimplifyDemandedVectorEltsForTargetNode - fix shadow variable warning. NFCI.
Fixes cppcheck warning.

llvm-svn: 365271
2019-07-06 18:46:09 +00:00
Simon Pilgrim 01f1bad618 [X86] LowerBuildVectorv16i8 - pull out repeated getOperand() call. NFCI.
llvm-svn: 365270
2019-07-06 18:33:29 +00:00
Craig Topper 317d6093df [X86] Remove patterns from MOVLPSmr and MOVHPSmr instructions.
These patterns are the same as the MOVLPDmr and MOVHPDmr patterns,
but with a bitcast at the end. We can just select the PD instruction
and let execution domain fixing switch to PS.

llvm-svn: 365267
2019-07-06 17:59:51 +00:00
Craig Topper 913105ca42 [X86] Add patterns to select MOVLPDrm from MOVSD+load and MOVHPD from UNPCKL+load.
These narrow the load so we can only do it if the load isn't
volatile.

There also tests in vector-shuffle-128-v4.ll that this should
support, but we don't seem to fold bitcast+load on pre-sse4.2
targets due to the slow unaligned mem 16 flag.

llvm-svn: 365266
2019-07-06 17:59:45 +00:00
Craig Topper d22b2d01ca [X86] Correct the size check in foldMemoryOperandCustom.
The Size either needs to be 0 meaning we aren't folding
a stack reload. Or the stack slot needs to be at least
16 bytes. I've also added a paranoia check ensure the
RCSize is at leat 16 bytes as well. This avoids any
FR32/FR64 surprises, but I think we already filtered
those earlier.

All of our test case have Size as either 0 or 16 and
RCSize == 16. So the Size <= 16 check worked for those
cases.

llvm-svn: 365234
2019-07-05 18:54:00 +00:00
Craig Topper 6e6d229e5e [X86] Update SSE1 MOVLPSrm and MOVHPSrm isel patterns to ensure loads are non-volatile before folding.
These patterns use 128-bit loads, but the instructions only load
64-bits. We shouldn't narrow the load if its volatile.

Fixes another variant of PR42079

llvm-svn: 365225
2019-07-05 17:31:29 +00:00
Craig Topper 8a93952a5c [X86] Remove unnecessary isel pattern for MOVLPSmr.
This was identical to a pattern for MOVPQI2QImr with a bitcast
as an input. But we should be able to turn MOVPQI2QImr into
MOVLPSmr in the execution domain fixup pass so we shouldn't
need this.

llvm-svn: 365224
2019-07-05 17:31:25 +00:00
Robert Lougher 9dcfbbae76 This reverts r365061 and r365062 (test update)
Revision r365061 changed a skip of debug instructions for a skip
of meta instructions. This is not safe, as IMPLICIT_DEF is classed
as a meta instruction.

llvm-svn: 365202
2019-07-05 12:42:06 +00:00
Robert Lougher 2478b62098 Revert r365198 as this accidentally commited something that
should not have been added.

llvm-svn: 365199
2019-07-05 12:30:45 +00:00
Robert Lougher 3bea2b15f5 This reverts r365061 and r365062 (test update)
Revision r365061 changed a skip of debug instructions for a skip
of meta instructions. This is not safe, as IMPLICIT_DEF is classed
as a meta instruction.

llvm-svn: 365198
2019-07-05 12:20:21 +00:00
Simon Pilgrim 8b25d9bf01 [X86][SSE] LowerINSERT_VECTOR_ELT - early out for out of range indices
Fixes OSS-Fuzz #15662

llvm-svn: 365180
2019-07-05 10:34:53 +00:00
Craig Topper 171732aeb3 [X86] Add custom isel to select ADD/SUB/OR/XOR/AND to their non-immediate forms under optsize when the immediate has additional users.
Summary:
We attempt to prevent folding immediates with multiple users under optsize. But we only do this from store nodes and X86ISD::ADD/SUB/XOR/OR/AND patterns. We don't do it for ISD::ADD/SUB/XOR/OR/AND even though we count them as users when deciding whether to fold into other nodes. This leads to situations where we block folding to a compare for example, but still fold into an AND or OR as seen in PR27202.

Unfortunately touching the isel patterns in tablegen for the ISD::ADD/SUB/XOR/OR/AND opcodes will cause the patterns to be unusable for fast isel. And we don't have a way to make a fast isel only pattern.

To workaround this, this patch adds custom isel in front of the isel table that will select the non-immediate forms if the immediate has additional users. This may create some issues for ANDN and NOT matching. And there's room for improvement with unsigned 32 immediates on 64-bit AND.

This patch needs more thorough test cases, but I wanted to get feedback on the direction. Please send me any other test cases you've seen in the wild.

I think we probably have the same issue with the immediate matching when we fold RMW from X86ISD::ADD/SUB/XOR/OR/AND. And our TEST immedaite shrinking logic. Our cost modeling for immediates that can fit in a sign extended 8-bit immediate on a 16/32/64 bit operation is completely wrong.

I also wonder if we should update the ConstantHoisting cost model and block folding for "opaque" constants. But of course constants can still be created by DAG combine and lowering optimizations.

Fixes PR27202

Reviewers: spatel, RKSimon, andreadb

Reviewed By: RKSimon

Subscribers: jsji, hiraditya, jdoerfert, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59909

llvm-svn: 365163
2019-07-04 22:53:57 +00:00
Simon Pilgrim fde766de4b [X86][AVX1] Combine concat_vectors(pshufd(x,c),pshufd(y,c)) -> vpermilps(concat_vectors(x,y),c)
Bitcast v4i32 to v8f32 and back again - it might be worth adding isel patterns for X86PShufd v8i32 on AVX1 targets like we did for X86Blendi to avoid the bitcasts?

llvm-svn: 365125
2019-07-04 10:17:10 +00:00
Craig Topper 163b8bb3f5 [X86] Use pointer sized indices instead of i32 for EXTRACT_VECTOR_ELT and INSERT_VECTOR_ELT in a couple places.
Most places already did this.

llvm-svn: 365109
2019-07-04 06:21:54 +00:00
Robert Lougher 720baf0416 [X86] Avoid SFB - Skip meta instructions
This patch generalizes the fix in D61680 to ignore all meta instructions,
not just debug info.

Patch by Chris Dawson.

Differential Revision: https://reviews.llvm.org/D62605

llvm-svn: 365061
2019-07-03 17:43:55 +00:00
Francis Visoiu Mistrih 83bbe2f418 [CodeGen] Make branch funnels pass the machine verifier
We previously marked all the tests with branch funnels as
`-verify-machineinstrs=0`.

This is an attempt to fix it.

1) `ICALL_BRANCH_FUNNEL` has no defs. Mark it as `let OutOperandList =
(outs)`

2) After that we hit an assert: ``` Assertion failed: (Op.getValueType()
!= MVT::Other && Op.getValueType() != MVT::Glue && "Chain and glue
operands should occur at end of operand list!"), function AddOperand,
file
/Users/francisvm/llvm/llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp,
line 461.  ```

The chain operand was added at the beginning of the operand list. Move
that to the end.

3) After that we hit another verifier issue in the pseudo expansion
where the registers used in the cmps and jmps are not added to the
livein lists. Add the `EFLAGS` to all the new MBBs that we create.

PR39436

Differential Review: https://reviews.llvm.org/D54155

llvm-svn: 365058
2019-07-03 17:16:45 +00:00
Simon Pilgrim 26812c7675 [X86] ComputeNumSignBitsForTargetNode - add target shuffle support.
llvm-svn: 365057
2019-07-03 17:06:59 +00:00
Simon Pilgrim 783dbe402f [X86][AVX] combineX86ShufflesRecursively - peek through extract_subvector
If we have more then 2 shuffle ops to combine, try to use combineX86ShuffleChainWithExtract to see if some are from the same super vector.

llvm-svn: 365050
2019-07-03 15:46:08 +00:00
Simon Pilgrim 868d0b7fd9 [X86][AVX] Combine vpermi(bitcast(x)) -> bitcast(vpermi(x))
iff the number of elements doesn't change.

This gets around an issue with combineX86ShuffleChain not being able to hint which domain is preferred for shuffles that can be done with either.

Fixes regression introduced in rL365041

llvm-svn: 365044
2019-07-03 14:34:16 +00:00
Simon Pilgrim 0c230209fe [X86][AVX] combineX86ShuffleChainWithExtract - add number of non-zero extract_subvectors to the combine depth
This better accounts for the cost/benefit of removing extract_subvectors from the shuffle and will be more useful in future patches.

The vpermq predicate regression will be fixed shortly.

llvm-svn: 365041
2019-07-03 14:17:21 +00:00
Simon Pilgrim 8c099cbe7c [X86][SSE] lowerUINT_TO_FP_v2i32 - explicitly cast half word to double
Fixes MSVC analyzer extension->double warning.

llvm-svn: 365027
2019-07-03 11:23:27 +00:00
Simon Pilgrim 8df90b843d [X86][SSE] LowerINSERT_VECTOR_ELT - ensure insertion index correctness. NFCI.
Assert that the insertion index is in range and use uint64_t for the index to fix MSVC/cppcheck truncation warning.

llvm-svn: 365025
2019-07-03 10:59:52 +00:00
Simon Pilgrim 8853bd9592 [X86][SSE] LowerScalarImmediateShift - ensure shift amount correctness. NFCI.
Assert that the shift amount is in range and create vXi8 shift masks in a way that doesn't cause MSVC/cppcheck shift result is truncated then extended warnings.

llvm-svn: 365024
2019-07-03 10:47:33 +00:00
Simon Pilgrim 64e3a51534 Fix uninitialized variable warnings. NFCI.
Both MSVC and cppcheck don't like the fact that the variables are initialized via references.

llvm-svn: 365018
2019-07-03 10:22:08 +00:00
Simon Pilgrim 7b7b9b78a2 [X86] LowerFunnelShift - use modulo constant shift amount.
This avoids the use of getZExtValue and uses the modulo shift amount which is whats expected for funnel shifts anyhow. 

llvm-svn: 365016
2019-07-03 10:04:16 +00:00
Craig Topper b770d2c9d4 [X86] Add a DAG combine for turning *_extend_vector_inreg+load into an appropriate extload if the load isn't volatile.
Remove the corresponding isel patterns that did the same thing without checking for volatile.

This fixes another variation of PR42079

llvm-svn: 364977
2019-07-02 23:20:03 +00:00
Simon Pilgrim 5613874947 [X86] getTargetConstantBitsFromNode - remove unnecessary getZExtValue() (PR42486)
Don't use APInt::getZExtValue() if you can avoid it - eventually someone will call it with i128 or something that doesn't fit into 64-bits.

In this case it was completely superfluous as we'd moved the rest of the code to always use APInt.

Fixes the <1 x i128> addition bug in PR42486

llvm-svn: 364953
2019-07-02 18:20:38 +00:00
Craig Topper cffbaa93b7 [X86] Add patterns to select (scalar_to_vector (loadf32)) as (V)MOVSSrm instead of COPY_TO_REGCLASS + (V)MOVSSrm_alt.
Similar for (V)MOVSD. Ultimately, I'd like to see about folding
scalar_to_vector+load to vzload. Which would select as (V)MOVSSrm
so this is closer to that.

llvm-svn: 364948
2019-07-02 17:51:02 +00:00
Simon Pilgrim 9304168103 [X86][AVX] combineX86ShuffleChain - pull out CombineShuffleWithExtract lambda. NFCI.
Pull out CombineShuffleWithExtract lambda to new combineX86ShuffleChainWithExtract wrapper and refactored it to handle more than 2 shuffle inputs - this will allow combineX86ShufflesRecursively to call this in a future patch.

llvm-svn: 364924
2019-07-02 13:30:04 +00:00
Simon Pilgrim d609ebb779 [X86] resolveTargetShuffleInputsAndMask - add repeated input handling.
We were relying on combineX86ShufflesRecursively to handle this - this patch gets it done earlier which should make it easier for other code to use resolveTargetShuffleInputsAndMask.

llvm-svn: 364906
2019-07-02 10:53:17 +00:00
Craig Topper 2d306b2d57 [X86] Add PreprocessISelDAG support for turning ISD::FP_TO_SINT/UINT into X86ISD::CVTTP2SI/CVTTP2UI and to reduce the number of isel patterns.
llvm-svn: 364887
2019-07-02 05:53:37 +00:00
Craig Topper 3f722d40c5 [X86] Use v4i32 vzloads instead of v2i64 for vpmovzx/vpmovsx patterns where only 32-bits are loaded.
v2i64 vzload defines a 64-bit memory access. It doesn't look like
we have any coverage for this either way.

Also remove some vzload usages where the instruction loads only
16-bits.

llvm-svn: 364851
2019-07-01 21:25:11 +00:00
Craig Topper 328b24150e [X86] Remove several bad load folding isel patterns for VPMOVZX/VPMOVSX.
These patterns all matched a v2i64 vzload which only loads 64-bits
to instructions that load a full 128-bits.

llvm-svn: 364847
2019-07-01 21:23:38 +00:00
Craig Topper 5e7815b695 [X86] Correct v4f32->v2i64 cvt(t)ps2(u)qq memory isel patterns
These instructions only read 64-bits of memory so we shouldn't
allow a full vector width load to be pattern matched in case it
is marked volatile.

Instead allow vzload or scalar_to_vector+load.

Also add a DAG combine to turn full vector loads into vzload when
used by one of these instructions if the load isn't volatile.

This fixes another case for PR42079

llvm-svn: 364838
2019-07-01 19:01:37 +00:00
Robert Lougher e20030f612 [X86] Avoid SFB - Fix inconsistent codegen with/without debug info(2)
The function findPotentialBlockers may consider debug info instructions as
potential blockers and may stop searching for a store-load pair prematurely.

This patch corrects this and tests the cases where the store is separated
from the load by more than InspectionLimit debug instructions.

Patch by Chris Dawson.

Differential Revision: https://reviews.llvm.org/D62408

llvm-svn: 364829
2019-07-01 18:28:21 +00:00
Simon Pilgrim e3e38cce4a [X86] Add widenSubVector to size in bits helper. NFCI.
We can already widenSubVector to a specific type (of the same scalar type) - this variant just specifies the target vector size.

This will be useful when CombineShuffleWithExtract relaxes the need to have the same scalar type for all shuffle operand subvector sources.

llvm-svn: 364803
2019-07-01 16:20:47 +00:00
Simon Pilgrim 172fe5dd19 [X86] CombineShuffleWithExtract - updated description comments. NFCI.
CombineShuffleWithExtract no longer requires that both shuffle ops are extract_subvectors, from the same type or from the same size.

llvm-svn: 364745
2019-07-01 11:33:45 +00:00
Craig Topper 29fff0797b [X86] Improve the type checking fast-isel handling of vector bitcasts.
We had a bunch of vector size legality checks for the source type
based on feature flags, but we didn't check the destination type at
all beyond ensuring that it was a "simple" type. But this allowed
the destination to be i128 which isn't legal.

This commit changes the code to use TLI's isTypeLegal logic in
place of the all the subtarget checks. Then additionally checks
that the source and dest are vectors.

Fixes 42452

llvm-svn: 364729
2019-07-01 07:09:34 +00:00
Craig Topper 4ca81a9b99 [X86] Add a DAG combine to replace vector loads feeding a v4i32->v2f64 CVTSI2FP/CVTUI2FP node with a vzload.
But only when the load isn't volatile.

This improves load folding during isel where we only have vzload
and scalar_to_vector+load patterns. We can't have full vector load
isel patterns for the same volatile load issue.

Also add some missing masked cvtsi2fp/cvtui2fp with vzload patterns.

llvm-svn: 364728
2019-07-01 07:09:31 +00:00
Craig Topper d1728f8987 [X86] Add MOVHPDrm/MOVLPDrm patterns that use VZEXT_LOAD.
We already had patterns that used scalar_to_vector+load. But we can
also have a vzload.

Found while investigating combining scalar_to_vector+load to vzload.

llvm-svn: 364726
2019-07-01 07:09:23 +00:00
Fangrui Song 78ee2fbf98 Cleanup: llvm::bsearch -> llvm::partition_point after r364719
llvm-svn: 364720
2019-06-30 11:19:56 +00:00
Craig Topper 725a8a5dc4 [X86] Custom lower AVX masked loads to masked load and vselect instead of selecting a maskmov+vblend during isel.
AVX masked loads only support 0 as the value for masked off elements.
So we need an extra blend to support other values. Previously we
expanded the masked load to two instructions with isel patterns.
With this patch we now insert the vselect during lowering and it
will be separately selected as a blend.

llvm-svn: 364718
2019-06-30 06:46:37 +00:00
Sanjay Patel 9126c84f50 [x86] remove stale comment about cmov; NFC
The cmov node used to sometimes return a glue result (and that's what
'flag' meant in this context), but that was removed with D38664.

llvm-svn: 364687
2019-06-28 21:45:55 +00:00
Simon Pilgrim 978a08c885 [X86] CombineShuffleWithExtract - recurse through EXTRACT_SUBVECTOR chain
llvm-svn: 364667
2019-06-28 17:57:32 +00:00
Simon Pilgrim a54e1a0f01 [X86] CombineShuffleWithExtract - only require 1 source to be EXTRACT_SUBVECTOR
We were requiring that both shuffle operands were EXTRACT_SUBVECTORs, but we can relax this to only require one of them to be.

Also, we shouldn't bother attempting this if both operands are from the lowest subvector (or not EXTRACT_SUBVECTOR at all).

llvm-svn: 364644
2019-06-28 12:24:49 +00:00
Craig Topper cbb88a5169 [X86] Connect the output chain properly when combining vzext_movl+load into vzext_load.
llvm-svn: 364625
2019-06-28 06:58:50 +00:00
Craig Topper e832adea0f [X86] Remove some duplicate patterns that already exist as part of their instruction definition. NFC
llvm-svn: 364623
2019-06-28 05:03:47 +00:00
Sanjay Patel a95ca2b5ff [x86] prevent crashing from select narrowing with AVX512
llvm-svn: 364585
2019-06-27 20:16:58 +00:00
Simon Pilgrim 1fd1c60979 [X86] combineX86ShufflesRecursively - merge shuffles with more than 2 inputs
We already had the infrastructure for this, but were waiting for the fix for a number of regressions which were handled by the recent shuffle(extract_subvector(),extract_subvector()) -> extract_subvector(shuffle()) shuffle combines

llvm-svn: 364569
2019-06-27 17:30:51 +00:00
Simon Pilgrim e9a2f4fe2c Use getConstantOperandAPInt instead of getConstantOperandVal for comparisons.
getConstantOperandAPInt avoids any large integer issues - these are unlikely but the fuzzers do like to mess around.....

llvm-svn: 364564
2019-06-27 16:46:00 +00:00
Simon Pilgrim 74343eba37 [X86] getTargetVShiftByConstNode - reduce variable scope. NFCI.
Fixes cppcheck warning.

llvm-svn: 364561
2019-06-27 16:33:44 +00:00
Djordje Todorovic 71d3869f60 [Backend] Keep call site info valid through the backend
Handle call instruction replacements and deletions in order to preserve
valid state of the call site info of the MachineFunction.

NOTE: If the call site info is enabled for a new target, the assertion from
the MachineFunction::DeleteMachineInstr() should help to locate places
where the updateCallSiteInfo() should be called in order to preserve valid
state of the call site info.

([10/13] Introduce the debug entry values.)

Co-authored-by: Ananth Sowda <asowda@cisco.com>
Co-authored-by: Nikola Prica <nikola.prica@rt-rk.com>
Co-authored-by: Ivan Baev <ibaev@cisco.com>

Differential Revision: https://reviews.llvm.org/D61062

llvm-svn: 364536
2019-06-27 13:10:29 +00:00
Simon Pilgrim c5cff5d3d1 [X86] getFauxShuffle - add DemandedElts as a filter
This is currently benign but will be used in the future based on the elements referenced by the parent shuffle(s).

llvm-svn: 364530
2019-06-27 12:35:52 +00:00
Simon Pilgrim 90e121fbe6 [X86][AVX] SimplifyDemandedVectorElts - combine PERMPD(x) -> EXTRACTF128(X)
If we only use the bottom lane, see if we can simplify this to extract_subvector - which is always at least as quick as PERMPD/PERMQ.

llvm-svn: 364518
2019-06-27 11:16:03 +00:00
Djordje Todorovic 7eeeb5947e [ISEL][X86] Tracking of registers that forward call arguments
While lowering calls, collect info about registers that forward arguments
into following function frame. We store such info into the MachineFunction
of the call. This is used very late when dumping DWARF info about
call site parameters.

([9/13] Introduce the debug entry values.)

Co-authored-by: Ananth Sowda <asowda@cisco.com>
Co-authored-by: Nikola Prica <nikola.prica@rt-rk.com>
Co-authored-by: Ivan Baev <ibaev@cisco.com>

Differential Revision: https://reviews.llvm.org/D60715

llvm-svn: 364516
2019-06-27 10:51:15 +00:00
Diana Picus 43fb5ae50c [GlobalISel] Accept multiple vregs for lowerCall's args
Change the interface of CallLowering::lowerCall to accept several
virtual registers for each argument, instead of just one.  This is a
follow-up to D46018.

CallLowering::lowerReturn was similarly refactored in D49660 and
lowerFormalArguments in D63549.

With this change, we no longer pack the virtual registers generated for
aggregates into one big lump before delegating to the target. Therefore,
the target can decide itself whether it wants to handle them as separate
pieces or use one big register.

ARM and AArch64 have been updated to use the passed in virtual registers
directly, which means we no longer need to generate so many
merge/extract instructions.

NFCI for AMDGPU, Mips and X86.

Differential Revision: https://reviews.llvm.org/D63551

llvm-svn: 364512
2019-06-27 09:18:03 +00:00
Diana Picus 8138996128 [GlobalISel] Accept multiple vregs for lowerCall's result
Change the interface of CallLowering::lowerCall to accept several
virtual registers for the call result, instead of just one.  This is a
follow-up to D46018.

CallLowering::lowerReturn was similarly refactored in D49660 and
lowerFormalArguments in D63549.

With this change, we no longer pack the virtual registers generated for
aggregates into one big lump before delegating to the target. Therefore,
the target can decide itself whether it wants to handle them as separate
pieces or use one big register.

ARM and AArch64 have been updated to use the passed in virtual registers
directly, which means we no longer need to generate so many
merge/extract instructions.

NFCI for AMDGPU, Mips and X86.

Differential Revision: https://reviews.llvm.org/D63550

llvm-svn: 364511
2019-06-27 09:15:53 +00:00
Diana Picus c3dbe23977 [GlobalISel] Accept multiple vregs in lowerFormalArgs
Change the interface of CallLowering::lowerFormalArguments to accept
several virtual registers for each formal argument, instead of just one.
This is a follow-up to D46018.

CallLowering::lowerReturn was similarly refactored in D49660. lowerCall
will be refactored in the same way in follow-up patches.

With this change, we forward the virtual registers generated for
aggregates to CallLowering. Therefore, the target can decide itself
whether it wants to handle them as separate pieces or use one big
register. We also copy the pack/unpackRegs helpers to CallLowering to
facilitate this.

ARM and AArch64 have been updated to use the passed in virtual registers
directly, which means we no longer need to generate so many
merge/extract instructions.

AArch64 seems to have had a bug when lowering e.g. [1 x i8*], which was
put into a s64 instead of a p0. Added a test-case which illustrates the
problem more clearly (it crashes without this patch) and fixed the
existing test-case to expect p0.

AMDGPU has been updated to unpack into the virtual registers for
kernels. I think the other code paths fall back for aggregates, so this
should be NFC.

Mips doesn't support aggregates yet, so it's also NFC.

x86 seems to have code for dealing with aggregates, but I couldn't find
the tests for it, so I just added a fallback to DAGISel if we get more
than one virtual register for an argument.

Differential Revision: https://reviews.llvm.org/D63549

llvm-svn: 364510
2019-06-27 08:54:17 +00:00
Diana Picus 69ce1c1319 [GlobalISel] Allow multiple VRegs in ArgInfo. NFC
Allow CallLowering::ArgInfo to contain more than one virtual register.
This is useful when passes split aggregates into several virtual
registers, but need to also provide information about the original type
to the call lowering. Used in follow-up patches.

Differential Revision: https://reviews.llvm.org/D63548

llvm-svn: 364509
2019-06-27 08:50:53 +00:00
Mikael Holmen 7b81b61368 Silence gcc warning after r364458
Without the fix gcc 7.4.0 complains with

../lib/Target/X86/X86ISelLowering.cpp: In function 'bool getFauxShuffleMask(llvm::SDValue, llvm::SmallVectorImpl<int>&, llvm::SmallVectorImpl<llvm::SDValue>&, llvm::SelectionDAG&)':
../lib/Target/X86/X86ISelLowering.cpp:6690:36: error: enumeral and non-enumeral type in conditional expression [-Werror=extra]
             int Idx = (ZeroMask[j] ? SM_SentinelZero : (i + j + Ofs));
                        ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

llvm-svn: 364507
2019-06-27 08:16:18 +00:00
Craig Topper 9153501f07 [X86] Remove (vzext_movl (scalar_to_vector (load))) matching code from selectScalarSSELoad.
I think this will be turning into vzext_load during DAG combine.

llvm-svn: 364499
2019-06-27 05:52:00 +00:00
Craig Topper 9ea5a32251 [X86] Teach selectScalarSSELoad to not narrow volatile loads.
llvm-svn: 364498
2019-06-27 05:51:56 +00:00
Craig Topper 3d12971e1c [X86] Rework the logic in LowerBuildVectorv16i8 to make better use of any_extend and break false dependencies. Other improvements
This patch rewrites the loop iteration to only visit every other element starting with element 0. And we work on the "even" element and "next" element at the same time. The "First" logic has been moved to the bottom of the loop and doesn't run on every element. I believe it could create dangling nodes previously since we didn't check if we were going to use SCALAR_TO_VECTOR for the first insertion. I got rid of the "First" variable and just do a null check on V which should be equivalent. We also no longer use undef as the starting V for vectors with no zeroes to avoid false dependencies. This matches v8i16.

I've changed all the extends and OR operations to use MVT::i32 since that's what they'll be promoted to anyway. I've tried to use zero_extend only when necessary and use any_extend otherwise. This resulted in some improvements in tests where we are now able to promote aligned (i32 (extload i8)) to a 32-bit load.

Differential Revision: https://reviews.llvm.org/D63702

llvm-svn: 364469
2019-06-26 20:16:19 +00:00
Craig Topper afa58b6ba1 [X86] Remove isTypePromotionOfi1ZeroUpBits and its helpers.
This was trying to optimize concat_vectors with zero of setcc or
kand instructions. But I think it produced the same code we
produce for a concat_vectors with 0 even it it doesn't come from
one of those operations.

llvm-svn: 364463
2019-06-26 19:45:48 +00:00
Simon Pilgrim dfe079ffbf [X86][SSE] getFauxShuffleMask - handle OR(x,y) where x and y have no overlapping bits
Create a per-byte shuffle mask based on the computeKnownBits from each operand - if for each byte we have a known zero (or both) then it can be safely blended.

Fixes PR41545

llvm-svn: 364458
2019-06-26 18:21:26 +00:00
Simon Pilgrim 435ee9fb1f [X86][SSE] X86TargetLowering::isCommutativeBinOp - add PMULDQ
Allows narrowInsertExtractVectorBinOp to reduce vector size instead of the more restricted SimplifyDemandedVectorEltsForTargetNode

llvm-svn: 364434
2019-06-26 14:58:11 +00:00
Simon Pilgrim 6b687bf681 [X86][SSE] X86TargetLowering::isCommutativeBinOp - add PCMPEQ
Allows narrowInsertExtractVectorBinOp to reduce vector size

llvm-svn: 364432
2019-06-26 14:40:49 +00:00
Simon Pilgrim b13c6f1a9d [X86][SSE] X86TargetLowering::isBinOp - add PCMPGT
Allows narrowInsertExtractVectorBinOp to reduce vector size

llvm-svn: 364431
2019-06-26 14:34:41 +00:00
Simon Pilgrim 24f96a0eee [X86] shouldScalarizeBinop - never scalarize target opcodes.
We have (almost) no target opcodes that have scalar/vector equivalents - for now assume we can't scalarize them (we can add exceptions if we need to).

llvm-svn: 364429
2019-06-26 14:21:29 +00:00
Roman Lebedev 13889145f0 [X86][Codegen] X86DAGToDAGISel::matchBitExtract(): consistently capture lambdas by value
llvm-svn: 364420
2019-06-26 12:19:52 +00:00
Roman Lebedev fbb2e40d5c [X86] X86DAGToDAGISel::matchBitExtract(): pattern c: truncation awareness
Summary:
The one thing of note here is that the 'bitwidth' constant (32/64) was previously pessimistic.
Given `x & (-1 >> (C - z))`, we were taking `C` to be `bitwidth(x)`, but in reality
we want `(-1 >> (C - z))` pattern to mean "low z bits must be all-ones".
And for that, `C` should be `bitwidth(-1 >> (C - z))`, i.e. of the shift operation itself.

Last pattern D does not seem to exhibit any of these truncation issues.
Although it has the opposite problem - if we extract low bits (no shift) from i64,
and then truncate to i32, then we fail to shrink this 64-bit extraction into 32-bit extraction.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62806

llvm-svn: 364419
2019-06-26 12:19:47 +00:00
Roman Lebedev b0ecc1cc6b [X86] X86DAGToDAGISel::matchBitExtract(): pattern b: truncation awareness
Summary:
(Not so) boringly identical to pattern a (D62786)
Not yet sure how do deal with the last pattern c.

Reviewers: RKSimon, craig.topper, spatel

Reviewed By: RKSimon

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62793

llvm-svn: 364418
2019-06-26 12:19:39 +00:00
Roman Lebedev 8b9a03973a [X86] X86DAGToDAGISel::matchBitExtract(): pattern a: truncation awareness
Summary:
Finally tying up loose ends here.

The problem is quite simple:
If we have pattern `(x >> start) &  (1 << nbits) - 1`,
and then truncate the result, that truncation will be propagated upwards,
into the `and`. And that isn't currently handled.

I'm only fixing pattern `a` here,
the same fix will be needed for patterns `b`/`c` too.

I *think* this isn't missing any extra legality checks,
since we only look past truncations. Similary, i don't think
we can get any other truncation there other than i64->i32.

Reviewers: craig.topper, RKSimon, spatel

Reviewed By: craig.topper

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62786

llvm-svn: 364417
2019-06-26 12:19:11 +00:00
Hans Wennborg 6876de90e8 Fix the build after r364401
It was failing with:

/b/s/w/ir/cache/builder/src/third_party/llvm/llvm/lib/Target/X86/X86ISelLowering.cpp:18772:66:
error: call of overloaded 'makeArrayRef(<brace-enclosed initializer list>)' is ambiguous
     scaleShuffleMask<int>(Scale, makeArrayRef<int>({ 0, 2, 1, 3 }), Mask);
                                                                  ^
/b/s/w/ir/cache/builder/src/third_party/llvm/llvm/lib/Target/X86/X86ISelLowering.cpp:18772:66: note: candidates are:
In file included from /b/s/w/ir/cache/builder/src/third_party/llvm/llvm/include/llvm/CodeGen/MachineFunction.h:20:0,
                 from /b/s/w/ir/cache/builder/src/third_party/llvm/llvm/include/llvm/CodeGen/CallingConvLower.h:19,
                 from /b/s/w/ir/cache/builder/src/third_party/llvm/llvm/lib/Target/X86/X86ISelLowering.h:17,
                 from /b/s/w/ir/cache/builder/src/third_party/llvm/llvm/lib/Target/X86/X86ISelLowering.cpp:14:
/b/s/w/ir/cache/builder/src/third_party/llvm/llvm/include/llvm/ADT/ArrayRef.h:480:15:
note: llvm::ArrayRef<T> llvm::makeArrayRef(const std::vector<_RealType>&) [with T = int]
   ArrayRef<T> makeArrayRef(const std::vector<T> &Vec) {
               ^
/b/s/w/ir/cache/builder/src/third_party/llvm/llvm/include/llvm/ADT/ArrayRef.h:485:37:
note: llvm::ArrayRef<T> llvm::makeArrayRef(const llvm::ArrayRef<T>&) [with T = int]
   template <typename T> ArrayRef<T> makeArrayRef(const ArrayRef<T> &Vec) {
                                     ^

llvm-svn: 364414
2019-06-26 11:56:38 +00:00
Simon Pilgrim c0711af7f9 [X86][AVX] combineExtractSubvector - 'little to big' extract_subvector(bitcast()) support
Ideally this needs to be a generic combine in DAGCombiner::visitEXTRACT_SUBVECTOR but there's some nasty regressions in aarch64 due to neon shuffles not handling bitcasts at all.....

llvm-svn: 364407
2019-06-26 11:21:09 +00:00
Simon Pilgrim 3845a4f849 [X86][AVX] truncateVectorWithPACK - avoid bitcasted shuffles
truncateVectorWithPACK is often used in conjunction with ComputeNumSignBits which struggles when peeking through bitcasts.

This fix tries to avoid bitcast(shuffle(bitcast())) patterns in the 256-bit 64-bit sublane shuffles so we can still see through at least until lowering when the shuffles will need to be bitcasted to widen the shuffle type.

llvm-svn: 364401
2019-06-26 09:50:11 +00:00
Clement Courbet be98e0ab78 [ExpandMemCmp] Honor prefer-vector-width.
Reviewers: gchatelet, echristo, spatel, atdt

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63769

llvm-svn: 364384
2019-06-26 07:06:49 +00:00
Matt Arsenault 8fcc70f141 Don't look for the TargetFrameLowering in the implementation
The same oddity was apparently copy-pasted between multiple targets.

llvm-svn: 364349
2019-06-25 20:53:35 +00:00
Craig Topper 4577b8c17c [X86] Remove isel patterns that look for (vzext_movl (scalar_to_vector (load)))
I believe these all get canonicalized to vzext_movl. The only case where that wasn't true was when the load was loadi32 and the load was an extload aligned to 32 bits. But that was fixed in r364207.

Differential Revision: https://reviews.llvm.org/D63701

llvm-svn: 364337
2019-06-25 17:31:52 +00:00
Craig Topper 14ea14ae85 [X86] Add a DAG combine to turn vzmovl+load into vzload if the load isn't volatile. Remove isel patterns for vzmovl+load
We currently have some isel patterns for treating vzmovl+load the same as vzload, but that shrinks the load which we shouldn't do if the load is volatile.

Rather than adding isel checks for volatile. This patch removes the patterns and teachs DAG combine to merge them into vzload when its legal to do so.

Differential Revision: https://reviews.llvm.org/D63665

llvm-svn: 364333
2019-06-25 17:08:26 +00:00
Simon Pilgrim aae4b68703 [X86] lowerShuffleAsSpecificZeroOrAnyExtend - add ANY_EXTEND TODO.
lowerShuffleAsSpecificZeroOrAnyExtend should be able to lower to ANY_EXTEND_VECTOR_INREG as well as ZER_EXTEND_VECTOR_INREG.

llvm-svn: 364313
2019-06-25 13:36:53 +00:00
Clement Courbet 3bc5ad551a [ExpandMemCmp] Move all options to TargetTransformInfo.
Split off from D60318.

llvm-svn: 364281
2019-06-25 08:04:13 +00:00
Craig Topper 7fccb2ac5e [X86] Don't a vzext_movl in LowerBuildVectorv16i8/LowerBuildVectorv8i16 if there are no zeroes in the vector we're building.
In LowerBuildVectorv16i8 we took care to use an any_extend if the first pair is in the lower 16-bits of the vector and no elements are 0. So bits [31:16] will be undefined. But we still emitted a vzext_movl to ensure that bits [127:32] are 0. If we don't need any zeroes we should be consistent and make all of 127:16 undefined.

In LowerBuildVectorv8i16 we can just delete the vzext_movl code because we only use the scalar_to_vector when there are no zeroes. So the vzext_movl is always unnecessary.

Found while investigating whether (vzext_movl (scalar_to_vector (loadi32)) patterns are necessary. At least one of the cases where they were necessary was where the loadi32 matched 32-bit aligned 16-bit extload. Seemed weird that we required vzext_movl for that case.

Differential Revision: https://reviews.llvm.org/D63700

llvm-svn: 364207
2019-06-24 17:28:41 +00:00
Craig Topper 033774e144 [X86] Cleanups and safety checks around the isFNEG
This patch does a few things to start cleaning up the isFNEG function.

-Remove the Op0/Op1 peekThroughBitcast calls that seem unnecessary. getTargetConstantBitsFromNode has its own peekThroughBitcast inside. And we have a separate peekThroughBitcast on the return value.
-Add a check of the scalar size after the first peekThroughBitcast to ensure we haven't changed the element size and just did something like f32->i32 or f64->i64.
-Remove an unnecessary check that Op1's type is floating point after the peekThroughBitcast. We're just going to look for a bit pattern from a constant. We don't care about its type.
-Add VT checks on several places that consume the return value of isFNEG. Due to the peekThroughBitcasts inside, the type of the return value isn't guaranteed. So its not safe to use it to build other nodes without ensuring the type matches the type being used to build the node. We might be able to replace these checks with bitcasts instead, but I don't have a test case so a bail out check seemed better for now.

Differential Revision: https://reviews.llvm.org/D63683

llvm-svn: 364206
2019-06-24 17:28:26 +00:00
Matt Arsenault faeaedf8e9 GlobalISel: Remove unsigned variant of SrcOp
Force using Register.

One downside is the generated register enums require explicit
conversion.

llvm-svn: 364194
2019-06-24 16:16:12 +00:00
Matt Arsenault e3a676e9ad CodeGen: Introduce a class for registers
Avoids using a plain unsigned for registers throughoug codegen.
Doesn't attempt to change every register use, just something a little
more than the set needed to build after changing the return type of
MachineOperand::getReg().

llvm-svn: 364191
2019-06-24 15:50:29 +00:00
Craig Topper e8da65c698 [X86] Turn v16i16->v16i8 truncate+store into a any_extend+truncstore if we avx512f, but not avx512bw.
Ideally we'd be able to represent this truncate as a any_extend to
v16i32 and a truncate, but SelectionDAG doens't know how to not
fold those together.

We have isel patterns to use a vpmovzxwd+vpdmovdb for the truncate,
but we aren't able to simultaneously fold the load and the store
from the isel pattern. By pulling the truncate into the store we
can successfully hide it from the DAG combiner. Then we can isel
pattern match the truncstore and load+any_extend separately.

llvm-svn: 364163
2019-06-23 23:51:21 +00:00
Craig Topper c8d94e7889 [X86] Fix isel pattern that was looking for a bitcasted load. Remove what appears to be a copy/paste mistake.
DAG combine should ensure bitcasts of loads don't exist.

Also remove 3 patterns that are identical to the block above them.

llvm-svn: 364158
2019-06-23 19:17:50 +00:00
Craig Topper cadd826d0a [X86][SelectionDAG] Cleanup and simplify masked_load/masked_store in tablegen. Use more precise PatFrags for scalar masked load/store.
Rename masked_load/masked_store to masked_ld/masked_st to discourage
their direct use. We need to check truncating/extending and
compressing/expanding before using them. This revealed that
our scalar masked load/store patterns were misusing these.

With those out of the way, renamed masked_load_unaligned and
masked_store_unaligned to remove the "_unaligned". We didn't
check the alignment anyway so the name was somewhat misleading.

Make the aligned versions inherit from masked_load/store instead
from a separate identical version. Merge the 3 different alignments
PatFrags into a single version that uses the VT from the SDNode to
determine the size that the alignment needs to match.

llvm-svn: 364150
2019-06-23 06:06:04 +00:00
Simon Pilgrim a962c1bc0f [X86][SSE] Fold extract_subvector(vselect(x,y,z),0) -> vselect(extract_subvector(x,0),extract_subvector(y,0),extract_subvector(z,0))
llvm-svn: 364136
2019-06-22 17:57:01 +00:00
Craig Topper 4649a051bf [X86] Add DAG combine to turn (vzmovl (insert_subvector undef, X, 0)) into (insert_subvector allzeros, (vzmovl X), 0)
128/256 bit scalar_to_vectors are canonicalized to (insert_subvector undef, (scalar_to_vector), 0). We have isel patterns that try to match this pattern being used by a vzmovl to use a 128-bit instruction and a subreg_to_reg.

This patch detects the insert_subvector undef portion of this and pulls it through the vzmovl, creating a narrower vzmovl and an insert_subvector allzeroes. We can then match the insertsubvector into a subreg_to_reg operation by itself. Then we can fall back on existing (vzmovl (scalar_to_vector)) patterns.

Note, while the scalar_to_vector case is the motivating case I didn't restrict to just that case. I'm also wondering about shrinking any 256/512 vzmovl to an extract_subvector+vzmovl+insert_subvector(allzeros) but I fear that would have bad implications to shuffle combining.

I also think there is more canonicalization we can do with vzmovl with loads or scalar_to_vector with loads to create vzload.

Differential Revision: https://reviews.llvm.org/D63512

llvm-svn: 364095
2019-06-21 19:10:21 +00:00
Craig Topper 4569cdbcf5 [X86] Don't mark v64i8/v32i16 ISD::SELECT as custom unless they are legal types.
We don't have any Custom handling during type legalization. Only
operation legalization.

Fixes PR42355

llvm-svn: 364093
2019-06-21 18:50:00 +00:00
Craig Topper ce6c06dfdd [X86] Add a debug print of the node in the default case for unhandled opcodes in ReplaceNodeResults.
This should be unreachable, but bugs can make it reachable. This
adds a debug print so we can see the bad node in the output when
the llvm_unreachable triggers.

llvm-svn: 364091
2019-06-21 18:49:21 +00:00
Simon Pilgrim 5dba4ed208 [X86][AVX] Combine INSERT_SUBVECTOR(SRC0, EXTRACT_SUBVECTOR(SRC1)) as shuffle
Subvector shuffling often ends up as insert/extract subvector.

llvm-svn: 364090
2019-06-21 18:35:04 +00:00
Craig Topper 6af1be9664 [X86] Use vmovq for v4i64/v4f64/v8i64/v8f64 vzmovl.
We already use vmovq for v2i64/v2f64 vzmovl. But we were using a
blendpd+xorpd for v4i64/v4f64/v8i64/v8f64 under opt speed. Or
movsd+xorpd under optsize.

I think the blend with 0 or movss/d is only needed for
vXi32 where we don't have an instruction that can move 32
bits from one xmm to another while zeroing upper bits.

movq is no worse than blendpd on any known CPUs.

llvm-svn: 364079
2019-06-21 17:24:21 +00:00
Simon Pilgrim 96e77ce626 [X86] isBinOp - move commutative ops to isCommutativeBinOp. NFCI.
TargetLoweringBase::isBinOp checks isCommutativeBinOp as a fallback, so don't duplicate.

llvm-svn: 364072
2019-06-21 16:23:28 +00:00
Simon Pilgrim 36a999ffb8 [X86] X86ISD::ANDNP is a (non-commutative) binop
The sat add/sub tests still have unnecessary extract_subvector((vandnps ymm, ymm), 0) uses that should be split to (vandnps (extract_subvector(ymm, 0), extract_subvector(ymm, 0)), but its getting better.

llvm-svn: 364038
2019-06-21 12:42:39 +00:00
Simon Pilgrim 9184b009cf [X86] createMMXBuildVector - call with BuildVectorSDNode directly. NFCI.
llvm-svn: 364030
2019-06-21 11:25:06 +00:00
Simon Pilgrim c26b8f2afc [X86] combineAndnp - use isNOT instead of manually checking for (XOR x, -1)
llvm-svn: 364026
2019-06-21 11:13:15 +00:00
Simon Pilgrim b5733581c4 [X86] foldVectorXorShiftIntoCmp - use isConstOrConstSplat. NFCI.
Use the isConstOrConstSplat helper instead of inspecting the build vector manually.

llvm-svn: 364024
2019-06-21 10:54:30 +00:00
Simon Pilgrim 771c33e375 [X86][AVX] isNOT - handle concat_vectors(xor X, -1, xor Y, -1) pattern
llvm-svn: 364022
2019-06-21 10:44:15 +00:00
Fangrui Song dc8de6037c Simplify std::lower_bound with llvm::{bsearch,lower_bound}. NFC
llvm-svn: 364006
2019-06-21 05:40:31 +00:00
Craig Topper 9e1665f2d6 [X86] Add BLSI to isUseDefConvertible.
Summary:
BLSI sets the C flag is the input is not zero. So if its followed
by a TEST of the input where only the Z flag is consumed, we can
replace it with the opposite check of the C flag.

We should be able to do the same for BLSMSK and BLSR, but the
naive test case for those is being optimized to a subo by
CodeGenPrepare.

Reviewers: spatel, RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63589

llvm-svn: 363957
2019-06-20 17:52:53 +00:00
Simon Pilgrim a4d705e0ef [X86] LowerAVXExtend - handle ANY_EXTEND_VECTOR_INREG lowering as well.
llvm-svn: 363922
2019-06-20 11:31:54 +00:00
Craig Topper b4ea64570c [X86] Remove memory instructions form isUseDefConvertible.
The caller of this is looking for comparisons of the input
to these instructions with 0. But the memory instructions
input is an addess not a value input in a register.

llvm-svn: 363907
2019-06-20 04:58:40 +00:00
Craig Topper 451f7feb64 [X86] Add v64i8/v32i16 to several places in X86CallingConv.td where they seemed obviously missing.
llvm-svn: 363906
2019-06-20 04:29:00 +00:00
Sanjay Patel b5640b6fe8 [x86] avoid vector load narrowing with extracted store uses (PR42305)
This is an exception to the rule that we should prefer xmm ops to ymm ops.
As shown in PR42305:
https://bugs.llvm.org/show_bug.cgi?id=42305
...the store folding opportunity with vextractf128 may result in better
perf by reducing the instruction count.

Differential Revision: https://reviews.llvm.org/D63517

llvm-svn: 363853
2019-06-19 18:13:47 +00:00
Simon Pilgrim 0018b78ef6 [X86][SSE] combineToExtendVectorInReg - add ANY_EXTEND support TODO. NFCI.
So I don't forget - there's a load of yak shaving to do first.

llvm-svn: 363847
2019-06-19 17:42:37 +00:00
Simon Pilgrim 34279db355 [X86][SSE] Combine shuffles to ANY_EXTEND/ANY_EXTEND_VECTOR_INREG.
We already do this for ZERO_EXTEND/ZERO_EXTEND_VECTOR_INREG - this just extends the pattern matcher to recognize cases where we don't need the zeros in the extension.

llvm-svn: 363841
2019-06-19 17:21:15 +00:00
Simon Pilgrim cdc0236e3a [X86] getExtendInVec - take a ISD::*_EXTEND opcode instead of a IsSigned bool flag. NFCI.
Prep work to support ANY_EXTEND/ANY_EXTEND_VECTOR_INREG without needing another flag.

llvm-svn: 363818
2019-06-19 15:18:24 +00:00
Simon Pilgrim d4754cac89 [X86] Add *_EXTEND -> *_EXTEND_VECTOR_INREG opcode conversion helper. NFCI.
Given a *_EXTEND or *_EXTEND_VECTOR_INREG opcode, convert it to *_EXTEND_VECTOR_INREG.

llvm-svn: 363812
2019-06-19 14:54:02 +00:00
Simon Pilgrim 2b309027ed [X86] Merge extract_subvector(*_EXTEND) and extract_subvector(*_EXTEND_VECTOR_INREG) handling. NFCI.
llvm-svn: 363808
2019-06-19 14:25:27 +00:00
Clement Courbet 4ef7c2868a [X86] Add missing properties on llvm.x86.sse.{st,ld}mxcsr
Summary:
llvm.x86.sse.stmxcsr only writes to memory.
llvm.x86.sse.ldmxcsr only reads from memory, and might generate an FPE.

Reviewers: craig.topper, RKSimon

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62896

llvm-svn: 363773
2019-06-19 08:44:31 +00:00
Matt Arsenault 9cac4e6d14 Rename ExpandISelPseudo->FinalizeISel, delay register reservation
This allows targets to make more decisions about reserved registers
after isel. For example, now it should be certain there are calls or
stack objects in the frame or not, which could have been introduced by
legalization.

Patch by Matthias Braun

llvm-svn: 363757
2019-06-19 00:25:39 +00:00
Craig Topper 10e6128c62 [X86] Remove unnecessary line that makes v4f32 FP_ROUND Legal. NFC
FP_ROUND defaults to Legal for all MVT types and nothing changes
the v4f32 entry way from this default. If we needed this line
we'd also need one for v8f32 with AVX512 which we don't have.

llvm-svn: 363719
2019-06-18 19:04:03 +00:00
Simon Pilgrim 9c8593934a [X86][AVX] extract_subvector(any_extend(x)) -> any_extend_vector_inreg(x)
Part of fixing the X86 regression noted in D63281 - I've split this into X86 and generic parts - the generic commit will be coming shortly and will fix the vector-reduce-mul-widen.ll regression introduced here.

llvm-svn: 363693
2019-06-18 15:30:50 +00:00
Simon Pilgrim 7dd529e54d [X86] Replace any_extend* vector extensions with zero_extend* equivalents
First step toward addressing the vector-reduce-mul-widen.ll regression in D63281 - we should replace ANY_EXTEND/ANY_EXTEND_VECTOR_INREG in X86ISelDAGToDAG to avoid having to add duplicate patterns when treating any extensions as legal.

In future patches this will also allow us to keep any extension nodes around a lot longer in the DAG, which should mean that we can keep better track of undef elements that otherwise become zeros that we think we have to keep......

Differential Revision: https://reviews.llvm.org/D63326

llvm-svn: 363655
2019-06-18 09:50:13 +00:00
Craig Topper f4284f8a9d [X86] Move code that shrinks immediates for ((x << C1) op C2) into a helper function. NFCI
Preliminary step for D59909

llvm-svn: 363645
2019-06-18 04:23:58 +00:00
Craig Topper 587427716c [X86] Remove MOVDI2SSrm/MOV64toSDrm/MOVSS2DImr/MOVSDto64mr CodeGenOnly instructions.
The isel patterns for these use a bitcast and load/store, but
DAG combine should have canonicalized those away.

For the purposes of the memory folding table these opcodes can be
replaced by the MOVSSrm_alt/MOVSDrm_alt and MOVSSmr/MOVSDmr opcodes.

llvm-svn: 363644
2019-06-18 03:23:15 +00:00
Craig Topper 8582ecd8d9 [X86] Introduce new MOVSSrm/MOVSDrm opcodes that use VR128 register class.
Rename the old versions that use FR32/FR64 to MOVSSrm_alt/MOVSDrm_alt.

Use the new versions in patterns that previously used a COPY_TO_REGCLASS
to VR128. These patterns expect the upper bits to be zero. The
current set up appears to work, but I'm not sure we should be
enforcing upper bits being zero through a COPY_TO_REGCLASS.

I wanted to flip the arrangement and use a COPY_TO_REGCLASS to
FR32/FR64 for the patterns that need an f32/f64 result, but that
complicated fastisel and globalisel.

I've been doing some experiments with reducing some isel patterns
and ended up in a situation where I had a
(SUBREG_TO_REG (COPY_TO_RECLASS (VMOVSSrm), VR128)) and our
post-isel peephole was unable to avoid using an instruction for
the SUBREG_TO_REG due to the COPY_TO_REGCLASS. Having a VR128
instruction removes the COPY_TO_REGCLASS that was breaking this.

llvm-svn: 363643
2019-06-18 03:23:11 +00:00
Craig Topper 971ad74ba2 Use VR128X instead of FR32X/FR64X for the register class in VMOVSSZmrk/VMOVSDZmrk.
Removes COPY_TO_REGCLASS from some patterns.

llvm-svn: 363630
2019-06-17 23:08:29 +00:00
Craig Topper 0e18300802 [X86] Make an assert in LowerSCALAR_TO_VECTOR stricter to make it clear what types are allowed here. NFC
Make it clear that only integer type with i32 or smaller elements shoudl get to this part of the code.

llvm-svn: 363629
2019-06-17 23:08:09 +00:00
Craig Topper f3f968adcd [X86] Add TB_NO_REVERSE to some memory folding table entries where the register form requires 64-bit mode, but the memory form does not.
We don't know if its safe to unfold if we're in 32-bit mode.

This is simlar to what was done to some load opcodes in r363523.

I think its pretty unlikely we will try to unfold these anyway so
I don't think this is testable.

llvm-svn: 363595
2019-06-17 18:38:07 +00:00
Simon Pilgrim 835999e48a [X86][SSE] Scalarize under-aligned XMM vector nt-stores (PR42026)
If a XMM non-temporal store has less than natural alignment, scalarize the vector - with SSE4A we can stay on the vector and use MOVNTSD(f64), else we must move to GPRs and use MOVNTI(i32/i64).

llvm-svn: 363592
2019-06-17 18:20:04 +00:00
Simon Pilgrim bb9adfdb4e [X86][AVX] Split under-aligned vector nt-stores.
If a YMM/ZMM non-temporal store has less than natural alignment, split the vector - either they will be satisfactorily aligned or will continue to be split until they are XMMs - at which point the legalizer will scalarize it.

llvm-svn: 363582
2019-06-17 17:22:38 +00:00
Warren Ristow 6452bdd29b [LV] Suppress vectorization in some nontemporal cases
When considering a loop containing nontemporal stores or loads for
vectorization, suppress the vectorization if the corresponding
vectorized store or load with the aligment of the original scaler
memory op is not supported with the nontemporal hint on the target.

This adds two new functions:
  bool isLegalNTStore(Type *DataType, unsigned Alignment) const;
  bool isLegalNTLoad(Type *DataType, unsigned Alignment) const;

to TTI, leaving the target independent default implementation as
returning true, but with overriding implementations for X86 that
check the legality based on available Subtarget features.

This fixes https://llvm.org/PR40759

Differential Revision: https://reviews.llvm.org/D61764

llvm-svn: 363581
2019-06-17 17:20:08 +00:00
Simon Pilgrim 12cb792d7f [X86] combineLoad - begun making the load split code more generic. NFCI.
This is currently only used for ymm->xmm splitting but we shouldn't hardcode the offsets/alignment.

This is necessary for an upcoming patch to split under-aligned non-temporal vector loads.

llvm-svn: 363570
2019-06-17 15:54:36 +00:00
Simon Pilgrim 454e6b9010 [X86][SSE] Prevent misaligned non-temporal vector load/store combines
For loads, pre-SSE41 we can't perform NT loads at all, and after that we can only perform vector aligned loads, so if the alignment is less than for a xmm we'll just end up using the regular unaligned vector loads anyway.

First step towards fixing PR42026 - the next step for stores will be to use SSE4A movntsd where possible and to avoid the stack spill on SSE2 targets.

Differential Revision: https://reviews.llvm.org/D63246

llvm-svn: 363564
2019-06-17 14:26:10 +00:00
Craig Topper 9f2f127009 [X86] Add TB_NO_REVERSE to some folding table entries where the register from uses the REX prefix, but the memory form does not.
It would not be safe to unfold the memory form the register form
without checking that we are compiling for 64-bit mode.

This probaby isn't a real functional issue since we are unlikely
to unfold any of these instructions since they don't have any
tied registers, aren't commutable, and don't have any inputs
other than the address.

llvm-svn: 363523
2019-06-16 22:33:09 +00:00
Sanjay Patel d14389c0a5 [x86] split 256-bit vector selects if operands are vector concats
This is similar logic/motivation to the select splitting in D62969.

In D63233, the pattern changes so that we no longer have an extract_subvector of vselect,
but the operands of the select are still being concatenated.

The closest case is represented in either the first or last test diffs here - we have an
extra instruction, but we converted 3-4 ymm instructions into 4-5 xmm instructions.
I think that's the right trade-off for most AVX1 targets.

In the example based on PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428
...this makes the loop about 30% faster (tested on Haswell by compiling with -mavx).

Differential Revision: https://reviews.llvm.org/D63364

llvm-svn: 363508
2019-06-16 14:04:49 +00:00
Simon Pilgrim fcffc2facc [X86] CombineShuffleWithExtract - handle cases with different vector extract sources
Insert the shorter vector source into an undef vector of the longer vector source's type.

llvm-svn: 363507
2019-06-16 08:00:41 +00:00
Simon Pilgrim 456ca5d7f7 [X86] CombineShuffleWithExtract - assert all src ops types are multiples of rootsize. NFCI.
llvm-svn: 363501
2019-06-15 19:12:44 +00:00
Simon Pilgrim 90e87af303 [X86][AVX] Handle lane-crossing shuffle(extract_subvector(x,c1),extract_subvector(y,c2),m1) shuffles
Pull out the existing (non)lane-crossing fold into a helper lambda and use for lane-crossing unary shuffles as well.

Fixes PR34380

llvm-svn: 363500
2019-06-15 18:30:43 +00:00
Simon Pilgrim 990f3ceb67 [X86][AVX] Decode constant bits from insert_subvector(c1, c2, c3)
This mostly happens due to SimplifyDemandedVectorElts reducing a vector to insert_subvector(undef, c1, 0)

llvm-svn: 363499
2019-06-15 17:05:24 +00:00
Kevin P. Neal fece7c6c83 [FPEnv] Lower STRICT_FP_EXTEND and STRICT_FP_ROUND nodes in preprocess phase of ISelLowering to mirror non-strict nodes on x86.
I recently discovered a bug on the x86 platform: The fp80 type was not handled well by x86 for constrained floating point nodes, as their regular counterparts are replaced by extending loads and truncating stores during the preprocess phase. Normally, platforms don't have this issue, as they don't typically attempt to perform such legalizations during instruction selection preprocessing. Before this change, strict_fp nodes survived until they were mutated to normal nodes, which happened shortly after preprocessing on other platforms. This modification lowers these nodes at the same phase while properly utilizing the chain.5

Submitted by:	Drew Wock <drew.wock@sas.com>
Reviewed by:	Craig Topper, Kevin P. Neal
Approved by:	Craig Topper
Differential Revision:	https://reviews.llvm.org/D63271

llvm-svn: 363417
2019-06-14 16:28:55 +00:00
Eric Christopher 5e83d8fff4 Move commentary on opcode translation for code16 mov instructions
to segment registers closer to the segment register check for when
we add further optimizations.

llvm-svn: 363355
2019-06-14 04:51:55 +00:00
Craig Topper cf34a2bd5d [X86Disassembler] Unify the EVEX and VEX code in emitContextTable. Merge the ATTR_VEXL/ATTR_EVEXL bits. NFCI
Merging the two bits shrinks the context table from 16384 bytes to 8192 bytes.

Remove the ATTRIBUTE_BITS macro and just create an enum directly. Then fix the ATTR_max define to be 8192 to reflect the table size so we stop hardcoding it separately.

llvm-svn: 363330
2019-06-13 22:15:25 +00:00
Simon Pilgrim 757a2f13fd [X86] Use fresh MemOps when emitting VAARG64
Previously it copied over MachineMemOperands verbatim which caused MOV32rm to have store flags set, and MOV32mr to have load flags set. This fixes some assertions being thrown with EXPENSIVE_CHECKS on.

Committed on behalf of @luke (Luke Lau)

Differential Revision: https://reviews.llvm.org/D62726

llvm-svn: 363268
2019-06-13 14:05:37 +00:00
Simon Pilgrim 6b56ad164c [CodeGen] Add getMachineMemOperand + MachineMemOperand::Flags allocator helper wrapper. NFCI.
Pre-commit for D62726 on behalf of @luke (Luke Lau)

llvm-svn: 363257
2019-06-13 12:58:55 +00:00
Simon Pilgrim 0baf136a4d [X86][SSE] Avoid assert for broadcast(horiz-op()) cases for non-f64 cases.
Based on fuzz test from @craig.topper

llvm-svn: 363251
2019-06-13 11:26:21 +00:00
Tom Stellard f335672218 X86: Clean up pass initialization
Summary:
- Remove redundant initializations from pass constructors that were
  already being initialized by LLVMInitializeX86Target().

- Add initialization function for the FPS pass.

Reviewers: craig.topper

Reviewed By: craig.topper

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63218

llvm-svn: 363221
2019-06-13 02:09:32 +00:00
Simon Pilgrim 4e0648a541 [TargetLowering] Add MachineMemOperand::Flags to allowsMemoryAccess tests (PR42123)
As discussed on D62910, we need to check whether particular types of memory access are allowed, not just their alignment/address-space.

This NFC patch adds a MachineMemOperand::Flags argument to allowsMemoryAccess and allowsMisalignedMemoryAccesses, and wires up calls to pass the relevant flags to them.

If people are happy with this approach I can then update X86TargetLowering::allowsMisalignedMemoryAccesses to handle misaligned NT load/stores.

Differential Revision: https://reviews.llvm.org/D63075

llvm-svn: 363179
2019-06-12 17:14:03 +00:00
Simon Pilgrim 5b0e0dd709 [X86][AVX] Fold concat(vpermilps(x,c),vpermilps(y,c)) -> vpermilps(concat(x,y),c)
Handles PSHUFD/PSHUFLW/PSHUFHW (AVX2) + VPERMILPS (AVX1).

An extra AVX1 PSHUFD->VPERMILPS combine will be added in a future commit.

llvm-svn: 363178
2019-06-12 16:38:20 +00:00
Craig Topper ed4cd44870 [X86] Add VCMPSSZrr_Intk and VCMPSDZrr_Intk to isNonFoldablePartialRegisterLoad.
The non-masked versions are already in there. I'm having some
trouble coming up with a way to test this right now. Most load
folding should happen during isel so I'm not sure how to get
peephole pass to do it.

llvm-svn: 363125
2019-06-12 06:29:53 +00:00
Simon Pilgrim 266f43964e [TargetLowering] Add allowsMemoryAccess(MachineMemOperand) helper wrapper. NFCI.
As suggested by @arsenm on D63075 - this adds a TargetLowering::allowsMemoryAccess wrapper that takes a Load/Store node's MachineMemOperand to handle the AddressSpace/Alignment arguments and will also implicitly handle the MachineMemOperand::Flags change in D63075.

llvm-svn: 363048
2019-06-11 11:00:23 +00:00
Craig Topper 627d8168e7 [X86] Add load folding isel patterns to scalar_math_patterns and AVX512_scalar_math_fp_patterns.
Also add a FIXME for the peephole pass not being able to handle this.

llvm-svn: 363032
2019-06-11 04:30:53 +00:00
Tom Stellard 4b0b26199b Revert CMake: Make most target symbols hidden by default
This reverts r362990 (git commit 374571301d)

This was causing linker warnings on Darwin:

ld: warning: direct access in function 'llvm::initializeEvexToVexInstPassPass(llvm::PassRegistry&)'
from file '../../lib/libLLVMX86CodeGen.a(X86EvexToVex.cpp.o)' to global weak symbol
'void std::__1::__call_once_proxy<std::__1::tuple<void* (&)(llvm::PassRegistry&),
std::__1::reference_wrapper<llvm::PassRegistry>&&> >(void*)' from file '../../lib/libLLVMCore.a(Verifier.cpp.o)'
means the weak symbol cannot be overridden at runtime. This was likely caused by different translation
units being compiled with different visibility settings.

llvm-svn: 363028
2019-06-11 03:21:13 +00:00
Tom Stellard 374571301d CMake: Make most target symbols hidden by default
Summary:
For builds with LLVM_BUILD_LLVM_DYLIB=ON and BUILD_SHARED_LIBS=OFF
this change makes all symbols in the target specific libraries hidden
by default.

A new macro called LLVM_EXTERNAL_VISIBILITY has been added to mark symbols in these
libraries public, which is mainly needed for the definitions of the
LLVMInitialize* functions.

This patch reduces the number of public symbols in libLLVM.so by about
25%.  This should improve load times for the dynamic library and also
make abi checker tools, like abidiff require less memory when analyzing
libLLVM.so

One side-effect of this change is that for builds with
LLVM_BUILD_LLVM_DYLIB=ON and LLVM_LINK_LLVM_DYLIB=ON some unittests that
access symbols that are no longer public will need to be statically linked.

Before and after public symbol counts (using gcc 8.2.1, ld.bfd 2.31.1):
nm before/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
36221
nm after/libLLVM-9svn.so | grep ' [A-Zuvw] ' | wc -l
26278

Reviewers: chandlerc, beanz, mgorny, rnk, hans

Reviewed By: rnk, hans

Subscribers: Jim, hiraditya, michaelplatings, chapuni, jholewinski, arsenm, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, sbc100, jgravelle-google, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, zzheng, edward-jones, mgrang, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, kristina, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D54439

llvm-svn: 362990
2019-06-10 22:12:56 +00:00
Craig Topper 9000a72a4b [X86] When promoting i16 compare with immediate to i32, try to use sign_extend for eq/ne if the input is truncated from a type with enough sign its.
Summary:
Our default behavior is to use sign_extend for signed comparisons and zero_extend for everything else. But for equality we have the freedom to use either extension. If we can prove the input has been truncated from something with enough sign bits, we can use sign_extend instead and let DAG combine optimize it out. A similar rule is used by type legalization in LegalizeIntegerTypes.

This gets rid of the movzx in PR42189. The immediate will still take 4 bytes instead of the 2 bytes plus 0x66 prefix a cmp di, 32767 would get, but it avoids a length changing prefix.

Reviewers: RKSimon, spatel, xbolva00

Reviewed By: xbolva00

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63032

llvm-svn: 362920
2019-06-10 04:50:12 +00:00
Craig Topper ceb807bbbc [X86] Disable f32->f64 extload when sse2 is enabled
Summary:
We can only use the memory form of cvtss2sd under optsize due to a partial register update. So previously we were emitting 2 instructions for extload when optimizing for speed. Also due to a late optimization in preprocessiseldag we had to handle (fpextend (loadf32)) under optsize.

This patch forces extload to expand so that it will always be in the (fpextend (loadf32)) form during isel. And when optimizing for speed we can just let each of those pieces select an instruction independently.

Reviewers: spatel, RKSimon

Reviewed By: RKSimon

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62710

llvm-svn: 362919
2019-06-10 04:37:16 +00:00
Craig Topper dd10099d5c [X86] Use EVEX instructions for f128 FAND/FOR/FXOR when avx512vl is enabled.
llvm-svn: 362915
2019-06-10 01:18:55 +00:00
Craig Topper f7ba8b808a [X86] Convert f32/f64 FANDN/FAND/FOR/FXOR to vector logic ops and scalar_to_vector/extract_vector_elts to reduce isel patterns.
Previously we did the equivalent operation in isel patterns with
COPY_TO_REGCLASS operations to transition. By inserting
scalar_to_vetors and extract_vector_elts before isel we can
allow each piece to be selected individually and accomplish the
same final result.

I ideally we'd use vector operations earlier in lowering/combine,
but that looks to be more difficult.

The scalar-fp-to-i64.ll changes are because we have a pattern for
using movlpd for store+extract_vector_elt. While an f64 store
uses movsd. The encoding sizes are the same.

llvm-svn: 362914
2019-06-10 00:41:07 +00:00
Jatin Bhateja 2a30aeb010 [X86] NFCI : Comment updation for EVEX to VEX translation.
Reviewers: llvm-commits, jbhateja

Reviewed By: jbhateja

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D63055

llvm-svn: 362898
2019-06-09 09:59:26 +00:00
Craig Topper 2ba0e2518b [X86] Remove (store (f32 (extractelt (v4f32))) isel patterns which is redundant.
We emit a MOVSSmr and a COPY_TO_REGCLASS, but that's what we would get from
selecting the store and extractelt independently.

llvm-svn: 362895
2019-06-09 03:21:33 +00:00
Craig Topper 7d8494c41c [X86] Mutate scalar fceil/ffloor/ftrunc/fnearbyint/frint into X86ISD::RNDSCALE during PreProcessIselDAG to cut down on number of isel patterns.
Similar was done for vectors in r362535. Removes about 1200 bytes from the isel table.

llvm-svn: 362894
2019-06-08 23:53:31 +00:00
Jonas Paulsson fdc4ea34e3 [SystemZ, RegAlloc] Favor 3-address instructions during instruction selection.
This patch aims to reduce spilling and register moves by using the 3-address
versions of instructions per default instead of the 2-address equivalent
ones. It seems that both spilling and register moves are improved noticeably
generally.

Regalloc hints are passed to increase conversions to 2-address instructions
which are done in SystemZShortenInst.cpp (after regalloc).

Since the SystemZ reg/mem instructions are 2-address (dst and lhs regs are
the same), foldMemoryOperandImpl() can no longer trivially fold a spilled
source register since the reg/reg instruction is now 3-address. In order to
remedy this, new 3-address pseudo memory instructions are used to perform the
folding only when the dst and lhs virtual registers are known to be allocated
to the same physreg. In order to not let MachineCopyPropagation run and
change registers on these transformed instructions (making it 3-address), a
new target pass called SystemZPostRewrite.cpp is run just after
VirtRegRewriter, that immediately lowers the pseudo to a target instruction.

If it would have been possibe to insert a COPY instruction and change a
register operand (convert to 2-address) in foldMemoryOperandImpl() while
trusting that the caller (e.g. InlineSpiller) would update/repair the
involved LiveIntervals, the solution involving pseudo instructions would not
have been needed. This is perhaps a potential improvement (see Phabricator
post).

Common code changes:

* A new hook TargetPassConfig::addPostRewrite() is utilized to be able to run a
target pass immediately before MachineCopyPropagation.

* VirtRegMap is passed as an argument to foldMemoryOperand().

Review: Ulrich Weigand, Quentin Colombet
https://reviews.llvm.org/D60888

llvm-svn: 362868
2019-06-08 06:19:15 +00:00
Craig Topper bd03230cb0 [X86] Remove unnecessary new line escape from the end of a macro. NFC
llvm-svn: 362837
2019-06-07 20:30:40 +00:00
Sanjay Patel 6880bceda2 [x86] narrow extract subvector of vector select
This is a potentially large perf win for AVX1 targets because of the way we
auto-vectorize to 256-bit but then expect the backend to legalize/optimize
for the half-implemented AVX1 ISA.

On the motivating example from PR37428 (even though this patch doesn't solve
the vector shift issue):
https://bugs.llvm.org/show_bug.cgi?id=37428
...there's a 16% speedup when compiling with "-mavx" (perf tested on Haswell)
because we eliminate the remaining 256-bit vblendv ops.

I added comments on a couple of tests that require further work. If we have
256-bit logic ops separating the vselect and extract, we should probably narrow
everything to 128-bit, but that requires a larger pattern match.

Differential Revision: https://reviews.llvm.org/D62969

llvm-svn: 362797
2019-06-07 13:17:46 +00:00
Pengfei Wang f8b28931a7 [X86] -march=cooperlake (llvm)
Support intel -march=cooperlake in llvm

Patch by Shengchen Kan (skan)

Differential Revision: https://reviews.llvm.org/D62836

llvm-svn: 362776
2019-06-07 08:31:35 +00:00
Craig Topper f320f26716 [X86] Make a bunch of merge masked binops commutable for loading folding.
This primarily affects add/fadd/mul/fmul/and/or/xor/pmuludq/pmuldq/max/min/fmaxc/fminc/pmaddwd/pavg.

We already commuted the unmasked and zero masked versions.

I've added 512-bit stack folding tests for most of the instructions
affected. I've tested needing commuting and not commuting across
unmasked, merged masked, and zero masked. The 128/256 bit instructions
should behave similarly.

llvm-svn: 362746
2019-06-06 21:00:04 +00:00
Craig Topper 6b67dfa54c [X86] Make masked floating point equality/ordered compares commutable for load folding purposes.
Same as what is supported for the unmasked form.

llvm-svn: 362717
2019-06-06 16:39:04 +00:00
Craig Topper 9226ba6b37 [X86] Don't turn avx masked.load with constant mask into masked.load+vselect when passthru value is all zeroes.
This is intended to enable the use of an immediate blend or
more optimal instruction. But if the passthru is zero we don't
need any additional instructions.

llvm-svn: 362675
2019-06-06 05:41:27 +00:00
Craig Topper 3975b15dba [X86] Fix mistake that marked VADDSSrrb_Int/VADDSDrrb_Int/VMULSSrrb_Int/VMULSDrrb_Int as commutable.
One of the sources controls the pass through value for the upper bits
of the result so we can't really commute it.

In practice this problem isn't a functional issue because we would
only try to commute this instruction in order to fold a load. But
we can't do embedded rounding and fold a load at the same time. So
the load fold would never succeed so I don't think we would ever
commute or at least keep the version after commuting.

llvm-svn: 362647
2019-06-05 21:00:31 +00:00
Craig Topper d0fff89b81 [X86] Add the vector integer min/max instructions to isAssociativeAndCommutative.
As far as I know these should be freely reassociatable just like
the floating point MAXC/MINC instructions.

The *reduce* test changes are largely regressions and caused by
the "generic" CPU we default to not having a scheduler model.

The machine-combiner-int-vec.ll test shows the positive benefits
of this change.

Differential Revision: https://reviews.llvm.org/D62787

llvm-svn: 362629
2019-06-05 18:25:09 +00:00
Sanjay Patel 2bf82879bd [x86] split more 256-bit stores of concatenated vectors
As suggested in D62498 - collectConcatOps() matches both
concat_vectors and insert_subvector patterns, and we see
more test improvements by using the more general match.

llvm-svn: 362620
2019-06-05 16:40:57 +00:00
Simon Pilgrim de586bd1fd [X86][AVX] Generalize split256BitStore to splitVectorStore. NFCI.
Enables us to use this to split 512-bit vectors in future patches.

llvm-svn: 362617
2019-06-05 16:14:14 +00:00
Simon Pilgrim 886a55eaa0 [X86][AVX] combineX86ShuffleChain - combine shuffle(extractsubvector(x),extractsubvector(y))
We already handle the case where we combine shuffle(extractsubvector(x),extractsubvector(x)), this relaxes the requirement to permit different sources as long as they have the same value type.

This causes a couple of cases where the VPERMV3 binary shuffles occur at a wider width than before, which I intend to improve in future commits - but as only the subvector's mask indices are defined, these will broadcast so we don't see any increase in constant size.

llvm-svn: 362599
2019-06-05 12:56:53 +00:00
Craig Topper 78fdce25a1 [X86] Cleanup convertIntLogicToFPLogic a little. NFCI
-Use early returns to reduce indentation
-Replace multipe ifs with a switch.
-Replace an assert with an llvm_unreachable default in the switch.
-Check that the FP type we're going to use for the
 X86ISD::FAND/FOR/FXOR is legal rather than checking that the
 integer type matches the width of a legal scalar fp type. This all
 runs after legalization so it shouldn't really matter, but making
 sure we're using a valid type in the X86ISD node is really
 whats important.

llvm-svn: 362565
2019-06-05 01:00:34 +00:00
Craig Topper 137de38009 [X86] Mutate fceil/ffloor/ftrunc/fnearbyint/frint into X86ISD::RNDSCALE during PreProcessIselDAG to cut down on pattern permutations
We already need to have patterns for X86ISD::RNDSCALE to support software intrinsics. But we currently have 5 sets of patterns for the 5 rounding operations. For of these 6 patterns we have to support 3 vectors widths, 2 element sizes, sse/vex/evex encodings, load folding, and broadcast load folding. This results in a fair amount of bytes in the isel table.

This patch adds code to PreProcessIselDAG to morph the fceil/ffloor/ftrunc/fnearbyint/frint to X86ISD::RNDSCALE. This way we can remove everything, but the intrinsic pattern while still allowing the operations to be considered Legal for DAGCombine and Legalization. This shrinks the DAGISel by somewhere between 9K and 10K.

There is one complication to this, the STRICT versions of these nodes are currently mutated to their none strict equivalents at isel time when the node is visited. This won't be true in the future since that loses the chain ordering information. For now I've also added support for the non-STRICT nodes to Select so we can change the STRICT versions there after they've been mutated to their non-STRICT versions. We'll probably need a STRICT version of RNDSCALE or something to handle this in the future. Which will take us back to needing 2 sets of patterns for strict and non-strict, but that's still better than the 11 or 12 sets of patterns we'd need.

We can probably do something similar for scalar, but I haven't looked at it yet.

Differential Revision: https://reviews.llvm.org/D62757

llvm-svn: 362535
2019-06-04 18:03:07 +00:00
Benjamin Kramer 03ff1b3c30 [X86] Fold single-use variable into assert. NFC.
Avoids an unused variable warning in Release builds.

llvm-svn: 362534
2019-06-04 18:01:07 +00:00
Sanjay Patel 606eb2367f [x86] split 256-bit store of concatenated vectors
This shows up as a side issue to the main problem for the AVX target example from PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3

But as we can see in the pile of existing test diffs, it's actually a widespread problem
that affects any AVX or later target. Apart from a couple of oddballs, I think these are
all improvements for the reasons stated in the code comment: we do not want to enable YMM
unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit
stores anyway.

We could say that MergeConsecutiveStores() is going overboard on some of these examples,
but that won't solve the problem completely. But that is a reason I'm proposing this as
a lowering rather than a combine: we will infinite loop fighting the merge code if we try
this earlier.

Differential Revision: https://reviews.llvm.org/D62498

llvm-svn: 362524
2019-06-04 16:40:04 +00:00
Sanjay Patel 1e63dd0b44 [SelectionDAG][x86] limit post-legalization store merging by type
The proposal in D62498 showed that x86 would benefit from vector
store splitting, but that may conflict with the generic DAG
combiner's store merging transforms.

Add memory type to the existing TLI hook that enables the merging
transforms, so we can limit those changes to scalars only for x86.

llvm-svn: 362507
2019-06-04 15:15:59 +00:00
Simon Pilgrim a6e289e9f8 [X86][SSE] Pulled out (sub (xor X, M), M) 'ConditionalNegate' out pattern match code. NFCI.
As discussed on D62777 - we should be able to use this in more SSE41+ cases as well but that requires us to separate it from the OR(AND(),ANDN()) matcher.

llvm-svn: 362504
2019-06-04 15:02:33 +00:00
Craig Topper dcf865f0ca [X86] Fix the pattern for merge masked vcvtps2pd.
r362199 fixed it for zero masking, but not zero masking. The load
folding in the peephole pass hid the bug. This patch turns off
the peephole pass on the relevant test to ensure coverage.

llvm-svn: 362440
2019-06-03 19:29:14 +00:00
Simon Pilgrim 8a32ca381d [CostModel][X86] Improve masked load/store AVX1/AVX2 costs
A mixture of internal tests and review of the scheduler models indicates we're overestimating the cost of a masked load, which we're estimating at 4x regular memory ops - more realistic values indicates that its closer to 2x. Masked stores costs are a lot more diverse but 8x is roughly in the middle of the range.

e.g. SandyBridge
defm : X86WriteRes<WriteFMaskedLoad, [SBPort23,SBPort05], 8, [1,2], 3>;
defm : X86WriteRes<WriteFMaskedLoadY, [SBPort23,SBPort05], 9, [1,2], 3>;
defm : X86WriteRes<WriteFMaskedStore, [SBPort4,SBPort01,SBPort23], 5, [1,1,1], 3>;
defm : X86WriteRes<WriteFMaskedStoreY, [SBPort4,SBPort01,SBPort23], 5, [1,1,1], 3>;

e.g. Btver2
defm : X86WriteRes<WriteFMaskedLoad, [JLAGU, JFPU01, JFPX], 6, [1, 2, 2], 1>;
defm : X86WriteRes<WriteFMaskedLoadY, [JLAGU, JFPU01, JFPX], 6, [2, 4, 4], 2>;
defm : X86WriteRes<WriteFMaskedStore, [JSAGU, JFPU01, JFPX], 6, [1, 1, 4], 1>;
defm : X86WriteRes<WriteFMaskedStoreY, [JSAGU, JFPU01, JFPX], 6, [2, 2, 4], 2>;

Differential Revision: https://reviews.llvm.org/D61257

llvm-svn: 362338
2019-06-02 20:37:02 +00:00
Simon Pilgrim 59a8db628b [TTI][X86] Cleanup getMaskedMemoryOpCost. NFCI.
Prep work before resurrecting D61257.

llvm-svn: 362335
2019-06-02 18:06:42 +00:00
Simon Pilgrim 71a39bcf68 [X86] isHorizontalBinOp - add extract_subvector(shuffle(x)) handling (PR39921)
Let's us match horizontal op patterns on fast-variable-shuffle targets (Haswell etc.)

llvm-svn: 362327
2019-06-02 15:47:49 +00:00
Simon Pilgrim 7a869e7036 [DAGCombine] Fold insert_subvector(bitcast(x),bitcast(y),c1) -> bitcast(insert_subvector(x,y),c2)
Move this combine from x86 into generic DAGCombine, which currently only manages cases where the bitcast is between types of the same scalarsize.

Differential Revision: https://reviews.llvm.org/D59188

llvm-svn: 362324
2019-06-02 14:42:11 +00:00
Craig Topper 396a915c26 [X86] Add the SSE versions of PMULLW and PMULLD to isAssociativeAndCommutative.
llvm-svn: 362309
2019-06-02 00:42:58 +00:00
Craig Topper c288a19bb7 [X86] Add AVX512BF16 and AVX512VP2INTERSECT instructions to the loading folding tables.
llvm-svn: 362288
2019-06-01 06:20:59 +00:00
Craig Topper 48fdb61766 [X86] Make the X86FoldTablesEmitter functional again. Fix the spacing in the output to make it easier to diff.
Fix a few other formatting issues in the manual table. And remove some
old FIXMEs.

llvm-svn: 362287
2019-06-01 06:20:55 +00:00
Craig Topper 31d00d80a2 [X86] Remove patterns for X86VSintToFP/X86VUintToFP+loadv4f32 to v2f64.
These patterns can incorrectly narrow a volatile load from 128-bits to 64-bits.
Similar to PR42079.

Switch to using (v4i32 (bitcast (v2i64 (scalar_to_vector (loadi64))))) as the
load pattern used in the instructions.

This probably still has issues in 32-bit mode where loadi64 isn't legal. Maybe
we should use VZMOVL for widened loads even when we don't need the upper bits
as zeroes?

llvm-svn: 362203
2019-05-31 07:38:26 +00:00
Craig Topper b79cc5f802 [X86] Remove avx512 isel patterns for fpextend+load. Prefer to only match fp extloads instead.
DAG combine will usually fold fpextend+load to an fp extload anyway. So the
256 and 512 patterns were probably unnecessary. The 128 bit pattern was special
in that it looked for a v4f32 load, but then used it in an instruction that
only loads 64-bits. This is bad if the load happens to be volatile. We could
probably make the patterns volatile aware, but that's more work for something
that's probably rare. The peephole pass might kick in and save us anyway. We
might also be able to fix this with some additional DAG combines.

This also adds patterns for vselect+extload to enabled masked vcvtps2pd to be
used. Previously we looked for the unlikely vselect+fpextend+load.

llvm-svn: 362199
2019-05-31 06:21:53 +00:00
Craig Topper 23066033a1 [X86] Correct the ins operand order for MASKPAIR16STORE to match other store instructions.
This makes the 5 address operands come first. And the data operand comes last.

This matches the operand order the instruction is created with. It's also the
expected order in X86MCInstLower. So everything appeared to work, but the
operands didn't match their declared type.

Fixes a -verify-machineinstrs failure.

Also remove the isel patterns from these instructions since they should only
be used for stack spills and reloads. I'm not even sure what types the patterns
were looking for to match.

llvm-svn: 362193
2019-05-31 05:20:27 +00:00
Pengfei Wang 2e67d0c842 [X86] Add VP2INTERSECT instructions
Support Intel AVX512 VP2INTERSECT instructions in llvm

Patch by Xiang Zhang (xiangzhangllvm)

Differential Revision: https://reviews.llvm.org/D62366

llvm-svn: 362188
2019-05-31 02:50:41 +00:00
Craig Topper 70dc2200a2 [X86] Remove result type constraints from the extloadv2f32/extloadv4f32/extloadv8f32 PatFrags. NFC
The result types aren't mentioned in the pattern name so really shouldn't be in the PatFrags.

The users of these either have their own type constraint or rely on the type constranit system to realize the only legal extend would be to f64.

llvm-svn: 362175
2019-05-30 23:35:24 +00:00
Craig Topper d6b74cc859 [X86] Remove code that unnecessarily sets EXTLOAD with src type of v2f32/v4f32/v8f32 as Legal for SSE2/AVX/AVX512 respectively. NFC
The LoadExt table defaults to all combinations being Legal. For
vector types, only src VTs with an i1 element type were ever changed.
So we don't need to mark them legal manually.

llvm-svn: 362170
2019-05-30 22:29:06 +00:00
Simon Pilgrim 32aac1727a [X86][SSE] Improve bool vector extload (PR26091)
We already have good codegen for (vXiY *ext(vXi1 bitcast(iX))) cases, this patch uses it for loads of vXi1 types as well - changing the load into a iX integer load, and bitcasting so that combineToExtendBoolVectorInReg can then use it.

Differential Revision: https://reviews.llvm.org/D62449

llvm-svn: 362081
2019-05-30 10:25:20 +00:00
Pengfei Wang 1f67d94279 [X86] Add ENQCMD instructions
For more details about these instructions, please refer to the latest
ISE document:
https://software.intel.com/en-us/download/intel-architecture-instruction-set-extensions-programming-reference.

Patch by Tianqing Wang (tianqing)

Differential Revision: https://reviews.llvm.org/D62281

llvm-svn: 362053
2019-05-30 03:59:16 +00:00
Pengfei Wang 72e3f9662b Revert "[X86] Use 'llvm_unreachable' instead of nullptr in unreachable code to"
This reverts commit c1b3716614bc0a107e6f41a7d3d503baefad8a5b.

llvm-svn: 361918
2019-05-29 02:49:59 +00:00
Pengfei Wang 818c652643 [X86] Use 'llvm_unreachable' instead of nullptr in unreachable code to
avoid static check fail

RegClassOrBank is an object of RegClassOrRegBank, which is defined as
using llvm::RegClassOrRegBank = typedef PointerUnion<const
TargetRegisterClass *, const RegisterBank *>
so control flow can not get here. Use ""llvm_unreachable" here to avoid
"null pointer" confusion.

Patch by Shengchen Kan (skan)

Differential Revision: https://reviews.llvm.org/D62006

Signed-off-by: pengfei <pengfei.wang@intel.com>
llvm-svn: 361912
2019-05-29 02:20:37 +00:00
Fangrui Song 656afe370d [X86] Fix x86-64 call *foo@tlsdesc(%rax) and support R_386_TLSGOTDESC R_386_TLS_DESC_CALL
D18885 emitted 5 bytes for call *foo@tlsdesc(%rax). It should use the
2-byte form instead and let R_X86_64_TLSDESC_CALL apply to the beginning
of the call instruction.

The 2-byte form was deliberately chosen to make ->LE and ->IE relaxation work:

    0:   48 8d 05 00 00 00 00    lea    0x0(%rip),%rax        # 7 <.text+0x7>
                         3: R_X86_64_GOTPC32_TLSDESC     a-0x4
    7:   ff 10                   callq  *(%rax)
                         7: R_X86_64_TLSDESC_CALL        a

=>

    0:   48 c7 c0 fc ff ff ff    mov    $0xfffffffffffffffc,%rax
    7:   66 90                   xchg   %ax,%ax

Also change the symbol type to STT_TLS when VK_TLSCALL or VK_TLSDESC is
seen.

Reviewed By: compnerd

Differential Revision: https://reviews.llvm.org/D62512

llvm-svn: 361910
2019-05-29 02:02:59 +00:00
Adhemerval Zanella 6d7bf5e8df [CodeGen] Add lrint/llrint builtins
This patch add the ISD::LRINT and ISD::LLRINT along with new
intrinsics.  The changes are straightforward as for other
floating-point rounding functions, with just some adjustments
required to handle the return value being an interger.

The idea is to optimize lrint/llrint generation for AArch64
in a subsequent patch.  Current semantic is just route it to libm
symbol.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D62017

llvm-svn: 361875
2019-05-28 20:47:44 +00:00
Sanjay Patel f7980e727f Revert "[x86] split 256-bit store of concatenated vectors"
This reverts commit d5a8637072.

Most likely suspect for this bot failure:
http://lab.llvm.org:8011/builders/clang-cmake-x86_64-avx2-linux/builds/9684

llvm-svn: 361850
2019-05-28 17:37:58 +00:00
David Greene 561fcc0d63 [X86-64] Fix 256-bit SET0 lowering for non-VLX targets
If we don't have VLX then 256-bit SET0 should be lowered
to VPXOR with ZMM registers.  This restores functionality
accidentally removed by r309926.

Differential Revision: https://reviews.llvm.org/D62415

llvm-svn: 361843
2019-05-28 15:37:01 +00:00
Sanjay Patel d5a8637072 [x86] split 256-bit store of concatenated vectors
This shows up as a side issue to the main problem for the AVX target example from PR37428:
https://bugs.llvm.org/show_bug.cgi?id=37428 - https://godbolt.org/z/7tpRa3

But as we can see in the pile of existing test diffs, it's actually a widespread problem
that affects any AVX or later target. Apart from a couple of oddballs, I think these are
all improvements for the reasons stated in the code comment: we do not want to enable YMM
unnecessarily (avoid vzeroupper and frequency throttling) and some cores split 256-bit
stores anyway.

We could say that MergeConsecutiveStores() is going overboard on some of these examples,
but that won't solve the problem completely. But that is the reason I'm proposing this as
a lowering rather than a combine: we will infinite loop fighting the merge code if we try
this earlier.

Differential Revision: https://reviews.llvm.org/D62498

llvm-svn: 361822
2019-05-28 13:54:17 +00:00
Sanjay Patel 6bf4ca9d2e [x86] fix 256-bit vector store splitting to honor 'volatile'
Forking this out of the discussion in D62498
(and assuming that will be committed later, so adding the helper function here).
The LangRef says:
"the backend should never split or merge target-legal volatile load/store instructions."

Differential Revision: https://reviews.llvm.org/D62506

llvm-svn: 361815
2019-05-28 12:58:07 +00:00
Benjamin Kramer 57e267a2e9 [X86] Custom lower CONCAT_VECTORS of v2i1
The generic legalizer cannot handle this. Add an assert instead of
silently miscompiling vectors with elements smaller than 8 bits.

llvm-svn: 361814
2019-05-28 12:52:57 +00:00
Simon Pilgrim 4b48aa0e30 [X86] X86CmovConverterPass::collectCmovCandidates - fix uninitialized variable warnings. NFCI.
llvm-svn: 361804
2019-05-28 10:53:23 +00:00
Simon Pilgrim a044410f37 [X86][SSE] Add shuffle combining support for ISD::ANY_EXTEND_VECTOR_INREG
Reuses what we already have in place for ISD::ZERO_EXTEND_VECTOR_INREG just with a different sentinel

llvm-svn: 361734
2019-05-26 16:00:35 +00:00
Simon Pilgrim 58a8541dcc [X86][AVX] combineBitcastvxi1 - peek through bitops to determine size of original vector
We were only testing for direct SETCC results - this allows us to peek through AND/OR/XOR combinations of the comparison results as well.

There's a missing SEXT(PACKSS) fold that I need to investigate for v8i1 cases before I can enable it there as well.

llvm-svn: 361716
2019-05-26 10:54:23 +00:00
Simon Pilgrim 40fa52b174 [X86] lowerBuildVectorToBitOp - support build_vector(shift()) -> shift(build_vector(),C)
Commonly occurs in sign-extension cases

llvm-svn: 361706
2019-05-25 18:02:17 +00:00
Nikita Popov d87eceda0e [X86] Combine fminnum/fmaxnum with non-nan operand to fmin/fmax
If we have a known non-nan operand, place it in the second operand
of fmin/fmax that is returned if either operand is nan.

Differential Revision: https://reviews.llvm.org/D62448

llvm-svn: 361704
2019-05-25 16:44:29 +00:00
Craig Topper 46e5052b8e [X86FixupLEAs] Turn optIncDec into a generic two address LEA optimizer. Support LEA64_32r properly.
INC/DEC is really a special case of a more generic issue. We should also turn leas into add reg/reg or add reg/imm regardless of the slow lea flags.

This also supports LEA64_32 which has 64 bit input registers and 32 bit output registers. So we need to convert the 64 bit inputs to their 32 bit equivalents to check if they are equal to base reg.

One thing to note, the original code preserved the kill flags by adding operands to the new instruction instead of using addReg. But I think tied operands aren't supposed to have the kill flag set. I dropped the kill flags, but I could probably try to preserve it in the add reg/reg case if we think its important. Not sure which operand its supposed to go on for the LEA64_32r instruction due to the super reg implicit uses. Though I'm also not sure those are needed since they were probably just created by an INSERT_SUBREG from a 32-bit input.

Differential Revision: https://reviews.llvm.org/D61472

llvm-svn: 361691
2019-05-25 06:17:47 +00:00
Craig Topper 4b08fcdeb1 [X86] Add zero idioms to the haswell, broadwell, and skylake schedule models. Add 256-bit fp xor to sandybridge zero idioms
This copies the Sandy Bridge zero idiom support to later CPUs. Adding the AVX2 and AVX512F/VL instructions as appropriate.

Differential Revision: https://reviews.llvm.org/D62360

llvm-svn: 361690
2019-05-25 04:47:49 +00:00
Simon Pilgrim 95b8d9bbf8 [SelectionDAG] computeKnownBits - support constant pool values from target
This patch adds the overridable TargetLowering::getTargetConstantFromLoad function which allows targets to return any constant value loaded by a LoadSDNode node - only X86 makes use of this so far but everything should be in place for other targets.

computeKnownBits then uses this function to improve codegen, notably vector code after legalization.

A future commit will do the same for ComputeNumSignBits but computeKnownBits sees the bigger benefit.

This required a couple of fixes:
* SimplifyDemandedBits must early-out for getTargetConstantFromLoad cases to prevent infinite loops of constant regeneration (similar to what we already do for BUILD_VECTOR).
* Fix a DAGCombiner::visitTRUNCATE issue as we had trunc(shl(v8i32),v8i16) <-> shl(trunc(v8i16),v8i32) infinite loops after legalization on AVX512 targets.

Differential Revision: https://reviews.llvm.org/D61887

llvm-svn: 361620
2019-05-24 10:03:11 +00:00
Robert Lougher 170dfeb2ff Resubmit r360436 "[X86] Avoid SFB - Fix inconsistent codegen with/without debug info"
Fixes https://bugs.llvm.org/show_bug.cgi?id=40969

The functions findPotentiallyBlockedCopies and buildCopy are currently not
accounting for the presence of debug instructions. In the former this results
in the optimization not being trigerred, and in the latter results in
inconsistent codegen.

This patch enables the optimization to be performed in a debug build and
ensures the codegen is consistent with non-debug builds.

Patch by Chris Dawson.

Differential Revision: https://reviews.llvm.org/D61680

llvm-svn: 361527
2019-05-23 18:15:12 +00:00
Fangrui Song 86c9ca48c3 [X86] Support -fno-plt __tls_get_addr calls
In general dynamic/local dynamic TLS models, with -fno-plt,

* x86: emit `calll *___tls_get_addr@GOT(%ebx)` instead of `calll ___tls_get_addr@PLT`
  Note, on x86, if we can get rid of %ebx as the PIC register,
  it may be better to use a register not preserved across function calls.
* x86_64: emit `callq *__tls_get_addr@GOTPCREL(%rip)` instead of `callq __tls_get_addr@PLT`

Reorganize the code by separating 32-bit and 64-bit.

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D62106

llvm-svn: 361453
2019-05-23 01:05:13 +00:00
Craig Topper 93f38e1f1a [X86] Explcitly disable VEXTRACT instruction matching for an immediate of 0. Remove a bunch of isel patterns that become unnecessary.
We effectively had a second set of isel patterns that tried to use a
regular store instruction and an extract_subreg instruction. Or a masked move
and an extract_subreg. These patterns were intended to override the
matching of VEXTRACT instructions by taking advantage of the priority
of the explicit immediate 0 for the index.

This patch instaed just disables the immediate 0 matchin the VEXTRACT
patterns. This each of the component pieces of the larger patterns will
match by themselves.

This found a bug of sorts were we didn't use 128-bit store for 512->128
extract on KNL. Its unclear what the right thing here should be.
Using the vextract avoids constraining the register allocator to use
xmm0-15. But it always results in a longer encoding if the register
allocator ends up choosing xmm0-15 anyway.

llvm-svn: 361431
2019-05-22 21:00:18 +00:00
Craig Topper 9816d55776 [X86][InstCombine] Remove InstCombine code that turns X86 round intrinsics into llvm.ceil/floor. Remove some isel patterns that existed because that was happening.
We were turning roundss/sd/ps/pd intrinsics with immediates of 1 or 2 into
llvm.floor/ceil.  The llvm.ceil/floor intrinsics are supposed to correspond
to the libm functions.  For the libm functions we need to disable the
precision exception so the llvm.floor/ceil functions should always map to
encodings 0x9 and 0xA.

We had a mix of isel patterns where some used 0x9 and 0xA and others used
0x1 and 0x2. We need to be consistent and always use 0x9 and 0xA.

Since we have no way in isel of knowing where the llvm.ceil/floor came
from, we can't map X86 specific intrinsics with encodings 1 or 2 to it.
We could map 0x9 and 0xA to llvm.ceil/floor instead, but I'd really like
to see a use case and optimization advantage first.

I've left the backend test cases to show the blend we now emit without
the extra isel patterns. But I've removed the InstCombine tests completely.

llvm-svn: 361425
2019-05-22 20:04:55 +00:00
Sjoerd Meijer aa4f1ffca4 [TargetMachine] error message unsupported code model
When the tiny code model is requested for a target machine that does not
support this, we get an error message (which is nice) but also this diagnostic
and request to submit a bug report:

    fatal error: error in backend: Target does not support the tiny CodeModel
    [Inferior 2 (process 31509) exited with code 0106]
    clang-9: error: clang frontend command failed with exit code 70 (use -v to see invocation)
    (gdb) clang version 9.0.0 (http://llvm.org/git/clang.git 29994b0c63a40f9c97c664170244a7bba5ecc15e) (http://llvm.org/git/llvm.git 95606fdf91c2d63a931e865f4b78b2e9828ddc74)
    Target: arm-arm-none-eabi
    Thread model: posix
    clang-9: note: diagnostic msg: PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
    clang-9: note: diagnostic msg:
    ********************
    PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
    Preprocessed source(s) and associated run script(s) are located at:
    clang-9: note: diagnostic msg: /tmp/tiny-dfe1a2.c
    clang-9: note: diagnostic msg: /tmp/tiny-dfe1a2.sh
    clang-9: note: diagnostic msg:

But this is not a bug, this is a feature. :-) Not only is this not a bug, this
is also pretty confusing. This patch causes just to print the fatal error and
not the diagnostic:

fatal error: error in backend: Target does not support the tiny CodeModel

Differential Revision: https://reviews.llvm.org/D62236

llvm-svn: 361370
2019-05-22 10:40:26 +00:00
Nikita Popov 15df05152d [X86] Don't compare i128 through vector if construction not cheap (PR41971)
Fix for https://bugs.llvm.org/show_bug.cgi?id=41971. Make the
combineVectorSizedSetCCEquality() transform more conservative by
checking that the bitcast to the vector type will be cheap/free
for both operands. I'm considering it cheap if it's a constant,
a load or already a vector. I've dropped the explicit check for
f128 because it should fall out naturally (in the cases where
it'd be detrimental).

Differential Revision: https://reviews.llvm.org/D62220

llvm-svn: 361352
2019-05-22 06:47:06 +00:00
Pengfei Wang 6a0d432e9e [X86] [CET] Deal with return-twice function such as vfork, setjmp when
CET-IBT enabled

Return-twice functions will indirectly jump after the caller's position.
So when CET-IBT is enable, we should make sure these is endbr*
instructions follow these Return-twice function caller. Like GCC does.

Patch by Xiang Zhang (xiangzhangllvm)

Differential Revision: https://reviews.llvm.org/D61881

llvm-svn: 361342
2019-05-22 00:50:21 +00:00
Craig Topper ed6df47bae [X86] Remove an unneeded ZERO_EXTEND creation from LowerINTRINSIC_W_CHAIN. NFC
We were trying to ZERO_EXTEND from an i8 X86ISD::SETCC to i8 again.

llvm-svn: 361288
2019-05-21 19:03:45 +00:00
Simon Pilgrim 4b82e50315 [X86][SSE] computeKnownBitsForTargetNode - add X86ISD::ANDNP support
Fixes PACKSS-PSHUFB shuffle regressions mentioned on D61692

llvm-svn: 361270
2019-05-21 15:20:24 +00:00
Petar Jovanovic e85bbf564d [DebugInfoMetadata] Refactor DIExpression::prepend constants (NFC)
Refactor DIExpression::With* into a flag enum in order to be less
error-prone to use (as discussed on D60866).

Patch by Djordje Todorovic.

Differential Revision: https://reviews.llvm.org/D61943

llvm-svn: 361137
2019-05-20 10:35:57 +00:00
Craig Topper 3164b50af7 [X86] Remove combineShift function. Just dispatch directly to the handler for each flavor from the main switch. NFC
llvm-svn: 361108
2019-05-19 01:01:46 +00:00
Simon Pilgrim 065431c82b [X86][SSE] Fold movmsk(not(x)) -> not(movmsk)
Helps to improve folding of comparisons with movmsk results.

llvm-svn: 361056
2019-05-17 17:56:25 +00:00
Simon Pilgrim 2c2f8e74b9 [X86][SSE] Match all-of bool scalar reductions into a bitcast/movmsk + cmp.
Same as what we do for vector reductions in combineHorizontalPredicateResult, use movmsk+cmp for scalar (and(extract(x,0),extract(x,1)) reduction patterns. 

llvm-svn: 361052
2019-05-17 17:25:55 +00:00
Simon Pilgrim 279314e81b [X86][AVX] Remove LowerCTTZ's AVX1 custom vector handling.
We can now rely on generic expansion to handle this.

llvm-svn: 361038
2019-05-17 14:37:19 +00:00