Add similar functionality to isShuffleEquivalent - if the mask elements don't match, try matching the BUILD_VECTOR scalars instead.
As target shuffles need to handle SM_Sentinel values, this can get a bit tricky, so commit just adds actual mask element index handling - full SM_SentinelZero support will be added when the need arises.
Also, enables support in matchVectorShuffleWithPACK
llvm-svn: 369212
Simplifies shuffle mask comparisons by just bailing out if the shuffle mask has any out of range values - will make an upcoming patch much simpler.
llvm-svn: 369211
This was a quick pass through some obvious places. I haven't tried the clang-tidy check.
I also replaced the zeroes in getX86SubSuperRegister with X86::NoRegister which is the real sentinel name.
Differential Revision: https://reviews.llvm.org/D66363
llvm-svn: 369151
Eventually we need to generalize combineExtractWithShuffle to handle all faux shuffles and handle truncate (and X86ISD::VTRUNC etc.) there, but we're not ready yet (still creates nodes on the fly, incomplete DemandedElts support, bad use of recursive Depth limit).
llvm-svn: 369134
Summary:
This clang-tidy check is looking for unsigned integer variables whose initializer
starts with an implicit cast from llvm::Register and changes the type of the
variable to llvm::Register (dropping the llvm:: where possible).
Partial reverts in:
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
X86FixupLEAs.cpp - Some functions return unsigned and arguably should be MCRegister
X86FrameLowering.cpp - Some functions return unsigned and arguably should be MCRegister
HexagonBitSimplify.cpp - Function takes BitTracker::RegisterRef which appears to be unsigned&
MachineVerifier.cpp - Ambiguous operator==() given MCRegister and const Register
PPCFastISel.cpp - No Register::operator-=()
PeepholeOptimizer.cpp - TargetInstrInfo::optimizeLoadInstr() takes an unsigned&
MachineTraceMetrics.cpp - MachineTraceMetrics lacks a suitable constructor
Manual fixups in:
ARMFastISel.cpp - ARMEmitLoad() now takes a Register& instead of unsigned&
HexagonSplitDouble.cpp - Ternary operator was ambiguous between unsigned/Register
HexagonConstExtenders.cpp - Has a local class named Register, used llvm::Register instead of Register.
PPCFastISel.cpp - PPCEmitLoad() now takes a Register& instead of unsigned&
Depends on D65919
Reviewers: arsenm, bogner, craig.topper, RKSimon
Reviewed By: arsenm
Subscribers: RKSimon, craig.topper, lenary, aemerson, wuzish, jholewinski, MatzeB, qcolombet, dschuff, jyknight, dylanmckay, sdardis, nemanjai, jvesely, wdng, nhaehnle, sbc100, jgravelle-google, kristof.beyls, hiraditya, aheejin, kbarton, fedor.sergeev, javed.absar, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, tpr, PkmX, jocewei, jsji, Petar.Avramovic, asbirlea, Jim, s.egerton, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65962
llvm-svn: 369041
Now that we're using widening legalization. We need to improve our extract_subvector cost model for these types. This patch begins by modeling these as a subvector extract followed by a permute. I've left FIXMEs in the code for future improvements.
Differential Revision: https://reviews.llvm.org/D65892
llvm-svn: 369022
Now that we've moved to C++14, we no longer need the llvm::make_unique
implementation from STLExtras.h. This patch is a mechanical replacement
of (hopefully) all the llvm::make_unique instances across the monorepo.
llvm-svn: 369013
If the last step in an FP add reduction allows reassociation and doesn't care
about -0.0, then we are free to recognize that computation as a reduction
that may reorder the intermediate steps.
This is requested directly by PR42705:
https://bugs.llvm.org/show_bug.cgi?id=42705
and solves PR42947 (if horizontal math instructions are actually faster than
the alternative):
https://bugs.llvm.org/show_bug.cgi?id=42947
Differential Revision: https://reviews.llvm.org/D66236
llvm-svn: 368995
We already had the pattern for just the scalar to vector and bitcast,
but not the case where we wanted zeroes in the high half of the xmm.
llvm-svn: 368972
fp_to_sint is turned into X86cvttp2si during isel preprocessing.
The other redundant isel patterns were removed previously, but I
missed this one because its in the MMX td file.
llvm-svn: 368968
If the width is 256 bits, then we must have AVX so the else here
was unnecessary. Once that's removed then the >= 256 bit code is
identical to the 128 bit code with a different VT so combine them.
llvm-svn: 368956
Now that we legalize by widening, the element types here won't change. Previously these were modeled as the elements being widened and then the instruction might become an AND or SHL/ASHR pair. But now they'll become something like a ZERO_EXTEND_VECTOR_INREG/SIGN_EXTEND_VECTOR_INREG.
For AVX2, when the destination type is legal its clear the cost should be 1 since we have extend instructions that can produce 256 bit vectors from less than 128 bit vectors. I'm a little less sure about AVX1 costs, but I think the ones I changed were definitely too high, but they might still be too high.
Differential Revision: https://reviews.llvm.org/D66169
llvm-svn: 368858
If the target shuffle mask is from a wider type, attempt to scale the mask so that the extraction can attempt to peek through.
Fixes the regression mentioned in rL368662
Reapplying this as rL368308 had to be reverted as part of rL368660 to revert rL368276
llvm-svn: 368663
If we don't demand all elements, then attempt to combine to a simpler shuffle.
At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts.
The insertps-combine.ll regression is because XFormVExtractWithShuffleIntoLoad can't see through shuffles of different widths - this will be fixed in a follow-up commit.
Reapplying this as rL368307 had to be reverted as part of rL368660 to revert rL368276
llvm-svn: 368662
This introduced a false positive MemorySanitizer warning about use of
uninitialized memory in a vectorized crc function in Chromium. That suggests
maybe something is not right with this transformation. See
https://crbug.com/992853#c7 for a reproducer.
This also reverts the follow-up commits r368307 and r368308 which
depended on this.
> This patch attempts to peek through vectors based on the demanded bits/elt of a particular ISD::EXTRACT_VECTOR_ELT node, allowing us to avoid dependencies on ops that have no impact on the extract.
>
> In particular this helps remove some unnecessary scalar->vector->scalar patterns.
>
> The wasm shift patterns are annoying - @tlively has indicated that the wasm vector shift codegen are to be refactored in the near-term and isn't considered a major issue.
>
> Differential Revision: https://reviews.llvm.org/D65887
llvm-svn: 368660
Currently we can't keep any state in the selector object that we get from
subtarget. As a result we have to plumb through all our variables through
multiple functions. This change makes it non-const and adds a virtual init()
method to allow further state to be captured for each target.
AArch64 makes use of this in this patch to cache a call to hasFnAttribute()
which is expensive to call, and is used on each selection of G_BRCOND.
Differential Revision: https://reviews.llvm.org/D65984
llvm-svn: 368652
r367088 made it so that funclets store XMM registers into their local
frame instead of storing them to the parent frame. However, that change
forgot to update the parent frame pointer offset for catch blocks. This
change does that.
Fixes crashes when an exception is rethrown in a catch block that saves
XMMs, as described in https://crbug.com/992860.
llvm-svn: 368631
We need AVX512BW to be able to truncate an i16 vector. If we don't
have that we have to extend i16->i32, then trunc, i32->i8. But we
won't be able to remove the min/max if we do that. At least not
without more special handling.
llvm-svn: 368623
If we're after type legalize, we should make sure we won't create
a store with an illegal type when we separate the AVG pattern
from the truncating store.
I don't know of a way to fail for this today. Just noticed while
I was in the vicinity.
llvm-svn: 368608
Support -march=tigerlake for x86.
Compare with Icelake Client, It include 4 more new features ,they are
avx512vp2intersect, movdiri, movdir64b, shstk.
Patch by Xiang Zhang (xiangzhangllvm)
Differential Revision: https://reviews.llvm.org/D65840
llvm-svn: 368543
The test case that changed is probably better served through
allowing combineTruncatedArithmetic to create narrow vectors. It
also appears InstCombine would have simplified this test case
to remove the zext and trunc anyway.
llvm-svn: 368522
On SSE41+ targets we always lower vector shuffles to ZERO_EXTEND_VECTOR_INREG, even if we don't need the extended bits.
This patch relaxes this so that we lower to ANY_EXTEND_VECTOR_INREG if we can, meaning that shuffle combines have a better idea of what elements need to be kept zero. This helps the multiple reduction code as we can now combine away a lot more of the pack+extend codes.
Differential Revision: https://reviews.llvm.org/D65741
llvm-svn: 368515
Summary:
This patch adds a special DAG combine for SSE1 to recognize the IR pattern InstCombine gives us for movmsk. This only does the recognition for a few cases where its obvious the input won't be scalarized resulting in building a vector just do to the movmsk. I've made it separate from our existing matching for movmsk since that's called in multiple places and I didn't spend time to see if the other callers would make sense here. Plus the restrictions and additional checks would complicate that.
This fixes the case from PR42870. Buts its probably still broken the presence of logic ops feeding the movmsk pattern which would further hide the v4f32 type.
Reviewers: spatel, RKSimon, xbolva00
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65689
llvm-svn: 368506
Summary:
On windows if the frame size exceed 4096 bytes, compiler need to
generate a call to _alloca_probe. X86CallFrameOptimization pass
changes the reserved stack size and cause of stack probe function
not be inserted. This patch fix the issue by detecting the call
frame size, if the size exceed 4096 bytes, drop X86CallFrameOptimization.
Reviewers: craig.topper, wxiao3, annita.zhang, rnk, RKSimon
Reviewed By: rnk
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65923
llvm-svn: 368503
Summary:
Targets often have instructions that can sign-extend certain cases faster
than the equivalent shift-left/arithmetic-shift-right. Such cases can be
identified by matching a shift-left/shift-right pair but there are some
issues with this in the context of combines. For example, suppose you can
sign-extend 8-bit up to 32-bit with a target extend instruction.
%1:_(s32) = G_SHL %0:_(s32), i32 24 # (I've inlined the G_CONSTANT for brevity)
%2:_(s32) = G_ASHR %1:_(s32), i32 24
%3:_(s32) = G_ASHR %2:_(s32), i32 1
would reasonably combine to:
%1:_(s32) = G_SHL %0:_(s32), i32 24
%2:_(s32) = G_ASHR %1:_(s32), i32 25
which no longer matches the special case. If your shifts and extend are
equal cost, this would break even as a pair of shifts but if your shift is
more expensive than the extend then it's cheaper as:
%2:_(s32) = G_SEXT_INREG %0:_(s32), i32 8
%3:_(s32) = G_ASHR %2:_(s32), i32 1
It's possible to match the shift-pair in ISel and emit an extend and ashr.
However, this is far from the only way to break this shift pair and make
it hard to match the extends. Another example is that with the right
known-zeros, this:
%1:_(s32) = G_SHL %0:_(s32), i32 24
%2:_(s32) = G_ASHR %1:_(s32), i32 24
%3:_(s32) = G_MUL %2:_(s32), i32 2
can become:
%1:_(s32) = G_SHL %0:_(s32), i32 24
%2:_(s32) = G_ASHR %1:_(s32), i32 23
All upstream targets have been configured to lower it to the current
G_SHL,G_ASHR pair but will likely want to make it legal in some cases to
handle their faster cases.
To follow-up: Provide a way to legalize based on the constant. At the
moment, I'm thinking that the best way to achieve this is to provide the
MI in LegalityQuery but that opens the door to breaking core principles
of the legalizer (legality is not context sensitive). That said, it's
worth noting that looking at other instructions and acting on that
information doesn't violate this principle in itself. It's only a
violation if, at the end of legalization, a pass that checks legality
without being able to see the context would say an instruction might not be
legal. That's a fairly subtle distinction so to give a concrete example,
saying %2 in:
%1 = G_CONSTANT 16
%2 = G_SEXT_INREG %0, %1
is legal is in violation of that principle if the legality of %2 depends
on %1 being constant and/or being 16. However, legalizing to either:
%2 = G_SEXT_INREG %0, 16
or:
%1 = G_CONSTANT 16
%2:_(s32) = G_SHL %0, %1
%3:_(s32) = G_ASHR %2, %1
depending on whether %1 is constant and 16 does not violate that principle
since both outputs are genuinely legal.
Reviewers: bogner, aditya_nandakumar, volkan, aemerson, paquette, arsenm
Subscribers: sdardis, jvesely, wdng, nhaehnle, rovka, kristof.beyls, javed.absar, hiraditya, jrtc27, atanasyan, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61289
llvm-svn: 368487
As discussed on PR42825, if we are inverting the selection mask we can just swap the inputs and avoid the inversion.
Differential Revision: https://reviews.llvm.org/D65522
llvm-svn: 368438
I've now needed to add an extra parameter to this call twice recently. Not only
is the signature getting extremely unwieldy, but just updating all of the
callsites and implementations is a pain. Putting the parameters in a struct
sidesteps both issues.
llvm-svn: 368408
The only way to generate these was through promoting legalization
of narrow vectors, but we widen those types now. So we shouldn't
produce these nodes.
llvm-svn: 368396
Under this configuration we'll want to split the v8i64 or v16i32 into two vectors. The default legalization will try to truncate each of those 256-bit pieces one step to 128-bit, concatenate those, then truncate one more time from the new 256 to 128 bits.
With this patch we now truncate the two splits to 64-bits then concatenate those. We have to do this two different ways depending on whether have widening legalization enabled. Without widening legalization we have to manually construct X86ISD::VTRUNC to prevent the ISD::TRUNCATE with a narrow result being promoted to 128 bits with a larger element type than what we want followed by something like a pshufb to grab the lower half of each element to finish the job. With widening legalization we just get the right thing. When we switch to widening by default we can just delete the other code path.
Differential Revision: https://reviews.llvm.org/D65626
llvm-svn: 368349
This fixes znver1 so that it properly enables CMPXHG8B. We can
probably remove explicit CMPXCHG8B from CPUs that also have
CMPXCHG16B, but keeping this simple to allow cherry pick to 9.0.
Fixes PR42935.
llvm-svn: 368324
If the target shuffle mask is from a wider type, attempt to scale the mask so that the extraction can attempt to peek through.
Fixes the regression mentioned in rL368307
llvm-svn: 368308
If we don't demand all elements, then attempt to combine to a simpler shuffle.
At the moment we can only do this if Depth == 0 as combineX86ShufflesRecursively uses Depth to track whether the shuffle has really changed or not - we'll need to change this before we can properly start merging combineX86ShufflesRecursively into SimplifyDemandedVectorElts.
The insertps-combine.ll regression is because XFormVExtractWithShuffleIntoLoad can't see through shuffles of different widths - this will be fixed in a follow-up commit.
llvm-svn: 368307
If we're splitting the 512-bit vector anyway and we have zero/sign bits, then we might as well use pack instructions to concat and truncate at once.
Differential Revision: https://reviews.llvm.org/D65904
llvm-svn: 368210
The assert that caused this to be reverted should be fixed now.
Original commit message:
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 368183
We have aliases that disambiguate memory forms of bt/btc/btr/bts
without suffixes to the 32-bit form. These aliases should have
been updated when the instructions were updated in r356413.
llvm-svn: 368127
The upper 4 bits of the immediate byte are used to encode a
register. We need to limit the explicit immediate to fit in the
remaining 4 bits.
Fixes PR42899.
llvm-svn: 368123
If we're after type legalization we should only be trying to turn
v2i64 into v2i32. So bitcast to v4i32, shuffle the even elements
together. Then use X86ISD::CVTSI2P. The alternative is to leave
the v2i64 type alone and let it scalarized. Hopefully keeping
it packed is better.
Fixes PR42905.
llvm-svn: 368091
Summary:
Cleans X86.td's Barcelona entry to be more like the others,
by moving the features out of the `Proc<>`, thus potentially
making it possible to inherit from them.
Split off from D63628
Reviewers: craig.topper, RKSimon
Reviewed By: craig.topper
Subscribers: hiraditya, jfb, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65791
llvm-svn: 368061
If we don't demand any non-undef shuffle elements then the assert will fail as all shuffle inputs would still be flagged as 'identity' safe.
Exposed by an incoming patch.
llvm-svn: 368022
Summary:
Before this patch MGATHER/MSCATTER is capable of representing all
common addressing modes, but only when illegal types are used.
This patch adds an IndexType property so more representations
are available when using legal types only.
Original modes:
vector of bases
base + vector of signed scaled offsets
New modes:
base + vector of signed unscaled offsets
base + vector of unsigned scaled offsets
base + vector of unsigned unscaled offsets
The current behaviour of addressing modes for gather/scatter remains
unchanged.
Patch by Paul Walker.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D65636
llvm-svn: 368008
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 367901
The test case is based on the example from the post-commit thread for:
https://reviews.llvm.org/rGc9171bd0a955
This replaces the x86-specific simple-type check from:
rL367766
with a check in the DAGCombiner. Adding the check isn't
strictly necessary after the fix from:
rL367768
...but it seems likely that we're heading for trouble if
we are creating weird types in this transform.
I combined the earlier legality check into the initial
clause to simplify the code.
So we should only try the trunc/sext transform at the
earliest combine stage, but we limit the transform to
simple types anyway because the TLI hook is probably
too lax about what it considers a free truncate.
llvm-svn: 367834
Summary:
This is patch is part of a serie to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet, jfb, jakehehrlich
Reviewed By: jfb
Subscribers: wuzish, jholewinski, arsenm, dschuff, nemanjai, jvesely, nhaehnle, javed.absar, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, dexonsmith, PkmX, jocewei, jsji, s.egerton, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65514
llvm-svn: 367828
Summary:
The SimplifyDemandedVectorElts function can replace with undef
when no elements are demanded, but due to how it interacts with
TargetLoweringOpts, it can only do this when the node has
no other users.
Remove a now unneeded DAG combine from the X86 backend.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65713
llvm-svn: 367788
This avoids the crash from:
https://bugs.llvm.org/show_bug.cgi?id=42880
...and I think it's a proper constraint for the TLI hook.
But that example raises questions about what happens to get us
into this situation (created i29 types) and what happens later
(why does legalization die on those types), so I'm not sure if
we will resolve the bug based on this change.
llvm-svn: 367766
This is consistent with the target independent intrinsic handling.
Not sure this really matters since we just pull the constant out
using getZExtValue later.
llvm-svn: 367736
If a type is larger than a legal type and needs to be split, we would previously allow the multiply to be decomposed even if the split multiply is legal. Since the shift + add/sub code would also need to be split, its not any better to decompose it.
This patch figures out what type the mul will eventually be legalized to and then uses that type for the query. I tried just returning false illegal types and letting them get handled after type legalization, but then we can't recognize and i64 constant splat on 32-bit targets since will be destroyed by type legalization. We could special case vectors of i64 to avoid that...
Differential Revision: https://reviews.llvm.org/D65533
llvm-svn: 367601
This adds SimplifyMultipleUseDemandedBitsForTargetNode X86 support and uses it to allow us to peek through vector insertions to avoid dependencies on entire insertion chains.
llvm-svn: 367570
We have custom code that ignores the normal promoting type legalization on less than 128-bit vector types like v4i8 to emit pavgb, paddusb, psubusb since we don't have the equivalent instruction on a larger element type like v4i32. If this operation appears before a store, we can be left with an any_extend_vector_inreg followed by a truncstore after type legalization. When truncstore isn't legal, this will normally be decomposed into shuffles and a non-truncating store. This will then combine away the any_extend_vector_inreg and shuffle leaving just the store. On avx512, truncstore is legal so we don't decompose it and we had no combines to fix it.
This patch adds a new DAG combine to detect this case and emit either an extract_store for 64-bit stoers or a extractelement+store for 32 and 16 bit stores. This makes the avx512 codegen match the avx2 codegen for these situations. I'm restricting to only when -x86-experimental-vector-widening-legalization is false. When we're widening we're not likely to create this any_extend_inreg+truncstore combination. This means we should be able to remove this code when we flip the default. I would like to flip the default soon, but I need to investigate some performance regressions its causing in our branch that I wasn't seeing on trunk.
Differential Revision: https://reviews.llvm.org/D65538
llvm-svn: 367488
Summary:
This will make it possible to improve IPRA by taking into account
register usage in indirect calls.
NFC yet; this is just laying the groundwork to start building
up patches to take advantage of the information for improved register
allocation.
Reviewers: aditya_nandakumar, volkan, qcolombet, arsenm, rovka, aemerson, paquette
Subscribers: sdardis, wdng, javed.absar, hiraditya, jrtc27, atanasyan, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65488
llvm-svn: 367476
Before combining insert_subvector(insert_subvector(vec, sub0, c0), sub1, c1) patterns, ensure that the subvectors are all the same type. On AVX512 targets especially we might have a mixture of 128/256 subvector insertions.
llvm-svn: 367429
PR42819 showed an issue that we couldn't handle the case where we demanded a 'sub-sub-vector' of the SUBV_BROADCAST 'sub-vector' source.
This patch recognizes these cases and extracts the sub-sub-vector instead of trying to broadcast to a type smaller than the 'sub-vector' source.
llvm-svn: 367306
As discussed on rL367171, we have a problem where the depth recursion used in combineX86ShufflesRecursively was subtly different to computeKnownBits etc. - it starts at Depth=1 instead of Depth=0 like the others and has a different maximum recursion depth.
This NFC patch fixes the recursion depth to start at 0, so we can more easily reuse depth values in calls from combineX86ShufflesRecursively and its helper functions in computeKnownBits etc.
llvm-svn: 367232
The pmaddwd inserts a truncate, if that truncate would end up
creating additional instructions instead of making a zext
narrower, then we shouldn't do it.
I've restricted this to only sse4.1 targets since on prior
targets the zext will be done in stages. So the truncate will
probably not create additional instructions. Might need some
more investigation of mul shrinking and the other pmaddwd
transform to be sure this is the right decision.
There might be a slight regression on AVX1 targets due to add
splitting. Hard to say for sure. Maybe we need to look into
using the vector reduction flag to use 2 narrow loads and a
blend instead of extracting and inserting.
llvm-svn: 367198
This reverts r340478 and r340631 and replaces them with a simpler
method of just letting DAG combine revisit the nodes to handle
the other operand.
llvm-svn: 367195
Recommit rL367100 which was reverted at rL367141. Until PR42777 is fixed, we no longer get the benefits of peeking through bitcasts but it does still remove a GetDemandedBits user and gives us the equivalent combines.
llvm-svn: 367172
Summary:
This is an alternate approach to D57970.
Currently funclets reuse the same stack slots that are used in the
parent function for saving callee-saved xmm registers. If the parent
function modifies a callee-saved xmm register before an excpetion is
thrown, the catch handler will overwrite the original saved value.
This patch allocates space in funclets stack for saving callee-saved xmm
registers and uses RSP instead RBP to access memory.
Reviewers: andrew.w.kaylor, LuoYuanke, annita.zhang, craig.topper,
RKSimon
Subscribers: rnk, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63396
Signed-off-by: pengfei <pengfei.wang@intel.com>
llvm-svn: 367088
Summary:
Add a new method which tries to compute the target address referenced by an operand.
This patch supports x86_64 RIP-relative addressing for now.
It is necessary to print referenced symbol names in llvm-objdump.
Reviewers: andreadb, MaskRay, grosbach, jgalenson, craig.topper
Reviewed By: MaskRay, craig.topper
Subscribers: bcain, rupprecht, jhenderson, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63847
llvm-svn: 366987
Summary:
This was originally reported in D62818.
https://rise4fun.com/Alive/oPH
InstCombine does the opposite fold, in hope that `C l>>/<< Y` expression
will be hoisted out of a loop if `Y` is invariant and `X` is not.
But as it is seen from the diffs here, if it didn't get hoisted,
the produced assembly is almost universally worse.
Much like with my recent "hoist add/sub by/from const" patches,
we should get almost universal win if we hoist constant,
there is almost always an "and/test by imm" instruction,
but "shift of imm" not so much, so we may avoid having to
materialize the immediate, and thus need one less register.
And since we now shift not by constant, but by something else,
the live-range of that something else may reduce.
Special care needs to be applied not to disturb x86 `BT` / hexagon `tstbit`
instruction pattern. And to not get into endless combine loop.
Reviewers: RKSimon, efriedma, t.p.northover, craig.topper, spatel, arsenm
Reviewed By: spatel
Subscribers: hiraditya, MaskRay, wuzish, xbolva00, nikic, nemanjai, jvesely, wdng, nhaehnle, javed.absar, tpr, kristof.beyls, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62871
llvm-svn: 366955
This patch adds support for recognizing cases where a larger vector type is being used to reduce just the elements in the lower subvector:
e.g. <8 x i32> reduction pattern in a <16 x i32> vector:
<4,5,6,7,u,u,u,u,u,u,u,u,u,u,u,u>
<2,3,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
<1,u,u,u,u,u,u,u,u,u,u,u,u,u,u,u>
matchBinOpReduction returns the lower extracted subvector in such cases, assuming isExtractSubvectorCheap accepts the extraction.
I've only enabled it for X86 reduction sums so far. I intend to enable it for the bitop/minmax cases in future patches, and eventually I think its worth turning it on all the time. This is mainly just a case of ensuring calls to matchBinOpReduction don't make assumptions on the vector width based on the original vector extraction.
Fixes the x86 partial reduction sum cases in PR33758 and PR42023.
Differential Revision: https://reviews.llvm.org/D65047
llvm-svn: 366933
LegalizeDAG tries to legal the DAG by legalizing nodes before
their operands.
If we create a new node, we end up legalizing it after its operands.
This prevents some of the optimizations that can be done when the
operand is a build_vector since the build_vector will have been
legalized to something else.
Differential Revision: https://reviews.llvm.org/D65132
llvm-svn: 366835
The build_vector will become a constant pool load. By using the
desired type initially, it ensures we don't generate a bitcast
of the constant pool load which will need to be folded with
the load.
While experimenting with another patch, I noticed that when the
load type and the constant pool type don't match, then
SimplifyDemandedBits can't handle it. While we should probably
fix that, this was a simple way to fix the issue I saw.
llvm-svn: 366732
This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.
A helper function, findEltLoadSrc, recurses to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.
Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.
Fixed out of bounds load assert identified in rL366501
Differential Revision: https://reviews.llvm.org/D64551
llvm-svn: 366681
As detailed on PR42674, we can reduce a vXi8 down until we have the final <8 x i8>, and then use PSADBW with zero, to sum those values. We then extract the bottom i8, discarding any overflow from the upper bits of the i16 result.
llvm-svn: 366636
Summary:
For split-stack, if the nested argument (i.e. R10) is not used, no need to save/restore it in the prologue.
Reviewers: thanm
Reviewed By: thanm
Subscribers: mstorsjo, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64673
llvm-svn: 366569
I plan on adding memcpy optimizations in the GlobalISel pipeline, but we can't
do that unless we delay lowering to actual function calls. This patch changes
the translator to generate G_INTRINSIC_W_SIDE_EFFECTS for these functions, and
then have each target specify that using the new custom legalizer for intrinsics
hook that they want it expanded it a libcall.
Differential Revision: https://reviews.llvm.org/D64895
llvm-svn: 366516
This patch enables us to find the source loads for each element, splitting them into a Load and ByteOffset, and attempts to recognise consecutive loads that are in fact from the same source load.
A helper function, findEltLoadSrc, recurses to find a LoadSDNode and determines the element's byte offset within it. When attempting to match consecutive loads, byte offsetted loads then attempt to matched against a previous load that has already been confirmed to be a consecutive match.
Next step towards PR16739 - after this we just need to account for shuffling/repeated elements to create a vector load + shuffle.
Differential Revision: https://reviews.llvm.org/D64551
llvm-svn: 366441
LEA doesn't affect flags, so use it more liberally to replace an ADD when
we know that the ADD operands affect flags.
In the motivating example from PR40483:
https://bugs.llvm.org/show_bug.cgi?id=40483
...this lets us avoid duplicating a math op just to avoid flag conflict.
As mentioned in the TODO comments, this heuristic can be extended to
fire more often if that leads to more improvements.
Differential Revision: https://reviews.llvm.org/D64707
llvm-svn: 366431
I'm not convinced the code this calls is properly vetted for
vXi1 vectors. Experimental vector widening legalization testing
for D55251 is now hitting an assertion failure inside
EltsFromConsecutiveLoads. This is occurring from a v2i1 load
having a store size different than its VT size. Hopefully
this commit will keep such issues from happening.
llvm-svn: 366405
This is part of what is requested by PR42023:
https://bugs.llvm.org/show_bug.cgi?id=42023
There's an extension needed for FP add, but exactly how we would specify
that using flags is not clear to me, so I left that as a TODO.
We're still missing patterns for partial reductions when the input vector
is 256-bit or 512-bit, but I think that's a failure of vector narrowing.
If we can reduce the widths, then this matching should work on those tests.
Differential Revision: https://reviews.llvm.org/D64760
llvm-svn: 366268
We mostly avoid sub with immediate but there are a couple cases that can create them. One is the add 128, %rax -> sub -128, %rax trick in isel. The other is when a SUB immediate gets created for a compare where both the flags and the subtract value is used. If we are unable to linearize the SelectionDAG to satisfy the flag user and the sub result user from the same instruction, we will clone the sub immediate for the two uses. The one that produces flags will eventually become a compare. The other will have its flag output dead, and could then be considered for LEA creation.
I added additional test cases to add.ll to show the the sub -128 trick gets converted to LEA and a case where we don't need to convert it.
This showed up in the current codegen for PR42571.
Differential Revision: https://reviews.llvm.org/D64574
llvm-svn: 366151
inttofp (trunc (extelt X, 0)) --> inttofp (extelt (bitcast X), 0)
We have pseudo-vectorization of scalar int to FP casts, so this tries to
make that more likely by replacing a truncate with a bitcast. I didn't see
any test diffs starting from 'uitofp', so I left that as a TODO. We can't
only match the shorter trunc+extract pattern because there's an opposing
transform somewhere, so we infinite loop. Waiting to try this during
lowering is another possibility.
A motivating case is shown in PR39975 and included in the test diffs here:
https://bugs.llvm.org/show_bug.cgi?id=39975
Differential Revision: https://reviews.llvm.org/D64710
llvm-svn: 366098
I think we only turn out of range shiftss to undef when
all elements are out of range or the shift amount is a splat out
of range. I'm not sure which, I didn't check.
During lowering we can split a shift where some elements
are out of range into multiple shifts. This can create a
new shift with a splat shift amount that is out of range.
This patch returns undef for this case.
Fixes PR42615.
Differential Revision: https://reviews.llvm.org/D64699
llvm-svn: 366096
Summary:
SSE1 only supports v4f32. But does have instructions like movlps/movhps that load/store 64-bits of memory.
This patch breaks the connection between the node VT of the vzext_load/vextract_store patterns and the memory VT. Enabling a v4f32 node with a 64-bit memory VT. I've used i64 as the memory VT here. I've written the PatFrag predicate to just check the store size not the specific VT. I think the VT will only matter for CSE purposes. We could use v2f32, but if we want to start using these operations in more places a simple integer type might make the most sense.
I'd like to maybe use this same thing for SSE2 and later as well, but that will need more work to be supported by EltsFromConsecutiveLoads to avoid regressing lit tests. I'd maybe also like to combine bitcasts with these load/stores nodes now that the types are disconnected. And I'd also like to consider canonicalizing (scalar_to_vector + load) to vzext_load.
If you want I can split the mechanical tablegen stuff where I added the 32/64 off from the sse1 change.
Reviewers: spatel, RKSimon
Reviewed By: RKSimon
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64528
llvm-svn: 366034
We can use the C flag from NEG to detect that the input was zero.
Really we could probably use the Z flag too. But C matches what
we'd do for usubo 0, X.
Haven't found a test case for this due to the usubo formation
in CGP. But I verified if I comment out the CGP code this
transformation catches some of the same cases.
llvm-svn: 365929
Follow up to D58597, where it was noted that the commuted ISD::SUB variant
was having problems with lack of combines.
See also D63958 where we untangled setcc/sub pairs.
Differential Revision: https://reviews.llvm.org/D58875
llvm-svn: 365791
Summary:
As of binutils 2.32, ld has a bogus TLS relaxation error when the GD/LD
code sequence using R_X86_64_GOTPCREL (instead of R_X86_64_GOTPCRELX) is
attempted to be relaxed to IE/LE (binutils PR24784). gold and lld are good.
In gcc/config/i386/i386.md, there is a configure-time check of as/ld
support and the GOT relaxation will not be used if as/ld doesn't support
it:
if (flag_plt || !HAVE_AS_IX86_TLS_GET_ADDR_GOT)
return "call\t%P2";
return "call\t{*%p2@GOT(%1)|[DWORD PTR %p2@GOT[%1]]}";
In clang, -DENABLE_X86_RELAX_RELOCATIONS=OFF is the default. The ld.bfd
bogus error can be reproduced with:
thread_local int a;
int main() { return a; }
clang -fno-plt -fpic a.cc -fuse-ld=bfd
GOTPCRELX gained relative good support in 2016, which is considered
relatively new. It is even difficult to conditionally default to
-DENABLE_X86_RELAX_RELOCATIONS=ON due to cross compilation reasons. So
work around the ld.bfd bug by only using GOT when GOTPCRELX is enabled.
Reviewers: dalias, hjl.tools, nikic, rnk
Reviewed By: nikic
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64304
llvm-svn: 365752
We use the functions that convert to three address to do the
conversion, but changing an 8 or 16 bit will cause it to create
a virtual register. This can't be done after register allocation
where this pass runs.
I've switched the pass completely to a white list of instructions
that can be converted to LEA instead of a blacklist that was
incorrect. This will avoid surprises if we enhance the three
address conversion function to include additional instructions
in the future.
Fixes PR42565.
llvm-svn: 365720
Unfortunately subo formation in CGP prevents obvious ways of
testing this.
But we already have BLSI in here and the flag behavior is
well understood.
Might become more useful if we improve PR42571.
llvm-svn: 365702
This renames the type so it doesn't sound like its based off the load size - as we're moving towards supporting combining loads of different sizes.
llvm-svn: 365655
Cache the LoadSDNode nodes so we can easily map to/from the element index instead of packing them together - this will be useful for future patches for PR16739 etc.
llvm-svn: 365620
This patch checks to see if the vector element loads are based off a dereferenceable pointer that covers the entire vector width, in which case we don't need to have element loads at both extremes of the vector width - just the start (base pointer) of it.
Another step towards partial vector loads......
Differential Revision: https://reviews.llvm.org/D64205
llvm-svn: 365614
This should prevent doing this on pre-sse4.1 targets or for 256
bit vectors without avx2.
I don't know of a failure from this. Op legalization will probably
take care of, but seemed better to be safe.
llvm-svn: 365577
Basically the problem is that X86 doesn't set the Fast flag from
allowsMemoryAccess on certain CPUs due to slow unaligned memory
subtarget features. This prevents bitcasts from being folded into
loads and stores. But all vector loads and stores of the same width
are the same cost on X86.
This patch merges the allowsMemoryAccess call into isLoadBitCastBeneficial to allow X86 to skip it.
Differential Revision: https://reviews.llvm.org/D64295
llvm-svn: 365549
Dump the DWARF information about call sites and call site parameters into
debug info sections.
The patch also provides an interface for the interpretation of instructions
that could load values of a call site parameters in order to generate DWARF
about the call site parameters.
([13/13] Introduce the debug entry values.)
Co-authored-by: Ananth Sowda <asowda@cisco.com>
Co-authored-by: Nikola Prica <nikola.prica@rt-rk.com>
Co-authored-by: Ivan Baev <ibaev@cisco.com>
Differential Revision: https://reviews.llvm.org/D60716
llvm-svn: 365467
Summary:
This makes it so that IR files using triples without an environment work
out of the box, without normalizing them.
Typically, the MSVC behavior is more desirable. For example, it tends to
enable things like constant merging, use of associative comdats, etc.
Addresses PR42491
Reviewers: compnerd
Subscribers: hiraditya, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D64109
llvm-svn: 365387
This can help avoid a copy or enable load folding.
On SSE4.1 targets we can commute it to blendi instead.
I had to make shufpd with a 0x02 immediate commutable as well
since we expect commuting to be reversible.
llvm-svn: 365292
These patterns are the same as the MOVLPDmr and MOVHPDmr patterns,
but with a bitcast at the end. We can just select the PD instruction
and let execution domain fixing switch to PS.
llvm-svn: 365267
These narrow the load so we can only do it if the load isn't
volatile.
There also tests in vector-shuffle-128-v4.ll that this should
support, but we don't seem to fold bitcast+load on pre-sse4.2
targets due to the slow unaligned mem 16 flag.
llvm-svn: 365266
The Size either needs to be 0 meaning we aren't folding
a stack reload. Or the stack slot needs to be at least
16 bytes. I've also added a paranoia check ensure the
RCSize is at leat 16 bytes as well. This avoids any
FR32/FR64 surprises, but I think we already filtered
those earlier.
All of our test case have Size as either 0 or 16 and
RCSize == 16. So the Size <= 16 check worked for those
cases.
llvm-svn: 365234
These patterns use 128-bit loads, but the instructions only load
64-bits. We shouldn't narrow the load if its volatile.
Fixes another variant of PR42079
llvm-svn: 365225
This was identical to a pattern for MOVPQI2QImr with a bitcast
as an input. But we should be able to turn MOVPQI2QImr into
MOVLPSmr in the execution domain fixup pass so we shouldn't
need this.
llvm-svn: 365224
Revision r365061 changed a skip of debug instructions for a skip
of meta instructions. This is not safe, as IMPLICIT_DEF is classed
as a meta instruction.
llvm-svn: 365202
Revision r365061 changed a skip of debug instructions for a skip
of meta instructions. This is not safe, as IMPLICIT_DEF is classed
as a meta instruction.
llvm-svn: 365198
Summary:
We attempt to prevent folding immediates with multiple users under optsize. But we only do this from store nodes and X86ISD::ADD/SUB/XOR/OR/AND patterns. We don't do it for ISD::ADD/SUB/XOR/OR/AND even though we count them as users when deciding whether to fold into other nodes. This leads to situations where we block folding to a compare for example, but still fold into an AND or OR as seen in PR27202.
Unfortunately touching the isel patterns in tablegen for the ISD::ADD/SUB/XOR/OR/AND opcodes will cause the patterns to be unusable for fast isel. And we don't have a way to make a fast isel only pattern.
To workaround this, this patch adds custom isel in front of the isel table that will select the non-immediate forms if the immediate has additional users. This may create some issues for ANDN and NOT matching. And there's room for improvement with unsigned 32 immediates on 64-bit AND.
This patch needs more thorough test cases, but I wanted to get feedback on the direction. Please send me any other test cases you've seen in the wild.
I think we probably have the same issue with the immediate matching when we fold RMW from X86ISD::ADD/SUB/XOR/OR/AND. And our TEST immedaite shrinking logic. Our cost modeling for immediates that can fit in a sign extended 8-bit immediate on a 16/32/64 bit operation is completely wrong.
I also wonder if we should update the ConstantHoisting cost model and block folding for "opaque" constants. But of course constants can still be created by DAG combine and lowering optimizations.
Fixes PR27202
Reviewers: spatel, RKSimon, andreadb
Reviewed By: RKSimon
Subscribers: jsji, hiraditya, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59909
llvm-svn: 365163
Bitcast v4i32 to v8f32 and back again - it might be worth adding isel patterns for X86PShufd v8i32 on AVX1 targets like we did for X86Blendi to avoid the bitcasts?
llvm-svn: 365125
This patch generalizes the fix in D61680 to ignore all meta instructions,
not just debug info.
Patch by Chris Dawson.
Differential Revision: https://reviews.llvm.org/D62605
llvm-svn: 365061
We previously marked all the tests with branch funnels as
`-verify-machineinstrs=0`.
This is an attempt to fix it.
1) `ICALL_BRANCH_FUNNEL` has no defs. Mark it as `let OutOperandList =
(outs)`
2) After that we hit an assert: ``` Assertion failed: (Op.getValueType()
!= MVT::Other && Op.getValueType() != MVT::Glue && "Chain and glue
operands should occur at end of operand list!"), function AddOperand,
file
/Users/francisvm/llvm/llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp,
line 461. ```
The chain operand was added at the beginning of the operand list. Move
that to the end.
3) After that we hit another verifier issue in the pseudo expansion
where the registers used in the cmps and jmps are not added to the
livein lists. Add the `EFLAGS` to all the new MBBs that we create.
PR39436
Differential Review: https://reviews.llvm.org/D54155
llvm-svn: 365058
If we have more then 2 shuffle ops to combine, try to use combineX86ShuffleChainWithExtract to see if some are from the same super vector.
llvm-svn: 365050
iff the number of elements doesn't change.
This gets around an issue with combineX86ShuffleChain not being able to hint which domain is preferred for shuffles that can be done with either.
Fixes regression introduced in rL365041
llvm-svn: 365044
This better accounts for the cost/benefit of removing extract_subvectors from the shuffle and will be more useful in future patches.
The vpermq predicate regression will be fixed shortly.
llvm-svn: 365041