This patch replaces the separate APInts for KnownZero/KnownOne with a single KnownBits struct. This is similar to what was done to ValueTracking's version recently.
This is largely a mechanical transformation from KnownZero to Known.Zero.
Differential Revision: https://reviews.llvm.org/D32569
llvm-svn: 301620
This patch uses various APInt methods to reduce the number of temporary APInts. These were all found while working through converting SelectionDAG's computeKnownBits to also use the KnownBits struct recently added to the ValueTracking version.
llvm-svn: 301618
This reverts commit r301105, 4, 3 and 1, as a follow up of the previous
revert, which broke even more bots.
For reference:
Revert "[APInt] Use operator<<= where possible. NFC"
Revert "[APInt] Use operator<<= instead of shl where possible. NFC"
Revert "[APInt] Use ashInPlace where possible."
PR32754.
llvm-svn: 301111
immediate operands.
This commit adds an AArch64 dag-combine that optimizes code generation
for logical instructions taking immediate operands. The optimization
uses demanded bits to change a logical instruction's immediate operand
so that the immediate can be folded into the immediate field of the
instruction.
This recommits r300932 and r300930, which was causing dag-combine to
loop forever. The problem was that optimizeLogicalImm was returning
true even when there was no change to the immediate node (which happened
when the immediate was all zeros or ones), which caused dag-combine to
push and pop the same node to the work list over and over again without
making any progress.
This commit fixes the bug by returning false early in optimizeLogicalImm
if the immediate is all zeros or ones. Also, it changes the code to
compare the immediate with 0 or Mask rather than calling
countPopulation.
rdar://problem/18231627
Differential Revision: https://reviews.llvm.org/D5591
llvm-svn: 301019
It seems that r300930 was creating an infinite loop in dag-combine when
compling the following file:
MultiSource/Benchmarks/MiBench/consumer-typeset/z21.c
llvm-svn: 300940
immediate operands.
This commit adds an AArch64 dag-combine that optimizes code generation
for logical instructions taking immediate operands. The optimization
uses demanded bits to change a logical instruction's immediate operand
so that the immediate can be folded into the immediate field of the
instruction.
This recommits r300913, which broke bots because I didn't fix a call to
ShrinkDemandedConstant in SIISelLowering.cpp after changing the APIs of
TargetLoweringOpt and TargetLowering.
rdar://problem/18231627
Differential Revision: https://reviews.llvm.org/D5591
llvm-svn: 300930
immediate operands.
This commit adds an AArch64 dag-combine that optimizes code generation
for logical instructions taking immediate operands. The optimization
uses demanded bits to change a logical instruction's immediate operand
so that the immediate can be folded into the immediate field of the
instruction.
rdar://problem/18231627
Differential Revision: https://reviews.llvm.org/D5591
llvm-svn: 300913
getSignBit is a static function that creates an APInt with only the sign bit set. getSignMask seems like a better name to convey its functionality. In fact several places use it and then store in an APInt named SignMask.
Differential Revision: https://reviews.llvm.org/D32108
llvm-svn: 300856
This patch uses lshrInPlace to replace code where the object that lshr is called on is being overwritten with the result.
This adds an lshrInPlace(const APInt &) version as well.
Differential Revision: https://reviews.llvm.org/D32155
llvm-svn: 300566
Our 16 bit support is assembler-only + the terrible hack that is
.code16gcc. Simply using 32 bit registers does the right thing for the
latter.
Fixes PR32681.
llvm-svn: 300429
Summary:
In PR32594, inline assembly using the 'A' constraint on x86_64 causes
llvm to crash with a "Cannot select" stack trace. This is because
`X86TargetLowering::getRegForInlineAsmConstraint` hardcodes that 'A'
means the EAX and EDX registers.
However, on x86_64 it means the RAX and RDX registers, and on 16-bit x86
(ia16?) it means the old AX and DX registers.
Add new register classes in `X86RegisterInfo.td` to support these cases,
and amend the logic in `getRegForInlineAsmConstraint` to cope with
different subtargets. Also add a test case, derived from PR32594.
Reviewers: craig.topper, qcolombet, RKSimon, ab
Reviewed By: ab
Subscribers: ab, emaste, royger, llvm-commits
Differential Revision: https://reviews.llvm.org/D31902
llvm-svn: 300404
From a user prospective, it forces the use of an annoying nullptr to mark the end of the vararg, and there's not type checking on the arguments.
The variadic template is an obvious solution to both issues.
Differential Revision: https://reviews.llvm.org/D31070
llvm-svn: 299949
Module::getOrInsertFunction is using C-style vararg instead of
variadic templates.
From a user prospective, it forces the use of an annoying nullptr
to mark the end of the vararg, and there's not type checking on the
arguments. The variadic template is an obvious solution to both
issues.
llvm-svn: 299925
Module::getOrInsertFunction is using C-style vararg instead of
variadic templates.
From a user prospective, it forces the use of an annoying nullptr
to mark the end of the vararg, and there's not type checking on the
arguments. The variadic template is an obvious solution to both
issues.
Patch by: Serge Guelton <serge.guelton@telecom-bretagne.eu>
Differential Revision: https://reviews.llvm.org/D31070
llvm-svn: 299699
Before r294774, there was a problem when lowering broadcasts to use
128-bit subvectors.
When we looked through a bitcast to find the broadcast input, we'd keep
using the original type, so you'd end up with things like:
(v8f32 (broadcast
(v4f32 (extract_subvector
(v8i32 V),
...))
))
r294774 fixed it to always emit subvectors with the scalar type of the
original source.
It also introduced some asserts, to check that we use scalars with
the same size, and vectors with the same number of elements.
The scalar size equality is checked earlier when looking through bitcasts,
and is a useful assert.
However, the number of elements don't have to be identical: we're always
going to extract a 128-bit subvector, and we can have different size
inputs if we looked through a concat_vector to find a 256-bit source.
Relax the overzealous assert.
Replace it with a check of the original source vector being 256 or 512
bits. If it's 128 bits, we can't extract_subvector from it.
Fixes PR32371.
llvm-svn: 299490
https://reviews.llvm.org/D30537 / https://reviews.llvm.org/rL296977 added these transforms
and other related transforms to the generic DAGCombiner (with a hook that x86 sets to true),
so these patterns should not exist by the time we reach the target-specific combiner hook.
llvm-svn: 299448
PSADBW pattern currently supports the 32 bit IR pattern and only GLT (greather than) comparison.
The patch extends the pattern to catch also 64 bit IR pattern and includes all other comparison types (not only GLT).
Differential Revision: https://reviews.llvm.org/D31577
llvm-svn: 299425
It can be costly to transfer from the gprs to the xmm registers and can prevent loads merging.
This patch splits vXi16/vXi32/vXi64 BUILD_VECTORS that use the same operand in multiple elements into a BUILD_VECTOR with only a single insertion of each of those elements and then performs an unary shuffle to duplicate the values.
There are a couple of minor regressions this patch unearths due to some missing MOVDDUP/BROADCAST folds that I will address in a future patch.
Note: Now that vector shuffle lowering and combining is pretty good we should be reusing that instead of duplicating so much in LowerBUILD_VECTOR - this is the first of several patches to address this.
Differential Revision: https://reviews.llvm.org/D31373
llvm-svn: 299387
The x86_64 ABI requires that the stack is 16 byte aligned on function calls. Thus, the 8-byte error code, which is pushed by the CPU for certain exceptions, leads to a misaligned stack. This results in bugs such as Bug 26413, where misaligned movaps instructions are generated.
This commit fixes the misalignment by adjusting the stack pointer in these cases. The adjustment is done at the beginning of the prologue generation by subtracting another 8 bytes from the stack pointer. These additional bytes are popped again in the function epilogue.
Fixes Bug 26413
Patch by Philipp Oppermann.
Differential Revision: https://reviews.llvm.org/D30049
llvm-svn: 299383
This moves the isMask and isShiftedMask functions to be class methods. They now use the MathExtras.h function for single word size and leading/trailing zeros/ones or countPopulation for the multiword size. The previous implementation made multiple temorary memory allocations to do the bitwise arithmetic operations to match the MathExtras.h implementation.
Differential Revision: https://reviews.llvm.org/D31565
llvm-svn: 299362
Our _MM_HINT_T0/T1 constant values are 3/2 which matches gcc, but not icc or Intel documentation. Interestingly gcc had this same bug on their implementation of the gather/scatter builtins at one point too.
Fixes PR32411.
llvm-svn: 299234
Currently ComputeNumSignBits returns the minimum number of sign bits for all elements of vector data, when we may only be interested in one/some of the elements.
This patch adds a DemandedElts argument that allows us to specify the elements we actually care about. The original ComputeNumSignBits implementation calls with a DemandedElts demanding all elements to match current behaviour. Scalar types set this to 1.
I've only added support for BUILD_VECTOR and EXTRACT_VECTOR_ELT so far, all others will default to demanding all elements but can be updated in due course.
Followup to D25691.
Differential Revision: https://reviews.llvm.org/D31311
llvm-svn: 299219
Follow up to D25691, this sets up the plumbing necessary to support vector demanded elements support in known bits calculations in target nodes.
Differential Revision: https://reviews.llvm.org/D31249
llvm-svn: 299201
We currently perform the various fp_to_sint XMM conversion and then transfer to the MMX register (on 32-bit via the stack).
This patch improves support for MOVDQ2Q XMM to MMX transfers and adds the XMM->MMX fp_to_sint direct conversion patterns. The SSE2 specifications are the same as for XMM->XMM and XMM->MMX rounding/exceptions/etc.
Differential Revision: https://reviews.llvm.org/D30868
llvm-svn: 298943
This is a patch for an on-going bugzilla bug 21281 on the generated X86 code for a matrix transpose8x8 subroutine which requires vector interleaving. The generated code in AVX2 is currently non-optimal and requires 60 instructions as opposed to only 40 instructions generated for AVX1.
The patch includes a fix for the AVX2 case where vector unpack instructions use less operations than the vector blend operations available in AVX2.
In this case using vector unpack instructions is more efficient.
Reviewers:
zvi
delena
igorb
craig.topper
guyblank
eladcohen
m_zuckerman
aymanmus
RKSimon
llvm-svn: 298840
Patch to generalize combinePCMPAnd1 (for handling SETCC + ZEXT cases) to work for any input that has zero/all bits set masked with an 'all low bits' mask.
Replaced the implicit assumption of shift availability with a call to SupportedVectorShiftWithImm.
Part 1 of 3.
Differential Revision: https://reviews.llvm.org/D31347
llvm-svn: 298779
This is the payoff for D31156 - if a target has efficient comparison instructions for vector-sized equality,
we can replace memcmp calls with inline code that is both smaller and faster.
Differential Revision: https://reviews.llvm.org/D31290
llvm-svn: 298775
Up until now, vpmovm2 instruction described its destination operand size
by the source operand size. This patch adds new pattern for the vpmovm2
instruction. The node describes new expansion of the destination (from
{128|256} to 512).
Differential Revision: https://reviews.llvm.org/D30654
llvm-svn: 298586
Summary:
This class is a list of AttributeSetNodes corresponding the function
prototype of a call or function declaration. This class used to be
called ParamAttrListPtr, then AttrListPtr, then AttributeSet. It is
typically accessed by parameter and return value index, so
"AttributeList" seems like a more intuitive name.
Rename AttributeSetImpl to AttributeListImpl to follow suit.
It's useful to rename this class so that we can rename AttributeSetNode
to AttributeSet later. AttributeSet is the set of attributes that apply
to a single function, argument, or return value.
Reviewers: sanjoy, javed.absar, chandlerc, pete
Reviewed By: pete
Subscribers: pete, jholewinski, arsenm, dschuff, mehdi_amini, jfb, nhaehnle, sbc100, void, llvm-commits
Differential Revision: https://reviews.llvm.org/D31102
llvm-svn: 298393
We could do better by splitting any oversized type into whatever vector size the target supports,
but I left that for future work if it ever comes up. The motivating case is memcmp() calls on 16-byte
structs, so I think we can wire that up with a TLI hook that feeds into this.
Differential Revision: https://reviews.llvm.org/D31156
llvm-svn: 298376
Make x86_64-fuchsia targets under -mcmodel=kernel use %gs rather
than %fs to access ABI slots for stack-protector and safe-stack
Patch by Roland McGrath.
Differential Revision: https://reviews.llvm.org/D30870
llvm-svn: 298302
Summary:
Currently we handle these intrinsics at isel with special patterns. But as they just map to normal logic operations, we should just handle them at lowering. This will expose them to DAG combine optimizations. Right now the kor-sequence test generates a bunch of regclass copies between GR16 and VK16 that the peephole optimizer and/or register coallescing are removing to keep everything in the mask domain. By handling the logic op intrinsics earlier, these copies become bitcasts in the DAG and get removed by DAG combine which seems more robust.
This should help enable my plan to stop copying between K registers and GR8/GR16. The peephole optimizer can't remove a chain of copies between K and GR32 with insert_subreg/extract_subreg present in the chain so the kor-sequence test break. But this patch should dodge the problem entirely.
Reviewers: zvi, delena, RKSimon, igorb
Reviewed By: igorb
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D31056
llvm-svn: 298228
The MIR printer dumps a string that describe the register mask of a function.
A static predefined list of register masks matches a static list of strings.
However when the register mask is not from the static predefined list, there is no descriptor string and the printer fails.
This patch adds support to custom register mask printing and dumping.
Also the list of callee saved registers (describing the registers that must be preserved for the caller) might be dynamic.
As such this data needs to be dumped and parsed back to the Machine Register Info.
Differential Revision: https://reviews.llvm.org/D30971
llvm-svn: 298207
As noted in the comment, we might want to account for this case,
but I didn't look at what that would mean for the asm.
I'm also not sure why this only reproduces with avx512, but I'm
putting a conservative fix in for now to avoid the crash.
Also, if both sides of an add are zexted, shouldn't we shrink that add?
https://bugs.llvm.org/show_bug.cgi?id=32316
llvm-svn: 298107
Reduced version of D26357 - based on the discussion on llvm-dev about canonicalization of UMIN/UMAX/SMIN/SMAX as well as ABS I've reduced that patch to just the ABS ISD node (with x86/sse support) to improve basic combines and lowering.
ARM/AArch64, Hexagon, PowerPC and NVPTX all have similar instructions allowing us to make this a generic opcode and move away from the hard coded tablegen patterns which makes it tricky to match more complex patterns.
At the moment this patch doesn't attempt legalization as we only create an ABS node if its legal/custom.
Differential Revision: https://reviews.llvm.org/D29639
llvm-svn: 297780
Each Calling convention (CC) defines a static list of registers that should be preserved by a callee function. All other registers should be saved by the caller.
Some CCs use additional condition: If the register is used for passing/returning arguments – the caller needs to save it - even if it is part of the Callee Saved Registers (CSR) list.
The current LLVM implementation doesn’t support it. It will save a register if it is part of the static CSR list and will not care if the register is passed/returned by the callee.
The solution is to dynamically allocate the CSR lists (Only for these CCs). The lists will be updated with actual registers that should be saved by the callee.
Since we need the allocated lists to live as long as the function exists, the list should reside inside the Machine Register Info (MRI) which is a property of the Machine Function and managed by it (and has the same life span).
The lists should be saved in the MRI and populated upon LowerCall and LowerFormalArguments.
The patch will also assist to implement future no_caller_saved_regsiters attribute intended for interrupt handler CC.
Differential Revision: https://reviews.llvm.org/D28566
llvm-svn: 297715
For AVX-512 we force the input to zero if the input is undef or the mask is all ones to break an execution dependency. This patch brings the same behavior to AVX2.
llvm-svn: 297652
We were already forcing undef inputs to become a zero vector, this now catches an all ones mask too.
Ideally we'd use undef and let execution dep fix handle picking the best register/clearance for the undef, but I don't think it can handle the early clobber today.
llvm-svn: 297651
The immediate should be 1 or 2, not 0 or 1. This was found while adding bounds checking to clang. In fact the existing clang builtin test failed if we ran it all the way to assembly.
llvm-svn: 297591
I noticed unnecessary 'sbb' instructions in D30472 and while looking at 'ptest' codegen recently.
This happens because we were transforming any 'setb' - even when we only wanted a single-bit result.
This patch moves those transforms under visitAdd/visitSub, so we we're only creating sbb/adc when it
is a win. I don't know why we need a SETCC_CARRY node type, but I'm not proposing to change that
existing behavior in this patch.
Also, I'm skeptical that sbb/adc are a win for all micro-arches, so I added comments to the test files
where this transform still fires.
The test changes here are all cases where we no longer produce sbb/adc. Avoiding partial register
stalls (generating an xor to clear a register) is not handled in some cases, but that's a separate
issue.
Differential Revision: https://reviews.llvm.org/D30611
llvm-svn: 297586
Without SSE41 (pextrb) we currently extract byte elements from a vector by spilling to stack and reloading the byte.
This patch is an initial attempt at using MOVD/PEXTRW to extract the relevant DWORD/WORD from the vector and then shift+truncate to collect the correct byte.
Extraction of multiple bytes this way would result in code bloat, but as explained in the patch we could probably afford to be more aggressive with the supported extractions before again falling back on spilling - possibly through counting the number of extracts and which DWORD/WORD they originate?
Differential Revision: https://reviews.llvm.org/D29841
llvm-svn: 297568
We currently have to insert bits via a temporary variable of the same size as the target with various shift/mask stages, resulting in further temporary variables, all of which require the allocation of memory for large APInts (MaskSizeInBits > 64).
This is another of the compile time issues identified in PR32037 (see also D30265).
This patch adds the APInt::insertBits() helper method which avoids the temporary memory allocation and masks/inserts the raw bits directly into the target.
Differential Revision: https://reviews.llvm.org/D30780
llvm-svn: 297458
Summary:
Loop alignment can cause a significant change of
the perfromance for short loops.
To be able to evaluate the impact of loop alignment this change
introduces the new option x86-experimental-pref-loop-alignment.
The alignment will be 2^Value bytes, the default value is 4.
Patch by Serguei Katkov!
Reviewers: craig.topper
Reviewed By: craig.topper
Subscribers: sanjoy, llvm-commits
Differential Revision: https://reviews.llvm.org/D30391
llvm-svn: 297178
Use the store size of the argument type, which will be a byte-sized
quantity, rather than dividing the size in bits by 8.
Fixes PR32136 and re-enables copy elision from i64 arguments.
Reverts the workaround in from r296950.
llvm-svn: 297045
As described on PR31712, we miss a variety of legalization combines because we lower these to X86ISD::VSEXT/VZEXT despite them having the same functionality. This patch makes 128-bit (SSE41) SIGN/ZERO_EXTEND_VECTOR_IN_REG ops legal, adds the necessary tablegen plumbing and uses a helper 'getExtendInVec' to decide when to use SIGN/ZERO_EXTEND_VECTOR_IN_REG or VSEXT/VZEXT.
We're missing a couple of shuffle combines that will be added in a future patch for review.
Later patches can then support the AVX2 cases as a mixture of SIGN/ZERO_EXTEND and SIGN/ZERO_EXTEND_VECTOR_IN_REG, and then finally deal with the AVX512 cases.
Differential Revision: https://reviews.llvm.org/D30549
llvm-svn: 296985
The larger goal is to move the ADC/SBB transforms currently in
combineX86SetCC() to combineAddOrSubToADCOrSBB() because we're
creating ADC/SBB in lots of places where we shouldn't.
This was intended to be an NFC change, but avx-512 has something
strange going on. It doesn't seem like any of the affected tests
should really be using SET+TEST or ADC; a simple ADD could replace
several instructions. But that's another bug...
llvm-svn: 296978
Long ago (2010 according to svn blame), combineShuffle probably needed to prevent the accidental creation of illegal i64 types but there doesn't appear to be any combines that can cause this any more as they all have their own legality checks.
Differential Revision: https://reviews.llvm.org/D30213
llvm-svn: 296966
This fixes cases where i1 types were not properly legalized yet and lead
to the creating of 0-sized stack slots.
This fixes http://llvm.org/PR32136
llvm-svn: 296950
The comments were wrong, and this is not an obvious transform.
This hopefully makes it clearer that we're missing the commuted
patterns for adds. It's less clear that this is actually a good
transform for all micro-arch.
This is prep work for trying to clean up the current adc/sbb
codegen because it's definitely not happening optimally.
llvm-svn: 296918
MMX extraction often ends up as extract_i32(bitcast_v2i32(extract_i64(bitcast_v1i64(x86mmx v), 0)), 0) which fails to simplify on 32-bit targets as i64 isn't legal
llvm-svn: 296782
Summary:
Avoids tons of prologue boilerplate when arguments are passed in memory
and left in memory. This can happen in a debug build or in a release
build when an argument alloca is escaped. This will dramatically affect
the code size of x86 debug builds, because X86 fast isel doesn't handle
arguments passed in memory at all. It only handles the x86_64 case of up
to 6 basic register parameters.
This is implemented by analyzing the entry block before ISel to identify
copy elision candidates. A copy elision candidate is an argument that is
used to fully initialize an alloca before any other possibly escaping
uses of that alloca. If an argument is a copy elision candidate, we set
a flag on the InputArg. If the the target generates loads from a fixed
stack object that matches the size and alignment requirements of the
alloca, the SelectionDAG builder will delete the stack object created
for the alloca and replace it with the fixed stack object. The load is
left behind to satisfy any remaining uses of the argument value. The
store is now dead and is therefore elided. The fixed stack object is
also marked as mutable, as it may now be modified by the user, and it
would be invalid to rematerialize the initial load from it.
Supersedes D28388
Fixes PR26328
Reviewers: chandlerc, MatzeB, qcolombet, inglorion, hans
Subscribers: igorb, llvm-commits
Differential Revision: https://reviews.llvm.org/D29668
llvm-svn: 296683
DAGCombiner already supports peeking thorough shuffles to improve vector element extraction, but legalization often leaves us in situations where we need to extract vector elements after shuffles have already been lowered.
This patch adds support for VECTOR_EXTRACT_ELEMENT/PEXTRW/PEXTRB instructions to attempt to handle target shuffles as well. I've covered some basic scenarios including handling shuffle mask scaling and the implicit zero-extension of PEXTRW/PEXTRB, there is more that could be done here (that I've mentioned in TODOs) but I haven't found many cases where its worth it.
Differential Revision: https://reviews.llvm.org/D30176
llvm-svn: 296381
Summary:
SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc.
APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt.
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30392
llvm-svn: 296355
Summary:
SmallBitVector uses a malloc for more than 58 bits on a 64-bit target and more than 27 bits on a 32-bit target. Some of the vector types we deal with here use more than those number of elements and therefore cause a malloc.
APInt on the other hand supports up to 64 bits without a malloc. That's the maximum number of bits we need here so we can avoid a malloc for all cases by using APInt.
Reviewers: RKSimon
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D30390
llvm-svn: 296354
The current pattern for extract bits in range is typically:
Mask.lshr(BitOffset).trunc(SubSizeInBits);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.
This is another of the compile time issues identified in PR32037 (see also D30265).
This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.
Differential Revision: https://reviews.llvm.org/D30336
llvm-svn: 296272
The current pattern for extract bits in range is typically:
Mask.lshr(BitOffset).trunc(SubSizeInBits);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.
This is another of the compile time issues identified in PR32037 (see also D30265).
This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.
Differential Revision: https://reviews.llvm.org/D30336
llvm-svn: 296147
The current pattern for extract bits in range is typically:
Mask.lshr(BitOffset).trunc(SubSizeInBits);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation of memory for the temporary variable.
This is another of the compile time issues identified in PR32037 (see also D30265).
This patch adds the APInt::extractBits() helper method which avoids the temporary memory allocation.
Differential Revision: https://reviews.llvm.org/D30336
llvm-svn: 296141
Noticed while profiling PR32037, the target shuffle ops were being stored in SmallVector<*,8> types but the combiner could store as many as 16 ops at maximum depth (2 per depth).
llvm-svn: 296130
The current pattern for setting bits in range is typically:
Mask |= APInt::getBitsSet(MaskSizeInBits, LoPos, HiPos);
Which can be particularly slow for large APInts (MaskSizeInBits > 64) as they require the allocation memory for the temporary variable.
This is one of the key compile time issues identified in PR32037.
This patch adds the APInt::setBits() helper method which avoids the temporary memory allocation completely, this first implementation uses setBit() internally instead but already significantly reduces the regression in PR32037 (~10% drop). Additional optimization may be possible.
I investigated whether there is need for APInt::clearBits() and APInt::flipBits() equivalents but haven't seen these patterns to be particularly common, but reusing the code would be trivial.
Differential Revision: https://reviews.llvm.org/D30265
llvm-svn: 296102
The Fuchsia ABI defines slots from the thread pointer where the
stack-guard value for stack-protector, and the unsafe stack pointer
for safe-stack, are stored. This parallels the Android ABI support.
Patch by Roland McGrath
Differential Revision: https://reviews.llvm.org/D30237
llvm-svn: 296081
Minor optimization, don't create temporary mask APInts that are just going to be OR'd into the accumulate masks - insert directly instead.
llvm-svn: 295848
This patch introduces new X86ISD::FMAXS and X86ISD::FMINS opcodes. The legacy intrinsics now lower to this node. As do the AVX-512 masked intrinsics when the rounding mode is CUR_DIRECTION.
I've merged a copy of the tablegen multiclass avx512_fp_scalar into avx512_fp_scalar_sae. avx512_fp_scalar still needs to support CUR_DIRECTION appearing as a rounding mode for X86ISD::FADD_ROUND and others.
Differential revision: https://reviews.llvm.org/D30186
llvm-svn: 295810
Summary:
Rework the code that was sinking/duplicating (icmp and, 0) sequences
into blocks where they were being used by conditional branches to form
more tbz instructions on AArch64. The new code is more general in that
it just looks for 'and's that have all icmp 0's as users, with a target
hook used to select which subset of 'and' instructions to consider.
This change also enables 'and' sinking for X86, where it is more widely
beneficial than on AArch64.
The 'and' sinking/duplicating code is moved into the optimizeInst phase
of CodeGenPrepare, where it can take advantage of the fact the
OptimizeCmpExpression has already sunk/duplicated any icmps into the
blocks where they are used. One minor complication from this change is
that optimizeLoadExt needed to be updated to always mark 'and's it has
determined should be in the same block as their feeding load in the
InsertedInsts set to avoid an infinite loop of hoisting and sinking the
same 'and'.
This change fixes a regression on X86 in the tsan runtime caused by
moving GVNHoist to a later place in the optimization pipeline (see
PR31382).
Reviewers: t.p.northover, qcolombet, MatzeB
Subscribers: aemerson, mcrosier, sebpop, llvm-commits
Differential Revision: https://reviews.llvm.org/D28813
llvm-svn: 295746
This matches what is already done during shuffle lowering and helps prevent the need for a zero-vector in cases where shuffles match both patterns.
llvm-svn: 295723
Pull out repeated code for extraction index operand and source vector value type.
Use isNullConstant helper to check for zero extraction index.
llvm-svn: 295670
Its more profitable to go through memory (1 cycles throughput)
than using VMOVD + VPERMV/PSHUFB sequence ( 2/3 cycles throughput) to implement EXTRACT_VECTOR_ELT with variable index.
IACA tool was used to get performace estimation (https://software.intel.com/en-us/articles/intel-architecture-code-analyzer)
For example for var_shuffle_v16i8_v16i8_xxxxxxxxxxxxxxxx_i8 test from vector-shuffle-variable-128.ll I get 26 cycles vs 79 cycles.
Removing the VINSERT node, we don't need it any more.
Differential Revision: https://reviews.llvm.org/D29690
llvm-svn: 295660
Replaces existing approach that could only search BUILD_VECTOR nodes.
Requires getTargetConstantBitsFromNode to discriminate cases with all/partial UNDEF bits in each element - this should also be useful when we get around to supporting getTargetShuffleMaskIndices with UNDEF elements.
llvm-svn: 295613
As discussed on D27692, this permits another domain to be used to combine a shuffle at high depths.
We currently set the required depth at 4 or more combined shuffles, this is probably too high for most targets but is a good starting point and already helps avoid a number of costly variable shuffles.
llvm-svn: 295608
Add the infrastructure to flag whether float and/or int domains are permitable.
A future patch will enable domain crossing based off shuffle depth and the value types of the source vectors.
llvm-svn: 295604
Minor performance speedup - if any call to getShuffleScalarElt fails to get a result, don't both calling for the remaining elements as EltsFromConsecutiveLoads will fail anyhow.
llvm-svn: 295235
Summary:
We don't seem to have great rules on what a valid VBROADCAST node looks like. And as a consequence we end up with a lot of patterns to try to catch everything. We have patterns with scalar inputs, 128-bit vector inputs, 256-bit vector inputs, and 512-bit vector inputs.
As you can see from the things improved here we are currently missing patterns for 128-bit loads being extended to 256-bit before the vbroadcast.
I'd like to propose that VBROADCAST should always take a 128-bit vector type as input. As a first step towards that this patch adds an EXTRACT_SUBVECTOR in front of VBROADCAST when the input is 256 or 512-bits. In the future I would like to add scalar_to_vector around all the scalar operations. And maybe we should consider adding a VBROADCAST+load node to avoid separating loads from the broadcasting operation when the load itself isn't foldable.
This requires an additional change in target shuffle combining to look for the extract subvector and look through it to find the original operand. I'm sure this change isn't perfect but was enough to fix a few test failures that were being caused.
Another interesting thing I noticed is that the changes in masked_gather_scatter.ll show cases were we don't remove a useless insert into element 1 before broadcasting element 0.
Reviewers: delena, RKSimon, zvi
Reviewed By: zvi
Subscribers: igorb, llvm-commits
Differential Revision: https://reviews.llvm.org/D28747
llvm-svn: 295155