Removed AssertZext node, which was inserted between X86ISD::SETCC and "truncate to i1".
Differential Revision: https://reviews.llvm.org/D22850
llvm-svn: 277289
Up until now, we only had code to match PSADBW patterns that look like what
comes out of the loop vectorizer - a partial reduction inside the loop body
that gets fed into a horizontal operation in a different basic block.
This adds support for straight-line patterns, like those generated by the
SLP vectorizer.
Differential Revision: https://reviews.llvm.org/D22889
llvm-svn: 277219
Support for lowering to VBROADCASTF128 etc. in D22460 was not correctly ensuring that the only users of the 128-bit vector load were the insertions of the vector into the lower/upper subvectors.
llvm-svn: 277214
We currently default to using either generic shuffles or MASK+PACKUS/PACKSS to truncate all integer vectors. For vector comparisons, we know that the result will be either all or zero bits in every element, which can be efficiently truncated by directly using PACKSS to repeatedly halve the size of each element.
Due to the limited input values (-1 or 0) we don't need to account for vector element size, so for simplicity we just use the PACKSS(vXi16,vXi16) implementation in all cases. Additionally for AVX2 PACKSS of 256bit data we must perform a PERMQ shuffle to reorder the data into the correct order. I did investigate performing a single shuffle after all the PACKSS calls but the need to cross 128bit lanes makes this difficult to achieve efficiently.
We avoid performing this on AVX512 as it should have better alternative truncation instructions.
Differential Revision: https://reviews.llvm.org/D22814
llvm-svn: 277132
This places the 132/213/231 form number in front of the SS/SD/PS/PD. Move the Y for 256-bit versions to be after the PS/PD. Change the AVX512 scalar forms to include a Z in the their name. This new format should be consistent with the general naming of instructions.
llvm-svn: 276559
As reported on PR26235, we don't currently make use of the VBROADCASTF128/VBROADCASTI128 instructions (or the AVX512 equivalents) to load+splat a 128-bit vector to both lanes of a 256-bit vector.
This patch enables lowering from subvector insertion/concatenation patterns and auto-upgrades the llvm.x86.avx.vbroadcastf128.pd.256 / llvm.x86.avx.vbroadcastf128.ps.256 intrinsics to match.
We could possibly investigate using VBROADCASTF128/VBROADCASTI128 to load repeated constants as well (similar to how we already do for scalar broadcasts).
Reapplied with fix for PR28657 - removed intrinsic definitions (clang companion patch to be be submitted shortly).
Differential Revision: https://reviews.llvm.org/D22460
llvm-svn: 276416
Under normal circumstances we prefer the higher performance MOVD to extract the 0'th element of a v8i16 vector instead of PEXTRW.
But as detailed on PR27265, this prevents the SSE41 implementation of PEXTRW from folding the store of the 0'th element. Additionally it prevents us from making use of the fact that the (SSE2) reg-reg version of PEXTRW implicitly zero-extends the i16 element to the i32/i64 destination register.
This patch only preferentially lowers to MOVD if we will not be zero-extending the extracted i16, nor prevent a store from being folded (on SSSE41).
Fix for PR27265.
Differential Revision: https://reviews.llvm.org/D22509
llvm-svn: 276289
As requested on D22509, I've pulled out the v8i16 extraction lowering as the SSE41 and pre-SSE41 implementations are effectively the same.
llvm-svn: 276285
As reported on PR26235, we don't currently make use of the VBROADCASTF128/VBROADCASTI128 instructions (or the AVX512 equivalents) to load+splat a 128-bit vector to both lanes of a 256-bit vector.
This patch enables lowering from subvector insertion/concatenation patterns and auto-upgrades the llvm.x86.avx.vbroadcastf128.pd.256 / llvm.x86.avx.vbroadcastf128.ps.256 intrinsics to match.
We could possibly investigate using VBROADCASTF128/VBROADCASTI128 to load repeated constants as well (similar to how we already do for scalar broadcasts).
Differential Revision: https://reviews.llvm.org/D22460
llvm-svn: 276281
This patch adds costs for the vectorized implementations of CTPOP, the default values were seriously underestimating the cost of these and was encouraging vectorization on targets where serialized use of POPCNT would be much better.
Differential Revision: https://reviews.llvm.org/D22456
llvm-svn: 276104
The following condition expression ( a >> n) & 1 is converted to "bt a, n" instruction. It works on all intel targets.
But on AVX-512 it was broken because the expression is modified to (truncate (a >>n) to i1).
I added the new sequence (truncate (a >>n) to i1) to the BT pattern.
Differential Revision: https://reviews.llvm.org/D22354
llvm-svn: 275950
Summary:
Instead, we take a single flags arg (a bitset).
Also add a default 0 alignment, and change the order of arguments so the
alignment comes before the flags.
This greatly simplifies many callsites, and fixes a bug in
AMDGPUISelLowering, wherein the order of the args to getLoad was
inverted. It also greatly simplifies the process of adding another flag
to getLoad.
Reviewers: chandlerc, tstellarAMD
Subscribers: jholewinski, arsenm, jyknight, dsanders, nemanjai, llvm-commits
Differential Revision: http://reviews.llvm.org/D22249
llvm-svn: 275592
As discussed on PR28136, lowerShuffleAsRepeatedMaskAndLanePermute was attempting to match repeated masks at the 128-bit level and then permute the resultant lanes at the 128-bit (AVX1) or 64-bit (AVX2) sub-lane level.
This change allows us to create the repeated masks at the sub-lane level (and then concat them together to create a 128-bit repeated mask) and then select which sub-lane to permute. This has no effect on the AVX1 codegen.
Fixes PR28136.
llvm-svn: 275543
This improves the situation discussed in D19228 where we were forcing VPERMPD/VPERMQ where VPERM2F128/VPERM2I128 would have been better.
This was incorrectly reverted in rL275421 during triage of PR28552.
llvm-svn: 275497
This improves the situation discussed in D19228 where we were forcing VPERMPD/VPERMQ where VPERM2F128/VPERM2I128 would have been better.
llvm-svn: 275411
Primarily this is to allow blend with zero instead of having to use vperm2f128, but we can use this in the future to deal with AVX512 cases where we need to keep the original element size to correctly fold masked operations.
llvm-svn: 275406
We know that pcmp produces all-ones/all-zeros bitmasks, so we can use that behavior to avoid unnecessary constant loading.
One could argue that load+and is actually a better solution for some CPUs (Intel big cores) because shifts don't have the
same throughput potential as load+and on those cores, but that should be handled as a CPU-specific later transformation if
it ever comes up. Removing the load is the more general x86 optimization. Note that the uneven usage of vpbroadcast in the
test cases is filed as PR28505:
https://llvm.org/bugs/show_bug.cgi?id=28505
Differential Revision: http://reviews.llvm.org/D22225
llvm-svn: 275276
At present the only shuffle with a variable mask we recognise is PSHUFB, which influences if its worth the cost of mask creation/loading of a combined target shuffle with a variable mask. This change sets up the infrastructure to support other shuffles in the future but has no effect yet.
llvm-svn: 275059
Calls to matchVectorShuffleAsInsertPS only need to ensure the inputs are 128-bit vectors. Only lowerVectorShuffleAsInsertPS needs to ensure that they are v4f32.
llvm-svn: 275028
Until we have a better way to extract constants through bitcasted build vectors (and how to handle undefs of partial lanes etc.) at least accept build vectors that are all zeroes.
llvm-svn: 274833
We were checking for 2 insertions (which is caught earlier in the pattern matching loop) instead of the case where we have no insertions.
Turns out this code never fires as we always try to lower to insertps after trying to lower to blendps, which would catch these cases - I'm about to make some changes to support combining to insertps which could cause this to fire so I don't want to remove it.
llvm-svn: 274648
The patch removes redundant kmov instructions (not all, we still have a lot of work here) and redundant "and" instructions after "setcc".
I use "AssertZero" marker between X86ISD::SETCC node and "truncate" to eliminate extra "and $1" instruction.
I also changed zext, aext and trunc patterns in the .td file. It allows to remove extra "kmov" instruictions.
This patch fixes https://llvm.org/bugs/show_bug.cgi?id=28173.
Fast ISEL mode is not supported correctly for AVX-512. ICMP/FCMP scalar instruction should return result in k-reg. It will be fixed in one of the next patches. I redirected handling of "cmp" to the DAG builder mode. (The code looks worse in one specific test case, but without this fix the new patch fails).
Differential revision: http://reviews.llvm.org/D21956
llvm-svn: 274613
For the most part this simplifies all callers. There were two places in X86 that needed an explicit makeArrayRef to shorten a statically sized array.
llvm-svn: 274337
This is a mechanical change to make TargetLowering API take MachineInstr&
(instead of MachineInstr*), since the argument is expected to be a valid
MachineInstr. In one case, changed a parameter from MachineInstr* to
MachineBasicBlock::iterator, since it was used as an insertion point.
As a side effect, this removes a bunch of MachineInstr* to
MachineBasicBlock::iterator implicit conversions, a necessary step
toward fixing PR26753.
llvm-svn: 274287
This is mostly a mechanical change to make TargetInstrInfo API take
MachineInstr& (instead of MachineInstr* or MachineBasicBlock::iterator)
when the argument is expected to be a valid MachineInstr. This is a
general API improvement.
Although it would be possible to do this one function at a time, that
would demand a quadratic amount of churn since many of these functions
call each other. Instead I've done everything as a block and just
updated what was necessary.
This is mostly mechanical fixes: adding and removing `*` and `&`
operators. The only non-mechanical change is to split
ARMBaseInstrInfo::getOperandLatencyImpl out from
ARMBaseInstrInfo::getOperandLatency. Previously, the latter took a
`MachineInstr*` which it updated to the instruction bundle leader; now,
the latter calls the former either with the same `MachineInstr&` or the
bundle leader.
As a side effect, this removes a bunch of MachineInstr* to
MachineBasicBlock::iterator implicit conversions, a necessary step
toward fixing PR26753.
Note: I updated WebAssembly, Lanai, and AVR (despite being
off-by-default) since it turned out to be easy. I couldn't run tests
for AVR since llc doesn't link with it turned on.
llvm-svn: 274189
When lowering two blended PACKUS, we used to disregard the types
of the PACKUS inputs, indiscriminately generating a v16i8 PACKUS.
This leads to non-selectable things like:
(v16i8 (PACKUS (v4i32 v0), (v4i32 v1)))
Instead, check that the PACKUSes have the same type, and use that
as the final result type.
llvm-svn: 274138
The original implementation attempted to zero registers using
XOR %foo, %foo. This is problematic because it constitutes a
read-modify-write of a register which might not be defined.
Instead, use MOV32r0 to avoid these problems; expandPostRAPseudo does
the right thing here.
llvm-svn: 274024
AVX1 can only broadcast vectors as floats/doubles, so for 256-bit vectors we insert bitcasts if we are shuffling v8i32/v4i64 types. Unfortunately the presence of these bitcasts prevents the current broadcast lowering code from peeking through cases where we have concatenated / extracted vectors to create the 256-bit vectors.
This patch allows us to peek through bitcasts as long as the number of elements doesn't change (i.e. element bitwidth is the same) so the broadcast index is not affected.
Note this bitcast peek is different from the stage later on which doesn't care about the type and is just trying to find a load node.
As we're being more aggressive with bitcasts, we also need to ensure that the broadcast type is correctly bitcasted
Differential Revision: http://reviews.llvm.org/D21660
llvm-svn: 274013
This patch allows target shuffles to be combined to single input immediate permute instructions - (V)PSHUFD/VPERMILPD/VPERMILPS - allowing more general pattern matching than what we current do and improves the likelihood of memory folding compared to existing patterns which tend to reuse the input in multiple arguments.
Further permute instructions (V)PSHUFLW/(V)PSHUFHW/(V)PERMQ/(V)PERMPD may be added in the future but its proven tricky to create tests cases for them so far. (V)PSHUFLW/(V)PSHUFHW is already handled quite well in combineTargetShuffle so it may be that removing some of that code may allow us to perform more of the combining in one place without duplication.
Differential Revision: http://reviews.llvm.org/D21148
llvm-svn: 273999
AVX1 can only broadcast vectors as floats/doubles, so for 256-bit vectors we insert bitcasts if we are shuffling v8i32/v4i64 types. Unfortunately the presence of these bitcasts prevents the current broadcast lowering code from peeking through cases where we have concatenated / extracted vectors to create the 256-bit vectors.
This patch allows us to peek through bitcasts as long as the number of elements doesn't change (i.e. element bitwidth is the same) so the broadcast index is not affected.
Note this bitcast peek is different from the stage later on which doesn't care about the type and is just trying to find a load node.
Differential Revision: http://reviews.llvm.org/D21660
llvm-svn: 273848
The setCallee function will set the number of fixed arguments based
on the size of the argument list. The FixedArgs parameter was often
explicitly set to 0, leading to a lack of consistent value for non-
vararg functions.
Differential Revision: http://reviews.llvm.org/D20376
llvm-svn: 273403
Summary:
Fix the computation of the offsets present in the scopetable when using the
SEH (__except_handler4).
This patch added an intrinsic to track the position of the allocation on the
stack of the EHGuard. This position is needed when producing the ScopeTable.
```
struct _EH4_SCOPETABLE {
DWORD GSCookieOffset;
DWORD GSCookieXOROffset;
DWORD EHCookieOffset;
DWORD EHCookieXOROffset;
_EH4_SCOPETABLE_RECORD ScopeRecord[1];
};
struct _EH4_SCOPETABLE_RECORD {
DWORD EnclosingLevel;
long (*FilterFunc)();
union {
void (*HandlerAddress)();
void (*FinallyFunc)();
};
};
```
The code to generate the EHCookie is added in `X86WinEHState.cpp`.
Which is adding these instructions when using SEH4.
```
Lfunc_begin0:
# BB#0: # %entry
pushl %ebp
movl %esp, %ebp
pushl %ebx
pushl %edi
pushl %esi
subl $28, %esp
movl %ebp, %eax <<-- Loading FramePtr
movl %esp, -36(%ebp)
movl $-2, -16(%ebp)
movl $L__ehtable$use_except_handler4_ssp, %ecx
xorl ___security_cookie, %ecx
movl %ecx, -20(%ebp)
xorl ___security_cookie, %eax <<-- XOR FramePtr and Cookie
movl %eax, -40(%ebp) <<-- Storing EHGuard
leal -28(%ebp), %eax
movl $__except_handler4, -24(%ebp)
movl %fs:0, %ecx
movl %ecx, -28(%ebp)
movl %eax, %fs:0
movl $0, -16(%ebp)
calll _may_throw_or_crash
LBB1_1: # %cont
movl -28(%ebp), %eax
movl %eax, %fs:0
addl $28, %esp
popl %esi
popl %edi
popl %ebx
popl %ebp
retl
```
And the corresponding offset is computed:
```
Luse_except_handler4_ssp$parent_frame_offset = -36
.p2align 2
L__ehtable$use_except_handler4_ssp:
.long -2 # GSCookieOffset
.long 0 # GSCookieXOROffset
.long -40 # EHCookieOffset <<----
.long 0 # EHCookieXOROffset
.long -2 # ToState
.long _catchall_filt # FilterFunction
.long LBB1_2 # ExceptionHandler
```
Clang is not yet producing function using SEH4, but it's a work in progress.
This patch is a step toward having a valid implementation of SEH4.
Unfortunately, it is not yet fully working. The EH registration block is not
allocated at the right offset on the stack.
Reviewers: rnk, majnemer
Subscribers: llvm-commits, chrisha
Differential Revision: http://reviews.llvm.org/D21231
llvm-svn: 273281
Fix for PR27726 - sitofp i64 to fp128 was loading the merged load i64 to a x87 register preventing legalization for conversion to fp128.
Added 32-bit tests for fp128 cast/conversions.
llvm-svn: 273210
We currently only allow exact matches of shuffle mask patterns during target shuffle combining.
This patch relaxes this to permit SM_SentinelUndef in the combined shuffle to always be accepted as well as allowing exact matching of the SM_SentinelZero value.
I've adjusted some tests that were requiring exact shuffle masks to now include undef values.
Differential Revision: http://reviews.llvm.org/D21495
llvm-svn: 273119
For <N x i32> type mul, pmuludq will be used for targets without SSE41, which
often introduces many extra pack and unpack instructions in vectorized loop
body because pmuludq generates <N/2 x i64> type value. However when the operands
of <N x i32> mul are extended from smaller size values like i8 and i16, the type
of mul may be shrunk to use pmullw + pmulhw/pmulhuw instead of pmuludq, which
generates better code. For targets with SSE41, pmulld is supported so no
shrinking is needed.
Differential Revision: http://reviews.llvm.org/D20931
llvm-svn: 272694
Summary:
The "-Werror=enum-compare" shows that the statement is using two different enums:
enumeral mismatch in conditional expression: 'llvm::X86ISD::NodeType' vs 'llvm::ISD::NodeType'
A follow-up fix on D21235.
Reviewers: klimek
Subscribers: spatel, cfe-commits
Differential Revision: http://reviews.llvm.org/D21278
llvm-svn: 272539
This used to be free, copying and moving DebugLocs became expensive
after the metadata rewrite. Passing by reference eliminates a ton of
track/untrack operations. No functionality change intended.
llvm-svn: 272512
This patch is intended to solve:
https://llvm.org/bugs/show_bug.cgi?id=28044
By changing the definition of X86ISD::CMPP to use float types, we allow it to be created
and pass legalization for an SSE1-only target where v4i32 is not legal.
The motivational trail for this change includes:
https://llvm.org/bugs/show_bug.cgi?id=28001
and eventually makes this trigger:
http://reviews.llvm.org/D21190
Ie, after this step, we should be free to have Clang generate FP compare IR instead of x86
intrinsics for SSE C packed compare intrinsics. (We can auto-upgrade and remove the LLVM
sse.cmp intrinsics as a follow-up step.) Once we're generating vector IR instead of x86
intrinsics, a big pile of generic optimizations can trigger.
Differential Revision: http://reviews.llvm.org/D21235
llvm-svn: 272511
PSHUFB can speed up BITREVERSE of byte vectors by performing LUT on the low/high nibbles separately and ORing the results. Wider integer vector types are already BSWAP'd beforehand so also make use of this approach.
llvm-svn: 272477
These are byte shift instructions and it will make shuffle combining a lot more straightforward if we can assume a vXi8 vector of bytes so decoded shuffle masks match the return type's number of elements
llvm-svn: 272468
512-bit VPSLLDQ/VPSRLDQ can only be used for avx512bw targets so lowerVectorShuffleAsShift had to be adjusted to include the subtarget
llvm-svn: 272300
As suggested by clang-tidy's performance-unnecessary-copy-initialization.
This can easily hit lifetime issues, so I audited every change and ran the
tests under asan, which came back clean.
llvm-svn: 272126
Summary:
This patch is adding support for the MSVC buffer security check implementation
The buffer security check is turned on with the '/GS' compiler switch.
* https://msdn.microsoft.com/en-us/library/8dbf701c.aspx
* To be added to clang here: http://reviews.llvm.org/D20347
Some overview of buffer security check feature and implementation:
* https://msdn.microsoft.com/en-us/library/aa290051(VS.71).aspx
* http://www.ksyash.com/2011/01/buffer-overflow-protection-3/
* http://blog.osom.info/2012/02/understanding-vs-c-compilers-buffer.html
For the following example:
```
int example(int offset, int index) {
char buffer[10];
memset(buffer, 0xCC, index);
return buffer[index];
}
```
The MSVC compiler is adding these instructions to perform stack integrity check:
```
push ebp
mov ebp,esp
sub esp,50h
[1] mov eax,dword ptr [__security_cookie (01068024h)]
[2] xor eax,ebp
[3] mov dword ptr [ebp-4],eax
push ebx
push esi
push edi
mov eax,dword ptr [index]
push eax
push 0CCh
lea ecx,[buffer]
push ecx
call _memset (010610B9h)
add esp,0Ch
mov eax,dword ptr [index]
movsx eax,byte ptr buffer[eax]
pop edi
pop esi
pop ebx
[4] mov ecx,dword ptr [ebp-4]
[5] xor ecx,ebp
[6] call @__security_check_cookie@4 (01061276h)
mov esp,ebp
pop ebp
ret
```
The instrumentation above is:
* [1] is loading the global security canary,
* [3] is storing the local computed ([2]) canary to the guard slot,
* [4] is loading the guard slot and ([5]) re-compute the global canary,
* [6] is validating the resulting canary with the '__security_check_cookie' and performs error handling.
Overview of the current stack-protection implementation:
* lib/CodeGen/StackProtector.cpp
* There is a default stack-protection implementation applied on intermediate representation.
* The target can overload 'getIRStackGuard' method if it has a standard location for the stack protector cookie.
* An intrinsic 'Intrinsic::stackprotector' is added to the prologue. It will be expanded by the instruction selection pass (DAG or Fast).
* Basic Blocks are added to every instrumented function to receive the code for handling stack guard validation and errors handling.
* Guard manipulation and comparison are added directly to the intermediate representation.
* lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
* lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
* There is an implementation that adds instrumentation during instruction selection (for better handling of sibbling calls).
* see long comment above 'class StackProtectorDescriptor' declaration.
* The target needs to override 'getSDagStackGuard' to activate SDAG stack protection generation. (note: getIRStackGuard MUST be nullptr).
* 'getSDagStackGuard' returns the appropriate stack guard (security cookie)
* The code is generated by 'SelectionDAGBuilder.cpp' and 'SelectionDAGISel.cpp'.
* include/llvm/Target/TargetLowering.h
* Contains function to retrieve the default Guard 'Value'; should be overriden by each target to select which implementation is used and provide Guard 'Value'.
* lib/Target/X86/X86ISelLowering.cpp
* Contains the x86 specialisation; Guard 'Value' used by the SelectionDAG algorithm.
Function-based Instrumentation:
* The MSVC doesn't inline the stack guard comparison in every function. Instead, a call to '__security_check_cookie' is added to the epilogue before every return instructions.
* To support function-based instrumentation, this patch is
* adding a function to get the function-based check (llvm 'Value', see include/llvm/Target/TargetLowering.h),
* If provided, the stack protection instrumentation won't be inlined and a call to that function will be added to the prologue.
* modifying (SelectionDAGISel.cpp) do avoid producing basic blocks used for inline instrumentation,
* generating the function-based instrumentation during the ISEL pass (SelectionDAGBuilder.cpp),
* if FastISEL (not SelectionDAG), using the fallback which rely on the same function-based implemented over intermediate representation (StackProtector.cpp).
Modifications
* adding support for MSVC (lib/Target/X86/X86ISelLowering.cpp)
* adding support function-based instrumentation (lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp, .h)
Results
* IR generated instrumentation:
```
clang-cl /GS test.cc /Od /c -mllvm -print-isel-input
```
```
*** Final LLVM Code input to ISel ***
; Function Attrs: nounwind sspstrong
define i32 @"\01?example@@YAHHH@Z"(i32 %offset, i32 %index) #0 {
entry:
%StackGuardSlot = alloca i8* <<<-- Allocated guard slot
%0 = call i8* @llvm.stackguard() <<<-- Loading Stack Guard value
call void @llvm.stackprotector(i8* %0, i8** %StackGuardSlot) <<<-- Prologue intrinsic call (store to Guard slot)
%index.addr = alloca i32, align 4
%offset.addr = alloca i32, align 4
%buffer = alloca [10 x i8], align 1
store i32 %index, i32* %index.addr, align 4
store i32 %offset, i32* %offset.addr, align 4
%arraydecay = getelementptr inbounds [10 x i8], [10 x i8]* %buffer, i32 0, i32 0
%1 = load i32, i32* %index.addr, align 4
call void @llvm.memset.p0i8.i32(i8* %arraydecay, i8 -52, i32 %1, i32 1, i1 false)
%2 = load i32, i32* %index.addr, align 4
%arrayidx = getelementptr inbounds [10 x i8], [10 x i8]* %buffer, i32 0, i32 %2
%3 = load i8, i8* %arrayidx, align 1
%conv = sext i8 %3 to i32
%4 = load volatile i8*, i8** %StackGuardSlot <<<-- Loading Guard slot
call void @__security_check_cookie(i8* %4) <<<-- Epilogue function-based check
ret i32 %conv
}
```
* SelectionDAG generated instrumentation:
```
clang-cl /GS test.cc /O1 /c /FA
```
```
"?example@@YAHHH@Z": # @"\01?example@@YAHHH@Z"
# BB#0: # %entry
pushl %esi
subl $16, %esp
movl ___security_cookie, %eax <<<-- Loading Stack Guard value
movl 28(%esp), %esi
movl %eax, 12(%esp) <<<-- Store to Guard slot
leal 2(%esp), %eax
pushl %esi
pushl $204
pushl %eax
calll _memset
addl $12, %esp
movsbl 2(%esp,%esi), %esi
movl 12(%esp), %ecx <<<-- Loading Guard slot
calll @__security_check_cookie@4 <<<-- Epilogue function-based check
movl %esi, %eax
addl $16, %esp
popl %esi
retl
```
Reviewers: kcc, pcc, eugenis, rnk
Subscribers: majnemer, llvm-commits, hans, thakis, rnk
Differential Revision: http://reviews.llvm.org/D20346
llvm-svn: 272053
We currently only combine to blend+zero if the target value type has 8 elements or less, but this was missing a lot of cases where the combined mask had been widened.
This change makes it so we use the combined mask to determine the blend value type, allowing us to catch more widened cases.
llvm-svn: 272003
Windows itanium is nearly identical to windows-msvc (MS ABI for C, itanium for
C++). Enable the TLS support for the target similar to the MSVC model.
llvm-svn: 271797
The AVX2 v16i16 shift lowering works by unpacking to 2 x v8i32, performing the shift and then truncating the result.
The unpacking is used to place the values in the upper 16-bits so that we can correctly sign-extend for SRA shifts. Unfortunately we weren't ensuring that the lower 16-bits were zero to ensure that SHL correctly shifts in zero bits.
llvm-svn: 271796
This patch begins adding support for lowering to the XOP VPERMIL2PD/VPERMIL2PS shuffle instructions - adding the X86ISD::VPERMIL2 opcode and cleaning up the usage.
The internal llvm intrinsics were assuming the shuffle mask operand was the same type as the float/double input operands (I guess to simplify the intrinsic definitions in X86InstrXOP.td to a single value type). These needed changing to integer types (matching the clang builtin and the AMD intrinsics definitions), an auto upgrade path is added to convert old calls.
Mask decoding/target shuffle support will be added in future patches.
Differential Revision: http://reviews.llvm.org/D20049
llvm-svn: 271633
Summary:
In PR29973 Sanjay Patel reported an assertion failure when a certain
loop was optimized, for a target without SSE2 support. It turned out
this was because of the AVG pattern detection introduced in rL253952.
Prevent the assertion failure by bailing out early in
`detectAVGPattern()`, if the target does not support SSE2.
Also add a minimized test case.
Reviewers: congh, eli.friedman, spatel
Subscribers: emaste, llvm-commits
Differential Revision: http://reviews.llvm.org/D20905
llvm-svn: 271548
Introduced in r271244, this is probably undefined behaviour and asserts when
compiled with Visual C++ debug mode.
On further note, the loop is quadratic with regard to the number of successors
since removeSuccessor is linear and could probably be modified to linear time.
llvm-svn: 271278
This adds support to the backed to actually support SjLj EH as an exception
model. This is *NOT* the default model, and requires explicitly opting into it
from the frontend. GCC supports this model and for MinGW can still be enabled
via the `--using-sjlj-exceptions` options.
Addresses PR27749!
llvm-svn: 271244
Most often as not this is what it started out as, the extraction is zero-cost on AVX and the PMOVZX/PMOVSX folding logic is based around 128-bit loads.
llvm-svn: 270858
This patch is a first step towards a more extendible method of matching combined target shuffle masks.
Initially this just pulls out the existing basic mask matches and adds support for some 256/512 bit equivalents. Future patterns will require a number of features to be added but I wanted to keep this patch simple.
I hope we can avoid duplication between shuffle lowering and combining and share more complex pattern match functions in future commits.
Differential Revision: http://reviews.llvm.org/D19198
llvm-svn: 270230
This refactors the logic in X86 to avoid code duplication. It also
splits it in two steps: it first decides if a symbol is local to the DSO
and then uses that information to decide how to access it.
The first part is implemented by shouldAssumeDSOLocal. It is not in any
way specific to X86. In a followup patch I intend to move it to
somewhere common and reused it in other backends.
llvm-svn: 270209
Since the calls don't return, the instruction afterwards will never run,
and is just taking up unnecessary space in the binary.
Differential Revision: http://reviews.llvm.org/D20406
llvm-svn: 270109
with an additional fix to make RegAllocFast ignore undef physreg uses. It would
previously get confused about the "push %eax" instruction's use of eax. That
method for adjusting the stack pointer is used in X86FrameLowering::emitSPUpdate
as well, but since that runs after register-allocation, we didn't run into the
RegAllocFast issue before.
llvm-svn: 269949
Summary:
MONITORX/MWAITX instructions provide similar capability to the MONITOR/MWAIT
pair while adding a timer function, such that another termination of the MWAITX
instruction occurs when the timer expires. The presence of the MONITORX and
MWAITX instructions is indicated by CPUID 8000_0001, ECX, bit 29.
The MONITORX and MWAITX instructions are intercepted by the same bits that
intercept MONITOR and MWAIT. MONITORX instruction establishes a range to be
monitored. MWAITX instruction causes the processor to stop instruction execution
and enter an implementation-dependent optimized state until occurrence of a
class of events.
Opcode of MONITORX instruction is "0F 01 FA". Opcode of MWAITX instruction is
"0F 01 FB". These opcode information is used in adding tests for the
disassembler.
These instructions are enabled for AMD's bdver4 architecture.
Patch by Ganesh Gopalasubramanian!
Reviewers: echristo, craig.topper, RKSimon
Subscribers: RKSimon, joker.eph, llvm-commits
Differential Revision: http://reviews.llvm.org/D19795
llvm-svn: 269911
This patch moves the expansion of WIN_ALLOCA pseudo-instructions
into a separate pass that walks the CFG and lowers the instructions
based on a conservative estimate of the offset between the stack
pointer and the lowest accessed stack address.
The goal is to reduce binary size and run-time costs by removing
calls to _chkstk. While it doesn't fix all the code quality problems
with inalloca calls, it's an incremental improvement for PR27076.
Differential Revision: http://reviews.llvm.org/D20263
llvm-svn: 269828
The new X86 shuffle lowering can do just fine without transforming vselects
into vector_shuffles. It looks like the only thing this code does right now
is cause trouble - in particular, it can lead to combine/legalization infinite
loops.
Note that it's not completely NFC, since some of the shuffle masks get inverted,
which may cause slight differences further down the line. We may want to find
a way to invert those masks, but that's orthogonal to this commit.
This fixes the hang in PR27689.
llvm-svn: 269676
This patch uses PSHUFB to lower vector CTLZ and avoid (slower) scalarizations.
The leading zero count of each 4-bit nibble of the vector is determined by using a PSHUFB lookup. Pairs of results are then repeatedly combined up to the original element width.
Differential Revision: http://reviews.llvm.org/D20016
llvm-svn: 269646
This is similar to r268953, but for floating point and vector register
classes.
Explanations:
The setting of the inline asm constraints was implicitly relying on the
order of the register classes in the file generated by tablegen.
Since, we do not have any control on that order, make sure we do not
depend on it anymore.
llvm-svn: 268973
This reapplies commit r268796, with a fix for the setting of the inline asm
constraints. I.e., "mark" LOW32_ADDR_ACCESS_RBP as a GR variant, so that the
regular processing of the GR operands (setting of the subregisters) happens.
Original commit log:
[X86] Add a new LOW32_ADDR_ACCESS_RBP register class.
ABIs like NaCl uses 32-bit addresses but have 64-bit frame.
The new register class reflects those constraints when choosing a
register class for a address access.
llvm-svn: 268955
The setting of the inline asm constraints was implicitly relying on the
order of the register classes in the file generated by tablegen.
Since, we do not have any control on that order, make sure we do not
depend on it anymore.
llvm-svn: 268953
A number of libcalls don't exist in any particular lib but are, instead,
defined in math.h as inline functions (even in C mode!). Don't rely on
their existence when lowering @llvm.{cos,sin,floor,..}.f32, promote them
instead.
N.B. We had logic to handle FREM but were missing out on a number of
others. This change generalizes the FREM handling.
llvm-svn: 268875
Both Linux and kFreeBSD use glibc, so follow similiar code paths.
Add isTargetGlibc to check for this, and use it instead of isTargetLinux
in a few places.
Fixes PR22248 for kFreeBSD.
Differential Revision: http://reviews.llvm.org/D19104
llvm-svn: 268624
The result type of setcc is dependent on whether or not AVX512 is
present.
We had an X86-specific DAG-combine which assumed that the result type
should be i8 when it could be i1.
This meant that we would generate illegal setccs which LowerSETCC did
not like.
Instead, use an appropriate type and zero extend to i8.
Also, there were some scenarios where the fold should have fired but
didn't because we were overly cautious about the types. This meant that
we generated:
shrl $31, %edi
andl $1, %edi
kmovw %edi, %k0
kxnorw %k0, %k0, %k1
kshiftrw $15, %k1, %k1
kxorw %k1, %k0, %k0
kmovw %k0, %eax
instead of:
testl %edi, %edi
setns %al
This fixes PR27638.
llvm-svn: 268609
The callseq_end node must be glued with the TLS calls, otherwise,
the generic code will miss the uses of the returned value and will
mark it dead.
Moreover, TLSCall 64-bit pseudo must not set an implicit-use on RDI,
the pseudo uses the symbol address at this point not RDI and the
lowering will do the right thing.
llvm-svn: 267797
This effectively adds back the extractelt combine removed by r262358:
the direct case can still occur (because x86_mmx is special, see
r262446), but it's the indirect case that's now superseded by the
generic combine.
llvm-svn: 267651
We aren't currently making use of this in any successful mask decode and its actually incorrect as it inserts the wrong number of SM_SentinelUndef mask elements.
llvm-svn: 267350
Reused the ability to split constants of a type wider than the shuffle mask to work with masks generated from scalar constants transfered to xmm.
This fixes an issue preventing PSHUFB target shuffle masks decoding rematerialized scalar constants and also exposes the XOP VPPERM bug described in PR27472.
llvm-svn: 267343
CTTZ_ZERO_UNDEF can be custom lowered specially if CTLZ is supported. Otherwise CTTZ and CTTZ_ZERO_UNDEF are handled the same way by using CTPOP and bitmath.
llvm-svn: 266952
With this change, ideally IR pass can always generate llvm.stackguard
call to get the stack guard; but for now there are still IR form stack
guard customizations around (see getIRStackGuard()). Future SSP
customization should go through LOAD_STACK_GUARD.
There is a behavior change: stack guard values are not CSEed anymore,
since we should never reuse the value in case that it has been spilled (and
corrupted). See ssp-guard-spill.ll. This also cause the change of stack
size and codegen in X86 and AArch64 test cases.
Ideally we'd like to know if the guard created in llvm.stackprotector() gets
spilled or not. If the value is spilled, discard the value and reload
stack guard; otherwise reuse the value. This can be done by teaching
register allocator to know how to rematerialize LOAD_STACK_GUARD and
force a rematerialization (which seems hard), or check for spilling in
expandPostRAPseudo. It only makes sense when the stack guard is a global
variable, which requires more instructions to load. Anyway, this seems to go out
of the scope of the current patch.
llvm-svn: 266806
Using VPERMQ/VPERMPD allows memory folding of the (repeated) input where VINSERTI128/VINSERTF128 can not.
Differential Revision: http://reviews.llvm.org/D19228
llvm-svn: 266728
no functional change.
ExtraLoad and WrapperKind are been used only if (OpFlags == X86II::MO_GOTPCREL).
Differential Revision: http://reviews.llvm.org/D18942
llvm-svn: 266557
It is very likely that the swiftself parameter is alive throughout most
functions function so putting it into a callee save register should
avoid spills for the callers with only a minimum amount of extra spills
in the callees.
Currently the generated code is correct but unnecessarily spills and
reloads arguments passed in callee save registers, I will address this
in upcoming patches.
This also adds a missing check that for tail calls the preserved value
of the caller must be the same as the callees parameter.
Differential Revision: http://reviews.llvm.org/D18902
llvm-svn: 266252
Extend the existing lowering of vXi8 multiplies to support v64i8 on avx512bw targets.
I added the Lower512IntArith helper function to help with this - not sure how often this could be used in the future, but it seemed better than putting all that logic inside LowerMUL.
Differential Revision: http://reviews.llvm.org/D18937
llvm-svn: 265902
This is a cleanup patch for SSP support in LLVM. There is no functional change.
llvm.stackprotectorcheck is not needed, because SelectionDAG isn't
actually lowering it in SelectBasicBlock; rather, it adds check code in
FinishBasicBlock, ignoring the position where the intrinsic is inserted
(See FindSplitPointForStackProtector()).
llvm-svn: 265851
Re-apply r265450 which caused PR27245 and was reverted in r265559
because of a wrong generalization: the fetch_and_add->add_and_fetch
combine only works in specific, but pretty common, cases:
(icmp slt x, 0) -> (icmp sle (add x, 1), 0)
(icmp sge x, 0) -> (icmp sgt (add x, 1), 0)
(icmp sle x, 0) -> (icmp slt (sub x, 1), 0)
(icmp sgt x, 0) -> (icmp sge (sub x, 1), 0)
Original Message:
We only generate LOCKed versions of add/sub when the result is unused.
It often happens that the result is used, but only by a comparison. We
can optimize those out by reusing EFLAGS, which lets us use the proper
instructions, instead of having to fallback to LXADD.
Instead of doing this as an MI peephole (as we do for the other
non-LOCKed (really, non-MR) forms), do it in ISel. It becomes quite
tricky later.
This also makes it eventually possible to stop expanding and/or/xor
if the only user is an icmp (also see D18141).
This uses the LOCK ISD opcodes added by r262244.
Differential Revision: http://reviews.llvm.org/D17633
llvm-svn: 265636
Summary:
In the context of http://wg21.link/lwg2445 C++ uses the concept of
'stronger' ordering but doesn't define it properly. This should be fixed
in C++17 barring a small question that's still open.
The code currently plays fast and loose with the AtomicOrdering
enum. Using an enum class is one step towards tightening things. I later
also want to tighten related enums, such as clang's
AtomicOrderingKind (which should be shared with LLVM as a 'C++ ABI'
enum).
This change touches a few lines of code which can be improved later, I'd
like to keep it as NFC for now as it's already quite complex. I have
related changes for clang.
As a follow-up I'll add:
bool operator<(AtomicOrdering, AtomicOrdering) = delete;
bool operator>(AtomicOrdering, AtomicOrdering) = delete;
bool operator<=(AtomicOrdering, AtomicOrdering) = delete;
bool operator>=(AtomicOrdering, AtomicOrdering) = delete;
This is separate so that clang and LLVM changes don't need to be in sync.
Reviewers: jyknight, reames
Subscribers: jyknight, llvm-commits
Differential Revision: http://reviews.llvm.org/D18775
llvm-svn: 265602
Bionic has a defined thread-local location for the stack protector
cookie. Emit a direct load instead of going through __stack_chk_guard.
llvm-svn: 265481
We only generate LOCKed versions of add/sub when the result is unused.
It often happens that the result is used, but only by a comparison. We
can optimize those out by reusing EFLAGS, which lets us use the proper
instructions, instead of having to fallback to LXADD.
Instead of doing this as an MI peephole (as we do for the other
non-LOCKed (really, non-MR) forms), do it in ISel. It becomes quite
tricky later.
This also makes it eventually possible to stop expanding and/or/xor
if the only user is an icmp (also see D18141).
This uses the LOCK ISD opcodes added by r262244.
Differential Revision: http://reviews.llvm.org/D17633
llvm-svn: 265450
We can only perform a tail call to a callee that preserves all the
registers that the caller needs to preserve.
This situation happens with calling conventions like preserver_mostcc or
cxx_fast_tls. It was explicitely handled for fast_tls and failing for
preserve_most. This patch generalizes the check to any calling
convention.
Related to rdar://24207743
Differential Revision: http://reviews.llvm.org/D18680
llvm-svn: 265329
Implemented truncstore for KNL and skylake-avx512.
Covered vectors from v2i1 to v64i1. We save the value in bits (not in bytes) - v32i1 is saved in 4 bytes.
Differential Revision: http://reviews.llvm.org/D18740
llvm-svn: 265283
Add support for lowering with the MOVMSK instruction to extract vector element signbits to a GPR.
This is an early step towards more optimal handling of vector comparison results.
Differential Revision: http://reviews.llvm.org/D18741
llvm-svn: 265266
Follow-up to http://reviews.llvm.org/D18566 and http://reviews.llvm.org/D18676 -
where we noticed that an intermediate splat was being generated for memsets of
non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The 16-byte test that was added in D18566 is now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
Note that the SSE1 path is not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
llvm-svn: 265161
Follow-up to D18566 - where we noticed that an intermediate splat was being
generated for memsets of non-zero chars.
That was because we told getMemsetStores() to use a 32-bit vector element type,
and it happily obliged by producing that constant using an integer multiply.
The tests that were added in the last patch are now equivalent for AVX1 and AVX2
(no splats, just a vector load), but we have PR27141 to track that splat difference.
In the new tests, the splat via shuffling looks ok to me, but there might be some
room for improvement depending on uarch there.
Note that the SSE1/2 paths are not changed in this patch. That can be a follow-up.
This patch should resolve PR27100.
Differential Revision: http://reviews.llvm.org/D18676
llvm-svn: 265148
Change isConsecutiveLoads to check that loads are non-volatile as this
is a requirement for any load merges. Propagate change to two callers.
Reviewers: RKSimon
Subscribers: llvm-commits
Differential Revision: http://reviews.llvm.org/D18546
llvm-svn: 265013
Fix for issue introduced D17297, where we were breaking early from the loop detecting consecutive loads which could leave us thinking a consecutive load with zeros was possible.
llvm-svn: 264922
XOP's VPPERM has some great 'permute operations' that it can do as well as part of shuffling the bytes of a 128-bit vector - in this case we use it to perform BITREVERSE in a single instruction.
llvm-svn: 264870
operations.
Specifically, we had code that tried to badly approximate reconstructing
all of the possible variations on addressing modes in two x86
instructions based on those in one pseudo instruction. This is not the
first bug uncovered with doing this, so stop doing it altogether.
Instead generically and pedantically copy every operand from the address
over to both new instructions, and strip kill flags from any register
operands.
This fixes a subtle bug seen in the wild where we would mysteriously
drop parts of the addressing mode, causing for example the index
argument in the added test case to just be completely ignored.
Hypothetically, this was an extremely bad miscompile because it actually
caused a predictable and leveragable write of a 64bit quantity to an
unintended offset (the first element of the array intead of whatever
other element was intended). As a consequence, in theory this could even
have introduced security vulnerabilities.
However, this was only something that could happen with an atomic
floating point add. No other operation could trigger this bug, so it
seems extremely unlikely to have occured widely in the wild.
But it did in fact occur, and frequently in scientific applications
which were using relaxed atomic updates of a floating point value after
adding a delta. Those would end up being quite badly miscompiled by
LLVM, which is how we found this. Of course, this often looks like
a race condition in the code, but it was actually a miscompile.
I suspect that this whole RELEASE_FADD thing was a complete mistake.
There is no such operation, and I worry that anything other than add
will get remarkably worse codegeneration. But that's not for this
change....
llvm-svn: 264845
If all a BUILD_VECTOR's source elements are the same bit (AND/XOR/OR) operation type and each has one constant operand, lower to a pair of BUILD_VECTOR and just apply the bit operation to the vectors.
The constant operands will form a constant vector meaning that we still only have a single BUILD_VECTOR to lower and we will have replaced all the scalarized operations with a single SSE equivalent.
Its not in our interest to start make a general purpose vectorizer from this, but I'm seeing enough of these scalar bit operations from the later legalization/scalarization stages to support them at least.
Differential Revision: http://reviews.llvm.org/D18492
llvm-svn: 264666
ICMP instruction selection fails on SKX and KNL for i1 operand.
I use XOR to resolve:
(A == B) is equivalent to (A xor B) == 0
Differential Revision: http://reviews.llvm.org/D18511
llvm-svn: 264566
Currently this is to mainly to prevent scalarization of integer division by constants.
Differential Revision: http://reviews.llvm.org/D18307
llvm-svn: 264511
LowerMul v32i8 on AVX2 needs to split the 256-bit sources to allow sign-extension back to v16i16 to occur. Since this is basically the same as Lower256IntArith we simplify by using that here instead.
llvm-svn: 264506
LowerShift was using the same code as Lower256IntArith to split 256-bit vectors into 2 x 128-bit vectors, so now we just call Lower256IntArith.
llvm-svn: 264403
KTEST instruction may be used instead of TEST in this case:
%int_sel3 = bitcast <8 x i1> %sel3 to i8
%res = icmp eq i8 %int_sel3, zeroinitializer
br i1 %res, label %L2, label %L1
Differential Revision: http://reviews.llvm.org/D18444
llvm-svn: 264298
This patch begins adding support for lowering to the XOP VPPERM instruction - adding the X86ISD::VPPERM opcode.
Differential Revision: http://reviews.llvm.org/D18189
llvm-svn: 264260
This should be hoisted further up so it can be used in DAGCombiner and other backends,
but I'm limiting the scope in the interest of patch minimalism.
It's not quite NFC because some of the replaced code was using an 'if' check rather
than a 'while' loop, so those cases would only look through a single bitcast.
llvm-svn: 264186
Improve vector extension of vectors on hardware without dedicated VSEXT/VZEXT instructions.
We already convert these to SIGN_EXTEND_VECTOR_INREG/ZERO_EXTEND_VECTOR_INREG but can further improve this by using the legalizer instead of prematurely splitting into legal vectors in the combine as this only properly helps for lowering to VSEXT/VZEXT.
Removes a lot of unnecessary any_extend + mask pattern - (Fix for PR25718).
Reapplied with a fix for PR26953 (missing vector widening legalization).
Differential Revision: http://reviews.llvm.org/D17932
llvm-svn: 264062
Improve computeZeroableShuffleElements to be able to peek through bitcasts to extract zero/undef values from BUILD_VECTOR nodes of different element sizes to the shuffle mask.
Differential Revision: http://reviews.llvm.org/D14261
llvm-svn: 263906
We were being too aggressive in trying to combine a shuffle into a blend-with-zero pattern, often resulting in a endless loop of contrasting combines
This patch stops the combine if we already have a blend in place (means we miss some domain corrections)
llvm-svn: 263717
Converting masked vector loads to regular vector loads for x86 AVX should always be a win.
I raised the legality issue of reading the extra memory bytes on llvm-dev. I did not see any
objections.
1. x86 already does this kind of optimization for multiple scalar loads -> vector load.
2. If other targets have the same flexibility, we could move this transform up to CGP or DAGCombiner.
Differential Revision: http://reviews.llvm.org/D18094
llvm-svn: 263446
The SSE41 v8i16 shift lowering using (v)pblendvb is great for non-constant shift amounts, but if it is constant then we can efficiently reduce the VSELECT to shuffles with the pre-SSE41 lowering.
llvm-svn: 263383
cmpxchg[8|16]b uses RBX as one of its argument.
In other words, using this instruction clobbers RBX as it is defined to hold one
the input. When the backend uses dynamically allocated stack, RBX is used as a
reserved register for the base pointer.
Reserved registers have special semantic that only the target understands and
enforces, because of that, the register allocator don’t use them, but also,
don’t try to make sure they are used properly (remember it does not know how
they are supposed to be used).
Therefore, when RBX is used as a reserved register but defined by something that
is not compatible with that use, the register allocator will not fix the
surrounding code to make sure it gets saved and restored properly around the
broken code. This is the responsibility of the target to do the right thing with
its reserved register.
To fix that, when the base pointer needs to be preserved, we use a different
pseudo instruction for cmpxchg that save rbx.
That pseudo takes two more arguments than the regular instruction:
- One is the value to be copied into RBX to set the proper value for the
comparison.
- The other is the virtual register holding the save of the value of RBX as the
base pointer. This saving is done as part of isel (i.e., we emit a copy from
rbx).
cmpxchg_save_rbx <regular cmpxchg args>, input_for_rbx_reg, save_of_rbx_as_bp
This gets expanded into:
rbx = copy input_for_rbx_reg
cmpxchg <regular cmpxchg args>
rbx = save_of_rbx_as_bp
Note: The actual modeling of the pseudo is a bit more complicated to make sure
the interferes that appears after the pseudo gets expanded are properly modeled
before that expansion.
This fixes PR26883.
llvm-svn: 263325
Improve vector extension of vectors on hardware without dedicated VSEXT/VZEXT instructions.
We already convert these to SIGN_EXTEND_VECTOR_INREG/ZERO_EXTEND_VECTOR_INREG but can further improve this by using the legalizer instead of prematurely splitting into legal vectors in the combine as this only properly helps for lowering to VSEXT/VZEXT.
Removes a lot of unnecessary any_extend + mask pattern - (Fix for PR25718).
Differential Revision: http://reviews.llvm.org/D17932
llvm-svn: 263303
Its not enough that we test for SSSE3 - that's only OK for 128-bit vectors - we also need to test for AVX2 / AVX512BW for 256/512 bit vector cases.
llvm-svn: 263239
Looking at the IR definition of a masked load made me realize
there was no reason to use a shuffle here, so we don't need
to convert the format of the mask at all.
llvm-svn: 263167
Generalise the existing SIGN_EXTEND to SIGN_EXTEND_VECTOR_INREG combine to support zero extension as well and get rid of a lot of unnecessary ANY_EXTEND + mask patterns.
Reapplied with a fix for PR26870 (avoid premature use of TargetConstant in ZERO_EXTEND_VECTOR_INREG expansion).
Differential Revision: http://reviews.llvm.org/D17691
llvm-svn: 263159