llvm-project

Commit Graph

Author	SHA1	Message	Date
Liu, Chen3	57e8f840b6	[X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A). This bug was introduced by D109953. The operand order of generated FMA is wrong. Differential Revision: https://reviews.llvm.org/D110606	2021-09-28 11:38:53 +08:00
Xiang1 Zhang	ebe9944a34	[ISel] Legalized arithmetic.fence.f128 for 32-bits target Reviewed By: Craig Topper, Wang Pengfei Differential Revision: https://reviews.llvm.org/D110467	2021-09-28 10:27:25 +08:00
Lang Hames	21a06254a3	[ORC] Switch from JITTargetAddress to ExecutorAddr for EPC-call APIs. Part of the ongoing move to ExecutorAddr.	2021-09-27 16:53:09 -07:00
Lang Hames	6fe2e9a9cc	[ORC] Hold shared_ptr<SymbolStringPool> in errors containing SymbolStringPtrs. This allows these error values to remain valid, even if they tear down the JIT itself.	2021-09-27 15:46:56 -07:00
Congzhe Cao	c42772752a	[CodeMoverUtils] Enhance isSafeToMoveBefore() when control flow equivalence is satisfied With improved analysis in determining CFG equivalence that does not require strict dominance and post-dominance conditions, we now relax isSafeToMoveBefore() such that an instruction I can be moved before InsertPoint even if they do not strictly dominate each other, as long as they follow the same control flow path. For example, we can move Instruction 0 before Instruction 1, and vice versa. ``` if (cond1) // Instruction 0: %add = add i32 1, 2 if (cond1) // Instruction 1: %add2 = add i32 2, 1 ``` Reviewed By: Whitney Differential Revision: https://reviews.llvm.org/D110456	2021-09-27 18:37:36 -04:00
Jozef Lawrynowicz	6cfb4d46ba	[llvm-readobj] Support dumping of MSP430 ELF attributes The MSP430 ABI supports build attributes for specifying the ISA, code model, data model and enum size in ELF object files. Differential Revision: https://reviews.llvm.org/D107969	2021-09-28 00:56:11 +03:00
modimo	20faf78919	[ThinLTO] Add noRecurse and noUnwind thinlink function attribute propagation Thinlink provides an opportunity to propagate function attributes across modules, enabling additional propagation opportunities. This change propagates (currently default off, turn on with `disable-thinlto-funcattrs=1`) noRecurse and noUnwind based off of function summaries of the prevailing functions in bottom-up call-graph order. Testing on clang self-build: 1. There's a 35-40% increase in noUnwind functions due to the additional propagation opportunities. 2. Throughput is measured at 10-15% increase in thinlink time which itself is 1.5% of E2E link time. Implementation-wise this adds the following summary function attributes: 1. noUnwind: function is noUnwind 2. mayThrow: function contains a non-call instruction that `Instruction::mayThrow` returns true on (e.g. windows SEH instructions) 3. hasUnknownCall: function contains calls that don't make it into the summary call-graph thus should not be propagated from (e.g. indirect for now, could add no-opt functions as well) Testing: Clang self-build passes and 2nd stage build passes check-all ninja check-all with newly added tests passing Reviewed By: tejohnson Differential Revision: https://reviews.llvm.org/D36850	2021-09-27 12:28:07 -07:00
Roman Lebedev	2a7a768dad	[X86][Costmodel] Load/store i16 Stride=4 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For this tuple, measuring becomes problematic since there's a lot of spilling going on, but apparently all these memory ops do not affect worst-case estimate at all here. For load we have: https://godbolt.org/z/zP4hd8MT6 - for intels `Block RThroughput: =150.0`; for ryzens, `Block RThroughput: <=59` So pick cost of `150`. For store we have: https://godbolt.org/z/vKb8zTK8E - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=24.0` So pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110548	2021-09-27 22:20:01 +03:00
Roman Lebedev	ee5a050e2e	[X86][Costmodel] Load/store i16 Stride=4 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =75.0`; for ryzens, `Block RThroughput: <=29.5` So pick cost of `75`. (note that `# 32-byte Reload` does not affect throughput there.) For store we have: https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110543	2021-09-27 22:20:01 +03:00
Roman Lebedev	5615d6a6dd	[X86][Costmodel] Load/store i16 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/dd8T5P471 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=14.5` So pick cost of `33`. For store we have: https://godbolt.org/z/zPxcKWhn4 - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=6.0` So pick cost of `10`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110541	2021-09-27 22:20:01 +03:00
Roman Lebedev	df2b42d12e	[X86][Costmodel] Load/store i16 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rnsf639Wh - for intels `Block RThroughput: =17.0`; for ryzens, `Block RThroughput: <=7.5` So pick cost of `17`. For store we have: https://godbolt.org/z/565KKrcY6 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110537	2021-09-27 22:20:01 +03:00
Roman Lebedev	45caac91c4	[X86][Costmodel] Load/store i16 Stride=4 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/5EYc6r9nh - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `6`. For store we have: https://godbolt.org/z/z61e5d6GE - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110536	2021-09-27 22:20:01 +03:00
Sanjay Patel	fdba1dccbe	[InstCombine] reduce code for shl-of-sub transform; NFC	2021-09-27 14:56:01 -04:00
Jon Chesterfield	80fa43fe9a	Revert "[openmp] Add addrspacecast to getOrCreateIdent" This reverts commit `1a761e5b7b`. Failed CI, albeit with a different failure mode to BZ51982	2021-09-27 19:27:35 +01:00
Jon Chesterfield	1a761e5b7b	[openmp] Add addrspacecast to getOrCreateIdent Fixes 51982. Minor refactor to remove `return x = y` construct. Test case derived from https://github.com/ROCm-Developer-Tools/aomp/\ blob/aomp-dev/test/smoke/nest_call_par2/nest_call_par2.c by deleting parts while checking the assertion failure still occurred. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D110556	2021-09-27 19:23:12 +01:00
Sanjay Patel	623f93ed1c	[InstCombine] add use check to shl transform This bug was introduced with the refactoring in: `9075edc89b` ...but there were no tests to detect it.	2021-09-27 14:10:26 -04:00
Jameson Nash	e27a6db529	Bad SLPVectorization shufflevector replacement, resulting in write to wrong memory location We see that it might otherwise do: %10 = getelementptr {}, <2 x {}> %9, <2 x i32> <i32 10, i32 4> %11 = bitcast <2 x {}*> %10 to <2 x i64> ... %27 = extractelement <2 x i64> %11, i32 0 %28 = bitcast i64 %27 to <2 x i64>* store <2 x i64> %22, <2 x i64>* %28, align 4, !tbaa !2 Which is an out-of-bounds store (the extractelement got offset 10 instead of offset 4 as intended). With the fix, we correctly generate extractelement for i32 1 and generate correct code. Differential Revision: https://reviews.llvm.org/D106613	2021-09-27 14:06:13 -04:00
Praveen Velliengiri	e90b512c4d	[AMDGPU] Change ASAN init/fini kernels linkage to external. HSA runtime fails to find the symbols for Init and Fini kernels as they mark with internal linkage, changing the linkage to external to fix those errors. Differential Revision: https://reviews.llvm.org/D110054	2021-09-27 11:50:37 -06:00
Quinn Pham	682e15f371	[PowerPC] Fix td pattern for P10 VSLDBI and VSRDBI This patch fixes the pattern for the P10 instructions Vector Shift Left Double by Bit Immediate VN-form and Vector Shift Right Double by Bit Immediate VN-form. The third argument should be a target constant (`timm`) instead of an `i32` because an immediate is expected. Reviewed By: lei Differential Revision: https://reviews.llvm.org/D109920	2021-09-27 12:36:18 -05:00
Craig Topper	a2a07e8db3	[RISCV] Fold store of vmv.x.s to a vse with VL=1. This can avoid a loss of decoupling with the scalar unit on cores with decoupled scalar and vector units. We should support FP too, but those use extract_element and not a custom ISD node so it is a little different. I also left a FIXME in the test for i64 extract and store on RV32. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D109482	2021-09-27 09:54:46 -07:00
Kazu Hirata	59540b29f8	[InstCombine] Fix an "unused variable" warning	2021-09-27 09:49:32 -07:00
Craig Topper	933182e948	[RISCV] Improve support for forming widening multiplies when one input is a scalar splat. If one input of a fixed vector multiply is a sign/zero extend and the other operand is a splat of a scalar, we can use a widening multiply if the scalar value has sufficient sign/zero bits. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D110028	2021-09-27 09:37:07 -07:00
Sanjay Patel	9075edc89b	[InstCombine] move shl-only folds out from under commonShiftTransforms(); NFCI This is no-functional-change-intended, but it hopefully makes things slightly clearer and more efficient to have transforms that require 'shl' be called only from visitShl(). Further cleanup is possible.	2021-09-27 12:09:47 -04:00
Kazu Hirata	b68a62b3a9	[Lanai] Remove redundant declaration getTheLanaiTarget (NFC) Note that getTheLanaiTarget is declared in TargetInfo/LanaiTargetInfo.h, which LanaiDisassembler.cpp includes. Identified with readability-redundant-declaration.	2021-09-27 08:58:27 -07:00
Nico Weber	7664508910	[llvm/OptTable] Add named param comment for GroupedShortOption	2021-09-27 11:33:29 -04:00
Nico Weber	730bbc6f72	[llvm/OptTable] Drop "The" prefix on fields	2021-09-27 11:24:51 -04:00
Nico Weber	6ffd8e3902	[llvm] Convert OptTable::ParseOneArg() to std::unique_ptr<>	2021-09-27 11:19:21 -04:00
Nico Weber	7789a68e5a	[llvm] Convert OptTable::parseOneArgGrouped() to std::unique_ptr<>	2021-09-27 11:19:15 -04:00
Nico Weber	2f955424c4	[llvm] ConvertOption::accept(), acceptInternal() to std::unique_ptr<> These functions transfer ownership to the caller. Make this clear in the type system. No behavior change.	2021-09-27 11:05:02 -04:00
Sanjay Patel	21429cf43a	[InstCombine] generalize fold for (trunc (X u>> C1)) u>> C This is another step towards trying to re-apply D110170 by eliminating conflicting transforms that cause infinite loops. `a47c8e40c7` was a previous patch in this direction. The diffs here are mostly cosmetic, but intentional: 1. The existing code that would handle this pattern in FoldShiftByConstant() is limited to 'shl' only now. The formatting change to IsLeftShift shows that we could move several transforms into visitShl() directly for efficiency because they are not common shift transforms. 2. The tests are regenerated to show new instruction names to prove that we are getting (almost) identical logic results. 3. The one case where we differ ("trunc_sandwich_small_shift1") shows that we now use a narrow 'and' instruction. Previously, we relied on another transform to do that, but it is limited to legal types. That seems to be a legacy constraint from when IR analysis and codegen were less robust. https://alive2.llvm.org/ce/z/JxyGA4 declare void @llvm.assume(i1) define i8 @src(i32 %x, i32 %c0, i8 %c1) { ; The sum of the shifts must not overflow the source width. %z1 = zext i8 %c1 to i32 %sum = add i32 %c0, %z1 %ov = icmp ult i32 %sum, 32 call void @llvm.assume(i1 %ov) %sh1 = lshr i32 %x, %c0 %tr = trunc i32 %sh1 to i8 %sh2 = lshr i8 %tr, %c1 ret i8 %sh2 } define i8 @tgt(i32 %x, i32 %c0, i8 %c1) { %z1 = zext i8 %c1 to i32 %sum = add i32 %c0, %z1 %maskc = lshr i8 -1, %c1 %s = lshr i32 %x, %sum %t = trunc i32 %s to i8 %a = and i8 %t, %maskc ret i8 %a }	2021-09-27 10:57:31 -04:00
Sanjay Patel	025a805d7c	[InstCombine] match variable names and code comments; NFC Similar to: `29c09c7` Planned follow-up is to add a transform here to allow removing a common shift fold that is conflicting with D110170.	2021-09-27 10:57:31 -04:00
Sebastian Neubauer	bf980930e5	[AMDGPU] Ignore KILLs when forming clauses KILL instructions are sometimes present and prevented hard clauses from being formed. Fix this by ignoring all meta instructions in clauses. Differential Revision: https://reviews.llvm.org/D106042	2021-09-27 16:33:52 +02:00
Sjoerd Meijer	eba76056a3	[FuncSpec] Don't specialise (or crash) on poison or constexpr values Function specialization was crashing on poison values and constexpr values. The problem is that these values are not added to the solver, so it crashes when a lookup is performed for these values. This fixes that by not specialising on these values. For poison that is obvious, but for constexpr this is a change in behaviour. Thus, in one way this is a bit of a stopgap, but specialising on constexpr values wasn't done very intentionally, and need some more work and tests if we wanted to support this. As a follow up, we need to look if the solver should exit more gracefully and return a "don't know", or that it should really support these constexprs. This should fix PR51600 (https://bugs.llvm.org/show_bug.cgi?id=51600). Differential Revision: https://reviews.llvm.org/D110529	2021-09-27 14:58:53 +01:00
Jun Ma	3a998c06a8	Revert "Recommit "Revert "[CVP] processSwitch: Remove default case when switch cover all possible values.""" This reverts commit `8ba2adcf9e`.	2021-09-27 20:39:05 +08:00
Roman Lebedev	7424deb743	[X86][Costmodel] Load/store i16 Stride=2 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/q6GbK89br - for intels `Block RThroughput: =18.0`; for ryzens, `Block RThroughput: <=7.0` So pick cost of `18`. For store we have: https://godbolt.org/z/Yzfoo5TnW - for intels `Block RThroughput: =8.0`; for ryzens, `Block RThroughput: <=4.0` So pick cost of `8`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110507	2021-09-27 14:21:12 +03:00
Roman Lebedev	a5113e9445	[X86][Costmodel] Load/store i16 Stride=2 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=3.5` So pick cost of `9`. For store we have: https://godbolt.org/z/Y1E7qnjz8 - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110506	2021-09-27 14:20:11 +03:00
Roman Lebedev	70c90cc5bd	[X86][Costmodel] Load/store i16 Stride=2 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/e5YE99a4P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `6`. For store we have: https://godbolt.org/z/3vM4KsE1n - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `3`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110505	2021-09-27 14:18:29 +03:00
Roman Lebedev	49e532aa52	[X86][Costmodel] Load/store i16 Stride=2 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1j3nf3dro - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/4n1zvP37j - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110504	2021-09-27 14:15:25 +03:00
Fraser Cormack	e2b46e336b	[DAGCombiner][VP] Fold zero-length or false-masked VP ops This patch adds a generic DAGCombine for vector-predicated (VP) nodes. Those for which we can determine that no vector element is active can be replaced by either undef or, for reductions, the start value. This is tested rather trivially at the IR level, where it's possible that we want to teach instcombine to perform this optimization. However, we can also see the zero-evl case arise during SelectionDAG legalization, when wide VP operations can be split into two and the upper operation emerges as trivially false. It's possible that we could perform this optimization "proactively" (both on legal vectors and before splitting) and reduce the width of an operation and insert it into a larger undef vector: ``` v8i32 vp_add x, y, mask, 4 -> v8i32 insert_subvector (v8i32 undef), (v4i32 vp_add xsub, ysub, mask, 4), i32 0 ``` This is somewhat analogous to similar vector narrow/widening optimizations, but it's unclear at this point whether that's beneficial to do this for VP ops for any/all targets. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D109148	2021-09-27 11:30:09 +01:00
David Green	bb2d23dcd4	[ARM] Improve detection of fallthough when aligning blocks We align non-fallthrough branches under Cortex-M at O3 to lead to fewer instruction fetches. This improves that for the block after a LE or LETP. These blocks will still have terminating branches until the LowOverheadLoops pass is run (as they are not handled by analyzeBranch, the branch is not removed until later), so canFallThrough will return false. These extra branches will eventually be removed, leaving a fallthrough, so treat them as such and don't add unnecessary alignments. Differential Revision: https://reviews.llvm.org/D107810	2021-09-27 11:21:21 +01:00
Simon Pilgrim	468ff703e1	[X86] combineVectorHADDSUB - remove the broken HOP(x,x) merging code (PR51974) This intention of this code turns out to be superfluous as we can handle this with shuffle combining, and it has a critical flaw in that it doesn't check for dependencies. Fixes PR51974	2021-09-27 10:41:22 +01:00
Fraser Cormack	d48f6df1f8	[RISCV] Create the correct mask type when lowering EXTRACT_VECTOR_ELT This particular case was creating a `VMSET_VL` using the old fixed-length type in order to pass a mask to other custom nodes operating on the scalable container type. This kind of thing wasn't caught for us; I only noticed when experimenting with odd-length vectors, where it was trying to generate an invalid `v3i1` MVT. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110420	2021-09-27 09:43:40 +01:00
Freddy Ye	902ec6142a	[X86][ISel] Lowering FROUND(f16) and FROUNDEVEN(f16) When AVX512FP16 is enabled, FROUND(f16) cannot be dealt with TypeLegalize, and no libcall in libm is ready for fround(f16) now. FROUNDEVEN(f16) has related instruction in AVX512FP16. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110312	2021-09-27 13:35:03 +08:00
Lang Hames	1ea8d12510	[ORC] Add missing lock to CompileOnDemandLayer::getPerDylibResources. The getPerDylibResources method may be called concurrently from multiple threads, so we need to protect access to the underlying map. Possible for fix https://llvm.org/PR51064	2021-09-26 18:35:58 -07:00
Lang Hames	4b37462aab	[ORC] Fix SimpleRemoteEPC data races. Adds a 'start' method to SimpleRemoteEPCTransport to defer transport startup until the client has been configured. This avoids races on client members if the first messages arrives while the client is being configured. Also fixes races on the file descriptors in FDSimpleRemoteEPCTransport.	2021-09-26 18:11:48 -07:00
Nikita Popov	7a855596c3	[BasicAA] Don't check whether GEP is sized (NFC) GEPs are required to have sized source element type, so we can just assert that here.	2021-09-26 21:21:54 +02:00
Simon Pilgrim	c0eff50fc5	[X86][SSE] combineMulToPMADDWD - enable sext_extend_vector_inreg(vXi16) -> zext_extend_vector_inreg(vXi16) fold The plan is to allow combineMulToPMADDWD to match illegal vector types (as long as they're still pow2), which should allow us to start removing the 128-bit limit on more of the PMADDWD combines.	2021-09-26 19:37:23 +01:00
Simon Pilgrim	ed3e4917b3	[X86] Fold PACK(_EXTEND_VECTOR_INREG, UNDEF) -> _EXTEND_VECTOR_INREG For 128-bit vectors, we can remove a PACK of a EXTEND_VECTOR_INREG node and just create a smaller extension to the result/packed type.	2021-09-26 19:37:22 +01:00
Lang Hames	6498b0e991	Reintroduce "[ORC] Introduce EPCGenericRTDyldMemoryManager." This reintroduces "[ORC] Introduce EPCGenericRTDyldMemoryManager." (`bef55a2b47`) and "[lli] Add ChildTarget dependence on OrcTargetProcess library." (`7a219d801b`) which were reverted in `99951a5684` due to bot failures. The root cause of the bot failures should be fixed by "[ORC] Fix uninitialized variable." (`0371049277`) and "[ORC] Wait for handleDisconnect to complete in SimpleRemoteEPC::disconnect." (`320832cc9b`).	2021-09-27 03:24:33 +10:00
Simon Pilgrim	3fe9767204	[X86] Fold ADD(VPMADDWD(X,Y),VPMADDWD(Z,W)) -> VPMADDWD(SHUFFLE(X,Z), SHUFFLE(Y,W)) Merge addition of VPMADDWD nodes if each element pair doesn't use the upper element in each pair (i.e. its zero) - we can generalize this to either element in the pair if we one day create VPMADDWD with zero lower elements. There are still a number of issues with extending/shuffling with 256/512-bit VPMADDWD nodes so this initially only works for v2i32/v4i32 cases - I'm working on removing all these limitations but there's still a bit of yak shaving to go.....	2021-09-26 18:08:29 +01:00

1 2 3 4 5 ...

151087 Commits