llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	40b230f685	[RISCV] Limit transformAddImmMulImm to prevent an infinite loop. This fixes an issue reported in D108607.	2021-09-23 15:53:11 -07:00
Vang Thao	1443ba6163	[AMDGPU] Propagate defining src reg for AGPR to AGPR Copys On targets that do not support AGPR to AGPR copying directly, try to find the defining accvgpr_write and propagate its source vgpr register to the copies before register allocation so the source vgpr register does not get clobbered. The postrapseudos pass also attempt to propagate the defining accvgpr_write but if the register to propagate is clobbered, it will give up and create new temporary vgpr registers instead. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D108830	2021-09-23 15:17:53 -07:00
Craig Topper	70f50114f3	[RISCV] Add another isel optimization for (and (shl x, c2), c1) Turn (and (shl x, c2), c1) -> (slli (srli x, c3-c2), c3) if c1 is a shifted mask with no leading zeros and c3 trailing zeros where c3 is greater than c2.	2021-09-23 14:18:07 -07:00
Nico Weber	3fa43da7a3	[llvm] Fix a copy-pasto We should use IMAGE_REL_I386_SECREL in the i386 section of this file. IMAGE_REL_I386_SECREL and IMAGE_REL_AMD64_SECREL have the same numeric value 0xB, so this doesn't change behavior.	2021-09-23 15:34:01 -04:00
Craig Topper	4a69551d66	[RISCV] Add more isel optimizations for (and (shr x, c2), c1). Turn (and (shr x, c2), c1) -> (slli (srli x, c2+c3), c3) if c1 is a shifted mask with c2 leading zeros and c3 trailing zeros. When the leading zeros is C2+32 we can use SRLIW in place of SRLI.	2021-09-23 11:29:04 -07:00
Sanjay Patel	74ba4b769a	[x86] move combiner state check into convertIntLogicToFPLogic(); NFC This function can be adapted to solve bugs like PR51245, but it could require differentiating the combiner timing between the existing and new transforms.	2021-09-23 14:28:22 -04:00
Thomas Lively	2f519825ba	[WebAssembly] Add prototype relaxed SIMD fma/fms instructions Add experimental clang builtins, LLVM intrinsics, and backend definitions for the new {f32x4,f64x2}.{fma,fms} instructions in the relaxed SIMD proposal: https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md. Do not allow these instructions to be selected without explicit user opt-in. Differential Revision: https://reviews.llvm.org/D110295	2021-09-23 11:01:36 -07:00
Piotr Sobczak	2ac53fffae	[AMDGPU] Avoid processing functions in amdgpu-propagate-attributes pass for shaders The pass amdgpu-propagate-attributes ("Early/Late propagate attributes from kernels to functions") is currently run also for shaders, where it does nothing. Modify the check so the pass only processes functions for kernels. Differential Revision: https://reviews.llvm.org/D109961	2021-09-23 16:46:56 +02:00
Simon Pilgrim	c931d35216	[CostModel][X86] Increase i64 mul cost from 1 to 2 Only the most recent cpus support really 1cy 64-bit multiplies, and the X64 cost table represents a realistic worst case. The 1cy value was also discouraging vectorization when most vXi64 PMULDQ expansions aren't actually slower than scalarization. Noticed while investigating PR51436.	2021-09-23 14:48:21 +01:00
Jim Lin	fbacf5ad38	[RISCV] Add missing op type OPERAND_UIMM2, OPERAND_UIMM3 and OPERAND_UIMM7 for verifyInstruction Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D110307	2021-09-23 19:30:46 +08:00
Fraser Cormack	e7c879a69d	[RISCV][VP] Add support for VP_REDUCE_* operations This patch adds codegen support for lowering the vector-predicated reduction intrinsics to RVV instructions. The process is similar to that of the other reduction intrinsics, save for the fact that every VP reduction has a start value. We reuse the existing custom "VL" nodes, adding extra patterns where required to handle non-true masks. To support these nodes, the `RISCVISD::VECREDUCE_*_VL` nodes have been given an explicit "merge" operand. This is to faciliate the VP reductions, where we must be careful to ensure that even if no operation is performed (when VL=0) we still produce the start value. The RVV reductions don't update the destination register under these conditions, so we tie the splatted start value to the output register. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D107657	2021-09-23 11:11:05 +01:00
Jay Foad	6cef28ed2d	[TII] Remove the MFI argument to convertToThreeAddress. NFC. This simplifies the API and addresses a FIXME in TwoAddressInstructionPass::convertInstTo3Addr. Differential Revision: https://reviews.llvm.org/D110229	2021-09-23 08:58:46 +01:00
Liu, Chen3	76656ec8ec	[X86][FP16] Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A) This patch is to support transform something like _mm512_add_ph(acc, _mm512_fmadd_pch(a, b, _mm512_setzero_ph())) to _mm512_fmadd_pch(a, b, acc). Differential Revision: https://reviews.llvm.org/D109953	2021-09-23 15:37:08 +08:00
Mikael Holmen	e7b169a8ae	[AMDGPU] Fix gcc warnings about unused variables [NFC]	2021-09-23 08:08:00 +02:00
Usman Nadeem	3b12282b0e	[AArch64][SVE][InstCombine] Eliminate redundant chains of tuple get/set Differential Revision: https://reviews.llvm.org/D109667 Change-Id: I06a3c28e3658ecda109a3a1b73265828274ab2ea	2021-09-22 20:59:46 -07:00
Wang, Pengfei	ebec077e07	[X86][FP16] Change the order of the operands in complex FMA intrinsics to allow swap between the mul operands. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D109658	2021-09-23 11:02:48 +08:00
Zhi An Ng	1552179ac0	[WebAssembly] Add relaxed-simd feature This currently only defines a constant, but it the future will be used to gate builtins for experimenting and prototyping relaxed-simd proposal (https://github.com/WebAssembly/relaxed-simd/). Differential Revision: https://reviews.llvm.org/D110111	2021-09-22 14:52:50 -07:00
Craig Topper	f0a422f935	[RISCV] Add fcvt.s.w(u)/fcvt.d.w(u)/fcvt.h.w(u) to hasAllNBitUsers These instructions only read the lower 32 bits of their input.	2021-09-22 14:24:26 -07:00
Craig Topper	b33a1cc05b	[RISCV] Optimize vp.store with an all ones mask to avoid a vmset. We can use riscv_vse intrinsic instead of riscv_vse_mask. The code here is based on similar code for handling masked.scatter and vp.scatter. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D110206	2021-09-22 09:12:47 -07:00
Hongtao Yu	d9b511d8e8	[CSSPGO] Set PseudoProbeInserter as a default pass. Currenlty PseudoProbeInserter is a pass conditioned on a target switch. It works well with a single clang invocation. It doesn't work so well when the backend is called separately (i.e, through the linker or llc), where user has always to pass -pseudo-probe-for-profiling explictly. I'm making the pass a default pass that requires no command line arg to trigger, but will be actually run depending on whether the CU comes with `llvm.pseudo_probe_desc` metadata. Reviewed By: wenlei Differential Revision: https://reviews.llvm.org/D110209	2021-09-22 09:09:48 -07:00
Simon Pilgrim	b1f38a27f0	[Target][CodeGen] Remove default CostKind arguments on inner/impl TTI overrides Based off a discussion on D110100, we should be avoiding default CostKinds whenever possible. This initial patch removes them from the 'inner' target implementation callbacks - these should only be used by the main TTI calls, so this should guarantee that we don't cause changes in CostKind by missing it in an inner call. This exposed a few missing arguments in getGEPCost and reduction cost calls that I've cleaned up. Differential Revision: https://reviews.llvm.org/D110242	2021-09-22 15:28:08 +01:00
Sander de Smalen	6375ca4059	[AArch64][SVE] Add extract_subvector patterns for unpacked fp16 and bfloat types. Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D110163	2021-09-22 14:25:17 +01:00
Tim Northover	3a00e58c2f	AArch64: use indivisible cmpxchg for 128-bit atomic loads at O0 Like normal atomicrmw operations, at -O0 the simple register-allocator can insert spills into the LL/SC loop if it's expanded and visible when regalloc runs. This can cause the operation to never succeed by repeatedly clearing the monitor. Instead expand to a cmpxchg, which has a pseudo-instruction for -O0.	2021-09-22 14:20:43 +01:00
David Green	02cd8a6b91	[ARM] Allow smaller VMOVL in tail predicated loops This allows VMOVL in tail predicated loops so long as the the vector size the VMOVL is extending into is less than or equal to the size of the VCTP in the tail predicated loop. These cases represent a sign-extend-inreg (or zero-extend-inreg), which needn't block tail predication as in https://godbolt.org/z/hdTsEbx8Y. For this a vecsize has been added to the TSFlag bits of MVE instructions, which stores the size of the elements that the MVE instruction operates on. In the case of multiple size (such as a MVE_VMOVLs8bh that extends from i8 to i16, the largest size was be chosen). The sizes are encoded as 00 = i8, 01 = i16, 10 = i32 and 11 = i64, which often (but not always) comes from the instruction encoding directly. A unit test was added, and although only a subset of the vecsizes are currently used, the rest should be useful for other cases. Differential Revision: https://reviews.llvm.org/D109706	2021-09-22 12:07:52 +01:00
Sander de Smalen	ab3607c0ed	[AArch64][SVE] Add missing load/store patterns for unpacked bfloat vectors. Reviewed By: c-rhodes Differential Revision: https://reviews.llvm.org/D110063	2021-09-22 09:45:33 +01:00
Jay Foad	0205806d0f	[AMDGPU] Convert mac/fmac to mad/fma when folding output modifiers Use of output modifiers forces VOP3 encoding for a VOP2 mac/fmac instruction, so we might as well convert it to the more flexible VOP3- only mad/fma form. With this change, the only way we should emit VOP3-encoded mac/fmac is if regalloc chooses registers that require the VOP3 encoding, e.g. sgprs for both src0 and src1. In all other cases the mac/fmac should either be converted to mad/fma or shrunk to VOP2 encoding. Differential Revision: https://reviews.llvm.org/D110156	2021-09-22 09:36:34 +01:00
Jay Foad	3828ea6181	[AMDGPU] Divergence-driven instruction selection for mul i32 Differential Revision: https://reviews.llvm.org/D109881	2021-09-22 09:36:34 +01:00
Matt Arsenault	ec55dcedce	AMDGPU: Refactor getWavesPerEU to separate flat workgroup size query Add an overload to pass the flat workgroup range in separately. This will allow the attributor to use the assumed value for amdgpu-flat-workgroup-sizes when inferring amdgpu-waves-per-eu.	2021-09-21 22:57:17 -04:00
Chen Zheng	ffa9fa9ed2	[PowerPC] prepare for udpate form with non-const increment. This is a follow-up of D105872. Now we are able to prepare for update form with non-const increment. Reviewed By: jsji Differential Revision: https://reviews.llvm.org/D106032	2021-09-22 02:54:28 +00:00
Usman Nadeem	645b8f5365	[AArch64][SVE] Add patterns to generate ADR instruction Differential Revision: https://reviews.llvm.org/D109665 Change-Id: I9d2928688b80b804a16f52928e2057749ec2c0b2	2021-09-21 15:50:49 -07:00
Arthur Eubanks	e42234383e	Make DiagnosticInfoResourceLimit's limit param required And always print it. This makes some LLVM diagnostics match up better with Clang's diagnostics. Updated some AMDGPU uses of DiagnosticInfoResourceLimit and now we print better diagnostics for those. Reviewed By: dblaikie Differential Revision: https://reviews.llvm.org/D110204	2021-09-21 15:27:58 -07:00
Kirill Stoimenov	2649999579	[asan] Fixed a bug causing a crash when redzone optimization kicked in on X86 with -asan-optimize-callbacks flag on. This change adds the ASan intrinsic to the list whihc are setting hasCopyImplyingStackAdjustment. Reviewed By: eugenis Differential Revision: https://reviews.llvm.org/D110012	2021-09-21 22:26:03 +00:00
Craig Topper	b81e26c7f4	Recommit "[X86] Clear kill flags when rewriting SETCC uses in flag copy lowering." This time with the right bug number. When we rewrite the setcc we replace set old setcc output register with the new CondReg. But since CondReg can be shared by other replacements, we don't know if the kill flags for the old register are valid for CondReg. So be conservative and remove them. The test case has a SETCCr and a SETCCm on the same condition so they end up sharing the same CondReg. The SETCCr had one use with a kill flag. This kill flag isn't valid after the replacement because CondReg needs a live range extending to the later SETCCm replacment. Fixes PR51903.	2021-09-21 14:59:25 -07:00
Craig Topper	51a82e051e	Revert "[X86] Clear kill flags when rewriting SETCC uses in flag copy lowering." This reverts commit `7550f146ff`. I botched the bug number.	2021-09-21 14:33:44 -07:00
Craig Topper	7550f146ff	[X86] Clear kill flags when rewriting SETCC uses in flag copy lowering. When we rewrite the setcc we replace set old setcc output register with the new CondReg. But since CondReg can be shared by other replacements, we don't know if the kill flags for the old register are valid for CondReg. So be conservative and remove them. The test case has a SETCCr and a SETCCm on the same condition so they end up sharing the same CondReg. The SETCCr had one use with a kill flag. This kill flag isn't valid after the replacement because CondReg needs a live range extending to the later SETCCm replacment. Fixes PR51908. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110046	2021-09-21 14:29:46 -07:00
alex-t	1a33294652	[AMDGPU] Filtering out the inactive lanes bits when lowering copy to SCC Normally, given that the DA results are kept consistent over the selection DAG, uniform comparisons get selected to S_CMP_* but divergent to V_CMP_*. Sometimes, for the sake of efficiency, SSA subgraphs may be converted to VALU to avoid repeatedly copying data back and forth. Hence we have to be able to sustain the correctness passing the i1 from VALU to SALU context and vice versa. VALU operations only process the active lanes of the VGPR and ignore inactive ones. Active lanes correspond to 1 bit in the EXEC mask register. SALU represents i1 as just one bit but VALU as 64bits: 0/1 and 0/(0xffffffffffffffff & EXEC) respectively. SALU uses one-bit conditional flag SCC but VALU - VCC that is a pair of 32-bit SGPRs To expose SCC to the VALU context we need to convert the one-bit boolean value to the appropriate 64bit. To return back to the SALU context we need to do the opposite. To correctly convert 64bit VALU boolean to either 0 or 1 we need to filter out the bits corresponding to the inactive lanes. Reviewed By: piotr Differential Revision: https://reviews.llvm.org/D109900	2021-09-21 21:19:31 +03:00
Brendon Cahoon	cbdf624bb8	[AMDGPU] Correctly merge alias.scope and noalias metadata for memops When adding alias.scope and noalias metadata to a memcpy function, the alias.scope and noalias metadata from the operands are merged. The rule for merging alias.scope is to take the intersection of the domains and the union of the scopes within those domains. The rule for merging noalias is to take the intersection. The bug is that AMDGPULowerModuleLDS was using concatenation for both alias.scope and noalias. For example, when f1 and f2 are added to the LDS structure and there is a memcpy(f2, f1, sizeof(f1)). Then, concatenation creates noalias metadata for the memcpy that includes both {f1, f2}. That means that the memcpy is assumed not to alias a prior load of f2, which enables the optimizer to remove a load of f2 that occurs after mempcy. The function MDNode::getmostGenericAliasScope defines the semantics for alias.scope. There is a function, combineMetadata in Local.cpp, that uses intersect for noalias. Differential Revision: https://reviews.llvm.org/D110049	2021-09-21 13:02:01 -05:00
Craig Topper	7c975665b4	[RISCV] Make some arrays of constants 'static const'. NFC This helps the compiler generate better code.	2021-09-21 10:52:47 -07:00
Amy Kwan	2af57b6099	[PowerPC] Add prefix load pattern for fpext to v2f64 This patch adds a prefixed load pattern involving v2f32 fpext v2f64, where we are dealing with a value with an offset that fits into a 34-bit signed immediate. A reduced test case is also added to patch that tests the pattern, in which the pattern is tested in the big endian CHECKs of the newly added test. Differential Revision: https://reviews.llvm.org/D109887	2021-09-21 12:45:24 -05:00
Craig Topper	aeb63d464f	[RISCV] Teach RISCVTargetLowering::shouldSinkOperands to sink splats for and/or/xor. This requires a minor change to CodeGenPrepare to ensure that shouldSinkOperands will be called for And. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D110106	2021-09-21 10:07:29 -07:00
Dmitry Preobrazhensky	3500e7d2b0	[AMDGPU][MC][GFX7][GFX10] Corrected image_atomic_fcmpswap Differential Revision: https://reviews.llvm.org/D109616	2021-09-21 18:06:02 +03:00
Ben Shi	b3052013b4	[RISCV] Optimize (add (mul x, c0), c1) Optimize (add (mul x, c0), c1) -> (ADDI (MUL (ADDI, c1/c0), c0), c1%c0), if c1/c0 and c1%c0 are simm12, while c1 is not. Optimize (add (mul x, c0), c1) -> (MUL (ADDI, c1/c0), c0), if c1%c0 is zero, and c1/c0 is simm12 while c1 is not. Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D108607	2021-09-21 14:13:14 +00:00
Dmitry Preobrazhensky	b8e7f53208	[AMDGPU][MC][GFX10] Enabled dlc for FLAT and GLOBAL atomics Differential Revision: https://reviews.llvm.org/D109614	2021-09-21 16:23:20 +03:00
Jonas Paulsson	a48b43f981	[SystemZ] Emit EXRL target instructions before text section is ended. SystemZ adds the EXRL target instructions in the end of each file. This must be done before debug info emission since that may end the text section, and therefore this is now done in emitConstantPools() (instead of in emitEndOfAsmFile). Review: Ulrich Weigand Differential Revision: https://reviews.llvm.org/D109513	2021-09-21 14:32:28 +02:00
Nicholas Guy	9e4d72675f	[AArch64] Improve schedule modelling on the Cortex-A55 Enables the FuseAddress feature in the Cortex-A55 scheduling model Differential Revision: https://reviews.llvm.org/D109323	2021-09-21 13:03:34 +01:00
Jay Foad	598bebeaa6	[AMDGPU] Prefer fmac over fma when selecting FMA_W_CHAIN FMA_W_CHAIN is used when lowering fdiv f32. Prefer to select it to fmac if there are no source modifiers, just like we do for other mad/mac and fma/fmac cases. Differential Revision: https://reviews.llvm.org/D110074	2021-09-21 11:57:45 +01:00
Jay Foad	86dcb59206	[AMDGPU] Prefer v_fmac over v_fma only when no source modifiers are used v_fmac with source modifiers forces VOP3 encoding, but it is strictly better to use the VOP3-only v_fma instead, because $dst and $src2 are not tied so it gives the register allocator more freedom and avoids a copy in some cases. This is the same strategy we already use for v_mad vs v_mac and v_fma_legacy vs v_fmac_legacy. Differential Revision: https://reviews.llvm.org/D110070	2021-09-21 11:57:45 +01:00
Cullen Rhodes	b23d22f7d5	[PowerPC] NFC: Remove unused tblgen template args Identified in D109359. Reviewed By: nemanjai Differential Revision: https://reviews.llvm.org/D109715	2021-09-21 08:24:16 +00:00
Yonghong Song	ea72b0319d	BPF: make 32bit register spill with 64bit alignment In llvm, for non-alu32 mode, the stack alignment is 64bit so only one 64bit spill per 64bit slot. For alu32 mode, the stack alignment is 32bit, so it is possible to have two 32bit spills per 64bit slot. Currently, bpf kernel verifier does not preserve register states for 32bit spills. That is, one 32bit register may hold a constant value or a bounded range before spill. After reload from the stack, the information is lost and sometimes this may cause verifier failure. For 64bit register spill, the verifier indeed tries to preserve the register state for reloading. The current verifier can be modestly changed to handle one 32bit spill per 64bit stack slot with state-preserving reload. Handling two 32bit spills per 64bit stack slot will require substantial changes. This patch changes stack alignment for alu32 to be 64bit. This way, for any 64bit slot in alu32 mode, only one 32bit or 64bit register values can be saved. Together with previous-mentioned verifier enhancement, 32bit spill can be handled with state preserving. Note that llvm stack slot coallescing seems only doing adjacent packing which may leave some holes in the stack. For example, stack slot 8 <== 8 bytes stack slot 4 <== 8 bytes with 4 byte hole stack slot 8 <== 8 bytes stack slot 4 <== 4 bytes Differential Revision: https://reviews.llvm.org/D109073	2021-09-20 21:00:25 -07:00
Kazu Hirata	85b4b21c8b	[llvm] Use make_early_inc_range (NFC)	2021-09-20 19:30:02 -07:00
Amara Emerson	4ceea77409	[X86] Rename the X86WinAllocaExpander pass and related symbols to "DynAlloca". NFC. For x86 Darwin, we have a stack checking feature which re-uses some of this machinery around stack probing on Windows. Renaming this to be more appropriate for a generic feature. Differential Revision: https://reviews.llvm.org/D109993	2021-09-20 16:19:28 -07:00
Jacob Lambert	dc6e8dfdfe	[AMDGPU][NFC] Correct typos in lib/Target/AMDGPU/AMDGPU*.cpp files. Test commit for new contributor.	2021-09-20 14:48:50 -07:00
Craig Topper	a95ba81073	[RISCV] Teach RISCVTargetLowering::shouldSinkOperands to sink splats for FMA. If either of the multiplicands is a splat, we can sink it to use vfmacc.vf or similar.	2021-09-20 11:49:50 -07:00
Craig Topper	04ab6c85ef	[RISCV] Teach RISCVTargetLowering::shouldSinkOperands to sink splats for FAdd/FSub/FMul/FDiv.	2021-09-20 10:25:46 -07:00
Craig Topper	d85e347a28	[RISCV] Add a pass to recognize VLS strided loads/store from gather/scatter. For strided accesses the loop vectorizer seems to prefer creating a vector induction variable with a start value of the form <i32 0, i32 1, i32 2, ...>. This value will be incremented each loop iteration by a splat constant equal to the length of the vector. Within the loop, arithmetic using splat values will be done on this vector induction variable to produce indices for a vector GEP. This pass attempts to dig through the arithmetic back to the phi to create a new scalar induction variable and a stride. We push all of the arithmetic out of the loop by folding it into the start, step, and stride values. Then we create a scalar GEP to use as the base pointer for a strided load or store using the computed stride. Loop strength reduce will run after this pass and can do some cleanups to the scalar GEP and induction variable. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D107790	2021-09-20 09:39:44 -07:00
David Green	3f90df22f1	[ARM] MVE reverse shuffles. The vectorizer can sometimes make reverse shuffles from indices that count down. In MVE, we don't have a 128bit rev instruction, but we can select this to a VREV64 with some lane movs to swap the two halfs. Ideally this would use VMOVD's, but only gets as far as VMOVS's at the moment. Differential Revision: https://reviews.llvm.org/D69510	2021-09-20 13:48:01 +01:00
Simon Pilgrim	4ab7c0d3fa	[X86] X86TargetTransformInfo - remove unnecessary if-else after early exit. NFCI. (style) Break the if-else chain as they all return.	2021-09-20 12:53:17 +01:00
Petar Avramovic	e4c46ddd91	[GlobalISel] Improve elimination of dead instructions in legalizer Add eraseInstr(s) utility functions. Before deleting an instruction collects its use instructions. After deletion deletes use instructions that became trivially dead. This patch clears all dead instructions in existing legalizer mir tests. Differential Revision: https://reviews.llvm.org/D109154	2021-09-20 13:00:58 +02:00
Tim Northover	13aa102e07	AArch64: use ldp/stp for 128-bit atomic load/store in v.84 onwards v8.4 says that normal loads/stores of 128-bytes are single-copy atomic if they're properly aligned (which all LLVM atomics are) so we no longer need to do a full RMW operation to guarantee we got a clean read.	2021-09-20 09:50:11 +01:00
David Spickett	92c9b28347	Revert "[AArch64][SVE] Teach cost model that masked loads/stores are cheap" This reverts commit `734708e04f`. Due to build failures on the 2 stage SVE VLS bot. https://lab.llvm.org/buildbot/#/builders/176/builds/908/steps/11/logs/stdio	2021-09-20 08:45:18 +00:00
Simon Pilgrim	0e89ff8195	[X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one use. Helps with the regression noted on D109065 - don't truncate a broadcast source if the source has multiple uses.	2021-09-19 22:53:30 +01:00
Kazu Hirata	84b07c9b3a	[llvm] Use pop_back_val (NFC)	2021-09-19 13:44:23 -07:00
Simon Pilgrim	f855ef2601	[X86][Atom] Fix FP uops + port usage Both ports are required in most cases. Update the uops counts + port usage based off the most recent llvm-exegesis captures (PR36895) and what Intel AoM / Agner / InstLatX64 reports as well. Noticed while trying to improve fp costs for vectorization via the D103695 helper script.	2021-09-19 20:39:20 +01:00
Simon Pilgrim	b7342e3137	[X86] Fold SHUFPS(shuffle(x),shuffle(y),mask) -> SHUFPS(x,y,mask') We can combine unary shuffles into either of SHUFPS's inputs and adjust the shuffle mask accordingly. Unlike general shuffle combining, we can be more aggressive and handle multiuse cases as we're not going to accidentally create additional shuffles.	2021-09-19 20:39:19 +01:00
Simon Pilgrim	cf8fac7d07	[X86][Atom] Specific uops for all IMUL/IDIV instructions Based off a mixture of llvm-exegesis captures (PR36895) and Intel AoM / Agner / InstLatX64 reports.	2021-09-19 16:58:52 +01:00
Roman Lebedev	5f2fe48d06	[X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded Apparently this has no test coverage before D108382, but D108382 itself shows a few regressions that this fixes. It doesn't seem worthwhile breaking apart broadcasts, assuming we want the broadcasted value to be preset in several elements, not just the 0'th one. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108411	2021-09-19 17:38:32 +03:00
Roman Lebedev	07f1d8f0ca	[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such Split off from D108253. Broadcast is simpler than any other shuffle we might produce to do what we want to do here, so prefer it. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D108382	2021-09-19 17:35:37 +03:00
Roman Lebedev	0852313e47	[NFC] combineX86ShufflesRecursively(): actually address nits for previous patch	2021-09-19 17:29:18 +03:00
Roman Lebedev	1e72ca94e5	[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing This was suggested in https://reviews.llvm.org/D108382#inline-1039018, and it avoids regressions in that patch. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109065	2021-09-19 17:24:58 +03:00
David Green	1da52ef294	[ARM] Add VGETLANEu patterns for v4f16 and v8f16 These were apparently missing, having no pattern that could convert a VGETLANEu of a v4f16 to an i32. Added bf16 whilst here, following the same code.	2021-09-19 14:25:21 +01:00
Simon Pilgrim	e381d8b243	[X86][Atom] Fix (U)COMISS/SD uops, latency and throughput Both ports are required, for reg and mem variants - we can also use the WriteFComX class directly and remove the unnecessary InstRW overrides. Matches what Intel AoM / Agner / InstLatX64 report as well.	2021-09-19 12:44:44 +01:00
Ben Shi	dee5a8ca32	[RISCV] Optimize (add (shl x, c0), (shl y, c1)) with SHADD Optimize (add (shl x, c0), (shl y, c1)) -> (SLLI (SHADD x, y), c1), if c0-c1 == 1/2/3. Reviewed By: craig.topper, luismarques Differential Revision: https://reviews.llvm.org/D108916	2021-09-19 16:35:12 +08:00
Roman Lebedev	6a2c2263fb	[X86] Improve i8 all-ones element insertion in pre-SSE4.1 Should avoid some regressions in D109065 Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109989	2021-09-18 22:24:06 +03:00
David Green	cb5e3f7959	[ARM] Prevent large integer VQDMULH pattern crashes Put a limit on the size of constant integers we test when looking for VQDMULH, to prevent it from crashing from values more than 64bits.	2021-09-18 18:47:02 +01:00
Usman Nadeem	757384abff	[AArch64][SVE][InstCombine] Fold redundant zip1/2(uzp1/2) operations zip1(uzp1(A, B), uzp2(A, B)) --> A zip2(uzp1(A, B), uzp2(A, B)) --> B Differential Revision: https://reviews.llvm.org/D109666 Change-Id: I4a6578db2fcef9ff71ad0e77b9fe08354e6dbfcd	2021-09-17 15:24:46 -07:00
Roman Lebedev	358df06f4e	[X86] Improve `matchBinaryShuffle()`'s `BLEND` lowering with per-element all-zero/all-ones knowledge We can use `OR` instead of `BLEND` if either the element we are not picking is zero (or masked away); or the element we are picking overwhelms (e.g. it's all-ones) whatever the element we are not picking: https://alive2.llvm.org/ce/z/RKejao Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D109726	2021-09-17 19:13:33 +03:00
Simon Pilgrim	db23f27786	[X86] X86PreTileConfig - Use const-ref iterator in for-range loop. NFCI. Avoid unnecessary copies, reported by MSVC static analyzer.	2021-09-17 14:04:53 +01:00
Simon Pilgrim	5ebe95e256	[X86][Atom] Fix integer shuffles uops, latency and throughput The MMX pack/unpck shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only). The SSE pslldq/psrldq shuffles don't need an override - they have the same behaviour as other shuffles (Port0 only). The SSE pshufb shuffles use 4uops (+1 load). Noticed the pslldq/psrldq issue while trying to improve reduction costs via the D103695 helper script, and fixed the others while reviewing. Confirmed with Intel AoM / Agner / InstLatX64.	2021-09-17 12:11:54 +01:00
Jonas Paulsson	1a5ab3e97c	[SystemZ] Recognize .machine directive in parser. The .machine directive can be used in assembly files to specify the ISA for the instructions following it. Review: Ulrich Weigand Differential Revision: https://reviews.llvm.org/D109660	2021-09-17 12:03:54 +02:00
Petar Avramovic	d477a7c2e7	GlobalISel/Utils: Refactor integer/float constant match functions Rework getConstantstVRegValWithLookThrough in order to make it clear if we are matching integer/float constant only or any constant(default). Add helper functions that get DefVReg and APInt/APFloat from constant instr getIConstantVRegValWithLookThrough: integer constant, only G_CONSTANT getFConstantVRegValWithLookThrough: float constant, only G_FCONSTANT getAnyConstantVRegValWithLookThrough: either G_CONSTANT or G_FCONSTANT Rename getConstantVRegVal and getConstantVRegSExtVal to getIConstantVRegVal and getIConstantVRegSExtVal. These now only match G_CONSTANT as described in comment. Relevant matchers now return both DefVReg and APInt/APFloat. Replace existing uses of getConstantstVRegValWithLookThrough and getConstantVRegVal with new helper functions. Any constant match is only required in: ConstantFoldBinOp: for constant argument that was bit-cast of float to int getAArch64VectorSplat: AArch64::G_DUP operands can be any constant amdgpu select for G_BUILD_VECTOR_TRUNC: operands can be any constant In other places use integer only constant match. Differential Revision: https://reviews.llvm.org/D104409	2021-09-17 11:22:13 +02:00
Chen Zheng	80584f0056	Revert "[PowerPC][ELF] make sure local variable space does not overlap with parameter save area" This causes mix-compile issues on PowerPC Linux. This reverts commit `324bd467a2`.	2021-09-17 08:07:18 +00:00
Jacob Lambert	4c1023b4b7	[AMDGPU] NFC: Fixing small spelling errors in AMDGPU header files Nonfunctional commit fixing several minor spelling errors in llvm/lib/Target/AMDGPU header files. Testing workflow as a new contributor. Differential Revision: https://reviews.llvm.org/D109733	2021-09-16 13:03:09 -07:00
Craig Topper	73e5b9ea90	[RISCV] Select (srl (sext_inreg X, i32), uimm5) to SRAIW if only lower 32 bits are used. SimplifyDemandedBits can turn srl into sra if the bits being shifted in aren't demanded. This patch can recover the original sra in some cases. I've renamed the tablegen class for detecting W users since the "overflowing operator" term I originally borrowed from Operator.h does not include srl. Reviewed By: luismarques Differential Revision: https://reviews.llvm.org/D109162	2021-09-16 11:03:35 -07:00
Vang Thao	106959acc1	[AMDGPU] Inline non-kernel functions using extern lds In https://reviews.llvm.org/D100481, forceful inline of all non-kernel functions using lds was disabled since AMDGPULowerModuleLDS pass now handles static lds. However that pass does not handle extern lds so non-kernel functions using extern lds must sill be inline. Reviewed By: hsmhsm, arsenm Differential Revision: https://reviews.llvm.org/D109773	2021-09-16 10:58:51 -07:00
Kazu Hirata	cfc7402419	[llvm] Use drop_begin (NFC)	2021-09-16 08:46:26 -07:00
Doug Gregor	a773db7d76	Add a command-line flag to control the Swift extended async frame info. Introduce a new command-line flag `-swift-async-fp={auto\|always\|never}` that controls how code generation sets the Swift extended async frame info bit. There are three possibilities: * `auto`: which determines how to set the bit based on deployment target, either statically or dynamically via `swift_async_extendedFramePointerFlags`. * `always`: the default, always set the bit statically, regardless of deployment target. * `never`: never set the bit, regardless of deployment target. Patch by Doug Gregor <dgregor@apple.com> Reviewed By: doug.gregor Differential Revision: https://reviews.llvm.org/D109392	2021-09-16 06:57:45 -07:00
Alexandros Lamprineas	1bd5ea968e	[ARM] Mitigate the cve-2021-35465 security vulnurability. Recently a vulnerability issue is found in the implementation of VLLDM instruction in the Arm Cortex-M33, Cortex-M35P and Cortex-M55. If the VLLDM instruction is abandoned due to an exception when it is partially completed, it is possible for subsequent non-secure handler to access and modify the partial restored register values. This vulnerability is identified as CVE-2021-35465. The mitigation sequence varies between v8-m and v8.1-m as follows: v8-m.main --------- mrs r5, control tst r5, #8 /* CONTROL_S.SFPA / it ne .inst.w 0xeeb00a40 / vmovne s0, s0 / 1: vlldm sp / Lazy restore of d0-d16 and FPSCR. / v8.1-m.main ----------- vscclrm {vpr} / Clear VPR. / vlldm sp / Lazy restore of d0-d16 and FPSCR. */ More details on developer.arm.com/support/arm-security-updates/vlldm-instruction-security-vulnerability Differential Revision: https://reviews.llvm.org/D109157	2021-09-16 12:56:43 +01:00
Alexandros Lamprineas	61f25daa8d	[ARM][CMSE] Clear the secure fp-registers when using softfp abi. When expanding the non-secure call instruction we are emiting code to clear the secure floating-point registers only if the targeted architecture has floating-point support. The potential problem is when the source code containing non-secure calls are built with -mfloat-abi=soft but some other part of the system has been built with -mfloat-abi=softfp (soft and softfp are compatible as they use the same procedure calling standard). In this case floating-point registers could leak to non-secure state as the non-secure won't have cleared them assuming no floating point has been used. Differential Revision: https://reviews.llvm.org/D109153	2021-09-16 12:56:43 +01:00
Cullen Rhodes	17f1ccc759	[AArch64][SVE] NFC: Remove unnecessary if	2021-09-16 11:26:46 +00:00
Simon Pilgrim	1ef62cb200	[X86] SimplifyDemandedVectorEltsForTargetNode - add PSADBW handling Peek through PSADBW operands to handle non demanded elements.	2021-09-16 11:28:31 +01:00
Jay Foad	128a49727a	[AMDGPU] Fix upcoming TableGen warnings on unused template arguments. NFC. The warning is implemented by D109359 which is still in review. Differential Revision: https://reviews.llvm.org/D109826	2021-09-16 09:07:18 +01:00
Jessica Paquette	c8b3d7d6d6	[AArch64][GlobalISel] Ensure atomic loads always get assigned GPR destinations The default register bank selection code for G_LOAD assumes that we ought to use a FPR when the load is casted to a float/double. For atomics, this isn't true; we should always use GPRs. Without this patch, we crash in the following example: https://godbolt.org/z/MThjas441 Also make the code a little more stylistically consistent while we're here. Also test some other weird cast combinations as well. Differential Revision: https://reviews.llvm.org/D109771	2021-09-15 17:05:09 -07:00
Ahmed Bougacha	e159d3cbfc	[AArch64][GlobalISel] Use MI::getIntrinsicID in more spots. NFC. There's technically a difference in the logic used by these findIntrinsicID and MachineInstr::getIntrinsicID, but it shouldn't be a meaningful difference here, with G_INTRINSIC instructions. getIntrinsicID's "first non-def" logic should be correct for those.	2021-09-15 16:45:34 -07:00
Simon Pilgrim	0767e43d87	[CostModel][X86] Adjust bitreverse/ctpop/ctlz/cttz AVX2+ costs based on llvm-mca reports Based off the worse case numbers generated by D103695, the AVX2/512 bit reversing/counting costs were higher than necessary (based off instruction counts instead of actual throughput).	2021-09-15 13:04:40 +01:00
Martin Storsjö	b33a43e57c	[ARM] Move fetching of ARMSubtarget into the scopes that need it. NFC. This was requested in D38253, but missed back then. Differential Revision: https://reviews.llvm.org/D109046	2021-09-15 15:03:20 +03:00
David Green	a2332d5332	[ARM] Prevent continuous folding of SUBC Under some situations under Thumb1, we could be stuck in an infinite loop recombining the same instruction. This puts a limit on that, not combining SUBC with SUBE repeatedly.	2021-09-15 11:23:32 +01:00
Simon Pilgrim	dcba994184	[X86] combineX86ShuffleChain - ensure we only peek through bitcasts to vectors (PR51858) When searching for hidden identity shuffles (added at rG41146bfe82aecc79961c3de898cda02998172e4b), only peek through bitcasts to the source operand if it is a vector type as well.	2021-09-15 10:21:05 +01:00
Simon Atanasyan	533471ff2f	[MIPS] Remove unused tblgen template args. NFC Identified in D109359.	2021-09-15 12:16:07 +03:00
Cullen Rhodes	18655140d6	[NVPTX] NFC: Remove unused imm type intrinsic arg Identified in D109359. Reviewed By: tra Differential Revision: https://reviews.llvm.org/D109755	2021-09-15 08:56:51 +00:00
Xiang1 Zhang	1f1c71aeac	[X86][InlineAsm] Use mem size information (*word ptr) for "global variable + registers" memory expression in inline asm. Differential Revision: https://reviews.llvm.org/D109739	2021-09-15 16:11:14 +08:00

1 2 3 4 5 ...

64315 Commits