llvm-project

Commit Graph

Author	SHA1	Message	Date
Vitaly Buka	16fecdfa70	Revert "[AArch64] Add `foldCSELOfCSEl` DAG combine" Breaks ubsan on buildbot, details in D125504 This reverts commit `6f9423ef06`.	2022-08-16 20:29:37 -07:00
Karl Meakin	6f9423ef06	[AArch64] Add `foldCSELOfCSEl` DAG combine Differential Revision: https://reviews.llvm.org/D125504	2022-08-16 12:49:11 +01:00
Zain Jaffal	7155ed4289	[AArch64] Add support for 256-bit non temporal loads Currenlty all temporal loads are mapped to `LDP` or `LDR`. This patch will map all the non temporal 256-bit loads into `LDNP`. Future patches should address other non-temporal loads. Reviewed By: fhahn, dmgreen Differential Revision: https://reviews.llvm.org/D131773	2022-08-16 12:19:36 +01:00
Vitaly Buka	e0e960923f	[AArch64] Fix signed integer overflow in CSINC case Followup to D131815, which overlflows on different values.	2022-08-15 15:04:20 -07:00
Vitaly Buka	f1596952f9	[AArch64] Fix signed integer overflow in CSINC case https://lab.llvm.org/staging/#/builders/224/builds/2/steps/16/logs/stdio Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D131815	2022-08-13 13:12:09 -07:00
David Green	a9e9dd9a3a	[AArch64] Add bf16 select handling A bfloat select operation will currently crash, but is allowed from C. This adds handling for the operation, turning it into a FCSELHrrr if fullfp16 is present, or converting it to a FCSELSrrr if not. The FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally attempted to do this via a tablegen pattern, but it appears that the nzcv glue is places onto the wrong node, causing it to be forgotten and incorrect scheduling to be emitted). The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not present, which helps avoid an unnecessary promotion to f32. Differential Revision: https://reviews.llvm.org/D131253	2022-08-11 14:20:36 +01:00
David Truby	b1b9c39629	[AArch64][SVE] Use SVE for VLS fcopysign for wide vectors Currently fcopysign for VLS vectors lowers through NEON even when the vector width is wider than a NEON vector, causing bad codegen as the vectors are split. This patch causes SVE to be used for these vectors instead, giving much better codegen on wide VLS vectors. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D128642	2022-08-10 10:17:19 +00:00
Fangrui Song	de9d80c1c5	[llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC With C++17 there is no Clang pedantic warning or MSVC C5051.	2022-08-08 11:24:15 -07:00
David Truby	9a976f3661	[llvm] Always use TargetConstant for FP_ROUND ISD Nodes This patch ensures consistency in the construction of FP_ROUND nodes such that they always use ISD::TargetConstant instead of ISD::Constant. This additionally fixes a bug in the AArch64 SVE backend where patterns were matching against TargetConstant nodes and sometimes failing when passed a Constant node. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130370	2022-08-03 14:02:11 +01:00
David Green	1206f72e31	[AArch64] Fold Mul(And(Srl(X, 15), 0x10001), 0xffff) to CMLTz This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16 CMLTz instruction. The Srl and And extract the top bit (whether the input is negative) and the Mul sets all values in the i16 half to all 1/0 depending on if that top bit was set. This is equivalent to a v8i16 CMLTz instruction. The same applies to other sizes with equivalent constants. Differential Revision: https://reviews.llvm.org/D130874	2022-08-02 13:01:59 +01:00
Kazu Hirata	12b29900a1	Use any_of (NFC)	2022-07-30 10:35:56 -07:00
Matt Devereau	a8b726ac65	[AArch64][SVE] Change DupLane128Combine Index comparison to 0 IdxInsert == IdxDupLane is incorrect. IdxInsert is the starting element number, whereas IdxIndex is the index of a quadword	2022-07-29 14:31:00 +00:00
David Sherwood	487fa6f8c3	[AArch64][DAGCombine] Add performBuildVectorCombine 'extract_elt ~> anyext' A build vector of two extracted elements is equivalent to an extract subvector where the inner vector is any-extended to the extract_vector_elt VT, because extract_vector_elt has the effect of an any-extend. (build_vector (extract_elt_i16_to_i32 vec Idx+0) (extract_elt_i16_to_i32 vec Idx+1)) => (extract_subvector (anyext_i16_to_i32 vec) Idx) Depends on D130697 Differential Revision: https://reviews.llvm.org/D130698	2022-07-29 09:51:09 +01:00
Amara Emerson	65246d3eb4	Use hasNItemsOrLess() in MRI::hasAtMostUserInstrs().	2022-07-27 11:42:14 -07:00
Mingming Liu	34348814e1	[AArch64] Explicitly use v1i64 type for llvm.aarch64.neon.pmull64 Without this, the intrinsic will be expanded to an integer; thereby an explicit copy (from GPR to SIMD register) will be codegen'd. This matches the general convention of using "v1" types to represent scalar integer operations in vector registers. The similar approach is observed in D56616, and the pattern likely applies on other intrinsic that accepts integer scalars (e.g., int_aarch64_neon_sqdmulls_scalar) Differential Revision: https://reviews.llvm.org/D130548	2022-07-27 11:11:16 -07:00
Amara Emerson	19cdd1908b	[AArch64][GlobalISel] Add heuristics for localizing G_CONSTANT. This adds similar heuristics to G_GLOBAL_VALUE, querying the cost of materializing a specific constant in code size. Doing so prevents us from sinking constants which require multiple instructions to generate into use blocks. Code size savings on CTMark -Os: Program size.__text before after diff ClamAV/clamscan 381940.00 382052.00 0.0% lencod/lencod 428408.00 428428.00 0.0% SPASS/SPASS 411868.00 411876.00 0.0% kimwitu++/kc 449944.00 449944.00 0.0% Bullet/bullet 463588.00 463556.00 -0.0% sqlite3/sqlite3 284696.00 284668.00 -0.0% consumer-typeset/consumer-typeset 414492.00 414424.00 -0.0% 7zip/7zip-benchmark 595244.00 594972.00 -0.0% mafft/pairlocalalign 247512.00 247368.00 -0.1% tramp3d-v4/tramp3d-v4 372884.00 372044.00 -0.2% Geomean difference -0.0% Differential Revision: https://reviews.llvm.org/D130554	2022-07-27 10:51:16 -07:00
Paul Walker	e5c892dd85	[SVE][SelectionDAG] Use INDEX to generate matching instances of BUILD_VECTOR. This patch starts small, only detecting sequences of the form <a, a+n, a+2n, a+3n, ...> where a and n are ConstantSDNodes. Differential Revision: https://reviews.llvm.org/D125194	2022-07-26 15:28:37 +00:00
Sander de Smalen	a41ddf178e	[AArch64][SVE] Sink ptrue into loop if it is used by PTEST. This helps fold away the ptest instructions, which needs the knowledge on whether the general predicate is known to zero the inactive lanes. This fixes some PTEST regressions introduced by D129282. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D129852	2022-07-26 15:07:41 +01:00
Sander de Smalen	370ff43a15	[AArch64][SVE] Consider more intrinsics in 'isZeroingInactiveLanes'. This fixes some PTEST regressions introduced by D129282. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D129851	2022-07-26 15:07:41 +01:00
Bradley Smith	953a98ef8d	[AArch64][SVE] Fold target specific ext/trunc nodes into loads/stores Due to the way fixed length SVE lowering works, we sometimes introduce ext/trunc nodes very late, these nodes then immediately get converted into target specific nodes (UUNPKLO/UZP1) before they get a chance to be folded into a load/store. This patch introduces target specific dag combines for these nodes so that we can still create extending loads/truncating stores out of them. Differential Revision: https://reviews.llvm.org/D128065	2022-07-25 15:24:05 +00:00
Cullen Rhodes	c04ff587dc	[AArch64] Combine setcc (iN (bitcast (vNi1 X))) with vecreduce_or Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130163	2022-07-25 12:14:33 +00:00
Rosie Sumpter	034a27e688	[AArch64] Add f16 fpimm patterns This patch recognizes f16 immediates as legal and adds the necessary patterns. This allows the fadda folding introduced in `05d424d165` to be applied to the f16 cases. Differential Revision: https://reviews.llvm.org/D129989	2022-07-25 09:08:10 +01:00
Simon Pilgrim	939cf9b1be	[AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation As noticed on D129765 and reported on Issue #56531 - aarch64 targets can use the neon ctpop + add-reduce instructions to speed up scalar ctpop instructions, but we fail to do this for parity calculations. I'm not sure where the cutoff should be for specific CPUs, but i64 (+ i128 special case) shows a definite reduction in instruction count. i32 is about the same (but scalar <-> neon transfers are probably more costly?), and sub-i32 promotion looks to be a definite regression compared to parity expansion optimized for those widths. Differential Revision: https://reviews.llvm.org/D130246	2022-07-22 17:24:17 +01:00
Cullen Rhodes	bf268a05cd	[AArch64] Emit vector FP cmp when LE is used with fast-math Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130093	2022-07-22 07:53:55 +00:00
Matt Devereau	cd3d7bf15d	[AArch64][SVE] Add DAG-Combine to push bitcasts from floating point loads after DUPLANE128 This patch lowers duplane128(insert_subvector(undef, bitcast(op(128bitsubvec)), 0), 0) to bitcast(duplane128(insert_subvector(undef, op(128bitsubvec), 0), 0)). This enables floating-point loads to match patterns added in https://reviews.llvm.org/D130010 Differential Revision: https://reviews.llvm.org/D130013	2022-07-21 11:00:10 +00:00
Simon Pilgrim	0f6b0461b0	[DAG] SimplifyDemandedBits - relax "xor (X >> ShiftC), XorC --> (not X) >> ShiftC" to match only demanded bits The "xor (X >> ShiftC), XorC --> (not X) >> ShiftC" fold is currently limited to the XOR mask being a shifted all-bits mask, but we can relax this to only need to match under the demanded bits. This helps expose more bit extraction/clearing patterns and fixes the PowerPC testCompares*.ll regressions from D127115 Alive2: https://alive2.llvm.org/ce/z/fl7T7K Differential Revision: https://reviews.llvm.org/D129933	2022-07-19 10:59:07 +01:00
Simon Pilgrim	4f10516325	[AArch64] isDesirableToCommuteWithShift - add explicit ShiftLHS variable instead of altering a incoming argument variable As discussed on D129995, altering the 'N' variable to point to shift's source value was confusing.	2022-07-18 13:28:07 +01:00
Simon Pilgrim	259c36e7c1	[DAG] Add asserts to isDesirableToCommuteWithShift overrides to ensure its being called from a shift. NFC.	2022-07-18 13:11:24 +01:00
Simon Pilgrim	a5d0122f75	[DAG] Canonicalize non-inlane shuffle -> AND if all non-inlane referenced elements are known zero As mentioned on D127115, this patch that attempts to recognise shuffle masks that could be simplified to a AND mask - we already have a similar transform that will fold AND -> 'clear mask' shuffle, but this patch handles cases where the referenced elements are not from the same lane indices but are known to be zero. Differential Revision: https://reviews.llvm.org/D129150	2022-07-16 11:38:24 +01:00
Rosie Sumpter	e5edc1b5ee	[AArch64][SVE] Ensure PTEST operands have type nxv16i1 Currently any legal predicate types will be pattern-matched when creating a PTEST instruction. This could be a problem in future since PTEST always uses the .B specifier for the operand, but it is not always guaranteed that the extra lanes of unpacked types (e.g. nxv4i1) are zero. This patch ensures the operands of PTEST are type nxv16i1, where the undef lanes are set to zero. Differential Revision: https://reviews.llvm.org/D129282/	2022-07-12 09:27:59 +01:00
David Green	4334cbd49b	[AArch64] Remove incorrect use of DemandElts This call to computeKnownBits was passing in a 0xff mask, looking like it was expecting it to be used as a DemandBits, not a DemandElts mask.	2022-07-08 11:38:00 +01:00
Florian Hahn	afdedd405e	[AArch64] Try to re-use extended operand for SETCC with vector ops. Try to re-use an already extended operand for SetCC with vector operands feeding an extended select. Doing so avoids requiring another full extension of the SET_CC result when lowering the select. This improves lowering for certain extend/cmp/select patterns operating. For example with v16i8, this replaces 6 instructions for the extra extension with 4 separate selects. This improves the generated code for loops like the one below in combination with D96522. int foo(uint8_t p, int N) { unsigned long long sum = 0; for (int i = 0; i < N ; i++, p++) { unsigned int v = p; sum += (v < 127) ? v : 256 - v; } return sum; } https://clang.godbolt.org/z/Wco866MjY On the AArch64 cores I have access to, the patch improves performance of the vector loop by ~10%. This could be generalized per follow-ups, but the initial version targets one of the more important cases in combination with D96522. Alive2 modeling: * sext EQ https://alive2.llvm.org/ce/z/5upBvb * sext NE https://alive2.llvm.org/ce/z/zbEcJp * zext EQ https://alive2.llvm.org/ce/z/_xMwof * zext NE https://alive2.llvm.org/ce/z/5FwKfc * zext unsigned predicate: https://alive2.llvm.org/ce/z/iEwLU3 * sext signed predicate: https://alive2.llvm.org/ce/z/aMBega Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D120481	2022-07-07 16:50:00 -07:00
Bradley Smith	60d6be5dd3	[LegalizeTypes] Replace vecreduce_xor/or/and with vecreduce_add/umax/umin if not legal This is done during type legalization since the target representation of these nodes may not be valid until after type legalization, and after type legalization the fact that these are dealing with i1 types may be lost. Differential Revision: https://reviews.llvm.org/D128996	2022-07-07 09:33:54 +00:00
Sander de Smalen	5d4f6ce229	[AArch64][SVE] Zero other lanes when doing OR reduction on unpacked predicate using ptest. When the predicate vector is unpacked, we cannot assume anything about the values in the other lanes. We have to make sure we use the correct predicate where we know that the other lanes have been zeroed. Reviewed By: RosieSumpter Differential Revision: https://reviews.llvm.org/D129081	2022-07-06 16:12:44 +00:00
Sander de Smalen	95e08824fa	[AArch64] Add support for various operations on nxv1i1 types. The supported operations are: * Logical operations (and, or, xor, bic) * Logical reductions (and, or, xor, [us]min, [us]max) * Conversions to/from svbool_t * Predicate count (CNTP) Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D128835	2022-07-06 15:57:11 +00:00
David Sherwood	77b13a57a9	[AArch64][SME] Add SME addha/va intrinsics This patch adds new the following SME intrinsics: @llvm.aarch64.sme.addva @llvm.aarch64.sme.addha Differential Revision: https://reviews.llvm.org/D127861	2022-07-05 09:47:17 +01:00
Sander de Smalen	bf89d24f53	[AArch64] NFC: Move safe predicate casting to a separate function. This patch puts the code to safely bitcast a predicate, and possibly zero any undefined lanes when doing a widening cast, into one place and merges the functionality with lowerConvertToSVBool. This is some cleanup inspired by D128665. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D128926	2022-07-04 10:32:54 +00:00
Sander de Smalen	690db16422	[AArch64] Make nxv1i1 types a legal type for SVE. One motivation to add support for these types are the LD1Q/ST1Q instructions in SME, for which we have defined a number of load/store intrinsics which at the moment still take a `<vscale x 16 x i1>` predicate regardless of their element type. This patch adds basic support for the nxv1i1 type such that it can be passed/returned from functions, as well as some basic support to support some existing tests that result in a nxv1i1 type. It also adds support for splats. Other operations (e.g. insert/extract subvector, logical ops, etc) will be supported in follow-up patches. Reviewed By: paulwalker-arm, efriedma Differential Revision: https://reviews.llvm.org/D128665	2022-07-01 15:11:13 +00:00
Matt Devereau	5166345f50	[SVE][AArch64] Refine hasSVEArgsOrReturn As described in aapcs64 (https://github.com/ARM-software/abi-aa/blob/2022Q1/aapcs64/aapcs64.rst#scalable-vector-registers) AAVPCS is used only when registers z0-z7 take an SVE argument. This fixes the case where floats occupy the lower bits of registers z0-z7 but SVE arguments in registers greater than z7 cause a function to use AAVPCS where it should use AAPCS. Moving SVE function deduction from AArch64RegisterInfo::hasSVEArgsOrReturn to AArch64TargetLowering::LowerFormalArguments where physical register lowering is more accurate fixes this. Differential Revision: https://reviews.llvm.org/D127209	2022-07-01 13:24:55 +00:00
Matt Devereau	018a0dd5c8	[AArch64][SVE] Create AArch64ISD node for DUPQLANE128 Create an AArch64ISD node instead of emitting machine node DUP_ZZI_Q. This allows a simpler DAG combine for work previously attempted in https://reviews.llvm.org/D128503 Differential Revision: https://reviews.llvm.org/D128902	2022-07-01 11:46:24 +00:00
Sander de Smalen	fbefc62a96	[AArch64][SME] Sink tile offset operands into the loop for load/store instructions. This helps ISel decompose the generic offset for the tile into a base + offset. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D128508	2022-06-28 10:28:36 +01:00
David Sherwood	054faac9f9	[AArch64][SME] Add SVE2 psel, uclamp, sclamp and revd IR intrinsics When the SME feature is enabled we also gain access to a few extra SVE2 instructions. This patch adds LLVM IR intrinsics to make use of these new instructions: @llvm.aarch64.sve.psel @llvm.aarch64.sve.revd @llvm.aarch64.sve.sclamp @llvm.aarch64.sve.uclamp Differential Revision: https://reviews.llvm.org/D128332	2022-06-28 10:25:06 +01:00
David Sherwood	f916ee0fb1	[AArch64][SME] Add SME outer product intrinsics This patch adds the following intrinsics to support the SME ACLE: * @llvm.aarch64.sme.mopa: Non-widening outer product + accumulate * @llvm.aarch64.sme.mops: Non-widening outer product + subtract * @llvm.aarch64.sme.mopa.wide: Widening outer product + accumulate * @llvm.aarch64.sme.mops.wide: Widening outer product + subtract * @llvm.aarch64.sme.smopa.wide: Widening signed sum of outer product + accumulate * @llvm.aarch64.sme.smops.wide: Widening signed sum of outer product + subtract * @llvm.aarch64.sme.umopa.wide: Widening unsigned sum of outer product + accumulate * @llvm.aarch64.sme.umops.wide: Widening unsigned sum of outer product + subtract * @llvm.aarch64.sme.sumopa.wide: Widening signed by unsigned sum of outer product + accumulate * @llvm.aarch64.sme.sumops.wide: Widening signed by unsigned sum of outer product + subtract * @llvm.aarch64.sme.usmopa.wide: Widening unsigned by signed sum of outer product + accumulate * @llvm.aarch64.sme.usmops.wide: Widening unsigned by signed sum of outer product + subtract Differential Revision: https://reviews.llvm.org/D127956	2022-06-28 09:41:44 +01:00
David Green	03c65c0d32	[AArch64] Convert vector add(ext, ext) into ext(add(ext, ext)) Given a vector add or sub from extends that needs more that one 'step' (i.e i8 to i32 or i16 to i64), we can transform the sequence to sext(add(ext, ext)), to allow the add(ext, ext) to become a single uaddl and a larger extend, producing less instructions in total. https://alive2.llvm.org/ce/z/S2T4k- Differential Revision: https://reviews.llvm.org/D128426	2022-06-24 10:04:28 +01:00
Paul Walker	e8716179eb	[SVE] Make ISD::SPLAT_VECTOR a legal operation. The implication of this patch being AArch64ISD::DUP no longer supports scalable vectors. Differential Revision: https://reviews.llvm.org/D128265	2022-06-23 00:42:47 +01:00
David Sherwood	aa0a413df8	[AArch64][SME] Add some SME PSTATE setting/query intrinsics This patch adds support for: * Querying the PSTATE.SM state with @llvm.aarch64.sme.get.pstatesm * Reading/writing the TPIDR2 register with new @llvm.aarch64.sme.get.tpidr2 and @llvm.aarch64.sme.set.tpidr2 intrinsics. Tests added here: CodeGen/AArch64/sme-get-pstatesm.ll CodeGen/AArch64/sme-read-write-tpidr2.ll Differential Revision: https://reviews.llvm.org/D127957	2022-06-22 10:26:45 +01:00
Paul Walker	7b285ae0e8	[SVE] Lower "unpredicated" sabd/uabd intrinsics to ISD::ABDS/U. This enables an existing transformation that when combined with an add will emit saba/uaba instructions. Differential Revision: https://reviews.llvm.org/D128198	2022-06-22 00:02:51 +01:00
David Green	3f81841474	[AArch64] Add Extract(DUP(C)) as a canonical constant. As a followup to D128144, this adds extract(DUP(C)) as a canonical constant to prevent it being transformed back into a BUILD_VECTOR, leading to an infinite loop.	2022-06-21 09:51:22 +01:00
David Green	c0ecbfa4fd	[AArch64] Known bits for AArch64ISD::DUP An AArch64ISD::DUP is just a splat, where the known bits for each lane are the same as the input. This teaches that to computeKnownBitsForTargetNode. Problems arise for constants though, as a constant BUILD_VECTOR can be lowered to an AArch64ISD::DUP, which SimplifyDemandedBits would then turn back into a constant BUILD_VECTOR leading to an infinite cycle. This has been prevented by adding a isTargetCanonicalConstantNode node to prevent the conversion back into a BUILD_VECTOR. Differential Revision: https://reviews.llvm.org/D128144	2022-06-20 19:11:57 +01:00
David Sherwood	013358632e	[AArch64][SME] Add the zero intrinsic The SME zero instruction takes a mask as an input declaring which 64-bit element tiles should be zeroed. There is a 1:1 mapping between the zero intrinsic and the instruction, however we also want to make the register allocator aware that some tile registers are being written to. We can actually just use the custom inserter for a pseudo instruction to correctly mark all the appropriate registers in the mask as implicitly defined by the operation. Differential Revision: https://reviews.llvm.org/D127843	2022-06-20 14:27:59 +01:00

1 2 3 4 5 ...

1445 Commits