llvm-project

Commit Graph

Author	SHA1	Message	Date
OverMighty	232953f996	[AArch64] Add pattern for SQDML*Lv1i32_indexed There was no pattern to fold into these instructions. This patch adds the pattern obtained from the following ACLE intrinsics so that they generate sqdmlal/sqdmlsl instructions instead of separate sqdmull and sqadd/sqsub instructions: - vqdmlalh_s16, vqdmlslh_s16 - vqdmlalh_lane_s16, vqdmlalh_laneq_s16, vqdmlslh_lane_s16, vqdmlslh_laneq_s16 (when the lane index is 0) It also modifies the result of the existing pattern for the latter, when the lane index is not 0, to use the v1i32_indexed instructions instead of the v4i16_indexed ones. Fixes #49997. Differential Revision: https://reviews.llvm.org/D131700	2022-08-17 12:00:47 +01:00
Vitaly Buka	16fecdfa70	Revert "[AArch64] Add `foldCSELOfCSEl` DAG combine" Breaks ubsan on buildbot, details in D125504 This reverts commit `6f9423ef06`.	2022-08-16 20:29:37 -07:00
Karl Meakin	6f9423ef06	[AArch64] Add `foldCSELOfCSEl` DAG combine Differential Revision: https://reviews.llvm.org/D125504	2022-08-16 12:49:11 +01:00
Zain Jaffal	7155ed4289	[AArch64] Add support for 256-bit non temporal loads Currenlty all temporal loads are mapped to `LDP` or `LDR`. This patch will map all the non temporal 256-bit loads into `LDNP`. Future patches should address other non-temporal loads. Reviewed By: fhahn, dmgreen Differential Revision: https://reviews.llvm.org/D131773	2022-08-16 12:19:36 +01:00
Vitaly Buka	e0e960923f	[AArch64] Fix signed integer overflow in CSINC case Followup to D131815, which overlflows on different values.	2022-08-15 15:04:20 -07:00
Vitaly Buka	f1596952f9	[AArch64] Fix signed integer overflow in CSINC case https://lab.llvm.org/staging/#/builders/224/builds/2/steps/16/logs/stdio Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D131815	2022-08-13 13:12:09 -07:00
Kazu Hirata	109df7f9a4	[llvm] Qualify auto in range-based for loops (NFC) Identified with readability-qualified-auto.	2022-08-13 12:55:42 -07:00
Florian Hahn	c2af37dcdb	Revert "[AArch64][GlobalISel] Recognise some CCMPri" This reverts commit `38c2366b3f`. This patch seems to break boostraping LLVM with `-fglobal-isel -O3` on AArch64 hardware. Without the revert, there are 500+ test failures for the `check-llvm-codegen-x86` target.	2022-08-13 17:44:41 +01:00
David Green	a9e9dd9a3a	[AArch64] Add bf16 select handling A bfloat select operation will currently crash, but is allowed from C. This adds handling for the operation, turning it into a FCSELHrrr if fullfp16 is present, or converting it to a FCSELSrrr if not. The FCSELSrrr is created via using INSERT_SUBREG/EXTRACT_SUBREG to convert the bf16 to a f32 and using the f32 pattern for FCSELSrrr. (I originally attempted to do this via a tablegen pattern, but it appears that the nzcv glue is places onto the wrong node, causing it to be forgotten and incorrect scheduling to be emitted). The FCSELSrrr can also be used for fp16 selects when +fullfp16 is not present, which helps avoid an unnecessary promotion to f32. Differential Revision: https://reviews.llvm.org/D131253	2022-08-11 14:20:36 +01:00
David Truby	b1b9c39629	[AArch64][SVE] Use SVE for VLS fcopysign for wide vectors Currently fcopysign for VLS vectors lowers through NEON even when the vector width is wider than a NEON vector, causing bad codegen as the vectors are split. This patch causes SVE to be used for these vectors instead, giving much better codegen on wide VLS vectors. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D128642	2022-08-10 10:17:19 +00:00
Dinar Temirbulatov	cab6cd6834	[AArch64][LoopVectorize] Introduce trip count minimal value threshold to ignore tail-folding. After D121595 was commited, I noticed regressions assosicated with small trip count numbersvectorisation by tail folding with scalable vectors. As a solution for those issues I propose to introduce the minimal trip count threshold value. Differential Revision: https://reviews.llvm.org/D130755	2022-08-09 22:10:17 +01:00
Archibald Elliott	b20fe2c25b	[docs][AArch64] Label Features with Arm ARM Names This patch adds the names of the Arm Architecture Reference Manual (ARM) features to the corresponding Subtarget Features in the AArch64 backend and target parser. The aim of this is to make it clearer what architectural features a subtarget feature might enable (so, which features a CPU must provide to support that subtarget feature), and so make it easier to add new CPUs in the future. Differential Revision: https://reviews.llvm.org/D131257	2022-08-09 18:45:50 +01:00
Luo, Yuanke	aaf6c7b05c	[globalisel] Select register bank for DBG_VALUE The register operand of DBG_VALUE is not selected to a proper register bank in both AArch64 and X86. This would cause getRegClass crash after global ISel. After discussion, we think the MIR should assume all vritual register should be set proper register class after global ISel, so this patch is to fix the gap of DBG_VALUE for AArch64 and X86. Differential Revision: https://reviews.llvm.org/D129037	2022-08-09 13:11:51 +08:00
Yuta Mukai	3f561996bf	[AArch64] Fix and add A64FX scheduling resource/latency info 1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR) is added and CompleteModel is set to one. 2. Information for pseudo SVE instructions is added. Those instructions are present at the time of scheduling. 3. Resource and latency information for SVE instructions is modified to be more accurate. For example, the description for CMPEQ, which consumes one cycle each of unit FLA and PPR, is as follows. ``` Previous: def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>; def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {... Modified: def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>; def A64FXGI1 : ProcResGroup<[A64FXIPPR]>; def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {... ``` Reference: A64FX Microarchitecture Manual (Table 16-3) https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf Reviewed By: dmgreen, kawashima-fj Differential Revision: https://reviews.llvm.org/D131165	2022-08-09 10:53:40 +09:00
Fangrui Song	de9d80c1c5	[llvm] LLVM_FALLTHROUGH => [[fallthrough]]. NFC With C++17 there is no Clang pedantic warning or MSVC C5051.	2022-08-08 11:24:15 -07:00
Cullen Rhodes	a6dec9f5b2	[AArch64][SVE] Add patterns to select masked FP arith Add patterns to select predicated instructions when lowering: fadd(a, select(mask, b, splat(0))) fsub(a, select(mask, b, splat(0))) 'fadd' is unsafe unless no-signed zeros fast-math flag is set, since -0.0 + 0.0 = 0.0 changes the sign. Alive2: https://alive2.llvm.org/ce/z/wbhJh_ Also adds FMA patterns for: fadd(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c) fsub(a, select(mask, mul(b, c), splat(0))) -> fmla(a, mask, b, c) These patterns require the 'contract' fast-math flag to be set, and the fadd 'nsz' as above. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130564	2022-08-08 08:44:13 +00:00
Kazu Hirata	d0ec61c9ff	[Target] Remove unused forward declarations (NFC)	2022-08-07 00:16:16 -07:00
Kazu Hirata	a2d4501718	[llvm] Fix comment typos (NFC)	2022-08-07 00:16:14 -07:00
Paul Walker	0533c39a76	[SVE] Expand DUPM patterns to handle all integer vector types. NOTE: i8 vector splats are ignored because the immediate range of DUP already has full coverage. Differential Revision: https://reviews.llvm.org/D131078	2022-08-05 16:00:08 +00:00
David Green	38c2366b3f	[AArch64][GlobalISel] Recognise some CCMPri This is a simple addition to emitConditionalComparison, to match CCMP with immediates using getIConstantVRegValWithLookThrough, letting it select the CCMPri variants of the instructions. Differential Revision: https://reviews.llvm.org/D131073	2022-08-05 07:48:42 +01:00
Mingming Liu	bc8f2f3649	[AArch64][TTI][NFC] Overload method 'getVectorInstrCost' to provide vector instruction itself, as a context information for cost estimation. 1) Overloaded (instruction-based) method is a wrapper around the current (opcode-based) method. 2) This patch also changes a few callsites (VectorCombine.cpp, SLPVectorizer.cpp, CodeGenPrepare.cpp) to call the overloaded method. 3) This is a split of D128302. Differential Revision: https://reviews.llvm.org/D131114	2022-08-04 12:58:25 -07:00
David Truby	9a976f3661	[llvm] Always use TargetConstant for FP_ROUND ISD Nodes This patch ensures consistency in the construction of FP_ROUND nodes such that they always use ISD::TargetConstant instead of ISD::Constant. This additionally fixes a bug in the AArch64 SVE backend where patterns were matching against TargetConstant nodes and sometimes failing when passed a Constant node. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130370	2022-08-03 14:02:11 +01:00
Vladislav Dzhidzhoev	71aecbb75c	[AArch64] Treat x18 as callee-saved in functions with Windows calling convention on Darwin rGcf97e0ec42b8 makes $x18 to be treated as callee-saved in functions with Windows calling convention on non-Windows OSes. Here we mark $x18 as callee-saved for functions with Windows calling convention on Darwin, as well as on other non-Windows platforms, in order to prevent some miscompilations (like miscompilation of win64cc-darwin-backup-x18.ll). Since getCalleeSavedRegs doesn't return x18 in list of callee-saved registers, assignCalleeSavedSpillSlots and determineCalleeSaves consider different sets of registers as callee-saved. It causes an error: ``` Assertion failed: ((!HasCalleeSavedStackSize \|\| getCalleeSavedStackSize() == Size) && "Invalid size calculated for callee saves"), function getCalleeSavedStackSize, file AArch64MachineFunctionInfo.h, line 292. ``` Differential Revision: https://reviews.llvm.org/D130676	2022-08-02 20:33:42 +03:00
David Green	1206f72e31	[AArch64] Fold Mul(And(Srl(X, 15), 0x10001), 0xffff) to CMLTz This folds a v4i32 Mul(And(Srl(X, 15), 0x10001), 0xffff) into a v8i16 CMLTz instruction. The Srl and And extract the top bit (whether the input is negative) and the Mul sets all values in the i16 half to all 1/0 depending on if that top bit was set. This is equivalent to a v8i16 CMLTz instruction. The same applies to other sizes with equivalent constants. Differential Revision: https://reviews.llvm.org/D130874	2022-08-02 13:01:59 +01:00
David Sherwood	4ef9cb6c17	[AArch64][LoopVectorize] Disable tail-folding for SVE when loop has interleaved accesses If we have interleave groups in the loop we want to vectorise then we should fall back on normal vectorisation with a scalar epilogue. In such cases when tail-folding is enabled we'll almost certainly go on to create vplans with very high costs for all vector VFs and fall back on VF=1 anyway. This is likely to be worse than if we'd just used an unpredicated vector loop in the first place. Once the vectoriser has proper support for analysing all the costs for each combination of VF and vectorisation style, then we should be able to remove this. Added an extra test here: Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll Differential Revision: https://reviews.llvm.org/D128342	2022-08-02 09:52:33 +01:00
Vasileios Porpodas	f669030373	[TTI][AArch64][SLP] Sets the cost of an ADD reduction 2xi64 to 2. 2xi64 is the legalized type for wide reductions (like 16xi64) and setting the cost to 2 makes `load-reduce` and `load-zext-reduce` patterns profitable. The few performance measurments that I did on an aarch64 machine confirm that these patterns are actually faster when vectorized. Differential Revision: https://reviews.llvm.org/D130740	2022-08-01 13:03:14 -07:00
Kazu Hirata	bf6021709a	Use drop_begin (NFC)	2022-07-31 15:17:09 -07:00
Kazu Hirata	12b29900a1	Use any_of (NFC)	2022-07-30 10:35:56 -07:00
Matt Devereau	a8b726ac65	[AArch64][SVE] Change DupLane128Combine Index comparison to 0 IdxInsert == IdxDupLane is incorrect. IdxInsert is the starting element number, whereas IdxIndex is the index of a quadword	2022-07-29 14:31:00 +00:00
David Sherwood	487fa6f8c3	[AArch64][DAGCombine] Add performBuildVectorCombine 'extract_elt ~> anyext' A build vector of two extracted elements is equivalent to an extract subvector where the inner vector is any-extended to the extract_vector_elt VT, because extract_vector_elt has the effect of an any-extend. (build_vector (extract_elt_i16_to_i32 vec Idx+0) (extract_elt_i16_to_i32 vec Idx+1)) => (extract_subvector (anyext_i16_to_i32 vec) Idx) Depends on D130697 Differential Revision: https://reviews.llvm.org/D130698	2022-07-29 09:51:09 +01:00
chendewen	7eeb468ae5	[Aarch64] Add cost for missing extensions. This patch adds a cost estimate for some missing sign extensions. ref: https://reviews.llvm.org/D14730 Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D130565	2022-07-28 17:34:00 +08:00
Amara Emerson	93e3aeb9a8	[AArch64][GlobalISel] Fix custom legalization of rotates using sext for shift vs zext. Rotates are defined according to DAG documentation as having unsigned shifts, so we need to zero-extend instead of sign-extend here. Fixes issue 56664	2022-07-27 22:10:42 -07:00
Amara Emerson	65246d3eb4	Use hasNItemsOrLess() in MRI::hasAtMostUserInstrs().	2022-07-27 11:42:14 -07:00
Mingming Liu	34348814e1	[AArch64] Explicitly use v1i64 type for llvm.aarch64.neon.pmull64 Without this, the intrinsic will be expanded to an integer; thereby an explicit copy (from GPR to SIMD register) will be codegen'd. This matches the general convention of using "v1" types to represent scalar integer operations in vector registers. The similar approach is observed in D56616, and the pattern likely applies on other intrinsic that accepts integer scalars (e.g., int_aarch64_neon_sqdmulls_scalar) Differential Revision: https://reviews.llvm.org/D130548	2022-07-27 11:11:16 -07:00
Amara Emerson	19cdd1908b	[AArch64][GlobalISel] Add heuristics for localizing G_CONSTANT. This adds similar heuristics to G_GLOBAL_VALUE, querying the cost of materializing a specific constant in code size. Doing so prevents us from sinking constants which require multiple instructions to generate into use blocks. Code size savings on CTMark -Os: Program size.__text before after diff ClamAV/clamscan 381940.00 382052.00 0.0% lencod/lencod 428408.00 428428.00 0.0% SPASS/SPASS 411868.00 411876.00 0.0% kimwitu++/kc 449944.00 449944.00 0.0% Bullet/bullet 463588.00 463556.00 -0.0% sqlite3/sqlite3 284696.00 284668.00 -0.0% consumer-typeset/consumer-typeset 414492.00 414424.00 -0.0% 7zip/7zip-benchmark 595244.00 594972.00 -0.0% mafft/pairlocalalign 247512.00 247368.00 -0.1% tramp3d-v4/tramp3d-v4 372884.00 372044.00 -0.2% Geomean difference -0.0% Differential Revision: https://reviews.llvm.org/D130554	2022-07-27 10:51:16 -07:00
Amara Emerson	9cc1dd209d	[AArch64][GlobalISel] Lower vector G_CTTZ. Fixes issue 56398	2022-07-27 00:14:30 -07:00
Paul Walker	e5c892dd85	[SVE][SelectionDAG] Use INDEX to generate matching instances of BUILD_VECTOR. This patch starts small, only detecting sequences of the form <a, a+n, a+2n, a+3n, ...> where a and n are ConstantSDNodes. Differential Revision: https://reviews.llvm.org/D125194	2022-07-26 15:28:37 +00:00
Sander de Smalen	a41ddf178e	[AArch64][SVE] Sink ptrue into loop if it is used by PTEST. This helps fold away the ptest instructions, which needs the knowledge on whether the general predicate is known to zero the inactive lanes. This fixes some PTEST regressions introduced by D129282. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D129852	2022-07-26 15:07:41 +01:00
Sander de Smalen	370ff43a15	[AArch64][SVE] Consider more intrinsics in 'isZeroingInactiveLanes'. This fixes some PTEST regressions introduced by D129282. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D129851	2022-07-26 15:07:41 +01:00
Simon Tatham	55f1fbf005	[MC,llvm-objdump,ARM] Target-dependent disassembly resync policy. Currently, when llvm-objdump is disassembling a code section and encounters a point where no instruction can be decoded, it uses the same policy on all targets: consume one byte of the section, emit it as "<unknown>", and try disassembling from the next byte position. On an architecture where instructions are always 4 bytes long and 4-byte aligned, this makes no sense at all. If a 4-byte word cannot be decoded as an instruction, then the next place that a valid instruction could //possibly// be found is 4 bytes further on. Disassembling from a misaligned address can't possibly produce anything that the code generator intended, or that the CPU would even attempt to execute. This patch introduces a new MCDisassembler virtual method called `suggestBytesToSkip`, which allows each target to choose its own resynchronization policy. For Arm (as opposed to Thumb) and AArch64, I've filled in the new method to return a fixed width of 4. Thumb is a more interesting case, because the criterion for identifying 2-byte and 4-byte instruction encodings is very simple, and doesn't require the particular instruction to be recognized. So `suggestBytesToSkip` is also passed an ArrayRef of the bytes in question, so that it can take that into account. The new test case shows Thumb disassembly skipping over two unrecognized instructions, and identifying one as 2-byte and one as 4-byte. For targets other than Arm and AArch64, this is NFC: the base class implementation of `suggestBytesToSkip` still returns 1, so that the existing behavior is unchanged. Other targets can fill in their own implementations as they see fit; I haven't attempted to choose a new behavior for each one myself. I've updated all the call sites of `MCDisassembler::getInstruction` in llvm-objdump, and also one in sancov, which was the only other place I spotted the same idiom of `if (Size == 0) Size = 1` after a call to `getInstruction`. Reviewed By: DavidSpickett Differential Revision: https://reviews.llvm.org/D130357	2022-07-26 09:35:30 +01:00
Cullen Rhodes	6082051da1	[AArch64][SVE] Add patterns to select mla/mls Adds patterns for: add(a, select(mask, mul(b, c), splat(0))) -> mla(a, mask, b, c) sub(a, select(mask, mul(b, c), splat(0))) -> mls(a, mask, b, c) Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130492	2022-07-26 07:52:44 +00:00
Bradley Smith	953a98ef8d	[AArch64][SVE] Fold target specific ext/trunc nodes into loads/stores Due to the way fixed length SVE lowering works, we sometimes introduce ext/trunc nodes very late, these nodes then immediately get converted into target specific nodes (UUNPKLO/UZP1) before they get a chance to be folded into a load/store. This patch introduces target specific dag combines for these nodes so that we can still create extending loads/truncating stores out of them. Differential Revision: https://reviews.llvm.org/D128065	2022-07-25 15:24:05 +00:00
Cullen Rhodes	c04ff587dc	[AArch64] Combine setcc (iN (bitcast (vNi1 X))) with vecreduce_or Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130163	2022-07-25 12:14:33 +00:00
Rosie Sumpter	034a27e688	[AArch64] Add f16 fpimm patterns This patch recognizes f16 immediates as legal and adds the necessary patterns. This allows the fadda folding introduced in `05d424d165` to be applied to the f16 cases. Differential Revision: https://reviews.llvm.org/D129989	2022-07-25 09:08:10 +01:00
Cullen Rhodes	836f790bb1	[AArch64][SVE] Add patterns to select masked add/sub instructions When lowering add(a, select(mask, b, splat(0))) the sel instruction can be removed by using predicated add/sub instructions. Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D129751	2022-07-25 07:22:05 +00:00
Kazu Hirata	b5188591a0	[llvm] Remove redundaunt virtual specifiers (NFC) Identified with modernize-use-override.	2022-07-24 21:50:35 -07:00
Arnold Schwaighofer	58e6ee0e1f	llvm.swift.async.context.addr cannot be modeled as NoMem because we don't want it to be cse'd accross async suspends An async suspend models the split between two partial async functions. `llvm.swift.async.context.addr ` will have a different value in the two partial functions so it is not correct to generally CSE the instruction. rdar://97336162 Differential Revision: https://reviews.llvm.org/D130201	2022-07-22 11:50:58 -07:00
Simon Pilgrim	939cf9b1be	[AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation As noticed on D129765 and reported on Issue #56531 - aarch64 targets can use the neon ctpop + add-reduce instructions to speed up scalar ctpop instructions, but we fail to do this for parity calculations. I'm not sure where the cutoff should be for specific CPUs, but i64 (+ i128 special case) shows a definite reduction in instruction count. i32 is about the same (but scalar <-> neon transfers are probably more costly?), and sub-i32 promotion looks to be a definite regression compared to parity expansion optimized for those widths. Differential Revision: https://reviews.llvm.org/D130246	2022-07-22 17:24:17 +01:00
Shubham Narlawar	f55dbfbd9d	[AArch64] Move SeparateConstOffsetFromGEPPass before LSR and enable EnableGEPOpt by default. GEP's across basic blocks were not getting splitted due to EnableGEPOpt which was turned off by default. Hence, EarlyCSE missed the opportunity to eliminate common part of GEP's. This can be achieved by simply turning GEP pass on. - This patch moves SeparateConstOffsetFromGEPPass() just before LSR. - It enables EnableGEPOpt by default. Resolves - https://github.com/llvm/llvm-project/issues/50528 Added an unit test. Differential Revision: https://reviews.llvm.org/D128582	2022-07-22 15:20:53 +01:00
Cullen Rhodes	bf268a05cd	[AArch64] Emit vector FP cmp when LE is used with fast-math Reviewed By: paulwalker-arm Differential Revision: https://reviews.llvm.org/D130093	2022-07-22 07:53:55 +00:00

1 2 3 4 5 ...

6254 Commits