llvm-project

Commit Graph

Author	SHA1	Message	Date
Florian Hahn	76afbf60ed	[VPlan] Replace uses with new value in VPInstructionsToVPRecipe (NFC). Now that VPRecipeBase inherits from VPDef, we can always use the new VPValue for replacement, if the recipe defines one. Given the recipes that are supported at the moment, all new recipes must have either 0 or 1 defined values.	2021-01-25 19:38:08 +00:00
Florian Hahn	3201274dea	[VPlan] Handle scalarized values in VPTransformState. This patch adds plumbing to handle scalarized values directly in VPTransformState. Reviewed By: gilr Differential Revision: https://reviews.llvm.org/D92282	2021-01-25 14:21:56 +00:00
Sander de Smalen	171d12489f	[SLPVectorizer] NFC: Migrate getVectorCallCosts to use InstructionCost. This change also changes getReductionCost to return InstructionCost, and it simplifies two expressions by removing a redundant 'isValid' check.	2021-01-25 12:27:01 +00:00
Sanjay Patel	77adbe6a8c	[SLP] fix fast-math requirements for fmin/fmax reductions `a6f0221276` enabled intersection of FMF on reduction instructions, so it is safe to ease the check here. There is still some room to improve here - it looks like we have nearly duplicate flags propagation logic inside of the LoopUtils helper but it is limited targets that do not form reduction intrinsics (they form the shuffle expansion).	2021-01-24 08:55:56 -05:00
Nikita Popov	c83cff45c7	[IR] Add NoAliasScopeDeclInst (NFC) Add an intrinsic type class to represent the llvm.experimental.noalias.scope.decl intrinsic, to make code working with it a bit nicer by hiding the metadata extraction from view.	2021-01-23 22:40:32 +01:00
Kazu Hirata	1238378f18	[llvm] Use pop_back_val (NFC)	2021-01-23 10:56:33 -08:00
Sanjay Patel	a6f0221276	[SLP] fix fast-math-flag propagation on FP reductions As shown in the test diffs, we could miscompile by propagating flags that did not exist in the original code. The flags required for fmin/fmax reductions will be fixed in a follow-up patch.	2021-01-23 11:17:20 -05:00
Anton Rapetov	a4914dc1f2	[SLP] do not traverse constant uses Walking the use list of a Constant (particularly, ConstantData) is not scalable, since a given constant may be used by many instructinos in many functions in many modules. Differential Revision: https://reviews.llvm.org/D94713	2021-01-22 08:14:09 -05:00
David Sherwood	2e080eb00a	[SVE] Add support for scalable vectorization of loops with selects and cmps I have removed an unnecessary assert in LoopVectorizationCostModel::getInstructionCost that prevented a cost being calculated for select instructions when using scalable vectors. In addition, I have changed AArch64TTIImpl::getCmpSelInstrCost to only do special cost calculations for fixed width vectors and fall back to the base version for scalable vectors. I have added a simple cost model test for cmps and selects: test/Analysis/CostModel/sve-cmpsel.ll and some simple tests that show we vectorize loops with cmp and select: test/Transforms/LoopVectorize/AArch64/sve-basic-vec.ll Differential Revision: https://reviews.llvm.org/D95039	2021-01-22 09:48:13 +00:00
David Green	39db5753f9	[LV][ARM] Inloop reduction cost modelling This adds cost modelling for the inloop vectorization added in `745bf6cf44`. Up until now they have been modelled as the original underlying instruction, usually an add. This happens to works OK for MVE with instructions that are reducing into the same type as they are working on. But MVE's instructions can perform the equivalent of an extended MLA as a single instruction: %sa = sext <16 x i8> A to <16 x i32> %sb = sext <16 x i8> B to <16 x i32> %m = mul <16 x i32> %sa, %sb %r = vecreduce.add(%m) -> R = VMLADAV A, B There are other instructions for performing add reductions of v4i32/v8i16/v16i8 into i32 (VADDV), for doing the same with v4i32->i64 (VADDLV) and for performing a v4i32/v8i16 MLA into an i64 (VMLALDAV). The i64 are particularly interesting as there are no native i64 add/mul instructions, leading to the i64 add and mul naturally getting very high costs. Also worth mentioning, under NEON there is the concept of a sdot/udot instruction which performs a partial reduction from a v16i8 to a v4i32. They extend and mul/sum the first four elements from the inputs into the first element of the output, repeating for each of the four output lanes. They could possibly be represented in the same way as above in llvm, so long as a vecreduce.add could perform a partial reduction. The vectorizer would then produce a combination of in and outer loop reductions to efficiently use the sdot and udot instructions. Although this patch does not do that yet, it does suggest that separating the input reduction type from the produced result type is a useful concept to model. It also shows that a MLA reduction as a single instruction is fairly common. This patch attempt to improve the costmodelling of in-loop reductions by: - Adding some pattern matching in the loop vectorizer cost model to match extended reduction patterns that are optionally extended and/or MLA patterns. This marks the cost of the reduction instruction correctly and the sext/zext/mul leading up to it as free, which is otherwise difficult to tell and may get a very high cost. (In the long run this can hopefully be replaced by vplan producing a single node and costing it correctly, but that is not yet something that vplan can do). - getExtendedAddReductionCost is added to query the cost of these extended reduction patterns. - Expanded the ARM costs to account for these expanded sizes, which is a fairly simple change in itself. - Some minor alterations to allow inloop reduction larger than the highest vector width and i64 MVE reductions. - An extra InLoopReductionImmediateChains map was added to the vectorizer for it to efficiently detect which instructions are reductions in the cost model. - The tests have some updates to show what I believe is optimal vectorization and where we are now. Put together this can greatly improve performance for reduction loop under MVE. Differential Revision: https://reviews.llvm.org/D93476	2021-01-21 21:03:41 +00:00
Sanjay Patel	2f03528f5e	[SLP] rename reduction variable to avoid shadowing; NFC The code structure can likely be improved now that 'OperationData' is gone.	2021-01-21 16:02:38 -05:00
Sanjay Patel	d77753381f	[SLP] simplify reduction matching This is NFC-intended and removes the "OperationData" class which had become nothing more than a recurrence (reduction) type. I adjusted the matching logic to distinguish instructions from non-instructions - that's all that the "IsLeafValue" member was keeping track of.	2021-01-21 14:58:57 -05:00
Kazu Hirata	e53472de68	[Transforms] Use llvm::append_range (NFC)	2021-01-20 21:35:54 -08:00
Sanjay Patel	c09be0d2a0	[SLP] reduce reduction code for checking vectorizable ops; NFC This is another step towards removing `OperationData` and fixing FMF matching/propagation bugs when forming reductions.	2021-01-20 11:14:48 -05:00
Sanjay Patel	1c54112a57	[SLP] refactor more reduction functions; NFC We were able to remove almost all of the state from OperationData, so these don't make sense as members of that class - just pass the RecurKind in as a param. More streamlining is possible, but I'm trying to avoid logic/typo bugs while fixing this. Eventually, we should not need the `OperationData` class.	2021-01-20 11:14:48 -05:00
Sanjay Patel	8590d24543	[SLP] move reduction createOp functions; NFC We were able to remove almost all of the state from OperationData, so these don't make sense as members of that class - just pass the RecurKind in as a param.	2021-01-20 11:14:48 -05:00
Kazu Hirata	8857202489	[llvm] Use llvm::find (NFC)	2021-01-19 20:19:14 -08:00
Alexey Bataev	e463bd53c0	Revert "[SLP]Merge reorder and reuse shuffles." This reverts commit `438682de6a` to fix the bug with the reducing size of the resulting vector for the entry node with multiple users.	2021-01-19 11:48:04 -08:00
Jeroen Dobbelaere	121cac01e8	[noalias.decl] Look through llvm.experimental.noalias.scope.decl Just like llvm.assume, there are a lot of cases where we can just ignore llvm.experimental.noalias.scope.decl. Reviewed By: nikic Differential Revision: https://reviews.llvm.org/D93042	2021-01-19 20:09:42 +01:00
David Sherwood	c3ce262794	[NFC] Make remaining cost functions in LoopVectorize.cpp use InstructionCost A previous patch has already changed getInstructionCost to return an InstructionCost type. This patch changes the other various getXXXCost functions to return an InstructionCost too. This is a non-functional change - I've added a few asserts that the costs are valid in places where we're selecting between vector call and intrinsic costs. However, since we don't yet return invalid costs from any of the TTI implementations these asserts should not fire. See this patch for the introduction of the type: https://reviews.llvm.org/D91174 See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html Differential Revision: https://reviews.llvm.org/D94065	2021-01-19 09:08:40 +00:00
Sanjay Patel	5b77ac32b1	[SLP] match maxnum/minnum intrinsics as FP reduction ops After much refactoring over the last 2 weeks to the reduction matching code, I think this change is finally ready. We effectively broke fmax/fmin vector reduction optimization when we started canonicalizing to intrinsics in instcombine, so this should restore that functionality for SLP. There are still FMF problems here as noted in the code comments, but we should be avoiding miscompiles on those for fmax/fmin by restricting to full 'fast' ops (negative tests are included). Fixing FMF propagation is a planned follow-up. Differential Revision: https://reviews.llvm.org/D94913	2021-01-18 17:37:16 -05:00
Sanjay Patel	3dbbadb8ef	[SLP] rename reduction query for min/max ops; NFC This will avoid confusion once we start matching min/max intrinsics. All of these hacks to accomodate cmp+sel idioms should disappear once we canonicalize to min/max intrinsics.	2021-01-18 09:32:57 -05:00
Sanjay Patel	d1c4e859ce	[SLP] reduce opcode API dependency in reduction cost calc; NFC The icmp opcode is now hard-coded in the cost model call. This will make it easier to eventually remove all opcode queries for min/max patterns as we transition to intrinsics.	2021-01-18 09:32:57 -05:00
Caroline Concatto	36710c38c1	[NFC]Migrate VectorCombine.cpp to use InstructionCost This patch changes these functions: vectorizeLoadInsert isExtractExtractCheap foldExtractedCmps scalarizeBinopOrCmp getShuffleExtract foldBitcastShuf to use the class InstructionCost when calling TTI.get<something>Cost(). This patch is part of a series of patches to use InstructionCost instead of unsigned/int for the cost model functions. See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html See this patch for the introduction of the type: https://reviews.llvm.org/D91174 ps.:This patch adds the test \|\| !NewCost.isValid(), because we want to return false when: !NewCost.isValid && !OldCost.isValid()->the cost to transform it expensive and !NewCost.isValid() && OldCost.isValid() Therefore for simplication we only add test for !NewCost.isValid() Differential Revision: https://reviews.llvm.org/D94069	2021-01-18 13:37:21 +00:00
Sanjay Patel	49b96cd9ef	[SLP] remove opcode field from reduction data class This is NFC-intended and another step towards supporting intrinsics as reduction candidates. The remaining bits of the OperationData class do not make much sense as-is, so I will try to improve that, but I'm trying to take minimal steps because it's still not clear how this was intended to work.	2021-01-16 13:55:52 -05:00
Sanjay Patel	fcfcc3cc6b	[SLP] fix typos; NFC	2021-01-16 13:55:52 -05:00
Sanjay Patel	48dbac5b6b	[SLP] remove unnecessary use of 'OperationData' This is another NFC-intended patch to allow matching intrinsics (example: maxnum) as candidates for reductions. It's possible that the loop/if logic can be reduced now, but it's still difficult to understand how this all works.	2021-01-16 13:55:52 -05:00
Kazu Hirata	19aacdb715	[llvm] Construct SmallVector with iterator ranges (NFC)	2021-01-16 09:40:53 -08:00
Sanjay Patel	ceb3cdccd0	[SLP] remove dead code in reduction matching; NFC To get into this block we had: !A \|\| B \|\| C and we checked C in the first 'if' clause leaving !A \|\| B. But the 2nd 'if' is checking: A && !B --> !(!A \|\| B)	2021-01-15 17:03:26 -05:00
Sanjay Patel	1f21de535d	[SLP] remove unused reduction functions; NFC These were made obsolete by simplifying the code in recent patches.	2021-01-15 14:59:33 -05:00
Sanjay Patel	b21905dfe3	[SLP] remove unnecessary state in matching reductions This is NFC-intended. I'm still trying to figure out how the loop where this is used works. It does not seem like we require this data at all, but it's hard to confirm given the complicated predicates.	2021-01-14 18:32:37 -05:00
Bjorn Pettersson	d58512b2e3	[SLP] Don't vectorize stores of non-packed types (like i1, i2) In the spirit of commit `fc783e91e0` (llvm-svn: 248943) we shouldn't vectorize stores of non-packed types (i.e. types that has padding between consecutive variables in a scalar layout, but being packed in a vector layout). The problem was detected as a miscompile in a downstream test case. Reviewed By: anton-afanasyev Differential Revision: https://reviews.llvm.org/D94446	2021-01-14 11:30:33 +01:00
Sanjay Patel	123674a816	[SLP] simplify type check for reductions This is NFC-intended. The 'valid' call allows int/FP/pointers for other parts of SLP. The difference here is that we can't reduce pointers.	2021-01-13 13:30:46 -05:00
Sanjay Patel	9e7895a868	[SLP] reduce code duplication while processing reductions; NFC	2021-01-12 16:03:57 -05:00
Sanjay Patel	92fb5c49e8	[SLP] rename variable to improve readability; NFC The OperationData in the 2nd block (visiting the operands) is completely independent of the 1st block.	2021-01-12 16:03:57 -05:00
Sanjay Patel	554be30a42	[SLP] reduce code duplication in processing reductions; NFC	2021-01-12 16:03:57 -05:00
Sanjay Patel	46507a96fc	[SLP] reduce code duplication while matching reductions; NFC	2021-01-12 16:03:57 -05:00
Philip Reames	9f61fbd75a	[LV] Relax assumption that LCSSA implies single entry This relates to the ongoing effort to support vectorization of multiple exit loops (see D93317). The previous code assumed that LCSSA phis were always single entry before the vectorizer ran. This was correct, but only because the vectorizer allowed only a single exiting edge. There's nothing in the definition of LCSSA which requires single entry phis. A common case where this comes up is with a loop with multiple exiting blocks which all reach a common exit block. (e.g. see the test updates) Differential Revision: https://reviews.llvm.org/D93725	2021-01-12 12:34:52 -08:00
Florian Hahn	eb0371e403	[VPlan] Unify value/recipe printing after VPDef transition. This patch unifies the way recipes and VPValues are printed after the transition to VPDef. VPSlotTracker has been updated to iterate over all recipes and all their defined values to number those. There is no need to number values in Value2VPValue. It also updates a few places that only used slot numbers for VPInstruction. All recipes now can produce numbered VPValues.	2021-01-11 14:42:46 +00:00
Florian Hahn	a94497a342	[VPlan] Move initial quote emission from ::print to ::dumpBasicBlock. This means there will be no stray " when printing individual recipes using print()/dump() in a debugger, for example.	2021-01-11 12:22:15 +00:00
David Sherwood	40abeb11f4	[NFC][InstructionCost] Change LoopVectorizationCostModel::getInstructionCost to return InstructionCost This patch is part of a series of patches that migrate integer instruction costs to use InstructionCost. In the function selectVectorizationFactor I have simply asserted that the cost is valid and extracted the value as is. In future we expect to encounter invalid costs, but we should filter out those vectorization factors that lead to such invalid costs. See this patch for the introduction of the type: https://reviews.llvm.org/D91174 See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2020-November/146408.html Differential Revision: https://reviews.llvm.org/D92178	2021-01-11 09:22:37 +00:00
David Sherwood	b7ccaca537	[NFC] Remove min/max functions from InstructionCost Removed the InstructionCost::min/max functions because it's fine to use std::min/max instead. Differential Revision: https://reviews.llvm.org/D94301	2021-01-11 09:00:12 +00:00
Sanjay Patel	3f09c77d33	[SLP] fix typo in assert This snuck into `0aa75fb12f` , but I didn't catch it locally.	2021-01-10 13:15:04 -05:00
Sanjay Patel	0aa75fb12f	[SLP] put verifyFunction call behind EXPENSIVE_CHECKS A severe compile-time slowdown from this call is noted in: https://llvm.org/PR48689 My naive fix was to put it under LLVM_DEBUG ( `267ff79` ), but that's not limiting in the way we want. This is a quick fix (or we could just remove the call completely and rely on some later pass to discover potentially wrong IR?). A bigger/better fix would be to improve/limit verifyFunction() as noted in: https://llvm.org/PR47712 Differential Revision: https://reviews.llvm.org/D94328	2021-01-10 12:32:21 -05:00
Kazu Hirata	6a6e382161	[llvm] Drop unnecessary make_range (NFC)	2021-01-09 09:25:00 -08:00
Florian Hahn	65f578fc0e	[VPlan] Keep start value of VPWidenPHIRecipe as VPValue. Similar to D92129, update VPWidenPHIRecipe to manage the start value as VPValue. This allows adjusting the start value as a VPlan transform, which will be used in a follow-up patch to support reductions during epilogue vectorization. Reviewed By: gilr Differential Revision: https://reviews.llvm.org/D93975	2021-01-09 16:34:15 +00:00
Florian Hahn	c493e9216b	[VPlan] Move reduction start value creation to widenPHIRecipe. This was suggested to prepare for D93975. By moving the start value creation to widenPHInstruction, we set the stage to manage the start value directly in VPWidenPHIRecipe, which be used subsequently to set the 'resume' value for reductions during epilogue vectorization. It also moves RdxDesc to the recipe, so we do not have to rely on Legal to look it up later. Reviewed By: gilr Differential Revision: https://reviews.llvm.org/D94175	2021-01-08 17:49:43 +00:00
Alexander Belyaev	bcbdeafa9c	Revert "[SLP]Need shrink the load vector after reordering." This reverts commit `4284afdf94`. This changes computed values in fused_batchnorm_test_cpu. Not equal to tolerance rtol=1e-06, atol=0.001 Mismatched value: a is different from b. not close where = (array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]), array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]), array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5])) not close lhs = [-0.6636615 -0.9804948 -1.148275 -0.68193716 -0.8572368 -0.65046215 -0.6993756 -1.2244141 -1.0938729 -0.50369143 -0.51830524 -0.738452 -0.7214286 -0.48115745 -0.9380924 -0.9341769 -0.5916775 -1.2896856 -0.7264182 -0.9746917 -0.783249 -0.7659018 -0.86214024 -0.47784212] not close rhs = [ 0.44102234 0.12418899 -0.04359123 0.42274666 0.24744703 0.45422167 0.40530816 -0.11973029 0.01081094 0.6009924 0.5863786 0.3662318 0.38325527 0.62352633 0.1665914 0.1705069 0.5130063 -0.18500176 0.37826565 0.12999213 0.3214348 0.338782 0.24254355 0.62684166] not close dif = [1.1046839 1.1046838 1.1046838 1.1046839 1.1046839 1.1046839 1.1046838 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046838 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046839 1.1046838 1.1046838] not close tol = [0.00100044 0.00100012 0.00100004 0.00100042 0.00100025 0.00100045 0.00100041 0.00100012 0.00100001 0.0010006 0.00100059 0.00100037 0.00100038 0.00100062 0.00100017 0.00100017 0.00100051 0.00100019 0.00100038 0.00100013 0.00100032 0.00100034 0.00100024 0.00100063]	2021-01-08 14:42:26 +01:00
Sanjay Patel	267ff7901c	[SLP] limit verifyFunction to debug build (PR48689) As noted in PR48689, the verifier may have some kind of exponential behavior that should be addressed separately. For now, only run it in debug mode to prevent problems for release+asserts. That limit is what we had before D80401, and I'm not sure if there was a reason to change it in that patch.	2021-01-08 08:10:17 -05:00
Cullen Rhodes	1e7efd397a	[LV] Legalize scalable VF hints In the following loop: void foo(int a, int b, int N) { for (int i=0; i<N; ++i) a[i + 4] = a[i] + b[i]; } The loop dependence constrains the VF to a maximum of (4, fixed), which would mean using <4 x i32> as the vector type in vectorization. Extending this to scalable vectorization, a VF of (4, scalable) implies a vector type of <vscale x 4 x i32>. To determine if this is legal vscale must be taken into account. For this example, unless max(vscale)=1, it's unsafe to vectorize. For SVE, the number of bits in an SVE register is architecturally defined to be a multiple of 128 bits with a maximum of 2048 bits, thus the maximum vscale is 16. In the loop above it is therefore unfeasible to vectorize with SVE. However, in this loop: void foo(int a, int b, int N) { #pragma clang loop vectorize_width(X, scalable) for (int i=0; i<N; ++i) a[i + 32] = a[i] + b[i]; } As long as max(vscale) multiplied by the number of lanes 'X' doesn't exceed the dependence distance, it is safe to vectorize. For SVE a VF of (2, scalable) is within this constraint, since a vector of <16 x 2 x 32> will have no dependencies between lanes. For any number of lanes larger than this it would be unsafe to vectorize. This patch extends 'computeFeasibleMaxVF' to legalize scalable VFs specified as loop hints, implementing the following behaviour: * If the backend does not support scalable vectors, ignore the hint. * If scalable vectorization is unfeasible given the loop dependence, like in the first example above for SVE, then use a fixed VF. * Accept scalable VFs if it's safe to do so. * Otherwise, clamp scalable VFs that exceed the maximum safe VF. Reviewed By: sdesmalen, fhahn, david-arm Differential Revision: https://reviews.llvm.org/D91718	2021-01-08 10:49:44 +00:00

1 2 3 4 5 ...

2322 Commits