llvm-project

Commit Graph

Author	SHA1	Message	Date
Sanjay Patel	8ce1e3b759	[CGP] avoid zext/trunc of a memcmp expansion compare This could be viewed as another shortcoming of the DAGCombiner: when both operands of a compare are zexted from the same source type, we should be able to compare the original types. The effect on PowerPC perf is likely unnoticeable, but there's a visible regression for x86 if we feed the suboptimal IR for memcmp expansion to the DAG: _cmp_eq4_zexted_to_i64: movl (%rdi), %ecx movl (%rsi), %edx xorl %eax, %eax cmpq %rdx, %rcx sete %al _cmp_eq4_better: movl (%rdi), %ecx xorl %eax, %eax cmpl (%rsi), %ecx sete %al llvm-svn: 304923	2017-06-07 16:16:45 +00:00
Nemanja Ivanovic	d8623f0825	[PowerPC] Eliminate integer compare instructions - vol. 5 Adds handling for i64 SETNE comparison (both sign and zero extended). Differential Revision: https://reviews.llvm.org/D33720 llvm-svn: 304907	2017-06-07 13:18:06 +00:00
Nemanja Ivanovic	bb67f847d6	[PowerPC] Eliminate integer compare instructions - vol. 3 Adds handling for i32 SETNE comparison (both sign and zero extended). Differential Revision: https://reviews.llvm.org/D33718 llvm-svn: 304901	2017-06-07 12:23:41 +00:00
Sanjay Patel	f57015d4cc	[CGP / PowerPC] use direct compares if there's only one load per block in memcmp() expansion I'd like to enable CGP memcmp expansion for x86, but the output from CGP would regress the special cases (memcmp(x,y,N) != 0 for N=1,2,4,8,16,32 bytes) that we already handle. I'm not sure if we'll actually be able to produce the optimal code given the block-at-a-time limitation in the DAG. We might have to just avoid those special-cases here in CGP. But regardless of that, I think this is a win for the more general cases. http://rise4fun.com/Alive/cbQ Differential Revision: https://reviews.llvm.org/D33963 llvm-svn: 304849	2017-06-07 00:17:08 +00:00
Sanjay Patel	7a52296f1f	[PowerPC] auto-generate full checks and increase test coverage 3 of the tests were testing exactly the same thing: memcmp(x, y, 16) != 0. I changed that to test 4, 7, and 16 bytes, so we can see how those differ. llvm-svn: 304838	2017-06-06 22:06:07 +00:00
Matthias Braun	0021d46a1c	RegisterScavenging: Add ScavengerTest pass This pass allows to run the register scavenging independently of PrologEpilogInserter to allow targeted testing. Also adds some basic register scavenging tests. llvm-svn: 304606	2017-06-02 23:01:42 +00:00
Sean Fertile	457ddd311a	[PowerPC] Correctly specify the cache line size for Power 7, 8 and 9. Fixes PPCTTIImpl::getCacheLineSize() returning the wrong cache line size for newer ppc processors. Commiting on behalf of Stefan Pintilie. Differential Revision: https://reviews.llvm.org/D33656 llvm-svn: 304317	2017-05-31 18:20:17 +00:00
Zaara Syeda	3a7578c658	[PPC] Inline expansion of memcmp This patch does an inline expansion of memcmp. It changes the memcmp library call into an inline expansion when the size is known at compile time and is under a target specified threshold. This expansion is implemented in CodeGenPrepare and expands into straight line code. The target specifies a maximum load size and the expansion works by using this size to load the two sources, compare, and exit early if a difference is found. It also has a special case when the memcmp result is used in a compare to zero equality. Differential Revision: https://reviews.llvm.org/D28637 llvm-svn: 304313	2017-05-31 17:12:38 +00:00
Tony Jiang	60c247de18	[PowerPC] Fix a performance bug for PPC::XXPERMDI. There are some VectorShuffle Nodes in SDAG which can be selected to XXPERMDI Instruction, this patch recognizes them and does the selection to improve the PPC performance. Differential Revision: https://reviews.llvm.org/D33404 llvm-svn: 304298	2017-05-31 13:09:57 +00:00
Nemanja Ivanovic	accab033c9	[PowerPC] Eliminate integer compare instructions - vol. 3 This patch builds upon https://reviews.llvm.org/rL302810 to add handling for the 64-bit SETEQ patterns. Differential Revision: https://reviews.llvm.org/D33369 llvm-svn: 304286	2017-05-31 08:04:07 +00:00
Nemanja Ivanovic	e597bd8230	[PowerPC] Eliminate integer compare instructions - vol. 2 This patch builds upon https://reviews.llvm.org/rL302810 to add handling for bitwise logical operations in general purpose registers. The idea is to keep the values in GPRs as long as possible - only extracting them to a condition register bit when no further operations are to be done. Differential Revision: https://reviews.llvm.org/D31851 llvm-svn: 304282	2017-05-31 05:40:25 +00:00
Tim Shen	0bd0aa8f07	[AntiDepBreaker] Revert r299124 and add a test. Summary: AntiDepBreaker intends to add all live-outs, including the implicit CSRs, in StartBlock. r299124 was done without understanding that intention. Now with the live-ins propagated correctly (D32464), we can revert this change. Reviewers: MatzeB, qcolombet Subscribers: nemanjai, llvm-commits Differential Revision: https://reviews.llvm.org/D33697 llvm-svn: 304251	2017-05-30 22:26:52 +00:00
Matthias Braun	eec1f3672a	LivePhysRegs: Fix addLiveOutsNoPristines() for return blocks past PEI Re-commit r303938 and r303954 with a fix for addLiveIns(): the internal addPristines() function must be called on an empty set or it may accidentally reset saved registers. - addLiveOutsNoPristines() needs to add callee saved registers that are actually saved and restored somewhere to the set (they are not pristine). - Cleanup/rewrite the code for addLiveOuts()/addLiveOutsNoPristines(). This fixes the problem from D32156. Differential Revision: https://reviews.llvm.org/D32464 llvm-svn: 304001	2017-05-26 16:23:08 +00:00
Matthias Braun	c93c063993	Revert "LivePhysRegs: Fix addLiveOutsNoPristines() for return blocks past PEI" Tentatively revert this to see if it fixes the buildbot stage2 breakages. This reverts commit r303938. This reverts commit r303954. llvm-svn: 303960	2017-05-26 02:25:20 +00:00
Matthias Braun	9e6826de77	Test for r303938 llvm-svn: 303954	2017-05-26 01:29:25 +00:00
Tim Shen	a2b85da879	[PPC] Fix atomics lowering in DAG lowering. I forgot to forward the chain, causing some missing instruction dependencies. The test crashes the compiler without this patch. Inspired by the test case, D33519 also tries to remove the extra sync. Differential Revision: https://reviews.llvm.org/D33573 llvm-svn: 303931	2017-05-25 22:58:35 +00:00
Tony Jiang	0a429f040e	[PowerPC] Fix a performance bug for PPC::XXSLDWI. There are some VectorShuffle Nodes in SDAG which can be selected to XXSLDWI instruction, this patch recognizes them and does the selection to improve the PPC performance. llvm-svn: 303822	2017-05-24 23:48:29 +00:00
Zaara Syeda	932978315b	P9: D-form vector load/store. Differential Revision: https://reviews.llvm.org/D33248 llvm-svn: 303780	2017-05-24 17:50:37 +00:00
Hiroshi Inoue	37e63b1b21	Summary PPC backend eliminates compare instructions by using record-form instructions in PPCInstrInfo::optimizeCompareInstr, which is called from peephole optimization pass. This patch improves this optimization to eliminate more compare instructions in two types of common case. - comparison against a constant 1 or -1 The record-form instructions set CR bit based on signed comparison against 0. So, the current implementation does not exploit the record-form instruction for comparison against a non-zero constant. This patch enables record-form optimization for constant of 1 or -1 if possible; it changes the condition "greater than -1" into "greater than or equal to 0" and "less than 1" into "less than or equal to 0". With this patch, compare can be eliminated in the following code sequence, as an example. uint64_t a, b; if ((a \| b) & 0x8000000000000000ull) { ... } else { ... } - andi for 32-bit comparison on PPC64 Since record-form instructions execute 64-bit signed comparison and so we have limitation in eliminating 32-bit comparison, i.e. with cmplwi, using the record-form. The original implementation already has such checks but andi. is not recognized as an instruction which executes implicit zero extension and hence safe to convert into record-form if used for equality check. %1 = and i32 %a, 10 %2 = icmp ne i32 %1, 0 br i1 %2, label %foo, label %bar In this simple example, LLVM generates andi. + cmplwi + beq on PPC64. This patch make it possible to eliminate the cmplwi for this case. I added andi. for optimization targets if it is safe to do so. Differential Revision: https://reviews.llvm.org/D30081 llvm-svn: 303500	2017-05-21 06:00:05 +00:00
Kyle Butt	f6c61ef64d	CodeGen: Power: Add lowering for shifts of v1i128. When legalizing vector operations on vNi128, they will be split to v1i128 because that is a legal type on ppc64, but then the compiler will crash in selection dag because it fails to select for these operations. This patch fixes shift operations. Logical shift right and left shift can be performed in the vector unit, but algebraic shift right requires being split. Differential Revision: https://reviews.llvm.org/D32774 llvm-svn: 303307	2017-05-17 21:54:41 +00:00
Krzysztof Parzyszek	2b0533126e	[PPC] Properly update register save area offsets The variables MinGPR/MinG8R were not updated properly when resetting the offsets, which in the included testcase lead to saving the CR register in the same location as R30. This fixes another issue reported in PR26519. Differential Revision: https://reviews.llvm.org/D33017 llvm-svn: 303257	2017-05-17 13:25:09 +00:00
Tim Shen	0fbbef43e0	[PPC] Add -ppc-asm-full-reg-names to atomic-2.ll. NFC. Differential Revisions: https://reviews.llvm.org/D32763 llvm-svn: 303209	2017-05-16 20:58:55 +00:00
Tim Shen	3bef27cc6f	[PPC] Lower load acquire/seq_cst trailing fence to cmp + bne + isync. Summary: This fixes pr32392. The lowering pipeline is: llvm.ppc.cfence in IR -> PPC::CFENCE8 in isel -> Actual instructions in expandPostRAPseudo. The reason why expandPostRAPseudo is chosen is because previous passes are likely eliminating instructions like cmpw 3, 3 (early CSE) and bne- 7, .+4 (some branch pass(s)). Differential Revision: https://reviews.llvm.org/D32763 llvm-svn: 303205	2017-05-16 20:18:06 +00:00
Nirav Dave	da8f221273	Elide stores which are overwritten without being observed. Summary: In SelectionDAG, when a store is immediately chained to another store to the same address, elide the first store as it has no observable effects. This is causes small improvements dealing with intrinsics lowered to stores. Test notes: * Many testcases overwrite store addresses multiple times and needed minor changes, mainly making stores volatile to prevent the optimization from optimizing the test away. * Many X86 test cases optimized out instructions associated with associated with va_start. * Note that test_splat in CodeGen/AArch64/misched-stp.ll no longer has dependencies to check and can probably be removed and potentially replaced with another test. Reviewers: rnk, john.brawn Subscribers: aemerson, rengolin, qcolombet, jyknight, nemanjai, nhaehnle, javed.absar, llvm-commits Differential Revision: https://reviews.llvm.org/D33206 llvm-svn: 303198	2017-05-16 19:43:56 +00:00
Kyle Butt	7d531daece	CodeGen: BlockPlacement: Increase tail duplication size for O3. At O3 we are more willing to increase size if we believe it will improve performance. The current threshold for tail-duplication of 2 instructions is conservative, and can be relaxed at O3. Benchmark results: llvm test-suite: 6% improvement in aha, due to duplication of loop latch 3% improvement in hexxagon 2% slowdown in lpbench. Seems related, but couldn't completely diagnose. Internal google benchmark: Produces 4% improvement on internal google protocol buffer serialization benchmarks. Differential-Revision: https://reviews.llvm.org/D32324 llvm-svn: 303084	2017-05-15 17:30:47 +00:00
Guozhi Wei	22e7da9597	[PPC] Change the register constraint of the first source operand of instruction mtvsrdd to g8rc_nox0 According to Power ISA V3.0 document, the first source operand of mtvsrdd is constant 0 if r0 is specified. So the corresponding register constraint should be g8rc_nox0. This bug caused wrong output generated by 401.bzip2 when -mcpu=power9 and fdo are specified. Differential Revision: https://reviews.llvm.org/D32880 llvm-svn: 302834	2017-05-11 22:17:35 +00:00
Nemanja Ivanovic	96c3d626a2	[PowerPC] Eliminate integer compare instructions - vol. 1 This patch is the first in a series of patches to provide code gen for doing compares in GPRs when the compare result is required in a GPR. It adds the infrastructure to select GPR sequences for i1->i32 and i1->i64 extensions. This first patch handles equality comparison on i32 operands with the result sign or zero extended. Differential Revision: https://reviews.llvm.org/D31847 llvm-svn: 302810	2017-05-11 16:54:23 +00:00
Serge Pavlov	d526b13e61	Add extra operand to CALLSEQ_START to keep frame part set up previously Using arguments with attribute inalloca creates problems for verification of machine representation. This attribute instructs the backend that the argument is prepared in stack prior to CALLSEQ_START..CALLSEQ_END sequence (see http://llvm.org/docs/InAlloca.htm for details). Frame size stored in CALLSEQ_START in this case does not count the size of this argument. However CALLSEQ_END still keeps total frame size, as caller can be responsible for cleanup of entire frame. So CALLSEQ_START and CALLSEQ_END keep different frame size and the difference is treated by MachineVerifier as stack error. Currently there is no way to distinguish this case from actual errors. This patch adds additional argument to CALLSEQ_START and its target-specific counterparts to keep size of stack that is set up prior to the call frame sequence. This argument allows MachineVerifier to calculate actual frame size associated with frame setup instruction and correctly process the case of inalloca arguments. The changes made by the patch are: - Frame setup instructions get the second mandatory argument. It affects all targets that use frame pseudo instructions and touched many files although the changes are uniform. - Access to frame properties are implemented using special instructions rather than calls getOperand(N).getImm(). For X86 and ARM such replacement was made previously. - Changes that reflect appearance of additional argument of frame setup instruction. These involve proper instruction initialization and methods that access instruction arguments. - MachineVerifier retrieves frame size using method, which reports sum of frame parts initialized inside frame instruction pair and outside it. The patch implements approach proposed by Quentin Colombet in https://bugs.llvm.org/show_bug.cgi?id=27481#c1. It fixes 9 tests failed with machine verifier enabled and listed in PR27481. Differential Revision: https://reviews.llvm.org/D32394 llvm-svn: 302527	2017-05-09 13:35:13 +00:00
Krzysztof Parzyszek	038a0546db	[PPC] When restoring R30 (PIC base pointer), mark it as <def> This happened on the PPC32/SVR4 path and was discovered when building FreeBSD on PPC32. It was a typo-class error in the frame lowering code. This fixes PR26519. llvm-svn: 302183	2017-05-04 19:14:54 +00:00
Tim Shen	e59d06fe78	[PowerPC, DAGCombiner] Fold a << (b % (sizeof(a) * 8)) back to a single instruction Summary: This is the corresponding llvm change to D28037 to ensure no performance regression. Reviewers: bogner, kbarton, hfinkel, iteratee, echristo Subscribers: nemanjai, llvm-commits Differential Revision: https://reviews.llvm.org/D28329 llvm-svn: 301990	2017-05-03 00:07:02 +00:00
Nemanja Ivanovic	b89c27f515	[PowerPC] Emit VMX loads/stores for aligned ops to avoid adding swaps on LE Fixes PR30730. This is a re-commit of a pulled commit. The commit was pulled because some software projects contained uses of Altivec vectors that violated alignment requirements. Known issues have now been fixed. Committing on behalf of Lei Huang. Differential Revision: https://reviews.llvm.org/D26861 llvm-svn: 301892	2017-05-02 01:47:34 +00:00
Sanjoy Das	ba0daee6b2	[StackMaps] Increase the size of the "location size" field Summary: In some cases LLVM (especially the SLP vectorizer) will create vectors that are 256 bytes (or larger). Given that this is intentional[0] is likely to get more common, this patch updates the StackMap binary format to deal with the spill locations for said vectors. This change also bumps the stack map version from 2 to 3. [0]: https://reviews.llvm.org/D32533#738350 Reviewers: reames, kavon, skatkov, javed.absar Subscribers: mcrosier, nemanjai, llvm-commits Differential Revision: https://reviews.llvm.org/D32629 llvm-svn: 301615	2017-04-28 04:48:42 +00:00
Adrian Prantl	083e6a5b5c	Don't emit CFI instructions at the end of a function When functions are terminated by unreachable instructions, the last instruction might trigger a CFI instruction to be generated. However, emitting it would be be illegal since the function (and thus the FDE the CFI is in) has already ended with the previous instruction. Darwin's dwarfdump --verify --eh-frame complains about this and the specification supports this. Relevant bits from the DWARF 5 standard (6.4 Call Frame Information): "[The] address_range [field in an FDE]: The number of bytes of program instructions described by this entry." "Row creation instructions: [...] The new location value is always greater than the current one." The first quotation implies that a CFI cannot describe a target address outside of the enclosing FDE's range. rdar://problem/26244988 Differential Revision: https://reviews.llvm.org/D32246 llvm-svn: 301219	2017-04-24 18:45:59 +00:00
Sanjay Patel	ae382bb6af	[DAG] add splat vector support for 'xor' in SimplifyDemandedBits This allows forming more 'not' ops, so we get improvements for ISAs that have and-not. Follow-up to: https://reviews.llvm.org/rL300725 llvm-svn: 300763	2017-04-19 21:23:09 +00:00
Sanjay Patel	c9d36f181f	[PowerPC] add test and auto-generate checks; NFC llvm-svn: 300700	2017-04-19 14:58:09 +00:00
Hal Finkel	cef9e52736	[PowerPC] multiply-with-overflow might use the CTR register Check the legality of ISD::[US]MULO to see whether Intrinsic::[us]mul_with_overflow will legalize into a function call (and, thus, will use the CTR register). Fixes PR32485. Patch by Tim Neumann! Differential Revision: https://reviews.llvm.org/D31790 llvm-svn: 299910	2017-04-11 02:03:17 +00:00
Matt Arsenault	f10061ec70	Add address space mangling to lifetime intrinsics In preparation for allowing allocas to have non-0 addrspace. llvm-svn: 299876	2017-04-10 20:18:21 +00:00
Eli Friedman	5fba1e53f2	Turn on -addr-sink-using-gep by default. The new codepath has been in the tree for years, and there isn't any reason to use two codepaths here. Differential Revision: https://reviews.llvm.org/D30596 llvm-svn: 299723	2017-04-06 22:42:18 +00:00
Sanjay Patel	b2f1621bb1	[DAGCombiner] add and use TLI hook to convert and-of-seteq / or-of-setne to bitwise logic+setcc (PR32401) This is a generic combine enabled via target hook to reduce icmp logic as discussed in: https://bugs.llvm.org/show_bug.cgi?id=32401 It's likely that other targets will want to enable this hook for scalar transforms, and there are probably other patterns that can use bitwise logic to reduce comparisons. Note that we are missing an IR canonicalization for these patterns, and we will probably prefer the pair-of-compares form in IR (shorter, more likely to fold). Differential Revision: https://reviews.llvm.org/D31483 llvm-svn: 299542	2017-04-05 14:09:39 +00:00
Sanjay Patel	a4546efbc8	add/move codegen tests for and/or of setcc; NFC llvm-svn: 299396	2017-04-03 22:45:46 +00:00
Sanjay Patel	665021e7ee	[DAGCombiner] enable vector transforms for any/all {sign} bits set/clear The code already allowed vector types in via "isInteger" (which might want a more specific name), so use splat-friendly constant predicates to match those types. llvm-svn: 299304	2017-04-01 15:05:54 +00:00
Sanjay Patel	fe9340c168	[PowerPC, x86] add vector tests for any/all {sign} bits set/clear; NFC llvm-svn: 299303	2017-04-01 14:32:18 +00:00
Sanjay Patel	34da36e74f	[DAGCombiner] add fold for 'All sign bits set?' (and (setlt X, 0), (setlt Y, 0)) --> (setlt (and X, Y), 0) We have 7 similar folds, but this one got away. The fact that the x86 test with a branch didn't change is probably a separate bug. We may also be missing this and the related folds in instcombine. llvm-svn: 299252	2017-03-31 20:28:06 +00:00
Sanjay Patel	a97d36fdda	[PowerPC] add tests for setcc+setcc+logic; NFC These are the same tests added for x86 with r299238, but PPC doesn't specify all branches as cheap, so we see different patterns in tests with branches. llvm-svn: 299244	2017-03-31 18:51:03 +00:00
Eric Christopher	9fd267c221	Temporarily revert "[PPC] In PPCBoolRetToInt change the bool value to i64 if the target is ppc64" as it's causing test failures, I've given Carrot a testcase offline. This reverts commit r298955. llvm-svn: 299153	2017-03-31 02:16:54 +00:00
Eric Christopher	bf8f759305	Add testcase for r299124. Patch by Tim Shen! llvm-svn: 299125	2017-03-30 22:35:10 +00:00
Adam Nemet	edaec6de73	[DAGCombiner] Initial support for the fast-math flag contract Now alternatively to the TargetOption.AllowFPOpFusion global flag, FMUL->FADD can also use the per operation FMF to allow fusion. The idea here is not to port everything to the new scheme (e.g. fused multiply-and-sub will be ported later) but that this work all the way from clang. The transformation is conditionalized on both the FADD and the FMUL having the FMF contract flag. Differential Revision: https://reviews.llvm.org/D31169 llvm-svn: 299096	2017-03-30 18:53:04 +00:00
Sanjay Patel	b8a728f993	[CodeGen] clean up and add tests for scalar and-of-setcc; NFC https://bugs.llvm.org/show_bug.cgi?id=32401 llvm-svn: 299034	2017-03-29 21:58:52 +00:00
Guozhi Wei	f8d40181c9	[PPC] In PPCBoolRetToInt change the bool value to i64 if the target is ppc64 In PPCBoolRetToInt bool value is changed to i32 type. On ppc64 it may introduce an extra zero extension for the return value. This patch changes the integer type to i64 to avoid the zero extension on ppc64. This patch fixed PR32442. Differential Revision: https://reviews.llvm.org/D31407 llvm-svn: 298955	2017-03-28 22:55:01 +00:00
Dehao Chen	b197d5b0a0	Fix trellis layout to avoid mis-identify triangle. Summary: For the following CFG: A->B B->C A->C If there is another edge B->D, then ABC should not be considered as triangle. Reviewers: davidxl, iteratee Reviewed By: iteratee Subscribers: nemanjai, llvm-commits Differential Revision: https://reviews.llvm.org/D31310 llvm-svn: 298661	2017-03-23 23:28:09 +00:00
Tim Shen	ce26a45f7c	[PPC] Add generated tests for all atomic operations Summary: Add tests for all atomic operations for powerpc64le, so that all changes can be easily examined. Reviewers: kbarton, hfinkel, echristo Subscribers: mehdi_amini, nemanjai, llvm-commits Differential Revision: https://reviews.llvm.org/D31285 llvm-svn: 298614	2017-03-23 16:02:47 +00:00
Oren Ben Simhon	9ce0ec5dbc	CalleeSavedRegister was removed from MIR and is recalculated upon MIR parsing. llvm-svn: 298210	2017-03-19 11:18:09 +00:00
Kyle Butt	d609d6ebf0	CodeGen: BlockPlacement: Adjust test case so it covers rL297925. NFC I had ajusted the test case before when testing a chain of length 2, and then reverted it with rL296845 when I switched to 3 triangles. After running benchmarks and examining generated code at length 2 I forgot to put the test back. llvm-svn: 298000	2017-03-16 21:33:29 +00:00
Nemanja Ivanovic	ffcf0fb1cc	[PowerPC][Altivec] Add mfvrd and mffprd extended mnemonic mfvrd and mffprd are both alias to mfvrsd. This patch enables correct parsing of the aliases, but we still emit a mfvrsd. Committing on behalf of brunoalr (Bruno Rosa). Differential Revision: https://reviews.llvm.org/D29177 llvm-svn: 297849	2017-03-15 16:04:53 +00:00
Nirav Dave	54e22f33d9	In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled. Recommiting with compiler time improvements Recommitting after fixup of 32-bit aliasing sign offset bug in DAGCombiner. * Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search and chain alias analysis which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. When merging stores search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and the output Codegen (save perhaps for some ARM cases where we correctly constructs wider loads, but then promotes them to float operations which appear but requires more expensive constant generation). Some minor peephole optimizations to deal with improved SubDAG shapes (listed below) Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seems sufficient to not cause regressions in tests. 5. Remove Chain dependencies of Memory operations on CopyfromReg nodes as these are captured by data dependence 6. Forward loads-store values through tokenfactors containing {CopyToReg,CopyFromReg} Values. 7. Peephole to convert buildvector of extract_vector_elt to extract_subvector if possible (see CodeGen/AArch64/store-merge.ll) 8. Store merging for the ARM target is restricted to 32-bit as some in some contexts invalid 64-bit operations are being generated. This can be removed once appropriate checks are added. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable, improving load-store forwarding. One test in particular is worth noting: CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store forwarding converts a load-store pair into a parallel store and a memory-realized bitcast of the same value. However, because we lose the sharing of the explicit and implicit store values we must create another local store. A similar transformation happens before SelectionDAG as well. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle llvm-svn: 297695	2017-03-14 00:34:14 +00:00
Tim Shen	c7472d912b	Revert "Revert "[PowerPC][ELFv2ABI] Allocate parameter area on-demand to reduce stack frame size"" After inspection, it's an UB in our code base. Someone cast a var-arg function pointer to a non-var-arg one. :/ Re-commit r296771 to continue testing on the patch. Sorry for the trouble! llvm-svn: 297256	2017-03-08 02:41:35 +00:00
Tim Shen	70054bb827	Revert "[PowerPC][ELFv2ABI] Allocate parameter area on-demand to reduce stack frame size" This reverts commit r296771. We found some wide spread test failures internally. I'm working on a testcase. Politely revert the patch in the mean time. :) llvm-svn: 297124	2017-03-07 07:40:10 +00:00
Nemanja Ivanovic	12e67d868a	[PowerPC] Fix failure with STBRX when store is narrower than the bswap Fixes a crash caused by r296811 by truncating the input of the STBRX node when the bswap is wider than i32. Fixes https://bugs.llvm.org/show_bug.cgi?id=32140 Differential Revision: https://reviews.llvm.org/D30615 llvm-svn: 297001	2017-03-06 07:32:13 +00:00
Sanjay Patel	066f3208bf	[DAGCombiner] allow transforming (select Cond, C +/- 1, C) to (add(ext Cond), C) select Cond, C +/- 1, C --> add(ext Cond), C -- with a target hook. This is part of the ongoing process to obsolete D24480. The motivation is to canonicalize to select IR in InstCombine whenever possible, so we need to have a way to undo that easily in codegen. PowerPC is an obvious winner for this kind of transform because it has fast and complete bit-twiddling abilities but generally lousy conditional execution perf (although this might have changed in recent implementations). x86 also sees some wins, but the effect is limited because these transforms already mostly exist in its target-specific combineSelectOfTwoConstants(). The fact that we see any x86 changes just shows that that code is a mess of special-case holes. We may be able to remove some of that logic now. My guess is that other targets will want to enable this hook for most cases. The likely follow-ups would be to add value type and/or the constants themselves as parameters for the hook. As the tests in select_const.ll show, we can transform any select-of-constants to math/logic, but the general transform for any 2 constants needs one more instruction (multiply or 'and'). ARM is one target that I think may not want this for most cases. I see infinite loops there because it wants to use selects to enable conditionally executed instructions. Differential Revision: https://reviews.llvm.org/D30537 llvm-svn: 296977	2017-03-04 19:18:09 +00:00
Chandler Carruth	ce52b80744	[SDAG] Revert r296476 (and r296486, r296668, r296690). This patch causes compile times for some patterns to explode. I have a (large, unreduced) test case that slows down by more than 20x and several test cases slow down by 2x. I'm sending some of the test cases directly to Nirav and following up with more details in the review log, but this should unblock anyone else hitting this. llvm-svn: 296862	2017-03-03 10:02:25 +00:00
Kyle Butt	1fa6030767	CodeGen: BlockPlacement: Precompute layout for chains of triangles. For chains of triangles with small join blocks that can be tail duplicated, a simple calculation of probabilities is insufficient. Tail duplication can be profitable in 3 different ways for these cases: 1) The post-dominators marked 50% are actually taken 56% (This shrinks with longer chains) 2) The chains are statically correlated. Branch probabilities have a very U-shaped distribution. [http://nrs.harvard.edu/urn-3:HUL.InstRepos:24015805] If the branches in a chain are likely to be from the same side of the distribution as their predecessor, but are independent at runtime, this transformation is profitable. (Because the cost of being wrong is a small fixed cost, unlike the standard triangle layout where the cost of being wrong scales with the # of triangles.) 3) The chains are dynamically correlated. If the probability that a previous branch was taken positively influences whether the next branch will be taken We believe that 2 and 3 are common enough to justify the small margin in 1. The code pre-scans a function's CFG to identify this pattern and marks the edges so that the standard layout algorithm can use the computed results. llvm-svn: 296845	2017-03-03 01:00:22 +00:00
Guozhi Wei	ed28e742ee	[PPC] Fix code generation for bswap(int32) followed by store16 This patch fixes pr32063. Current code in PPCTargetLowering::PerformDAGCombine can transform bswap store into a single PPCISD::STBRX instruction. but it doesn't consider the case that the operand size of bswap may be larger than store size. When it occurs, we need 2 modifications, 1 For the last operand of PPCISD::STBRX, we should not use DAG.getValueType(N->getOperand(1).getValueType()), instead we should use cast<StoreSDNode>(N)->getMemoryVT(). 2 Before PPCISD::STBRX, we need to shift the original operand of bswap to the right side. Differential Revision: https://reviews.llvm.org/D30362 llvm-svn: 296811	2017-03-02 21:07:59 +00:00
Nemanja Ivanovic	db8425eff0	[PowerPC][ELFv2ABI] Allocate parameter area on-demand to reduce stack frame size This patch reduces the stack frame size by not allocating the parameter area if it is not required. In the current implementation LowerFormalArguments_64SVR4 already handles the parameter area, but LowerCall_64SVR4 does not (when calculating the stack frame size). What this patch does is make LowerCall_64SVR4 consistent with LowerFormalArguments_64SVR4. Committing on behalf of Hiroshi Inoue. Differential Revision: https://reviews.llvm.org/D29881 llvm-svn: 296771	2017-03-02 17:38:59 +00:00
Sanjay Patel	92938657a0	[DAGCombiner] fold binops with constant into select-of-constants This is part of the ongoing attempt to improve select codegen for all targets and select canonicalization in IR (see D24480 for more background). The transform is a subset of what is done in InstCombine's FoldOpIntoSelect(). I first noticed a regression in the x86 avx512-insert-extract.ll tests with a patch that hopes to convert more selects to basic math ops. This appears to be a general missing DAG transform though, so I added tests for all standard binops in rL296621 (PowerPC was chosen semi-randomly; it has scripted FileCheck support, but so do ARM and x86). The poor output for "sel_constants_shl_constant" is tracked with: https://bugs.llvm.org/show_bug.cgi?id=32105 Differential Revision: https://reviews.llvm.org/D30502 llvm-svn: 296699	2017-03-01 22:51:31 +00:00
Nemanja Ivanovic	b223cfabcc	Improve scheduling with branch coalescing This patch adds a MachineSSA pass that coalesces blocks that branch on the same condition. Committing on behalf of Lei Huang. Differential Revision: https://reviews.llvm.org/D28249 llvm-svn: 296670	2017-03-01 20:29:34 +00:00
Artur Pilipenko	e1b2d31468	[DAGCombiner] Support {a\|s}ext, {a\|z\|s}ext load nodes in load combine Resubmit r295336 after the bug with non-zero offset patterns on BE targets is fixed (r296336). Support {a\|s}ext, {a\|z\|s}ext load nodes as a part of load combine patters. Reviewed By: filcab Differential Revision: https://reviews.llvm.org/D29591 llvm-svn: 296651	2017-03-01 18:12:29 +00:00
Sanjay Patel	ffc6943011	[PPC] add tests for select-of-constants with binop; NFC llvm-svn: 296621	2017-03-01 14:26:49 +00:00
Nirav Dave	f830dec3f2	In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled. Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner. * Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search and chain alias analysis which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. When merging stores search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and the output Codegen (save perhaps for some ARM cases where we correctly constructs wider loads, but then promotes them to float operations which appear but requires more expensive constant generation). Some minor peephole optimizations to deal with improved SubDAG shapes (listed below) Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seems sufficient to not cause regressions in tests. 5. Remove Chain dependencies of Memory operations on CopyfromReg nodes as these are captured by data dependence 6. Forward loads-store values through tokenfactors containing {CopyToReg,CopyFromReg} Values. 7. Peephole to convert buildvector of extract_vector_elt to extract_subvector if possible (see CodeGen/AArch64/store-merge.ll) 8. Store merging for the ARM target is restricted to 32-bit as some in some contexts invalid 64-bit operations are being generated. This can be removed once appropriate checks are added. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable, improving load-store forwarding. One test in particular is worth noting: CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store forwarding converts a load-store pair into a parallel store and a memory-realized bitcast of the same value. However, because we lose the sharing of the explicit and implicit store values we must create another local store. A similar transformation happens before SelectionDAG as well. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle llvm-svn: 296476	2017-02-28 14:24:15 +00:00
Michael Kuperstein	13bf8a2684	[CGP] Split some critical edges coming out of indirect branches Splitting critical edges when one of the source edges is an indirectbr is hard in general (because it requires changing the memory the indirectbr reads). But if a block only has a single indirectbr predecessor (which is the common case), we can simulate splitting that edge by splitting the destination block, and retargeting the direct branches. This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame() ends up using an indirect branch with ~100 successors, and passing a constant to each of those. Since MachineSink can't break indirect critical edges on demand (and doing this in MIR doesn't look feasible), this causes us to emit about ~100 defs of registers containing constants, which we in the predecessor block, where only one of those constants is used in each successor. So, at each computed goto, we needlessly spill about a 100 constants to stack. The end result is that a clang-compiled python interpreter can be about ~2.5x slower on a simple python reduction loop than a gcc-compiled interpreter. Differential Revision: https://reviews.llvm.org/D29916 llvm-svn: 296416	2017-02-28 00:11:34 +00:00
Amaury Sechet	681472cd0f	Do full codegen for various tests. NFC llvm-svn: 296305	2017-02-27 01:15:57 +00:00
Daniel Jasper	3ca4525612	Revert "[CGP] Split some critical edges coming out of indirect branches" This reverts commit r296149 as it leads to crashes when compiling for PPC. llvm-svn: 296295	2017-02-26 11:09:12 +00:00
Nirav Dave	73cd0194cf	Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled." This reverts commit r296252 until 256-bit operations are more efficiently generated in X86. llvm-svn: 296279	2017-02-26 01:27:32 +00:00
Nirav Dave	beabf456df	In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled. Recommiting after fixup of 32-bit aliasing sign offset bug in DAGCombiner. * Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search and chain alias analysis which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. When merging stores search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and the output Codegen (save perhaps for some ARM cases where we correctly constructs wider loads, but then promotes them to float operations which appear but requires more expensive constant generation). Some minor peephole optimizations to deal with improved SubDAG shapes (listed below) Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seems sufficient to not cause regressions in tests. 5. Remove Chain dependencies of Memory operations on CopyfromReg nodes as these are captured by data dependence 6. Forward loads-store values through tokenfactors containing {CopyToReg,CopyFromReg} Values. 7. Peephole to convert buildvector of extract_vector_elt to extract_subvector if possible (see CodeGen/AArch64/store-merge.ll) 8. Store merging for the ARM target is restricted to 32-bit as some in some contexts invalid 64-bit operations are being generated. This can be removed once appropriate checks are added. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable, improving load-store forwarding. One test in particular is worth noting: CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store forwarding converts a load-store pair into a parallel store and a memory-realized bitcast of the same value. However, because we lose the sharing of the explicit and implicit store values we must create another local store. A similar transformation happens before SelectionDAG as well. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle llvm-svn: 296252	2017-02-25 11:43:58 +00:00
Michael Kuperstein	46b131e3f8	[CGP] Split some critical edges coming out of indirect branches Splitting critical edges when one of the source edges is an indirectbr is hard in general (because it requires changing the memory the indirectbr reads). But if a block only has a single indirectbr predecessor (which is the common case), we can simulate splitting that edge by splitting the destination block, and retargeting the direct branches. This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame() ends up using an indirect branch with ~100 successors, and passing a constant to each of those. Since MachineSink can't break indirect critical edges on demand (and doing this in MIR doesn't look feasible), this causes us to emit about ~100 defs of registers containing constants, which we in the predecessor block, where only one of those constants is used in each successor. So, at each computed goto, we needlessly spill about a 100 constants to stack. The end result is that a clang-compiled python interpreter can be about ~2.5x slower on a simple python reduction loop than a gcc-compiled interpreter. Differential Revision: https://reviews.llvm.org/D29916 llvm-svn: 296149	2017-02-24 18:41:32 +00:00
Nemanja Ivanovic	195c5452d3	[PowerPC] Use subfic instruction for subtract from immediate Provide a 64-bit pattern to use SUBFIC for subtracting from a 16-bit immediate. The corresponding pattern already exists for 32-bit integers. Committing on behalf of Hiroshi Inoue. Differential Revision: https://reviews.llvm.org/D29387 llvm-svn: 296144	2017-02-24 18:16:06 +00:00
Nemanja Ivanovic	82d53ed492	[PowerPC] Use rldicr instruction for AND with an immediate if possible Emit clrrdi (extended mnemonic for rldicr) for AND-ing with masks that clear bits from the right hand size. Committing on behalf of Hiroshi Inoue. Differential Revision: https://reviews.llvm.org/D29388 llvm-svn: 296143	2017-02-24 18:03:16 +00:00
Sanjay Patel	832b1622d8	[DAGCombiner] add missing folds for scalar select of {-1,0,1} The motivation for filling out these select-of-constants cases goes back to D24480, where we discussed removing an IR fold from add(zext) --> select. And that goes back to: https://reviews.llvm.org/rL75531 https://reviews.llvm.org/rL159230 The idea is that we should always canonicalize patterns like this to a select-of-constants in IR because that's the smallest IR and the best for value tracking. Note that we currently do the opposite in some cases (like the cases in this patch). Ie, the proposed folds in this patch already exist in InstCombine today: https://github.com/llvm-mirror/llvm/blob/master/lib/Transforms/InstCombine/InstCombineSelect.cpp#L1151 As this patch shows, most targets generate better machine code for simple ext/add/not ops rather than a select of constants. So the follow-up steps to make this less of a patchwork of special-case folds and missing IR canonicalization: 1. Have DAGCombiner convert any select of constants into ext/add/not ops. 2 Have InstCombine canonicalize in the other direction (create more selects). Differential Revision: https://reviews.llvm.org/D30180 llvm-svn: 296137	2017-02-24 17:17:33 +00:00
Michael Kuperstein	581c9f4b20	Revert r269060 to pacify bots. llvm-svn: 296064	2017-02-24 01:22:19 +00:00
Michael Kuperstein	12e79d5002	[CGP] Split some critical edges coming out of indirect branches Splitting critical edges when one of the source edges is an indirectbr is hard in general (because it requires changing the memory the indirectbr reads). But if a block only has a single indirectbr predecessor (which is the common case), we can simulate splitting that edge by splitting the destination block, and retargeting the direct branches. This is motivated by the use of computed gotos in python 2.7: PyEval_EvalFrame() ends up using an indirect branch with ~100 successors, and passing a constant to each of those. Since MachineSink can't break indirect critical edges on demand (and doing this in MIR doesn't look feasible), this causes us to emit about ~100 defs of registers containing constants, which we in the predecessor block, where only one of those constants is used in each successor. So, at each computed goto, we needlessly spill about a 100 constants to stack. The end result is that a clang-compiled python interpreter can be about ~2.5x slower on a simple python reduction loop than a gcc-compiled interpreter. Differential Revision: https://reviews.llvm.org/D29916 llvm-svn: 296060	2017-02-24 00:56:21 +00:00
Bill Seurer	8e48f416ad	[DAGCombiner] revert r295336 r295336 causes a bootstrapped clang to fail for many compilations on powerpc BE. See http://lab.llvm.org:8011/builders/clang-ppc64be-linux-multistage/builds/2315 for example. Reverting as per the developer's request. llvm-svn: 295849	2017-02-22 16:27:33 +00:00
Sanjay Patel	f2a345c8ee	[PowerPC] add tests for select-of-constants; NFC llvm-svn: 295460	2017-02-17 16:43:43 +00:00
Artur Pilipenko	85d758299e	[DAGCombiner] Support {a\|s}ext, {a\|z\|s}ext load nodes in load combine Resubmit -r295314 with PowerPC and AMDGPU tests updated. Support {a\|s}ext, {a\|z\|s}ext load nodes as a part of load combine patters. Reviewed By: filcab Differential Revision: https://reviews.llvm.org/D29591 llvm-svn: 295336	2017-02-16 17:07:27 +00:00
Kyle Butt	7fbec9bdf1	Codegen: Make chains from trellis-shaped CFGs Lay out trellis-shaped CFGs optimally. A trellis of the shape below: A B \|\ /\| \| \ / \| \| X \| \| / \ \| \|/ \\| C D would be laid out A; B->C ; D by the current layout algorithm. Now we identify trellises and lay them out either A->C; B->D or A->D; B->C. This scales with an increasing number of predecessors. A trellis is a a group of 2 or more predecessor blocks that all have the same successors. because of this we can tail duplicate to extend existing trellises. As an example consider the following CFG: B D F H / \ / \ / \ / \ A---C---E---G---Ret Where A,C,E,G are all small (Currently 2 instructions). The CFG preserving layout is then A,B,C,D,E,F,G,H,Ret. The current code will copy C into B, E into D and G into F and yield the layout A,C,B(C),E,D(E),F(G),G,H,ret define void @straight_test(i32 %tag) { entry: br label %test1 test1: ; A %tagbit1 = and i32 %tag, 1 %tagbit1eq0 = icmp eq i32 %tagbit1, 0 br i1 %tagbit1eq0, label %test2, label %optional1 optional1: ; B call void @a() br label %test2 test2: ; C %tagbit2 = and i32 %tag, 2 %tagbit2eq0 = icmp eq i32 %tagbit2, 0 br i1 %tagbit2eq0, label %test3, label %optional2 optional2: ; D call void @b() br label %test3 test3: ; E %tagbit3 = and i32 %tag, 4 %tagbit3eq0 = icmp eq i32 %tagbit3, 0 br i1 %tagbit3eq0, label %test4, label %optional3 optional3: ; F call void @c() br label %test4 test4: ; G %tagbit4 = and i32 %tag, 8 %tagbit4eq0 = icmp eq i32 %tagbit4, 0 br i1 %tagbit4eq0, label %exit, label %optional4 optional4: ; H call void @d() br label %exit exit: ret void } here is the layout after D27742: straight_test: # @straight_test ; ... Prologue elided ; BB#0: # %entry ; A (merged with test1) ; ... More prologue elided mr 30, 3 andi. 3, 30, 1 bc 12, 1, .LBB0_2 ; BB#1: # %test2 ; C rlwinm. 3, 30, 0, 30, 30 beq 0, .LBB0_3 b .LBB0_4 .LBB0_2: # %optional1 ; B (copy of C) bl a nop rlwinm. 3, 30, 0, 30, 30 bne 0, .LBB0_4 .LBB0_3: # %test3 ; E rlwinm. 3, 30, 0, 29, 29 beq 0, .LBB0_5 b .LBB0_6 .LBB0_4: # %optional2 ; D (copy of E) bl b nop rlwinm. 3, 30, 0, 29, 29 bne 0, .LBB0_6 .LBB0_5: # %test4 ; G rlwinm. 3, 30, 0, 28, 28 beq 0, .LBB0_8 b .LBB0_7 .LBB0_6: # %optional3 ; F (copy of G) bl c nop rlwinm. 3, 30, 0, 28, 28 beq 0, .LBB0_8 .LBB0_7: # %optional4 ; H bl d nop .LBB0_8: # %exit ; Ret ld 30, 96(1) # 8-byte Folded Reload addi 1, 1, 112 ld 0, 16(1) mtlr 0 blr The tail-duplication has produced some benefit, but it has also produced a trellis which is not laid out optimally. With this patch, we improve the layouts of such trellises, and decrease the cost calculation for tail-duplication accordingly. This patch produces the layout A,C,E,G,B,D,F,H,Ret. This layout does have back edges, which is a negative, but it has a bigger compensating positive, which is that it handles the case where there are long strings of skipped blocks much better than the original layout. Both layouts handle runs of executed blocks equally well. Branch prediction also improves if there is any correlation between subsequent optional blocks. Here is the resulting concrete layout: straight_test: # @straight_test ; BB#0: # %entry ; A (merged with test1) mr 30, 3 andi. 3, 30, 1 bc 12, 1, .LBB0_4 ; BB#1: # %test2 ; C rlwinm. 3, 30, 0, 30, 30 bne 0, .LBB0_5 .LBB0_2: # %test3 ; E rlwinm. 3, 30, 0, 29, 29 bne 0, .LBB0_6 .LBB0_3: # %test4 ; G rlwinm. 3, 30, 0, 28, 28 bne 0, .LBB0_7 b .LBB0_8 .LBB0_4: # %optional1 ; B (Copy of C) bl a nop rlwinm. 3, 30, 0, 30, 30 beq 0, .LBB0_2 .LBB0_5: # %optional2 ; D (Copy of E) bl b nop rlwinm. 3, 30, 0, 29, 29 beq 0, .LBB0_3 .LBB0_6: # %optional3 ; F (Copy of G) bl c nop rlwinm. 3, 30, 0, 28, 28 beq 0, .LBB0_8 .LBB0_7: # %optional4 ; H bl d nop .LBB0_8: # %exit Differential Revision: https://reviews.llvm.org/D28522 llvm-svn: 295223	2017-02-15 19:49:14 +00:00
Amaury Sechet	9df26d330f	Fix typo in test filename. NFC llvm-svn: 294860	2017-02-11 17:48:49 +00:00
Amaury Sechet	887117fb3d	Add test case for pr31890. NFC llvm-svn: 294455	2017-02-08 14:35:48 +00:00
Nemanja Ivanovic	17aeb5a260	[PowerPC][Altivec] Add vnot extended mnemonic Adds the vnot extended mnemonic for the vnor instruction. Committing on behalf of brunoalr (Bruno Rosa). Differential Revision: https://reviews.llvm.org/D29225 llvm-svn: 294330	2017-02-07 18:57:29 +00:00
Sanne Wouda	57b63d6ade	[LLC] Add an inline assembly diagnostics handler. Summary: llc would hit a fatal error for errors in inline assembly. The diagnostics message is now printed. Reviewers: rengolin, MatzeB, javed.absar, anemet Reviewed By: anemet Subscribers: jyknight, nemanjai, llvm-commits Differential Revision: https://reviews.llvm.org/D29408 llvm-svn: 293999	2017-02-03 11:14:39 +00:00
Nirav Dave	93f9d5ce04	Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled." This reverts commit r293893 which is miscompiling lua on ARM and bootstrapping for x86-windows. llvm-svn: 293915	2017-02-02 18:24:55 +00:00
Nirav Dave	4442667fc5	In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled. Recommiting after fixing X86 inc/dec chain bug. * Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search and chain alias analysis which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. When merging stores search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and the output Codegen (save perhaps for some ARM cases where we correctly constructs wider loads, but then promotes them to float operations which appear but requires more expensive constant generation). Some minor peephole optimizations to deal with improved SubDAG shapes (listed below) Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seems sufficient to not cause regressions in tests. 5. Remove Chain dependencies of Memory operations on CopyfromReg nodes as these are captured by data dependence 6. Forward loads-store values through tokenfactors containing {CopyToReg,CopyFromReg} Values. 7. Peephole to convert buildvector of extract_vector_elt to extract_subvector if possible (see CodeGen/AArch64/store-merge.ll) 8. Store merging for the ARM target is restricted to 32-bit as some in some contexts invalid 64-bit operations are being generated. This can be removed once appropriate checks are added. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable, improving load-store forwarding. One test in particular is worth noting: CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store forwarding converts a load-store pair into a parallel store and a memory-realized bitcast of the same value. However, because we lose the sharing of the explicit and implicit store values we must create another local store. A similar transformation happens before SelectionDAG as well. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle llvm-svn: 293893	2017-02-02 14:39:42 +00:00
Kit Barton	d26978796e	[PowerPC] Fix sjlj pseduo instructions to use G8RC_NOX0 register class The the following instructions: - LD/LWZ (expanded from sjLj pseudo-instructions) - LXVL/LXVLL vector loads - STXVL/STXVLL vector stores all require G8RC_NO0X class registers for RA. Differential Revision: https://reviews.llvm.org/D29289 Committed for Lei Huang llvm-svn: 293769	2017-02-01 14:33:57 +00:00
Kyle Butt	b15c06677c	CodeGen: Allow small copyable blocks to "break" the CFG. When choosing the best successor for a block, ordinarily we would have preferred a block that preserves the CFG unless there is a strong probability the other direction. For small blocks that can be duplicated we now skip that requirement as well, subject to some simple frequency calculations. Differential Revision: https://reviews.llvm.org/D28583 llvm-svn: 293716	2017-01-31 23:48:32 +00:00
Nicolai Haehnle	8813d5d221	[DAGCombine] require UnsafeFPMath for re-association of addition Summary: The affected transforms all implicitly use associativity of addition, for which we usually require unsafe math to be enabled. The "Aggressive" flag is only meant to convey information about the performance of the fused ops relative to a fmul+fadd sequence. Fixes Bug 31626. Reviewers: spatel, hfinkel, mehdi_amini, arsenm, tstellarAMD Subscribers: jholewinski, nemanjai, wdng, llvm-commits Differential Revision: https://reviews.llvm.org/D28675 llvm-svn: 293635	2017-01-31 14:35:37 +00:00
Nemanja Ivanovic	2f2a6ab991	[PowerPC][Altivec] Add vmr extended mnemonic Just adds the vmr (Vector Move Register) mnemonic for the VOR instruction in the PPC back end. Committing on behalf of brunoalr (Bruno Rosa). Differential Revision: https://reviews.llvm.org/D29133 llvm-svn: 293626	2017-01-31 13:43:11 +00:00
Sean Fertile	3c8c385a77	[PPC] cleanup of mayLoad/mayStore flags and memory operands. 1) Explicitly sets mayLoad/mayStore property in the tablegen files on load/store instructions. 2) Updated the flags on a number of intrinsics indicating that they write memory. 3) Added SDNPMemOperand flags for some target dependent SDNodes so that they propagate their memory operand Review: https://reviews.llvm.org/D28818 llvm-svn: 293200	2017-01-26 18:59:15 +00:00
Nirav Dave	d32a421f75	Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled." This reverts commit r293184 which is failing in LTO builds llvm-svn: 293188	2017-01-26 16:46:13 +00:00
Nirav Dave	de6516c466	In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled. * Simplify Consecutive Merge Store Candidate Search Now that address aliasing is much less conservative, push through simplified store merging search and chain alias analysis which only checks for parallel stores through the chain subgraph. This is cleaner as the separation of non-interfering loads/stores from the store-merging logic. When merging stores search up the chain through a single load, and finds all possible stores by looking down from through a load and a TokenFactor to all stores visited. This improves the quality of the output SelectionDAG and the output Codegen (save perhaps for some ARM cases where we correctly constructs wider loads, but then promotes them to float operations which appear but requires more expensive constant generation). Some minor peephole optimizations to deal with improved SubDAG shapes (listed below) Additional Minor Changes: 1. Finishes removing unused AliasLoad code 2. Unifies the chain aggregation in the merged stores across code paths 3. Re-add the Store node to the worklist after calling SimplifyDemandedBits. 4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is arbitrary, but seems sufficient to not cause regressions in tests. 5. Remove Chain dependencies of Memory operations on CopyfromReg nodes as these are captured by data dependence 6. Forward loads-store values through tokenfactors containing {CopyToReg,CopyFromReg} Values. 7. Peephole to convert buildvector of extract_vector_elt to extract_subvector if possible (see CodeGen/AArch64/store-merge.ll) 8. Store merging for the ARM target is restricted to 32-bit as some in some contexts invalid 64-bit operations are being generated. This can be removed once appropriate checks are added. This finishes the change Matt Arsenault started in r246307 and jyknight's original patch. Many tests required some changes as memory operations are now reorderable, improving load-store forwarding. One test in particular is worth noting: CodeGen/PowerPC/ppc64-align-long-double.ll - Improved load-store forwarding converts a load-store pair into a parallel store and a memory-realized bitcast of the same value. However, because we lose the sharing of the explicit and implicit store values we must create another local store. A similar transformation happens before SelectionDAG as well. Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle llvm-svn: 293184	2017-01-26 16:02:24 +00:00
Rafael Espindola	82149a1aa9	Use shouldAssumeDSOLocal in classifyGlobalReference. And teach shouldAssumeDSOLocal that ppc has no copy relocations. The resulting code handle a few more case than before. For example, it knows that a weak symbol can be resolved to another .o file, but it will still be in the main executable. llvm-svn: 293180	2017-01-26 15:02:31 +00:00
Tim Shen	fd1e5aa8df	[APFloat] Switch from (PPCDoubleDoubleImpl, IEEEdouble) layout to (IEEEdouble, IEEEdouble) Summary: This patch changes the layout of DoubleAPFloat, and adjust all operations to do either: 1) (IEEEdouble, IEEEdouble) -> (uint64_t, uint64_t) -> PPCDoubleDoubleImpl, then run the old algorithm. 2) Do the right thing directly. 1) includes multiply, divide, remainder, mod, fusedMultiplyAdd, roundToIntegral, convertFromString, next, convertToInteger, convertFromAPInt, convertFromSignExtendedInteger, convertFromZeroExtendedInteger, convertToHexString, toString, getExactInverse. 2) includes makeZero, makeLargest, makeSmallest, makeSmallestNormalized, compare, bitwiseIsEqual, bitcastToAPInt, isDenormal, isSmallest, isLargest, isInteger, ilogb, scalbn, frexp, hash_value, Profile. I could split this into two patches, e.g. use 1) for all operatoins first, then incrementally change some of them to 2). I didn't do that, because 1) involves code that converts data between PPCDoubleDoubleImpl and (IEEEdouble, IEEEdouble) back and forth, and may pessimize the compiler. Instead, I find easy functions and use approach 2) for them directly. Next step is to implement move multiply and divide from 1) to 2). I don't have plans for other functions in 1). Differential Revision: https://reviews.llvm.org/D27872 llvm-svn: 292839	2017-01-23 22:39:35 +00:00
Benjamin Kramer	db9e0b659d	Fix some broken CHECK lines. The colon is important. llvm-svn: 292761	2017-01-22 20:28:56 +00:00
Tony Jiang	8e8c444d3d	[PowerPC] Expand ISEL instruction into if-then-else sequence. Generally, the ISEL is expanded into if-then-else sequence, in some cases (like when the destination register is the same with the true or false value register), it may just be expanded into just the if or else sequence. llvm-svn: 292154	2017-01-16 20:12:26 +00:00

1 2 3 4 5 ...

1514 Commits