llvm-project

Commit Graph

Author	SHA1	Message	Date
Daniel Sanders	7813ae879e	Replace string GNU Triples with llvm::Triple in MCAsmInfo subclasses and create*AsmInfo(). NFC. Summary: This is the first of several patches to eliminate StringRef forms of GNU triples from the internals of LLVM. After this is complete, GNU triples will be replaced by a more authoratitive representation in the form of an LLVM TargetTuple. Reviewers: rengolin Reviewed By: rengolin Subscribers: ted, llvm-commits, rengolin, jholewinski Differential Revision: http://reviews.llvm.org/D10236 llvm-svn: 239036	2015-06-04 13:12:25 +00:00
Elena Demikhovsky	2f1a0dabd0	AVX-512: I brought back vector-shuffle-512-v8.ll test. I re-generated it after all AVX-512 shuffle optimizations. llvm-svn: 239026	2015-06-04 07:49:56 +00:00
Elena Demikhovsky	4078c75bd4	AVX-512: added all SKX forms of VPERMW/D/Q instructions. Added all forms of VPERMPS/PD instrcuctions. Added encoding tests. llvm-svn: 239016	2015-06-04 07:07:13 +00:00
Elena Demikhovsky	214335d703	Removed {}, NFC. llvm-svn: 239014	2015-06-04 07:01:29 +00:00
Sanjay Patel	667a7e2a0f	make reciprocal estimate code generation more flexible by adding command-line options (3rd try) The first try (r238051) to land this was reverted due to ExecutionEngine build failure; that was hopefully addressed by r238788. The second try (r238842) to land this was reverted due to BUILD_SHARED_LIBS failure; that was hopefully addressed by r238953. This patch adds a TargetRecip class for processing many recip codegen possibilities. The class is intended to handle both command-line options to llc as well as options passed in from a front-end such as clang with the -mrecip option. The x86 backend is updated to use the new functionality. Only -mcpu=btver2 with -ffast-math should see a functional change from this patch. All other x86 CPUs continue to not use reciprocal estimates by default with -ffast-math. Differential Revision: http://reviews.llvm.org/D8982 llvm-svn: 239001	2015-06-04 01:32:35 +00:00
Asaf Badouh	402ebb34af	re-apply 238809 AVX-512: Implemented GETEXP instruction for KNL and SKX Added rounding mode modifier for SQRTPS/PD Added tests for encoding and intrinsics. CR: http://reviews.llvm.org/D9991 llvm-svn: 238923	2015-06-03 13:41:48 +00:00
Elena Demikhovsky	86224fe468	AVX-512: More code improvements in shuffles, NFC llvm-svn: 238919	2015-06-03 12:05:03 +00:00
Elena Demikhovsky	21de893377	AVX-512: VSHUFPD instruction selection - code improvements llvm-svn: 238918	2015-06-03 11:21:01 +00:00
Elena Demikhovsky	9e38086534	AVX-512: Implemented SHUFF32x4/SHUFF64x2/SHUFI32x4/SHUFI64x2 instructions for SKX and KNL. Added tests for encoding. By Igor Breger (igor.breger@intel.com) llvm-svn: 238917	2015-06-03 10:56:40 +00:00
Elena Demikhovsky	f7e641cc2d	X86: Added MPX feature and bound registers. Intel® Memory Protection Extensions (Intel® MPX) is a new feature in Skylake. It is a part of KNL and SKX sets. It is also a part of Skylake client. I added definition of %bnd0 - %bnd3 registers, each register is a pair of 64-bit integers. llvm-svn: 238916	2015-06-03 10:30:57 +00:00
Simon Pilgrim	452252e6c8	[X86] Removed (unused) FSRL x86 operation This patch removes the old X86ISD::FSRL op - which allowed float vectors to use the byte right shift operations (causing a domain switch....). Since the refactoring of the shuffle lowering code this no longer has any use. Differential Revision: http://reviews.llvm.org/D10169 llvm-svn: 238906	2015-06-03 08:32:36 +00:00
Rafael Espindola	cf8beece97	Revert "make reciprocal estimate code generation more flexible by adding command-line options (2nd try)" This reverts commit r238842. It broke -DBUILD_SHARED_LIBS=ON build. llvm-svn: 238900	2015-06-03 05:32:44 +00:00
Rafael Espindola	9aa3ab30a9	Avoid a call to getOrCreateSymbol when we already have the symbol. llvm-svn: 238890	2015-06-03 00:02:40 +00:00
Sanjay Patel	6f031d848e	make reciprocal estimate code generation more flexible by adding command-line options (2nd try) The first try (r238051) to land this was reverted due to bot failures that were hopefully addressed by r238788. This patch adds a TargetRecip class for processing many recip codegen possibilities. The class is intended to handle both command-line options to llc as well as options passed in from a front-end such as clang with the -mrecip option. The x86 backend is updated to use the new functionality. Only -mcpu=btver2 with -ffast-math should see a functional change from this patch. All other x86 CPUs continue to not use reciprocal estimates by default with -ffast-math. Differential Revision: http://reviews.llvm.org/D8982 llvm-svn: 238842	2015-06-02 15:28:15 +00:00
Elena Demikhovsky	8938f5acca	AVX-512: Implemented VRANGESD and VRANGESS instructions for SKX Implemented DAG lowering for all these forms. Added tests for encoding. By Igor Breger (igor.breger@intel.com) llvm-svn: 238834	2015-06-02 14:12:54 +00:00
Elena Demikhovsky	44a129c533	AVX-512: Shorten implementation of lowerV16X32VectorShuffle() using lowerVectorShuffleWithSHUFPS() and other shuffle-helpers routines. Added matching of VALIGN instruction. llvm-svn: 238830	2015-06-02 13:43:18 +00:00
Elena Demikhovsky	3425c932da	AVX-512: Implemented VFIXUPIMMSD and VFIXUPIMMSS instructions for KNL Implemented DAG lowering for all these forms. Added tests for encoding. By Igor Breger (igor.breger@intel.com) llvm-svn: 238811	2015-06-02 08:28:57 +00:00
Asaf Badouh	8d897dd05f	revert 238809 llvm-svn: 238810	2015-06-02 07:45:19 +00:00
Asaf Badouh	17de10f37e	AVX-512: Implemented GETEXP instruction for KNL and SKX Added rounding mode modifier for SQRTPS/PD Added tests for encoding and intrinsics. llvm-svn: 238809	2015-06-02 07:18:14 +00:00
Elena Demikhovsky	67afb630e1	AVX-512: Optimized vector shuffle for v16f32 and v16i32 types. llvm-svn: 238743	2015-06-01 13:26:18 +00:00
Elena Demikhovsky	3582eb3b39	AVX-512: Implemented VRANGEPD and VRANGEPD instructions for SKX. Implemented DAG lowering for all these forms. Added tests for encoding. By Igor Breger (igor.breger@intel.com) llvm-svn: 238738	2015-06-01 11:05:34 +00:00
Elena Demikhovsky	0c41088ebf	AVX-512: Implemented vector shuffle lowering for v8i64 and v8f64 types. I removed the vector-shuffle-512-v8.ll, it is auto-generated test, not valid any more. llvm-svn: 238735	2015-06-01 09:49:53 +00:00
Elena Demikhovsky	75ede68793	AVX-512: added all forms of VPSHUFD and VPSHUFHW, VPSHUFLW including encodings. llvm-svn: 238729	2015-06-01 07:17:23 +00:00
Elena Demikhovsky	42c96d9c0a	AVX-512: Implemented VFIXUPIMMPD and VFIXUPIMMPS instructions for KNL and SKX Implemented DAG lowering for all these forms. Added tests for encoding. by Igor Breger (igor.breger@intel.com) llvm-svn: 238728	2015-06-01 06:50:49 +00:00
Elena Demikhovsky	dd68d0cb0f	AVX-512: Fixed a bug in compress and expand intrinsics. By Igor Breger (igor.breger@intel.com) llvm-svn: 238724	2015-06-01 06:30:13 +00:00
Matt Arsenault	bd7d80a4a6	Add address space argument to isLegalAddressingMode This is important because of different addressing modes depending on the address space for GPU targets. This only adds the argument, and does not update any of the uses to provide the correct address space. llvm-svn: 238723	2015-06-01 05:31:59 +00:00
Rafael Espindola	5eb02e45e3	Simplify another function that doesn't fail. llvm-svn: 238703	2015-06-01 00:27:26 +00:00
Simon Pilgrim	f19ef9f741	Stripped trailing whitespace. NFC. llvm-svn: 238654	2015-05-30 13:01:42 +00:00
Chandler Carruth	cb58910ce8	[x86] Unify the horizontal adding used for popcount lowering taking the best approach of each. For vNi16, we use SHL + ADD + SRL pattern that seem easily the best. For vNi32, we use the PUNPCK + PSADBW + PACKUSWB pattern. In some cases there is a huge improvement with this in IACA's estimated throughput -- over 2x higher throughput!!!! -- but the measurements are too good to be true. In one narrow case, the SHL + ADD + SHL + ADD + SRL pattern looks slightly faster, but I'm not sure I believe any of the measurements at this point. Both are the exact same uops though. Hard to be confident of anything past that. If anyone wants to collect very detailed (Agner-level) timings with the result of this patch, or with the i32 case replaced with SHL + ADD + SHl + ADD + SRL, I'd be very interested. Note that you'll need to test it on both Ivybridge and Haswell, with both SSE3, SSSE3, and AVX selected as I saw unique behavior in each of these buckets with IACA all of which should be checked against measured performance. But this patch is still a useful improvement by dropping duplicate work and getting the much nicer PSADBW lowering for v2i64. I'd still like to rephrase this in terms of generic horizontal sum. It's a bit lame to have a special case of that just for popcount. llvm-svn: 238652	2015-05-30 10:35:03 +00:00
Chandler Carruth	11e6f8fed1	[x86] Split out the horizontal byte sum lowering component of the LUT lowering into a helper function. NFC. llvm-svn: 238650	2015-05-30 09:46:16 +00:00
Chandler Carruth	9cc2516676	[x86] Replace the long spelling of getting a bitcast with the much shorter one. NFC. In addition to being much shorter to type and requiring fewer arguments, this change saves over 30 lines from this one file, all wasted on total boilerplate... llvm-svn: 238640	2015-05-30 04:23:13 +00:00
Chandler Carruth	060cdca996	[x86] Replace the long spelling of getting a bitcast with the new short spelling. NFC. llvm-svn: 238639	2015-05-30 04:19:57 +00:00
Chandler Carruth	502b23a7a9	[sdag] Add the helper I most want to the DAG -- building a bitcast around a value using its existing SDLoc. Start using this in just one function to save omg lines of code. llvm-svn: 238638	2015-05-30 04:14:10 +00:00
Chandler Carruth	2599da3cfd	[x86] Restore the bitcasts I removed when refactoring this to avoid shifting vectors of bytes as x86 doesn't have direct support for that. This removes a bunch of redundant masking in the generated code for SSE2 and SSE3. In order to avoid the really significant code size growth this would have triggered, I also factored the completely repeatative logic for shifting and masking into two lambdas which in turn makes all of this much easier to read IMO. llvm-svn: 238637	2015-05-30 04:05:11 +00:00
Chandler Carruth	6ba9730a4e	[x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique. Summary: A description of this technique can be found here: http://wm.ite.pl/articles/sse-popcount.html The core of the idea is to use an in-register lookup table and the PSHUFB instruction to compute the population count for the low and high nibbles of each byte, and then to use horizontal sums to aggregate these into vector population counts with wider element types. On x86 there is an instruction that will directly compute the horizontal sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily. Various tricks are used to get vNi32 and vNi16 from the vNi8 that the LUT computes. The base implemantion of this, and most of the work, was done by Bruno in a follow up to D6531. See Bruno's detailed post there for lots of timing information about these changes. I have extended Bruno's patch in the following ways: 0) I committed the new tests with baseline sequences so this shows a diff, and regenerated the tests using the update scripts. 1) Bruno had noticed and mentioned in IRC a redundant mask that I removed. 2) I introduced a particular optimization for the i32 vector cases where we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD + PSADBW to compute doubled high i32 popcounts. This takes advantage of the fact that to line up the high i32 popcounts we have to shift them anyways, and we can shift them by one fewer bit to effectively divide the count by two. While the PSHUFD based horizontal add is no faster, it doesn't require registers or load traffic the way a mask would, and provides more ILP as it happens on different ports with high throughput. 3) I did some code cleanups throughout to simplify the implementation logic. 4) I refactored it to continue to use the parallel bitmath lowering when SSSE3 is not available to preserve the performance of that version on SSE2 targets where it is still much better than scalarizing as we'll still do a bitmath implementation of popcount even in scalar code there. With #1 and #2 above, I analyzed the result in IACA for sandybridge, ivybridge, and haswell. In every case I measured, the throughput is the same or better using the LUT lowering, even v2i64 and v4i64, and even compared with using the native popcnt instruction! The latency of the LUT lowering is often higher than the latency of the scalarized popcnt instruction sequence, but I think those latency measurements are deeply misleading. Keeping the operation fully in the vector unit and having many chances for increased throughput seems much more likely to win. With this, we can lower every integer vector popcount implementation using the LUT strategy if we have SSSE3 or better (and thus have PSHUFB). I've updated the operation lowering to reflect this. This also fixes an issue where we were scalarizing horribly some AVX lowerings. Finally, there are some remaining cleanups. There is duplication between the two techniques in how they perform the horizontal sum once the byte population count is computed. I'm going to factor and merge those two in a separate follow-up commit. Differential Revision: http://reviews.llvm.org/D10084 llvm-svn: 238636	2015-05-30 03:20:59 +00:00
Chandler Carruth	c2e400de83	[x86] Restructure the parallel bitmath lowering of popcount into a separate routine, generalize it to work for all the integer vector sizes, and do general code cleanups. This dramatically improves lowerings of byte and short element vector popcount, but more importantly it will make the introduction of the LUT-approach much cleaner. The biggest cleanup I've done is to just force the legalizer to do the bitcasting we need. We run these iteratively now and it makes the code much simpler IMO. Other changes were minor, and mostly naming and splitting things up in a way that makes it more clear what is going on. The other significant change is to use a different final horizontal sum approach. This is the same number of instructions as the old method, but shifts left instead of right so that we can clear everything but the final sum with a single shift right. This seems likely better than a mask which will usually have to read the mask from memory. It is certaily fewer u-ops. Also, this will be temporary. This and the LUT approach share the need of horizontal adds to finish the computation, and we have more clever approaches than this one that I'll switch over to. llvm-svn: 238635	2015-05-30 03:20:55 +00:00
Jim Grosbach	13760bd152	MC: Clean up MCExpr naming. NFC. llvm-svn: 238634	2015-05-30 01:25:56 +00:00
Reid Kleckner	e6531a5588	[WinEH] Adjust the 32-bit SEH prologue to better match reality It turns out that _except_handler3 and _except_handler4 really use the same stack allocation layout, at least today. They just make different choices about encoding the LSDA. This is in preparation for lowering the llvm.eh.exceptioninfo(). llvm-svn: 238627	2015-05-29 22:57:46 +00:00
Reid Kleckner	173a72524f	Disable FP elimination in funcs using 32-bit MSVC EH personalities The value in 'ebp' acts as an implicit argument to the outlined handlers, and is recovered with frameaddress(1). llvm-svn: 238619	2015-05-29 21:58:11 +00:00
Rafael Espindola	4d37b2a259	Remove getData. This completes the mechanical part of merging MCSymbol and MCSymbolData. llvm-svn: 238617	2015-05-29 21:45:01 +00:00
Reid Kleckner	5b8ebfbc25	Only add the EH state insertion pass on 32-bit Windows llvm-svn: 238612	2015-05-29 20:43:10 +00:00
Rafael Espindola	beb6060a51	Remove the MCSymbolData typedef. The getData member function is next. llvm-svn: 238611	2015-05-29 20:41:47 +00:00
Reid Kleckner	1d3d4adbb9	[WinEH] Emit EH tables for __CxxFrameHandler3 on 32-bit x86 Small (really small!) C++ exception handling examples work on 32-bit x86 now. This change disables the use of .seh_* directives in WinException when CFI is not in use. It also uses absolute symbol references in the tables instead of imagerel32 relocations. Also fixes a cache invalidation bug in MMI personality classification. llvm-svn: 238575	2015-05-29 17:00:57 +00:00
Matthias Braun	e41e146c16	CodeGen: Use mop_iterator instead of MIOperands/ConstMIOperands MIOperands/ConstMIOperands are classes iterating over the MachineOperand of a MachineInstr, however MachineInstr::mop_iterator does the same thing. I assume these two iterators exist to have a uniform interface to iterate over the operands of a machine instruction bundle and a single machine instruction. However in practice I find it more confusing to have 2 different iterator classes, so this patch transforms (nearly all) the code to use mop_iterators. The only exception being MIOperands::anlayzePhysReg() and MIOperands::analyzeVirtReg() still needing an equivalent, I leave that as an exercise for the next patch. Differential Revision: http://reviews.llvm.org/D9932 This version is slightly modified from the proposed revision in that it introduces MachineInstr::getOperandNo to avoid the extra counting variable in the few loops that previously used MIOperands::getOperandNo. llvm-svn: 238539	2015-05-29 02:56:46 +00:00
Reid Kleckner	fe4d491bd9	[WinEH] Start inserting state number stores for C++ EH This moves all the state numbering code for C++ EH to WinEHPrepare so that we can call it from the X86 state numbering IR pass that runs before isel. Now we just call the same state numbering machinery and insert a bunch of stores. It also populates MachineModuleInfo with information about the current function. llvm-svn: 238514	2015-05-28 22:00:24 +00:00
Rafael Espindola	3a5d3cce80	Remove a trivial forwarding function. NFC. llvm-svn: 238506	2015-05-28 21:36:02 +00:00
Reid Kleckner	bfcad2f181	Remove debug prints from r238487 llvm-svn: 238501	2015-05-28 21:23:53 +00:00
Reid Kleckner	80956a0142	Disable x86 tail call optimizations that jump through GOT For x86 targets, do not do sibling call optimization when materializing the callee's address would require a GOT relocation. We can still do tail calls to internal functions, hidden functions, and protected functions, because they do not require this kind of relocation. It is still possible to get GOT relocations when the user explicitly asks for it with musttail or -tailcallopt, both of which are supposed to guarantee TCO. Based on a patch by Chih-hung Hsieh. Reviewers: srhines, timmurray, danalbert, enh, void, nadav, rnk Subscribers: joerg, davidxl, llvm-commits Differential Revision: http://reviews.llvm.org/D9799 llvm-svn: 238487	2015-05-28 20:44:28 +00:00
Reid Kleckner	e2e57faa7d	[WinEH] Remove debugging dump() call llvm-svn: 238472	2015-05-28 20:02:05 +00:00
Elena Demikhovsky	86c7b46680	AVX-512: Fixed a bug in extracting subvector from v64i1 By Igor Breger (igor.breger@intel.com) llvm-svn: 238322	2015-05-27 14:09:33 +00:00

1 2 3 4 5 ...

11726 Commits