llvm-project

Commit Graph

Author	SHA1	Message	Date
Craig Topper	a21c557955	[RISCV] Remove Zbproposedc extension This consists of 3 compressed instructions, c.not, c.neg, and c.zext.w. I believe these have been picked up by the Zce effort using different encodings. I don't think it makes sense to keep them in bitmanip. It will eventually cause a conflict if/when Zce is implemented in llvm. Differential Revision: https://reviews.llvm.org/D110871	2021-09-30 14:23:05 -07:00
Albion Fung	4195ed9959	[PowerPC] Improved codegen related to xscvdpsxws/xscvdpuxws This patch removes the uneccessary mf/mtvsr generated in conjunction with xscvdpsxws/xscvdpuxws. Differential revision: https://reviews.llvm.org/D109902	2021-09-30 14:31:00 -05:00
Stanislav Mekhanoshin	244aa7f735	[AMDGPU] move hasAGPRs/hasVGPRs into header It is now very simple and can go right into the header allowing optimizer to combine callers, such as isVGPRClass and similar. It does not need anything from the TRI itself anymore, so make it static class member along with the callers. Differential Revision: https://reviews.llvm.org/D110762	2021-09-30 10:02:02 -07:00
Kazu Hirata	f631173d80	[llvm] Migrate from arg_operands to args (NFC) Note that arg_operands is considered a legacy name. See llvm/include/llvm/IR/InstrTypes.h for details.	2021-09-30 08:51:21 -07:00
David Green	f9aa8623fe	[ARM] Add more MVE intrinsics to sink splats to This adds a few more unpredicated intrinsics to sink splats to, in order to create more qr instruction variants. Notably this includes saddsat/uaddsat but also some of the unpredicated mve intrinsics. Differential Revision: https://reviews.llvm.org/D110333	2021-09-30 14:41:23 +01:00
Jingu Kang	13f3c39f36	Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation" This reverts the revert commit `c07f709969` with bug fixes. Differential Revision: https://reviews.llvm.org/D109963	2021-09-30 09:27:08 +01:00
Ruiling Song	52785989e9	AMDGPU: Broadcast scalar boolean to vector boolean explicitly This is used to fix wrong code generation of s_add_co_select_user in test/CodeGen/AMDGPU/expand-scalar-carry-out-select-user.ll s_addc_u32 s4, s6, 0 s_cselect_b64 vcc, 1, 0 <-- vcc set as 0x1 if SCC==1 v_mov_b32_e32 v1, s4 s_cmp_gt_u32 s6, 31 v_cndmask_b32_e32 v1, 0, v1, vcc If the s_addc_u32 set SCC, then we will get value 0x1 in VCC. The v_cndmask will do per thread selection with VCC as condition register. As VCC only gets the first bit being set, only the first thread/lane in destination register can get correct result if the very first lane is active. In fact, we should broadcast the value to all active lanes of the final register. The idea here is doing this broadcast to vector boolean explicitly instead of lowering it into a COPY from SCC which would be interpreted as selecting between 0/1. This is used to replace D109754. Reviewed-by: foad, alex-t Differential Revision: https://reviews.llvm.org/D109889	2021-09-30 10:15:01 +08:00
Amara Emerson	1c0e8a98e4	[AArch64][GlobalISel] Widen G_BUILD_VECTOR source & dest element types to s8.	2021-09-29 15:11:30 -07:00
Ricky Taylor	e1e3b6ee72	[M68k] Avoid UB in disassembler When reading 32 bits a 32-bit shift would be executed. This is undefined behaviour, but in this case we can just replace the entire scratch value to avoid it. Differential Revision: https://reviews.llvm.org/D110769	2021-09-29 22:07:14 +01:00
Stefan Pintilie	fb4e44c4e7	[PowerPC] The builtins load8r and store8r are Power 7 plus. This patch makes sure that the builtins __builtin_ppc_load8r and __ builtin_ppc_store8r are only available for Power 7 and up. Currently the builtins seem to produce incorrect code if used for Power 6 or before. Reviewed By: nemanjai, #powerpc Differential Revision: https://reviews.llvm.org/D110653	2021-09-29 14:34:40 -05:00
Roman Lebedev	2d42a192e0	[X86][Costmodel] Load/store i8 Stride=2 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/xz6x7c35P - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=2.5` So pick cost of `6`. For store we have: https://godbolt.org/z/xz6x7c35P - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `4`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110709	2021-09-29 21:52:45 +03:00
Roman Lebedev	bac60c55e0	[X86][Costmodel] Load/store i8 Stride=2 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/a9hv4z47v - for intels `Block RThroughput: =4.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `4`. For store we have: https://godbolt.org/z/6GfPn1b79 - for intels `Block RThroughput: =3.0`; for ryzens, `Block RThroughput: <=2.0` So pick cost of `3`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110708	2021-09-29 21:52:45 +03:00
Roman Lebedev	1962185671	[X86][Costmodel] Load/store i8 Stride=2 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 Identical to VF=2. For load we have: https://godbolt.org/z/4TEbdzbMM - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/MYfzGPf3Y - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110705	2021-09-29 21:52:45 +03:00
Roman Lebedev	08face1f9a	[X86][Costmodel] Load/store i8 Stride=2 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 Identical to VF=2. For load we have: https://godbolt.org/z/sGE41GYo7 - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/ba5r3s9xa - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110704	2021-09-29 21:52:45 +03:00
Roman Lebedev	7d52628eb0	[X86][Costmodel] Load/store i8 Stride=2 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/caKqjr9hb - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. For store we have: https://godbolt.org/z/6TTn3eKj8 - for intels `Block RThroughput: =1.0`; for ryzens, `Block RThroughput: <=0.5` So pick cost of `1`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110702	2021-09-29 21:52:44 +03:00
Jay Foad	f9b68304a2	[AMDGPU] Enable machine verification after AMDGPUISelDAGToDAG This was introduced in D32628 but it does not seem to be required any more. At least it does not show any problems in check-llvm in an LLVM_ENABLE_EXPENSIVE_CHECKS build. Differential Revision: https://reviews.llvm.org/D110692	2021-09-29 18:47:19 +01:00
Kazu Hirata	9a640a1cb8	[AArch64] Remove redundant declaration createAArch64ObjectTargetStreamer (NFC) Note that createAArch64ObjectTargetStreamer is declared in AArch64TargetStreamer.h and defined in AArch64TargetStreamer.cpp. Identified with readability-redundant-declaration.	2021-09-29 09:08:41 -07:00
David Green	e9adcbde31	[AArch64] Model Cortex-A55 Q register NEON instructions Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction requires taking both units (and can only be issued as the first instruction in a dual issue pair). This patch models that by splitting the WriteV SchedWrite into two - the WriteVd that reads/writes only 64bit operands, and the WriteVq that read/writes 128bit registers. The A55 schedule then uses this distinction to model the WriteVq as taking both resource units, and starting a Schedule Group and WriteVd as taking one as before. I believe this is more correct, even if it does not lead to much better performance. Differential Revision: https://reviews.llvm.org/D108766	2021-09-29 16:55:31 +01:00
Jay Foad	9886f21bc1	[MSP430] Recognize Bi as an indirect branch in analyzeBranch. NFC. Recognize Bi as an unconditional branch, just like JMP. This allows machine verification to run after MSP430BranchSelector without failing this assertion: virtual bool llvm::MSP430InstrInfo::analyzeBranch(llvm::MachineBasicBlock &, llvm::MachineBasicBlock &, llvm::MachineBasicBlock &, SmallVectorImpl<llvm::MachineOperand> &, bool) const: Assertion `I->getOpcode() == MSP430::JCC && "Invalid conditional branch"' failed. Note that machine verification is currently disabled after addPreEmitPass passes because of problems on other targets, so this is currently NFC. Differential Revision: https://reviews.llvm.org/D110691	2021-09-29 16:43:11 +01:00
Simon Pilgrim	676f2809b5	[CostModel][AArch64] Don't dereference CostTblEntry before null check. Fix static analysis warning that we check for null Entry after dereferencing it. I don't think this can actually happen as i8/i16 should legalize to use the i32 path which should return a cost - but I'd rather play it safe that rely on an implicit type legalization.	2021-09-29 16:35:29 +01:00
David Green	8a645fc44b	[AArch64] Enable type promotion for AArch64 This enables the type promotion pass for AArch64, which acts as a CodeGenPrepare pass to promote illegal integers to legal ones, especially useful for removing extends that would otherwise require cross-basic-block analysis. I have enabled this generally, for both ISel and GlobalISel. In some quick experiments it appeared to help GlobalISel remove extra extends in places too, but that might just be missing optimizations that are better left for later. We can disable it again if required. In my experiments, this can improvement performance in some cases, and codesize was a small improvement. SPEC was a very small improvement, within the noise. Some of the test cases show extends being moved out of loops, often when the extend would be part of a cmp operand, but that should reduce the latency of the instruction in the loop on many cpus. The signed-truncation-check tests are increasing as they are no longer matching specific DAG combines. We also hope to add some additional improvements to the pass in the near future, to capture more cases of promoting extends through phis that have come up in a few places lately. Differential Revision: https://reviews.llvm.org/D110239	2021-09-29 15:13:12 +01:00
Nemanja Ivanovic	09b67aa1c3	[PowerPC] Implement builtin for vbpermd The instruction has similar semantics to vbpermq but for doublewords. It was added in Power9 and the ABI documents the builtin. Differential revision: https://reviews.llvm.org/D107899	2021-09-29 06:34:31 -05:00
Sander de Smalen	87bcbd61b5	[AArch64][SVE] Fix extract_subvector patterns for unpacked fp types. The patterns added in D110163 were incorrect, since it used the wrong element widths for its shuffles. Example for nxv2f16 extract_subvector(nxv8f16 %in, 6): <a\|b\|c\|d\|e\|f\|g\|h> ^^^ extract g and h. => UUNPKHI .h -> .s results in: <e \|f \|g \|h > => UUNPKHI .s -> .d results in: <g \|h > Reviewed By: david-arm Differential Revision: https://reviews.llvm.org/D110523	2021-09-29 11:14:49 +01:00
Martin Storsjö	d6216e2cd1	[X86] Fix handling of i128<->fp on Windows On Windows, i128 arguments are passed as indirect arguments, and they are returned in xmm0. This is mostly fixed up by `WinX86_64ABIInfo::classify` in Clang, making the IR functions return v2i64 instead of i128, and making the arguments indirect. However for cases where libcalls are generated in the target lowering, the lowering uses the default x86_64 calling convention for i128, where they are passed/returned as a register pair. Add custom lowering logic, similar to the existing logic for i128 div/mod (added in `4a406d32e9`), manually making the libcall (while overriding the return type to v2i64 or passing the arguments as pointers to arguments on the stack). X86CallingConv.td doesn't seem to handle i128 at all, otherwise the windows specific behaviours would ideally be implemented as overrides there, in generic code, handling these cases automatically. This fixes https://bugs.llvm.org/show_bug.cgi?id=48940. Differential Revision: https://reviews.llvm.org/D110413	2021-09-29 13:05:59 +03:00
Amara Emerson	e6ed880e47	[AArch64][GlobalISel] Make some vector G_SMULH/G_UMULH legal.	2021-09-29 02:35:29 -07:00
hsmahesha	c0735cb9f1	[AMDGPU] Do not internalize ASan device library functions. ASan device library functions (those starts with the prefix __asan_) are at the moment undergoing through undesired optimizations due to internalization. Hence, in order to avoid such undesired optimizations on ASan device library functions, do not internalize them in the first place. Reviewed By: yaxunl Differential Revision: https://reviews.llvm.org/D110468	2021-09-29 07:19:02 +05:30
Sterling Augustine	c07f709969	Revert "Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"" This reverts commit `73a196a11c`. Causes crashes as reported in https://reviews.llvm.org/D109963	2021-09-28 18:02:06 -07:00
Jessica Paquette	241c7b1473	[AArch64][GlobalISel] Run overlapping_and after legalization When we have code with truncates, those truncates may be changed into G_ANDs with constants. These may, in turn, feed into other G_AND instructions. Running this combine post-legalize allows us to optimize examples like this one: https://godbolt.org/z/zrGY4dfEW SDAG currently optimizes the example above so that there is only one `and`. GISel doesn't optimize it, because the G_AND we'd optimize here is translated as a G_TRUNC. Later, that G_TRUNC is turned into a G_AND during legalization. Differential Revision: https://reviews.llvm.org/D110667	2021-09-28 17:13:34 -07:00
Roman Lebedev	b6b7860954	[X86][Costmodel] Load/store i16 Stride=6 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For this tuple, measuring becomes problematic since there's a lot of spilling going on, but apparently all these memory ops do not affect worst-case estimate at all here. For load we have: https://godbolt.org/z/5qGb9odP6 - for intels `Block RThroughput: <=106.0`; for ryzens, `Block RThroughput: <=34.8` So pick cost of `106`. For store we have: https://godbolt.org/z/KrWcv4Ph7 - for intels `Block RThroughput: =58.0`; for ryzens, `Block RThroughput: <=20.5` So pick cost of `58`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110593	2021-09-28 19:15:08 +03:00
Roman Lebedev	24e42f7d28	[X86][Costmodel] Load/store i16 Stride=6 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/3Tc5s897j - for intels `Block RThroughput: =39.0`; for ryzens, `Block RThroughput: <=13.5` So pick cost of `39`. For store we have: https://godbolt.org/z/fo1h9E67e - for intels `Block RThroughput: =21.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `21`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110592	2021-09-28 19:15:07 +03:00
Roman Lebedev	b3011bcc78	[X86][Costmodel] Load/store i16 Stride=6 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/1Wcaf9c7T - for intels `Block RThroughput: =9.0`; for ryzens, `Block RThroughput: <=4.5` So pick cost of `9`. For store we have: https://godbolt.org/z/1Wcaf9c7T - for intels `Block RThroughput: =15.0`; for ryzens, `Block RThroughput: <=6.0` So pick cost of `15`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110591	2021-09-28 19:15:01 +03:00
Roman Lebedev	aa93c55889	[X86][Costmodel] Load/store i16 Stride=6 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/bhscej4WM - for intels `Block RThroughput: =13.0`; for ryzens, `Block RThroughput: <=7.0` So pick cost of `13`. For store we have: https://godbolt.org/z/Yf4Pfnxbq - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=3.5` So pick cost of `10`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110590	2021-09-28 19:14:56 +03:00
Quinn Pham	70391b3468	[PowerPC] FP compare and test XL compat builtins. This patch is in a series of patches to provide builtins for compatability with the XL compiler. This patch adds builtins for compare exponent and test data class operations on floating point values. Reviewed By: #powerpc, lei Differential Revision: https://reviews.llvm.org/D109437	2021-09-28 11:01:51 -05:00
Kazu Hirata	9e4f1f9265	[SystemZ] Remove redundant declaration SystemZMnemonicSpellCheck (NFC) Note that SystemZMnemonicSpellCheck is defined in SystemZGenAsmMatcher.inc, which SystemZAsmParser.cpp includes. Identified with readability-redundant-declaration.	2021-09-28 08:38:05 -07:00
David Green	fdd8c10959	[ARM] Delay reverting WLS in arm-block-placement As we have to split blocks, we may be left in an invalid loop state after a WLS is reverted to a DLS. Instead remember the WLS that could not be fixed and revert them after finishing processing all other loops. Differential Revision: https://reviews.llvm.org/D110567	2021-09-28 15:38:29 +01:00
Jingu Kang	73a196a11c	Recommit "[AArch64] Split bitmask immediate of bitwise AND operation" This reverts the revert commit `f85d8a5bed` with bug fixes. Original message: MOVi32imm + ANDWrr ==> ANDWri + ANDWri MOVi64imm + ANDXrr ==> ANDXri + ANDXri The mov pseudo instruction could be expanded to multiple mov instructions later. In this case, try to split the constant operand of mov instruction into two bitmask immediates. It makes only two AND instructions intead of multiple mov + and instructions. Added a peephole optimization pass on MIR level to implement it. Differential Revision: https://reviews.llvm.org/D109963	2021-09-28 15:26:29 +01:00
David Green	2c53215e99	[ARM] Skip debug info in recomputeVPTBlockMask The ARMLowOverheadLoops pass recalculates VPT block masks when it converts VCMP's inside VPT blocks into VPT's. The function to do so doesn't seem to handle debug info though, leading to invalid block creation or asserts at compile time. Make sure the function skips any debug info between the MVE instructions it inspects. Differential Revision: https://reviews.llvm.org/D110564	2021-09-28 14:58:13 +01:00
hyeongyu kim	86bf234d0b	[IR] Change the default value of InstertElement to poison (1/4) This patch is for fixing potential insertElement-related bugs like D93818. ``` V = UndefValue::get(VecTy); for(...) V = Builder.CreateInsertElementy(V, Elt, Idx); => V = PoisonValue::get(VecTy); for(...) V = Builder.CreateInsertElementy(V, Elt, Idx); ``` Like above, this patch changes the placeholder V to poison. The patch will be separated into several commits. Reviewed By: aqjune Differential Revision: https://reviews.llvm.org/D110311	2021-09-28 22:29:16 +09:00
Jingu Kang	f85d8a5bed	Revert "[AArch64] Split bitmask immediate of bitwise AND operation" This reverts commit `864b206796`. Reverting due to error on buildbots.	2021-09-28 13:28:09 +01:00
Jingu Kang	864b206796	[AArch64] Split bitmask immediate of bitwise AND operation MOVi32imm + ANDWrr ==> ANDWri + ANDWri MOVi64imm + ANDXrr ==> ANDXri + ANDXri The mov pseudo instruction could be expanded to multiple mov instructions later. In this case, try to split the constant operand of mov instruction into two bitmask immediates. It makes only two AND instructions intead of multiple mov + and instructions. Added a peephole optimization pass on MIR level to implement it. Differential Revision: https://reviews.llvm.org/D109963	2021-09-28 11:57:43 +01:00
Liu, Chen3	57e8f840b6	[X86][FP16] Fix a bug when Combine the FADD(A, FMA(B, C, 0)) to FMA(B, C, A). This bug was introduced by D109953. The operand order of generated FMA is wrong. Differential Revision: https://reviews.llvm.org/D110606	2021-09-28 11:38:53 +08:00
Jozef Lawrynowicz	6cfb4d46ba	[llvm-readobj] Support dumping of MSP430 ELF attributes The MSP430 ABI supports build attributes for specifying the ISA, code model, data model and enum size in ELF object files. Differential Revision: https://reviews.llvm.org/D107969	2021-09-28 00:56:11 +03:00
Roman Lebedev	2a7a768dad	[X86][Costmodel] Load/store i16 Stride=4 VF=32 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For this tuple, measuring becomes problematic since there's a lot of spilling going on, but apparently all these memory ops do not affect worst-case estimate at all here. For load we have: https://godbolt.org/z/zP4hd8MT6 - for intels `Block RThroughput: =150.0`; for ryzens, `Block RThroughput: <=59` So pick cost of `150`. For store we have: https://godbolt.org/z/vKb8zTK8E - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=24.0` So pick cost of `64`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110548	2021-09-27 22:20:01 +03:00
Roman Lebedev	ee5a050e2e	[X86][Costmodel] Load/store i16 Stride=4 VF=16 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =75.0`; for ryzens, `Block RThroughput: <=29.5` So pick cost of `75`. (note that `# 32-byte Reload` does not affect throughput there.) For store we have: https://godbolt.org/z/Wd9cKab83 - for intels `Block RThroughput: =32.0`; for ryzens, `Block RThroughput: <=12.0` So pick cost of `32`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110543	2021-09-27 22:20:01 +03:00
Roman Lebedev	5615d6a6dd	[X86][Costmodel] Load/store i16 Stride=4 VF=8 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/dd8T5P471 - for intels `Block RThroughput: =33.0`; for ryzens, `Block RThroughput: <=14.5` So pick cost of `33`. For store we have: https://godbolt.org/z/zPxcKWhn4 - for intels `Block RThroughput: =10.0`; for ryzens, `Block RThroughput: <=6.0` So pick cost of `10`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110541	2021-09-27 22:20:01 +03:00
Roman Lebedev	df2b42d12e	[X86][Costmodel] Load/store i16 Stride=4 VF=4 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/rnsf639Wh - for intels `Block RThroughput: =17.0`; for ryzens, `Block RThroughput: <=7.5` So pick cost of `17`. For store we have: https://godbolt.org/z/565KKrcY6 - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: =2.0` So pick cost of `6`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110537	2021-09-27 22:20:01 +03:00
Roman Lebedev	45caac91c4	[X86][Costmodel] Load/store i16 Stride=4 VF=2 interleaving costs The only sched models that for cpu's that support avx2 but not avx512 are: haswell, broadwell, skylake, zen1-3 For load we have: https://godbolt.org/z/5EYc6r9nh - for intels `Block RThroughput: =6.0`; for ryzens, `Block RThroughput: <=3.0` So pick cost of `6`. For store we have: https://godbolt.org/z/z61e5d6GE - for intels `Block RThroughput: =2.0`; for ryzens, `Block RThroughput: <=1.0` So pick cost of `2`. I'm directly using the shuffling asm the llc produced, without any manual fixups that may be needed to ensure sequential execution. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D110536	2021-09-27 22:20:01 +03:00
Praveen Velliengiri	e90b512c4d	[AMDGPU] Change ASAN init/fini kernels linkage to external. HSA runtime fails to find the symbols for Init and Fini kernels as they mark with internal linkage, changing the linkage to external to fix those errors. Differential Revision: https://reviews.llvm.org/D110054	2021-09-27 11:50:37 -06:00
Quinn Pham	682e15f371	[PowerPC] Fix td pattern for P10 VSLDBI and VSRDBI This patch fixes the pattern for the P10 instructions Vector Shift Left Double by Bit Immediate VN-form and Vector Shift Right Double by Bit Immediate VN-form. The third argument should be a target constant (`timm`) instead of an `i32` because an immediate is expected. Reviewed By: lei Differential Revision: https://reviews.llvm.org/D109920	2021-09-27 12:36:18 -05:00
Craig Topper	a2a07e8db3	[RISCV] Fold store of vmv.x.s to a vse with VL=1. This can avoid a loss of decoupling with the scalar unit on cores with decoupled scalar and vector units. We should support FP too, but those use extract_element and not a custom ISD node so it is a little different. I also left a FIXME in the test for i64 extract and store on RV32. Reviewed By: frasercrmck Differential Revision: https://reviews.llvm.org/D109482	2021-09-27 09:54:46 -07:00

1 2 3 4 5 ...

64355 Commits