llvm-project

Commit Graph

Author	SHA1	Message	Date
Phoebe Wang	bc1819389f	[X86][RFC] Using `__bf16` for AVX512_BF16 intrinsics This is an alternative of D120395 and D120411. Previously we use `__bfloat16` as a typedef of `unsigned short`. The name may give user an impression it is a brand new type to represent BF16. So that they may use it in arithmetic operations and we don't have a good way to block it. To solve the problem, we introduced `__bf16` to X86 psABI and landed the support in Clang by D130964. Now we can solve the problem by switching intrinsics to the new type. Reviewed By: LuoYuanke, RKSimon Differential Revision: https://reviews.llvm.org/D132329	2022-10-19 23:47:04 +08:00
Freddy Ye	3ee58e2f35	[X86] Add WRMSRNS instructions. For more details about these instructions, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html Reviewed By: craig.topper Differential Revision: https://reviews.llvm.org/D135935	2022-10-19 13:04:11 +08:00
Freddy Ye	e3df4ba9d2	[X86] Add MSRLIST instructions. For more details about these instructions, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html Reviewed By: skan, RKSimon Differential Revision: https://reviews.llvm.org/D135934	2022-10-19 10:35:42 +08:00
Han Zhu	d0d48a91f8	[X86] Lower vector interleave into unpck and perm [This Godbolt link](https://godbolt.org/z/s17Kv1s9T) shows different codegen between clang and gcc for a transpose operation. clang result: ``` vmovdqu xmm0, xmmword ptr [rcx + rax] vmovdqu xmm1, xmmword ptr [rcx + rax + 16] vmovdqu xmm2, xmmword ptr [r8 + rax] vmovdqu xmm3, xmmword ptr [r8 + rax + 16] vpunpckhbw xmm4, xmm2, xmm0 vpunpcklbw xmm0, xmm2, xmm0 vpunpcklbw xmm2, xmm3, xmm1 vpunpckhbw xmm1, xmm3, xmm1 vmovdqu xmmword ptr [rdi + 2rax + 48], xmm1 vmovdqu xmmword ptr [rdi + 2rax + 32], xmm2 vmovdqu xmmword ptr [rdi + 2rax], xmm0 vmovdqu xmmword ptr [rdi + 2rax + 16], xmm4 ``` gcc result: ``` vmovdqu ymm3, YMMWORD PTR [rdi+rax] vpunpcklbw ymm1, ymm3, YMMWORD PTR [rsi+rax] vpunpckhbw ymm0, ymm3, YMMWORD PTR [rsi+rax] vperm2i128 ymm2, ymm1, ymm0, 32 vperm2i128 ymm1, ymm1, ymm0, 49 vmovdqu YMMWORD PTR [rcx+rax2], ymm2 vmovdqu YMMWORD PTR [rcx+32+rax2], ymm1 ``` clang's code is roughly 15% slower than gcc's when evaluated on an internal compression benchmark. The loop vectorizer generates the following shufflevector intrinsic: ``` %interleaved.vec = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63> ``` which is lowered to SelectionDAG: ``` t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0 t6: v64i8 = concat_vectors t2, undef:v32i8 t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1 t7: v64i8 = concat_vectors t4, undef:v32i8 t8: v64i8 = vector_shuffle<0,64,1,65,2,66,3,67,4,68,5,69,6,70,7,71,8,72,9,73,10,74,11,75,12,76,13,77,14,78,15,79,16,80,17,81,18,82,19,83,20,84,21,85,22,86,23,87,24,88,25,89,26,90,27,91,28,92,29,93,30,94,31,95> t6, t7 ``` So far this `vector_shuffle` is good enough for us to pattern-match and transform, but as we go down the SelectionDAG pipeline, it got split into smaller shuffles. During dagcombine1, the shuffle is split by `foldShuffleOfConcatUndefs`. ``` // shuffle (concat X, undef), (concat Y, undef), Mask --> // concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1) t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0 t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1 t19: v32i8 = vector_shuffle<0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47> t2, t4 t15: ch,glue = CopyToReg t0, Register:v32i8 $ymm0, t19 t20: v32i8 = vector_shuffle<16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63> t2, t4 t17: ch,glue = CopyToReg t15, Register:v32i8 $ymm1, t20, t15:1 ``` With `foldShuffleOfConcatUndefs` commented out, the vector is still split later by the type legalizer, which comes after dagcombine1, because v64i8 is not a legal type in AVX2 (64 * 8 = 512 bits while ymm = 256 bits). There doesn't seem to be a good way to avoid this split. Lowering the `vector_shuffle` into unpck and perm during dagcombine1 is too early. Therefore, although somewhat inconvenient, we decided to go with pattern-matching a pair vector shuffles later in the SelectionDAG pipeline, as part of `lowerV32I8Shuffle`. The code looks at the two operands of the first shuffle it encounters, iterates through the users of the operands, and tries to find two shuffles that are consecutive interleaves. Once the pattern is found, it lowers them into unpcks and perms. It returns the perm for the shuffle that's currently being lowered (have ISel modify the DAG), and replaces the other shuffle in place. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D134477	2022-10-17 11:39:27 -07:00
Benjamin Kramer	08dc847f33	Add missing `override`s after `aad013de41`	2022-10-14 10:38:32 +02:00
Xiang1 Zhang	aad013de41	[InlineAsm][bugfix] Correct function addressing in inline asm In Linux PIC model, there are 4 cases about value/label addressing: Case 1: Function call or Label jmp inside the module. Case 2: Data access (such as global variable, static variable) inside the module. Case 3: Function call or Label jmp outside the module. Case 4: Data access (such as global variable) outside the module. Due to current llvm inline asm architecture designed to not "recognize" the asm code, there are quite troubles for us to treat mem addressing differently for same value/adress used in different instuctions. For example, in pic model, call a func may in plt way or direclty pc-related, but lea/mov a function adress may use got. This patch fix/refine the case 1 and case 2 in inline asm. Due to currently inline asm didn't support jmp the outsider lable, this patch mainly focus on fix the function call addressing bugs in inline asm. Reviewed By: Pengfei, RKSimon Differential Revision: https://reviews.llvm.org/D133914	2022-10-14 09:47:26 +08:00
Simon Pilgrim	fa9c12ed96	[X86] Attempt to combine binary shuffles where both operands come from the same larger vector Allows us to use combineX86ShuffleChainWithExtract to combine targetshuffle(low_subvector(x),high_subvector(x)) -> low_subvector(targetshuffle(x)) style patterns This is currently very limited (it must have a v2i64/v2f64 result), but while triaging I noticed we might be able to extend this to allow more types for targets with suitable variable cross lane shuffle support. Fixes #58339	2022-10-13 14:34:11 +01:00
Luo, Yuanke	f885c08034	Don't widen shuffle element with AVX512 Fix crash issue of D129537 and reopen it. Currently the X86 shuffle lowering would widen the element type for shuffle if the mask element value is adjacent. For below example %t2 = add nsw <16 x i32> %t0, %t1 %t3 = sub nsw <16 x i32> %t0, %t1 %t4 = shufflevector <16 x i32> %t2, <16 x i32> %t3, <16 x i32> <i32 16, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> ret <16 x i32> %t4 Compiler would transform the shuffle to %t4 = shufflevector <8 x i64> %t2, <8 x i64> %t3, <8 x i64> <i32 8, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> This may lose the oppotunity to let ISel select mask instruction when avx512 is enabled. This patch is to prevent the tranform when avx512 feature is enabled. Thank Simon for the idea. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D130830	2022-10-12 19:18:10 +08:00
Freddy Ye	566c277c64	[X86] Remove AVX512VP2INTERSECT from Sapphire Rapids. For more details, please refer to the latest ISE document: https://www.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D135509	2022-10-08 14:54:03 +08:00
Joao Moreira	eac3e5c3fb	[X86] Do not emit JCC to __x86_indirect_thunk Clang may optimize conditional tailcall blocks with the following layout: cmp <condition> je tailcall_target ret When retpoline is in place, indirect calls are converted into direct calls to a retpoline thunk. When these indirect calls are tail calls, they may be subject to the above described optimization (there is no indirect JCC, but since now the jump is direct it can be made conditional). The above layout is non-ideal for the Linux kernel scenario because the branches into thunks may be patched back into indirect branches during runtime depending on the underlying CPU features, what would not be feasible if the binary is emitted with the optimized layout above. Thus, prevent clang from emitting this it if CodeModel is Kernel. Feature request from the respective kernel mailing list: https://lore.kernel.org/llvm/Yv3uI%2FMoJVctmBCh@worktop.programming.kicks-ass.net/ Reviewed By: nickdesaulniers, pengfei Differential Revision: https://reviews.llvm.org/D134915	2022-10-06 11:09:24 -07:00
Matthias Braun	189900eb14	X86: Stop assigning register costs for longer encodings. This stops reporting CostPerUse 1 for `R8`-`R15` and `XMM8`-`XMM31`. This was previously done because instruction encoding require a REX prefix when using them resulting in longer instruction encodings. I found that this regresses the quality of the register allocation as the costs impose an ordering on eviction candidates. I also feel that there is a bit of an impedance mismatch as the actual costs occure when encoding instructions using those registers, but the order of VReg assignments is not primarily ordered by number of Defs+Uses. I did extensive measurements with the llvm-test-suite wiht SPEC2006 + SPEC2017 included, internal services showed similar patterns. Generally there are a log of improvements but also a lot of regression. But on average the allocation quality seems to improve at a small code size regression. Results for measuring static and dynamic instruction counts: Dynamic Counts (scaled by execution frequency) / Optimization Remarks: Spills+FoldedSpills -5.6% Reloads+FoldedReloads -4.2% Copies -0.1% Static / LLVM Statistics: regalloc.NumSpills mean -1.6%, geomean -2.8% regalloc.NumReloads mean -1.7%, geomean -3.1% size..text mean +0.4%, geomean +0.4% Static / LLVM Statistics: mean -2.2%, geomean -3.1%) regalloc.NumSpills mean -2.6%, geomean -3.9%) regalloc.NumReloads mean +0.6%, geomean +0.6%) size..text Static / LLVM Statistics: regalloc.NumSpills mean -3.0% regalloc.NumReloads mean -3.3% size..text mean +0.3%, geomean +0.3% Differential Revision: https://reviews.llvm.org/D133902	2022-09-30 16:01:33 -07:00
Bjorn Pettersson	0513b0305a	[X86] Avoid miscompile in combineOr (X86ISelLowering.cpp) In combineOr (X86ISelLowering.cpp) there is a DAG combine that rewrite a "(0 - SetCC) \| C" pattern into something simpler given that a LEA can be used. Another requirement is that C has some specific value, for example 1 or 7. When checking those requirements the code used a 32-bit unsigned variable to store the value of C. So for a 64-bit OR this could miscompile in case any of the 32 most significant bits in C were non zero. This patch adds fixes the bug by using a large enough type for the C value. The faulty code seem to have been introduced by commit `9bceb8981d` (D131358). Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D134892	2022-09-29 21:24:31 +02:00
Stefan Gränitz	ed8409dfa0	[ObjC][ARC] Fix target register for call expanded from CALL_RVMARKER on Windows Fix regression https://github.com/llvm/llvm-project/issues/56952 for Clang CodeGen on Windows. In the Windows ABI the instruction sequence that is expanded from CALL_RVMARKER should use RCX as target register and not RDI. Reviewed By: rnk, fhahn Differential Revision: https://reviews.llvm.org/D134441	2022-09-27 18:49:40 +02:00
Simon Pilgrim	196f27bb56	[CostModel][X86] Add missing cost kinds for v2i64 icmp on SLM	2022-09-25 15:12:21 +01:00
Simon Pilgrim	faff990e9b	[X86] Fix Icelake VPMULLQ zmm pipes and adjust AVX512DQ v8i64 mul costs to match worse case Icelake PMULLQ throughput regressed cf SkylakeServer as its Pipe0 only Confirmed with Intel SOM, Agner and instlatx64	2022-09-25 14:18:08 +01:00
Josh Stone	cb46ffdbf4	[X86] Use BuildStackAdjustment in stack probes This has the advantage of dealing with live EFLAGS, using LEA instead of SUB if needed to avoid clobbering. That also respects feature "lea-sp". We could allow unrolled stack probing from blocks with live-EFLAGS, if canUseAsEpilogue learns when emitStackProbeInlineGeneric will be used. Differential Revision: https://reviews.llvm.org/D134495	2022-09-23 09:30:32 -07:00
Josh Stone	26c37b461a	[X86] Don't allow prologue stack probing with live EFLAGS Fixes https://github.com/llvm/llvm-project/issues/49509 Differential Revision: https://reviews.llvm.org/D134494	2022-09-23 09:30:32 -07:00
Josh Stone	4dcfb09e40	[NFC][CodeGen] Use const MF in TargetLowering stack probe functions This makes them callable from places like canUseAsPrologue. Differential Revision: https://reviews.llvm.org/D134492	2022-09-23 09:30:32 -07:00
Simon Pilgrim	a6e9141505	[TTI] Add OperandValueProperties::OP_NegatedPowerOf2 enum (PR51436) The mul by constant costmodels handle power-of-2 constants, but not negated-power-of-2, despite the backends handling both. This patch adds the OperandValueProperties::OP_NegatedPowerOf2 enum and wires it for use for basic mul cost analysis and SLP handling. Fixes #50778 Differential Revision: https://reviews.llvm.org/D111968	2022-09-23 14:03:18 +01:00
Simon Pilgrim	98907f8685	[CostModel][X86] Tidyup sdiv/srem/udiv/urem by constant cost tables Preparation for adding cost kinds handling This is necessary to eventually unblock D111968	2022-09-22 20:46:33 +01:00
Simon Pilgrim	dc93202b44	[CostModel][X86] Remove duplicate ashr v4i64 cost table entry. NFCI.	2022-09-22 17:27:26 +01:00
Simon Pilgrim	e030be64d8	[CostModel][X86] Add partial CostKinds handling for funnelshifts/rotates This mainly just adds costs for the targets where we have actual funnelshift/rotate instructions (VBMI2/XOP etc.) - the cases where we expand still need addressing, although for many the default shift+or expansion, especially for uniform cases, isn't that bad. This was achieved with the 'cost-tables vs llvm-mca' script D103695	2022-09-22 11:24:11 +01:00
Simon Pilgrim	b2cd8118d0	[CostModel][X86] Add CostKinds handling for smax/smin/umax/umin instructions This was achieved with the 'cost-tables vs llvm-mca' script D103695	2022-09-22 10:19:23 +01:00
Simon Pilgrim	839ba13c3e	[CostModel][X86] Add vbmi2 costs for funnelshift/rotate intrinsics Add costs for the funnel shift instructions - fixes some discrepancies I was hitting with costs numbers from the 'cost-tables vs llvm-mca' script D103695	2022-09-21 13:48:22 +01:00
Serge Pavlov	181279ffcd	[X86][GlobalISel] Add support for sret demotion The change add support for the cases when return value is passed in memory rathen than in registers. Differential Revision: https://reviews.llvm.org/D134181	2022-09-20 11:47:53 +07:00
Simon Pilgrim	6b4d409f69	[CostModel][X86] Add CostKinds handling for CTLZ_ZERO_UNDEF/CTTZ_ZERO_UNDEF instructions This was achieved with the 'cost-tables vs llvm-mca' script D103695	2022-09-19 17:37:58 +01:00
Simon Pilgrim	135c9b2c4b	[CostModel][X86] Add CostKinds handling for vector ctlz instructions This was achieved with the 'cost-tables vs llvm-mca' script D103695	2022-09-19 16:44:09 +01:00
Simon Pilgrim	2538adde5c	[CostModel][X86] Add CostKinds handling for cttz This was achieved with the 'cost-tables vs llvm-mca' script D103695	2022-09-19 15:57:03 +01:00
Simon Pilgrim	d90a42d64c	[CostModel][X86] Add CTLZ_ZERO_UNDEF/CTTZ_ZERO_UNDEF cost handling Without LZCNT/BMI, the *_ZERO_UNDEF costs are cheaper as they can avoid the zero handling.	2022-09-19 14:06:33 +01:00
Kazu Hirata	cf07277fb4	[X86] Fix the LEA optimization pass The LEA optimization pass visits each basic block of a given machine function. In each basic block, for each pair of LEAs that differ only in their displacement fields, we replace all uses of the second LEA with the first LEA while adjusting the displacement. Now, without this patch, after all the replacements are made, the following assert triggers: assert(MRI->use_empty(LastVReg) && "The LEA's def register must have no uses"); The replacement loop uses: for (MachineOperand &MO : llvm::make_early_inc_range(MRI->use_operands(LastVReg))) { which is equivalent to: for (auto UI = MRI->use_begin(LastVReg), UE = MRI->use_end(); UI != UE;) { MachineOperand &MO = UI++; // <-- Look! That is, immediately after the post increment, make_early_inc_range already has the iterator for the next iteration in its mind. The problem is that in one iteration of the loop, we could replace two uses in a debug instruction like: DBG_VALUE_LIST !"r", !DIExpression(DW_OP_LLVM_arg, 0), %0:gr64, %0:gr64, ... So, the iterator for the next iteration becomes invalid. We end up traversing a garbage use list from that point on. In turn, we don't get to visit remaining uses. The patch fixes the problem by switching to a "draining" while loop: while (!MRI->use_empty(LastVReg)) { MachineOperand &MO = MRI->use_begin(LastVReg); MachineInstr &MI = *MO.getParent(); The credit goes to Simon Pilgrim for reducing the test case. Fixes https://github.com/llvm/llvm-project/issues/57673 Differential Revision: https://reviews.llvm.org/D133631	2022-09-18 17:50:17 -07:00
David Majnemer	8a868d8859	Revert "Revert "[clang, llvm] Add __declspec(safebuffers), support it in CodeView"" This reverts commit `cd20a18286` and adds a "let Heading" to NoStackProtectorDocs.	2022-09-16 19:39:48 +00:00
Simon Pilgrim	23cb1c42cd	[CostModel][X86] Update throughput costs for CTLZ ops This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (and recent fixes to the bdver2 + alderlake models) Adding full CostKinds costs are affecting some other tests as they make assumptions about SizeLatency costs, so they need addressing first	2022-09-16 16:56:49 +01:00
Simon Pilgrim	89e4cb603d	[X86] Add missing (unsupported) zmm vector move classes Although unsupported on HSW, we reuse this model for KNL which does require them Noticed when running the cost model fuzz script from D103695 with -mcpu=knl	2022-09-16 15:31:26 +01:00
Simon Pilgrim	f8fa04295f	[CostModel][X86] Add CostKinds handling for vector integer comparisons These were based off a mixture of vector integer add/sub costs and the numbers from the 'cost-tables vs llvm-mca' script from D103695 - the extra costs for different predicates are still proving tricky to implement, but I've gotten most costs to within +/1 now - the AVX512 are tricky as we still don't handle predicate results properly, so most of these were done by hand.	2022-09-16 13:03:41 +01:00
Sergei Barannikov	c6acb4eb0f	[SDAG] Add `getCALLSEQ_END` overload taking `uint64_t`s All in-tree targets pass pointer-sized ConstantSDNodes to the method. This overload reduced amount of boilerplate code a bit. This also makes getCALLSEQ_END consistent with getCALLSEQ_START, which already takes uint64_ts.	2022-09-15 14:02:12 -04:00
Simon Pilgrim	94620e4fc3	[CostModel][X86] Add CostKinds handling for vector shift by generic/non-uniform shift amounts These are the worst case generic vector shift costs, where nothing is known about the shift amounts - in particular this should stop us using the default sizelatency cost of 1 for so many pre-AVX2 vector shifts that can often actually expand during lowering to +20 uops, just for 128-bit vectors, resulting in some horrible inline/unroll decisions. This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)	2022-09-15 16:51:58 +01:00
Simon Pilgrim	0ec028fe10	[CostModel][X86] Add CostKinds handling for vector shift by uniform/constuniform ops Vector shift by const uniform is the cheapest shift instruction we have, non-const uniform have a marginally higher cost - some targets 'splat' the amount internally to use the shift-per-element instruction, others see a higher cost for the explicit zeroing of the upper bits for the (64-bit) shift amount. This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (I'll update the patch soon for reference)	2022-09-15 14:05:30 +01:00
Simon Pilgrim	854a4595b6	[CostModel][X86] getArithmeticInstrCost - move GLM/SLM custom costs AFTER constant shift -> multiply canonicalization Corrects the shift by constant costs to better account for them being converted to multiples for lowering - which demonstrates that we should probably be trying harder NOT to convert these to multiplies for some CPUs (v4i32 in particular).	2022-09-14 11:46:26 +01:00
Simon Pilgrim	40ab7875f8	[CostModel][X86] Fix throughput costs for AVX512BW v32i16 shifts Fixes regression from `a931dbfbd3`	2022-09-14 11:18:23 +01:00
Sylvestre Ledru	cd20a18286	Revert "[clang, llvm] Add __declspec(safebuffers), support it in CodeView" Causing: https://github.com/llvm/llvm-project/issues/57709 This reverts commit `ab56719acd`.	2022-09-13 10:53:59 +02:00
David Majnemer	ab56719acd	[clang, llvm] Add __declspec(safebuffers), support it in CodeView __declspec(safebuffers) is equivalent to __attribute__((no_stack_protector)). This information is recorded in CodeView. While we are here, add support for strict_gs_check.	2022-09-12 21:15:34 +00:00
Kazu Hirata	9606608474	[llvm] Use x.empty() instead of llvm::empty(x) (NFC) I'm planning to deprecate and eventually remove llvm::empty. I thought about replacing llvm::empty(x) with std::empty(x), but it turns out that all uses can be converted to x.empty(). That is, no use requires the ability of std::empty to accept C arrays and std::initializer_list. Differential Revision: https://reviews.llvm.org/D133677	2022-09-12 13:34:35 -07:00
Craig Topper	38ffa2bb96	[LegalizeTypes] Improve splitting for urem/udiv by constant for some constants. For remainder: If (1 << (Bitwidth / 2)) % Divisor == 1, we can add the high and low halves together and use a (Bitwidth / 2) urem. If (BitWidth /2) is a legal integer type, this urem will be expand by DAGCombiner using multiply by magic constant. We do have to take into account that adding high and low together can produce a carry, making it a (BitWidth / 2)+1 bit number. So we need to also add back in the carry from the first addition. For division: We can use the above trick to compute the remainder, subtract that remainder from the dividend, then multiply by the multiplicative inverse of the Divisor modulo (1 << BitWidth). This is based on the section "Remainder by Summing Digits" in Hacker's delight. The remainder trick is similar to a trick you may have learned for determining if a decimal number is divisible by 3. You can add all the digits together and see if the sum is divisible by 3. If you're not sure if the sum is divisible by 3, you can add its digits together. This can be repeated until you have a single decimal digit. If that digit is 3, 6, or 9, then the original number is divisible by 3. This works because 10 % 3 == 1. gcc already does this same trick. There are additional tricks gcc does urem as well as srem, udiv, and sdiv that I plan to add in future patches. Reviewed By: RKSimon Differential Revision: https://reviews.llvm.org/D130862	2022-09-12 10:34:52 -07:00
Matthias Gehre	c1502425ba	Move TargetTransformInfo::maxLegalDivRemBitWidth -> TargetLowering::maxSupportedDivRemBitWidth Also remove new-pass-manager version of ExpandLargeDivRem because there is no way yet to access TargetLowering in the new pass manager. Differential Revision: https://reviews.llvm.org/D133691	2022-09-12 17:06:16 +01:00
Simon Pilgrim	20ad05f9b4	[CostModel][X86] Add CostKinds handling for abs ops This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695	2022-09-12 16:34:37 +01:00
Simon Pilgrim	bd0109f392	[CostModel][X86] Move AVX512/AVX2 uniform shift costs into the generic uniform cost tables They shouldn't be happening after XOP shift costs - AVX2 shift supports takes preference over XOP for everything but vXi8 shifts - the improvement is pretty limited as it only affects bdver4 targets but it does help clean up a fraction of the messy shift cost logic....	2022-09-12 12:08:42 +01:00
Simon Pilgrim	a931dbfbd3	[CostModel][X86] Merge AVX512BW vXi8/vXi16 shifts into default AVX512BW cost table We only need to handle the uniform cases early	2022-09-10 18:18:42 +01:00
Simon Pilgrim	10edf88458	[CostModel][X86] Update CTPOP costs With the bdver2 model updates, many of the AVX1 costs were far too high - it also helped expose some costs mismatches for Atom/Silvermont	2022-09-10 17:57:20 +01:00
Simon Pilgrim	4994f87ca1	[X86] Fix bdver2 128-bit shuffles throughputs Noticed while trying to get vector ctpop/ctlz/cttz costs fixed using the script from D103695 - all of these are full-rate but the throughput costs were weirdly high for bdver2 Matches AMD 15h SoG, Agner and instlatx64	2022-09-10 17:34:40 +01:00
Simon Pilgrim	7785bd34e7	[X86] Fix bdver2 128-bit ALU/logic/shift throughputs Noticed while trying to get vector shifts costs fixed using the script from D103695 - all of these are full-rate but the throughput costs were weirdly high for bdver2 Matches AMD 15h SoG, Agner and instlatx64	2022-09-10 16:23:29 +01:00

1 2 3 4 5 ...

22912 Commits