This isn't the final fix for our reduction/horizontal codegen, but it takes care
of a lot of the problems. After we narrow the shuffle, existing combines for
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.
The avg and mul reduction tests show an existing shuffle lowering hole for
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.
Differential Revision: https://reviews.llvm.org/D57156
llvm-svn: 352209
This target-independent code won't trigger for cases such as RV32FD where
custom SelectionDAG nodes are generated. These new tests demonstrate such
cases. Additionally, float-arith.ll was updated so that fneg.s, fsgnjn.s, and
fabs.s selection patterns are actually exercised.
llvm-svn: 352199
Same as ARM.
On this occasion we split some of the instruction select tests for more
complicated instructions into their own files, so we can reuse them for
ARM and Thumb mode. Likewise for the legalizer tests.
llvm-svn: 352188
Fast selection of llvm icmp and fcmp instructions is not handled well about VSX instruction support.
We'd use VSX float comparison instruction instead of non-vsx float comparison instruction
if the operand register class is VSSRC or VSFRC because i32 and i64 are mapped to VSSRC and
VSFRC correspondingly if VSX feature is opened.
If the target does not have corresponding VSX instruction comparison for some type,
just copy VSX-related register to common float register class and use non-vsx comparison instruction.
Differential Revision: https://reviews.llvm.org/D57078
llvm-svn: 352174
Follow the same custom legalisation strategy as used in D57085 for
variable-length shifts (see that patch summary for more discussion). Although
we may lose out on some late-stage DAG combines, I think this custom
legalisation strategy is ultimately easier to reason about.
There are some codegen changes in rv64m-exhaustive-w-insts.ll but they are all
neutral in terms of the number of instructions.
Differential Revision: https://reviews.llvm.org/D57096
llvm-svn: 352171
The previous DAG combiner-based approach had an issue with infinite loops
between the target-dependent and target-independent combiner logic (see
PR40333). Although this was worked around in rL351806, the combiner-based
approach is still potentially brittle and can fail to select the 32-bit shift
variant when profitable to do so, as demonstrated in the pr40333.ll test case.
This patch instead introduces target-specific SelectionDAG nodes for
SHLW/SRLW/SRAW and custom-lowers variable i32 shifts to them. pr40333.ll is a
good example of how this approach can improve codegen.
This adds DAG combine that does SimplifyDemandedBits on the operands (only
lower 32-bits of first operand and lower 5 bits of second operand are read).
This seems better than implementing SimplifyDemandedBitsForTargetNode as there
is no guarantee that would be called (and it's not for e.g. the anyext return
test cases). Also implements ComputeNumSignBitsForTargetNode.
There are codegen changes in atomic-rmw.ll and atomic-cmpxchg.ll but the new
instruction sequences are semantically equivalent.
Differential Revision: https://reviews.llvm.org/D57085
llvm-svn: 352169
This patch exploits the instructions that store a single element from a vector
to preform a (store (extract_elt)). We already have code that does this with
ISA 3.0 instructions that were added to handle i8/i16 types. However, we had
never exploited the existing ones that handle f32/f64/i32/i64 types.
Differential revision: https://reviews.llvm.org/D56175
llvm-svn: 352131
https://reviews.llvm.org/D57178
Now add a hook in TargetPassConfig to query if CSE needs to be
enabled. By default this hook returns false only for O0 opt level but
this can be overridden by the target.
As a consequence of the default of enabled for non O0, a few tests
needed to be updated to not use CSE (by passing in -O0) to the run
line.
reviewed by: arsenm
llvm-svn: 352126
This patch adds support for vector @llvm.ceil intrinsics when full 16 bit
floating point support isn't available.
To do this, this patch...
- Implements basic isel for G_UNMERGE_VALUES
- Teaches the legalizer about 16 bit floats
- Teaches AArch64RegisterBankInfo to respect floating point registers on
G_BUILD_VECTOR and G_UNMERGE_VALUES
- Teaches selectCopy about 16-bit floating point vectors
It also adds
- A legalizer test for the 16-bit vector ceil which verifies that we create a
G_UNMERGE_VALUES and G_BUILD_VECTOR when full fp16 isn't supported
- An instruction selection test which makes sure we lower to G_FCEIL when
full fp16 is supported
- A test for selecting G_UNMERGE_VALUES
And also updates arm64-vfloatintrinsics.ll to show that the new ceiling types
work as expected.
https://reviews.llvm.org/D56682
llvm-svn: 352113
It should be emitted when any floating-point operations (including
calls) are present in the object, not just when calls to printf/scanf
with floating point args are made.
The difference caused by this is very subtle: in static (/MT) builds,
on x86-32, in a program that uses floating point but doesn't print it,
the default x87 rounding mode may not be set properly upon
initialization.
This commit also removes the walk of the types pointed to by pointer
arguments in calls. (To assist in opaque pointer types migration --
eventually the pointee type won't be available.)
That latter implies that it will no longer consider a call like
`scanf("%f", &floatvar)` as sufficient to emit _fltused on its
own. And without _fltused, `scanf("%f")` will abort with error R6002. This
new behavior is unlikely to bite anyone in practice (you'd have to
read a float, and do nothing with it!), and also, is consistent with
MSVC.
Differential Revision: https://reviews.llvm.org/D56548
llvm-svn: 352076
Select zero extending and sign extending load for MIPS32.
Use size from MachineMemOperand to determine number of bytes to load.
Differential Revision: https://reviews.llvm.org/D57099
llvm-svn: 352038
Use CombinerHelper to combine extending load instructions.
G_LOAD combined with G_ZEXT, G_SEXT or G_ANYEXT gives G_ZEXTLOAD,
G_SEXTLOAD or G_LOAD with same type as def of extending instruction
respectively.
Similarly G_ZEXTLOAD combined with G_ZEXT gives G_ZEXTLOAD and
G_SEXTLOAD combined with G_SEXT gives G_SEXTLOAD with same type
as def of extending instruction.
Differential Revision: https://reviews.llvm.org/D56914
llvm-svn: 352037
After creating new PHI instructions during isel pseudo expansion, the NoPHIs
property of MF should be reset in case it was previously set.
Review: Ulrich Weigand
llvm-svn: 352030
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.
Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.
When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.
Differential Revision: https://reviews.llvm.org/D57056
llvm-svn: 351965
Summary:
With XNACK, an smem load whose result is coalesced with an operand (thus
it overwrites its own operand) cannot appear in a clause, because some
other instruction might XNACK and restart the whole clause.
The clause breaker already realized that an smem that overwrites an
operand cannot appear in a clause, and broke the clause. The problem
that this commit fixes is that the SIFormMemoryClauses optimization
formed a bundle with early clobber, which caused the earlier code that
set up the coalesced operand to be removed as dead.
Differential Revision: https://reviews.llvm.org/D57008
Change-Id: I703c4d5b0bf7d6060222bec491f45c18bb3c0016
llvm-svn: 351950
Currently in Arm code, we allocate LR first, under the assumption that
it needs to be saved anyway. Unfortunately this has the disadvantage
that it will require any instructions using it to be the longer thumb2
instructions, not the shorter thumb1 ones.
This switches the order when we are optimising for minsize, returning to
the default order so that more lower registers can be used. It can end
up requiring more pushed registers, but on average produces smaller code.
Differential Revision: https://reviews.llvm.org/D56008
llvm-svn: 351938
In the last stage of type promotion, we replace any zext that uses a
new trunc with the operand of the trunc. This is okay when we only
allowed one type to be optimised, but now its the case that the trunc
maybe needed to produce a more narrow type than the one we were
optimising for. So we need to check this before doing the replacement.
Differential Revision: https://reviews.llvm.org/D57041
llvm-svn: 351935
The current check in CombineToPreIndexedLoadStore is too
conversative, preventing a pre-indexed store when the base pointer
is a predecessor of the value being stored. Instead, we should check
the pointer operand of the store.
Differential Revision: https://reviews.llvm.org/D56719
llvm-svn: 351933
As part of speculation hardening, the stack pointer gets masked with the
taint register (X16) before a function call or before a function return.
Since there are no instructions that can directly mask writing to the
stack pointer, the stack pointer must first be transferred to another
register, where it can be masked, before that value is transferred back
to the stack pointer.
Before, that temporary register was always picked to be x17, since the
ABI allows clobbering x17 on any function call, resulting in the
following instruction pattern being inserted before function calls and
returns/tail calls:
mov x17, sp
and x17, x17, x16
mov sp, x17
However, x17 can be live in those locations, for example when the call
is an indirect call, using x17 as the target address (blr x17).
To fix this, this patch looks for an available register just before the
call or terminator instruction and uses that.
In the rare case when no register turns out to be available (this
situation is only encountered twice across the whole test-suite), just
insert a full speculation barrier at the start of the basic block where
this occurs.
Differential Revision: https://reviews.llvm.org/D56717
llvm-svn: 351930
Two backend optimizations failed to handle cases when compiled with -g, due
to failing to consider DBG_VALUE instructions. This was in
SystemZTargetLowering::emitSelect() and
SystemZElimCompare::getRegReferences().
This patch makes sure that DBG_VALUEs are recognized so that they do not
affect these optimizations.
Tests for branch-on-count, load-and-trap and consecutive selects.
Review: Ulrich Weigand
https://reviews.llvm.org/D57048
llvm-svn: 351928
#pragma clang loop pipeline(disable)
Disable SWP optimization for the next loop.
“disable” is the only possible value.
#pragma clang loop pipeline_initiation_interval(number)
Set value of initiation interval for SWP
optimization to specified number value for
the next loop. Number is the positive value
greater than 0.
These pragmas could be used for debugging or reducing
compile time purposes. It is possible to disable SWP for
concrete loops to save compilation time or to find bugs
by not doing SWP to certain loops. It is possible to set
value of initiation interval to concrete number to save
compilation time by not doing extra pipeliner passes or
to check created schedule for specific initiation interval.
That is llvm part of the fix
Clang part of fix: https://reviews.llvm.org/D55710
Patch by Alexey Lapshin!
Differential Revision: https://reviews.llvm.org/D56403
llvm-svn: 351923
Each hwasan check requires emitting a small piece of code like this:
https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html#memory-accesses
The problem with this is that these code blocks typically bloat code
size significantly.
An obvious solution is to outline these blocks of code. In fact, this
has already been implemented under the -hwasan-instrument-with-calls
flag. However, as currently implemented this has a number of problems:
- The functions use the same calling convention as regular C functions.
This means that the backend must spill all temporary registers as
required by the platform's C calling convention, even though the
check only needs two registers on the hot path.
- The functions take the address to be checked in a fixed register,
which increases register pressure.
Both of these factors can diminish the code size effect and increase
the performance hit of -hwasan-instrument-with-calls.
The solution that this patch implements is to involve the aarch64
backend in outlining the checks. An intrinsic and pseudo-instruction
are created to represent a hwasan check. The pseudo-instruction
is register allocated like any other instruction, and we allow the
register allocator to select almost any register for the address to
check. A particular combination of (register selection, type of check)
triggers the creation in the backend of a function to handle the check
for specifically that pair. The resulting functions are deduplicated by
the linker. The pseudo-instruction (really the function) is specified
to preserve all registers except for the registers that the AAPCS
specifies may be clobbered by a call.
To measure the code size and performance effect of this change, I
took a number of measurements using Chromium for Android on aarch64,
comparing a browser with inlined checks (the baseline) against a
browser with outlined checks.
Code size: Size of .text decreases from 243897420 to 171619972 bytes,
or a 30% decrease.
Performance: Using Chromium's blink_perf.layout microbenchmarks I
measured a median performance regression of 6.24%.
The fact that a perf/size tradeoff is evident here suggests that
we might want to make the new behaviour conditional on -Os/-Oz.
But for now I've enabled it unconditionally, my reasoning being that
hwasan users typically expect a relatively large perf hit, and ~6%
isn't really adding much. We may want to revisit this decision in
the future, though.
I also tried experimenting with varying the number of registers
selectable by the hwasan check pseudo-instruction (which would result
in fewer variants being created), on the hypothesis that creating
fewer variants of the function would expose another perf/size tradeoff
by reducing icache pressure from the check functions at the cost of
register pressure. Although I did observe a code size increase with
fewer registers, I did not observe a strong correlation between the
number of registers and the performance of the resulting browser on the
microbenchmarks, so I conclude that we might as well use ~all registers
to get the maximum code size improvement. My results are below:
Regs | .text size | Perf hit
-----+------------+---------
~all | 171619972 | 6.24%
16 | 171765192 | 7.03%
8 | 172917788 | 5.82%
4 | 177054016 | 6.89%
Differential Revision: https://reviews.llvm.org/D56954
llvm-svn: 351920
It might be a bit nicer to use the fancy .legalIf and co. predicates,
but this was requiring more boilerplate and disables the coverage
assertions.
llvm-svn: 351886
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.
X86 uses i8, but seemed to be hacking around this before.
llvm-svn: 351882
The existing test unintentionally shows that we have prematurely
optimized the shuffle into a vector concat and lost the undef info,
so it is not affected by a basic improvement to
SimplifyDemandedVectorElts.
llvm-svn: 351834
vecbo (insertsubv undef, X, Z), (insertsubv undef, Y, Z) --> insertsubv VecC, (vecbo X, Y), Z
This is another step in generic vector narrowing. It's also a step towards more horizontal op
formation specifically for x86 (although we still failed to match those in the affected tests).
The scalarization cases are also not optimal (we should be scalarizing those), but it's still
an improvement to use a narrower vector op when we know part of the result must be constant
because both inputs are undef in some vector lanes.
I think a similar match but checking for a constant operand might help some of the cases in
D51553.
Differential Revision: https://reviews.llvm.org/D56875
llvm-svn: 351825
For constant bit select patterns, replace one AND with a ANDNP, allowing us to reuse the constant mask. Only do this if the mask has multiple uses (to avoid losing load folding) or if we have XOP as its VPCMOV can handle most folding commutations.
This also requires computeKnownBitsForTargetNode support for X86ISD::ANDNP and X86ISD::FOR to prevent regressions in fabs/fcopysign patterns.
Differential Revision: https://reviews.llvm.org/D55935
llvm-svn: 351819
Similar to horizontal ops on D56777, the sse2 (but not mmx) bit shift ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57026
llvm-svn: 351817
Similar to horizontal ops on D56777, the vpermilpd/vpermilps variable mask ops has local forwarding disabled, adding +1cy to the use latency for the result.
Differential Revision: https://reviews.llvm.org/D57022
llvm-svn: 351815
When we are inserting 1 "inline" element, and zeroing 2 of the other elements then we can safely commute the insertps source inputs to improve memory folding.
Differential Revision: https://reviews.llvm.org/D56843
llvm-svn: 351807
Avoid the infinite loop caused by the target DAG combine converting ANYEXT to
SIGNEXT and the target-independent DAG combine logic converting back to
ANYEXT. Do this by not adding the new node to the worklist.
Committing directly as this definitely doesn't make the problem any worse, and
I intend to follow-up with a patch that avoids this custom combiner logic
altogether and just lowers the i32 operations to a target-specific
SelectionDAG node. This should be easier to reason about and improve codegen
quality in some cases (though may miss out on some later DAG combines).
llvm-svn: 351806
This patch may seem familiar... but my previous patch handled the
equivalent lsls+and, not this case. Usually instcombine puts the
"and" after the shift, so this case doesn't come up. However, if the
shift comes out of a GEP, it won't get canonicalized by instcombine,
and DAGCombine doesn't have an equivalent transform.
This also modifies isDesirableToCommuteWithShift to suppress DAGCombine
transforms which would make the overall code worse.
I'm not really happy adding a bunch of code to handle this, but it would
probably be tricky to substantially improve the behavior of DAGCombine
here.
Differential Revision: https://reviews.llvm.org/D56032
llvm-svn: 351776
Not sure this is the best fix, but it saves an instruction for certain
constructs involving variable shifts.
Differential Revision: https://reviews.llvm.org/D55572
llvm-svn: 351768
Fixes two problems with GCNHazardRecognizer:
1. It only scans up to 5 instructions emitted earlier.
2. It does not take control flow into account. An earlier instruction
from the previous basic block is not necessarily a predecessor.
At the same time a real predecessor block is not scanned.
The patch provides a way to distinguish between scheduler and
hazard recognizer mode. It is OK to work with emitted instructions
in the scheduler because we do not really know what will be emitted
later and its order. However, when pass works as a hazard recognizer
the schedule is already finalized, and we have full access to the
instructions for the whole function, so we can properly traverse
predecessors and their instructions.
Differential Revision: https://reviews.llvm.org/D56923
llvm-svn: 351759
D56777 added +1cy local forwarding penalty for horizontal operations, but this penalty only affects sse2/xmm variants, the mmx variants don't suffer the penalty.
Confirmed with @andreadb
llvm-svn: 351755
These are copied from the sibling x86 file. I'm not sure which
of the current outputs (if any) is considered optimal, but
someone more familiar with AArch may want to take a look.
llvm-svn: 351754
The regression test is reduced from the example shown in D56281.
This does raise a question as noted in the test file: do we want
to handle this pattern? I don't have a motivating example for
that on x86 yet, but it seems like we could have that pattern
there too, so we could avoid the back-and-forth using a shuffle.
llvm-svn: 351753
r327630 introduced new write definitions for float/vector loads.
Before that revision, WriteLoad was used by both integer/float (scalar/vector)
load. So, WriteLoad had to conservatively declare a latency to 5cy. That is
because the load-to-use latency for float/vector load is 5cy.
Now that we have dedicated writes for float/vector loads, there is no reason why
we should keep the latency of WriteLoad to 5cy. At the moment, WriteLoad is only
used by scalar integer loads only; we can assume an optimstic 3cy latency for
them.
This patch changes that latency from 5cy to 3cy, and regenerates the affected
scheduling/mca tests.
Differential Revision: https://reviews.llvm.org/D56922
llvm-svn: 351742
This updates the AVR Select8/Select16 expansion code so that, when
inserting the two basic blocks for true and false conditions, any
existing fallthrough on the previous block is preserved.
Prior to this patch, if the block before the Select pseudo fell through
to the subsequent block, two new basic blocks would be inserted at the
prior fallthrough point, changing the fallthrough destination.
The predecessor or successor lists were not updated, causing the
BranchFolding pass at -O1 and above the rearrange basic blocks, causing
an infinite loop. Not to mention the unconditional fallthrough to the
true block is incorrect in of itself.
This patch modifies the Select8/16 expansion so that, if inserting true
and false basic blocks at a fallthrough point, the implicit branch is
preserved by means of an explicit, unconditional branch to the previous
fallthrough destination.
Thanks to Carl Peto for reporting this bug.
This fixes avr-rust bug https://github.com/avr-rust/rust/issues/123.
llvm-svn: 351721
This reverts commit r351718.
Carl pointed out that the unit test could be improved.
This patch will be recommitted once the test is made more resilient.
llvm-svn: 351719
This updates the AVR Select8/Select16 expansion code so that, when
inserting the two basic blocks for true and false conditions, any
existing fallthrough on the previous block is preserved.
Prior to this patch, if the block before the Select pseudo fell through
to the subsequent block, two new basic blocks would be inserted at the
prior fallthrough point, changing the fallthrough destination.
The predecessor or successor lists were not updated, causing the
BranchFolding pass at -O1 and above the rearrange basic blocks, causing
an infinite loop. Not to mention the unconditional fallthrough to the
true block is incorrect in of itself.
This patch modifies the Select8/16 expansion so that, if inserting true
and false basic blocks at a fallthrough point, the implicit branch is
preserved by means of an explicit, unconditional branch to the previous
fallthrough destination.
Thanks to Carl Peto for reporting this bug.
This fixes avr-rust bug https://github.com/avr-rust/rust/issues/123.
llvm-svn: 351718
There is a combine that was hiding these tests
not actually testing what they should be, although
they were producing the expected end result.
llvm-svn: 351698
This causes a couple of changes in the upgrade tests as signed/unsigned eq/ne are equivalent and we constant fold true/false codes, these changes are the same as what we already do for avx512 cmp/ucmp.
Noticed while cleaning up vector integer comparison costs for PR40376.
llvm-svn: 351697
This was crashing in the predicate function assuming the value
is a vector.
Copy more of what AArch64 uses. This probably needs more refinement
later, but I don't exactly understand what it means in some cases,
particularly since any legalization for these seems to be missing.
llvm-svn: 351693
We were upgrading these to the new style VPCOM/VPCOMU intrinsics (which includes the condition code immediate), but we'll be getting rid of those shortly, so convert these to generics first.
This causes a couple of changes in the upgrade tests as signed/unsigned eq/ne are equivalent and we constant fold true/false codes, these changes are the same as what we already do for avx512 cmp/ucmp.
Noticed while cleaning up vector integer comparison costs for PR40376.
llvm-svn: 351690
These intrinsics can always be replaced with generic integer comparisons without any regression in codegen, even for -O0/-fast-isel cases.
Noticed while cleaning up vector integer comparison costs for PR40376.
A future commit will remove/autoupgrade the existing VPCOM/VPCOMU llvm intrinsics.
llvm-svn: 351688
Prior to this patch, the AVR::LDWRdPtr instruction was always lowered to
instructions of this pattern:
ld $GPR8, [PTR:XYZ]+
ld $GPR8, [PTR]+1
This has a problem; the [PTR] is incremented in-place once, but never
decremented.
Future uses of the same pointer will use the now clobbered value,
leading to the pointer being incorrect by an offset of one.
This patch modifies the expansion code of the LDWRdPtr pseudo
instruction so that the pointer variable is not silently clobbered in
future uses in the same live range.
Bug first reported by Keshav Kini.
Patch by Kaushik Phatak.
llvm-svn: 351673
This reverts commit r351544.
In that commit, I had mistakenly misattributed the issue submitter as
the patch author, Kaushik Phatak.
The patch will be recommitted immediately with the correct attribution.
llvm-svn: 351672
The existing tests already show a sub-optimal transform,
but this should make it clear that we can't just match
an 'and' op when creating movmsk instructions.
llvm-svn: 351590
This sends these intrinsics through isel in a much more normal way. This should allow addressing mode matching in isel to make better use of the displacement field.
Differential Revision: https://reviews.llvm.org/D56827
llvm-svn: 351570
Prior to this patch, the AVR::LDWRdPtr instruction was always lowered to
instructions of this pattern:
ld $GPR8, [PTR:XYZ]+
ld $GPR8, [PTR]+1
This has a problem; the [PTR] is incremented in-place once, but never
decremented.
Future uses of the same pointer will use the now clobbered value,
leading to the pointer being incorrect by an offset of one.
This patch modifies the expansion code of the LDWRdPtr pseudo
instruction so that the pointer variable is not silently clobbered in
future uses in the same live range.
Patch by Keshav Kini.
llvm-svn: 351544
We should not pre-scheduled the node has ADJCALLSTACKDOWN parent,
or else, when bottom-up scheduling, ADJCALLSTACKDOWN and
ADJCALLSTACKUP may hold CallResource too long and make other
calls can't be scheduled. If there's no other available node
to schedule, the scheduler will try to rename the register by
creating copy to avoid the conflict which will fail because
CallResource is not a real physical register.
llvm-svn: 351527
This change modifies the LLVM ISel lowering settings so that
8-bit/16-bit multiplication is expanded to calls into the compiler
runtime library if the MCU being targeted does not support
multiplication in hardware.
Before this, MUL instructions would be generated on CPUs like the
ATtiny85, triggering a CPU reset due to an illegal instruction at
runtime.
First raised in https://github.com/avr-rust/rust/issues/124.
llvm-svn: 351523
The callee address is added as an optional operand (MCSymbol) in
AdjustInstrPostInstrSelection() and then used by asm printer to insert:
'.reloc tmplabel, R_MIPS_JALR, symbol
tmplabel:'.
Controlled with '-mips-jalr-reloc', default is true.
Differential revision: https://reviews.llvm.org/D56694
llvm-svn: 351485
EXPENSIVE_CHECKS buildbots are failing due to r351404.
Add x1 as live in to the funclet basic block for SEH funclets, as well as
-verify-machineinstrs to the test case that triggered the failure.
llvm-svn: 351472
Allow varargs functions to be called, both in arm and thumb mode. This
boils down to choosing the correct calling convention, which we can
easily test by making sure arm_aapcscc is used instead of
arm_aapcs_vfpcc when the callee is variadic.
llvm-svn: 351424
In order to support codegen RV64A, this patch:
* Introduces masked atomics intrinsics for atomicrmw operations and cmpxchg
that use the i64 type. These are ultimately lowered to masked operations
using lr.w/sc.w, but we need to use these alternate intrinsics for RV64
because i32 is not legal
* Modifies RISCVExpandPseudoInsts.cpp to handle PseudoAtomicLoadNand64 and
PseudoCmpXchg64
* Modifies the AtomicExpandPass hooks in RISCVTargetLowering to sext/trunc as
needed for RV64 and to select the i64 intrinsic IDs when necessary
* Adds appropriate patterns to RISCVInstrInfoA.td
* Updates test/CodeGen/RISCV/atomic-*.ll to show RV64A support
This ends up being a fairly mechanical change, as the logic for RV32A is
effectively reused.
Differential Revision: https://reviews.llvm.org/D53233
llvm-svn: 351422
There are cases where we have multiple epilogues that have the exact same unwind
code sequence. In that case, the epilogues can share the same unwind codes in
the .xdata section. This should get us past the assert "SEH unwind data
splitting not yet implemented" in many cases.
We still need to add support for generating multiple .pdata/.xdata sections for
those functions that need to be split into fragments.
Differential Revision: https://reviews.llvm.org/D56813
llvm-svn: 351421
Summary:
Everything before the word "version" is the tool, and everything after
the word "version" is the version.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, sunfish, llvm-commits
Differential Revision: https://reviews.llvm.org/D56742
llvm-svn: 351399
Previously we used ISD::SHL and ISD::SRL to represent these in SelectionDAG. ISD::SHL/SRL interpret an out of range shift amount as undefined behavior and will constant fold to undef. While the intrinsics are defined to return 0 for out of range shift amounts. A previous patch added a special node for VPSRAV to produce all sign bits.
This was previously believed safe because undefs frequently get turned into 0 either from the constant pool or a desire to not have a false register dependency. But undef is treated specially in some optimizations. For example, its ignored in detection of vector splats. So if the ISD::SHL/SRL can be constant folded and all of the elements with in bounds shift amounts are the same, we might fold it to single element broadcast from the constant pool. This would not put 0s in the elements with out of bounds shift amounts.
We do have an existing InstCombine optimization to use shl/lshr when the shift amounts are all constant and in bounds. That should prevent some loss of constant folding from this change.
Patch by zhutianyang and Craig Topper
Differential Revision: https://reviews.llvm.org/D56695
llvm-svn: 351381
Summary:
For these loads that write to the HI part of a register, we should chain them to the op that writes to the LO part
of the register to maintain the appropriate order.
Reviewers:
rampitec, arsenm
Differential Revision:
https://reviews.llvm.org/D56454
llvm-svn: 351379
If we're going to generate a new inverted setcc, we should make sure we will be able to remove the old setcc.
Differential Revision: https://reviews.llvm.org/D56765
llvm-svn: 351378
Summary:
This patch supports MS SEH extensions __try/__except/__finally. The intrinsics localescape and localrecover are responsible for communicating escaped static allocas from the try block to the handler.
We need to preserve frame pointers for SEH. So we create a new function/property HasLocalEscape.
Reviewers: rnk, compnerd, mstorsjo, TomTan, efriedma, ssijaric
Reviewed By: rnk, efriedma
Subscribers: smeenai, jrmuizel, alex, majnemer, ssijaric, ehsan, dmajor, kristina, javed.absar, kristof.beyls, chrib, llvm-commits
Differential Revision: https://reviews.llvm.org/D53540
llvm-svn: 351370
On Jaguar, horizontal adds/subs have local forwarding disable.
That means, we pay a compulsory extra cycle of write-back stage, and the value
is not available until the end of that stage.
This patch changes the latency of horizontal operations by adding an extra
cycle. With this patch, latency numbers now match what is reported by perf.
I plan to send another patch to also 'fix' the latency of shuffle operations (on
Jaguar, local forwarding is disabled for vector shuffles too).
Differential Revision: https://reviews.llvm.org/D56777
llvm-svn: 351366
I was trying to prevent shuffle regressions while matching more horizontal ops
and ended up here:
shuf (extract X, 0), (extract X, 4), Mask --> extract (shuf X, undef, Mask'), 0
The affected tests were added for:
https://bugs.llvm.org/show_bug.cgi?id=34380
This patch won't change the examples in the bug report itself, but we should be
able to extend this to catch more types.
Differential Revision: https://reviews.llvm.org/D56756
llvm-svn: 351346
This is LLVM part of D56663
Linker scripts shipped by TI require to have every
interrupt vector in a separate section with a specific name:
SECTIONS
{
__interrupt_vector_XX : { KEEP (*(__interrupt_vector_XX )) } > VECTXX
...
}
Follow the requirement emit the section for every vector
which contain address of interrupt handler:
.section __interrupt_vector_XX,"ax",@progbits
.word %isr%
Patch by Kristina Bessonova!
Differential Revision: https://reviews.llvm.org/D56664
llvm-svn: 351345
ReduceLoadWidth can trigger using a shifted mask is used and this
requires that the function return a shl node to correct for the
offset. However, the way that this was implemented meant that the
returned result could be an existing node, which would be incorrect.
This fixes the method of inserting the new node and replacing uses.
Differential Revision: https://reviews.llvm.org/D50432
llvm-svn: 351310
https://reviews.llvm.org/D52803
This patch adds support to continuously CSE instructions during
each of the GISel passes. It consists of a GISelCSEInfo analysis pass
that can be used by the CSEMIRBuilder.
llvm-svn: 351283
Summary:
Make recoverfp intrinsic target-independent so that it can be implemented for AArch64, etc.
Refer D53541 for the context. Clang counterpart D56748.
Reviewers: rnk, efriedma
Reviewed By: rnk, efriedma
Subscribers: javed.absar, kristof.beyls, llvm-commits
Differential Revision: https://reviews.llvm.org/D56747
llvm-svn: 351281
We're trying to have the vXi1 types in IR as much as possible. This prevents the need for bitcasts when the producer of the mask was already a vXi1 value like an icmp. The bitcasts can be subject to code motion and interfere with basic block at a time isel in bad ways.
llvm-svn: 351275
Summary:
We have seen performance regression when v_add3 is generated. The major reason is that the v_mad pattern
is broken when v_add3 is generated. We also see the register pressure increased. While we could not properly
estimate register pressure during instruction selection, we can give mad a higher priority.
In this work, we raise the priority for mad24 in selection and resolve the performance regression.
Reviewers:
rampitec
Differential Revision:
https://reviews.llvm.org/D56745
llvm-svn: 351273
In keeping with our general direction of having the vXi1 type present in IR, this patch converts the mask argument for avx512 gather to vXi1. This can avoid k-register to GPR to k-register transitions late in codegen.
I left the existing intrinsics behind because they have many out of tree users such as ISPC. They generate their own code and don't go through the autoupgrade path which only works for bitcode and ll parsing. Ideally we will get them to migrate to target independent intrinsics, but it might be easier for them to migrate to these new intrinsics.
I'll work on scatter and gatherpf/scatterpf next.
Differential Revision: https://reviews.llvm.org/D56527
llvm-svn: 351234
If we're shuffling with a zero vector, then we are better off not doing VECTOR_SHUFFLE(UNPCK()) as we lose track of those zero elements.
We were already doing this for SSSE3 targets as we have PSHUFB, but its worth doing for all targets.
llvm-svn: 351203
compiler identification lines in test-cases.
(Doing so only because it's then easier to search for references which
are actually important and need fixing.)
llvm-svn: 351200
The motivating case for this is shown in the first regression test. We are
transferring to scalar and back rather than just zero-extending with 'vpmovzxdq'.
That's a special-case for a more general pattern as shown here. In all tests,
we're avoiding the vector-scalar-vector moves in favor of vector ops.
We aren't producing optimal shuffle code in some cases though, so the patch is
limited to reduce regressions.
Differential Revision: https://reviews.llvm.org/D56281
llvm-svn: 351198
Summary:
V8 currently implements SIMD shifts as taking an immediate operation,
which disagrees with the spec proposal and the toolchain
implementation. As a stopgap measure to get things working, unroll all
vector shifts. Since this is a temporary measure, there are no tests.
Reviewers: aheejin, dschuff
Subscribers: sbc100, jgravelle-google, sunfish, dmgreen, llvm-commits
Differential Revision: https://reviews.llvm.org/D56520
llvm-svn: 351151
Summary:
This allows moving the condition from the intrinsic to the standard ICmp
opcode, so that LLVM can do simplifications on it. The icmp.i1 intrinsic
is an identity for retrieving the SGPR mask.
And we can also get the mask from and i1, or i1, xor i1.
Reviewers: arsenm, nhaehnle
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D52060
llvm-svn: 351150
Summary:
In r345197 ESP and RSP were added to GR32_TC/GR64_TC, allowing them to
be used for tail calls, but this also caused `findDeadCallerSavedReg` to
think they were acceptable targets for clobbering. Filter them out.
Fixes PR40289.
Patch by Geoffry Song!
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D56617
llvm-svn: 351146
A block ending in an unconditional branch can have two successors if one
is a landing pad. In practice, I think this only has an effect on
Windows because landing pads are never empty for Itanium unwinding.
(Alternatively, I could add a check to
AArch64InstrInfo::canInsertSelect, but this seems more obvious.)
Differential Revision: https://reviews.llvm.org/D56468
llvm-svn: 351142
Otherwise, with D56544, the intrinsic will be expanded to an integer
csel, which is probably not what the user expected. This matches the
general convention of using "v1" types to represent scalar integer
operations in vector registers.
While I'm here, also add some error checking so we don't generate
illegal ABS nodes.
Differential Revision: https://reviews.llvm.org/D56616
llvm-svn: 351141
This feature enables the fusion of some arithmetic and logic instructions
together.
Differential revision: https://reviews.llvm.org/D56572
llvm-svn: 351139