Despite the fact that the localizer's original motivation was to fix horrendous
constant spilling at -O0, shortening live ranges still has net benefits even
with optimizations enabled.
On an -Os build of CTMark, doing this improves code size by 0.5% geomean.
There are a few regressions, bullet increasing in size by 0.5%. One example from
bullet where code size increased slightly was due to GlobalISel actually now
generating the same code as SelectionDAG. So we actually have an opportunity
in future to implement better heuristics for localization and therefore be
*better* than SDAG in some cases. In relation to other optimizations though that
one is relatively minor.
Differential Revision: https://reviews.llvm.org/D67303
llvm-svn: 371266
Add the new method `LibCallSimplifier::substituteInParent()` that calls
`LibCallSimplifier::replaceAllUsesWith()' and
`LibCallSimplifier::eraseFromParent()` back to back, simplifying the
resulting code.
llvm-svn: 371264
Summary:
There's an unspoken invariant of callbr that the list of BlockAddress
Constants in the "function args" list match the BasicBlocks in the
"other labels" list. (This invariant is being added to the LangRef in
https://reviews.llvm.org/D67196).
When modifying the any of the indirect destinations of a callbr
instruction (possible jump targets), we need to update the function
arguments if the argument is a BlockAddress whose BasicBlock refers to
the indirect destination BasicBlock being replaced. Otherwise, many
transforms that modify successors will end up violating that invariant.
A recent change to the arm64 Linux kernel exposed this bug, which
prevents the kernel from booting.
I considered maintaining a mapping from indirect destination BasicBlock
to argument operand BlockAddress, but this ends up being a one to
potentially many (though usually one) mapping. Also, the list of
arguments to a function (or more typically inline assembly) ends up
being less than 10. The implementation is significantly simpler to just
rescan the full list of arguments. Because of the one to potentially
many relationship, the full arg list must be scanned (we can't stop at
the first instance).
Thanks to the following folks that reported the issue and helped debug
it:
* Nathan Chancellor
* Will Deacon
* Andrew Murray
* Craig Topper
Link: https://bugs.llvm.org/show_bug.cgi?id=43222
Link: https://github.com/ClangBuiltLinux/linux/issues/649
Link: https://lists.infradead.org/pipermail/linux-arm-kernel/2019-September/678330.html
Reviewers: craig.topper, chandlerc
Reviewed By: craig.topper
Subscribers: void, javed.absar, kristof.beyls, hiraditya, llvm-commits, nathanchance, srhines
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67252
llvm-svn: 371262
We can use a MOVSX16 here then rely on FixupBWInst to change to
MOVSX32 if the upper bits are dead. With a special case to
not promote if it could be turned into CBW.
Then we can rely on X86MCInstLower to turn the MOVSX into CBW
very late if register allocation worked out.
Using MOVSX gives an opportunity to use the MOVSX as a both a
copy and a sign extend since the input and output register aren't
tied together.
Differential Revision: https://reviews.llvm.org/D67192
llvm-svn: 371243
We can rely on X86FixupBWInsts to turn these into MOVZX32. This
simplifies a follow up commit to use MOVSX for i8 sdivrem with
a late optimization to use CBW when register allocation works out.
llvm-svn: 371242
In order to keep remarks around, we need to make them tied to a string
table.
Users then can delete the parser and rely on the string table to keep
the memory of the strings alive and deduplicated.
llvm-svn: 371233
-tailcallopt requires that we perform different stack adjustments than with
sibling calls. For example, the `@caller_to0_from8` function in
test/CodeGen/AArch64/tail-call.ll requires that we adjust SP. Without
-tailcallopt, this adjustment does not happen. With it, however, it is expected.
So, to ensure that adding sibling call support doesn't break -tailcallopt,
make CallLowering always fall back on possible tail calls when -tailcallopt
is passed in.
Update test/CodeGen/AArch64/tail-call.ll with a GlobalISel line to make sure
that we don't differ from the SDAG implementation at any point.
Differential Revision: https://reviews.llvm.org/D67245
llvm-svn: 371227
This patch sinks add/mul(shufflevector(insertelement())) into the basic block in which they are used so that they can then be selected together.
This is useful for various MVE instructions, such as vmla and others that take R registers.
Loop tests have been added to the vmla test file to make sure vmlas are generated in loops.
Differential revision: https://reviews.llvm.org/D66295
llvm-svn: 371218
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: nemanjai, javed.absar, hiraditya, kbarton, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, s.egerton, pzheng, ychen, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67267
llvm-svn: 371212
Summary:
This was discovered while introducing the llvm::Align type.
The original setMinFunctionAlignment used to take alignment as log2, looking at the comment it seems like instructions are to be 2-bytes aligned and not 4-bytes aligned.
Reviewers: uweigand
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67271
llvm-svn: 371204
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: jyknight, sdardis, nemanjai, javed.absar, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, s.egerton, pzheng, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67229
llvm-svn: 371200
This patch allows the DFAPacketizer to be queried after a packet is formed to work out which
resources were allocated to the packetized instructions.
This is particularly important for targets that do their own bundle packing - it's not
sufficient to know simply that instructions can share a packet; which slots are used is
also required for encoding.
This extends the emitter to emit a side-table containing resource usage diffs for each
state transition. The packetizer maintains a set of all possible resource states in its
current state. After packetization is complete, all remaining resource states are
possible packetization strategies.
The sidetable is only ~500K for Hexagon, but the extra tracking is disabled by default
(most uses of the packetizer like MachinePipeliner don't care and don't need the extra
maintained state).
Differential Revision: https://reviews.llvm.org/D66936
llvm-svn: 371198
If a stack spill location is overwritten by another spill instruction,
any variable locations pointing at that slot should be terminated. We
cannot rely on spills always being restored to registers or variable
locations being moved by a DBG_VALUE: the register allocator is entitled
to spill a value and then forget about it when it goes out of liveness.
To address this, scan for memory writes to spill locations, even those we
don't consider to be normal "spills". isSpillInstruction and
isLocationSpill distinguish the two now. After identifying spill
overwrites, terminate the open range, and insert a $noreg DBG_VALUE for
that variable.
Differential Revision: https://reviews.llvm.org/D66941
llvm-svn: 371193
Summary:
This fixes poor scheduling in a function containing a barrier and a few
load instructions.
Without this fix, ScheduleDAGInstrs::buildSchedGraph adds an artificial
edge in the dependency graph from the barrier instruction to the exit
node representing live-out latency, with a latency of about 500 cycles.
Because of this it thinks the critical path through the graph also has
a latency of about 500 cycles. And because of that it does not think
that any of the load instructions are on the critical path, so it
schedules them with no regard for their (80 cycle) latency, which gives
poor results.
Reviewers: arsenm, dstuttard, tpr, nhaehnle
Subscribers: kzhuravl, jvesely, wdng, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67218
llvm-svn: 371192
`struct Elf*_Shdr` has a field `sh_offset`, named `ShOffset` in
llvm::ELFYAML::Section. Rename SHOffset (e_shoff) to SHOff to prevent confusion.
Reviewed By: grimar
Differential Revision: https://reviews.llvm.org/D67254
llvm-svn: 371185
The MVE and LOB extensions of Armv8.1m can be combined to enable
'tail predication' which removes the need for a scalar remainder
loop after vectorization. Lane predication is performed implicitly
via a system register. The effects of predication is described in
Section B5.6.3 of the Armv8.1-m Arch Reference Manual, the key points
being:
- For vector operations that perform reduction across the vector and
produce a scalar result, whether the value is accumulated or not.
- For non-load instructions, the predicate flags determine if the
destination register byte is updated with the new value or if the
previous value is preserved.
- For vector store instructions, whether the store occurs or not.
- For vector load instructions, whether the value that is loaded or
whether zeros are written to that element of the destination
register.
This patch implements a pass that takes a hardware loop, containing
masked vector instructions, and converts it something that resembles
an MVE tail predicated loop. Currently, if we had code generation,
we'd generate a loop in which the VCTP would generate the predicate
and VPST would then setup the value of VPR.PO. The loads and stores
would be placed in VPT blocks so this is not tail predication, but
normal VPT predication with the predicate based upon a element
counting induction variable. Further work needs to be done to finally
produce a true tail predicated loop.
Because only the loads and stores are predicated, in both the LLVM IR
and MIR level, we will restrict support to only lane-wise operations
(no horizontal reductions). We will perform a final check on MIR
during loop finalisation too.
Another restriction, specific to MVE, is that all the vector
instructions need operate on the same number of elements. This is
because predication is performed at the byte level and this is set
on entry to the loop, or by the VCTP instead.
Differential Revision: https://reviews.llvm.org/D65884
llvm-svn: 371179
Summary:
Fix a bug of not update the jump table and recommit it again.
In `block-placement` pass, it will create some patterns for unconditional we can do the simple early retrun.
But the `early-ret` pass is before `block-placement`, we don't want to run it again.
This patch is to do the simple early return to optimize the blocks at the last of `block-placement`.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D63972
llvm-svn: 371177
Summary: It says [[ http://www.sco.com/developers/gabi/latest/ch4.eheader.html | here ]] that if there are no program headers than e_phoff should be 0, but currently it is always set after the header. GNU's `readelf` (but not `llvm-readelf`) complains about this: `readelf: Warning: possibly corrupt ELF header - it has a non-zero program header offset, but no program headers`.
Reviewers: jhenderson, grimar, MaskRay, rupprecht
Reviewed By: jhenderson, grimar, MaskRay
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67054
llvm-svn: 371162
Passing INT64_MIN to MCInstPrinter::formatHex triggers undefined
behavior because the negation of -9223372036854775808 cannot be
represented in type 'int64_t' (aka 'long long'). This patch puts a
workaround in place to just print the hex value directly.
A possible alternative involves using a small helper functions that uses
(implementation) defined conversions to achieve the desirable value:
static int64_t helper(int64_t V) {
auto U = static_cast<uint64_t>(V);
return V < 0 ? -U : U;
}
The underlying problem is that MCInstPrinter::formatHex(int64_t) returns
a format_object<int64_t> and should really return a
format_object<uint64_t>. However, that's not possible because formatImm
needs to be able to print both as decimal (where a signed is required)
and hex (where we'd prefer to always have an unsigned).
format_object<int64_t> formatImm(int64_t Value) const {
return PrintImmHex ? formatHex(Value) : formatDec(Value);
}
Differential revision: https://reviews.llvm.org/D67236
llvm-svn: 371159
See http://lists.llvm.org/pipermail/llvm-dev/2019-February/130583.html
and D60242 for the lld partition feature.
This patch:
* Teaches yaml2obj to parse the 3 section types.
* Teaches llvm-readobj/llvm-readelf to dump the 3 section types.
There is no test for SHT_LLVM_DEPENDENT_LIBRARIES in llvm-readobj. Add
it as well.
Reviewed By: thakis
Differential Revision: https://reviews.llvm.org/D67228
llvm-svn: 371157
The same stack is loaded for each workitem ID, and each use. Nothing
prevents you from creating multiple fixed stack objects with the same
offsets, so this was creating a load for each unique frame index,
despite them being the same offset. Re-use the same frame index so the
loads are CSEable.
llvm-svn: 371148
Summary:
Here we try to avoid issues with "explicit branch" with SimplifyBranchOnICmpChain
which can check on undef. Msan by design reports branches on uninitialized
memory and undefs, so we have false report here.
In general msan does not like when we convert
```
// If at least one of them is true we can MSAN is ok if another is undefs
if (a || b)
return;
```
into
```
// If 'a' is undef MSAN will complain even if 'b' is true
if (a)
return;
if (b)
return;
```
Example
Before optimization we had something like this:
```
while (true) {
bool maybe_undef = doStuff();
while (true) {
char c = getChar();
if (c != 10 && c != 13)
continue
break;
}
// we know that c == 10 || c == 13 if we get here,
// so msan know that branch is not affected by maybe_undef
if (maybe_undef || c == 10 || c == 13)
continue;
return;
}
```
SimplifyBranchOnICmpChain will convert that into
```
while (true) {
bool maybe_undef = doStuff();
while (true) {
char c = getChar();
if (c != 10 && c != 13)
continue;
break;
}
// however msan will complain here:
if (maybe_undef)
continue;
// we know that c == 10 || c == 13, so either way we will get continue
switch(c) {
case 10: continue;
case 13: continue;
}
return;
}
```
Reviewers: eugenis, efriedma
Reviewed By: eugenis, efriedma
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67205
llvm-svn: 371138
Approximately 30% of the time was spent in the std::vector
constructor. In one testcase this pushes the scheduler to being the
second slowest pass.
I'm not sure I understand why these vector are necessary. The default
scheduler initCandidate seems to use some pre-existing vectors for the
pressure.
llvm-svn: 371136
This patch reuses the MIR vreg renamer from the MIRCanonicalizerPass to cleanup
names of vregs in a MIR file for MIR test authors. I found it useful when
writing a regression test for a globalisel failure I encountered recently and
thought it might be useful for other folks as well.
Differential Revision: https://reviews.llvm.org/D67209
llvm-svn: 371121
Now that we look through copies, it's possible to visit registers that
have a register class constraint but not a type constraint. Avoid looking
through copies when this occurs as the SrcReg won't be able to determine
it's bit width or any known bits.
Along the same lines, if the initial query is on a register that doesn't
have a type constraint then the result is a default-constructed KnownBits,
that is, a 1-bit fully-unknown value.
llvm-svn: 371116
Recommit basic sibling call lowering (https://reviews.llvm.org/D67189)
The issue was that if you have a return type other than void, call lowering
will emit COPYs to get the return value after the call.
Disallow sibling calls other than ones that return void for now. Also
proactively disable swifterror tail calls for now, since there's a similar issue
with COPYs there.
Update call-translator-tail-call.ll to include test cases for each of these
things.
llvm-svn: 371114
The code was incorrectly counting the number of identical instructions,
and therefore tried to predicate an instruction which should not have
been predicated. This could have various effects: a compiler crash,
an assembler failure, a miscompile, or just generating an extra,
unnecessary instruction.
Instead of depending on TargetInstrInfo::removeBranch, which only
works on analyzable branches, just remove all branch instructions.
Fixes https://bugs.llvm.org/show_bug.cgi?id=43121 and
https://bugs.llvm.org/show_bug.cgi?id=41121 .
Differential Revision: https://reviews.llvm.org/D67203
llvm-svn: 371111
As noted in PR43197, we can use test+add+cmov+sra to implement
signed division by a power of 2.
This is based off the similar version in AArch64, but I've
adjusted it to use target independent nodes where AArch64 uses
target specific CMP and CSEL nodes. I've also blocked INT_MIN
as the transform isn't valid for that.
I've limited this to i32 and i64 on 64-bit targets for now and only
when CMOV is supported. i8 and i16 need further investigation to be
sure they get promoted to i32 well.
I adjusted a few tests to enable cmov to demonstrate the new
codegen. I also changed twoaddr-coalesce-3.ll to 32-bit mode
without cmov to avoid perturbing the scenario that is being
set up there.
Differential Revision: https://reviews.llvm.org/D67087
llvm-svn: 371104
A follow-up for r329011.
This may be changed to produce @llvm.sub.with.overflow in a later patch,
but for now just make things more consistent overall.
A few observations stem from this:
* There does not seem to be a similar one-instruction fold for uadd-overflow
* I'm not sure we'll want to canonicalize `B u> A` as `usub.with.overflow`,
so since the `icmp` here no longer refers to `sub`,
reconstructing `usub.with.overflow` will be problematic,
and will likely require standalone pass (similar to DivRemPairs).
https://rise4fun.com/Alive/Zqs
Name: (A - B) u> A --> B u> A
%t0 = sub i8 %A, %B
%r = icmp ugt i8 %t0, %A
=>
%r = icmp ugt i8 %B, %A
Name: (A - B) u<= A --> B u<= A
%t0 = sub i8 %A, %B
%r = icmp ule i8 %t0, %A
=>
%r = icmp ule i8 %B, %A
Name: C u< (C - D) --> C u< D
%t0 = sub i8 %C, %D
%r = icmp ult i8 %C, %t0
=>
%r = icmp ult i8 %C, %D
Name: C u>= (C - D) --> C u>= D
%t0 = sub i8 %C, %D
%r = icmp uge i8 %C, %t0
=>
%r = icmp uge i8 %C, %D
llvm-svn: 371101
If we have:
bb5:
br i1 %arg3, label %bb6, label %bb7
bb6:
%tmp = getelementptr inbounds i32, i32* %arg1, i64 2
store i32 3, i32* %tmp, align 4
br label %bb9
bb7:
%tmp8 = getelementptr inbounds i32, i32* %arg1, i64 2
store i32 3, i32* %tmp8, align 4
br label %bb9
bb9: ; preds = %bb4, %bb6, %bb7
...
We can't sink stores directly into bb9.
This patch creates new BB that is successor of %bb6 and %bb7
and sinks stores into that block.
SplitFooterBB is the parameter to the pass that controls
that behavior.
Change-Id: I7fdf50a772b84633e4b1b860e905bf7e3e29940f
Differential: https://reviews.llvm.org/D66234
llvm-svn: 371089
Summary:
Avoid visiting an instruction more than once by using a map.
This is similar to https://reviews.llvm.org/rL361416.
Reviewers: davidxl
Reviewed By: davidxl
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67198
llvm-svn: 371086
A number of inline assembly constraints are currently supported by LLVM, but rejected as invalid by Clang:
Target independent constraints:
s: An integer constant, but allowing only relocatable values
ARM specific constraints:
j: An immediate integer between 0 and 65535 (valid for MOVW)
x: A 32, 64, or 128-bit floating-point/SIMD register: s0-s15, d0-d7, or q0-q3
N: An immediate integer between 0 and 31 (Thumb1 only)
O: An immediate integer which is a multiple of 4 between -508 and 508. (Thumb1 only)
This patch adds support to Clang for the missing constraints along with some checks to ensure that the constraints are used with the correct target and Thumb mode, and that immediates are within valid ranges (at least where possible). The constraints are already implemented in LLVM, but just a couple of minor corrections to checks (V8M Baseline includes MOVW so should work with 'j', 'N' and 'O' shouldn't be valid in Thumb2) so that Clang and LLVM are in line with each other and the documentation.
Differential Revision: https://reviews.llvm.org/D65863
Change-Id: I18076619e319bac35fbb60f590c069145c9d9a0a
llvm-svn: 371079
As discussed on D64551 and PR43227, we don't correctly handle cases where the base load has a non-zero byte offset.
Until we can properly handle this, we must bail from EltsFromConsecutiveLoads.
llvm-svn: 371078
Linkers (ld.bfd/gold/lld) place the section header table at the very
end. This allows tools to strip it, which is optional in executable/shared objects.
In addition, if we add or section, the size of the section header table
will change. Placing the section header table in the end keeps section
offsets unchanged.
yaml2obj currently places the section header table immediately after the
program header. Follow what linkers do to make offset updating easier.
Reviewed By: grimar
Differential Revision: https://reviews.llvm.org/D67221
llvm-svn: 371074
This attempts to just fix the creation of VPT blocks, fixing up the iterating,
which instructions are considered in the bundle, and making sure that we do not
overrun the end of the block.
Differential Revision: https://reviews.llvm.org/D67219
llvm-svn: 371064
G_FENCE comes form fence instruction. For MIPS fence is generated in
AtomicExpandPass when atomic instruction gets surrounded with fence
instruction when needed.
G_FENCE arguments don't have LLT, because of that there is no job for
legalizer and regbankselect. Instruction select G_FENCE for MIPS32.
Differential Revision: https://reviews.llvm.org/D67181
llvm-svn: 371056
Select G_INTRINSIC_W_SIDE_EFFECTS for Intrinsic::trap for MIPS32
via legalizeIntrinsic.
Differential Revision: https://reviews.llvm.org/D67180
llvm-svn: 371055
Instead of returning structure by value clang usually adds pointer
to that structure as an argument. Pointers don't require special
handling no matter the SRet flag. Remove unsuccessful exit from
lowerCall for arguments with SRet flag if they are pointers.
Differential Revision: https://reviews.llvm.org/D67179
llvm-svn: 371054
This adds support for basic sibling call lowering in AArch64. The intent here is
to only handle tail calls which do not change the ABI (hence, sibling calls.)
At this point, it is very restricted. It does not handle
- Vararg calls.
- Calls with outgoing arguments.
- Calls whose calling conventions differ from the caller's calling convention.
- Tail/sibling calls with BTI enabled.
This patch adds
- `AArch64CallLowering::isEligibleForTailCallOptimization`, which is equivalent
to the same function in AArch64ISelLowering.cpp (albeit with the restrictions
above.)
- `mayTailCallThisCC` and `canGuaranteeTCO`, which are identical to those in
AArch64ISelLowering.cpp.
- `getCallOpcode`, which is exactly what it sounds like.
Tail/sibling calls are lowered by checking if they pass target-independent tail
call positioning checks, and checking if they satisfy
`isEligibleForTailCallOptimization`. If they do, then a tail call instruction is
emitted instead of a normal call. If we have a sibling call (which is always the
case in this patch), then we do not emit any stack adjustment operations. When
we go to lower a return, we check if we've already emitted a tail call. If so,
then we skip the return lowering.
For testing, this patch
- Adds call-translator-tail-call.ll to test which tail calls we currently lower,
which ones we don't, and which ones we shouldn't.
- Updates branch-target-enforcement-indirect-calls.ll to show that we fall back
as expected.
Differential Revision: https://reviews.llvm.org/D67189
........
This fails on EXPENSIVE_CHECKS builds due to a -verify-machineinstrs test failure in CodeGen/AArch64/dllimport.ll
llvm-svn: 371051
Handle the remaining cases also by handling asm goto in
SystemZInstrInfo::getBranchInfo().
Review: Ulrich Weigand
https://reviews.llvm.org/D67151
llvm-svn: 371048
Fixes clang static-analyzer warning.
Technically the MachineInstr *Sub might still be null if we're comparing zero (IsCmpZero == true), although this probably won't happen as SrcReg2 is probably == 0.
llvm-svn: 371047
Summary:
This patch renames functions that takes or returns alignment as log2, this patch will help with the transition to llvm::Align.
The renaming makes it explicit that we deal with log(alignment) instead of a power of two alignment.
A few renames uncovered dubious assignments:
- `MirParser`/`MirPrinter` was expecting powers of two but `MachineFunction` and `MachineBasicBlock` were using deal with log2(align). This patch fixes it and updates the documentation.
- `MachineBlockPlacement` exposes two flags (`align-all-blocks` and `align-all-nofallthru-blocks`) supposedly interpreted as power of two alignments, internally these values are interpreted as log2(align). This patch updates the documentation,
- `MachineFunctionexposes` exposes `align-all-functions` also interpreted as power of two alignment, internally this value is interpreted as log2(align). This patch updates the documentation,
Reviewers: lattner, thegameg, courbet
Subscribers: dschuff, arsenm, jyknight, dylanmckay, sdardis, nemanjai, jvesely, nhaehnle, javed.absar, hiraditya, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, dexonsmith, PkmX, jocewei, jsji, Jim, s.egerton, llvm-commits, courbet
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65945
llvm-svn: 371045
-ftime-trace could break flame-graph assumptions on Windows, with an
inner scope overrunning outer scopes. This was due to the way that times
were truncated. Changed this so time_points for the flame-graph are
truncated instead of durations, preserving the relative order of event
starts and ends.
I have tried to retain the extra precision for the totals, which count
thousands or millions of events.
Added assert to check this property holds in future.
Fixes PR43043
Differential Revision: https://reviews.llvm.org/D66411
llvm-svn: 371039
After r361885, realPathFromHandle() ends up getting called on the working
directory on each Clang invocation. This unveiled that the code didn't work for
paths on network shares.
For example, if one maps the local dir c:\src\tmp to x:
net use x: \\localhost\c$\tmp
and run e.g. "clang -c foo.cc" in x:\, realPathFromHandle will get
\\?\UNC\localhost\c$\src\tmp\ back from GetFinalPathNameByHandleW, and would
strip off the initial \\?\ prefix, ending up with a path that doesn't work.
This patch makes the prefix stripping a little smarter to handle this case.
Differential revision: https://reviews.llvm.org/D67166
llvm-svn: 371035
In D62809 I accidentally added "ELFState<ELFT> &State" as the
first parameter to two methods. There is no reason for having that.
I removed this argument and also moved finalizeStrings declaration to
remove an excessive 'private:' tag.
Differential revision: https://reviews.llvm.org/D67157
llvm-svn: 371033
Fix: added missing return "return 0;"
Original commit message:
This eliminates one of the error(1) call in this lib.
It is different from the others because happens on a fields mapping stage
and can be easily fixed.
Differential revision: https://reviews.llvm.org/D67150
llvm-svn: 371030
This eliminates one of the error(1) call in this lib.
It is different from the others because happens on a fields mapping stage
and can be easily fixed.
Differential revision: https://reviews.llvm.org/D67150
llvm-svn: 371023
As DW_AT_rnglists_base points after the header and headers have
different sizes for DWARF32 and DWARF64, we have to use the format
of the CU to adjust the offset correctly in order to extract
the referenced range list table.
The patch also changes the type of RangeSectionBase because in DWARF64
it is 8-bytes long.
Differential Revision: https://reviews.llvm.org/D67098
llvm-svn: 371016
This adds support for basic sibling call lowering in AArch64. The intent here is
to only handle tail calls which do not change the ABI (hence, sibling calls.)
At this point, it is very restricted. It does not handle
- Vararg calls.
- Calls with outgoing arguments.
- Calls whose calling conventions differ from the caller's calling convention.
- Tail/sibling calls with BTI enabled.
This patch adds
- `AArch64CallLowering::isEligibleForTailCallOptimization`, which is equivalent
to the same function in AArch64ISelLowering.cpp (albeit with the restrictions
above.)
- `mayTailCallThisCC` and `canGuaranteeTCO`, which are identical to those in
AArch64ISelLowering.cpp.
- `getCallOpcode`, which is exactly what it sounds like.
Tail/sibling calls are lowered by checking if they pass target-independent tail
call positioning checks, and checking if they satisfy
`isEligibleForTailCallOptimization`. If they do, then a tail call instruction is
emitted instead of a normal call. If we have a sibling call (which is always the
case in this patch), then we do not emit any stack adjustment operations. When
we go to lower a return, we check if we've already emitted a tail call. If so,
then we skip the return lowering.
For testing, this patch
- Adds call-translator-tail-call.ll to test which tail calls we currently lower,
which ones we don't, and which ones we shouldn't.
- Updates branch-target-enforcement-indirect-calls.ll to show that we fall back
as expected.
Differential Revision: https://reviews.llvm.org/D67189
llvm-svn: 370996
Moving MIRCanonicalizerPass vreg renaming code to MIRVRegNamerUtils so that it
can be reused in another pass (ie planing to write a standalone mir-namer pass).
I'm going to write a mir-namer pass so that next time someone has to author a
test in MIR, they can use it to cleanup the naming and make it more readable by
having the numbered vregs swapped out with named vregs.
Differential Revision: https://reviews.llvm.org/D67114
llvm-svn: 370985
In mingw environments, resources are normally compiled to resource
object files directly, instead of letting the linker convert them to
COFF format.
Since some time, GCC supports the notion of a default manifest object.
When invoking the linker, GCC looks for the default manifest object
file, and if found in the expected path, it is added to linker commands.
The default manifest is one that indicates support for the latest known
versions of windows, to implicitly unlock the modern behaviours of certain
APIs.
Not all mingw/gcc distributions include this file, but e.g. in msys2,
the default manifest object is distributed in a separate package (which
can be but might not always be installed).
This means that even if user projects only use one single resource
object file, the linker can end up with two resource object files,
and thus needs to support merging them.
The default manifest has a language id of zero, and GNU ld has got
logic for dropping a manifest with a zero language id, if there's
another manifest present with a nonzero language id. If there are
multiple manifests with a nonzero language id, the merging process
errors out.
Differential Revision: https://reviews.llvm.org/D66825
llvm-svn: 370974
This patch merges the sancov module and funciton passes into one module pass.
The reason for this is because we ran into an out of memory error when
attempting to run asan fuzzer on some protobufs (pc.cc files). I traced the OOM
error to the destructor of SanitizerCoverage where we only call
appendTo[Compiler]Used which calls appendToUsedList. I'm not sure where precisely
in appendToUsedList causes the OOM, but I am able to confirm that it's calling
this function *repeatedly* that causes the OOM. (I hacked sancov a bit such that
I can still create and destroy a new sancov on every function run, but only call
appendToUsedList after all functions in the module have finished. This passes, but
when I make it such that appendToUsedList is called on every sancov destruction,
we hit OOM.)
I don't think the OOM is from just adding to the SmallSet and SmallVector inside
appendToUsedList since in either case for a given module, they'll have the same
max size. I suspect that when the existing llvm.compiler.used global is erased,
the memory behind it isn't freed. I could be wrong on this though.
This patch works around the OOM issue by just calling appendToUsedList at the
end of every module run instead of function run. The same amount of constants
still get added to llvm.compiler.used, abd we make the pass usage and logic
simpler by not having any inter-pass dependencies.
Differential Revision: https://reviews.llvm.org/D66988
llvm-svn: 370971
This patch adds the ability to encode and decode InlineInfo objects and adds test coverage. Error handling is introduced in the encoding and decoding which will be used from here on out for remaining patches.
Differential Revision: https://reviews.llvm.org/D66600
llvm-svn: 370936
Since an add instruction must produce an unused carry out, this
requires additional SGPRs. This can be avoided by keeping the entire
offset computation in SGPRs. If one SGPR is still available, this only
costs one extra mov. If none are available, the entire computation can
be done in place and reversed.
This does assume the use is a VGPR operand. This was already assumed,
and we currently only select frame indexes to VALU instructions. This
should probably be fixed at some point to handle more possible MIR.
llvm-svn: 370929
Summary:
Instead of building attributes for internal functions which we do not
update as long as we assume they are dead, we now do not create
attributes until we assume the internal function to be live. This
improves the number of required iterations, as well as the number of
required updates, in real code. On our tests, the results are mixed.
Reviewers: sstefan1, uenoku
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66914
llvm-svn: 370924
Summary:
We create attributes on-demand so we need to check the white list
on-demand. This also unifies the location at which we create,
initialize, and eventually invalidate new abstract attributes.
The tests show mixed results, a few more call site attributes are
determined which can cause more iterations.
Reviewers: uenoku, sstefan1
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66913
llvm-svn: 370922
Summary:
Before we tried to rule out non-exact definitions early but that lead to
on-demand attributes created for them anyway. As a consequence we needed
to look at the definition in the initialize of each attribute again.
This patch centralized this lookup and tightens the condition under
which we give up on non-exact definitions.
Reviewers: uenoku, sstefan1
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67115
llvm-svn: 370917
SROA pass processes debug info incorrecly if applied twice.
Specifically, after SROA works first time, instcombine converts dbg.declare
intrinsics into dbg.value. Inlining creates new opportunities for SROA,
so it is called again. This time it does not handle correctly previously
inserted dbg.value intrinsics.
Differential Revision: https://reviews.llvm.org/D64595
llvm-svn: 370906
Apologies, due to a git SNAFU this fix (dump doesn't exist and silence unused variables) stayed in my index rather than applying to rL370893.
llvm-svn: 370894
This is the beginnings of a reimplementation of ModuloScheduleExpander. It works
by generating a single-block correct pipelined kernel and then peeling out the
prolog and epilogs.
This patch implements kernel generation as well as a validator that will
confirm the number of phis added is the same as the ModuloScheduleExpander.
Prolog and epilog peeling will come in a different patch.
Differential Revision: https://reviews.llvm.org/D67081
llvm-svn: 370893
When comparing variable locations, LiveDebugValues currently considers only
the machine location, ignoring any DIExpression applied to it. This is a
problem because that DIExpression can do pretty much anything to the machine
location, for example dereferencing it.
This patch adds DIExpressions to that comparison; now variables based on the
same register/memory-location but with different expressions will compare
differently, and be dropped if we attempt to merge them between blocks. This
reduces variable coverage-range a little, but only because we were producing
broken locations.
Differential Revision: https://reviews.llvm.org/D66942
llvm-svn: 370877
On release builds, 'MI' isn't used by anything (it's already inserted into a
block by BuildMI), while on non-release builds it's used by a LLVM_DEBUG
statement. Mark as explicitly used to avoid the warning.
llvm-svn: 370870
Summary:
While fixing the handling of some error cases, r370363 introduced new
problems -- assertion failures due to unchecked errors (my excuse is that a very
early version of that patch used Optional<T> instead of Expected).
This patch adds proper handling of parsing errors encountered when
dumping location lists from inside DWARF DIEs, and adds a bunch of
additional tests.
I reorder the arguments of the location list dumping functions to make
them consistent, and also be able to dump the two kinds of location
lists generically.
Reviewers: JDevlieghere, dblaikie, probinson
Subscribers: aprantl, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67102
llvm-svn: 370868
PT_GNU_STACK is used in an llvm-objcopy test.
I plan to use PT_GNU_RELRO in a patch to improve nested segment
processing in llvm-objcopy (PR42963).
Reviewed By: grimar
Differential Revision: https://reviews.llvm.org/D67146
llvm-svn: 370857
For any unpaired muls, we accumulate them as an input to the
reduction. Check the type of the mul and perform a sext if the
existing accumlator input type is not the same.
Differential Revision: https://reviews.llvm.org/D66993
llvm-svn: 370851
Summary: Previously module pass printer pass prints the banner even when the module doesn't include any function provided with `-filter-print-funcs` option. This introduced a lot of noise, especailly with ThinLTO. This diff addresses the issue and makes the banner printed only when the module includes functions in `-filter-print-funcs` list.
Reviewers: fedor.sergeev
Subscribers: mehdi_amini, hiraditya, dexonsmith, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66560
llvm-svn: 370849
Similar to the issue with G_ZEXT that was fixed earlier, this is a quick
to fall back if the source type is not exactly half of the dest type.
Fixes the clang-cmake-aarch64-lld bot build.
llvm-svn: 370847
This reverts r370525 (git commit 0bb1630685)
Also reverts r370543 (git commit 185ddc08ee)
The approach I took only works for functions marked `noreturn`. In
general, a call that is not known to be noreturn may be followed by
unreachable for other reasons. For example, there could be multiple call
sites to a function that throws sometimes, and at some call sites, it is
known to always throw, so it is followed by unreachable. We need to
insert an `int3` in these cases to pacify the Windows unwinder.
I think this probably deserves its own standalone, Win64-only fixup pass
that runs after block placement. Implementing that will take some time,
so let's revert to TrapUnreachable in the mean time.
llvm-svn: 370829
Summary:
This removes all string constants for function names and compares
functions by string directly when needed. Many of these constants are
used only once or twice so the benefit of defining them separately is
not very clear, and this actually fixes a bug.
When we already have a `malloc` declaration which is an alias to
something else within the module,
```
@malloc = weak hidden alias i8* (i32), i8* (i32)* @dlmalloc
```
(this happens compiling with emscripten with `-s WASM_OBJECT_FILES=0`
because all bc files are merged before being fed into `wasm-ld` which
runs the backend optimizations as LTO)
`Module::getFunction("malloc")` in `canLongjmp` returns `nullptr`
because `Module::getFunction` dyncasts pointer into `Function`, but the
alias is a `GlobalValue` but not a `Function`. This makes `canLongjmp`
return false for `malloc` in this case, and we end up adding a lot of
longjmp handling code around malloc. This is not only a code size
increase but actually a bug because `malloc` is used in the entry block
when preparing for setjmp tables for emscripten sjlj handling, and this
makes initial setjmp preparation, which has to happen in the entry
block, move to another split block, and this interferes with SSA update
later.
This also adds two more functions, `getTempRet0` and `setTempRet0`, in
the list of not longjmp-able functions.
Fixes https://github.com/emscripten-core/emscripten/issues/8935.
Reviewers: sbc100
Subscribers: mehdi_amini, jgravelle-google, hiraditya, sunfish, dexonsmith, dschuff, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67129
llvm-svn: 370828
The check needs to validate a counter offset before performing pointer
arithmetic with the (potentially corrupt) offset.
Found by UBSan's pointer overflow check.
rdar://54843625
Differential Revision: https://reviews.llvm.org/D66979
llvm-svn: 370826
When I dug into this, it turns out to be *much* more involved than I'd realized and doesn't actually simplify anything.
The general purpose of the leader table is that we want to find the most-dominating definition quickly. The problem for equivalance folding is slightly different; we want to find the most dominating *value* whose definition block dominates our use quickly.
To make this change, we'd end up having to restructure the leader table (either the sorting thereof, or maybe even introducing multiple leader tables per value) and that complexity is just not worth it.
llvm-svn: 370824
Now that we have the infrastructure to support s128 types as parameters
we can expand these to libcalls.
Differential Revision: https://reviews.llvm.org/D66185
llvm-svn: 370823
On AArch64, s128 types have to be split into s64 GPRs when passed as arguments.
This change adds the generic support in call lowering for dealing with multiple
registers, for incoming and outgoing args.
Support for splitting for return types not yet implemented.
Differential Revision: https://reviews.llvm.org/D66180
llvm-svn: 370822
Add the no-capture argument attribute deduction to the Attributor
fixpoint framework.
The new string attributed "no-capture-maybe-returned" is introduced to
allow deduction of no-capture through functions that "capture" an
argument but only by "returning" it. It is only used by the Attributor
for testing.
Differential Revision: https://reviews.llvm.org/D59922
llvm-svn: 370817
Summary:
Simplify the right shift of the intermediate result (given
in four parts) by using funnel shift.
There are some impact on lit tests, but that seems to be
related to register allocation differences due to how FSHR
is expanded on X86 (giving a slightly different operand order
for the OR operations compared to the old code).
Reviewers: leonardchan, RKSimon, spatel, lebedev.ri
Reviewed By: RKSimon
Subscribers: hiraditya, asb, rbar, johnrusso, simoncook, apazos, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, s.egerton, pzheng, bevinh, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67036
llvm-svn: 370813
These flags should simply be passed through to the target, which will do
the right thing. Add an MC/X86 test that uses these directives with the
three primary object file formats and shows that they disassemble the
same everywhere.
There is a missing test for .code32 on Windows ARM, since I'm not sure
exactly how to construct one.
Fixes PR43203
llvm-svn: 370805
This extends the existing logic for propagating constant expressions in an analogous manner for what we do across basic blocks. The core point is that we chose some order of operands, and canonicalize uses towards that one.
The heuristic used is inspired by the one used across blocks; in a follow up change, I'd plan to common them so that the cross block version uses the slightly stronger ordering herein.
As noted by the TODOs in the code, there's a good amount of room for improving the existing code and making it more powerful. Some follow up work planned.
Differential Revision: https://reviews.llvm.org/D66977
llvm-svn: 370791
This pattern, when imported at -O0 adds an extra copy via the SUBREG_TO_REG.
This is because the SUBREG_TO_REG is not eliminated. At all other opt levels,
it is eliminated.
This is a 1% geomean code size savings at -O0 on CTMark.
Differential Revision: https://reviews.llvm.org/D67027
llvm-svn: 370789
The code here seems to date back to r134705, when tablegen lowering was first
being added. I don't believe that we need to include CPSR implicit operands on
the MCInst. This now works more like other backends (like AArch64), where all
implicit registers are skipped.
This allows the AliasInst for CSEL's to match correctly, as can be seen in the
test changes.
Differential revision: https://reviews.llvm.org/D66703
llvm-svn: 370745
This moves ConstantMaterializationCost into ARMBaseInstrInfo so that it can
also be used in ISel Lowering, adding codesize values to the computed costs, to
be able to compare either approximate instruction counts or codesize costs.
It also adds a HasLowerConstantMaterializationCost, which compares the
ConstantMaterializationCost of two values, returning true if the first is
smaller either in instruction count/codesize, or falling back to the other in
the case that they are equal.
This is used in constant CSEL lowering to invert the predicate if the opposite
is easier to materialise.
Differential revision: https://reviews.llvm.org/D66701
llvm-svn: 370741
Arm 8.1-M adds a number of related CSEL instructions, including CSINC, CSNEG and CSINV. These choose between two values given the content in CPSR and a condition, performing an increment, negation or inverse of the false value.
This adds some selection for them, either from constant values or patterns. It does not include CSEL directly, which is currently not always making code better. It is still useful, but we will have to check more carefully where it should and shouldn't be used.
Code by Ranjeet Singh and Simon Tatham, with some modifications from me.
Differential revision: https://reviews.llvm.org/D66483
llvm-svn: 370739
Now the last `.section` directive in the MIPS asm file preamble
is the `.section .mdebug.abi`. If assembler code injected for example
by the LLVM `module asm` or the C ` __asm` directives do not contain
explicit switching to the `.text` section it goes to the `.mdebug.abi`
section. It might be unexpected to the user and in fact for example
breaks building some existing code like FreeBSD libc [1].
The patch forces switching to the `.text` section after emitting MIPS
assembler file preamble.
[1] https://bugs.llvm.org/show_bug.cgi?id=43119
Fix PR43119.
Differential Revision: https://reviews.llvm.org/D67014
llvm-svn: 370735
We were using isShiftedInt<7, Shift>(RHSC) to detect the ranges of offsets to
fold into MVE loads/stores. The instructions actually take a 7 bit unsigned
integer which is either added or subtracted. So something more like
isShiftedUInt<7, Shift>(abs(RHSC)).
Instead I've changes this to use the isScaledConstantInRange method, same as in
SelectT2AddrModeImm7Offset used by pre/post inc, which seemed to already be
getting this correct.
Differential revision: https://reviews.llvm.org/D66997
llvm-svn: 370731
Decoding of VMSR doesn't diagnose some unpredictable encodings, as the unpredictable bits are not correctly set.
Diff-reduce this instruction's internals WRT VMRS so I can see the differences better. Mostly this is s/src/Rt/g.
Fill in the "should-be-(0)" bits.
Designate the Unpredictable{} bits for both VMRS and VMSR.
Patch by Mark Murray!
Differential revision: https://reviews.llvm.org/D66938
llvm-svn: 370729
To save a 'add sp,#val' instruction by adding registers to the final pop instruction,
the first register transferred by this pop instruction need to be found.
If the function to be optimized has a non-void return value, the operand list contains
r0 (implicit) which prevents the optimization to take place.
Therefore implicit register references should be skipped in the search loop,
because this registers are never popped from the stack.
Patch by Rainer Herbertz (rOptimizer)!
Differential revision: https://reviews.llvm.org/D66730
llvm-svn: 370728
Summary:
Fold-tail currently supports reduction last-vector-value live-out's,
but has yet to support last-scalar-value live-outs, including
non-header phi's. As it relies on AllowedExit in order to detect
them and bail out we need to add the non-header PHI nodes to
AllowedExit, otherwise we end up with miscompiles.
Solves https://bugs.llvm.org/show_bug.cgi?id=43166
Reviewers: fhahn, Ayal
Reviewed By: fhahn, Ayal
Subscribers: anna, hiraditya, rkruppe, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D67074
llvm-svn: 370721
Now that we allow tail-folding, not only when we optimise for size, make
sure we do not run in this assert.
Differential revision: https://reviews.llvm.org/D66932
llvm-svn: 370711
The loop vectorizer was running in an assert when it tried to fold the tail and
had to emit runtime memory disambiguation checks.
Differential revision: https://reviews.llvm.org/D66803
llvm-svn: 370707
Emitting a schedule is really hard. There are lots of corner cases to take care of; in fact, of the 60+ SWP-specific testcases in the Hexagon backend most of those are testing codegen rather than the schedule creation itself.
One issue is that to test an emission corner case we must craft an input such that the generated schedule uses that corner case; sometimes this is very hard and convolutes testcases. Other times it is impossible but we want to test it anyway.
This patch adds a simple test pass that will consume a module containing a loop and generate pipelined code from it. We use post-instr-symbols as a way to annotate instructions with the stage and cycle that we want to schedule them at.
We also provide a flag that causes the MachinePipeliner to generate these annotations instead of actually emitting code; this allows us to generate an input testcase with:
llc < %s -stop-after=pipeliner -pipeliner-annotate-for-testing -o test.mir
And run the emission in isolation with:
llc < test.mir -run-pass=modulo-schedule-test
llvm-svn: 370705
This merges the 32-bit and 64-bit mode code to just use Custom
for both i32 and i64. We already had most of the handling in
the custom handling due to the AVX512 having legal fp_to_uint.
Just needed to add the i32->i64 promotion handling. Refactor
the fp_to_uint code in the custom handler to simplify the
number of times we check things.
Tweak cost model tables to match the default handling we were
getting due to Expand before.
llvm-svn: 370700
Use Custom lowering instead. Fall back to default expansion only
when the scalar FP type belongs in an XMM register. This improves
lowering for i32 to fp80, and also i32 to double on SSE1 only.
llvm-svn: 370699
FP128 values are passed in xmm registers so should be asssociated
with an SSE feature rather than MMX which uses a different set
of registers.
llc enables sse1 and sse2 by default with x86_64. But does not
enable mmx. Clang enables all 3 features by default.
I've tried to add command lines to test with -sse
where possible, but any test that returns a value in an xmm
register fails with a fatal error with -sse since we have no
defined ABI for that scenario.
llvm-svn: 370682
We should be using MQPR, and if we don't we can get COPYs and PHIs created for
QPR. These get folded into instructions, failing verification checks.
Differential revision: https://reviews.llvm.org/D66214
llvm-svn: 370676
Now that constrained fpto[su]i intrinsic are available,
add codegen support to the SystemZ backend.
In addition to pure back-end changes, I've also needed
to add the strict_fp_to_[su]int and any_fp_to_[su]int
pattern fragments in the obvious way.
llvm-svn: 370674
Summary:
Adds the following inline asm constraints for SVE:
- w: SVE vector register with full range, Z0 to Z31
- x: Restricted to registers Z0 to Z15 inclusive.
- y: Restricted to registers Z0 to Z7 inclusive.
This change also adds the "z" modifier to interpret a register as an SVE register.
Not all of the bitconvert patterns added by this patch are used, but they have been included here for completeness.
Reviewers: t.p.northover, sdesmalen, rovka, momchil.velikov, rengolin, cameron.mcinally, greened
Reviewed By: sdesmalen
Subscribers: javed.absar, tschuett, rkruppe, psnobl, cfe-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66302
llvm-svn: 370673
Fix: add a 'consumeError()' call to ObjectFile.cpp.
This error was never checked.
Original commit message:
It adds a test case for a problem fixed by D66976 <https://reviews.llvm.org/D66976>.
It was introduced by me in D66089 <https://reviews.llvm.org/D66089>.
The error reported was never consumed because of a wrong variable name used,
so it could fail when LLVM_ENABLE_ABI_BREAKING_CHECKS is used.
Differential revision: https://reviews.llvm.org/D67002
llvm-svn: 370669
The motivating bugs are:
https://bugs.llvm.org/show_bug.cgi?id=41340https://bugs.llvm.org/show_bug.cgi?id=42697
As discussed there, we could view this as a failure of IR canonicalization,
but then we would need to implement a backend fixup with target overrides
to get this right in all cases. Instead, we can just view this as a codegen
opportunity. It's not even clear for x86 exactly when we should favor
test+set; some CPUs have better theoretical throughput for the ALU ops than
bt/test.
This patch is made more complicated than I expected because there's an early
DAGCombine for 'and' that can change types of the intermediate ops via
trunc+anyext.
Differential Revision: https://reviews.llvm.org/D66687
llvm-svn: 370668
Summary:
D61491 caused us to use relocs when they're not strictly necessary, to
refer to symbols in the text section. This is a pessimization and it's a
problem for some loaders that don't support relocs yet.
Reviewers: nhaehnle, arsenm, tpr
Subscribers: kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65813
llvm-svn: 370667
Summary:
Commit r366897 introduced the possibility to set a variable from an
expression, such as [[#VAR2:VAR1+3]]. While introducing this feature, it
introduced extra logic to allow using such a variable on the same line
later on. Unfortunately that extra logic is flawed as it relies on a
mapping from variable to expression defining it when the mapping is from
variable definition to expression. This flaw causes among other issues
PR42896.
This commit avoids the problem by forbidding all use of a variable
defined on the same line, and removes the now useless logic. Redesign
will be done in a later commit because it will require some amount of
refactoring first for the solution to be clean. One example is the need
for some sort of transaction mechanism to set a variable temporarily and
from an expression and rollback if the CHECK pattern does not match so
that diagnostics show the right variable values.
Reviewers: jhenderson, chandlerc, jdenny, probinson, grimar, arichardson, rnk
Subscribers: JonChesterfield, rogfer01, hfinkel, kristina, rnk, tra, arichardson, grimar, dblaikie, probinson, llvm-commits, hiraditya
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66141
llvm-svn: 370663
bitcast <N x i8> (shuf X, undef, <N, N-1,...0>) to i{N*8} --> bswap (bitcast X to i{N*8})
In PR43146:
https://bugs.llvm.org/show_bug.cgi?id=43146
...we have a more complicated case where SLP is making a mess of bswap. This patch won't
do anything for that currently, but we need to improve bswap recognition in instcombine,
SLP, and/or a standalone pass to avoid that problem.
This is limited using the data-layout so we don't try to do this transform with actual
vector types. The backend does not appear to have folds to convert in either direction,
so we don't want to mess up something that is actually better lowered as a shuffle.
On x86, we're trading something like this:
vmovd %edi, %xmm0
vpshufb LCPI0_0(%rip), %xmm0, %xmm0 ## xmm0 = xmm0[3,2,1,0,u,u,u,u,u,u,u,u,u,u,u,u]
vmovd %xmm0, %eax
For:
movl %edi, %eax
bswapl %eax
Differential Revision: https://reviews.llvm.org/D66965
llvm-svn: 370659
I don't see GNU dlltool supporting doing this; with only a -d option
and no -l option, GNU dlltool runs successfully but doesn't write any
output file.
Differential Revision: https://reviews.llvm.org/D65645
llvm-svn: 370655
On BtVer2 conditional SIMD stores are heavily microcoded.
The latency is directly proportional to the number of packed elements extracted
from the input vector. Also, according to micro-benchmarks, most of the
computation seems to be done in the integer unit.
Only a minority of the uOPs is executed by the FPU. The observed behaviour on
the FPU looks similar to this:
- The input MASK value is moved to the Integer Unit
-- [ a VMOVMSK-like uOP-executed on JFPU0].
- In parallel, each element of the input XMM/YMM is extracted and then sent to
the IntegerUnit through JFPU1.
As expected, a (conditional) store is executed for every extracted element.
Interestingly, a (speculative) load is executed for every extracted element too.
It is as-if a "LOAD - BIT_EXTRACT- CMOV" sequence of uOPs is repeated by the
integer unit for every contionally stored element.
VMASKMOVDQU is a special case: the number of speculative loads is always 2
(presumably, one load per quadword). That means, extra shifts and masking is
performed on (one of) the loaded quadwords before each conditional store (that
also explains the big number of non-FP uOPs retired).
This patch replaces the existing writes for conditional SIMD stores (i.e.
WriteFMaskedStore, and WriteFMaskedStoreY) with the following new writes:
WriteFMaskedStore32 [ XMM Packed Single ]
WriteFMaskedStore32Y [ YMM Packed Single ]
WriteFMaskedStore64 [ XMM Packed Double ]
WriteFMaskedStore64Y [ YMM Packed Double ]
Added a wrapper class named X86SchedWriteMaskMove in X86Schedule.td to describe
both RM and MR variants for conditional SIMD moves in a single tablegen
definition.
Instances of that class are then passed in input to multiclass avx_movmask_rm
when constructing MASKMOVPS/PD definitions.
Since this patch introduces new writes, I had to update all the X86 scheduling
models.
Differential Revision: https://reviews.llvm.org/D66801
llvm-svn: 370649
The missing line added by this patch ensures that only spilt variable
locations are candidates for being restored from the stack. Otherwise,
register or constant-value information can be interpreted as a spill
location, through a union.
The added regression test replicates a scenario where this occurs: the
stack load from [rsp] causes the register-location DBG_VALUE to be
"restored" to rsi, when it should be left alone. See PR43058 for details.
Un x-fail a test that was suffering from this from a previous patch.
Differential Revision: https://reviews.llvm.org/D66895
llvm-svn: 370648
This is in line with the previous changes which allowed to
override the sh_offset/sh_size and useful for writing test cases.
Differential revision: https://reviews.llvm.org/D66998
llvm-svn: 370633
Verify that the call site DWARF symbols (added during the implementation
of the debug entry values feature) are generated properly.
Differential Revision: https://reviews.llvm.org/D66865
llvm-svn: 370631
MachineLICM can hoist an invariant load, but if that load is folded it needs to be unfolded. On AVX512 sometimes this load is an broadcast load which we were previously unable to unfold. This patch adds initial support for that with a very basic list of supported instructions as a starting point.
Differential Revision: https://reviews.llvm.org/D67017
llvm-svn: 370620
The motivating case for this is a long way from here:
https://bugs.llvm.org/show_bug.cgi?id=43146
...but I think this is where we have to start.
We need to canonicalize/optimize sequences of shift and logic to ease
pattern matching for things like bswap and improve perf in general.
But without the artificial limit of '!LegalTypes' (early combining),
there are a lot of test diffs, and not all are good.
In the minimal tests added for this proposal, x86 should have better
throughput in all cases. AArch64 is neutral for scalar tests because
it can fold shifts into bitwise logic ops.
There are 3 shift opcodes and 3 logic opcodes for a total of 9 possible patterns:
https://rise4fun.com/Alive/VlIhttps://rise4fun.com/Alive/n1mhttps://rise4fun.com/Alive/1Vn
Differential Revision: https://reviews.llvm.org/D67021
llvm-svn: 370617
Rename to lowerShuffleAsLanePermuteAndShuffle to make it clear that not just blends are performed.
Cleanup the in-lane shuffle mask generation to make it more obvious what's going on.
Some prep work noticed while investigating the poor shuffle code mentioned in D66004.
llvm-svn: 370613
These were never enabled correctly and are causing other problems. Taking them
out for the moment, whilst we work on the issues.
This reverts r370329.
llvm-svn: 370607
Summary:
This fixes the bugzilla id 43183 which triggerd by the following commit:
[RISCV] Avoid generating AssertZext for LP64 ABI when lowering floating LibCall
llvm-svn: 370604
This is what the copies will eventually be turned into. We don't
use COPY_TO_REGCLASS for scalar_to_vector patterns. So we should
use the real instruction here too.
llvm-svn: 370601
Summary:
Back-end currently expands mempcpy, but middle-end should work with memcpy instead of mempcpy to enable more memcpy-optimization.
GCC backend emits mempcpy, so LLVM backend could form it too, if we know mempcpy libcall is better than memcpy + n.
https://godbolt.org/z/dOCG96
Reviewers: efriedma, spatel, craig.topper, RKSimon, jdoerfert
Reviewed By: efriedma
Subscribers: hjl.tools, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65737
llvm-svn: 370593
EltsFromConsecutiveLoads was assuming that the number of input elts was the same as the number of elements in the output vector type when creating a zeroing shuffle, causing an assert when subvectors were being combined instead of just scalars.
llvm-svn: 370592
Narrowing stores when the target doesn't support the narrow version
forces the target to expand into a load-modify-store sequence, which
is highly suboptimal. The information narrowing throws away (legality
of the inverse transform) is hard to re-analyze. If the target doesn't
support a store of the narrow type, don't narrow even in pre-legalize
mode.
No test as this is DAGCombiner and depends on target bits.
llvm-svn: 370576
Use a { iN undef, i1 false } struct as the base, and only insert
the first operand, instead of using { iN undef, i1 undef } as the
base and inserting both. This is the same as what we do in InstCombine.
Differential Revision: https://reviews.llvm.org/D67034
llvm-svn: 370573
Restructured the code a little bit in preparation for adding
UMULFIXSAT. I think it will be easier to understand the code
if not interleaving the codegen for signed/unsigned/saturated
cases that much.
llvm-svn: 370569
1. zlib::compress accept &size_t but the param is an uint64_t.
2. Some systems don't have zlib installed. Don't use compression by default.
llvm-svn: 370564
cold versus function being newly added.
This is the second half of https://reviews.llvm.org/D66374.
Profile symbol list is the collection of function symbols showing up in
the binary which generates the current profile. It is used to discriminate
function being cold versus function being newly added. Profile symbol list
is only added for profile with ExtBinary format.
During profile use compilation, when profile-sample-accurate is enabled,
a function without profile will be regarded as cold only when it is
contained in that list.
Differential Revision: https://reviews.llvm.org/D66766
llvm-svn: 370563
Summary:
Adds clang builtins and LLVM intrinsics for these experimental
instructions. They are not implemented in engines yet, but that is ok
because the user must opt into using them by calling the builtins.
Reviewers: aheejin, dschuff
Reviewed By: aheejin
Subscribers: sbc100, jgravelle-google, hiraditya, sunfish, cfe-commits, llvm-commits
Tags: #clang, #llvm
Differential Revision: https://reviews.llvm.org/D67020
llvm-svn: 370556
This is an updated version of https://reviews.llvm.org/D66909 to fix PR42605.
Basically, current phi translatation translates an old value number to an new
value number for a call instruction based on the literal equality of call
expression, without verifying there is no clobber in between. This is incorrect.
To get a finegrain check, use MachineDependence analysis to do the job. However,
this is still not ideal. Although given a call instruction,
`MemoryDependenceResults::getCallDependencyFrom` returns identical call
instructions without clobber in between using MemDepResult with its DepType to
be `Def`. However, identical is too strict here and we want it to be relaxed a
little to consider phi-translation -- callee is the same, param operands can be
different. That means changing the semantic of `MemDepResult::Def` and I don't
know the potential impact.
So currently the patch is still conservative to only handle
MemDepResult::NonFuncLocal, which means the current call has no function local
clobber. If there is clobber, even if the clobber doesn't stand in between the
current call and the call with the new value, we won't do phi-translate.
Differential Revision: https://reviews.llvm.org/D67013
llvm-svn: 370547
Also improve assembler parser register validation for .seh_ directives.
This requires moving X86-specific seh directive handling into the x86
backend, which addresses some assembler FIXMEs.
Differential Revision: https://reviews.llvm.org/D66625
llvm-svn: 370533
Users have complained llvm.trap produce two ud2 instructions on Win64,
one for the trap, and one for unreachable. This change fixes that.
TrapUnreachable was added and enabled for Win64 in r206684 (April 2014)
to avoid poorly understood issues with the Windows unwinder.
There seem to be two major things in play:
- the unwinder
- C++ EH, _CxxFrameHandler3 & co
The unwinder disassembles forward from the return address to scan for
epilogues. Inserting a ud2 had the effect of stopping the unwinder, and
ensuring that it ran the EH personality function for the current frame.
However, it's not clear what the unwinder does when the return address
happens to be the last address of one function and the first address of
the next function.
The Visual C++ EH personality, _CxxFrameHandler3, needs to figure out
what the current EH state number is. It does this by consulting the
ip2state table, which maps from PC to state number. This seems to go
wrong when the return address is the last PC of the function or catch
funclet.
I'm not sure precisely which system is involved here, but in order to
address these real or hypothetical problems, I believe it is enough to
insert int3 after a call site if it would otherwise be the last
instruction in a function or funclet. I was able to reproduce some
similar problems locally by arranging for a noreturn call to appear at
the end of a catch block immediately before an unrelated function, and I
confirmed that the problems go away when an extra trailing int3
instruction is added.
MSVC inserts int3 after every noreturn function call, but I believe it's
only necessary to do it if the call would be the last instruction. This
change inserts a pseudo instruction that expands to int3 if it is in the
last basic block of a function or funclet. I did what I could to run the
Microsoft compiler EH tests, and the ones I was able to run showed no
behavior difference before or after this change.
Differential Revision: https://reviews.llvm.org/D66980
llvm-svn: 370525
This is the first stage in refactoring the pipeliner and making it more
accessible for backends to override and control. This separates the logic and
state required to *emit* a scheudule from the logic that *computes* and
validates a schedule.
This will enable (a) new schedule emitters and (b) new modulo scheduling
implementations to coexist.
NFC.
Differential Revision: https://reviews.llvm.org/D67006
llvm-svn: 370500
Just disable NSW/NUW flags. This matches what we're already doing for the other situations for these nodes, it was just missed for the demanded constant case.
Noticed by inspection - confirmed in offline discussion with @spatel. I've checked we have test coverage in the x86 extract-bits.ll and extract-lowbits.ll tests
llvm-svn: 370497
gcc and icc pass these types in zmm registers in zmm registers.
This patch implements a quick hack to override the register
type before calling convention handling to one that is legal.
Longer term we might want to do something similar to 256-bit
integer registers on AVX1 where we just split all the operations.
Fixes PR42957
Differential Revision: https://reviews.llvm.org/D66708
llvm-svn: 370495
Summary:
MTE allows memory access to bypass tag check iff the address argument
is [SP, #imm]. This change takes advantage of this to demote uses of
tagged addresses to regular FrameIndex operands, reducing register
pressure in large functions.
MO_TAGGED target flag is used to signal that the FrameIndex operand
refers to memory that might be tagged, and needs to be handled with
care. Such operand must be lowered to [SP, #imm] directly, without a
scratch register.
The transformation pass attempts to predict when the offset will be
out of range and disable the optimization.
AArch64RegisterInfo::eliminateFrameIndex has an escape hatch in case
this prediction has been wrong, but it is quite inefficient and should
be avoided.
Reviewers: pcc, vitalybuka, ostannard
Subscribers: mgorny, javed.absar, kristof.beyls, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66457
llvm-svn: 370490
I'm looking at unfolding broadcast loads on AVX512 which will
require refactoring this code to select broadcast opcodes instead
of regular load/stores in some cases. Merging them to avoid
further complicating their interfaces.
llvm-svn: 370484
Summary:
Instead of recomputing information for call sites we now use the
function information directly. This is always valid and once we have
call site specific information we can improve here.
This patch also bootstraps attributes that are created on-demand through
an initial update call. Information that is known will then directly be
available in the new attribute without causing an iteration delay.
The tests show how this improves the iteration count.
Reviewers: sstefan1, uenoku
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66781
llvm-svn: 370480
Summary:
Any pointer could have load/store users not only floating ones so we
move the manifest logic for alignment into the AAAlignImpl class.
Reviewers: uenoku, sstefan1
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66922
llvm-svn: 370479
Currenly we can encode the 'st_other' field of symbol using 3 fields.
'Visibility' is used to encode STV_* values.
'Other' is used to encode everything except the visibility, but it can't handle arbitrary values.
'StOther' is used to encode arbitrary values when 'Visibility'/'Other' are not helpfull enough.
'st_other' field is used to encode symbol visibility and platform-dependent
flags and values. Problem to encode it is that it consists of Visibility part (STV_* values)
which are enumeration values and the Other part, which is different and inconsistent.
For MIPS the Other part contains flags for all STO_MIPS_* values except STO_MIPS_MIPS16.
(Like comment in ELFDumper says: "Someones in their infinite wisdom decided to make
STO_MIPS_MIPS16 flag overlapped with other ST_MIPS_xxx flags."...)
And for PPC64 the Other part might actually encode any value.
This patch implements custom logic for handling the st_other and removes
'Visibility' and 'StOther' fields.
Here is an example of a new YAML style this patch allows:
- Name: foo
Other: [ 0x4 ]
- Name: bar
Other: [ STV_PROTECTED, 4 ]
- Name: zed
Other: [ STV_PROTECTED, STO_MIPS_OPTIONAL, 0xf8 ]
Differential revision: https://reviews.llvm.org/D66886
llvm-svn: 370472
This is hidden behind a (scalar-only) isOneConstant(N1) check at the moment, but once we get around to adding vector support we need to ensure we're dealing with the scalar bitwidth, not the total.
llvm-svn: 370468
Summary:
Found a couple of places in the code where all the PHI nodes
of a MBB is updated, replacing references to one MBB by
reference to another MBB instead.
This patch simply refactors the code to use a common helper
(MachineBasicBlock::replacePhiUsesWith) for such PHI node
updates.
Reviewers: t.p.northover, arsenm, uabelho
Subscribers: wdng, hiraditya, jsji, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66750
llvm-svn: 370463
Return a proper zero vector, just in case some elements are undef.
Noticed by inspection after dealing with a similar issue in PR43159.
llvm-svn: 370460
Summary:
@mclow.lists brought up this issue up in IRC.
It is a reasonably common problem to compare some two values for equality.
Those may be just some integers, strings or arrays of integers.
In C, there is `memcmp()`, `bcmp()` functions.
In C++, there exists `std::equal()` algorithm.
One can also write that function manually.
libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for
various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ
libc++ does not do anything like that, it simply relies on simple C++'s
`operator==()`. https://godbolt.org/z/er0Zwf (GOOD!)
So likely, there exists a certain performance opportunities.
Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that
is using `memcmp()` (in this case, compiled with modified compiler). {F8768213}
```
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>
#include "benchmark/benchmark.h"
template <class T>
bool equal(T* a, T* a_end, T* b) noexcept {
for (; a != a_end; ++a, ++b) {
if (*a != *b) return false;
}
return true;
}
template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
std::numeric_limits<T>::max());
std::vector<T> v;
v.reserve(count);
std::generate_n(std::back_inserter(v), count,
[&dis, &gen]() { return dis(gen); });
assert(v.size() == count);
return v;
}
struct Identical {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
auto Tmp = getVectorOfRandomNumbers<T>(count);
return std::make_pair(Tmp, std::move(Tmp));
}
};
struct InequalHalfway {
template <typename T>
static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
auto V0 = getVectorOfRandomNumbers<T>(count);
auto V1 = V0;
V1[V1.size() / size_t(2)]++; // just change the value.
return std::make_pair(std::move(V0), std::move(V1));
}
};
template <class T, class Gen>
void BM_bcmp(benchmark::State& state) {
const size_t Length = state.range(0);
const std::pair<std::vector<T>, std::vector<T>> Data =
Gen::template Gen<T>(Length);
const std::vector<T>& a = Data.first;
const std::vector<T>& b = Data.second;
assert(a.size() == Length && b.size() == a.size());
benchmark::ClobberMemory();
benchmark::DoNotOptimize(a);
benchmark::DoNotOptimize(a.data());
benchmark::DoNotOptimize(b);
benchmark::DoNotOptimize(b.data());
for (auto _ : state) {
const bool is_equal = equal(a.data(), a.data() + a.size(), b.data());
benchmark::DoNotOptimize(is_equal);
}
state.SetComplexityN(Length);
state.counters["eltcnt"] =
benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
state.counters["eltcnt/sec"] =
benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
const size_t BytesRead = 2 * sizeof(T) * Length;
state.counters["bytes_read/iteration"] =
benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
benchmark::Counter::OneK::kIs1024);
state.counters["bytes_read/sec"] = benchmark::Counter(
BytesRead, benchmark::Counter::kIsIterationInvariantRate,
benchmark::Counter::OneK::kIs1024);
}
template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
const size_t L2SizeBytes = []() {
for (const benchmark::CPUInfo::CacheInfo& I :
benchmark::CPUInfo::Get().caches) {
if (I.level == 2) return I.size;
}
return 0;
}();
// What is the largest range we can check to always fit within given L2 cache?
const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
/*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}
BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical)
->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical)
->Apply(CustomArguments<uint64_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway)
->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway)
->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway)
->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway)
->Apply(CustomArguments<uint64_t>);
```
{F8768210}
```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench
RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx
2019-04-25 21:17:11
Running build-old/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 0.65, 3.90, 4.14
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 432131 ns 432101 ns 1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s
BM_bcmp<uint8_t, Identical>_BigO 0.86 N 0.86 N
BM_bcmp<uint8_t, Identical>_RMS 8 % 8 %
<...>
BM_bcmp<uint16_t, Identical>/256000 161408 ns 161409 ns 4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s
BM_bcmp<uint16_t, Identical>_BigO 0.67 N 0.67 N
BM_bcmp<uint16_t, Identical>_RMS 25 % 25 %
<...>
BM_bcmp<uint32_t, Identical>/128000 81497 ns 81488 ns 8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s
BM_bcmp<uint32_t, Identical>_BigO 0.71 N 0.71 N
BM_bcmp<uint32_t, Identical>_RMS 42 % 42 %
<...>
BM_bcmp<uint64_t, Identical>/64000 50138 ns 50138 ns 10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s
BM_bcmp<uint64_t, Identical>_BigO 0.84 N 0.84 N
BM_bcmp<uint64_t, Identical>_RMS 27 % 27 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 192405 ns 192392 ns 3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO 0.38 N 0.38 N
BM_bcmp<uint8_t, InequalHalfway>_RMS 3 % 3 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 127858 ns 127860 ns 5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO 0.50 N 0.50 N
BM_bcmp<uint16_t, InequalHalfway>_RMS 0 % 0 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 49140 ns 49140 ns 14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO 0.40 N 0.40 N
BM_bcmp<uint32_t, InequalHalfway>_RMS 18 % 18 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 32101 ns 32099 ns 21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO 0.50 N 0.50 N
BM_bcmp<uint64_t, InequalHalfway>_RMS 1 % 1 %
RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0
2019-04-25 21:19:29
Running build-new/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
L1 Data 16K (x8)
L1 Instruction 64K (x4)
L2 Unified 2048K (x4)
L3 Unified 8192K (x1)
Load Average: 1.01, 2.85, 3.71
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 18593 ns 18590 ns 37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s
BM_bcmp<uint8_t, Identical>_BigO 0.04 N 0.04 N
BM_bcmp<uint8_t, Identical>_RMS 37 % 37 %
<...>
BM_bcmp<uint16_t, Identical>/256000 18950 ns 18948 ns 37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s
BM_bcmp<uint16_t, Identical>_BigO 0.08 N 0.08 N
BM_bcmp<uint16_t, Identical>_RMS 34 % 34 %
<...>
BM_bcmp<uint32_t, Identical>/128000 18627 ns 18627 ns 37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s
BM_bcmp<uint32_t, Identical>_BigO 0.16 N 0.16 N
BM_bcmp<uint32_t, Identical>_RMS 35 % 35 %
<...>
BM_bcmp<uint64_t, Identical>/64000 18855 ns 18855 ns 37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s
BM_bcmp<uint64_t, Identical>_BigO 0.32 N 0.32 N
BM_bcmp<uint64_t, Identical>_RMS 33 % 33 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 9570 ns 9569 ns 73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO 0.02 N 0.02 N
BM_bcmp<uint8_t, InequalHalfway>_RMS 29 % 29 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 9547 ns 9547 ns 74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO 0.04 N 0.04 N
BM_bcmp<uint16_t, InequalHalfway>_RMS 29 % 29 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 9396 ns 9394 ns 73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO 0.08 N 0.08 N
BM_bcmp<uint32_t, InequalHalfway>_RMS 30 % 30 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 9499 ns 9498 ns 73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO 0.16 N 0.16 N
BM_bcmp<uint64_t, InequalHalfway>_RMS 28 % 28 %
Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000 -0.9570 -0.9570 432131 18593 432101 18590
<...>
BM_bcmp<uint16_t, Identical>/256000 -0.8826 -0.8826 161408 18950 161409 18948
<...>
BM_bcmp<uint32_t, Identical>/128000 -0.7714 -0.7714 81497 18627 81488 18627
<...>
BM_bcmp<uint64_t, Identical>/64000 -0.6239 -0.6239 50138 18855 50138 18855
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000 -0.9503 -0.9503 192405 9570 192392 9569
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000 -0.9253 -0.9253 127858 9547 127860 9547
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000 -0.8088 -0.8088 49140 9396 49140 9394
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000 -0.7041 -0.7041 32101 9499 32099 9498
```
What can we tell from the benchmark?
* Performance of naive equality check somewhat improves with element size,
maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s
for uint64_t. I think, that instability implies performance problems.
* Performance of `memcmp()`-aware benchmark always maxes out at around
bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the
naive variant!
* eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at
eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and
linearly decreases with element size.
For uint64_t, it's ~4x+ the elements/second.
* The call obvious is more pricey than the loop, with small element count.
As it can be seen from the full output {F8768210}, the `memcmp()` is almost
universally worse, independent of the element size (and thus buffer size) when
element count is less than 8.
So all in all, bcmp idiom does indeed pose untapped performance headroom.
This diff does implement said idiom recognition. I think a reasonable test
coverage is present, but do tell if there is anything obvious missing.
Now, quality. This does succeed to build and pass the test-suite, at least
without any non-bundled elements. {F8768216} {F8768217}
This transform fires 91 times:
```
$ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json
Tests: 1149
Metric: loop-idiom.NumBCmp
Program result-new
MultiSourc...Benchmarks/7zip/7zip-benchmark 79.00
MultiSource/Applications/d/make_dparser 3.00
SingleSource/UnitTests/vla 2.00
MultiSource/Applications/Burg/burg 1.00
MultiSourc.../Applications/JM/lencod/lencod 1.00
MultiSource/Applications/lemon/lemon 1.00
MultiSource/Benchmarks/Bullet/bullet 1.00
MultiSourc...e/Benchmarks/MallocBench/gs/gs 1.00
MultiSourc...gs-C/TimberWolfMC/timberwolfmc 1.00
MultiSourc...Prolangs-C/simulator/simulator 1.00
```
The size changes are:
I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look.
```
$ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash
Tests: 1149
Same hash: 907 (filtered out)
Remaining: 242
Metric: size..text
Program result-old result-new diff
test-suite...ingleSource/UnitTests/vla.test 753.00 833.00 10.6%
test-suite...marks/7zip/7zip-benchmark.test 1001697.00 966657.00 -3.5%
test-suite...ngs-C/simulator/simulator.test 32369.00 32321.00 -0.1%
test-suite...plications/d/make_dparser.test 89585.00 89505.00 -0.1%
test-suite...ce/Applications/Burg/burg.test 40817.00 40785.00 -0.1%
test-suite.../Applications/lemon/lemon.test 47281.00 47249.00 -0.1%
test-suite...TimberWolfMC/timberwolfmc.test 250065.00 250113.00 0.0%
test-suite...chmarks/MallocBench/gs/gs.test 149889.00 149873.00 -0.0%
test-suite...ications/JM/lencod/lencod.test 769585.00 769569.00 -0.0%
test-suite.../Benchmarks/Bullet/bullet.test 770049.00 770049.00 0.0%
test-suite...HMARK_ANISTROPIC_DIFFUSION/128 NaN NaN nan%
test-suite...HMARK_ANISTROPIC_DIFFUSION/256 NaN NaN nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/64 NaN NaN nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/32 NaN NaN nan%
test-suite...ENCHMARK_BILATERAL_FILTER/64/4 NaN NaN nan%
Geomean difference nan%
result-old result-new diff
count 1.000000e+01 10.00000 10.000000
mean 3.152090e+05 311695.40000 0.006749
std 3.790398e+05 372091.42232 0.036605
min 7.530000e+02 833.00000 -0.034981
25% 4.243300e+04 42401.00000 -0.000866
50% 1.197370e+05 119689.00000 -0.000392
75% 6.397050e+05 639705.00000 -0.000005
max 1.001697e+06 966657.00000 0.106242
```
I don't have timings though.
And now to the code. The basic idea is to completely replace the whole loop.
If we can't fully kill it, don't transform.
I have left one or two comments in the code, so hopefully it can be understood.
Also, there is a few TODO's that i have left for follow-ups:
* widening of `memcmp()`/`bcmp()`
* step smaller than the comparison size
* Metadata propagation
* more than two blocks as long as there is still a single backedge?
* ???
Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet
Reviewed By: courbet
Subscribers: hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61144
llvm-svn: 370454
Summary:
Change LiveDebugValues so that it inserts entry values after the bundle
which contains the clobbering instruction. Previously it would insert
the debug value after the bundle head using insertAfter(), breaking the
bundle.
Reviewers: djtodoro, NikolaPrica, aprantl, vsk
Reviewed By: vsk
Subscribers: hiraditya, llvm-commits
Tags: #debug-info, #llvm
Differential Revision: https://reviews.llvm.org/D66888
llvm-svn: 370448
Extend WindowsResourceParser to support using a ResourceSectionRef for
loading resources from an object file.
Only allow merging resource object files in mingw mode; keep the
existing error on multiple resource objects in link mode.
If there only is one resource object file and no .res resources,
don't parse and recreate the .rsrc section, but just link it in without
inspecting it. This allows users to produce any .rsrc section (outside
of what the parser supports), just like before. (I don't have a specific
need for this, but it reduces the risk of this new feature.)
Separate out the .rsrc section chunks in InputFiles.cpp, and only include
them in the list of section chunks to link if we've determined that there
only was one single resource object. (We need to keep other chunks from
those object files, as they can legitimately contain other sections as
well, in addition to .rsrc section chunks.)
Differential Revision: https://reviews.llvm.org/D66824
llvm-svn: 370436
Instead of updating a global variable counter for the next index of
strings and data blobs, pass along a reference to actual data/string
vectors and let the TreeNode insertion methods add their data/strings to
the vectors when a new entry is needed.
Additionally, if the resource tree had duplicates, that were ignored
with -force:multipleres in lld, we no longer store all versions of the
duplicated resource data, now we only keep the one that actually ends
up referenced.
Differential Revision: https://reviews.llvm.org/D66823
llvm-svn: 370435
This allows llvm-readobj to print the contents of each resource
when printing resources from an object file or executable, like it
already does for plain .res files.
This requires providing the whole COFFObjectFile to ResourceSectionRef.
This supports both object files and executables. For executables,
the DataRVA field is used as is to look up the right section.
For object files, ideally we would need to complete linking of them
and fix up all relocations to know what the DataRVA field would end up
being. In practice, the only thing that makes sense for an RVA field
is an ADDR32NB relocation. Thus, find a relocation pointing at this
field, verify that it has the expected type, locate the symbol it
points at, look up the section the symbol points at, and read from the
right offset in that section.
This works both for GNU windres object files (which use one single
.rsrc section, with all relocations against the base of the .rsrc
section, with the original value of the DataRVA field being the
offset of the data from the beginning of the .rsrc section) and
cvtres object files (with two separate .rsrc$01 and .rsrc$02 sections,
and one symbol per data entry, with the original pre-relocated DataRVA
field being set to zero).
Differential Revision: https://reviews.llvm.org/D66820
llvm-svn: 370433
Add lower for G_FPTOUI. Algorithm is similar to the SDAG version
in TargetLowering::expandFP_TO_UINT.
Lower G_FPTOUI for MIPS32.
Differential Revision: https://reviews.llvm.org/D66929
llvm-svn: 370431
When the number of return values exceeds the number of registers available,
SelectionDAGBuilder::visitRet transforms a function's return to use a
pointer to a buffer to hold return values. When the returned value is an
operator such as extractvalue, the value may have a non-zero result number.
Add that number to the indexing when obtaining the values to store.
This fixes https://bugs.llvm.org/show_bug.cgi?id=43132.
Differential Revision: https://reviews.llvm.org/D66978
llvm-svn: 370430
Unlike ppc64, which has ADDISgotTprelHA+LDgotTprelL pairs,
ppc32 just uses LDgotTprelL32, so it does not make lots of sense to use
_LO without a paired _HA.
Emit R_PPC_GOT_TPREL16 instead R_PPC_GOT_TPREL16_LO to match GCC, and
get better linker relocation check. Note, R_PPC_GOT_TPREL16_{HA,LO}
don't have good linker support:
(a) lld does not support R_PPC_GOT_TPREL16_{HA,LO}.
(b) Top of tree ld.bfd does not support R_PPC_GOT_REL16_HA Initial-Exec -> Local-Exec relaxation:
// a.o
addis 3, 3, tsd_tls@got@tprel@ha
lwz 3, tsd_tls@got@tprel@l(3)
add 3, 3, tsd_tls@tls
// b.o
.section .tdata,"awT"; .globl tsd_tls; tsd_tls:
// ld/ld-new a.o b.o
internal error, aborting at ../../bfd/elf32-ppc.c:7952 in ppc_elf_relocate_section
Reviewed By: adalava
Differential Revision: https://reviews.llvm.org/D66925
llvm-svn: 370426
Add a default with an llvm_unreachable for anything we don't expect.
This seems safer that just blindly returning true for anything
missing from the switch.
llvm-svn: 370424
Add an WASM_SYMBOL_NO_STRIP flag, so that __attribute__((used)) doesn't
need to imply exporting. When targeting Emscripten, have
WASM_SYMBOL_NO_STRIP imply exporting.
Differential Revision: https://reviews.llvm.org/D62542
llvm-svn: 370415
Summary:
Reported in https://github.com/opencv/opencv/issues/15413.
We have serveral extended mnemonics for Move To/From Vector-Scalar Register Instructions
eg: mffprd,mtfprd etc.
We only support one of them, this patch add the others.
Reviewers: nemanjai, steven.zhang, hfinkel, #powerpc
Reviewed By: hfinkel
Subscribers: wuzish, qcolombet, hiraditya, kbarton, MaskRay, shchenz, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66963
llvm-svn: 370411
This teaches GISel to select patterns which fold an extend plus optional shift
into the addressing mode. In particular, adds and subs.
Factor out the arith extended register ComplexPatterns in AArch64InstrFormats.td
and create GISel equivalents.
Add some equivalent functions to the ones in AArch64ISelDAGToDAG:
- `selectArithExtendedRegister`
- `narrowExtendRegIfNeeded`
- `getExtendTypeForInst`
`getExtendTypeForInst` includes the checks for loads and stores. This will be
used for WRO addressing modes in loads + stores.
Teach selectCopy to properly handle subregister copies on the same bank in
order to support `narrowExtendRegIfNeeded`. The extended register must be a
GPR32, so we need to support same-bank subregister copies.
Fix a bug in getSubRegForClass which would cause registers on things like
GPR32common to end up getting ssub. Just change the check to look for FPR32
rather than GPR32.
For tests:
- Add select-arith-extended-reg.mir
- Update addsub_ext.ll to include GlobalISel checks
Differential Revision: https://reviews.llvm.org/D66835
llvm-svn: 370410
Summary:
This is a minor improvement on our past attempts to do this. Fixes
PR43155.
Reviewers: hans
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66905
llvm-svn: 370409
Summary:
There is no reason to differ in assembler behavior here between -msvc
and -gnu targets. Without this setting, the text after the '@' is
interpreted as a symbol variable, like foo@IMGREL.
Reviewers: mstorsjo
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66974
llvm-svn: 370408
ISD::isBuildVectorAllZeros permits undef elements to be present, which means we can't return it as a zero vector. PMULDQ/PMULUDQ is an extending multiply so a multiply by zero of the lower 32-bits should result in a zero 64-bit element.
llvm-svn: 370404
-Deprecate -mmpx and -mno-mpx command line options
-Remove CPUID detection of mpx for -march=native
-Remove MPX from all CPUs
-Remove MPX preprocessor define
I've left the "mpx" string in the backend so we don't fail on old IR, but its not connected to anything.
gcc has also deprecated these command line options. https://www.phoronix.com/scan.php?page=news_item&px=GCC-Patch-To-Drop-MPX
Differential Revision: https://reviews.llvm.org/D66669
llvm-svn: 370393
We can also apply the earlier updates to the lazy DTU, instead of
applying them directly.
Reviewers: kuhar, brzycki, asbirlea, SjoerdMeijer
Reviewed By: brzycki, asbirlea, SjoerdMeijer
Differential Revision: https://reviews.llvm.org/D66918
llvm-svn: 370391
AMDGPU uses this for some addressing mode selection patterns. The
analysis run itself doesn't do anything so it seems easier to just
always require this than adding a way to opt in.
llvm-svn: 370388
Summary:
I'm not planning to check this in at the moment, but feedback is very welcome, in particular how this affects performance.
The feedback obtains here will guide the next steps towards enabling this.
This patch enables the use of MemorySSA in the loop pass manager.
Passes that currently use MemorySSA:
- EarlyCSE
Passes that use MemorySSA after this patch:
- EarlyCSE
- LICM
- SimpleLoopUnswitch
Loop passes that update MemorySSA (and do not use it yet, but could use it after this patch):
- LoopInstSimplify
- LoopSimplifyCFG
- LoopUnswitch
- LoopRotate
- LoopSimplify
- LCSSA
Loop passes that do *not* update MemorySSA:
- IndVarSimplify
- LoopDelete
- LoopIdiom
- LoopSink
- LoopUnroll
- LoopInterchange
- LoopUnrollAndJam
- LoopVectorize
- LoopReroll
- IRCE
Reviewers: chandlerc, george.burgess.iv, davide, sanjoy, gberry
Subscribers: jlebar, Prazek, dmgreen, jdoerfert, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D58311
llvm-svn: 370384
Add a GISelPredicateCode to the stxr_* PatFrags in AArch64InstrAtomics.td.
This allows us to select these intrinsics.
Differential Revision: https://reviews.llvm.org/D65779
llvm-svn: 370382
Remove manual selection code for this intrinsic and use a GISelPredicateCode
instead.
This allows us to fully select this intrinsic without any tricky custom C++
matching.
Differential Revision: https://reviews.llvm.org/D65780
llvm-svn: 370380
Same thing as D66897, but for ldxr.* instead. Add a GISelPredicateCode to the
ldxr_* definitions, which allows us to import them.
Add select-ldxr-intrin.mir, and update arm64-ldxr-stxr.ll.
Differential Revision: https://reviews.llvm.org/D66898
llvm-svn: 370378
Add a GISelPredicateCode to ldaxr_*. This allows us to import the patterns for
@llvm.aarch64.ldaxr.*, and thus select them.
Add `isLoadStoreOfNumBytes` for the GISelPredicateCode, since each of these
intrinsics involves the same check.
Add select-ldaxr-intrin.mir, and update arm64-ldxr-stxr.ll.
Differential Revision: https://reviews.llvm.org/D66897
llvm-svn: 370377
Summary:
- Similar to the workaround in fix of PR30188, skip sinking common
lifetime markers of `alloca`. They are mostly left there after
inlining functions in branches.
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66950
llvm-svn: 370376
Summary:
While examining this class for possible use in lldb, I noticed two
things:
- it spits out parsing errors directly to stderr
- the loclists parser can incorrectly return valid location lists when
parsing malformed (truncated) data
I improve the stderr situation by making the parseOneLocationList
functions return Expected<T>s. The errors are still dumped to stderr by
their callers, so this is only a partial fix, but it is enough for my
use case, as I intend to parse the locations lists one by one.
I fix the behavior in the truncated scenario by using the newly
introduced DataExtractor Cursor API.
I also add tests for handling the error cases, as they currently have no
coverage.
Reviewers: dblaikie, JDevlieghere, probinson
Subscribers: lldb-commits, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D63591
llvm-svn: 370363
Both methods `MipsTargetStreamer::emitStoreWithSymOffset` and
`MipsTargetStreamer::emitLoadWithSymOffset` are almost the same and
differ argument names only. These methods are used in the single place
so it's better to inline their code and remove original methods.
llvm-svn: 370354
When a "base" in the `lw/sw $reg1, symbol($reg2)` instruction is
a register and generated code is position independent, backend
does not add the "base" value to the symbol address.
```
lw $reg1, %got(symbol)($gp)
lw/sw $reg1, 0($reg1)
```
This patch fixes the bug and adds the missed `addu` instruction by
passing `BaseReg` into the `loadAndAddSymbolAddress` routine and handles
the case when the `BaseReg` is the zero register to escape redundant
`move reg, reg` instruction:
```
lw $reg1, %got(symbol)($gp)
addu $reg1, $reg1, $reg2
lw/sw $reg1, 0($reg1)
```
Differential Revision: https://reviews.llvm.org/D66894
llvm-svn: 370353
Summary:
Now that with D65143/D65144 we've produce `@llvm.umul.with.overflow`,
and with D65147 we've flattened the CFG, we now can see that
the guard may have been there to prevent division by zero is redundant.
We can simply drop it:
```
----------------------------------------
Name: no overflow or zero
%iszero = icmp eq i4 %y, 0
%umul = smul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%umul.ov.not = xor %umul.ov, -1
%retval.0 = or i1 %iszero, %umul.ov.not
ret i1 %retval.0
=>
%iszero = icmp eq i4 %y, 0
%umul = smul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%umul.ov.not = xor %umul.ov, -1
%retval.0 = or i1 %iszero, %umul.ov.not
ret i1 %umul.ov.not
Done: 1
Optimization is correct!
```
Note that this is inverted from what we have in a previous patch,
here we are looking for the inverted overflow bit.
And that inversion is kinda problematic - given this particular
pattern we neither hoist that `not` closer to `ret` (then the pattern
would have been identical to the one without inversion,
and would have been handled by the previous patch), neither
do the opposite transform. But regardless, we should handle this too.
I've filled [[ https://bugs.llvm.org/show_bug.cgi?id=42720 | PR42720 ]].
Reviewers: nikic, spatel, xbolva00, RKSimon
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65151
llvm-svn: 370351
Summary:
Now that with D65143/D65144 we've produce `@llvm.umul.with.overflow`,
and with D65147 we've flattened the CFG, we now can see that
the guard may have been there to prevent division by zero is redundant.
We can simply drop it:
```
----------------------------------------
Name: no overflow and not zero
%iszero = icmp ne i4 %y, 0
%umul = umul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%retval.0 = and i1 %iszero, %umul.ov
ret i1 %retval.0
=>
%iszero = icmp ne i4 %y, 0
%umul = umul_overflow i4 %x, %y
%umul.ov = extractvalue {i4, i1} %umul, 1
%retval.0 = and i1 %iszero, %umul.ov
ret %umul.ov
Done: 1
Optimization is correct!
```
Reviewers: nikic, spatel, xbolva00
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65150
llvm-svn: 370350
Summary:
As it can be seen in the tests in D65143/D65144, even though we have formed an '@llvm.umul.with.overflow'
and got rid of potential for division-by-zero, the control flow remains, we still have that branch.
We have this condition:
```
// Don't fold i1 branches on PHIs which contain binary operators
// These can often be turned into switches and other things.
if (PN->getType()->isIntegerTy(1) &&
(isa<BinaryOperator>(PN->getIncomingValue(0)) ||
isa<BinaryOperator>(PN->getIncomingValue(1)) ||
isa<BinaryOperator>(IfCond)))
return false;
```
which was added back in rL121764 to help with `select` formation i think?
That check prevents us to flatten the CFG here, even though we know
we no longer need that guard and will be able to drop everything
but the '@llvm.umul.with.overflow' + `not`.
As it can be seen from tests, we end here because the `not` is being
sinked into the PHI's incoming values by InstCombine,
so we can't workaround this by hoisting it to after PHI.
Thus i suggest that we relax that check to not bailout if we'd get to hoist the `not`.
Reviewers: craig.topper, spatel, fhahn, nikic
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D65147
llvm-svn: 370349
The missing line added by this patch ensures that only spilt variable
locations are candidates for being restored from the stack. Otherwise,
register or constant-value information can be interpreted as a spill
location, through a union.
The added regression test replicates a scenario where this occurs: the
stack load from [rsp] causes the register-location DBG_VALUE to be
"restored" to rsi, when it should be left alone. See PR43058 for details.
Un x-fail a test that was suffering from this from a previous patch.
Differential Revision: https://reviews.llvm.org/D66895
llvm-svn: 370334
This allows us to produce broken binaries with local
symbols placed after global in '.dynsym'/'.symtab'
Also, simplifies the code.
Differential revision: https://reviews.llvm.org/D66799
llvm-svn: 370331
Masked loads and store fit naturally with MVE, the instructions being easily
predicated. This adds lowering for the simple cases of masked loads and stores.
It does not yet deal with widening/narrowing or pre/post inc.
The llvm masked load intrinsic will accept a "passthru" value, dictating the
values used for the zero masked lanes. In MVE the instructions write 0 to the
zero predicated lanes, so we need to match a passthru that isn't 0 (or undef)
with a select instruction to pull in the correct data after the load.
We also need to do something with unaligned loads/stores. Currently this uses a
similar method used in big endian, using an VLDRB.8 (and potentially a VREV in
BE). This does mean that the predicate mask is converted from, for example, a
v4i1 to a v16i1. The VLDR instructions are defined as using the first bit of
the relevant mask lane, so this could potentially load different results if the
predicate is little odd. As the input is a v4i1 however, I believe this is OK
and all the bits required should be set in the predicate, making the VLDRB.8
load the same data.
Differential Revision: https://reviews.llvm.org/D66534
llvm-svn: 370329
The "join" method in LiveDebugValues does not attempt to join unseen
predecessor blocks if their out-locations aren't yet initialized, instead
the block should be re-visited later to see if any locations have changed
validity. However, because the set of blocks were all being "process"'d
once before "join" saw them, that logic in "join" was actually ignoring
legitimate out-locations on the first pass through. This meant that some
invalidated locations were not removed from the head of loops, allowing
illegal locations to persist.
Fix this by removing the run of "process" before the main join/process loop
in ExtendRanges. Now the unseen predecessors that "join" skips truly are
uninitialized, and we come back to the block at a later time to re-run
"join", see the @baz function added.
This also fixes another fault where stack/register transfers in the entry
block (or any other before-any-loop-block) had their tranfers initially
ignored, and were then never revisited. The MIR test added tests for this
behaviour.
XFail a test that exposes another bug; a fix for this is coming in D66895.
Differential Revision: https://reviews.llvm.org/D66663
llvm-svn: 370328
Summary:
We were previously doing it in DAGCombine.
But we also want to do `sub %x, C` -> `add %x, (sub 0, C)` for vectors in DAGCombine.
So if we had `sub %x, -1`, we'll transform it to `add %x, 1`,
which `combineIncDecVector()` will immediately transform back into `sub %x, -1`,
and here we go again...
I've marked this as NFC since not a single test changes,
but since that 'changes' DAGCombine, probably this isn't fully NFC.
Reviewers: RKSimon, craig.topper, spatel
Reviewed By: craig.topper
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D62327
llvm-svn: 370327
Summary: This is beneficial when the shuffle is only used once and end up being generated in a few places when some node is combined into a shuffle.
Reviewers: craig.topper, efriedma, RKSimon, lebedev.ri
Subscribers: llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66718
llvm-svn: 370326
Summary:
Finally, the fold i was looking forward to :)
The legality check is muddy, i doubt i've groked the full generalization,
but it handles all the cases i care about, and can come up with:
https://rise4fun.com/Alive/26j
I.e. we can perform the fold if **any** of the following is true:
* The shift amount is either zero or one less than widest bitwidth
* Either of the values being shifted has at most lowest bit set
* The value that is being shifted by `shl` (which is not truncated) should have no less leading zeros than the total shift amount;
* The value that is being shifted by `lshr` (which **is** truncated) should have no less leading zeros than the widest bit width minus total shift amount minus one
I strongly suspect there is some better generalization, but i'm not aware of it as of right now.
For now i also avoided using actual `computeKnownBits()`, but restricted it to constants.
Reviewers: spatel, nikic, xbolva00
Reviewed By: spatel
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66383
llvm-svn: 370324
Instead of blindly incrementing pointers in llvm-readobj, use this
helper, which does bounds checking against the available section
data.
Differential Revision: https://reviews.llvm.org/D66818
llvm-svn: 370310
Previously, the expression (Reader.readFoo()) was expanded twice,
triggering asserts as one of the Error types ends up not checked
(and as it was expanded twice, the method would end up called twice
if it failed first).
Differential Revision: https://reviews.llvm.org/D66817
llvm-svn: 370309
We had an isel pattern to perform this, but its better to
do it in DAG combine as a simplification. This also fixes the lack
of patterns for AVX512 targets.
llvm-svn: 370294
Including a type legalizer fix to make bitcast operand promotion
work correctly when getSoftenedFloat returns f128 instead of i128.
Fixes PR43157
llvm-svn: 370293
We do not access the DT in the loop, so we do not have to apply updates
eagerly. We can apply them lazyly and flush them after we are done
merging blocks.
As follow-up work, we might be able to use the DTU above as well,
instead of manually updating the DT.
This brings the example from PR43134 from ~100s to ~4s for a relase +
assertions build on my machine.
Reviewers: efriedma, kuhar, asbirlea, brzycki
Reviewed By: kuhar, brzycki
Differential Revision: https://reviews.llvm.org/D66911
llvm-svn: 370292
SGPR spills aren't really handled after SILowerSGPRSpills. In order to
directly control what happens if the scavenger needs to spill, the
scavenger needs to be used directly. There is an alternative to
spilling in these contexts anyway since the frame register can be
increment and restored.
This does present another possible issue if spilling is needed for the
unused carry out if an add is needed. I think this can be avoided by
using a scalar add (although that clobbers SCC, which happens anyway).
llvm-svn: 370281
This is a special case because one node maps to two different G_
instructions, and the operand order is changed.
This mostly enables G_FCMP for AMDPGPU. G_ICMP is still manually
selected for now since it has the SALU and VALU complication to deal
with.
llvm-svn: 370280
Also add a FIXME because I'm not sure why these patterns exist. Looks like a missing combine.
And another FIXME because the AVX512 equivalent one of the patterns is missing.
llvm-svn: 370276
The patch fixed the issue that RV64 didn't clear the upper bits
when return complex floating value with lp64 ABI.
float _Complex
complex_add(float _Complex a, float _Complex b)
{
return a + b;
}
RealResult = zero_extend(RealA + RealB)
ImageResult = ImageA + ImageB
Return (RealResult | (ImageResult << 32))
The patch introduces shouldExtendTypeInLibCall target hook to suppress
the AssertZext generation when lowering floating LibCall.
Thanks to Eli's comments from the Bugzilla
https://bugs.llvm.org/show_bug.cgi?id=42820
Differential Revision: https://reviews.llvm.org/D65497
llvm-svn: 370275
If result of 64-bit address loading combines with 32-bit mask, LLVM
tries to optimize the code and remove "redundant" loading of upper
32-bits of the address. It leads to incorrect code on MIPS64 targets.
MIPS backend creates the following chain of commands to load 64-bit
address in the `MipsTargetLowering::getAddrNonPICSym64` method:
```
(add (shl (add (shl (add %highest(sym), %higher(sym)),
16),
%hi(sym)),
16),
%lo(%sym))
```
If the mask presents, LLVM decides to optimize the chain of commands. It
really does not make sense to load upper 32-bits because the 0x0fffffff
mask anyway clears them. After removing redundant commands we get this
chain:
```
(add (shl (%hi(sym), 16), %lo(%sym))
```
There is no patterns matched `(MipsHi (i64 symbol))`. Due a bug in `SYM_32`
predicate definition, backend incorrectly selects a pattern for a 32-bit
symbols and uses the `lui` instruction for loading `%hi(sym)`.
As a result we get incorrect set of instructions with unnecessary 16-bit
left shifting:
```
lui at,0x0
R_MIPS_HI16 foo
dsll at,at,0x10
daddiu at,at,0
R_MIPS_LO16 foo
```
This patch resolves two problems:
- Fix `SYM_32/SYM_64` predicates to prevent selection of patterns dedicated
to 32-bit symbols in case of using N64 ABI.
- Add missed patterns for 64-bit symbols for `%hi/%lo`.
Fix PR42736.
Differential Revision: https://reviews.llvm.org/D66228
llvm-svn: 370268
...cloning a function from a different module
Currently when a function with debug info is cloned from a different module, the
cloned function may have hanging DICompileUnits, so that the module with the
cloned function fails debug info verification.
The proposed fix inserts all DICompileUnits reachable from the cloned function
to "llvm.dbg.cu" metadata operands of the cloned function module.
Reviewed By: aprantl, efriedma
Differential Revision: https://reviews.llvm.org/D66510
Patch by Oleg Pliss (Oleg.Pliss@azul.com)
llvm-svn: 370265
By default ASan calls a versioned function
`__asan_version_mismatch_check_vXXX` from the ASan module constructor to
check that the compiler ABI version and runtime ABI version are
compatible. This ensures that we get a predictable linker error instead
of hard-to-debug runtime errors.
Sometimes, however, we want to skip this safety guard. This new command
line option allows us to do just that.
rdar://47891956
Reviewed By: kubamracek
Differential Revision: https://reviews.llvm.org/D66826
llvm-svn: 370258
Before this change, if multiple binary files were presented, all of them must have been instrumented or the load would fail with coverage_map_error::no_data_found.
Patch by Dean Sturtevant.
Differential Revision: https://reviews.llvm.org/D66763
llvm-svn: 370257
Stop counting explicitly disabled user_spgr's in the user_sgpr_count field of the kernel descriptor.
Differential Revision: https://reviews.llvm.org/D66900
llvm-svn: 370250
Always true/false checks were flagged by static analysis;
https://bugs.llvm.org/show_bug.cgi?id=43143
I have not confirmed the logic difference in propagating nsw vs. nuw,
but presumably we would have noticed a bug by now if that was wrong.
llvm-svn: 370248
Summary:
This functionality was added when Mapper::mapMetadata was recursive. It
is no longer needed after r265456, which switched it to be iterative.
Reviewers: dexonsmith, srhines
Subscribers: hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66860
llvm-svn: 370236
As dependences between abstract attributes can become stale, e.g., if
one was sufficient to imply another one at some point but it has since
been wakened to the point it is not usable for the formerly implied one.
To weed out spurious dependences, and thereby eliminate unneeded
updates, we introduce an option to determine how often the dependence
cache is cleared and recomputed during the fixpoint iteration.
Note that the initial value was determined such that we see a positive
result on our tests.
Differential Revision: https://reviews.llvm.org/D63315
llvm-svn: 370230
This implements constrained floating point intrinsics for FP to signed and
unsigned integers.
Quoting from D32319:
The purpose of the constrained intrinsics is to force the optimizer to
respect the restrictions that will be necessary to support things like the
STDC FENV_ACCESS ON pragma without interfering with optimizations when
these restrictions are not needed.
Reviewed by: Andrew Kaylor, Craig Topper, Hal Finkel, Cameron McInally, Roman Lebedev, Kit Barton
Approved by: Craig Topper
Differential Revision: http://reviews.llvm.org/D63782
llvm-svn: 370228
These are currently translated as normal functions calls in AArch64.
Until we have proper tail call lowering, we shouldn't translate these.
Differential Revision: https://reviews.llvm.org/D66842
llvm-svn: 370225
This reduces the number of SGPRs due to some concerns about running
out of SGPRs if you make all the SGPRs that aren't reserved available
for the calling convention.
Change-Id: Idb4ca4dc72f5b6808cb524ff7270915a8de5b4c1
llvm-svn: 370215
Summary:
Until we have proper call-site information we should not recompute
liveness and return information for each call site. This patch directly
uses the function versions and introduces TODOs at the usage sites.
The required iterations to get to the fixpoint are most of the time
reduced by this change and we always avoid work duplication.
Reviewers: sstefan1, uenoku
Subscribers: hiraditya, bollu, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66562
llvm-svn: 370208
This relands this commit, I mistakenly reverted the original change
thinking it was the cause of the observed MSan failures but it was not.
llvm-svn: 370206
Neither libgcc or compiler-rt are usually used on Windows, so these
functions can't be called.
Differential revision: https://reviews.llvm.org/D66880
llvm-svn: 370204
There is no pattern matched `add hi, (MipsLo texternalsym)`. As a result,
loading an address of 32-bit symbol requires two registers and one more
additional instruction:
```
addiu $1, $zero, %lo(foo)
lui $2, %hi(foo)
addu $25, $2, $1
```
This patch adds the missed pattern and enables generation more effective
set of instructions:
```
lui $1, %hi(foo)
addiu $25, $1, %lo(foo)
```
Differential Revision: https://reviews.llvm.org/D66771
llvm-svn: 370196
Summary: There are at least 2 ways to express the same shuffle. Various pieces of code explicit check for both option, but other places do not when they would benefit from doing it. This patches refactor the codebase to use buildLegalVectorShuffle in order to make that behavior more consistent.
Reviewers: craig.topper, efriedma, RKSimon, lebedev.ri
Subscribers: javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D66804
llvm-svn: 370190
This just pulls the MVEVPTBlockPass into a separate file, as opposed to being
wrapped up in Thumb2ITBlockPass.
Differential revision: https://reviews.llvm.org/D66579
llvm-svn: 370187
This adds fp16 VMOVX patterns, using the same patterns as rL362482 with some
adjustments for MVE. It allows us to move fp16 registers without going into and
out of gprs.
VMOVX is able to move the top bits from a fp16 in a fp reg into the bottom bits
of another register, zeroing the rest. This can be used for odd MVE register
lanes. The top bits are not read by fp16 instructions, so no move is required
there if we are dealing with even lanes.
Differential revision: https://reviews.llvm.org/D66793
llvm-svn: 370184
With the introduction of the typed byval attribute change there was no
way that the LLVM-C API could create the correct class Attribute. If a
program that uses the C API creates a ByVal attribute and annotates a
function with that attribute LLVM will crash when it assembles or write
that module containing the function out as bitcode.
This change is a minimal fix to at least allow code to work, this is
because the byval change is on the 9.0 and I don't want to introduce new
LLVM-C API this late in the release cycle.
By Jakob Bornecrantz!
Differential revision: https://reviews.llvm.org/D66144
llvm-svn: 370176
Allow vectorizing loops that have reductions when tail is folded by masking.
A select is introduced in VPlan, choosing between the last value carried by the
loop-exit/live-out instruction of the reduction, and the penultimate value
carried by the reduction phi, according to the "i < n" mask of fold-tail.
This select replaces the last value as the live-out value of the loop.
Differential Revision: https://reviews.llvm.org/D66720
llvm-svn: 370173
rL369567 reverted a couple of recent changes made to ARMParallelDSP
because of a miscompilation error: PR43073.
The issue stemmed from an underlying bug that was caused by adding
muls into a reduction before it was proved that they could be executed
in parallel with another mul.
Most of the changes here are from the previously reverted commits.
The additional changes have been made area:
1) The Search function now doesn't insert any muls into the Reduction
object. That now happens once the search has successfully finished.
2) For any muls added into the reduction but that weren't paired, we
accumulate their values as an input into the smlad.
Differential Revision: https://reviews.llvm.org/D66660
llvm-svn: 370171