Summary: This is needed for later SDWA support in CodeGen.
Reviewers: vpykhtin, tstellarAMD
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27412
llvm-svn: 290338
Summary: Real instruction should copy constraints from real instruction. This allows auto-generated disassembler to correctly process tied operands.
Reviewers: nhaustov, vpykhtin, tstellarAMD
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27847
llvm-svn: 290336
Replacing the memory operand in the ymm version of VPMADDWD from i128mem to i256mem.
Differential Revision: https://reviews.llvm.org/D28024
llvm-svn: 290333
declarations.
We're using a custom class here instead of the helper template, these
bits just didn't get deleted when the other bits did get deleted. This
was found by a really nice MSVC warning about explicitly instantiating
a template where some member functions aren't defined and thus can't be
instantiatied.
llvm-svn: 290327
from the old pass manager in the new one.
I'm not trying to support (initially) the numerous options that are
currently available to customize the pass pipeline. If we end up really
wanting them, we can add them later, but I suspect many are no longer
interesting. The simplicity of omitting them will help a lot as we sort
out what the pipeline should look like in the new PM.
I've also documented to the best of my ability *why* each pass or group
of passes is used so that reading the pipeline is more helpful. In many
cases I think we have some questionable choices of ordering and I've
left FIXME comments in place so we know what to come back and revisit
going forward. But for now, I've left it as similar to the current
pipeline as I could.
Lastly, I've had to comment out several places where passes are not
ported to the new pass manager or where the loop pass infrastructure is
not yet ready. I did at least fix a few bugs in the loop pass
infrastructure uncovered by running the full pipeline, but I didn't want
to go too far in this patch -- I'll come back and re-enable these as the
infrastructure comes online. But I'd like to keep the comments in place
because I don't want to lose track of which passes need to be enabled
and where they go.
One thing that seemed like a significant API improvement was to require
that we don't build pipelines for O0. It seems to have no real benefit.
I've also switched back to returning pass managers by value as at this
API layer it feels much more natural to me for composition. But if
others disagree, I'm happy to go back to an output parameter.
I'm not 100% happy with the testing strategy currently, but it seems at
least OK. I may come back and try to refactor or otherwise improve this
in subsequent patches but I wanted to at least get a good starting point
in place.
Differential Revision: https://reviews.llvm.org/D28042
llvm-svn: 290325
When DwarfExpression is emitting a fragment that is located in a
register and that fragment is smaller than the register, and the
register must be composed from sub-registers (are you still with me?)
the last DW_OP_piece operation must not be larger than the size of the
fragment itself, since the last piece of the fragment could be smaller
than the last subregister that is being emitted.
rdar://problem/29779065
llvm-svn: 290324
There are helpers for testing for constant or constant build_vector,
and for splat ConstantFP vectors, but not for a constantfp or
non-splat ConstantFP vector.
llvm-svn: 290317
This is to put the vector into a well defined state. Apparently the state of a
vector after being moved from is valid but unspecified. Found with clang-tidy.
llvm-svn: 290298
This adds a basic tablegen backend that analyzes the SelectionDAG
patterns to find simple ones that are eligible for GlobalISel-emission.
That's similar to FastISel, with one notable difference: we're not fed
ISD opcodes, so we need to map the SDNode operators to generic opcodes.
That's done using GINodeEquiv in TargetGlobalISel.td.
Otherwise, this is mostly boilerplate, and lots of filtering of any kind
of "complicated" pattern. On AArch64, this is sufficient to match G_ADD
up to s64 (to ADDWrr/ADDXrr) and G_BR (to B).
Differential Revision: https://reviews.llvm.org/D26878
llvm-svn: 290284
Each function summary has an attached list of type identifier GUIDs. The
idea is that during the regular LTO phase we would match these GUIDs to type
identifiers defined by the regular LTO module and store the resolutions in
a top-level "type identifier summary" (which will be implemented separately).
Differential Revision: https://reviews.llvm.org/D27967
llvm-svn: 290280
In order for the llvm DWARF parser to be used in LLDB we will need to be able to get the parent of a DIE. This patch adds that functionality by changing the DWARFDebugInfoEntry class to store a depth field instead of a sibling index. Using a depth field allows us to easily calculate the sibling and the parent without increasing the size of DWARFDebugInfoEntry.
I tested llvm-dsymutil on a debug version of clang where this fully parses DWARF in over 1200 .o files to verify there was no serious regression in performance.
Added a full suite of unit tests to test this functionality.
Differential Revision: https://reviews.llvm.org/D27995
llvm-svn: 290274
RTDyldMemoryManager.cpp describes the differing __register_frame
API between libunwind and libgcc, with a mailing list posting URL.
The original link was 404; replace it with what I believe is the
intended post, as well as a reference to the "OS X" implementation in
libunwind.
Differential Revision: https://reviews.llvm.org/D27965
llvm-svn: 290269
The constantexpr parsing was too constrained and rejected legal vector GEPs.
This relaxes it to be similar to the ones for instruction parsing.
This fixes PR30816.
Differential Revision: https://reviews.llvm.org/D28013
llvm-svn: 290261
For vector GEPs, CastGEPIndices can end up in an infinite recursion, because
we compare the vector type to the scalar pointer type, find them different,
and then try to cast a type to itself.
Differential Revision: https://reviews.llvm.org/D28009
llvm-svn: 290260
I added API for creation a target specific memory node in DAG. Today, all memory nodes are common for all targets and their constructors are located in SelectionDAG.cpp.
There are some cases in X86 where we need to create a special node - truncation-with-saturation store, float-to-half-store.
In the current patch I added truncation-with-saturation nodes and I'm using them for intrinsics. In the future I plan to implement DAG lowering for truncation-with-saturation pattern.
Differential Revision: https://reviews.llvm.org/D27899
llvm-svn: 290250
The vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible.
vectorcall uses more registers for arguments than fastcall or the default x64 calling convention use.
The vectorcall calling convention is only supported in native code on x86 and x64 processors that include Streaming SIMD Extensions 2 (SSE2) and above.
The current implementation does not handle Homogeneous Vector Aggregates (HVAs) correctly and this review attempts to fix it.
This aubmit also includes additional lit tests to cover better HVAs corner cases.
Differential Revision: https://reviews.llvm.org/D27392
llvm-svn: 290240
In r267672, where the loop distribution pragma was introduced, I tried
it hard to keep the old behavior for opt: when opt is invoked
with -loop-distribute, it should distribute the loop (it's off by
default when ran via the optimization pipeline).
As MichaelZ has discovered this has the unintended consequence of
breaking a very common developer work-flow to reproduce compilations
using opt: First you print the pass pipeline of clang
with -debug-pass=Arguments and then invoking opt with the returned
arguments.
clang -debug-pass will include -loop-distribute but the pass is invoked
with default=off so nothing happens unless the loop carries the pragma.
While through opt (default=on) we will try to distribute all loops.
This changes opt's default to off as well to match clang. The tests are
modified to explicitly enable the transformation.
llvm-svn: 290235
we used to print UNKNOWN instructions when the instruction to be printer was not
yet inserted in any BB: in that case the pretty printer would not be able to
compute a TII as the instruction does not belong to any BB or function yet.
This patch explicitly passes the TII to the pretty-printer.
Differential Revision: https://reviews.llvm.org/D27645
llvm-svn: 290228
No existing client is passing a non-null value here. This will come back
in a slightly different form as part of the type identifier summary work.
Differential Revision: https://reviews.llvm.org/D28006
llvm-svn: 290222
We're currently doing nearly the same thing for @llvm.objectsize in
three different places: two of them are missing checks for overflow,
and one of them could subtly break if InstCombine gets much smarter
about removing alloc sites. Seems like a good idea to not do that.
llvm-svn: 290214
GlobPattern is a class to handle glob pattern matching. Currently
only LLD is using that, but technically that feature is not specific
to linkers, so in this patch I move that file to LLVM.
Differential Revision: https://reviews.llvm.org/D27969
llvm-svn: 290212
Summary:
In getRangeForAffineAR we compute ranges for affine exprs E = A + B*C,
where ranges for A, B, and C are known. To avoid overflow, we need to
operate on a bigger bitwidth, and originally we chose 2*x+1 for this
(x being the original bitwidth). However, it is safe to use just 2*x:
A+B*C <= (2^x - 1) + (2^x - 1)*(2^x - 1) =
= 2^x - 1 + 2^2x - 2^x - 2^x + 1 =
= 2^2x - 2^x <= 2^2x - 1
Unnecessary extending of bitwidths results in noticeable slowdowns: ranges
perform arithmetic operations using APInt, which are much slower when bitwidths
are bigger than 64.
Reviewers: sanjoy, majnemer, chandlerc
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27795
llvm-svn: 290211
This patch adds support for YAML<->DWARF for debug_info sections.
This re-lands r290147, after fixing the issue that caused bots to fail (thank you UBSan!).
llvm-svn: 290204
Also make the summary ref and call graph vectors immutable. This means
a smaller API surface and fewer places to audit for non-determinism.
Differential Revision: https://reviews.llvm.org/D27875
llvm-svn: 290200
Make it clear that TripCount is the upper bound of the iteration on which
control exits LatchBlock.
Differential Revision: https://reviews.llvm.org/D26675
llvm-svn: 290199
See https://reviews.llvm.org/D6678 for the history of
isExtractSubvectorCheap. Essentially the same considerations apply
to ARM.
This temporarily breaks the formation of vpadd/vpaddl in certain cases;
AddCombineToVPADDL essentially assumes that we won't form VUZP shuffles.
See https://reviews.llvm.org/D27779 for followup fix.
Differential Revision: https://reviews.llvm.org/D27774
llvm-svn: 290198
When the instruction is processed the first time, it may be
deleted resulting in crashes. While the new test adds the same
user to the worklist twice, this particular case doesn't crash
but I'm not sure why.
llvm-svn: 290191
I haven't managed to get this to fail yet but its technically possible for the AND -> shuffle decomposition to result in illegal types.
llvm-svn: 290183
Summary:
Without a MachineMemOperand, the scheduler was assuming MIMG instructions
were ordered memory references, so no loads or stores could be reordered
across them.
Reviewers: arsenm
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27536
llvm-svn: 290179
Include of llvm/IR/Verifier.h was removed from HexagonCommonGEP.cpp in r289604
as unused. In fact it is required when expensive checks are enabled, because
it declared function `verifyFunction`, which is called in conditionally compiled
part of the file.
llvm-svn: 290170
clear. The current RefSCC can occur in exactly one position so we should
just enforce that and leverage the property rather than checking for it
anywhere.
This addresses review comments made on another patch.
llvm-svn: 290162
This doesn't implement *every* feature of the existing inliner, but
tries to implement the most important ones for building a functional
optimization pipeline and beginning to sort out bugs, regressions, and
other problems.
Notable, but intentional omissions:
- No alloca merging support. Why? Because it isn't clear we want to do
this at all. Active discussion and investigation is going on to remove
it, so for simplicity I omitted it.
- No support for trying to iterate on "internally" devirtualized calls.
Why? Because it adds what I suspect is inappropriate coupling for
little or no benefit. We will have an outer iteration system that
tracks devirtualization including that from function passes and
iterates already. We should improve that rather than approximate it
here.
- Optimization remarks. Why? Purely to make the patch smaller, no other
reason at all.
The last one I'll probably work on almost immediately. But I wanted to
skip it in the initial patch to try to focus the change as much as
possible as there is already a lot of code moving around and both of
these *could* be skipped without really disrupting the core logic.
A summary of the different things happening here:
1) Adding the usual new PM class and rigging.
2) Fixing minor underlying assumptions in the inline cost analysis or
inline logic that don't generally hold in the new PM world.
3) Adding the core pass logic which is in essence a loop over the calls
in the nodes in the call graph. This is a bit duplicated from the old
inliner, but only a handful of lines could realistically be shared.
(I tried at first, and it really didn't help anything.) All told,
this is only about 100 lines of code, and most of that is the
mechanics of wiring up analyses from the new PM world.
4) Updating the LazyCallGraph (in the new PM) based on the *newly
inlined* calls and references. This is very minimal because we cannot
form cycles.
5) When inlining removes the last use of a function, eagerly nuking the
body of the function so that any "one use remaining" inline cost
heuristics are immediately refined, and queuing these functions to be
completely deleted once inlining is complete and the call graph
updated to reflect that they have become dead.
6) After all the inlining for a particular function, updating the
LazyCallGraph and the CGSCC pass manager to reflect the
function-local simplifications that are done immediately and
internally by the inline utilties. These are the exact same
fundamental set of CG updates done by arbitrary function passes.
7) Adding a bunch of test cases to specifically target CGSCC and other
subtle aspects in the new PM world.
Many thanks to the careful review from Easwaran and Sanjoy and others!
Differential Revision: https://reviews.llvm.org/D24226
llvm-svn: 290161
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.
Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:
(1) The DIGlobalVariable should describe the source level variable,
not how to get to its location.
(2) It makes it unsafe/hard to update the expressions when we call
replaceExpression on the DIGLobalVariable.
(3) It makes it impossible to represent a global variable that is in
more than one location (e.g., a variable with multiple
DW_OP_LLVM_fragment-s). We also moved away from attaching the
DIExpression to DILocalVariable for the same reasons.
This reapplies r289902 with additional testcase upgrades and a change
to the Bitcode record for DIGlobalVariable, that makes upgrading the
old format unambiguous also for variables without DIExpressions.
<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769
llvm-svn: 290153
Summary:
The expression for computing the return value of getMemOpBaseRegImmOfs has only
one possible value. The other value would result in a return earlier in the
function. This patch replaces the expression with its only possible value.
Reviewers: sanjoy
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27437
llvm-svn: 290133
DWARF 4 and later supports encoding the PC as an address or as as offset from the low PC. Clients using DWARFDie should be insulated from how to extract the high PC value. This function takes care of extracting the form value and looking for the correct form.
Differential Revision: https://reviews.llvm.org/D27885
llvm-svn: 290131
This allows lowering i8 and i16 arguments if they can fit in the registers. Note
that the lowering is incomplete - ABI extensions are handled in a subsequent
patch.
(Last part of)
Differential Revision: https://reviews.llvm.org/D27704
llvm-svn: 290106
Teach the instruction selector and legalizer that it's ok to have adds with 8 or
16-bit integers.
This is the second part of https://reviews.llvm.org/D27704
llvm-svn: 290105
Teach the instruction selector that it's ok to copy small values from physical
registers.
First part of https://reviews.llvm.org/D27704
llvm-svn: 290104
PWR9 processor model for instruction scheduling. A subsequent patch will migrate
PWR9 to Post RA MIScheduler.
https://reviews.llvm.org/D24525
llvm-svn: 290102
This adds support for lowering more than 4 arguments (although still i32 only).
It uses the handleAssignments / ValueHandler infrastructure extracted from
the AArch64 backend in r288658.
Differential Revision: https://reviews.llvm.org/D27195
llvm-svn: 290098
Summary:
Added pair of directives .hsa_code_object_metadata/.end_hsa_code_object_metadata.
Between them user can put YAML string that would be directly put to the generated note. E.g.:
'''
.hsa_code_object_metadata
{
amd.MDVersion: [ 2, 0 ]
}
.end_hsa_code_object_metadata
'''
Based on D25046
Reviewers: vpykhtin, nhaustov, yaxunl, tstellarAMD
Subscribers: arsenm, kzhuravl, wdng, nhaehnle, mgorny, tony-tye
Differential Revision: https://reviews.llvm.org/D27619
llvm-svn: 290097
Add support for selecting simple G_LOAD and G_FRAME_INDEX instructions (32-bit
scalars only). This will be useful for functions that need to pass arguments on
the stack.
First part of https://reviews.llvm.org/D27195.
llvm-svn: 290096
Summary:
MachineInstr::isIdenticalTo() is for some reason not
symmetric when comparing bundles, which gives us the
property:
I1->isIdenticalTo(*I2) != I2->isIdenticalTo(*I1)
when comparing bundles where one bundle is longer than
the other.
This patch makes sure that bundles of different length
always are considered as not being identical. Thus, the
result of the comparison will be the same regardless of
which side that happens to be to the left.
Reviewers: dexonsmith, jonpa, andrew.w.kaylor
Subscribers: llvm-commits, mehdi_amini
Differential Revision: https://reviews.llvm.org/D27508
llvm-svn: 290095
The original version of the code in XRayInstrumentation.cpp assumed that
functions may not have empty machine basic blocks (or that the first one
couldn't be). This change addresses that by special-casing that specific
situation.
We provide two .mir test-cases to make sure we're handling this
appropriately.
Fixes llvm.org/PR31424.
Reviewers: chandlerc
Subscribers: varno, llvm-commits
Differential Revision: https://reviews.llvm.org/D27913
llvm-svn: 290091
Long is not the same size across a number of the platforms we support.
Use unsigned int here instead, it is more appropriate because
overflow/wrap-around is possible and, in this case, expected.
llvm-svn: 290068
Background/motivation - I was circling back around to:
https://llvm.org/bugs/show_bug.cgi?id=28296
I made a simple patch for that and noticed some regressions, so added test cases for
those with rL281055, and this is hopefully the minimal fix for just those cases.
But as you can see from the surrounding untouched folds, we are missing commuted patterns
all over the place, and of course there are no regression tests to cover any of those cases.
We could sprinkle "m_c_" dust all over this file and catch most of the missing folds, but
then we still wouldn't have test coverage, and we'd still miss some fraction of commuted
patterns because they require adjustments to the match order.
I'm aware of the concern about the potential compile-time performance impact of adding
matches like this (currently being discussed on llvm-dev), but I don't think there's any
evidence yet to suggest that handling commutative pattern matching more thoroughly is not
a worthwhile goal of InstCombine.
Differential Revision: https://reviews.llvm.org/D24419
llvm-svn: 290067
Not sure whether it causes and ASAN false positive or whether it
actually leads to incorrect code or whether it even exposes bad code.
Hans, I'll get you instructions to reproduce this.
llvm-svn: 290066
Commit on behalf of Gadi Haber
Removed EVEX_V512 prefix from scalar EVEX instructions since HW ignores L'L bits anyway (LIG). 4 instructions are modified.
The changed encodings are validated with XED.
Rviewers: delena, igorb
Differential revision: https://reviews.llvm.org/D27802
llvm-svn: 290065
These nodes are only emitted for lowering FABS/FNEG/FNABS/FCOPYSIGN. Ideally we just wouldn't create these nodes if SSE2 or higher is available, but it was simple to just convert them in DAG combine.
For SSE2, AVX, and AVX512 with DQI this is no functional change as the execution domain fixing pass ensures the right domain is selected regardless of the ISD opcode.
For AVX-512 without DQI we end up using integer instructions since the floating point versions aren't available. But we were already doing that for any logical operations in code that didn't come from FABS/FNEG/FNABS/FCOPYSIGN so this seems no worse. And we get the benefit of being able to fold broadcasts now.
llvm-svn: 290060
Patch implements parser of pubnames/pubtypes tables instead of static
function used before. It is now should be possible to reuse it
in LLD or other projects and clean up the duplication code.
Differential revision: https://reviews.llvm.org/D27851
llvm-svn: 290040
Summary:
PseudoSourceValue can be used to attach a target specific value for "well behaved" side-effects lowered from target specific intrinsics.
This is useful whenever there is not an LLVM IR Value around when representing such "well behaved" side-effected operations in backends by attaching a MachineMemOperand with a custom PseudoSourceValue as this makes the scheduler not treating them as "GlobalMemoryObjects" which triggers a logic that makes the operation act like a barrier in the Schedule DAG.
This patch adds another Kind to the PseudoSourceValue object which is "TargetCustom". It indicates a type of PseudoSourceValue that has a target specific meaning (aka. LLVM shouldn't assume any specific usage for such a PSV).
It supports the possibility of having many different kinds of "TargetCustom" PseudoSourceValues.
We had a discussion about if this was valuable or not (in particular because there was a believe that PSV were going away sooner or later) but seems like they are not going anywhere and I think they are useful backend side.
It is not clear the interaction of this with MIRParser (do we need a target hook to parse these?) and I would like a comment from Alex about that :)
Reviewers: arphaman, hfinkel, arsenm
Subscribers: Eugene.Zelenko, llvm-commits
Patch By: Marcello Maggioni
Differential Revision: https://reviews.llvm.org/D13575
llvm-svn: 290037
Re-apply r288561: Liveness tracking should be correct now after r290014.
Previously this pass was using up to 5% compile time in some cases which
is a bit much for what it is doing. The pass featured a full blown
data-flow analysis which in the default configuration was restricted to a
single block.
This rewrites the pass under the assumption that we only ever work on a
single block. This is done in a single pass maintaining a state machine
per general purpose register to catch LOH patterns.
Differential Revision: https://reviews.llvm.org/D27329
llvm-svn: 290026
BPI may trigger signed overflow UB while computing branch probabilities for
cold calls or to unreachables. For example, with our current choice of weights,
we'll crash if there are >= 2^12 branches to an unreachable.
Use a safer BranchProbability constructor which is better at handling fractions
with large denominators.
Changes since the initial commit:
- Use explicit casts to ensure that multiplication operands are 64-bit
ints.
rdar://problem/29368161
Differential Revision: https://reviews.llvm.org/D27862
llvm-svn: 290022
This reverts commit r290016. It breaks this bot, even though the test
passes locally:
http://bb.pgr.jp/builders/ninja-x64-msvc-RA-centos6/builds/32956/
AnalysisTests: /home/bb/ninja-x64-msvc-RA-centos6/llvm-project/llvm/lib/Support/BranchProbability.cpp:52: static llvm::BranchProbability llvm::BranchProbability::getBranchProbability(uint64_t, uint64_t): Assertion `Numerator <= Denominator && "Probability cannot be bigger than 1!"' failed.
llvm-svn: 290019
BPI may trigger signed overflow UB while computing branch probabilities
for cold calls or to unreachables. For example, with our current choice
of weights, we'll crash if there are >= 2^12 branches to an unreachable.
Use a safer BranchProbability constructor which is better at handling
fractions with large denominators.
rdar://problem/29368161
Differential Revision: https://reviews.llvm.org/D27862
llvm-svn: 290016
The Mach-O command line flag like "-arch armv7m" does not match the
arch name part of its llvm Triple which is "thumbv7m-apple-darwin”.
I think the best way to fix this is to have
llvm::object::MachOObjectFile::getArchTriple() optionally return the
name of the Mach-O arch flag that would be used with -arch that
matches the CPUType and CPUSubType. Then change
llvm::object::MachOUniversalBinary::ObjectForArch::getArchTypeName()
to use that and change it to getArchFlagName() as the type name is
really part of the Triple and the -arch flag name is a Mach-O thing
for a specific Triple with a specific Mcpu value.
rdar://29663637
llvm-svn: 290001
Summary:
When reading the metadata bitcode, create a type declaration when
possible for composite types when we are importing. Doing this in
the bitcode reader saves memory. Also it works naturally in the case
when the type ODR map contains a definition for the same composite type
because it was used in the importing module (buildODRType will
automatically use the existing definition and not create a type
declaration).
For Chromium built with -g2, this reduces the aggregate size of the
generated native object files by 66% (from 31G to 10G). It reduced
the time through the ThinLTO link and backend phases by about 20% on
my machine.
Reviewers: mehdi_amini, dblaikie, aprantl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27775
llvm-svn: 289993
This is recommit of r287553 after fixing the invalid loop info after eliminating an empty block and unit test failures in AVR and WebAssembly :
Summary: Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, joerg, davidxl
Subscribers: joerg, qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 289988
This reverts commit 289920 (again).
I forgot to implement a Bitcode upgrade for the case where a DIGlobalVariable
has not DIExpression. Unfortunately it is not possible to safely upgrade
these variables without adding a flag to the bitcode record indicating which
version they are.
My plan of record is to roll the planned follow-up patch that adds a
unit: field to DIGlobalVariable into this patch before recomitting.
This way we only need one Bitcode upgrade for both changes (with a
version flag in the bitcode record to safely distinguish the record
formats).
Sorry for the churn!
llvm-svn: 289982
This is the 3rd of 3 patches to get reading and writing of
CodeView symbol and type records to use a single codepath.
Differential Revision: https://reviews.llvm.org/D26427
llvm-svn: 289978
This patch reapplies r289863. The original patch was reverted because it
exposed a bug causing the loop vectorizer to crash in the Python runtime on
PPC. The underlying issue was fixed with r289958.
llvm-svn: 289975
`dropUnknownNonDebugMetadata` takes a list of "known" metadata IDs. The
only reason it worked at all is that `getMetadataID` returns something
unrelated -- it returns the subclass ID of the receiver (which is used
in `dyn_cast` etc.). That does not numerically match
`LLVMContext::MD_invariant_group` and ends up dropping `invariant_group`
along with every other metadata that does not numerically match
`LLVMContext::MD_invariant_group`.
llvm-svn: 289973
Currently, there are substantial problems forming vld1_dup even if the
VDUP survives legalization. The lack of an actual node
leads to terrible results: not only can we not form post-increment vld1_dup
instructions, but we form scalar pre-increment and post-increment
loads which force the loaded value into a GPR. This patch fixes that
by combining the vdup+load into an ARMISD node before DAGCombine
messes it up.
Also includes a crash fix for vld2_dup (see testcase @vld2dupi8_postinc_variable).
Recommiting with fix to avoid forming vld1dup if the type of the load
doesn't match the type of the vdup (see
https://llvm.org/bugs/show_bug.cgi?id=31404).
Differential Revision: https://reviews.llvm.org/D27694
llvm-svn: 289972
Replace sleep() posix function by a more portable sleep_for() function
from std. Also, ignore memmem() and strcasestr() on Windows.
Differential Revision: https://reviews.llvm.org/D27729
llvm-svn: 289964
Reverting because this breaks lld's gdb_index support - it's probably
double counting the abbrev relocation offset.
This reverts commit r289954.
llvm-svn: 289961
After r288909, instructions feeding predicated instructions may be scalarized
if profitable. Since these instructions will remain scalar, we shouldn't
attempt to type-shrink them. We should only truncate vector types to their
minimal bit widths. This bug was exposed by enabling the vectorization of loops
containing conditional stores by default.
llvm-svn: 289958
Summary: ThinLTO needs to invoke SampleProfileLoader pass during link time in order to annotate profile correctly after module importing.
Reviewers: davidxl, mehdi_amini, tejohnson
Subscribers: pcc, davide, llvm-commits, mehdi_amini
Differential Revision: https://reviews.llvm.org/D27790
llvm-svn: 289957
atomic_load_add returns the value before addition, but sets EFLAGS based on the
result of the addition. That means it's setting the flags based on effectively
subtracting C from the value at x, which is also what the outer cmp does.
This targets a pattern that occurs frequently with reference counting pointers:
void decrement(long volatile *ptr) {
if (_InterlockedDecrement(ptr) == 0)
release();
}
Clang would previously compile it (for 32-bit at -Os) as:
00000000 <?decrement@@YAXPCJ@Z>:
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: 31 c9 xor %ecx,%ecx
6: 49 dec %ecx
7: f0 0f c1 08 lock xadd %ecx,(%eax)
b: 83 f9 01 cmp $0x1,%ecx
e: 0f 84 00 00 00 00 je 14 <?decrement@@YAXPCJ@Z+0x14>
14: c3 ret
and with this patch it becomes:
00000000 <?decrement@@YAXPCJ@Z>:
0: 8b 44 24 04 mov 0x4(%esp),%eax
4: f0 ff 08 lock decl (%eax)
7: 0f 84 00 00 00 00 je d <?decrement@@YAXPCJ@Z+0xd>
d: c3 ret
(Equivalent variants with _InterlockedExchangeAdd, std::atomic<>'s fetch_add
or pre-decrement operator generate the same code.)
Differential Revision: https://reviews.llvm.org/D27781
llvm-svn: 289955
Input can be produced by ld -r, for example (a normal LLVM workflow
never hits this - LLVM only ever produces a single abbrev table in an
object (shared by multiple CUs), so the reloc's always 0, and when it's
linked together the relocation's resolved so it doesn't need to be
handled)
llvm-svn: 289954
This is recommit of r287553 after fixing the invalid loop info after eliminating an empty block:
Summary: Merging an empty case block into the header block of switch could cause ISel to add COPY instructions in the header of switch, instead of the case block, if the case block is used as an incoming block of a PHI. This could potentially increase dynamic instructions, especially when the switch is in a loop. I added a test case which was reduced from the benchmark I was targetting.
Reviewers: t.p.northover, mcrosier, manmanren, wmi, joerg, davidxl
Subscribers: joerg, qcolombet, danielcdh, hfinkel, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D22696
llvm-svn: 289951
It currently is in an unnamed namespace and then it shouldn't be used
from something in the header file. This actually triggers a warning with
GCC:
../include/llvm/IR/Verifier.h:39:7: warning: ‘llvm::TBAAVerifier’ has a field ‘llvm::TBAAVerifier::Diagnostic’ whose type uses the anonymous namespace [enabled by default]
llvm-svn: 289942
Add the minimal support necessary to select a function that returns the sum of
two i32 values.
This includes some support for argument/return lowering of i32 values through
registers, as well as the handling of copy and add instructions throughout the
GlobalISel pipeline.
Differential Revision: https://reviews.llvm.org/D26677
llvm-svn: 289940
stores by default
This uncovers a crasher in the loop vectorizer on PPC when building the
Python runtime. I'll send the testcase to the review thread for the
original commit.
llvm-svn: 289934
Summary:
This commits moves skipDebugInstructionsForward and
skipDebugInstructionsBackward from lib/CodeGen/IfConversion.cpp
to include/llvm/CodeGen/MachineBasicBlock.h and updates
some codgen files to use them.
This refactoring was suggested in https://reviews.llvm.org/D27688
and I thought it's best to do the refactoring in a separate
review, but I could also put both changes in a single review
if that's preferred.
Also, the names for the functions aren't the snappiest and
I would be happy to rename them if anybody has suggestions.
Reviewers: eli.friedman, iteratee, aprantl, MatzeB
Subscribers: MatzeB, llvm-commits
Differential Revision: https://reviews.llvm.org/D27782
llvm-svn: 289933
Add two public methods to ARMTargetLowering: CCAssignFnForCall and
CCAssignFnForReturn, which are just calling the already existing private method
CCAssignFnForNode. These will come in handy for GlobalISel on ARM.
We also replace all calls to CCAssignFnForNode in ARMISelLowering.cpp, because
the new methods are friendlier to the reader.
llvm-svn: 289932
This patch appears to result in trampolines in vtables being miscompiled
when they in turn tail call a method.
I've posted some preliminary details about the failure on the thread for
this commit and talked to Hal. He was comfortable going ahead and
reverting until we sort out what is wrong.
llvm-svn: 289928
This is intended to be used (in a later patch) by the BitcodeReader
to detect invalid TBAA and drop them when loading bitcode, so that
we don't break client that have legacy bitcode with possible invalid
TBAA.
Differential Revision: https://reviews.llvm.org/D27838
llvm-svn: 289927
One more attempt to re-commit the patch r285355, which I had to revert in r285362, because some tests were failing (the reason is because the size of the line_table varied depending on the full file name).
In the past the compiler always emitted .debug_line version 2, though some opcodes from DWARF 3 (e.g. DW_LNS_set_prologue_end, DW_LNS_set_epilogue_begin or DW_LNS_set_isa) and from DWARF 4 could be emitted by the compiler.
This patch changes version information of .debug_line to exactly match the DWARF version. For .debug_line version 4, a new field maximum_operations_per_instruction is emitted.
Differential Revision: https://reviews.llvm.org/D16697
llvm-svn: 289925
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.
Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:
(1) The DIGlobalVariable should describe the source level variable,
not how to get to its location.
(2) It makes it unsafe/hard to update the expressions when we call
replaceExpression on the DIGLobalVariable.
(3) It makes it impossible to represent a global variable that is in
more than one location (e.g., a variable with multiple
DW_OP_LLVM_fragment-s). We also moved away from attaching the
DIExpression to DILocalVariable for the same reasons.
This reapplies r289902 with additional testcase upgrades.
<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769
llvm-svn: 289920
Summary:
Instead of checking whether a global referenced by a function being
imported is defined in the same module, speculatively always add the
referenced globals to the module's export list. After all imports are
computed, for each module prune any not in its defined set from its
export list.
For a huge C++ app with aggressive importing thresholds, even with
D27687 we spent a lot of time invoking modulePath() from
exportGlobalInModule (modulePath() was still the 2nd hottest routine in
profile). The reason is that with comdat/linkonce the summary lists for
each GUID can be long. For the app in question, for example, we were
invoking exportGlobalInModule almost 2 million times, and we traversed
an average of 63 entries in the summary list each time.
This patch reduced the thin link time for the app by about 10% (on top
of D27687) when using aggressive importing thresholds, and about 3.5% on
average with default importing thresholds.
Reviewers: mehdi_amini
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27755
llvm-svn: 289918
idiom.
r289538: Match load by bytes idiom and fold it into a single load
r289540: Fix a buildbot failure introduced by r289538
r289545: Use more detailed assertion messages in the code ...
r289646: Add a couple of assertions to the load combine code ...
This DAG combine has a bad crash in it that is quite hard to trigger
sadly -- it relies on sneaking code with UB through the SDAG build and
into this particular combine. I've responded to the original commit with
a test case that reproduces it.
However, the code also has other problems that will require substantial
changes to address and so I'm going ahead and reverting it for now. This
should unblock us and perhaps others that are hitting the crash in the
wild and will let a fresh patch with updated approach come in cleanly
afterward.
Sorry for any trouble or disruption!
llvm-svn: 289916
This patch implements PR31013 by introducing a
DIGlobalVariableExpression that holds a pair of DIGlobalVariable and
DIExpression.
Currently, DIGlobalVariables holds a DIExpression. This is not the
best way to model this:
(1) The DIGlobalVariable should describe the source level variable,
not how to get to its location.
(2) It makes it unsafe/hard to update the expressions when we call
replaceExpression on the DIGLobalVariable.
(3) It makes it impossible to represent a global variable that is in
more than one location (e.g., a variable with multiple
DW_OP_LLVM_fragment-s). We also moved away from attaching the
DIExpression to DILocalVariable for the same reasons.
<rdar://problem/29250149>
https://llvm.org/bugs/show_bug.cgi?id=31013
Differential Revision: https://reviews.llvm.org/D26769
llvm-svn: 289902
This pass prepares a module containing type metadata for ThinLTO by splitting
it into regular and thin LTO parts if possible, and writing both parts to
a multi-module bitcode file. Modules that do not contain type metadata are
written unmodified as a single module.
All globals with type metadata are added to the regular LTO module, and
the rest are added to the thin LTO module.
Differential Revision: https://reviews.llvm.org/D27324
llvm-svn: 289899
Summary:
We were reinvoking exportGlobalInModule numerous times redundantly.
No need to re-export globals referenced by a global that was already
imported from its module. This resulted in a large speedup in the thin
link for a big application, particularly when importing aggressiveness
was cranked up.
Reviewers: mehdi_amini
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27687
llvm-svn: 289896
The IRTranslator uses an additional block before the LLVM-IR entry block
to perform all the ABI lowering and the constant hoisting. Thus, this
block is the actual entry block and it falls through the LLVM-IR entry
block. However, with such representation, we end up with two basic
blocks that are not maximal.
Therefore, this patch adds a bit of canonicalization by merging both the
LLVM-IR entry block and the ABI lowering/constants hoisting into one
block, making the resulting block more likely to be maximal (indeed the
LLVM-IR entry block might not have been maximal).
llvm-svn: 289891
Lowering to llvm.cttz() will result in constant folding anyway
if the argument to ffs is a constant. Pointed out by Eli for
fls() in D14590.
llvm-svn: 289888
Summary: The relocation is missing mask so an address that has non-zero bits in 47:43 may overwrite the register number. (Frequently shows up as target register changed to `xzr`....)
Reviewers: t.p.northover, lhames
Subscribers: davide, aemerson, rengolin, llvm-commits
Differential Revision: https://reviews.llvm.org/D27609
llvm-svn: 289880
The code change for D27687 accidentally got committed along with the
main change in r289843. Revert it temporarily, so that I can recommit it
along with its test as intended.
llvm-svn: 289875
This used to be allowed before r289402 by default (before r289402 you
could have TBAA metadata on any instruction), and while I'm not sure
that it helps, it does sound reasonable enough to not fail the verifier
and we have out-of-tree users who use this.
llvm-svn: 289872
Summary:
Thin link efficiency improvement. After adding an importing candidate to
the worklist we might have later added it again with a higher threshold.
Skip it when popped from the worklist if we recorded a higher threshold
than the current worklist entry, it will get processed again at the
higher threshold when that entry is popped.
This required adding the summary's GUID to the worklist, so that it can
be used to query the recorded highest threshold for it when we pop from the
worklist.
Reviewers: mehdi_amini
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27696
llvm-svn: 289867
This patch sets the default value of the "-enable-cond-stores-vec" command line
option to "true".
Differential Revision: https://reviews.llvm.org/D27814
llvm-svn: 289863
Now that a new API to merge debug locations has been committed at r289661 (see
review D26256 for more details), we can use it to "improve" the code added by
revision r280995.
Instead of nulling the debugloc of a commoned instruction, we use the 'merged'
debug location. At the moment, this is just a no functional change since
function `DILocation::getMergedLocation()` is just a stub and would always
return a null location.
Differential Revision: https://reviews.llvm.org/D27804
llvm-svn: 289862
The assert could potentially fire (though no cases have been
encountered), so just check that the instruction we're handling
specially for rematerialization only has one def to begin with.
Reviewed by Wei Mi over email.
llvm-svn: 289861
It seems pointless to add a resource to an archive because it won't have
any symbols to link against (and link.exe doesn't have an equivalent of
--whole-archive), but lib.exe allows it for some reason.
llvm-svn: 289859
Min/max canonicalization (r287585) exposes the fact that we're missing combines for min/max patterns.
This patch won't solve the example that was attached to that thread, so something else still needs fixing.
The line between InstCombine and InstSimplify gets blurry here because sometimes the icmp instruction that
we want to fold to already exists, but sometimes it's the swapped form of what we want.
Corresponding changes for smax/umin/umax to follow.
Differential Revision: https://reviews.llvm.org/D27531
llvm-svn: 289855
MachineLegalizer used to be the name of both the class and the member,
causing GCC errors. r276522 fixed that by renaming the member to just
'Legalizer'. The 'class' workaround isn't necessary anymore; drop it.
llvm-svn: 289848
This patch checks that the SlowMisaligned128Store subtarget feature is set
when penalizing such stores in getMemoryOpCost.
Differential Revision: https://reviews.llvm.org/D27677
llvm-svn: 289845
This is split out from D27696, since it turned out to be a bug fix and
not part of the NFC efficiency change.
Keep the same adjusted (possibly decayed) threshold in both the worklist
and the ImportList. Otherwise if we encountered it first along a cold
path, the callee would be added to the worklist with a lower decayed
threshold than when it is later encountered along a hot path. But the
logic uses the threshold recorded in the ImportList entry to check if
we should re-add it, and without this patch the threshold recorded there
is the same along both paths so we don't re-add it. Using the
same possibly decayed threshold in the ImportList ensures we re-add it
later with the higher non-decayed hot path threshold.
llvm-svn: 289843
This is a tiny patch with a big pile of test changes.
This partially fixes PR27885:
https://llvm.org/bugs/show_bug.cgi?id=27885
My motivating case looks like this:
- vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,2]
- vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
- vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
+ vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
And this happens several times in the diffs. For chips with domain-crossing penalties,
the instruction count and size reduction should usually overcome any potential
domain-crossing penalty due to using an FP op in a sequence of int ops. For chips such
as recent Intel big cores and Atom, there is no domain-crossing penalty for shufps, so
using shufps is a pure win.
So the test case diffs all appear to be improvements except one test in
vector-shuffle-combining.ll where we miss an opportunity to use a shift to generate
zero elements and one test in combine-sra.ll where multiple uses prevent the expected
shuffle combining.
Differential Revision: https://reviews.llvm.org/D27692
llvm-svn: 289837
Move the check for the code model into isGlobalInSmallSectionImpl and return false (not in small section) for variables placed in sections prefixed with .ldata (workaround for a tool limitation).
llvm-svn: 289832
Simplify CFG will try to sink the last instruction in a series of basic blocks,
creating a "common" instruction in the successor block (sinkLastInstruction).
When it does this, the debug location of the single instruction should be the
merged debug locations of the commoned instructions.
Differential Revision: https://reviews.llvm.org/D27590
llvm-svn: 289828
Add the missing domain equivalences for movss, movsd, movd and movq zero extending loading instructions.
Differential Revision: https://reviews.llvm.org/D27684
llvm-svn: 289825
Specifically avoid implicit conversions from/to integral types to
avoid potential errors when changing the underlying type. For example,
a typical initialization of a "full" mask was "LaneMask = ~0u", which
would result in a value of 0x00000000FFFFFFFF if the type was extended
to uint64_t.
Differential Revision: https://reviews.llvm.org/D27454
llvm-svn: 289820
A number of new patterns for simplifying and/xor of icmp:
(icmp ne %x, 0) ^ (icmp ne %y, 0) => icmp ne %x, %y if the following is true:
1- (%x = and %a, %mask) and (%y = and %b, %mask)
2- %mask is a power of 2.
(icmp eq %x, 0) & (icmp ne %y, 0) => icmp ult %x, %y if the following is true:
1- (%x = and %a, %mask1) and (%y = and %b, %mask2)
2- Let %t be the smallest power of 2 where %mask1 & %t != 0. Then for any
%s that is a power of 2 and %s & %mask2 != 0, we must have %s <= %t.
For example if %mask1 = 24 and %mask2 = 16, setting %s = 16 and %t = 8
violates condition (2) above. So this optimization cannot be applied.
llvm-svn: 289813
In some situations, the BUILD_VECTOR node that builds a v18i8 vector by
a splat of an i8 constant will end up with signed 8-bit values and other
situations, it'll end up with unsigned ones. Handle both situations.
Fixes PR31340.
llvm-svn: 289804
This is essentially a recommit of r285893, but with a correctness fix. The
problem of the original commit was that this:
bic r5, r7, #31
cbz r5, .LBB2_10
got rewritten into:
lsrs r5, r7, #5
beq .LBB2_10
The result in destination register r5 is not the same and this is incorrect
when r5 is not dead. So this fix includes checking the uses of the AND
destination register. And also, compared to the original commit, some regression
tests didn't need changing anymore because of this extra check.
For completeness, this was the original commit message:
For the common pattern (CMPZ (AND x, #bitmask), #0), we can do some more
efficient instruction selection if the bitmask is one consecutive sequence of
set bits (32 - clz(bm) - ctz(bm) == popcount(bm)).
1) If the bitmask touches the LSB, then we can remove all the upper bits and
set the flags by doing one LSLS.
2) If the bitmask touches the MSB, then we can remove all the lower bits and
set the flags with one LSRS.
3) If the bitmask has popcount == 1 (only one set bit), we can shift that bit
into the sign bit with one LSLS and change the condition query from NE/EQ to
MI/PL (we could also implement this by shifting into the carry bit and
branching on BCC/BCS).
4) Otherwise, we can emit a sequence of LSLS+LSRS to remove the upper and lower
zero bits of the mask.
1-3 require only one 16-bit instruction and can elide the CMP. 4 requires two
16-bit instructions but can elide the CMP and doesn't require materializing a
complex immediate, so is also a win.
Differential Revision: https://reviews.llvm.org/D27761
llvm-svn: 289794
Summary:
GAS already allows flags for sections to be specified directly as a
numeric value. This functionality is particularly useful for setting
processor or application-specific values that may not be directly
supported or understood by LLVM. This patch allows LLVM to use numeric
section flag values verbatim if specified by the assembly file.
Reviewers: grosbach, rafael, t.p.northover, rengolin
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27451
llvm-svn: 289785
This implements execute-only support for ARM code generation, which
prevents the compiler from generating data accesses to code sections.
The following changes are involved:
* Add the CodeGen option "-arm-execute-only" to the ARM code generator.
* Add the clang flag "-mexecute-only" as well as the GCC-compatible
alias "-mpure-code" to enable this option.
* When enabled, literal pools are replaced with MOVW/MOVT instructions,
with VMOV used in addition for floating-point literals. As the MOVT
instruction is required, execute-only support is only available in
Thumb mode for targets supporting ARMv8-M baseline or Thumb2.
* Jump tables are placed in data sections when in execute-only mode.
* The execute-only text section is assigned section ID 0, and is
marked as unreadable with the SHF_ARM_PURECODE flag with symbol 'y'.
This also overrides selection of ELF sections for globals.
llvm-svn: 289784
CS.doesNotAccessMemory(ArgNo) and CS.onlyReadsMemory(ArgNo) calls
dataOperandHasImpliedAttr, so revert this part of r289765 because
it should not be necessary.
llvm-svn: 289768
When iterating over data operands in AA, don't make argument-attribute-specific
queries on bundle operands. Trying to fix self hosting...
llvm-svn: 289765
Summary:
This fixes an issue with MachineBlockPlacement due to a badly timed call
to `analyzeBranch` with `AllowModify` set to true. The timeline is as
follows:
1. `MachineBlockPlacement::maybeTailDuplicateBlock` calls
`TailDup.shouldTailDuplicate` on its argument, which in turn calls
`analyzeBranch` with `AllowModify` set to true.
2. This `analyzeBranch` call edits the terminator sequence of the block
based on the physical layout of the machine function, turning an
unanalyzable non-fallthrough block to a unanalyzable fallthrough
block. Normally MBP bails out of rearranging such blocks, but this
block was unanalyzable non-fallthrough (and thus rearrangeable) the
first time MBP looked at it, and so it goes ahead and decides where
it should be placed in the function.
3. When placing this block MBP fails to analyze and thus update the
block in keeping with the new physical layout.
Concretely, before (1) we have something like:
```
LBL0:
< unknown terminator op that may branch to LBL1 >
jmp LBL1
LBL1:
... A
LBL2:
... B
```
In (2), analyze branch simplifies this to
```
LBL0:
< unknown terminator op that may branch to LBL2 >
;; jmp LBL1 <- redundant jump removed
LBL1:
... A
LBL2:
... B
```
In (3), MachineBlockPlacement goes ahead with its plan of putting LBL2
after the first block since that is profitable.
```
LBL0:
< unknown terminator op that may branch to LBL2 >
;; jmp LBL1 <- redundant jump
LBL2:
... B
LBL1:
... A
```
and the program now has incorrect behavior (we no longer fall-through
from `LBL0` to `LBL1`) because MBP can no longer edit LBL0.
There are several possible solutions, but I went with removing the teeth
off of the `analyzeBranch` calls in TailDuplicator. That makes thinking
about the result of these calls easier, and breaks nothing in the lit
test suite.
I've also added some bookkeeping to the MachineBlockPlacement pass and
used that to write an assert that would have caught this.
Reviewers: chandlerc, gberry, MatzeB, iteratee
Subscribers: mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D27783
llvm-svn: 289764
Inserting a new key into a DenseMap potentially invalidates iterators into that
map. Trying to fix an issue from r289755 triggering this assertion:
Assertion `isHandleInSync() && "invalid iterator access!"' failed.
llvm-svn: 289757
After r289755, the AssumptionCache is no longer needed. Variables affected by
assumptions are now found by using the new operand-bundle-based scheme. This
new scheme is more computationally efficient, and also we need much less
code...
llvm-svn: 289756
There was an efficiency problem with how we processed @llvm.assume in
ValueTracking (and other places). The AssumptionCache tracked all of the
assumptions in a given function. In order to find assumptions relevant to
computing known bits, etc. we searched every assumption in the function. For
ValueTracking, that means that we did O(#assumes * #values) work in InstCombine
and other passes (with a constant factor that can be quite large because we'd
repeat this search at every level of recursion of the analysis).
Several of us discussed this situation at the last developers' meeting, and
this implements the discussed solution: Make the values that an assume might
affect operands of the assume itself. To avoid exposing this detail to
frontends and passes that need not worry about it, I've used the new
operand-bundle feature to add these extra call "operands" in a way that does
not affect the intrinsic's signature. I think this solution is relatively
clean. InstCombine adds these extra operands based on what ValueTracking, LVI,
etc. will need and then those passes need only search the users of the values
under consideration. This should fix the computational-complexity problem.
At this point, no passes depend on the AssumptionCache, and so I'll remove
that as a follow-up change.
Differential Revision: https://reviews.llvm.org/D27259
llvm-svn: 289755
Most of the PowerPC64 code generation for the ELF ABI is already PIC.
There are four main exceptions:
(1) Constant pointer arrays etc. should in writeable sections.
(2) The TOC restoration NOP after a call is needed for all global
symbols. While GNU ld has a workaround for questionable GCC self-calls,
we trigger the checks for calls from COMDAT sections as they cross input
sections and are therefore not considered self-calls. The current
decision is questionable and suboptimal, but outside the scope of the
change.
(3) TLS access can not use the initial-exec model.
(4) Jump tables should use relative addresses. Note that the current
encoding doesn't work for the large code model, but it is more compact
than the default for any non-trivial jump table. Improving this is again
beyond the scope of this change.
At least (1) and (3) are assumptions made in target-independent code and
introducing additional hooks is a bit messy. Testing with clang shows
that a -fPIC binary is 600KB smaller than the corresponding -fno-pic
build. Separate testing from improved jump table encodings would explain
only about 100KB or so. The rest is expected to be a result of more
aggressive immediate forming for -fno-pic, where the -fPIC binary just
uses TOC entries.
This change brings the LLVM output in line with the GCC output, other
PPC64 compilers like XLC on AIX are known to produce PIC by default
as well. The relocation model can still be provided explicitly, i.e.
when using MCJIT.
One test case for case (1) is included, other test cases with relocation
mode sensitive behavior are wired to static for now. They will be
reviewed and adjusted separately.
Differential Revision: https://reviews.llvm.org/D26566
llvm-svn: 289743
I've chosen to remove NVPTXInstrInfo::CanTailMerge but not
NVPTXInstrInfo::isLoadInstr and isStoreInstr (which are also dead)
because while the latter two are reasonably useful utilities, the former
cannot be used safely: It relies on successful address space inference
to identify writes to shared memory, but addrspace inference is a
best-effort thing.
llvm-svn: 289740
The original motivation for this patch comes from wanting to canonicalize
more IR to selects and also canonicalizing min/max.
If we're going to do that, we need more backend fixups to undo select codegen
when simpler ops will do. I chose AArch64 for the tests because that shows the
difference in the simplest way. This should fix:
https://llvm.org/bugs/show_bug.cgi?id=31175
Differential Revision: https://reviews.llvm.org/D27489
llvm-svn: 289738
When getting attributes it is sometimes nicer to use Optional<T> some of the time instead of magic values. I tried to cut over to only using the Optional values but it made many of the call sites very messy, so it makes sense the leave in the calls that can return a default value. Otherwise code that looks like this:
uint64_t CallColumn = Die.getAttributeValueAsAddress(DW_AT_call_line, 0);
Has to be turned into:
uint64_t CallColumn = 0;
if (auto CallColumnValue = Die.getAttributeValueAsAddress(DW_AT_call_line))
CallColumn = *CallColumnValue;
The first snippet of code looks much better. But in cases where you want an offset that may or may not be there, the following code looks better:
if (auto StmtOffset = Die.getAttributeValueAsSectionOffset(DW_AT_stmt_list)) {
// Use StmtOffset
}
Differential Revision: https://reviews.llvm.org/D27772
llvm-svn: 289731
Summary:
Previously they were defined as a 2D char array in a header file. This
is kind of overkill -- we can let the linker lay out these strings
however it pleases. While we're at it, we might as well just inline
these constants where they're used, as each of them is used only once.
Also move NVPTXUtilities.{h,cpp} into namespace llvm.
Reviewers: tra
Subscribers: jholewinski, mgorny, llvm-commits
Differential Revision: https://reviews.llvm.org/D27636
llvm-svn: 289728
Summary: SampleProfileLoader pass may be invoked twice by LTO. The 2nd pass should not append more summary info as it is already preset by the 1st pass.
Reviewers: eraman, davidxl
Subscribers: mehdi_amini, llvm-commits
Differential Revision: https://reviews.llvm.org/D27733
llvm-svn: 289725
Also, udpate the ~60 failing tests in the tree which did
not contain a valid datalayout.
This fixes PR31123. lld will be updated in a following patch,
immediately after this is committed.
Differential Revision: https://reviews.llvm.org/D27082
llvm-svn: 289719
Summary: We used to create SampleProfileLoader pass in clang. This makes LTO/ThinLTO unable to add this pass in the linker plugin. This patch moves the SampleProfileLoader pass creation from clang to llvm pass manager builder.
Reviewers: tejohnson, davidxl, dnovillo
Subscribers: llvm-commits, mehdi_amini
Differential Revision: https://reviews.llvm.org/D27743
llvm-svn: 289714
Given that INSERT_VECTOR_ELT operates on D registers anyway, combining
64-bit vectors into a 128-bit vector is basically free. Therefore, try
to split BUILD_VECTOR nodes before giving up and lowering them to a series
of INSERT_VECTOR_ELT instructions. Sometimes this allows dramatically
better lowerings; see testcases for examples. Inspired by similar code
in the x86 backend for AVX.
Differential Revision: https://reviews.llvm.org/D27624
llvm-svn: 289706
If all the operands to a phi node are compares that have a RHS constant,
instcombine will try to pull them through the phi node, combining them into
a single operation. When it does this, the debug location of the new op
should be the merged debug locations of the phi node arguments.
Patch 8 of 8 for D26256. Folding of a compare that has a RHS constant.
Differential Revision: https://reviews.llvm.org/D26256
llvm-svn: 289704
Currently, there are substantial problems forming vld1_dup even if the
VDUP survives legalization. The lack of an actual node
leads to terrible results: not only can we not form post-increment vld1_dup
instructions, but we form scalar pre-increment and post-increment
loads which force the loaded value into a GPR. This patch fixes that
by combining the vdup+load into an ARMISD node before DAGCombine
messes it up.
Also includes a crash fix for vld2_dup (see testcase @vld2dupi8_postinc_variable).
Differential Revision: https://reviews.llvm.org/D27694
llvm-svn: 289703
This way it will be easier to expand DIFile (e.g., to contain checksum) without the need to modify the createCompileUnit() API.
Reviewers: llvm-commits, rnk
Differential Revision: https://reviews.llvm.org/D27762
llvm-svn: 289702
If all the operands to a phi node are a binop with a RHS constant, instcombine
will try to pull them through the phi node, combining them into a single
operation. When it does this, the debug location of the new op should be the
merged debug locations of the phi node arguments.
Patch 7 of 8 for D26256. Folding of a binop with RHS constant.
Differential Revision: https://reviews.llvm.org/D26256
llvm-svn: 289699
Summary:
Move GVNHoist to later in the optimization pipeline, specifically, to
the function simplification part of the pipeline. The new pipeline
location allows GVNHoist to run on a function after its callees have
been inlined but before the function has been considered for inlining
into its callers, exposing more opportunities for hoisting.
Performance results on AArch64 kryo:
Improvements:
Benchmarks/CoyoteBench/fftbench -24.952%
spec2006/bzip2 -4.071%
internal bmark -3.177%
Benchmarks/PAQ8p/paq8p -1.754%
spec2000/perlbmk -1.328%
spec2006/h264ref -1.140%
Regressions:
internal bmark +1.818%
Benchmarks/mafft/pairlocalalign +1.084%
Reviewers: sebpop, dberlin, hiraditya
Subscribers: aemerson, mehdi_amini, mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D27722
llvm-svn: 289696
If all the operands to a phi node are a cast, instcombine will try to pull
them through the phi node, combining them into a single cast. When it does
this, the debug location of the new cast should be the merged debug locations
of the phi node arguments.
Patch 6 of 8 for D26256. Folding of a cast operation.
Differential Revision: https://reviews.llvm.org/D26256
llvm-svn: 289693
This patch fixes the linkage for __crashtracer_info__, making it have the proper mangling (extern "C") and linkage (private extern).
It also adds a new PrettyStackTrace type, allowing LLDB to adopt this instead of Host::SetCrashDescriptionWithFormat().
Without this patch, CrashTracer on macOS won't pick up pretty stack traces from any LLVM client.
An LLDB commit adopting this API will follow shortly.
Differential Revision: https://reviews.llvm.org/D27683
llvm-svn: 289689
If all the operands to a phi node are a load, instcombine will try to pull
them through the phi node, combining them into a single load. When it does
this, the debug location of the new load should be the merged debug locations
of the phi node arguments.
Patch 5 of 8 for D26256. Folding of a load operation.
Differential Revision: https://reviews.llvm.org/D26256
llvm-svn: 289688
If all the operands to a phi node are getelementptr, instcombine
will try to pull them through the phi node, combining them into a single
operation. When it does this, the debug location of the new getelementptr
should be the merged debug locations of the phi node arguments.
Patch 4 of 8 for D26256. Folding of a getelementptr operation.
Differential Revision: https://reviews.llvm.org/D26256
llvm-svn: 289684
Adds a "Discriminator" field to struct DILineInfo, which defaults to 0.
Fills out the "Discriminator" field in DILineInfo in DWARFDebugLine::LineTable::getFileLineInfoForAddress().
in order to have a slightly nicer interface in getFileLineInfoForAddress.
Patch by Simon Que!
Differential Revision: https://reviews.llvm.org/D27649
llvm-svn: 289683
If all the operands to a phi node are of the same operation, instcombine
will try to pull them through the phi node, combining them into a single
operation. When it does this, the debug location of the operation should
be the merged debug locations of the phi node arguments.
Patch 3 of 8 for D26256. Folding of a compare operation.
Differential Revision: https://reviews.llvm.org/D26256
llvm-svn: 289681
If all the operands to a phi node are of the same operation, instcombine
will try to pull them through the phi node, combining them into a single
operation. When it does this, the debug location of the operation should
be the merged debug locations of the phi node arguments.
Patch 2 of 8 for D26256. Folding of a binary operation.
Differential Revision: https://reviews.llvm.org/D26256
llvm-svn: 289679
Since SGPRs should spill to VGPRs, they should be allocated first.
I don't think this is sufficient for SGPRs to always spill to
VGPRs though.
llvm-svn: 289671
Summary: We used to create SampleProfileLoader pass in clang. This makes LTO/ThinLTO unable to add this pass in the linker plugin. This patch moves the SampleProfileLoader pass creation from clang to llvm pass manager builder.
Reviewers: tejohnson, davidxl, dnovillo
Subscribers: llvm-commits, mehdi_amini
Differential Revision: https://reviews.llvm.org/D27743
llvm-svn: 289669
Retrying after fixing after removing load-store factoring through
token factors in favor of improved token factor operand pruning
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and
merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 289659
Generalize sdiv/udiv/srem/urem combines using APInt::isPowerOf2, which only works for const/splat-const values, to call SelectionDAG::isKnownToBeAPowerOfTwo instead which recognises many more cases.
Added a DAGCombiner::BuildLogBase2 helper since PowerOf2 combines often involve taking the log2 of such a value.
Differential Revision: https://reviews.llvm.org/D27714
llvm-svn: 289654
adding new optimization opportunity by adding new X86ISelLowering pattern. The test case was shown in https://llvm.org/bugs/show_bug.cgi?id=30945.
Test explanation:
Select gets three arguments mask, op and op2. In this case, the Mask is a result of ICMP. The ICMP instruction compares (with equal operand) the zero initializer vector and the result of the first ICMP.
In general, The result of "cmp eq, op1, zero initializers" is "not(op1)" where op1 is a mask. By rearranging of the two arguments inside the Select instruction, we can get the same result. Without the necessary of the middle phase ("cmp eq, op1, zero initializers").
Missed optimization opportunity:
vpcmpled %zmm0, %zmm1, %k0
knotw %k0, %k1
can be combine to
vpcmpgtd %zmm0, %zmm2, %k1
Reviewers:
1. delena
2. igorb
Commited after check all
Differential Revision: https://reviews.llvm.org/D27160
llvm-svn: 289653
At least the plugin used by the LibreOffice build
(<https://wiki.documentfoundation.org/Development/Clang_plugins>) indirectly
uses those members (through inline functions in LLVM/Clang include files in turn
using them), but they are not exported by utils/extract_symbols.py on Windows,
and accessing data across DLL/EXE boundaries on Windows is generally
problematic.
Differential Revision: https://reviews.llvm.org/D26671
llvm-svn: 289647
Currently, the error messages we emit for the .org directive when the
expression is not absolute or is out of range do not include the line
number of the directive, so it can be hard to track down the problem if
a file contains many .org directives.
This patch stores the source location in the MCOrgFragment, so that it
can be used for diagnostics emitted during layout.
Since layout is an iterative process, and the errors are detected during
each iteration, it would have been possible for errors to be reported
multiple times. To prevent this, I've made the assembler bail out after
each iteration if any errors have been reported. This will still allow
multiple unrelated errors to be reported in the common case where they
are all detected in the first round of layout.
Differential Revision: https://reviews.llvm.org/D27411
llvm-svn: 289643
This change aims to unify and correct our logic for when we need to allow for
the possibility of the linker adding a TOC restoration instruction after a
call. This comes up in two contexts:
1. When determining tail-call eligibility. If we make a tail call (i.e.
directly branch to a function) then there is no place for the linker to add
a TOC restoration.
2. When determining when we need to add a nop instruction after a call.
Likewise, if there is a possibility that the linker might need to add a
TOC restoration after a call, then we need to put a nop after the call
(the bl instruction).
First problem: We were using similar, but different, logic to decide (1) and
(2). This is just wrong. Both the resideInSameModule function (used when
determining tail-call eligibility) and the isLocalCall function (used when
deciding if the post-call nop is needed) were supposed to be determining the
same underlying fact (i.e. might a TOC restoration be needed after the call).
The same logic should be used in both places.
Second problem: The logic in both places was wrong. We only know that two
functions will share the same TOC when both functions come from the same
section of the same object. Otherwise the linker might cause the functions to
use different TOC base addresses (unless the multi-TOC linker option is
disabled, in which case only shared-library boundaries are relevant). There are
a number of factors that can cause functions to be placed in different sections
or come from different objects (-ffunction-sections, explicitly-specified
section names, COMDAT, weak linkage, etc.). All of these need to be checked.
The existing logic only checked properties of the callee, but the properties of
the caller must also be checked (for example, calling from a function in a
COMDAT section means calling between sections).
There was a conceptual error in the resideInSameModule function in that it
allowed tail calls to functions with weak linkage and protected/hidden
visibility. While protected/hidden visibility does prevent the function
implementation from being replaced at runtime (via interposition), it does not
prevent the linker from using an alternate implementation at link time (i.e.
using some strong definition to replace the provided weak one during linking).
If this happens, then we're still potentially looking at a required TOC
restoration upon return.
Otherwise, in general, the post-call nop is needed wherever ELF interposition
needs to be supported. We don't currently support ELF interposition at the IR
level (see http://lists.llvm.org/pipermail/llvm-dev/2016-November/107625.html
for more information), and I don't think we should try to make it appear to
work in the backend in spite of that fact. This will yield subtle bugs if
interposition is attempted. As a result, regardless of whether we're in PIC
mode, we don't assume that we need to add the nop to support the possibility of
ELF interposition. However, the necessary check is in place (i.e. calling
GV->isInterposable and TM.shouldAssumeDSOLocal) so when we have functions for
which interposition is allowed at the IR level, we'll add the nop as necessary.
In the mean time, we'll generate more tail calls and fewer nops when compiling
position-independent code.
Differential Revision: https://reviews.llvm.org/D27231
llvm-svn: 289638
Summary:
The motivation is to support better the -object_path_lto option on
Darwin. The linker needs to write down the generate object files on
disk for later use by lldb or dsymutil (debug info are not present
in the final binary). We're moving this into libLTO so that we can
be smarter when a cache is enabled and hard-link when possible
instead of duplicating the files.
Reviewers: tejohnson, deadalnix, pcc
Subscribers: dexonsmith, llvm-commits
Differential Revision: https://reviews.llvm.org/D27507
llvm-svn: 289631
Now we only pass bit 0 of the DemandedElts to optimize operand 1 as we recurse since the upper bits are unused. Similarly we clear bit 0 for optimizing operand 0.
Also calculate UndefElts correctly.
Simplify InstCombineCalls for these instrinics to just call SimplifyDemandedVectorElts for the call instrution to reuse this support.
llvm-svn: 289629
Now we only pass bit 0 of the DemandedElts to optimize operand 1 as we recurse since the upper bits are unused.
Also calculate UndefElts correctly.
Simplify InstCombineCalls for these instrinics to just call SimplifyDemandedVectorElts for the call instrution to reuse this support.
llvm-svn: 289628
Bots are broken and needs to be fixed before having this on by default.
The feature was committed in r289619.
I tried to disable it in r289624 and failed because it was initialized in two places.
llvm-svn: 289626
Follow-up to r289256, address a FIXME to avoid resetting the column
number. This reduced .debug_line by 2.6% in a RelWithDebInfo
self-build of clang.
llvm-svn: 289620
Summary:
Given a flag (-mllvm -reverse-iterate) this patch will enable iteration of SmallPtrSet in reverse order.
The idea is to compile the same source with and without this flag and expect the code to not change.
If there is a difference in codegen then it would mean that the codegen is sensitive to the iteration order of SmallPtrSet.
This is enabled only with LLVM_ENABLE_ABI_BREAKING_CHECKS.
Reviewers: chandlerc, dexonsmith, mehdi_amini
Subscribers: mgorny, emaste, llvm-commits
Differential Revision: https://reviews.llvm.org/D26718
llvm-svn: 289619
Summary:
This patch will add loop metadata on the pre and post loops generated by IRCE.
Currently, we have metadata for disabling optimizations such as vectorization,
unrolling, loop distribution and LICM versioning (and confirmed that these
optimizations check for the metadata before proceeding with the transformation).
The pre and post loops generated by IRCE need not go through loop opts (since
these are slow paths).
Added two test cases as well.
Reviewers: sanjoy, reames
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D26806
llvm-svn: 289588
We currently check if the exact trip count is known and is smaller than the
"tiny loop" bound. We should be checking the maximum bound on the trip count
instead.
Differential Revision: https://reviews.llvm.org/D27690
llvm-svn: 289583
Summary:
This patch aims to generalize matching of the strided store accesses to more general masks.
The more general rule is to have consecutive accesses based on the stride:
[x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...]
All elements in the masks need not form a contiguous space, there may be gaps.
As before, undefs are allowed and filled in with adjacent element loads.
Reviewers: HaoLiu, mssimpso
Subscribers: mkuper, delena, llvm-commits
Differential Revision: https://reviews.llvm.org/D23646
llvm-svn: 289573
This is not always behaving as expected as it turns out block live-in
lists are only correct most of the time. Still waiting for reviews on
https://reviews.llvm.org/D27559 to have them correct all of the time.
See also http://llvm.org/PR31361, rdar://25117107
This reverts commit r288567.
This reverts commit r288561.
llvm-svn: 289570
We were using the correct pseudo-instruction, but because the operand's flags
weren't set correctly we still ended up emitting incorrect relocations during
MC lowering.
llvm-svn: 289566
Many places pass around a DWARFDebugInfoEntryMinimal and a DWARFUnit. It is easy to get things wrong by using the wrong DWARFUnit with a DWARFDebugInfoEntryMinimal. This patch creates a DWARFDie class that contains the DWARFUnit and DWARFDebugInfoEntryMinimal objects so that they can't get out of sync. All attribute extraction has been moved out of DWARFDebugInfoEntryMinimal and into DWARFDie. DWARFDebugInfoEntryMinimal was also renamed to DWARFDebugInfoEntry.
DWARFDie objects are temporary objects that are used by clients and contain 2 pointers that you always need to have anyway. Keeping them grouped will avoid errors and simplify many of the attribute extracting APIs by not having to pass in a DWARFUnit.
Differential Revision: https://reviews.llvm.org/D27634
llvm-svn: 289565
Windows uses some macros to replace DeleteFile() by DeleteFileA() or
DeleteFileW(). This was causing an error at link time.
DeleteFile was renamed to RemoveFile().
Differential Revision: https://reviews.llvm.org/D27577
llvm-svn: 289563
Implement DirName from scratch to avoid dependencies on external libraries.
It's based on MSDN documentation for Naming Files, Paths, and Namespaces.
The algorithm can't simply start from the end and look backwards for the
first separator, because we need to preserve the prefix that represent
the root location. We shouldn't remove anything there. In Windows we
have many different options, like:
\\Server\Share\ , \ , C: , C:\ , \\?\C:\ , \\?\UNC\Server\Share\
We remove the last separator in the rest of the path, if it exists.
It was implemented to have a similar behaviour to dirname() in linux,
removing trailing separators, returning "." when the path doesn't
contain separators, etc.
Differential Revision: https://reviews.llvm.org/D27579
llvm-svn: 289562
I added a new flag RunningCB to know if the Fuzzer's main thread is
running the CB function, instead of using (!CurrentUnitSize).
(!CurrentUnitSize) doesn't work properly. For example, in FuzzerLoop.cpp,
inside ShuffleAndMinimize() function, we execute the callback with an
empty string (size=0). Previous implementation failed to detect timeouts
in that execution.
Also, I add a regression test for that case.
Differential Revision: https://reviews.llvm.org/D27433
llvm-svn: 289561
Reorganize #includes to follow LLVM Coding Standards.
Include some missing headers. Required to use `Printf()`.
Aside from that, this patch contains no functional change.
It is purely a re-organization.
Differential Revision: https://reviews.llvm.org/D27363
llvm-svn: 289560
std:🧵:hardware_concurrency() returns an unsigned, so I modify
NumberOfCpuCores() to return unsigned too.
The number of cpus is used to define the number of workers, so I decided
to update the worker and jobs flags to be declared as unsigned too.
Differential Revision: https://reviews.llvm.org/D27685
llvm-svn: 289559
Use unsigned for PID instead of signed int. GetCurrentProcessId() returns
an unsigned (DWORD) so we must be sure we can deal with all possible values.
I use a long unsigned to be sure it can hold a 32 bit unsigned (DWORD).
Differential Revision: https://reviews.llvm.org/D27281
llvm-svn: 289558
Add new flags to FuzzingOptions to represent the different conditions
on the signal handling. These options are passed when calling
SetSignalHandler().
This changes simplify the implementation of Windows's exception
handling. Now we can define a unique handler for all the exceptions.
Differential Revision: https://reviews.llvm.org/D27238
llvm-svn: 289557
Summary:
This is last in of a series of patches to evolve ADCE.cpp to support
removing of unnecessary control flow.
This patch adds the code to update the control and data flow graphs
to remove the dead control flow.
Also update unit tests to test the capability to remove dead,
may-be-infinite loop which is enabled by the switch
-adce-remove-loops.
Previous patches:
D23824 [ADCE] Add handling of PHI nodes when removing control flow
D23559 [ADCE] Add control dependence computation
D23225 [ADCE] Modify data structures to support removing control flow
D23065 [ADCE] Refactor anticipating new functionality (NFC)
D23102 [ADCE] Refactoring for new functionality (NFC)
Reviewers: dberlin, majnemer, nadav, mehdi_amini
Subscribers: llvm-commits, david2050, freik, twoh
Differential Revision: https://reviews.llvm.org/D24918
llvm-svn: 289548
Match a pattern where a wide type scalar value is loaded by several narrow loads and combined by shifts and ors. Fold it into a single load or a load and a bswap if the targets supports it.
Assuming little endian target:
i8 *a = ...
i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
=>
i32 val = *((i32)a)
i8 *a = ...
i32 val = (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | a[3]
=>
i32 val = BSWAP(*((i32)a))
This optimization was discussed on llvm-dev some time ago in "Load combine pass" thread. We came to the conclusion that we want to do this transformation late in the pipeline because in presence of atomic loads load widening is irreversible transformation and it might hinder other optimizations.
Eventually we'd like to support folding patterns like this where the offset has a variable and a constant part:
i32 val = a[i] | (a[i + 1] << 8) | (a[i + 2] << 16) | (a[i + 3] << 24)
Matching the pattern above is easier at SelectionDAG level since address reassociation has already happened and the fact that the loads are adjacent is clear. Understanding that these loads are adjacent at IR level would have involved looking through geps/zexts/adds while looking at the addresses.
The general scheme is to match OR expressions by recursively calculating the origin of individual bits which constitute the resulting OR value. If all the OR bits come from memory verify that they are adjacent and match with little or big endian encoding of a wider value. If so and the load of the wider type (and bswap if needed) is allowed by the target generate a load and a bswap if needed.
Reviewed By: hfinkel, RKSimon, filcab
Differential Revision: https://reviews.llvm.org/D26149
llvm-svn: 289538
N32 relocations are only correct for individual relocations at the moment.
Support for relocation composition will follow in a later patch.
Patch By: Daniel Sanders
Reviwers: vkalintiris, atanasyan
Differential Revision: https://reviews.llvm.org/D27467
llvm-svn: 289532
In certain cases it is possible that transient instructions such as
%reg = IMPLICIT_DEF as a single instruction in a basic block to reach
the MipsHazardSchedule pass. This patch teaches MipsHazardSchedule to
properly look through such cases.
Reviewers: vkalintiris, zoran.jovanovic
Differential Revision: https://reviews.llvm.org/D27209
llvm-svn: 289529
Only the lower bits of the input element are used. And only the lower element can be undef since the upper bits are zeroed.
Have InstCombineCalls call SimplifyDemandedVectorElts for these intrinsics to reuse this support.
llvm-svn: 289523
Summary:
Since we don't break BBs for function calls. We might get some insane counts
(wrap of unsigned) in the presence of noreturn calls.
This patch sets these counts to zero instead of the wrapped number.
Reviewers: davidxl
Subscribers: xur, eraman, llvm-commits
Differential Revision: https://reviews.llvm.org/D27602
llvm-svn: 289521
Summary:
This pass will be used to relax instructions which use out of bounds
memory accesses to equivalent operations that can work with the
addresses.
The pass currently implements relaxation for the STDWPtrQRr instruction.
Without this pass, an assertion error would be hit in the pseudo expansion pass.
In the future, we will need to add more instructions to this pass. We can do
that on a case-by-case basic.
Reviewers: arsenm, kparzysz
Subscribers: wdng, llvm-commits, mgorny
Differential Revision: https://reviews.llvm.org/D27650
llvm-svn: 289517
The general idea here is to get enough of the existing restrictions out of the way that the already existing folding logic in foldMemoryOperand can kick in for STATEPOINTs and fold references to immutable stack slots. The key changes are:
Support for folding multiple operands at once which reference the same load
Support for folding multiple loads into a single instruction
Walk all the operands of the instruction for varidic instructions (this is a bug fix!)
Once this lands, I'll post another patch which refactors the TII interface here. There's nothing actually x86 specific about the x86 code used here.
Differential Revision: https://reviews.llvm.org/D24103
llvm-svn: 289510
The stack slot reuse code had a really amusing bug. We ended up only reusing a stack slot exact once (initial use + reuse) within a basic block. If we had a third statepoint to process, we ended up allocating a new set of stack slots. If we crossed a basic block boundary, the set got cleared. As a result, code which is invoke heavy doesn't see the problem, but multiple calls within a basic block does. Net result: as we optimize invokes into calls, lowering gets worse.
The root error here is that the bitmap uses by the custom allocator wasn't kept in sync. The result was that we ended up resizing the bitmap on the next statepoint (to handle the cross block case), reset the bit once, but then never reset it again.
Differential Revision: https://reviews.llvm.org/D25243
llvm-svn: 289509
Implemented timeouts for Windows using TimerQueueTimers.
Timers are used to supervise the time of execution of the
callback function that is being fuzzed.
Differential Revision: https://reviews.llvm.org/D27237
llvm-svn: 289495
Power8 has MTVSRWZ but no LXSIBZX/LXSIHZX, so move 1 or 2 bytes to VSR through MTVSRWZ is much faster than store the extended value into stack and load it with LXSIWZX.
This patch fixes pr31144.
Differential Revision: https://reviews.llvm.org/D27287
llvm-svn: 289473
Summary:
I looked at libgcc's implementation (which is based on the paper,
Software for Doubled-Precision Floating-Point Computations", by Seppo Linnainmaa,
ACM TOMS vol 7 no 3, September 1981, pages 272-283.) and made it generic to
arbitrary IEEE floats.
Differential Revision: https://reviews.llvm.org/D26817
llvm-svn: 289472
This patch ensures the correct minimum bit width during type-shrinking.
Previously when type-shrinking, we always sign-extended values back to their
original width. However, if we are going to sign-extend, and the sign bit is
unknown, we have to increase the minimum bit width by one bit so the
sign-extend will fill the upper bits correctly. If the sign bit is known to be
zero, we can perform a zero-extend instead. This should fix PR31243.
Reference: https://llvm.org/bugs/show_bug.cgi?id=31243
Differential Revision: https://reviews.llvm.org/D27466
llvm-svn: 289470
DWARF specifies that "line 0" really means "no appropriate source
location" in the line table. By default, use this for branch targets
and some other cases that have no specified source location, to
prevent inheriting unfortunate line numbers from physically preceding
instructions (which might be from completely unrelated source).
Updated patch allows enabling or suppressing this behavior for all
unspecified source locations.
Differential Revision: http://reviews.llvm.org/D24180
llvm-svn: 289468
Summary:
I'm planning on changing the way we load metadata to enable laziness.
I'm getting lost in this gigantic files, and gigantic class that is the bitcode
reader. This is a first toward splitting it in a few coarse components that
are more easily understandable.
Reviewers: pcc, tejohnson
Subscribers: mgorny, llvm-commits, dexonsmith
Differential Revision: https://reviews.llvm.org/D27646
llvm-svn: 289461
Summary:
Compiling with GCC 5 or later can fail with a bogus error "constructor
required before non-static data member for
llvm::ValueEnumerator::MDRange::First has been parsed".
This was originally fixed upstream in GCC PR 70528, but later this fix
was reverted, and released versions of GCC still show the bogus error.
To work around this, replace MDRange's declaration of a default
constructor with a definition.
Reviewers: dexonsmith, rsmith, rivanvx
Subscribers: llvm-commits, dim, dexonsmith
Differential Revision: https://reviews.llvm.org/D18730
llvm-svn: 289454
Reverts r289412. It caused an OOB PHI operand access in instcombine when
ASan is enabled. Reduction in progress.
Also reverts "[SCEVExpander] Add a test case related to r289412"
llvm-svn: 289453
Summary:
While the result is constant across a single primitive, each pixel
shader wave can have pixels from multiple primitives.
Reviewers: tstellarAMD, arsenm
Subscribers: kzhuravl, wdng, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D27572
llvm-svn: 289447
We could truncate the condition and then try to fold the add into the
original condition value causing wrong case constants to be used.
Move the offset transform ahead of the truncate transform and return
after each transform, so there's no chance of getting confused values.
Fix for:
https://llvm.org/bugs/show_bug.cgi?id=31260
llvm-svn: 289442
Summary:
As discussed on mailing list, for ThinLTO importing we don't need
to import all the fields of the DICompileUnit. Don't import enums,
macros, retained types lists. Also only import local scoped imported
entities. Since we don't currently import any global variables,
we also don't need to import the list of global variables (added an
assert to verify none are being imported).
This is being done by pre-populating the value map entries to map
the unneeded metadata to nullptr. For the imported entities, we can
simply replace the source module's list with a new list containing
only those needed imported entities. This is done in the IRLinker
constructor so that value mapping automatically does the desired
mapping.
Reviewers: mehdi_amini, dexonsmith, dblaikie, aprantl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27635
llvm-svn: 289441
PMULDQ returns the 64-bit result of the signed multiplication of the lower 32-bits of vXi64 vector inputs, we can lower with this if the sign bits stretch that far.
Differential Revision: https://reviews.llvm.org/D27657
llvm-svn: 289426
These intrinsics only load a single element. We should use sse_loadf32/f64 to give more options of what loads it can match.
Currently these instructions are often only getting their load folded thanks to the load folding in the peephole pass. I plan to add more types of loads to sse_load_f32/64 so we can match without the peephole.
llvm-svn: 289423
Summary:
These intrinsic instructions are all selected from intrinsics that have well defined behavior for where the upper bits come from. It's not the same place as the lower bits.
As you can see we were suppressing load folding for these instructions in some cases. In none of the cases was the separate load helping avoid a partial dependency on the destination register. So we should just go ahead and allow the load to be folded.
Only foldMemoryOperand was suppressing folding for these. They all have patterns for folding sse_load_f32/f64 that aren't gated with OptForSize, but sse_load_f32/f64 doesn't allow 128-bit vector loads. It only allows scalar_to_vector and vzmovl of scalar loads to match. There's no reason we can't allow a 128-bit vector load to be narrowed so I would like to fix sse_load_f32/f64 to allow that. And if I do that it changes some of these same test cases to fold the load too.
Reviewers: spatel, zvi, RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27611
llvm-svn: 289419
SCEVExpand computes the insertion point for the components of a SCEV to be code
generated. When it comes to generating code for a division, SCEVexpand would
not be able to check (at compilation time) all the conditions necessary to avoid
a division by zero. The patch disables hoisting of expressions containing
divisions by anything other than non-zero constants in order to avoid hoisting
these expressions past conditions that should hold before doing the division.
The patch passes check-all on x86_64-linux.
Differential Revision: https://reviews.llvm.org/D27216
llvm-svn: 289412
When the load node which the broadcast instruction broadcasts has multiple uses, it cannot be folded.
A fallback pattern is added to catch these cases and provide another solution.
Differential Revision: https://reviews.llvm.org/D27661
llvm-svn: 289404
Summary:
Fix a corner case in `MDNode::getMostGenericTBAA` where we can sometimes
generate invalid TBAA metadata.
Reviewers: chandlerc, hfinkel, mehdi_amini, manmanren
Subscribers: mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D26635
llvm-svn: 289403
Summary:
This change adds some verification in the IR verifier around struct path
TBAA metadata.
Other than some basic sanity checks (e.g. we get constant integers where
we expect constant integers), this checks:
- That by the time an struct access tuple `(base-type, offset)` is
"reduced" to a scalar base type, the offset is `0`. For instance, in
C++ you can't start from, say `("struct-a", 16)`, and end up with
`("int", 4)` -- by the time the base type is `"int"`, the offset
better be zero. In particular, a variant of this invariant is needed
for `llvm::getMostGenericTBAA` to be correct.
- That there are no cycles in a struct path.
- That struct type nodes have their offsets listed in an ascending
order.
- That when generating the struct access path, you eventually reach the
access type listed in the tbaa tag node.
Reviewers: dexonsmith, chandlerc, reames, mehdi_amini, manmanren
Subscribers: mcrosier, llvm-commits
Differential Revision: https://reviews.llvm.org/D26438
llvm-svn: 289402
We have found that -- when the selected subarchitecture has a scheduling model
and we are not optimizing for size -- the machine-instruction combiner uses a
too-simple algorithm to compute the cost of one of the two alternatives [before
and after running a combining pass on a section of code], and therefor it throws
away the combination results too often.
This fix has the potential to help any ISA with the potential to combine
instructions and for which at least one subarchitecture has a scheduling model.
As of now, this is only known to definitely affect AArch64 subarchitectures with
a scheduling model.
Regression tested on AMD64/GNU-Linux, new test case tested to fail on an
unpatched compiler and pass on a patched compiler.
Patch by Abe Skolnik and Sebastian Pop.
llvm-svn: 289399
This is NFC today, but won't be once D27216 (or an equivalent patch) is
in.
This change fixes a design problem in SCEVExpander -- it relied on a
hoisting optimization to generate correct code for add recurrences.
This meant changing the hoisting optimization to not kick in under
certain circumstances (to avoid speculating faulting instructions, say)
would break correctness.
The fix is to make the correctness requirements explicit, and have it
not rely on the hoisting optimization for correctness.
llvm-svn: 289397
Regcall calling convention passes mask types arguments in x86 GPR registers.
The review includes the changes required in order to support v32i1, v16i1 and v8i1.
Differential Revision: https://reviews.llvm.org/D27148
llvm-svn: 289383
This teaches SimplifyDemandedElts that the FMA can be removed if the lower element isn't used. It also teaches it that if upper elements of the first operand aren't used then we can simplify them.
llvm-svn: 289377
iteration.
Instead, load the byte at the needle length, compare it directly, and
save it to use in the lookup table of lengths we can skip forward.
I also added an annotation to expect that the comparison fails so that
the loop gets laid out contiguously without the call to memcpy (and the
substantial register shuffling that the ABI requires of that call).
Finally, because this behaves especially badly with a needle length of
one (by calling memcmp with a zero length) special case that to directly
call memchr, which is what we should have been doing anyways.
This was motivated by the fact that there are a large number of test
cases in 'check-llvm' where FileCheck's performance is dominated by
calls to StringRef::find (in a release, no-asserts build). I'm working
on patches to generally improve matters there, but this alone was worth
a 12.5% improvement in one test case where FileCheck spent 92% of its
time in this routine.
I experimented a bunch with different minor variations on this theme,
for example setting the pointer *at* the last byte and indexing
backwards for the call to memcmp. That didn't improve anything on this
version and seemed more complex. I also tried other things to make the
loop flow more nicely and none worked. =/ It is a bit unfortunate, the
generated code here remains pretty gross, but I don't see any obvious
ways to improve it. At this point, most of my ideas would be really
elaborate:
1) While the remainder of the string is long enough, we could load
a 16-byte or 32-byte vector at the address of the last byte and use
palignr to rotate that and check the first 15- or 31-bytes at the
front of the next segment, essentially pre-loading the first several
bytes of the next iteration so we could quickly detect a mismatch in
those bytes without an additional memory access. Down side would be
the code complexity, having a fallback loop, and likely misaligned
vector load. Plus it would make the common case of the last byte not
matching somewhat slower (need some extraction from a vector).
2) While we have space, we could do an aligned load of a 16- or 32-byte
vector that *contains* the end byte, and use any peceding bytes to
have a more precise "no" test, and any subsequent bytes could be
saved for the next iteration. This remove any unaligned load penalty,
but still requires us to pay the overhead of vector extraction for
the cases where we didn't need to do anything other than load and
compare the last byte.
3) Try to walk from the last byte in a way that is more friendly to
cache and/or memory pre-fetcher considering we have to poke the last
byte anyways.
No idea if any of these are really worth pursuing though. They all seem
somewhat unlikely to yield big wins in practice and to be a lot of work
and complexity. So I settled here, which at least seems like a strict
improvement over the previous version.
llvm-svn: 289373
These intrinsics don't read the upper elements of their first and second input. These are slightly different the the SSE version which does use the upper bits of its first element as passthru bits since the result goes to an XMM register. For AVX-512 the result goes to a mask register instead.
llvm-svn: 289371
These intrinsics don't read the upper bits of their second input. And the third input is the passthru for masking and that only uses the lower element as well.
llvm-svn: 289370
There was a bug where we would hit an assertion if 'Q' was used as a
constraint.
I also removed hardcoded register names to prefer regexes so the tests
don't break when the register allocator changes.
llvm-svn: 289325
Summary: This gets rid of the hardcoded 'r0' that was used previously.
Reviewers: asl
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27567
llvm-svn: 289322
Summary:
This never really got implemented, and was very hard to test before
a lot of the refactoring changes to make things more robust. But now we
can test it thoroughly and cleanly, especially at the CGSCC level.
The core idea is that when an inner analysis manager proxy receives the
invalidation event for the outer IR unit, it needs to walk the inner IR
units and propagate it to the inner analysis manager for each of those
units. For example, each function in the SCC needs to get an
invalidation event when the SCC gets one.
The function / module interaction is somewhat boring here. This really
becomes interesting in the face of analysis-backed IR units. This patch
effectively handles all of the CGSCC layer's needs -- both invalidating
SCC analysis and invalidating function analysis when an SCC gets
invalidated.
However, this second aspect doesn't really handle the
LoopAnalysisManager well at this point. That one will need some change
of design in order to fully integrate, because unlike the call graph,
the entire function behind a LoopAnalysis's results can vanish out from
under us, and we won't even have a cached API to access. I'd like to try
to separate solving the loop problems into a subsequent patch though in
order to keep this more focused so I've adapted them to the API and
updated the tests that immediately fail, but I've not added the level of
testing and validation at that layer that I have at the CGSCC layer.
An important aspect of this change is that the proxy for the
FunctionAnalysisManager at the SCC pass layer doesn't work like the
other proxies for an inner IR unit as it doesn't directly manage the
FunctionAnalysisManager and invalidation or clearing of it. This would
create an ever worsening problem of dual ownership of this
responsibility, split between the module-level FAM proxy and this
SCC-level FAM proxy. Instead, this patch changes the SCC-level FAM proxy
to work in terms of the module-level proxy and defer to it to handle
much of the updates. It only does SCC-specific invalidation. This will
become more important in subsequent patches that support more complex
invalidaiton scenarios.
Reviewers: jlebar
Subscribers: mehdi_amini, mcrosier, mzolotukhin, llvm-commits
Differential Revision: https://reviews.llvm.org/D27197
llvm-svn: 289317
Ideally ISD::FP_TO_SINT and ISD::FP_TO_UINT would only be used for cases with the same number of input and output elements.
Similar things have already been done for other convert intrinsics.
llvm-svn: 289316
These should've been checking whether the immediate is a 6-bit unsigned
integer.
If the immediate was '63', this would cause an assertion error which
shouldn't have occurred.
llvm-svn: 289315
Since 32-bit instructions with 32-bit input immediate behavior
are used to materialize 16-bit constants in 32-bit registers
for 16-bit instructions, determining the legality based
on the size is incorrect. Change operands to have the size
specified in the type.
Also adds a workaround for a disassembler bug that
produces an immediate MCOperand for an operand that
is supposed to be OPERAND_REGISTER.
The assembler appears to accept out of bounds immediates and
truncates them, but this seems to be an issue for 32-bit
already.
llvm-svn: 289306
LLVM's use of DW_OP_bit_piece is incorrect and a based on a
misunderstanding of the wording in the DWARF specification. The offset
argument of DW_OP_bit_piece refers to the offset into the location
that is on the top of the DWARF expression stack, and not an offset
into the source variable. This has since also been clarified in the
DWARF specification.
This patch fixes all uses of DW_OP_bit_piece to emit the correct
offset and simplifies the DwarfExpression class to semi-automaticaly
emit empty DW_OP_pieces to adjust the offset of the source variable,
thus simplifying the code using DwarfExpression.
While this is an incompatible bugfix, in practice I don't expect this
to be much of a problem since LLVM's old interpretation and the
correct interpretation of DW_OP_bit_piece differ only when there are
gaps in the fragmented locations of the described variables or if
individual fragments are smaller than a byte. LLDB at least won't
interpret locations with gaps in them because is has no way to present
undefined bits in a variable, and there is a high probability that an
old-form expression will be malformed when interpreted correctly,
because the DW_OP_bit_piece offset will be outside of the location at
the top of the stack.
As a nice side-effect, this patch enables us to use a more efficient
encoding for subregisters: In order to express a sub-register at a
non-zero offset we now use a DW_OP_bit_piece instead of shifting the
value into place manually.
This patch also adds missing test coverage for code paths that weren't
exercised before.
<rdar://problem/29335809>
Differential Revision: https://reviews.llvm.org/D27550
llvm-svn: 289266
Summary:
There is no point in setting SGPRS=104, because VI allocates SGPRs
in multiples of 16, so 104 -> 112. That enables us to use all 102 SGPRs
for general purposes.
Reviewers: tstellarAMD
Subscribers: qcolombet, arsenm, kzhuravl, wdng, nhaehnle, yaxunl, tony-tye
Differential Revision: https://reviews.llvm.org/D27149
llvm-svn: 289260
Like DBG_VALUE, these emit nothing to the .text section, and sometimes
have no source location specified. Just ignore them.
Differential Revision: http://reviews.llvm.org/D27492
llvm-svn: 289256
Reapplied with fix for PR31323 - X86 SSE2 vXi16 multiplies for illegal types were creating CONCAT_VECTORS nodes with vector inputs that might not total the number of elements in the result type.
llvm-svn: 289232
Retrying after fixing overly aggressive load-store forwarding optimization.
Simplify Consecutive Merge Store Candidate Search
Now that address aliasing is much less conservative, push through
simplified store merging search which only checks for parallel stores
through the chain subgraph. This is cleaner as the separation of
non-interfering loads/stores from the store-merging logic.
Whem merging stores, search up the chain through a single load, and
finds all possible stores by looking down from through a load and a
TokenFactor to all stores visited. This improves the quality of the
output SelectionDAG and generally the output CodeGen (with some
exceptions).
Additional Minor Changes:
1. Finishes removing unused AliasLoad code
2. Unifies the the chain aggregation in the merged stores across
code paths
3. Re-add the Store node to the worklist after calling
SimplifyDemandedBits.
4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
arbitrary, but seemed sufficient to not cause regressions in
tests.
This finishes the change Matt Arsenault started in r246307 and
jyknight's original patch.
Many tests required some changes as memory operations are now
reorderable. Some tests relying on the order were changed to use
volatile memory operations
Noteworthy tests:
CodeGen/AArch64/argument-blocks.ll -
It's not entirely clear what the test_varargs_stackalign test is
supposed to be asserting, but the new code looks right.
CodeGen/AArch64/arm64-memset-inline.lli -
CodeGen/AArch64/arm64-stur.ll -
CodeGen/ARM/memset-inline.ll -
The backend now generates *worse* code due to store merging
succeeding, as we do do a 16-byte constant-zero store efficiently.
CodeGen/AArch64/merge-store.ll -
Improved, but there still seems to be an extraneous vector insert
from an element to itself?
CodeGen/PowerPC/ppc64-align-long-double.ll -
Worse code emitted in this case, due to the improved store->load
forwarding.
CodeGen/X86/dag-merge-fast-accesses.ll -
CodeGen/X86/MergeConsecutiveStores.ll -
CodeGen/X86/stores-merging.ll -
CodeGen/Mips/load-store-left-right.ll -
Restored correct merging of non-aligned stores
CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
Improved. Correctly merges buffer_store_dword calls
CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
Improved. Sidesteps loading a stored value and
merges two stores
CodeGen/X86/pr18023.ll -
This test has been removed, as it was asserting incorrect
behavior. Non-volatile stores *CAN* be moved past volatile loads,
and now are.
CodeGen/X86/vector-idiv.ll -
CodeGen/X86/vector-lzcnt-128.ll -
It's basically impossible to tell what these tests are actually
testing. But, looks like the code got better due to the memory
operations being recognized as non-aliasing.
CodeGen/X86/win32-eh.ll -
Both loads of the securitycookie are now merged.
Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
Differential Revision: https://reviews.llvm.org/D14834
llvm-svn: 289221
Part of the work for PR31323 - add extra asserts checking that the input vectors are of consistent type and result in the correct number of vector elements.
llvm-svn: 289214
Adds support for bitcasting a little endian 'small element' vector to 'large element' scalar/vector (e.g. v16i8 to v4i32 or v2i32 to i64), which is required for PR30845. We extract the knownbits for each 'small element' part and concatenate the results together.
We can add support for big endian and 'large element' scalar/vector to 'small element' vector bitcasting once we have test cases for them.
Differential Revision: https://reviews.llvm.org/D27129
llvm-svn: 289200
This reverts commit r288916 as it is currently causing a crasher in
Halide. Reproducer on llvm.org/PR31323. While it might be that halide is
generating invalid IR, llc shouldn't crash.
llvm-svn: 289194
sse_load_f32/f64 can also match loads that are zero extended to vectors. We shouldn't match that because we wouldn't be able to get the instruction to zero the upper bits like the intrinsic semantics would require for such a case.
There is a test case that does depend on this behavior.
llvm-svn: 289193
We could previously select an integer which would hit an assertion error
in pseudo expansion.
The new type will also generate the appropriate fixups if needed, which
wasn't done beforehand.
llvm-svn: 289192
Summary:
Scalar intrinsics have specific semantics about the which input's upper bits are passed through to the output. The same input is also supposed to be the input we use for the lower element when the mask bit is 0 in a masked operation. We aren't currently keeping these semantics with instruction selection.
This patch corrects this by introducing new scalar FMA ISD nodes that indicate whether operand 1(one of the multiply inputs) or operand 3(the additon/subtraction input) should pass thru its upper bits.
We use this information to select 213/132 form for the operand 1 version and the 231 form for the operand 3 version.
We also use this information to suppress combining FNEG operations on the passthru input since semantically the passthru bits aren't negated. This is stronger than the earlier check added for a user being SELECTS so we can remove that.
This fixes PR30913.
Reviewers: delena, zvi, v_klochkov
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D27144
llvm-svn: 289190
These were selecting directly to the VOP2 form instead
of VOP3 like the i32 instructions. Fixes regressions in
future commits where an immediate isn't folded because it was
initially used for the second operand.
Because uniform 16-bit operations are promoted to i32, it's
difficult to get a simple testcase where this matters. Fold
failures in SIFoldOperands here tend to be hidden by commute
and fold in SIShrinkInstructions.
llvm-svn: 289189
The motivating example is:
extern int patatino;
int goo() {
int x = 0;
for (int i = 0; i < 1000000; ++i) {
x *= patatino;
}
return x;
}
Currently SCCP will not realize that this function returns always zero,
therefore will try to unroll and vectorize the loop at -O3 producing an
awful lot of (useless) code. With this change, it will just produce:
0000000000000000 <g>:
xor %eax,%eax
retq
llvm-svn: 289175
This just hoists the check for declarations up a layer which allows
various sets used in the walk to be smaller. Also moves the relevant
comments to match, and catches a few other cleanups in this code.
llvm-svn: 289163
Supporting them properly is a reasonably complex chunk of work, so to allow bot
testing before then we should at least be able to fall back to DAG ISel.
llvm-svn: 289150
We were falsely claiming that we had an LSDA for the relevant EH
personality before this change, which could lead to the EH machinery
interpreting random adjacent data as an LSDA.
Fixes PR31317
This change is safe because cleanups can't contain exception handlers
today. We do these things to maintain that invariant:
- C++ destructors are naturally out-of-line
- __finally blocks are outlined in clang
- LLVM's inliner will not inline EH constructs into cleanups
llvm-svn: 289101
This was exposed by some code that used more than one level of sub-
registers. There is no testcase, because there is no such code in the
Hexagon backend.
llvm-svn: 289099
Not having this legal led to combine failures, resulting
in dumb things like bitcasts of constants not being folded
away.
The only reason I'm leaving the v_mov_b32 hack that f32
already uses is to avoid madak formation test regressions.
PeepholeOptimizer has an ordering issue where the immediate
fold attempt is into the sgpr->vgpr copy instead of the actual
use. Running it twice avoids that problem.
llvm-svn: 289096
The correct commutable opcode was set to itself, so this
was simply swapping the operands to commute instead of also
changing the opcode to v_subrev_u16.
llvm-svn: 289093
Multiple metadata values for records such as opencl.ocl.version, llvm.ident
and similar are created after linking several modules. For some of them, notably
opencl.ocl.version, this creates semantic problem because we cannot tell which
version of OpenCL the composite module conforms.
Moreover, such repetitions of identical values often create a huge list of
unneeded metadata, which grows bitcode size both in memory and stored on disk.
It can go up to several Mb when linked against our OpenCL library. Lastly, such
long lists obscure reading of dumped IR.
The pass unifies metadata after linking.
Differential Revision: https://reviews.llvm.org/D25381
llvm-svn: 289092
Summary:
Attaching !absolute_symbol to a global variable does two things:
1) Marks it as an absolute symbol reference.
2) Specifies the value range of that symbol's address.
Teach the X86 backend to allow absolute symbols to appear in place of
immediates by extending the relocImm and mov64imm32 matchers. Start using
relocImm in more places where it is legal.
As previously proposed on llvm-dev:
http://lists.llvm.org/pipermail/llvm-dev/2016-October/105800.html
Differential Revision: https://reviews.llvm.org/D25878
llvm-svn: 289087
Summary:
LC can currently select scalar load for uniform memory access
basing on readonly memory address space only. This restriction
originated from the fact that in HW prior to VI vector and scalar caches
are not coherent. With MemoryDependenceAnalysis we can check that the
memory location corresponding to the memory operand of the LOAD is not
clobbered along the all paths from the function entry.
Reviewers: rampitec, tstellarAMD, arsenm
Subscribers: wdng, arsenm, nhaehnle
Differential Revision: https://reviews.llvm.org/D26917
llvm-svn: 289076
ConstantFolding tried to cast one of the scalar indices to a vector
type. Instead, use the vector type only for the first index (which
is the only one allowed to be a vector) and use its scalar type
otherwise.
Fixes PR31250.
Reviewers: majnemer
Differential Revision: https://reviews.llvm.org/D27389
llvm-svn: 289073
Summary:
Most targets set the action for these nodes to Expand even though there
isn't actually any code for them in ExpandNode. Instead, targets simply
relied on the fact that no code generates these nodes as long as the
nodes aren't legal or custom.
However, generating these nodes can be useful e.g. for divide-by-constant
in wider integer types.
Expand of [US]MUL_LOHI will use MULH[US] when legal or custom, and
a sequence of half-width multiplications otherwise. Promote uses a wider
multiply.
This patch intends to not change the generated code, but indirect effects
are possible since expansions/promotions that were previously done in
DAGCombine may now be done in LegalizeDAG.
See D24822 for a change that actually uses the new expansion.
Reviewers: spatel, bkramer, venkatra, efriedma, hfinkel, ast, nadav, tstellarAMD
Subscribers: arsenm, jyknight, nemanjai, wdng, nhaehnle, llvm-commits
Differential Revision: https://reviews.llvm.org/D24956
llvm-svn: 289050
Summary:
Without the fix to isFrameOffsetLegal to consider the instruction's
immediate offset, the new test case hits the corresponding assertion in
resolveFrameIndex, because the LocalStackSlotAllocation pass re-uses a
different base register.
With only the fix to isFrameOffsetLegal, code quality reduces in a bunch of
places because frame base registers are added where they're not needed.
This is addressed by properly implementing needsFrameBaseReg, which also
helps to avoid unnecessary zero frame indices in a bunch of other places.
Fixes piglit glsl-1.50/execution/variable-indexing/gs-output-array-vec4-index-wr.shader_test
Reviewers: arsenm, tstellarAMD
Subscribers: qcolombet, kzhuravl, wdng, yaxunl, tony-tye, llvm-commits
Differential Revision: https://reviews.llvm.org/D27344
llvm-svn: 289048
So far it creates a test helper and so it should be moved there. It also
create a layering cycle between CodeGen and CodeGen/AsmPrinter, which
should be avoided.
Review: https://reviews.llvm.org/D27570
llvm-svn: 289044
When trying to vectorize trees that start at insertelement instructions
function tryToVectorizeList() uses vectorization factor calculated as
MinVecRegSize/ScalarTypeSize. But sometimes it does not work as tree
cost for this fixed vectorization factor is too high.
Patch tries to improve the situation. It tries different vectorization
factors from max(PowerOf2Floor(NumberOfVectorizedValues),
MinVecRegSize/ScalarTypeSize) to MinVecRegSize/ScalarTypeSize and tries
to choose the best one.
Differential Revision: https://reviews.llvm.org/D27215
llvm-svn: 289043
This allows clients to register an AsmCommentConsumer with the MCAsmLexer,
which receives a callback each time a comment is parsed.
Differential Revision: https://reviews.llvm.org/D27511
llvm-svn: 289036
Most importantly, we need to hash the relocation model, otherwise we can
end up trying to link non-PIC object files into PIEs or DSOs.
Differential Revision: https://reviews.llvm.org/D27556
llvm-svn: 289024
The relocations for `DIEEntry::EmitValue` were wrong for Win64
(emitting FK_Data_4 instead of FK_SecRel_4). This corrects that
oversight so that the DWARF data is correct in Win64 COFF files.
Fixes PR15393.
Patch by Jameson Nash <jameson@juliacomputing.com> based on a patch
by David Majnemer.
Differential Revision: https://reviews.llvm.org/D21731
llvm-svn: 289013
The only tests we have for the DWARF parser are the tests that use llvm-dwarfdump and expect output from textual dumps.
More DWARF parser modification are coming in the next few weeks and I wanted to add tests that can verify that we can encode and decode all form types, as well as test some other basic DWARF APIs where we ask DIE objects for their children and siblings.
DwarfGenerator.cpp was added in the lib/CodeGen directory. This file contains the code necessary to easily create DWARF for tests:
dwarfgen::Generator DG;
Triple Triple("x86_64--");
bool success = DG.init(Triple, Version);
if (!success)
return;
dwarfgen::CompileUnit &CU = DG.addCompileUnit();
dwarfgen::DIE CUDie = CU.getUnitDIE();
CUDie.addAttribute(DW_AT_name, DW_FORM_strp, "/tmp/main.c");
CUDie.addAttribute(DW_AT_language, DW_FORM_data2, DW_LANG_C);
dwarfgen::DIE SubprogramDie = CUDie.addChild(DW_TAG_subprogram);
SubprogramDie.addAttribute(DW_AT_name, DW_FORM_strp, "main");
SubprogramDie.addAttribute(DW_AT_low_pc, DW_FORM_addr, 0x1000U);
SubprogramDie.addAttribute(DW_AT_high_pc, DW_FORM_addr, 0x2000U);
dwarfgen::DIE IntDie = CUDie.addChild(DW_TAG_base_type);
IntDie.addAttribute(DW_AT_name, DW_FORM_strp, "int");
IntDie.addAttribute(DW_AT_encoding, DW_FORM_data1, DW_ATE_signed);
IntDie.addAttribute(DW_AT_byte_size, DW_FORM_data1, 4);
dwarfgen::DIE ArgcDie = SubprogramDie.addChild(DW_TAG_formal_parameter);
ArgcDie.addAttribute(DW_AT_name, DW_FORM_strp, "argc");
// ArgcDie.addAttribute(DW_AT_type, DW_FORM_ref4, IntDie);
ArgcDie.addAttribute(DW_AT_type, DW_FORM_ref_addr, IntDie);
StringRef FileBytes = DG.generate();
MemoryBufferRef FileBuffer(FileBytes, "dwarf");
auto Obj = object::ObjectFile::createObjectFile(FileBuffer);
EXPECT_TRUE((bool)Obj);
DWARFContextInMemory DwarfContext(*Obj.get());
This code is backed by the AsmPrinter code that emits DWARF for the actual compiler.
While adding unit tests it was discovered that DIEValue that used DIEEntry as their values had bugs where DW_FORM_ref1, DW_FORM_ref2, DW_FORM_ref8, and DW_FORM_ref_udata forms were not supported. These are all now supported. Added support for DW_FORM_string so we can emit inlined C strings.
Centralized the code to unique abbreviations into a new DIEAbbrevSet class and made both the dwarfgen::Generator and the llvm::DwarfFile classes use the new class.
Fixed comments in the llvm::DIE class so that the Offset is known to be the compile/type unit offset.
DIEInteger now supports more DW_FORM values.
There are also unit tests that cover:
Encoding and decoding all form types and values
Encoding and decoding all reference types (DW_FORM_ref1, DW_FORM_ref2, DW_FORM_ref4, DW_FORM_ref8, DW_FORM_ref_udata, DW_FORM_ref_addr) including cross compile unit references with that go forward one compile unit and backward on compile unit.
Differential Revision: https://reviews.llvm.org/D27326
llvm-svn: 289010
Replace @progbits in the section directive with %progbits, because "@" starts a comment on arm/thumb.
Use b.w branch instruction.
Use .thumb_function and .thumb_set for proper arm/thumb interwork. This way jumptable entry addresses on thumb have bit 0 set (correctly). This does not affect CFI check math, because the address of the jumptable start also has that bit set.
This does not work on thumbv5, because it does not support b.w, and the linker would not insert a veneer (trampoline?) to extend the range of b.n. We may need to do full-range plt-style jumptables on thumbv54, which are 12 bytes per entry. Another option is "push lr; bl; pop pc" (4 bytes) but that needs unwinding instructions, etc.
Differential Revision: https://reviews.llvm.org/D27499
llvm-svn: 289008
The fix committed in r288851 doesn't cover all the cases.
In particular, if we have an instruction with side effects
which has a no non-dbg use not depending on the bits, we still
perform RAUW destroying the dbg.value's first argument.
Prevent metadata from being replaced here to avoid the issue.
Differential Revision: https://reviews.llvm.org/D27534
llvm-svn: 288987
ConstantExpr instances were emitting code into the current block rather than
the entry block. This meant they didn't necessarily dominate all uses, which is
clearly wrong.
llvm-svn: 288985
Since DWARF formatting is agnostic to the object file it is stored in, it doesn't make sense for this to be in the MachOYAML implementation. Pulling it into its own namespace means we could modify the ELF and COFF YAML tools to emit DWARF as well.
In a follow-up patch I will better abstract this in obj2yaml and yaml2obj so that the DWARF bits in the tools can be re-used too.
llvm-svn: 288984
Having to ask the MIRBuilder for the current function is a little awkward, and
I'm intending to improve how that's threaded through anyway.
llvm-svn: 288983
MachineIRBuilder had weird before/after and beginning/end flags for the insert
point. Unfortunately the non-default means that instructions will be inserted
in reverse order which is almost never what anyone wants.
Really, I think we just want (like IRBuilder has) the ability to insert at any
C++ iterator-style point (i.e. before any instruction or before MBB.end()). So
this fixes MIRBuilders to behave like IRBuilders in this respect.
llvm-svn: 288980
If we don't skip over DEBUG_VALUEs, we get differences between -g and non-g
code.
This fixes PR31242.
Differential Revision: https://reviews.llvm.org/D27485
llvm-svn: 288965
The second operand of an "ri" instruction may be an immediate, but it may
also be a globalvariable, so we should make any assumptions.
This fixes PR31271.
Differential Revision: https://reviews.llvm.org/D27481
llvm-svn: 288964
This is now performed more generally by the target shuffle combine code.
Already covered by tests that were originally added in D7666/rL229480 to support combineVectorZext (or VectorZextCombine as it was known then....).
Differential Revision: https://reviews.llvm.org/D27510
llvm-svn: 288918
This patch attempts to scalarize the operand expressions of predicated
instructions if they were conditionally executed in the original loop. After
scalarization, the expressions will be sunk inside the blocks created for the
predicated instructions. The transformation essentially performs
un-if-conversion on the operands.
The cost model has been updated to determine if scalarization is profitable. It
compares the cost of a vectorized instruction, assuming it will be
if-converted, to the cost of the scalarized instruction, assuming that the
instructions corresponding to each vector lane will be sunk inside a predicated
block, possibly avoiding execution. If it's more profitable to scalarize the
entire expression tree feeding the predicated instruction, the expression will
be scalarized; otherwise, it will be vectorized. We only consider the cost of
the entire expression to accurately estimate the cost of the required
insertelement and extractelement instructions.
Differential Revision: https://reviews.llvm.org/D26083
llvm-svn: 288909
In the case of a fully redundant load LI dominated by an equivalent load V, GVN
should always preserve the original debug location of V. Otherwise, we risk to
introduce an incorrect stepping.
If V has debug info, then clearly it should not be modified. If V has a null
debugloc, then it is still potentially incorrect to propagate LI's debugloc
because LI may not post-dominate V.
Differential Revision: https://reviews.llvm.org/D27468
llvm-svn: 288903
We are being inconsistent with these instructions (and all their variants.....) with a random mix of them using the default float domain.
Differential Revision: https://reviews.llvm.org/D27419
llvm-svn: 288902
The non-constant pool version of DecodeVPERMIL2PMask was not offsetting correctly for the second input. I've updated the code to match the implementation in the constant-pool version.
Annoyingly this bug was hidden for so long as it's tricky to combine to useful variable shuffle masks that don't become constant-pool entries.
llvm-svn: 288898
When a function F is inlined, InlineFunction extends the debug location of every
instruction inlined from F by adding an InlinedAt.
However, if an instruction has a 'null' debug location, InlineFunction would
propagate the callsite debug location to it. This behavior existed since
revision 210459.
Revision 210459 was originally committed specifically to workaround the lack of
debug information for instructions inlined from intrinsic functions (which are
usually declared with attributes `__always_inline__, __nodebug__`).
The problem with revision 210459 is that it doesn't make any sort of distinction
between instructions inlined from a 'nodebug' function and instructions which
are inlined from a function built with debug info. This issue may lead to
incorrect stepping in the debugger.
This patch works under the assumption that a nodebug function does not have a
DISubprogram. When a function F is inlined into another function G,
InlineFunction checks if F has debug info associated with it.
For nodebug functions, the InlineFunction logic is unchanged (i.e. it would
still propagate the callsite debugloc to the inlined instructions). Otherwise,
InlineFunction no longer propagates the callsite debug location.
Differential Revision: https://reviews.llvm.org/D27462
llvm-svn: 288895
I believe this is the cause of the failure, but have not been able to confirm. Note that this is a speculative fix; I'm still waiting for a full build to finish as I synced and ended up doing a clean build which takes 20+ minutes on my machine.
llvm-svn: 288886
Requesting metadata for a global is a relatively expensive operation as it
involves a map lookup, but it's one that we need to do relatively frequently in
this pass to collect the list of type metadata nodes associated with a global.
This change improves the performance of type metadata queries by prebuilding
data structures that keep the global together with its list of type metadata,
and changing the pass to use that data structure wherever we were previously
passing global references around.
This change also eliminates some O(N^2) behavior by collecting the list of
globals associated with each type identifier during the first pass over the
list of globals rather than visiting each global to compute that list every
time we add a new type identifier.
Reduces pass runtime on a module containing Chrome's vtables from over 60s
to 0.9s.
Differential Revision: https://reviews.llvm.org/D27484
llvm-svn: 288859
On some platforms (like MSP430) the second element of the result
structure for SMULO/UMULO may have a shorter type than the one
returned by SetCC. We need to truncate it to the right type, or
else some incorrect code may be generated later on.
This fixes issue https://github.com/rust-lang/rust/issues/37829
Patch by Vadzim Dambrouski!
Differential Revision: https://reviews.llvm.org/D27154
llvm-svn: 288857
As Eli noted in the post-commit thread for r288833, the use of
swapOperands() may not be allowed in InstSimplify, so I'm
removing those calls here pending further review.
The swap mutates the icmp, and there doesn't appear to be precedent
for instruction mutation in InstSimplify.
I didn't actually have any tests for those cases, so I'm adding
a few here.
llvm-svn: 288855
BDCE has two phases:
1. It asks SimplifyDemandedBits if all the bits of an instruction are dead, and if so,
replaces all its uses with the constant zero.
2. Then, it asks SimplifyDemandedBits again if the instruction is really dead
(no side effects etc..) and if so, eliminates it.
Now, in 1) if all the bits of an instruction are dead, we may end up replacing a dbg use:
%call = tail call i32 (...) @g() #4, !dbg !15
tail call void @llvm.dbg.value(metadata i32 %call, i64 0, metadata !8, metadata !16), !dbg !17
->
%call = tail call i32 (...) @g() #4, !dbg !15
tail call void @llvm.dbg.value(metadata i32 0, i64 0, metadata !8, metadata !16), !dbg !17
but not eliminating the call because it may have arbitrary side effects.
In other words, we lose some debug informations.
This patch fixes the problem making sure that BDCE does nothing with the instruction if
it has side effects and no non-dbg uses.
Differential Revision: https://reviews.llvm.org/D27471
llvm-svn: 288851
Summary:
If we write an immediate to a VGPR and then copy the VGPR to an
SGPR, we can replace the copy with a S_MOV_B32 sgpr, imm, rather than
moving the copy to the SALU.
Reviewers: arsenm
Subscribers: kzhuravl, wdng, nhaehnle, yaxunl, llvm-commits, tony-tye
Differential Revision: https://reviews.llvm.org/D27272
llvm-svn: 288849
We were rounding size in bits down rather than up, leading to 0-sized slots for
i1 (assert!) and bugs for other types not byte-aligned.
llvm-svn: 288848
Summary:
Prefer expansions such as: pmullw,pmulhw,unpacklwd,unpackhwd over pmulld.
On Silvermont [source: Optimization Reference Manual]:
PMULLD has a throughput of 1/11 [instruction/cycles].
PMULHUW/PMULHW/PMULLW have a throughput of 1/2 [instruction/cycles].
Fixes pr31202.
Analysis of this issue was done by Fahana Aleen.
Reviewers: wmi, delena, mkuper
Subscribers: RKSimon, llvm-commits
Differential Revision: https://reviews.llvm.org/D27203
llvm-svn: 288844
Handle the case where a sign extension has ended up being split into separate stages (typically to get around vector legal ops) and a zext + sext_in_reg gets inserted.
Differential Revision: https://reviews.llvm.org/D27461
llvm-svn: 288842
All of these (and a few more) are already handled by InstCombine,
but we shouldn't have to wait until then to simplify these because
they're cheap to deal with here in InstSimplify.
This is the 'and' sibling of the earlier 'or' patch:
https://reviews.llvm.org/rL288833
llvm-svn: 288841