Bitcast and certain Ptr2Int/Int2Ptr instructions will not alter the
value of their operand and can therefore be looked through when we
determine non-nullness.
Differential Revision: https://reviews.llvm.org/D54956
llvm-svn: 352293
Fix issue noted in D57281 that only tested the one use for the SDValue (the result flag), not the entire SUB.
I've added the getNode() to make it clearer what is intended than just the -> redirection.
llvm-svn: 352291
As discussed on PR24545, we should try to commute X86::COND_A 'icmp ugt' cases to X86::COND_B 'icmp ult' to more optimally bind the carry flag output to a SBB instruction.
Differential Revision: https://reviews.llvm.org/D57281
llvm-svn: 352289
We often generate X86ISD::SBB(X, 0) for carry flag arithmetic.
I had tried to create test cases for the ADC equivalent (which often uses the same pattern) but haven't managed to find anything yet.
Differential Revision: https://reviews.llvm.org/D57169
llvm-svn: 352288
This reduces a bit of duplication between the combining and
lowering places that use it, but the primary motivation is
to make it easier to rearrange the lowering logic and solve
PR40434:
https://bugs.llvm.org/show_bug.cgi?id=40434
llvm-svn: 352280
For the power9 CPU, vector operations consume a pair of execution units rather
than one execution unit like a scalar operation. Update the target transform
cost functions to reflect the higher cost of vector operations when targeting
Power9.
Patch by RolandF.
Differential revision: https://reviews.llvm.org/D55461
llvm-svn: 352261
Summary: We have isel patterns for this, but we're missing some load patterns and all broadcast patterns. A DAG combine seems like a better fit for this.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56971
llvm-svn: 352260
Summary:
I'm not sure why we were using SEXTLOAD. EXTLOAD seems more appropriate since we don't care about the upper bits.
This patch changes this and then modifies the X86 post legalization combine to emit a extending shuffle instead of a sign_extend_vector_inreg. Could maybe use an any_extend_vector_inreg, but I just did what we already do in LowerLoad. I think we can actually get rid of this code entirely if we switch to -x86-experimental-vector-widening-legalization.
On AVX512 targets I think we might be able to use a masked vpmovzx and not have to expand this at all.
Reviewers: RKSimon, spatel
Reviewed By: RKSimon
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D57186
llvm-svn: 352255
DAGCombiner::visitBITCAST will perform:
fold (bitconvert (fneg x)) -> (xor (bitconvert x), signbit)
fold (bitconvert (fabs x)) -> (and (bitconvert x), (not signbit))
As shown in double-bitmanip-dagcombines.ll, this can be advantageous. But
RV32FD doesn't use bitcast directly (as i64 isn't a legal type), and instead
uses RISCVISD::SplitF64. This patch adds an equivalent DAG combine for
SplitF64.
llvm-svn: 352247
Summary:
Currently, if an instruction with a memory operand has no debug information,
X86DiscriminateMemOps will generate one based on the first line of the
enclosing function, or the last seen debug info.
This may cause confusion in certain debugging scenarios. The long term
approach would be to use the line number '0' in such cases, however, that
brings in challenges: the base discriminator value range is limited
(4096 values).
For the short term, adding an opt-in flag for this feature.
See bug 40319 (https://bugs.llvm.org/show_bug.cgi?id=40319)
Reviewers: dblaikie, jmorse, gbedwell
Reviewed By: dblaikie
Subscribers: aprantl, eraman, hiraditya
Differential Revision: https://reviews.llvm.org/D57257
llvm-svn: 352246
Summary:
Set default value for retrieved attributes to 1, since the check is against 1.
Eliminates the warning noise generated when the attributes are not present.
Reviewers: sanjoy
Subscribers: jlebar, llvm-commits
Differential Revision: https://reviews.llvm.org/D57253
llvm-svn: 352238
If bottom of block BB has only one successor OldTop, in most cases it is profitable to move it before OldTop, except the following case:
-->OldTop<-
| . |
| . |
| . |
---Pred |
| |
BB-----
Move BB before OldTop can't reduce the number of taken branches, this patch detects this case and prevent the moving.
Differential Revision: https://reviews.llvm.org/D57067
llvm-svn: 352236
We also need to combine to masked truncating with saturation stores, but I'm leaving that for a future patch.
This does regress some tests that used truncate wtih saturation followed by a masked store. Those now use a truncating store and use min/max to saturate.
Differential Revision: https://reviews.llvm.org/D57218
llvm-svn: 352230
The main goal of the model is to avoid *increasing* function size, as
that would eradicate any memory locality benefits from splitting. This
happens when:
- There are too many inputs or outputs to the cold region. Argument
materialization and reloads of outputs have a cost.
- The cold region has too many distinct exit blocks, causing a large
switch to be formed in the caller.
- The code size cost of the split code is less than the cost of a
set-up call.
A secondary goal is to prevent excessive overall binary size growth.
With the cost model in place, I experimented to find a splitting
threshold that works well in practice. To make warm & cold code easily
separable for analysis purposes, I moved split functions to a "cold"
section. I experimented with thresholds between [0, 4] and set the
default to the threshold which minimized geomean __text size.
Experiment data from building LNT+externals for X86 (N = 639 programs,
all sizes in bytes):
| Configuration | __text geom size | __cold geom size | TEXT geom size |
| **-Os** | 1736.3 | 0, n=0 | 10961.6 |
| -Os, thresh=0 | 1740.53 | 124.482, n=134 | 11014 |
| -Os, thresh=1 | 1734.79 | 57.8781, n=90 | 10978.6 |
| -Os, thresh=2 | ** 1733.85 ** | 65.6604, n=61 | 10977.6 |
| -Os, thresh=3 | 1733.85 | 65.3071, n=61 | 10977.6 |
| -Os, thresh=4 | 1735.08 | 67.5156, n=54 | 10965.7 |
| **-Oz** | 1554.4 | 0, n=0 | 10153 |
| -Oz, thresh=2 | ** 1552.2 ** | 65.633, n=61 | 10176 |
| **-O3** | 2563.37 | 0, n=0 | 13105.4 |
| -O3, thresh=2 | ** 2559.49 ** | 71.1072, n=61 | 13162.4 |
Picking thresh=2 reduces the geomean __text section size by 0.14% at
-Os, -Oz, and -O3 and causes ~0.2% growth in the TEXT segment. Note that
TEXT size is page-aligned, whereas section sizes are byte-aligned.
Experiment data from building LNT+externals for ARM64 (N = 558 programs,
all sizes in bytes):
| Configuration | __text geom size | __cold geom size | TEXT geom size |
| **-Os** | 1763.96 | 0, n=0 | 42934.9 |
| -Os, thresh=2 | ** 1760.9 ** | 76.6755, n=61 | 42934.9 |
Picking thresh=2 reduces the geomean __text section size by 0.17% at
-Os and causes no growth in the TEXT segment.
Measurements were done with D57082 (r352080) applied.
Differential Revision: https://reviews.llvm.org/D57125
llvm-svn: 352228
N_FUNC_COLD is a new MachO symbol attribute. It's a hint to the linker
to order a symbol towards the end of its section, to improve locality.
Example:
```
void a1() {}
__attribute__((cold)) void a2() {}
void a3() {}
int main() {
a1();
a2();
a3();
return 0;
}
```
A linker that supports N_FUNC_COLD will order _a2 to the end of the text
section. From `nm -njU` output, we see:
```
_a1
_a3
_main
_a2
```
Differential Revision: https://reviews.llvm.org/D57190
llvm-svn: 352227
This seems unnecessarily complicated because we gave names to
opposite polarity bools and have code comments that don't really
line up with the logic.
Step 1: remove UndefUpper and assert that it is the opposite of
UndefLower after the initial early exit.
llvm-svn: 352217
This patch adds a new type StringBlockVal which can be used to emit a
YAML block scalar, which preserves newlines in a multiline string. It
also updates MappingTraits<DiagnosticInfoOptimizationBase::Argument> to
use it for argument values with more than a single newline.
This is helpful for remarks that want to display more in-depth
information in a more structured way.
Reviewers: thegameg, anemet
Reviewed By: anemet
Subscribers: hfinkel, hiraditya, llvm-commits
Differential Revision: https://reviews.llvm.org/D57159
llvm-svn: 352216
Simplify to the generic ISD::ADD/SUB if we don't make use of the result flag.
This mainly helps with ADDCARRY/SUBBORROW intrinsics which get expanded to X86ISD::ADD/SUB but could be simplified further.
Noticed in some of the test cases in PR31754
Differential Revision: https://reviews.llvm.org/D57234
llvm-svn: 352210
This isn't the final fix for our reduction/horizontal codegen, but it takes care
of a lot of the problems. After we narrow the shuffle, existing combines for
insert/extract and binops kick in, and we end up with cheaper 128-bit ops.
The avg and mul reduction tests show an existing shuffle lowering hole for
AVX2/AVX512. I think in its most minimal form this is:
https://bugs.llvm.org/show_bug.cgi?id=40434
...but we might need multiple fixes to get it right.
Differential Revision: https://reviews.llvm.org/D57156
llvm-svn: 352209
Same as ARM.
On this occasion we split some of the instruction select tests for more
complicated instructions into their own files, so we can reuse them for
ARM and Thumb mode. Likewise for the legalizer tests.
llvm-svn: 352188
This patch extends TableGen language with !cond operator.
Instead of embedding !if inside !if which can get cumbersome,
one can now use !cond.
Below is an example to convert an integer 'x' into a string:
!cond(!lt(x,0) : "Negative",
!eq(x,0) : "Zero",
!eq(x,1) : "One,
1 : "MoreThanOne")
Reviewed By: hfinkel, simon_tatham, greened
Differential Revision: https://reviews.llvm.org/D55758
llvm-svn: 352185
Fast selection of llvm icmp and fcmp instructions is not handled well about VSX instruction support.
We'd use VSX float comparison instruction instead of non-vsx float comparison instruction
if the operand register class is VSSRC or VSFRC because i32 and i64 are mapped to VSSRC and
VSFRC correspondingly if VSX feature is opened.
If the target does not have corresponding VSX instruction comparison for some type,
just copy VSX-related register to common float register class and use non-vsx comparison instruction.
Differential Revision: https://reviews.llvm.org/D57078
llvm-svn: 352174
Follow the same custom legalisation strategy as used in D57085 for
variable-length shifts (see that patch summary for more discussion). Although
we may lose out on some late-stage DAG combines, I think this custom
legalisation strategy is ultimately easier to reason about.
There are some codegen changes in rv64m-exhaustive-w-insts.ll but they are all
neutral in terms of the number of instructions.
Differential Revision: https://reviews.llvm.org/D57096
llvm-svn: 352171
2nd part of D57095 with the same reason, just in another place. We never
fold branches that are not immediately in the current loop, but this check
is missing in `IsEdgeLive` As result, it may think that the edge in subloop is
dead while it's live. It's a pessimization in the current stance.
Differential Revision: https://reviews.llvm.org/D57147
Reviewed By: rupprecht
llvm-svn: 352170
The previous DAG combiner-based approach had an issue with infinite loops
between the target-dependent and target-independent combiner logic (see
PR40333). Although this was worked around in rL351806, the combiner-based
approach is still potentially brittle and can fail to select the 32-bit shift
variant when profitable to do so, as demonstrated in the pr40333.ll test case.
This patch instead introduces target-specific SelectionDAG nodes for
SHLW/SRLW/SRAW and custom-lowers variable i32 shifts to them. pr40333.ll is a
good example of how this approach can improve codegen.
This adds DAG combine that does SimplifyDemandedBits on the operands (only
lower 32-bits of first operand and lower 5 bits of second operand are read).
This seems better than implementing SimplifyDemandedBitsForTargetNode as there
is no guarantee that would be called (and it's not for e.g. the anyext return
test cases). Also implements ComputeNumSignBitsForTargetNode.
There are codegen changes in atomic-rmw.ll and atomic-cmpxchg.ll but the new
instruction sequences are semantically equivalent.
Differential Revision: https://reviews.llvm.org/D57085
llvm-svn: 352169
While a cold invoke itself and its unwind destination can't be
extracted, code which unconditionally executes before/after the invoke
may still be profitable to extract.
With cost model changes from D57125 applied, this gives a 3.5% increase
in split text across LNT+externals on arm64 at -Os.
llvm-svn: 352160
Otherwise they are treated as dynamic allocas, which ends up increasing
code size significantly. This reduces size of Chromium base_unittests
by 2MB (6.7%).
Differential Revision: https://reviews.llvm.org/D57205
llvm-svn: 352152
This patch exploits the instructions that store a single element from a vector
to preform a (store (extract_elt)). We already have code that does this with
ISA 3.0 instructions that were added to handle i8/i16 types. However, we had
never exploited the existing ones that handle f32/f64/i32/i64 types.
Differential revision: https://reviews.llvm.org/D56175
llvm-svn: 352131
As noted in D57156, we want to check at least part of
this pattern earlier (in combining), so this will allow
the code to be shared instead of duplicated.
llvm-svn: 352127
https://reviews.llvm.org/D57178
Now add a hook in TargetPassConfig to query if CSE needs to be
enabled. By default this hook returns false only for O0 opt level but
this can be overridden by the target.
As a consequence of the default of enabled for non O0, a few tests
needed to be updated to not use CSE (by passing in -O0) to the run
line.
reviewed by: arsenm
llvm-svn: 352126
PDBs contain several serialized hash tables. In the microsoft-pdb
repo published to support LLVM implementing PDB support, the
provided initializes the bucket count for the TPI and IPI streams
to the maximum size. This occurs in tpi.cpp L33 and tpi.cpp L398.
In the LLVM code for generating PDBs, these streams are created with
minimum number of buckets. This difference makes LLVM generated
PDBs slower for when used for debugging.
Patch by C.J. Hebert
Differential Revision: https://reviews.llvm.org/D56942
llvm-svn: 352117
This patch adds support for vector @llvm.ceil intrinsics when full 16 bit
floating point support isn't available.
To do this, this patch...
- Implements basic isel for G_UNMERGE_VALUES
- Teaches the legalizer about 16 bit floats
- Teaches AArch64RegisterBankInfo to respect floating point registers on
G_BUILD_VECTOR and G_UNMERGE_VALUES
- Teaches selectCopy about 16-bit floating point vectors
It also adds
- A legalizer test for the 16-bit vector ceil which verifies that we create a
G_UNMERGE_VALUES and G_BUILD_VECTOR when full fp16 isn't supported
- An instruction selection test which makes sure we lower to G_FCEIL when
full fp16 is supported
- A test for selecting G_UNMERGE_VALUES
And also updates arm64-vfloatintrinsics.ll to show that the new ceiling types
work as expected.
https://reviews.llvm.org/D56682
llvm-svn: 352113
Summary:
Using COFF's .def directive in module assembly used to crash ThinLTO
with "this directive only supported on COFF targets" when getting
symbol information in ModuleSymbolTable. This change allows
ModuleSymbolTable to process such code and adds a test to verify that
the .def directive has the desired effect on the native object file,
with and without ThinLTO.
Fixes https://bugs.llvm.org/show_bug.cgi?id=36789
Reviewers: rnk, pcc, vlad.tsyrklevich
Subscribers: mehdi_amini, eraman, hiraditya, dexonsmith, llvm-commits
Differential Revision: https://reviews.llvm.org/D57073
llvm-svn: 352112
A volatile operation cannot be used to prove an address points to normal
memory. (LangRef was recently updated to state it explicitly.)
Differential Revision: https://reviews.llvm.org/D57040
llvm-svn: 352109
Summary:
guessLibraryShortName() separates a full Mach-O dylib install name path
into a short name and a dyld image suffix. The short name is the name
of the dylib without its path or extension. The dyld image suffix is a
string used by dyld to load variants of dylibs if available at runtime;
for example, "when binding this process, load 'debug' variants of all
required dylibs." dyld knows exactly what the image suffix is, but
by convention diagnostic tools such as llvm-nm attempt to guess suffix
names by looking at the install name path.
These dyld image suffixes are separated from the short name by a '_'
character. Because the '_' character is commonly used to separate words
in filenames guessLibraryShortName() cannot reliably separate a dylib's
short name from an arbitrary image suffix; imagine if both the short
name and the suffix contains an '_' character! To better deal with this
ambiguity, guessLibraryShortName() will recognize only "_debug" and
"_profile" as valid Suffix values. Calling code needs to be tolerant of
guessLibraryShortName() guessing incorrectly.
The previous implementation of guessLibraryShortName() did not allow
'_' characters to appear in short names. When present, the short name
would be truncated, e.g., "libcompiler_rt" => "libcompiler". This
change allows "libcompiler_rt" and "libcompiler_rt_debug" to both be
recognized as "libcompiler_rt".
rdar://47412244
Reviewers: kledzik, lhames, pete
Reviewed By: pete
Subscribers: llvm-commits
Differential Revision: https://reviews.llvm.org/D56978
llvm-svn: 352104
Summary:
MemorySSA needs updating each time an instruction is moved.
LICM and control flow hoisting re-hoists instructions, thus needing another update when re-moving those instructions.
Pending cleanup: the MSSA update is duplicated, should be moved inside moveInstructionBefore.
Reviewers: jnspaulsson
Subscribers: sanjoy, jlebar, Prazek, george.burgess.iv, llvm-commits
Differential Revision: https://reviews.llvm.org/D57176
llvm-svn: 352092
Performing splitting early has several advantages:
- Inhibiting inlining of cold code early improves code size. Compared
to scheduling splitting at the end of the pipeline, this cuts code
size growth in half within the iOS shared cache (0.69% to 0.34%).
- Inhibiting inlining of cold code improves compile time. There's no
need to inline split cold functions, or to inline as much *within*
those split functions as they are marked `minsize`.
- During LTO, extra work is only done in the pre-link step. Less code
must be inlined during cross-module inlining.
An additional motivation here is that the most common cold regions
identified by the static/conservative splitting heuristic can (a) be
found before inlining and (b) do not grow after inlining. E.g.
__assert_fail, os_log_error.
The disadvantages are:
- Some opportunities for splitting out cold code may be missed. This
gap can potentially be narrowed by adding a worklist algorithm to the
splitting pass.
- Some opportunities to reduce code size may be lost (e.g. store
sinking, when one side of the CFG diamond is split). This does not
outweigh the code size benefits of splitting earlier.
On net, splitting early in the pipeline has substantial code size
benefits, and no major effects on memory locality or performance. We
measured memory locality using ktrace data, and consistently found that
10% fewer pages were needed to capture 95% of text page faults in key
iOS benchmarks. We measured performance on frequency-stabilized iOS
devices using LNT+externals.
This reverses course on the decision made to schedule splitting late in
r344869 (D53437).
Differential Revision: https://reviews.llvm.org/D57082
llvm-svn: 352080
It should be emitted when any floating-point operations (including
calls) are present in the object, not just when calls to printf/scanf
with floating point args are made.
The difference caused by this is very subtle: in static (/MT) builds,
on x86-32, in a program that uses floating point but doesn't print it,
the default x87 rounding mode may not be set properly upon
initialization.
This commit also removes the walk of the types pointed to by pointer
arguments in calls. (To assist in opaque pointer types migration --
eventually the pointee type won't be available.)
That latter implies that it will no longer consider a call like
`scanf("%f", &floatvar)` as sufficient to emit _fltused on its
own. And without _fltused, `scanf("%f")` will abort with error R6002. This
new behavior is unlikely to bite anyone in practice (you'd have to
read a float, and do nothing with it!), and also, is consistent with
MSVC.
Differential Revision: https://reviews.llvm.org/D56548
llvm-svn: 352076
After submitting https://reviews.llvm.org/D57138, I realized it was slightly more conservative than needed. The scalar indices don't appear to be a problem on a vector gep, we even had a test for that.
Differential Revision: https://reviews.llvm.org/D57161
llvm-svn: 352061
This is an alternative to https://reviews.llvm.org/D57103. After discussion, we dedicided to check this in as a temporary workaround, and pursue a true fix under the original thread.
The issue at hand is that the base rewriting algorithm doesn't consider the fact that GEPs can turn a scalar input into a vector of outputs. We had handling for scalar GEPs and fully vector GEPs (i.e. all vector operands), but not the scalar-base + vector-index forms. A true fix here requires treating GEP analogously to extractelement or shufflevector.
This patch is merely a workaround. It simply hides the crash at the cost of some ugly code gen for this presumable very rare pattern.
Differential Revision: https://reviews.llvm.org/D57138
llvm-svn: 352059
Select zero extending and sign extending load for MIPS32.
Use size from MachineMemOperand to determine number of bytes to load.
Differential Revision: https://reviews.llvm.org/D57099
llvm-svn: 352038
Use CombinerHelper to combine extending load instructions.
G_LOAD combined with G_ZEXT, G_SEXT or G_ANYEXT gives G_ZEXTLOAD,
G_SEXTLOAD or G_LOAD with same type as def of extending instruction
respectively.
Similarly G_ZEXTLOAD combined with G_ZEXT gives G_ZEXTLOAD and
G_SEXTLOAD combined with G_SEXT gives G_SEXTLOAD with same type
as def of extending instruction.
Differential Revision: https://reviews.llvm.org/D56914
llvm-svn: 352037
Instead of manually computing DT and PDT, we can get the from the pass
manager, which ideally has them already cached. With the new pass
manager, we could even preserve DT/PDT on a per function basis in a
module pass.
I think this also addresses the TODO about re-using the computed DTs for
BFI. IIUC, GetBFI will fetch the DT from the pass manager and when we
will fetch the cached version later.
Reviewers: vsk, hiraditya, tejohnson, thegameg, sebpop
Reviewed By: vsk
Differential Revision: https://reviews.llvm.org/D57092
llvm-svn: 352036
This reapplies commit r351987 with a failed test fix. Now the test
accepts both DW_OP_GNU_push_tls_address and DW_OP_form_tls_address
opcode.
Original commit message:
```
This is a fix for a regression introduced by the rL348194 commit. In
that change new type (MEK_DTPREL) of MipsMCExpr expression was added,
but in some places of the code this type of expression considered as
unexpected.
This change fixes the bug. The MEK_DTPREL type of expression is used for
marking TLS DIEExpr only and contains a regular sub-expression. Where we
need to handle the expression, we retrieve the sub-expression and
handle it in a common way.
```
llvm-svn: 352034
After creating new PHI instructions during isel pseudo expansion, the NoPHIs
property of MF should be reset in case it was previously set.
Review: Ulrich Weigand
llvm-svn: 352030
When we choose whether or not we should mark block as dead, we have an
inconsistent logic in markup of live blocks.
- We take candidate IF its terminator branches on constant AND it is immediately
in current loop;
- We mark successor live IF its terminator doesn't branch by constant OR it branches
by constant and the successor is its always taken block.
What we are missing here is that when the terminator branches on a constant but is
not taken as a candidate because is it not immediately in the current loop, we will
mark only one (always taken) successor as live. Therefore, we do NOT do the actual
folding but may NOT mark one of the successors as live. So the result of markup is
wrong in this case, and we may then hit various asserts.
Thanks Jordan Rupprech for reporting this!
Differential Revision: https://reviews.llvm.org/D57095
Reviewed By: rupprecht
llvm-svn: 352024
Summary:
UBSan wants to detect when unreachable code is actually reached, so it
adds instrumentation before every `unreachable` instruction. However,
the optimizer will remove code after calls to functions marked with
`noreturn`. To avoid this UBSan removes `noreturn` from both the call
instruction as well as from the function itself. Unfortunately, ASan
relies on this annotation to unpoison the stack by inserting calls to
`_asan_handle_no_return` before `noreturn` functions. This is important
for functions that do not return but access the the stack memory, e.g.,
unwinder functions *like* `longjmp` (`longjmp` itself is actually
"double-proofed" via its interceptor). The result is that when ASan and
UBSan are combined, the `noreturn` attributes are missing and ASan
cannot unpoison the stack, so it has false positives when stack
unwinding is used.
Changes:
# UBSan now adds the `expect_noreturn` attribute whenever it removes
the `noreturn` attribute from a function
# ASan additionally checks for the presence of this attribute
Generated code:
```
call void @__asan_handle_no_return // Additionally inserted to avoid false positives
call void @longjmp
call void @__asan_handle_no_return
call void @__ubsan_handle_builtin_unreachable
unreachable
```
The second call to `__asan_handle_no_return` is redundant. This will be
cleaned up in a follow-up patch.
rdar://problem/40723397
Reviewers: delcypher, eugenis
Tags: #sanitizers
Differential Revision: https://reviews.llvm.org/D56624
llvm-svn: 352003
Summary:
Profile sample files include the number of times each entry or inlined
call site is sampled. This is translated into the entry count metadta
on functions.
When sample data is being read, if a call site that was inlined
in the sample program is considered cold and not inlined, then
the entry count of the out-of-line functions does not reflect
the current compilation.
In this patch, we note call sites where the function was not inlined
and as a last action of the sample profile loading, we update the
called function's entry count to reflect the calls from these
call sites which are not included in the profile file.
Reviewers: danielcdh, wmi, Kader, modocache
Reviewed By: wmi
Subscribers: davidxl, eraman, llvm-commits
Differential Revision: https://reviews.llvm.org/D52845
llvm-svn: 352001
Summary:
Renamed setBaseDiscriminator to cloneWithBaseDiscriminator, to match
similar APIs. Also changed its behavior to copy over the other
discriminator components, instead of eliding them.
Renamed cloneWithDuplicationFactor to
cloneByMultiplyingDuplicationFactor, which more closely matches what
this API does.
Reviewers: dblaikie, wmi
Reviewed By: dblaikie
Subscribers: zzheng, llvm-commits
Differential Revision: https://reviews.llvm.org/D56220
llvm-svn: 351996
Summary:
Previously no client of ilist traits has needed to know about transfers
of nodes within the same list, so as an optimization, ilist doesn't call
transferNodesFromList in that case. However, now there are clients that
want to use ilist traits to cache instruction ordering information to
optimize dominance queries of instructions in the same basic block.
This change updates the existing ilist traits users to detect in-list
transfers and do nothing in that case.
After this change, we can start caching instruction ordering information
in LLVM IR data structures. There are two main ways to do that:
- by putting an order integer into the Instruction class
- by maintaining order integers in a hash table on BasicBlock
I plan to implement and measure both, but I wanted to commit this change
first to enable other out of tree ilist clients to implement this
optimization as well.
Reviewers: lattner, hfinkel, chandlerc
Subscribers: hiraditya, dexonsmith, llvm-commits
Differential Revision: https://reviews.llvm.org/D57120
llvm-svn: 351992
VPlan-native path
Context: Patch Series #2 for outer loop vectorization support in LV
using VPlan. (RFC:
http://lists.llvm.org/pipermail/llvm-dev/2017-December/119523.html).
Patch series #2 checks that inner loops are still trivially lock-step
among all vector elements. Non-loop branches are blindly assumed as
divergent.
Changes here implement VPlan based predication algorithm to compute
predicates for blocks that need predication. Predicates are computed
for the VPLoop region in reverse post order. A block's predicate is
computed as OR of the masks of all incoming edges. The mask for an
incoming edge is computed as AND of predecessor block's predicate and
either predecessor's Condition bit or NOT(Condition bit) depending on
whether the edge from predecessor block to the current block is true
or false edge.
Reviewers: fhahn, rengolin, hsaito, dcaballe
Reviewed By: fhahn
Patch by Satish Guggilla, thanks!
Differential Revision: https://reviews.llvm.org/D53349
llvm-svn: 351990
This saves a cbz+cold call in the interceptor ABI, as well as a realign
in both ABIs, trading off a dcache entry against some branch predictor
entries and some code size.
Unfortunately the functionality is hidden behind a flag because ifunc is
known to be broken on static binaries on Android.
Differential Revision: https://reviews.llvm.org/D57084
llvm-svn: 351989
This is a fix for a regression introduced by the rL348194 commit. In
that change new type (MEK_DTPREL) of MipsMCExpr expression was added,
but in some places of the code this type of expression considered as
unexpected.
This change fixes the bug. The MEK_DTPREL type of expression is used for
marking TLS DIEExpr only and contains a regular sub-expression. Where we
need to handle the expression, we retrieve the sub-expression and
handle it in a common way.
llvm-svn: 351987
Enable full support for the debug info. Recommit to fix the emission of
the not required closing brace.
Differential revision: https://reviews.llvm.org/D46189
llvm-svn: 351972
This patch adds a new ReadAdvance definition named ReadInt2Fpu.
ReadInt2Fpu allows x86 scheduling models to accurately describe delays caused by
data transfers from the integer unit to the floating point unit.
ReadInt2Fpu currently defaults to a delay of zero cycles (i.e. no delay) for all
x86 models excluding BtVer2. That means, this patch is only a functional change
for the Jaguar cpu model only.
Tablegen definitions for instructions (V)PINSR* have been updated to account for
the new ReadInt2Fpu. That read is mapped to the the GPR input operand.
On Jaguar, int-to-fpu transfers are modeled as a +6cy delay. Before this patch,
that extra delay was added to the opcode latency. In practice, the insert opcode
only executes for 1cy. Most of the actual latency is actually contributed by the
so-called operand-latency. According to the AMD SOG for family 16h, (V)PINSR*
latency is defined by expression f+1, where f is defined as a forwarding delay
from the integer unit to the fpu.
When printing instruction latency from MCA (see InstructionInfoView.cpp) and LLC
(only when flag -print-schedule is speified), we now need to account for any
extra forwarding delays. We do this by checking if scheduling classes declare
any negative ReadAdvance entries. Quoting a code comment in TargetSchedule.td:
"A negative advance effectively increases latency, which may be used for
cross-domain stalls". When computing the instruction latency for the purpose of
our scheduling tests, we now add any extra delay to the formula. This avoids
regressing existing codegen and mca schedule tests. It comes with the cost of an
extra (but very simple) hook in MCSchedModel.
Differential Revision: https://reviews.llvm.org/D57056
llvm-svn: 351965
This patch replaces the existing LLVMVectorSameWidth matcher with LLVMScalarOrSameVectorWidth.
The matching args must be either scalars or vectors with the same number of elements, but in either case the scalar/element type can differ, specified by LLVMScalarOrSameVectorWidth.
I've updated the _overflow intrinsics to demonstrate this - allowing it to return a i1 or <N x i1> overflow result, matching the scalar/vectorwidth of the other (add/sub/mul) result type.
The masked load/store/gather/scatter intrinsics have also been updated to use this, although as we specify the reference type to be llvm_anyvector_ty we guarantee the mask will be <N x i1> so no change in behaviour
Differential Revision: https://reviews.llvm.org/D57090
llvm-svn: 351957
Summary:
With XNACK, an smem load whose result is coalesced with an operand (thus
it overwrites its own operand) cannot appear in a clause, because some
other instruction might XNACK and restart the whole clause.
The clause breaker already realized that an smem that overwrites an
operand cannot appear in a clause, and broke the clause. The problem
that this commit fixes is that the SIFormMemoryClauses optimization
formed a bundle with early clobber, which caused the earlier code that
set up the coalesced operand to be removed as dead.
Differential Revision: https://reviews.llvm.org/D57008
Change-Id: I703c4d5b0bf7d6060222bec491f45c18bb3c0016
llvm-svn: 351950
Currently in Arm code, we allocate LR first, under the assumption that
it needs to be saved anyway. Unfortunately this has the disadvantage
that it will require any instructions using it to be the longer thumb2
instructions, not the shorter thumb1 ones.
This switches the order when we are optimising for minsize, returning to
the default order so that more lower registers can be used. It can end
up requiring more pushed registers, but on average produces smaller code.
Differential Revision: https://reviews.llvm.org/D56008
llvm-svn: 351938
In the last stage of type promotion, we replace any zext that uses a
new trunc with the operand of the trunc. This is okay when we only
allowed one type to be optimised, but now its the case that the trunc
maybe needed to produce a more narrow type than the one we were
optimising for. So we need to check this before doing the replacement.
Differential Revision: https://reviews.llvm.org/D57041
llvm-svn: 351935
The current check in CombineToPreIndexedLoadStore is too
conversative, preventing a pre-indexed store when the base pointer
is a predecessor of the value being stored. Instead, we should check
the pointer operand of the store.
Differential Revision: https://reviews.llvm.org/D56719
llvm-svn: 351933
As part of speculation hardening, the stack pointer gets masked with the
taint register (X16) before a function call or before a function return.
Since there are no instructions that can directly mask writing to the
stack pointer, the stack pointer must first be transferred to another
register, where it can be masked, before that value is transferred back
to the stack pointer.
Before, that temporary register was always picked to be x17, since the
ABI allows clobbering x17 on any function call, resulting in the
following instruction pattern being inserted before function calls and
returns/tail calls:
mov x17, sp
and x17, x17, x16
mov sp, x17
However, x17 can be live in those locations, for example when the call
is an indirect call, using x17 as the target address (blr x17).
To fix this, this patch looks for an available register just before the
call or terminator instruction and uses that.
In the rare case when no register turns out to be available (this
situation is only encountered twice across the whole test-suite), just
insert a full speculation barrier at the start of the basic block where
this occurs.
Differential Revision: https://reviews.llvm.org/D56717
llvm-svn: 351930
Two backend optimizations failed to handle cases when compiled with -g, due
to failing to consider DBG_VALUE instructions. This was in
SystemZTargetLowering::emitSelect() and
SystemZElimCompare::getRegReferences().
This patch makes sure that DBG_VALUEs are recognized so that they do not
affect these optimizations.
Tests for branch-on-count, load-and-trap and consecutive selects.
Review: Ulrich Weigand
https://reviews.llvm.org/D57048
llvm-svn: 351928
This patch relaxes restrictions on types of latch condition and range check.
In current implementation, they should match. This patch allows to handle
wide range checks against narrow condition. The motivating example is the
following:
int N = ...
for (long i = 0; (int) i < N; i++) {
if (i >= length) deopt;
}
In this patch, the option that enables this support is turned off by
default. We'll wait until it is switched to true.
Differential Revision: https://reviews.llvm.org/D56837
Reviewed By: reames
llvm-svn: 351926
#pragma clang loop pipeline(disable)
Disable SWP optimization for the next loop.
“disable” is the only possible value.
#pragma clang loop pipeline_initiation_interval(number)
Set value of initiation interval for SWP
optimization to specified number value for
the next loop. Number is the positive value
greater than 0.
These pragmas could be used for debugging or reducing
compile time purposes. It is possible to disable SWP for
concrete loops to save compilation time or to find bugs
by not doing SWP to certain loops. It is possible to set
value of initiation interval to concrete number to save
compilation time by not doing extra pipeliner passes or
to check created schedule for specific initiation interval.
That is llvm part of the fix
Clang part of fix: https://reviews.llvm.org/D55710
Patch by Alexey Lapshin!
Differential Revision: https://reviews.llvm.org/D56403
llvm-svn: 351923
Each hwasan check requires emitting a small piece of code like this:
https://clang.llvm.org/docs/HardwareAssistedAddressSanitizerDesign.html#memory-accesses
The problem with this is that these code blocks typically bloat code
size significantly.
An obvious solution is to outline these blocks of code. In fact, this
has already been implemented under the -hwasan-instrument-with-calls
flag. However, as currently implemented this has a number of problems:
- The functions use the same calling convention as regular C functions.
This means that the backend must spill all temporary registers as
required by the platform's C calling convention, even though the
check only needs two registers on the hot path.
- The functions take the address to be checked in a fixed register,
which increases register pressure.
Both of these factors can diminish the code size effect and increase
the performance hit of -hwasan-instrument-with-calls.
The solution that this patch implements is to involve the aarch64
backend in outlining the checks. An intrinsic and pseudo-instruction
are created to represent a hwasan check. The pseudo-instruction
is register allocated like any other instruction, and we allow the
register allocator to select almost any register for the address to
check. A particular combination of (register selection, type of check)
triggers the creation in the backend of a function to handle the check
for specifically that pair. The resulting functions are deduplicated by
the linker. The pseudo-instruction (really the function) is specified
to preserve all registers except for the registers that the AAPCS
specifies may be clobbered by a call.
To measure the code size and performance effect of this change, I
took a number of measurements using Chromium for Android on aarch64,
comparing a browser with inlined checks (the baseline) against a
browser with outlined checks.
Code size: Size of .text decreases from 243897420 to 171619972 bytes,
or a 30% decrease.
Performance: Using Chromium's blink_perf.layout microbenchmarks I
measured a median performance regression of 6.24%.
The fact that a perf/size tradeoff is evident here suggests that
we might want to make the new behaviour conditional on -Os/-Oz.
But for now I've enabled it unconditionally, my reasoning being that
hwasan users typically expect a relatively large perf hit, and ~6%
isn't really adding much. We may want to revisit this decision in
the future, though.
I also tried experimenting with varying the number of registers
selectable by the hwasan check pseudo-instruction (which would result
in fewer variants being created), on the hypothesis that creating
fewer variants of the function would expose another perf/size tradeoff
by reducing icache pressure from the check functions at the cost of
register pressure. Although I did observe a code size increase with
fewer registers, I did not observe a strong correlation between the
number of registers and the performance of the resulting browser on the
microbenchmarks, so I conclude that we might as well use ~all registers
to get the maximum code size improvement. My results are below:
Regs | .text size | Perf hit
-----+------------+---------
~all | 171619972 | 6.24%
16 | 171765192 | 7.03%
8 | 172917788 | 5.82%
4 | 177054016 | 6.89%
Differential Revision: https://reviews.llvm.org/D56954
llvm-svn: 351920
Previously, MemoryBlock automatically extends a requested buffer size to a
multiple of page size because (I believe) doing it was thought to be harmless
and with that you could get more memory (on average 2KiB on 4KiB-page systems)
"for free".
That programming interface turned out to be error-prone. If you request N
bytes, you usually expect that a resulting object returns N for `size()`.
That's not the case for MemoryBlock.
Looks like there is only one place where we take the advantage of
allocating more memory than the requested size. So, with this patch, I
simply removed the automatic size expansion feature from MemoryBlock
and do it on the caller side when needed. MemoryBlock now always
returns a buffer whose size is equal to the requested size.
Differential Revision: https://reviews.llvm.org/D56941
llvm-svn: 351916
Summary:
`CodeViewDebug::lowerTypeMemberFunction` used to default to a `Void`
return type if the function's type array was empty. After D54667, it
started blindly indexing the 0th item for the return type, which fails
in `getOperand` for empty arrays if assertions are enabled.
This patch restores the `Void` return type for empty type arrays, and
adds a test generated by Rust in line-only debuginfo mode.
Reviewers: zturner, rnk
Reviewed By: rnk
Subscribers: hiraditya, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D57070
llvm-svn: 351910
The splitting pass does not need BFI unless the Module actually has a profile
summary. Do not calcualte BFI unless the summary is present.
For the sqlite3 amalgamation, this reduces time spent in the splitting pass
from 0.4% of the total to under 0.1%.
llvm-svn: 351894
The splitting pass does not need (post)domtrees until after it's found a
cold block. Defer domtree calculation until a cold block is found.
For the sqlite3 amalgamation, this reduces time spent in the splitting
pass from 0.8% of the total to 0.4%.
llvm-svn: 351892
Also add debug prints in the default case of the switches in these routines.
Most if not all of the type legalization handlers already do this so this makes promoting floats consistent
llvm-svn: 351890
It might be a bit nicer to use the fancy .legalIf and co. predicates,
but this was requiring more boilerplate and disables the coverage
assertions.
llvm-svn: 351886
If the underlying filesystem does not support mmap system call,
FileOutputBuffer may fail when it attempts to mmap an output temporary
file. This patch handles such situation.
Unfortunately, it looks like it is very hard to test this functionality
without a filesystem that doesn't support mmap using llvm-lit. I tested
this locally by passing an invalid parameter to mmap so that it fails and
falls back to the in-memory buffer. Maybe that's all what we can do.
I believe it is reasonable to submit this without a test.
Differential Revision: https://reviews.llvm.org/D56949
llvm-svn: 351883
For AMDGPU the shift amount is never 64-bit, and
this needs to use a 32-bit shift.
X86 uses i8, but seemed to be hacking around this before.
llvm-svn: 351882
The old diagnostic form of the trace produced by -v and -vv looks
like:
```
check1:1:8: remark: CHECK: expected string found in input
CHECK: abc
^
<stdin>:1:3: note: found here
; abc def
^~~
```
When dumping annotated input is requested (via -dump-input), I find
that this old trace is not useful and is sometimes harmful:
1. The old trace is mostly redundant because the same basic
information also appears in the input dump's annotations.
2. The old trace buries any error diagnostic between it and the input
dump, but I find it useful to see any error diagnostic up front.
3. FILECHECK_OPTS=-dump-input=fail requests annotated input dumps only
for failed FileCheck calls. However, I have to also add -v or -vv
to get a full set of annotations, and that can produce massive
output from all FileCheck calls in all tests. That's a real
problem when I run this in the IDE I use, which grinds to a halt as
it tries to capture all that output.
When -dump-input=fail|always, this patch suppresses the old trace from
-v or -vv. Error diagnostics still print as usual. If you want the
old trace, perhaps to see variable expansions, you can set
-dump-input=none (the default).
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D55825
llvm-svn: 351881
I was honestly a bit surprised that we didn't do this before. This
patch is to handle "-" as the stdout so that if you pass `-o -` to
lld, for example, it writes an output to stdout instead of file `-`.
I thought that we might want to handle this at a higher level than
FileOutputBuffer, because if we land this patch, we can no longer
create a file whose name is `-` (there's a workaround though; you can
pass `./-` instead of `-`). However, because raw_fd_ostream already
handles `-` as a special file name, I think it's okay and actually
consistent to handle `-` as a special name in FileOutputBuffer.
Differential Revision: https://reviews.llvm.org/D56940
llvm-svn: 351852
Summary: Enable full support for the debug info.
Reviewers: echristo
Subscribers: jholewinski, aprantl, JDevlieghere, llvm-commits
Differential Revision: https://reviews.llvm.org/D46189
llvm-svn: 351846
Summary: Initial function labels must follow the debug location for the correct relocation info generation.
Reviewers: tra, jlebar, echristo
Subscribers: jholewinski, llvm-commits
Differential Revision: https://reviews.llvm.org/D45784
llvm-svn: 351843