Commit Graph

9624 Commits

Author SHA1 Message Date
Florian Hahn 1bd58870e5 [LoopUnroll] Use LoopSize+1 as threshold, to allow unrolling loops matching LoopSize.
We use `< UP.Threshold` later on, so we should use LoopSize + 1, to
allow unrolling if the result won't exceed to loop size.

Fixes PR43305.

Reviewers: efriedma, dmgreen, paquette

Reviewed By: dmgreen

Differential Revision: https://reviews.llvm.org/D67594

llvm-svn: 372084
2019-09-17 09:02:48 +00:00
Alina Sbirlea 6943472d45 [MemorySSA] Pass (for update) MSSAU when hoisting instructions.
Summary: Pass MSSAU to makeLoopInvariant in order to properly update MSSA.

Reviewers: george.burgess.iv

Subscribers: Prazek, sanjoy.google, uabelho, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D67470

llvm-svn: 371748
2019-09-12 17:12:51 +00:00
Eli Friedman 403e08d4cf [ConstantHoisting] Fix non-determinism.
Differential Revision: https://reviews.llvm.org/D66114

llvm-svn: 371644
2019-09-11 18:55:00 +00:00
Petr Hosek 7bdad08429 Reland "clang-misexpect: Profile Guided Validation of Performance Annotations in LLVM"
This patch contains the basic functionality for reporting potentially
incorrect usage of __builtin_expect() by comparing the developer's
annotation against a collected PGO profile. A more detailed proposal and
discussion appears on the CFE-dev mailing list
(http://lists.llvm.org/pipermail/cfe-dev/2019-July/062971.html) and a
prototype of the initial frontend changes appear here in D65300

We revised the work in D65300 by moving the misexpect check into the
LLVM backend, and adding support for IR and sampling based profiles, in
addition to frontend instrumentation.

We add new misexpect metadata tags to those instructions directly
influenced by the llvm.expect intrinsic (branch, switch, and select)
when lowering the intrinsics. The misexpect metadata contains
information about the expected target of the intrinsic so that we can
check against the correct PGO counter when emitting diagnostics, and the
compiler's values for the LikelyBranchWeight and UnlikelyBranchWeight.
We use these branch weight values to determine when to emit the
diagnostic to the user.

A future patch should address the comment at the top of
LowerExpectIntrisic.cpp to hoist the LikelyBranchWeight and
UnlikelyBranchWeight values into a shared space that can be accessed
outside of the LowerExpectIntrinsic pass. Once that is done, the
misexpect metadata can be updated to be smaller.

In the long term, it is possible to reconstruct portions of the
misexpect metadata from the existing profile data. However, we have
avoided this to keep the code simple, and because some kind of metadata
tag will be required to identify which branch/switch/select instructions
are influenced by the use of llvm.expect

Patch By: paulkirth
Differential Revision: https://reviews.llvm.org/D66324

llvm-svn: 371635
2019-09-11 16:19:50 +00:00
Florian Hahn e79381c3f7 [LoopInterchange] Drop unused splitInnerLoopHeader declaration.
llvm-svn: 371601
2019-09-11 10:32:15 +00:00
Dmitri Gribenko 57256af307 Revert "clang-misexpect: Profile Guided Validation of Performance Annotations in LLVM"
This reverts commit r371584. It introduced a dependency from compiler-rt
to llvm/include/ADT, which is problematic for multiple reasons.

One is that it is a novel dependency edge, which needs cross-compliation
machinery for llvm/include/ADT (yes, it is true that right now
compiler-rt included only header-only libraries, however, if we allow
compiler-rt to depend on anything from ADT, other libraries will
eventually get used).

Secondly, depending on ADT from compiler-rt exposes ADT symbols from
compiler-rt, which would cause ODR violations when Clang is built with
the profile library.

llvm-svn: 371598
2019-09-11 09:16:17 +00:00
Florian Hahn e4961218fd [LoopInterchange] Properly move condition, induction increment and ops to latch.
Currently we only rely on the induction increment to come before the
condition to ensure the required instructions get moved to the new
latch.

This patch duplicates and moves the required instructions to the
newly created latch. We move the condition to the end of the new block,
then process its operands. We stop at operands that are defined
outside the loop, or are the induction PHI.

We duplicate the instructions and update the uses in the moved
instructions, to ensure other users remain intact. See the added
test2 for such an example.

Reviewers: efriedma, mcrosier

Reviewed By: efriedma

Differential Revision: https://reviews.llvm.org/D67367

llvm-svn: 371595
2019-09-11 08:23:23 +00:00
Petr Hosek 394a8ed8f1 clang-misexpect: Profile Guided Validation of Performance Annotations in LLVM
This patch contains the basic functionality for reporting potentially
incorrect usage of __builtin_expect() by comparing the developer's
annotation against a collected PGO profile. A more detailed proposal and
discussion appears on the CFE-dev mailing list
(http://lists.llvm.org/pipermail/cfe-dev/2019-July/062971.html) and a
prototype of the initial frontend changes appear here in D65300

We revised the work in D65300 by moving the misexpect check into the
LLVM backend, and adding support for IR and sampling based profiles, in
addition to frontend instrumentation.

We add new misexpect metadata tags to those instructions directly
influenced by the llvm.expect intrinsic (branch, switch, and select)
when lowering the intrinsics. The misexpect metadata contains
information about the expected target of the intrinsic so that we can
check against the correct PGO counter when emitting diagnostics, and the
compiler's values for the LikelyBranchWeight and UnlikelyBranchWeight.
We use these branch weight values to determine when to emit the
diagnostic to the user.

A future patch should address the comment at the top of
LowerExpectIntrisic.cpp to hoist the LikelyBranchWeight and
UnlikelyBranchWeight values into a shared space that can be accessed
outside of the LowerExpectIntrinsic pass. Once that is done, the
misexpect metadata can be updated to be smaller.

In the long term, it is possible to reconstruct portions of the
misexpect metadata from the existing profile data. However, we have
avoided this to keep the code simple, and because some kind of metadata
tag will be required to identify which branch/switch/select instructions
are influenced by the use of llvm.expect

Patch By: paulkirth
Differential Revision: https://reviews.llvm.org/D66324

llvm-svn: 371584
2019-09-11 01:09:16 +00:00
Dmitri Gribenko 2bf8d77453 Revert "Reland "r364412 [ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline.""
This reverts commit r371502, it broke tests
(clang/test/CodeGenCXX/auto-var-init.cpp).

llvm-svn: 371507
2019-09-10 10:39:09 +00:00
Clement Courbet 612c260ec3 Reland "r364412 [ExpandMemCmp][MergeICmps] Move passes out of CodeGen into opt pipeline."
With a fix for sanitizer breakage (see explanation in D60318).

llvm-svn: 371502
2019-09-10 09:18:00 +00:00
Petr Hosek 7d1757aba8 Revert "clang-misexpect: Profile Guided Validation of Performance Annotations in LLVM"
This reverts commit r371484: this broke sanitizer-x86_64-linux-fast bot.

llvm-svn: 371488
2019-09-10 06:25:13 +00:00
Petr Hosek a10802fd73 clang-misexpect: Profile Guided Validation of Performance Annotations in LLVM
This patch contains the basic functionality for reporting potentially
incorrect usage of __builtin_expect() by comparing the developer's
annotation against a collected PGO profile. A more detailed proposal and
discussion appears on the CFE-dev mailing list
(http://lists.llvm.org/pipermail/cfe-dev/2019-July/062971.html) and a
prototype of the initial frontend changes appear here in D65300

We revised the work in D65300 by moving the misexpect check into the
LLVM backend, and adding support for IR and sampling based profiles, in
addition to frontend instrumentation.

We add new misexpect metadata tags to those instructions directly
influenced by the llvm.expect intrinsic (branch, switch, and select)
when lowering the intrinsics. The misexpect metadata contains
information about the expected target of the intrinsic so that we can
check against the correct PGO counter when emitting diagnostics, and the
compiler's values for the LikelyBranchWeight and UnlikelyBranchWeight.
We use these branch weight values to determine when to emit the
diagnostic to the user.

A future patch should address the comment at the top of
LowerExpectIntrisic.cpp to hoist the LikelyBranchWeight and
UnlikelyBranchWeight values into a shared space that can be accessed
outside of the LowerExpectIntrinsic pass. Once that is done, the
misexpect metadata can be updated to be smaller.

In the long term, it is possible to reconstruct portions of the
misexpect metadata from the existing profile data. However, we have
avoided this to keep the code simple, and because some kind of metadata
tag will be required to identify which branch/switch/select instructions
are influenced by the use of llvm.expect

Patch By: paulkirth
Differential Revision: https://reviews.llvm.org/D66324

llvm-svn: 371484
2019-09-10 03:11:39 +00:00
Teresa Johnson 9c27b59cec Change TargetLibraryInfo analysis passes to always require Function
Summary:
This is the first change to enable the TLI to be built per-function so
that -fno-builtin* handling can be migrated to use function attributes.
See discussion on D61634 for background. This is an enabler for fixing
handling of these options for LTO, for example.

This change should not affect behavior, as the provided function is not
yet used to build a specifically per-function TLI, but rather enables
that migration.

Most of the changes were very mechanical, e.g. passing a Function to the
legacy analysis pass's getTLI interface, or in Module level cases,
adding a callback. This is similar to the way the per-function TTI
analysis works.

There was one place where we were looking for builtins but not in the
context of a specific function. See FindCXAAtExit in
lib/Transforms/IPO/GlobalOpt.cpp. I'm somewhat concerned my workaround
could provide the wrong behavior in some corner cases. Suggestions
welcome.

Reviewers: chandlerc, hfinkel

Subscribers: arsenm, dschuff, jvesely, nhaehnle, mehdi_amini, javed.absar, sbc100, jgravelle-google, eraman, aheejin, steven_wu, george.burgess.iv, dexonsmith, jfb, asbirlea, gchatelet, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66428

llvm-svn: 371284
2019-09-07 03:09:36 +00:00
Denis Bakhvalov 58f172f05a [MergedLoadStoreMotion] Sink stores to BB with more than 2 predecessors
If we have:

bb5:
  br i1 %arg3, label %bb6, label %bb7

bb6:
  %tmp = getelementptr inbounds i32, i32* %arg1, i64 2
  store i32 3, i32* %tmp, align 4
  br label %bb9

bb7:
  %tmp8 = getelementptr inbounds i32, i32* %arg1, i64 2
  store i32 3, i32* %tmp8, align 4
  br label %bb9

bb9:  ; preds = %bb4, %bb6, %bb7
  ...

We can't sink stores directly into bb9.
This patch creates new BB that is successor of %bb6 and %bb7
and sinks stores into that block.

SplitFooterBB is the parameter to the pass that controls
that behavior.

Change-Id: I7fdf50a772b84633e4b1b860e905bf7e3e29940f
Differential: https://reviews.llvm.org/D66234
llvm-svn: 371089
2019-09-05 17:00:32 +00:00
Philip Reames 4228245e41 [NFC] Switch last couple of invariant_load checks to use hasMetadata
llvm-svn: 370948
2019-09-04 18:27:31 +00:00
Philip Reames 27820f9909 [Instruction] Add hasMetadata(Kind) helper [NFC]
It's a common idiom, so let's add the obvious wrapper for metadata kinds which are basically booleans.

llvm-svn: 370933
2019-09-04 17:28:48 +00:00
Sanjay Patel 4a2cd7be5a [InstSimplify] guard against unreachable code (PR43218)
This would crash:
https://bugs.llvm.org/show_bug.cgi?id=43218

llvm-svn: 370911
2019-09-04 15:12:55 +00:00
Philip Reames 30dc2da827 [GVN] Remove a todo introduced w/rL370791
When I dug into this, it turns out to be *much* more involved than I'd realized and doesn't actually simplify anything.  

The general purpose of the leader table is that we want to find the most-dominating definition quickly.  The problem for equivalance folding is slightly different; we want to find the most dominating *value* whose definition block dominates our use quickly.

To make this change, we'd end up having to restructure the leader table (either the sorting thereof, or maybe even introducing multiple leader tables per value) and that complexity is just not worth it.

llvm-svn: 370824
2019-09-03 21:56:17 +00:00
Philip Reames 37e2f5f125 [GVN] Propagate simple equalities from assumes within the tail of the block
This extends the existing logic for propagating constant expressions in an analogous manner for what we do across basic blocks. The core point is that we chose some order of operands, and canonicalize uses towards that one.

The heuristic used is inspired by the one used across blocks; in a follow up change, I'd plan to common them so that the cross block version uses the slightly stronger ordering herein. 

As noted by the TODOs in the code, there's a good amount of room for improving the existing code and making it more powerful.  Some follow up work planned.

Differential Revision: https://reviews.llvm.org/D66977

llvm-svn: 370791
2019-09-03 17:31:19 +00:00
Roman Lebedev bdd65351d3 Revert r370454 "[LoopIdiomRecognize] BCmp loop idiom recognition"
https://bugs.llvm.org/show_bug.cgi?id=43206 was filed,
claiming that there is a miscompilation.
Reverting until i investigate.

This reverts commit r370454

llvm-svn: 370788
2019-09-03 17:14:56 +00:00
Nikita Popov b9e668f2e7 [CVP] Generate simpler code for elided with.overflow intrinsics
Use a { iN undef, i1 false } struct as the base, and only insert
the first operand, instead of using { iN undef, i1 undef } as the
base and inserting both. This is the same as what we do in InstCombine.

Differential Revision: https://reviews.llvm.org/D67034

llvm-svn: 370573
2019-08-31 09:58:37 +00:00
Wei Mi 5ef5829fb0 [GVN] Verify value equality before doing phi translation for call instruction
This is an updated version of https://reviews.llvm.org/D66909 to fix PR42605.

Basically, current phi translatation translates an old value number to an new
value number for a call instruction based on the literal equality of call
expression, without verifying there is no clobber in between. This is incorrect.

To get a finegrain check, use MachineDependence analysis to do the job. However,
this is still not ideal. Although given a call instruction,
`MemoryDependenceResults::getCallDependencyFrom` returns identical call
instructions without clobber in between using MemDepResult with its DepType to
be `Def`. However, identical is too strict here and we want it to be relaxed a
little to consider phi-translation -- callee is the same, param operands can be
different. That means changing the semantic of `MemDepResult::Def` and I don't
know the potential impact.

So currently the patch is still conservative to only handle
MemDepResult::NonFuncLocal, which means the current call has no function local
clobber. If there is clobber, even if the clobber doesn't stand in between the
current call and the call with the new value, we won't do phi-translate.

Differential Revision: https://reviews.llvm.org/D67013

llvm-svn: 370547
2019-08-30 23:01:22 +00:00
Haojian Wu ed170c9bf9 Remove an extra ";", NFC.
llvm-svn: 370465
2019-08-30 12:09:31 +00:00
Roman Lebedev 5c9f3cfec7 [LoopIdiomRecognize] BCmp loop idiom recognition
Summary:
@mclow.lists brought up this issue up in IRC.
It is a reasonably common problem to compare some two values for equality.
Those may be just some integers, strings or arrays of integers.

In C, there is `memcmp()`, `bcmp()` functions.
In C++, there exists `std::equal()` algorithm.
One can also write that function manually.

libstdc++'s `std::equal()` is specialized to directly call `memcmp()` for
various types, but not `std::byte` from C++2a. https://godbolt.org/z/mx2ejJ

libc++ does not do anything like that, it simply relies on simple C++'s
`operator==()`. https://godbolt.org/z/er0Zwf (GOOD!)

So likely, there exists a certain performance opportunities.
Let's compare performance of naive `std::equal()` (no `memcmp()`) with one that
is using `memcmp()` (in this case, compiled with modified compiler). {F8768213}

```
#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

#include "benchmark/benchmark.h"

template <class T>
bool equal(T* a, T* a_end, T* b) noexcept {
  for (; a != a_end; ++a, ++b) {
    if (*a != *b) return false;
  }
  return true;
}

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct Identical {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    auto Tmp = getVectorOfRandomNumbers<T>(count);
    return std::make_pair(Tmp, std::move(Tmp));
  }
};

struct InequalHalfway {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    auto V0 = getVectorOfRandomNumbers<T>(count);
    auto V1 = V0;
    V1[V1.size() / size_t(2)]++;  // just change the value.
    return std::make_pair(std::move(V0), std::move(V1));
  }
};

template <class T, class Gen>
void BM_bcmp(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    const bool is_equal = equal(a.data(), a.data() + a.size(), b.data());
    benchmark::DoNotOptimize(is_equal);
  }
  state.SetComplexityN(Length);
  state.counters["eltcnt"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["eltcnt/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = []() {
    for (const benchmark::CPUInfo::CacheInfo& I :
         benchmark::CPUInfo::Get().caches) {
      if (I.level == 2) return I.size;
    }
    return 0;
  }();
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical)
    ->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical)
    ->Apply(CustomArguments<uint64_t>);

BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway)
    ->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway)
    ->Apply(CustomArguments<uint64_t>);
```
{F8768210}
```
$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench
RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx
2019-04-25 21:17:11
Running build-old/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.65, 3.90, 4.14
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000           432131 ns       432101 ns         1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s
BM_bcmp<uint8_t, Identical>_BigO               0.86 N          0.86 N
BM_bcmp<uint8_t, Identical>_RMS                   8 %             8 %
<...>
BM_bcmp<uint16_t, Identical>/256000          161408 ns       161409 ns         4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s
BM_bcmp<uint16_t, Identical>_BigO              0.67 N          0.67 N
BM_bcmp<uint16_t, Identical>_RMS                 25 %            25 %
<...>
BM_bcmp<uint32_t, Identical>/128000           81497 ns        81488 ns         8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s
BM_bcmp<uint32_t, Identical>_BigO              0.71 N          0.71 N
BM_bcmp<uint32_t, Identical>_RMS                 42 %            42 %
<...>
BM_bcmp<uint64_t, Identical>/64000            50138 ns        50138 ns        10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s
BM_bcmp<uint64_t, Identical>_BigO              0.84 N          0.84 N
BM_bcmp<uint64_t, Identical>_RMS                 27 %            27 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000      192405 ns       192392 ns         3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.38 N          0.38 N
BM_bcmp<uint8_t, InequalHalfway>_RMS              3 %             3 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000     127858 ns       127860 ns         5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint16_t, InequalHalfway>_RMS             0 %             0 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000      49140 ns        49140 ns        14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.40 N          0.40 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            18 %            18 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000       32101 ns        32099 ns        21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint64_t, InequalHalfway>_RMS             1 %             1 %
RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0
2019-04-25 21:19:29
Running build-new/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.01, 2.85, 3.71
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000            18593 ns        18590 ns        37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s
BM_bcmp<uint8_t, Identical>_BigO               0.04 N          0.04 N
BM_bcmp<uint8_t, Identical>_RMS                  37 %            37 %
<...>
BM_bcmp<uint16_t, Identical>/256000           18950 ns        18948 ns        37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s
BM_bcmp<uint16_t, Identical>_BigO              0.08 N          0.08 N
BM_bcmp<uint16_t, Identical>_RMS                 34 %            34 %
<...>
BM_bcmp<uint32_t, Identical>/128000           18627 ns        18627 ns        37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s
BM_bcmp<uint32_t, Identical>_BigO              0.16 N          0.16 N
BM_bcmp<uint32_t, Identical>_RMS                 35 %            35 %
<...>
BM_bcmp<uint64_t, Identical>/64000            18855 ns        18855 ns        37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s
BM_bcmp<uint64_t, Identical>_BigO              0.32 N          0.32 N
BM_bcmp<uint64_t, Identical>_RMS                 33 %            33 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000        9570 ns         9569 ns        73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.02 N          0.02 N
BM_bcmp<uint8_t, InequalHalfway>_RMS             29 %            29 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000       9547 ns         9547 ns        74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.04 N          0.04 N
BM_bcmp<uint16_t, InequalHalfway>_RMS            29 %            29 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000       9396 ns         9394 ns        73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.08 N          0.08 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            30 %            30 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000        9499 ns         9498 ns        73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.16 N          0.16 N
BM_bcmp<uint64_t, InequalHalfway>_RMS            28 %            28 %
Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench
Benchmark                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000                      -0.9570         -0.9570        432131         18593        432101         18590
<...>
BM_bcmp<uint16_t, Identical>/256000                     -0.8826         -0.8826        161408         18950        161409         18948
<...>
BM_bcmp<uint32_t, Identical>/128000                     -0.7714         -0.7714         81497         18627         81488         18627
<...>
BM_bcmp<uint64_t, Identical>/64000                      -0.6239         -0.6239         50138         18855         50138         18855
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000                 -0.9503         -0.9503        192405          9570        192392          9569
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000                -0.9253         -0.9253        127858          9547        127860          9547
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000                -0.8088         -0.8088         49140          9396         49140          9394
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000                 -0.7041         -0.7041         32101          9499         32099          9498
```

What can we tell from the benchmark?
* Performance of naive equality check somewhat improves with element size,
  maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s
  for uint64_t. I think, that instability implies performance problems.
* Performance of `memcmp()`-aware benchmark always maxes out at around
  bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the
  naive variant!
* eltcnt/sec metric for the `memcmp()`-aware benchmark maxes out at
  eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and
  linearly decreases with element size.
  For uint64_t, it's ~4x+ the elements/second.
* The call obvious is more pricey than the loop, with small element count.
  As it can be seen from the full output {F8768210}, the `memcmp()` is almost
  universally worse, independent of the element size (and thus buffer size) when
  element count is less than 8.

So all in all, bcmp idiom does indeed pose untapped performance headroom.
This diff does implement said idiom recognition. I think a reasonable test
coverage is present, but do tell if there is anything obvious missing.

Now, quality. This does succeed to build and pass the test-suite, at least
without any non-bundled elements. {F8768216} {F8768217}
This transform fires 91 times:
```
$ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json
Tests: 1149
Metric: loop-idiom.NumBCmp

Program                                         result-new

MultiSourc...Benchmarks/7zip/7zip-benchmark    79.00
MultiSource/Applications/d/make_dparser         3.00
SingleSource/UnitTests/vla                      2.00
MultiSource/Applications/Burg/burg              1.00
MultiSourc.../Applications/JM/lencod/lencod     1.00
MultiSource/Applications/lemon/lemon            1.00
MultiSource/Benchmarks/Bullet/bullet            1.00
MultiSourc...e/Benchmarks/MallocBench/gs/gs     1.00
MultiSourc...gs-C/TimberWolfMC/timberwolfmc     1.00
MultiSourc...Prolangs-C/simulator/simulator     1.00
```
The size changes are:
I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look.
```
$ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash
Tests: 1149
Same hash: 907 (filtered out)
Remaining: 242
Metric: size..text

Program                                        result-old result-new diff
test-suite...ingleSource/UnitTests/vla.test   753.00     833.00     10.6%
test-suite...marks/7zip/7zip-benchmark.test   1001697.00 966657.00  -3.5%
test-suite...ngs-C/simulator/simulator.test   32369.00   32321.00   -0.1%
test-suite...plications/d/make_dparser.test   89585.00   89505.00   -0.1%
test-suite...ce/Applications/Burg/burg.test   40817.00   40785.00   -0.1%
test-suite.../Applications/lemon/lemon.test   47281.00   47249.00   -0.1%
test-suite...TimberWolfMC/timberwolfmc.test   250065.00  250113.00   0.0%
test-suite...chmarks/MallocBench/gs/gs.test   149889.00  149873.00  -0.0%
test-suite...ications/JM/lencod/lencod.test   769585.00  769569.00  -0.0%
test-suite.../Benchmarks/Bullet/bullet.test   770049.00  770049.00   0.0%
test-suite...HMARK_ANISTROPIC_DIFFUSION/128    NaN        NaN        nan%
test-suite...HMARK_ANISTROPIC_DIFFUSION/256    NaN        NaN        nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/64    NaN        NaN        nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/32    NaN        NaN        nan%
test-suite...ENCHMARK_BILATERAL_FILTER/64/4    NaN        NaN        nan%
Geomean difference                                                   nan%
         result-old    result-new       diff
count  1.000000e+01  10.00000      10.000000
mean   3.152090e+05  311695.40000  0.006749
std    3.790398e+05  372091.42232  0.036605
min    7.530000e+02  833.00000    -0.034981
25%    4.243300e+04  42401.00000  -0.000866
50%    1.197370e+05  119689.00000 -0.000392
75%    6.397050e+05  639705.00000 -0.000005
max    1.001697e+06  966657.00000  0.106242
```

I don't have timings though.

And now to the code. The basic idea is to completely replace the whole loop.
If we can't fully kill it, don't transform.
I have left one or two comments in the code, so hopefully it can be understood.

Also, there is a few TODO's that i have left for follow-ups:
* widening of `memcmp()`/`bcmp()`
* step smaller than the comparison size
* Metadata propagation
* more than two blocks as long as there is still a single backedge?
* ???

Reviewers: reames, fhahn, mkazantsev, chandlerc, craig.topper, courbet

Reviewed By: courbet

Subscribers: hiraditya, xbolva00, nikic, jfb, gchatelet, courbet, llvm-commits, mclow.lists

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D61144

llvm-svn: 370454
2019-08-30 09:51:23 +00:00
Fangrui Song 6964027315 [LoopFusion] Fix another -Wunused-function in -DLLVM_ENABLE_ASSERTIONS=off build
llvm-svn: 370156
2019-08-28 03:12:40 +00:00
Philip Reames 2694522f13 [Loads/SROA] Remove blatantly incorrect code and fix a bug revealed in the process
The code we had isSafeToLoadUnconditionally was blatantly wrong. This function takes a "Size" argument which is supposed to describe the span loaded from. Instead, the code use the size of the pointer passed (which may be unrelated!) and only checks that span. For any Size > LoadSize, this can and does lead to miscompiles.

Worse, the generic code just a few lines above correctly handles the cases which *are* valid. So, let's delete said code.

Removing this code revealed two issues:
1) As noted by jdoerfert the removed code incorrectly handled external globals.  The test update in SROA is to stop testing incorrect behavior.
2) SROA was confusing bytes and bits, but this wasn't obvious as the Size parameter was being essentially ignored anyway.  Fixed.

Differential Revision: https://reviews.llvm.org/D66778

llvm-svn: 370102
2019-08-27 19:34:43 +00:00
Fangrui Song eb70ac0249 [LoopFusion] Fix -Wunused-function in -DLLVM_ENABLE_ASSERTIONS=off build
llvm-svn: 369836
2019-08-24 02:50:42 +00:00
Benjamin Kramer dc5f805d31 Do a sweep of symbol internalization. NFC.
llvm-svn: 369803
2019-08-23 19:59:23 +00:00
Cameron McInally 688f3bc240 [Reassoc] Small fix to support unary FNeg in NegateValue(...)
Differential Revision: https://reviews.llvm.org/D66612

llvm-svn: 369772
2019-08-23 15:49:38 +00:00
Philip Reames 2a52583d67 [IndVars] Fix a bug noticed by inspection
We were computing the loop exit value, but not ensuring the addrec belonged to the loop whose exit value we were computing.  I couldn't actually trip this; the test case shows the basic setup which *might* trip this, but none of the variations I've tried actually do.

llvm-svn: 369730
2019-08-23 04:03:23 +00:00
Fangrui Song 3fc933af8b [AlignmentFromAssumptions] getNewAlignmentDiff(): use getURemExpr()
The alignment is calculated incorrectly, thus sometimes it doesn't generate aligned mov instructions, as shown by the example below:

```
// b.cc
typedef long long index;

extern "C" index g_tid;
extern "C" index g_num;

void add3(float* __restrict__ a, float* __restrict__ b, float* __restrict__ c) {
    index n = 64*1024;
    index m = 16*1024;
    index k = 4*1024;
    index tid = g_tid;
    index num = g_num;
    __builtin_assume_aligned(a, 32);
    __builtin_assume_aligned(b, 32);
    __builtin_assume_aligned(c, 32);
    for (index i0=tid*k; i0<m; i0+=num*k)
        for (index i1=0; i1<n*m; i1+=m)
            for (index i2=0; i2<k; i2++)
                c[i1+i0+i2] = b[i0+i2] + a[i1+i0+i2];
}
```

Compile with `clang b.cc -Ofast -march=skylake -mavx2 -S`

```
vmovaps -224(%rdi,%rbx,4), %ymm0
vmovups -192(%rdi,%rbx,4), %ymm1         # should be movaps
vmovups -160(%rdi,%rbx,4), %ymm2         # should be movaps
vmovups -128(%rdi,%rbx,4), %ymm3         # should be movaps
vaddps  -224(%rsi,%rbx,4), %ymm0, %ymm0
vaddps  -192(%rsi,%rbx,4), %ymm1, %ymm1
vaddps  -160(%rsi,%rbx,4), %ymm2, %ymm2
vaddps  -128(%rsi,%rbx,4), %ymm3, %ymm3
vmovaps %ymm0, -224(%rdx,%rbx,4)
vmovups %ymm1, -192(%rdx,%rbx,4)         # should be movaps
vmovups %ymm2, -160(%rdx,%rbx,4)         # should be movaps
vmovups %ymm3, -128(%rdx,%rbx,4)         # should be movaps
```

Differential Revision: https://reviews.llvm.org/D66575
Patch by Dun Liang

llvm-svn: 369723
2019-08-23 02:17:04 +00:00
Florian Hahn b5e52bfd83 [GVN] Do PHI translations across all edges between the load and the unavailable pred.
Currently we do not properly translate addresses with PHIs if LoadBB !=
LI->getParent(), because PHITranslateAddr expects a direct predecessor as argument,
because it considers all instructions outside of the current block to
not requiring translation.

The amount of cases that trigger this should be very low, as most single
predecessor blocks should be folded into their predecessor by GVN before
we actually start with value numbering. It is still not guaranteed to
happen, so we should do PHI translation along all edges between the
loads' block and the predecessor where we have to place a load.

There are a few test cases showing current limits of the PHI translation, which
could be improved later.

Reviewers: spatel, reames, efriedma, john.brawn

Reviewed By: efriedma

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65020

llvm-svn: 369570
2019-08-21 20:06:50 +00:00
Evgeniy Stepanov 55ccd16354 Refactor isPointerOffset (NFC).
Summary:
Simplify the API using Optional<> and address comments in
         https://reviews.llvm.org/D66165

Reviewers: vitalybuka

Subscribers: hiraditya, llvm-commits, ostannard, pcc

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66317

llvm-svn: 369300
2019-08-19 21:08:04 +00:00
Alina Sbirlea 1a3fdaf6a6 [MemorySSA] Rename uses when inserting memory uses.
Summary:
When inserting uses from outside the MemorySSA creation, we don't
normally need to rename uses, based on the assumption that there will be
no inserted Phis (if  Def existed that required a Phi, that Phi already
exists). However, when dealing with unreachable blocks, MemorySSA will
optimize away Phis whose incoming blocks are unreachable, and these Phis end
up being re-added when inserting a Use.
There are two potential solutions here:
1. Analyze the inserted Phis and clean them up if they are unneeded
(current method for cleaning up trivial phis does not cover this)
2. Leave the Phi in place and rename uses, the same way as whe inserting
defs.
This patch use approach 2.

Resolves first test in PR42940.

Reviewers: george.burgess.iv

Subscribers: Prazek, sanjoy.google, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66033

llvm-svn: 369291
2019-08-19 18:57:40 +00:00
Alina Sbirlea f92109dc01 [MemorySSA] Loop passes should mark MSSA preserved when available.
This patch applies only to the new pass manager.
Currently, when MSSA Analysis is available, and pass to each loop pass, it will be preserved by that loop pass.
Hence, mark the analysis preserved based on that condition, vs the current `EnableMSSALoopDependency`. This leaves the global flag to affect only the entry point in the loop pass manager (in FunctionToLoopPassAdaptor).

llvm-svn: 369181
2019-08-17 01:02:12 +00:00
Evgeniy Stepanov 75344955fc Move isPointerOffset function to ValueTracking (NFC).
Summary: To be reused in MemTag sanitizer.

Reviewers: pcc, vitalybuka, ostannard

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66165

llvm-svn: 369062
2019-08-15 22:58:28 +00:00
Jonas Devlieghere 0eaee545ee [llvm] Migrate llvm::make_unique to std::make_unique
Now that we've moved to C++14, we no longer need the llvm::make_unique
implementation from STLExtras.h. This patch is a mechanical replacement
of (hopefully) all the llvm::make_unique instances across the monorepo.

llvm-svn: 369013
2019-08-15 15:54:37 +00:00
Philip Reames 7b0515176b [SCEV] Rename getMaxBackedgeTakenCount to getConstantMaxBackedgeTakenCount [NFC]
llvm-svn: 368930
2019-08-14 21:58:13 +00:00
Philip Reames 6cca3ad43e [RLEV] Rewrite loop exit values for multiple exit loops w/o overall loop exit count
We already supported rewriting loop exit values for multiple exit loops, but if any of the loop exits were not computable, we gave up on all loop exit values. This patch generalizes the existing code to handle individual computable loop exits where possible.

As discussed in the review, this is a starting point for figuring out a better API.  The code is a bit ugly, but getting it in lets us test as we go.  

Differential Revision: https://reviews.llvm.org/D65544

llvm-svn: 368898
2019-08-14 18:27:57 +00:00
Matt Arsenault dbc1f207fa InferAddressSpaces: Move target intrinsic handling to TTI
I'm planning on handling intrinsics that will benefit from checking
the address space enums. Don't bother moving the address collection
for now, since those won't need th enums.

llvm-svn: 368895
2019-08-14 18:13:00 +00:00
Matt Arsenault 0eac2a2963 InferAddressSpaces: Remove unnecessary check for ConstantInt
The IR is invalid if this isn't a constant since immarg was added.

llvm-svn: 368893
2019-08-14 18:01:42 +00:00
Bill Wendling cc2bebe039 Ignore indirect branches from callbr.
Summary:
We can't speculate around indirect branches: indirectbr and invoke. The
callbr instruction needs to be included here.

Reviewers: nickdesaulniers, manojgupta, chandlerc

Reviewed By: chandlerc

Subscribers: llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D66200

llvm-svn: 368873
2019-08-14 16:44:07 +00:00
David L. Jones d4edd9d97e Revert '[LICM] Make Loop ICM profile aware' and 'Fix pass dependency for LICM'
This reverts r368526 (git commit 7e71aa24bc)
This reverts r368542 (git commit cb5a90fd31)

llvm-svn: 368800
2019-08-14 04:50:33 +00:00
Wenlei He cb5a90fd31 Fix pass dependency for LICM
Expected to address buildbot failure http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/16285 caused by D65060.

llvm-svn: 368542
2019-08-11 22:54:05 +00:00
Wenlei He 7e71aa24bc [LICM] Make Loop ICM profile aware
Summary:
Hoisting/sinking instruction out of a loop isn't always beneficial. Hoisting an instruction from a cold block inside a loop body out of the loop could hurt performance. This change makes Loop ICM profile aware - it now checks block frequency to make sure hoisting/sinking anly moves instruction to colder block.

Test Plan:

ninja check

Reviewers: asbirlea, sanjoy, reames, nikic, hfinkel, vsk

Reviewed By: asbirlea

Subscribers: fhahn, vsk, davidxl, xbolva00, hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65060

llvm-svn: 368526
2019-08-11 06:05:35 +00:00
Sanjay Patel 21c15ef384 [Reassociate] try harder to convert negative FP constants to positive
This is an extension of a transform that tries to produce positive floating-point
constants to improve canonicalization (and hopefully lead to more reassociation
and CSE).

The original patches were:
D4904
D5363 (rL221721)

But as the test diffs show, these were limited to basic patterns by walking from
an instruction to its single user rather than recursively moving up the def-use
sequence. No fast-math is required here because we're only rearranging implicit
FP negations in intermediate ops.

A motivating bug is:
https://bugs.llvm.org/show_bug.cgi?id=32939

Differential Revision: https://reviews.llvm.org/D65954

llvm-svn: 368512
2019-08-10 13:17:54 +00:00
Bjorn Pettersson d218a3326e [InstSimplify] Report "Changed" also when only deleting dead instructions
Summary:
Make sure that we report that changes has been made
by InstSimplify also in situations when only trivially
dead instructions has been removed. If for example a call
is removed the call graph must be updated.

Bug seem to have been introduced by llvm-svn r367173
(commit 02b9e45a7e), since the code in question
was rewritten in that commit.

Reviewers: spatel, chandlerc, foad

Reviewed By: spatel

Subscribers: hiraditya, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D65973

llvm-svn: 368401
2019-08-09 07:08:25 +00:00
Cameron McInally 8416f20f2f [LICM] Support unary FNeg in LICM
Differential Revision: https://reviews.llvm.org/D65908

llvm-svn: 368350
2019-08-08 21:38:31 +00:00
Tim Corringham 4f64f1ba3c Add llvm.licm.disable metadata
For some targets the LICM pass can result in sub-optimal code in some
cases where it would be better not to run the pass, but it isn't
always possible to suppress the transformations heuristically.

Where the front-end has insight into such cases it is beneficial
to attach loop metadata to disable the pass - this change adds the
llvm.licm.disable metadata to enable that.

Differential Revision: https://reviews.llvm.org/D64557

llvm-svn: 368296
2019-08-08 13:46:17 +00:00
Cameron McInally 303b6dbfb4 [EarlyCSE] Add support for unary FNeg to EarlyCSE
Differential Revision: https://reviews.llvm.org/D65815

llvm-svn: 368171
2019-08-07 14:34:41 +00:00