Commit Graph

94 Commits

Author SHA1 Message Date
Philip Reames 47afaf2eb0 [llvm-exegesis] Initialize all supported targets
Enable registration of multiple exegesis targets at once. Use idiomatic approach to defining target select macros, but leave code in the llvm-mca sub-directories for now.

This is a stepping stone towards allowing llvm-exegesis benchmarking via simulator or testing in non-target dependent tests.

Differential Revision: https://reviews.llvm.org/D133605
2022-09-22 09:02:22 -07:00
Clement Courbet e52f8406e8 Re-land "[llvm-exegesis] Support analyzing results from a different target."
With Mips fixes.

This reverts commit 7daf60e344.
2022-09-22 11:39:52 +02:00
Clement Courbet 7daf60e344 Revert "[llvm-exegesis] Support analyzing results from a different target."
Breaks MIPS compile.

This reverts commit cc61c822e0.
2022-09-22 11:19:01 +02:00
Clement Courbet cc61c822e0 [llvm-exegesis] Support analyzing results from a different target.
We were using the native triple to parse the benchmarks. Use the triple
from the benchmarks file.

Right now this still only allows analyzing files produced by the current
target until D133605 is in.

This also makes the `Analysis` class much less ad-hoc.

Differential Revision: https://reviews.llvm.org/D133697
2022-09-22 11:11:18 +02:00
Clement Courbet 7053e863a1 [llvm-exegesis][NFC] Use factory function for LlvmState.
This allows failing more gracefully.
2022-09-12 14:19:33 +02:00
Philip Reames 4d50a39240 [llvm-exegesis] Cross compile all enabled targets
llvm-exegesis is rather odd in the LLVM ecosystem in code is selectively compiled based on the native machine. LLVM is cross compiler by default, so this stands out as odd. It's also less then helpful when working on code for a target other than your native dev environment.

This change only changes the build setup. A later change will enable -march support to allow actual benchmarking under e.g. simulators in a cross compilation environment.

Differential Revision: https://reviews.llvm.org/D133150
2022-09-09 09:04:55 -07:00
Reid Kleckner 89b57061f7 Move TargetRegistry.(h|cpp) from Support to MC
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.

This allows us to ensure that Support doesn't have includes from MC/*.

Differential Revision: https://reviews.llvm.org/D111454
2021-10-08 14:51:48 -07:00
Roman Lebedev e030f808ec
[Exegesis] Native clusterization: sub-partition by sched class id
Currently native clusterization simply groups all benchmarks
by the opcode of key instruction, but that is suboptimal in certain cases,
e.g. where we can already tell that the particular instructions
already resolve into different sched classes.
2021-09-07 17:54:37 +03:00
Roman Lebedev 78eaff2ef8
[llvm-exegesis] Loop unrolling for loop snippet repetitor mode
I really needed this, like, factually, yesterday,
when verifying dependency breaking idioms for AMD Zen 3 scheduler model.

Consider the following example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 }
error:           ''
info:            ''
assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3
...

```
What does it tell us?
So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle?
That doesn't seem right. That's even less than there are pipes supporting this type of op.

Now, second example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
Now that's just worse. Due to the looping, the throughput completely plummeted,
and now we can only do a single instruction/cycle!?

That's not great.
And final example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'VPXORYrr YMM0 YMM0 YMM0'
  config:          ''
  register_initial_values: []
cpu_name:        znver3
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
  - { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 }
error:           ''
info:            ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```

So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x
(loop-body-size/instruction count in snippet), and run a loop with 1000 iterations
over that duplicated/unrolled snippet, the measured throughput goes through the roof,
up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle!

Reviewed By: courbet

Differential Revision: https://reviews.llvm.org/D102522
2021-05-25 12:08:27 +03:00
Nico Weber ba7a92c01e [Support] Don't include VirtualFileSystem.h in CommandLine.h
CommandLine.h is indirectly included in ~50% of TUs when building
clang, and VirtualFileSystem.h is large.

(Already remarked by jhenderson on D70769.)

No behavior change.

Differential Revision: https://reviews.llvm.org/D100957
2021-04-21 10:19:01 -04:00
Clement Courbet 9e9f991ac0 [llvm-exegesis] Honor -mcpu in analysis mode.
This is useful to set the baseline model for an unknown CPU.

Fixes PR50013.

Differential Revision: https://reviews.llvm.org/D100743
2021-04-19 10:44:28 +02:00
Qiu Chaofan 9d2f06445f [llvm-exegesis] Ignore instructions using custom inserter
Some instructions defined in table-gen files sets usesCustomInserter
bit, which means it has to be lowered by target code and isn't actually
valid instruction at MC level. So we should treat them like pseudo
instructions.

Reviewed By: gchatelet

Differential Revision: https://reviews.llvm.org/D94898
2021-02-19 17:04:27 +08:00
Ella Ma 1756d67934 [llvm][clang][mlir] Add checks for the return values from Target::createXXX to prevent protential null deref
All these potential null pointer dereferences are reported by my static analyzer for null smart pointer dereferences, which has a different implementation from `alpha.cplusplus.SmartPtr`.

The checked pointers in this patch are initialized by Target::createXXX functions. When the creator function pointer is not correctly set, a null pointer will be returned, or the creator function may originally return a null pointer.

Some of them may not make sense as they may be checked before entering the function, but I fixed them all in this patch. I submit this fix because 1) similar checks are found in some other places in the LLVM codebase for the same return value of the function; and, 2) some of the pointers are dereferenced before they are checked, which may definitely trigger a null pointer dereference if the return value is nullptr.

Reviewed By: tejohnson, MaskRay, jpienaar

Differential Revision: https://reviews.llvm.org/D91410
2020-11-21 21:04:12 -08:00
Vy Nguyen cb3fd715f3 Reland rG4fcd1a8e6528:[llvm-exegesis] Add option to check the hardware support for a given feature before benchmarking.
This is mostly for the benefit of the LBR latency mode.
Right now, it performs no checking. If this is run on non-supported hardware, it will produce all zeroes for latency.

      Differential Revision: https://reviews.llvm.org/D85254

New change: Updated lit.local.cfg to use pass the right argument to llvm-exegesis to actually request the LBR mode.

Differential Revision: https://reviews.llvm.org/D88670
2020-10-01 12:21:16 -04:00
Michael Liao 2c9dc7bbbf Revert "[llvm-exegesis] Add option to check the hardware support for a given feature before benchmarking."
This reverts commit 4fcd1a8e65 as
`llvm/test/tools/llvm-exegesis/X86/lbr/mov-add.s` failed on hosts
without LBR supported if the build has LIBPFM enabled. On that host,
`perf_event_open` fails with `EOPNOTSUPP` on LBR config. That change's
basic assumption

> If this is run on a non-supported hardware, it will produce all zeroes for latency.

could not stand as `perf_event_open` system call will fail if the
underlying hardware really don't have LBR supported.
2020-09-30 23:15:35 -04:00
Vy Nguyen 4fcd1a8e65 [llvm-exegesis] Add option to check the hardware support for a given feature before benchmarking.
This is mostly for the benefit of the LBR latency mode.
Right now, it performs no checking. If this is run on non-supported hardware, it will produce all zeroes for latency.

Differential Revision: https://reviews.llvm.org/D85254
2020-09-30 12:25:59 -04:00
Jinsong Ji 29ec5901c9 [llvm-exegesis] Add whitespace between words in error message 2020-09-24 18:20:57 +00:00
Vy Nguyen ee7caa7593 Reland [llvm-exegesis] Add benchmark latency option on X86 that uses LBR for more precise measurements.
Starting with Skylake, the LBR contains the precise number of cycles between the two
        consecutive branches.
        Making use of this will hopefully make the measurements more precise than the
        existing methods of using RDTSC.

                Differential Revision: https://reviews.llvm.org/D77422

New change: check for existence of field `cycles` in perf_branch_entry before enabling this mode.
This should prevent compilation errors when building for older kernel whose headers don't support it.
2020-07-27 12:38:05 -04:00
Clement Courbet 6bddd099ac Revert "[llvm-exegesis] Add benchmark latency option on X86 that uses LBR for more precise measurements."
From @erichkeane:
```
This patch doesn't seem to build for me:
/iusers/ekeane1/workspaces/llvm-project/llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp: In function ‘llvm::Error llvm::exegesis::parseDataBuffer(const char*, size_t, const void*, const void*, llvm::SmallVector<long int, 4>*)’:
/iusers/ekeane1/workspaces/llvm-project/llvm/tools/llvm-exegesis/lib/X86/X86Counter.cpp:99:37: error: ‘struct perf_branch_entry’ has no member named ‘cycles’

CycleArray->push_back(Entry.cycles);
I'm on RHEL7, so I have kernel 3.10, so it doesn't have 'cycles'.

According ot this: https://elixir.bootlin.com/linux/v4.3/source/include/uapi/linux/perf_event.h#L963 kernel 4.3 is the first time that 'cycles' appeared in this structure.
```
2020-07-17 16:55:17 +02:00
Vy Nguyen 1360e140cc [llvm-exegesis] Add benchmark latency option on X86 that uses LBR for more precise measurements.
Starting with Skylake, the LBR contains the precise number of cycles between the two
    consecutive branches.
    Making use of this will hopefully make the measurements more precise than the
    existing methods of using RDTSC.

            Differential Revision: https://reviews.llvm.org/D77422
2020-07-16 12:12:46 -04:00
Vy Nguyen e086a39c11 [llvm-exegesis] Let Counter returns up to 16 entries
LBR contains (up to) 16 entries for last x branches and the X86LBRCounter (from D77422) should be able to return all those.
    Currently, it just returns the latest entry, which could lead to mis-leading measurements.
    This patch aslo changes the LatencyBenchmarkRunner to accommodate multi-value readings.

         https://reviews.llvm.org/D81050
2020-06-26 10:57:20 -04:00
Roman Lebedev de22d7154b
[llvm-exegesis] 'Min' repetition mode
Summary:
As noted in documentation, different repetition modes have different trade-offs:

> .. option:: -repetition-mode=[duplicate|loop]
>
>  Specify the repetition mode. `duplicate` will create a large, straight line
>  basic block with `num-repetitions` copies of the snippet. `loop` will wrap
>  the snippet in a loop which will be run `num-repetitions` times. The `loop`
>  mode tends to better hide the effects of the CPU frontend on architectures
>  that cache decoded instructions, but consumes a register for counting
>  iterations.

Indeed. Example:

>>! In D74156#1873657, @lebedev.ri wrote:
> At least for `CMOV`, i'm seeing wildly different results
> |           | Latency | RThroughput |
> | duplicate | 1       | 0.8         |
> | loop      | 2       | 0.6         |
> where latency=1 seems correct, and i'd expect the througput to be close to 1/2 (since there are two execution units).

This isn't great for analysis, at least for schedule model development.

As discussed in excruciating detail in

>>! In D74156#1924514, @gchatelet wrote:
>>>! In D74156#1920632, @lebedev.ri wrote:
>> ... did that explanation of the question i'm having made any sense?
>
> Thx for digging in the conversation !
> Ok it makes more sense now.
>
> I discussed it a bit with @courbet:
>  - We want the analysis tool to stay simple so we'd rather not make it knowledgeable of the repetition mode.
>  - We'd like to still be able to select either repetition mode to dig into special cases
>
> So we could add a third `min` repetition mode that would run both and take the minimum. It could be the default option.
> Would you have some time to look what it would take to add this third mode?

there appears to be an agreement that it is indeed sub-par,
and that we should provide an optional, measurement (not analysis!) -time
way to rectify the situation.

However, the solutions isn't entirely straight-forward.

We can just add an actual 'multiplexer' `MinSnippetRepetitor`, because
if we just concatenate snippets produced by `DuplicateSnippetRepetitor`
and `LoopSnippetRepetitor` and run+measure that, the measurement will
naturally be different from what we'd get by running+measuring
them separately and taking the min.
([[ https://www.wolframalpha.com/input/?i=%28x%2By%29%2F2+%21%3D+min%28x%2C+y%29 | `time(D+L)/2 != min(time(D), time(L))` ]])

Also, it seems best to me to have a single snippet instead of generating
a snippet per repetition mode, since the only difference here is that the
loop repetition mode reserves one register for loop counter.

As far as i can tell, we can either teach `BenchmarkRunner::runConfiguration()`
to produce a single report given multiple repetitors (as in the patch),
or do that one layer higher - don't modify `BenchmarkRunner::runConfiguration()`,
produce multiple reports, don't actually print each one, but aggregate them somehow
and only print the final one.

Initially i've gone ahead with the latter approach, but it didn't look like a natural fit;
the former (as in the diff) does seem like a better fit to me.

There's also a question of the test coverage. It sure currently does work here:
```
$ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-8fb949.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'CMOV64rr RAX RAX R11 i_0x0'
    - 'CMOV64rr RBP RBP R15 i_0x0'
    - 'CMOV64rr RBX RBX RBX i_0x0'
    - 'CMOV64rr RCX RCX RBX i_0x0'
    - 'CMOV64rr RDI RDI R10 i_0x0'
    - 'CMOV64rr RDX RDX RAX i_0x0'
    - 'CMOV64rr RSI RSI RAX i_0x0'
    - 'CMOV64rr R8 R8 R8 i_0x0'
    - 'CMOV64rr R9 R9 RDX i_0x0'
    - 'CMOV64rr R10 R10 RBX i_0x0'
    - 'CMOV64rr R11 R11 R14 i_0x0'
    - 'CMOV64rr R12 R12 R9 i_0x0'
    - 'CMOV64rr R13 R13 R12 i_0x0'
    - 'CMOV64rr R14 R14 R15 i_0x0'
    - 'CMOV64rr R15 R15 R13 i_0x0'
  config:          ''
  register_initial_values:
    - 'RAX=0x0'
    - 'R11=0x0'
    - 'EFLAGS=0x0'
    - 'RBP=0x0'
    - 'R15=0x0'
    - 'RBX=0x0'
    - 'RCX=0x0'
    - 'RDI=0x0'
    - 'R10=0x0'
    - 'RDX=0x0'
    - 'RSI=0x0'
    - 'R8=0x0'
    - 'R9=0x0'
    - 'R14=0x0'
    - 'R12=0x0'
    - 'R13=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: inverse_throughput, value: 0.819, per_snippet_value: 12.285 }
error:           ''
info:            instruction has tied variables, using static renaming.
assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BF000000000000000048BB000000000000000048B9000000000000000048BF000000000000000049BA000000000000000048BA000000000000000048BE000000000000000049B8000000000000000049B9000000000000000049BE000000000000000049BC000000000000000049BD0000000000000000490F40C3490F40EF480F40DB480F40CB490F40FA480F40D0480F40F04D0F40C04C0F40CA4C0F40D34D0F40DE4D0F40E14D0F40EC4D0F40F74D0F40FD490F40C35B415C415D415E415F5DC3
...
$ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-051eb3.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'CMOV64rr RAX RAX R11 i_0x0'
    - 'CMOV64rr RBP RBP RSI i_0x0'
    - 'CMOV64rr RBX RBX R9 i_0x0'
    - 'CMOV64rr RCX RCX RSI i_0x0'
    - 'CMOV64rr RDI RDI RBP i_0x0'
    - 'CMOV64rr RDX RDX R9 i_0x0'
    - 'CMOV64rr RSI RSI RDI i_0x0'
    - 'CMOV64rr R9 R9 R12 i_0x0'
    - 'CMOV64rr R10 R10 R11 i_0x0'
    - 'CMOV64rr R11 R11 R9 i_0x0'
    - 'CMOV64rr R12 R12 RBP i_0x0'
    - 'CMOV64rr R13 R13 RSI i_0x0'
    - 'CMOV64rr R14 R14 R14 i_0x0'
    - 'CMOV64rr R15 R15 R10 i_0x0'
  config:          ''
  register_initial_values:
    - 'RAX=0x0'
    - 'R11=0x0'
    - 'EFLAGS=0x0'
    - 'RBP=0x0'
    - 'RSI=0x0'
    - 'RBX=0x0'
    - 'R9=0x0'
    - 'RCX=0x0'
    - 'RDI=0x0'
    - 'RDX=0x0'
    - 'R12=0x0'
    - 'R10=0x0'
    - 'R13=0x0'
    - 'R14=0x0'
    - 'R15=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: inverse_throughput, value: 0.6083, per_snippet_value: 8.5162 }
error:           ''
info:            instruction has tied variables, using static renaming.
assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000048BE000000000000000048BB000000000000000049B9000000000000000048B9000000000000000048BF000000000000000048BA000000000000000049BC000000000000000049BA000000000000000049BD000000000000000049BE000000000000000049BF000000000000000049B80200000000000000490F40C3480F40EE490F40D9480F40CE480F40FD490F40D1480F40F74D0F40CC4D0F40D34D0F40D94C0F40E54C0F40EE4D0F40F64D0F40FA4983C0FF75C25B415C415D415E415F5DC3
...
$ ./bin/llvm-exegesis --opcode-name=CMOV64rr --mode=inverse_throughput --repetition-mode=min
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c7a47d.o
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2581f1.o
---
mode:            inverse_throughput
key:
  instructions:
    - 'CMOV64rr RAX RAX R11 i_0x0'
    - 'CMOV64rr RBP RBP R10 i_0x0'
    - 'CMOV64rr RBX RBX R10 i_0x0'
    - 'CMOV64rr RCX RCX RDX i_0x0'
    - 'CMOV64rr RDI RDI RAX i_0x0'
    - 'CMOV64rr RDX RDX R9 i_0x0'
    - 'CMOV64rr RSI RSI RAX i_0x0'
    - 'CMOV64rr R9 R9 RBX i_0x0'
    - 'CMOV64rr R10 R10 R12 i_0x0'
    - 'CMOV64rr R11 R11 RDI i_0x0'
    - 'CMOV64rr R12 R12 RDI i_0x0'
    - 'CMOV64rr R13 R13 RDI i_0x0'
    - 'CMOV64rr R14 R14 R9 i_0x0'
    - 'CMOV64rr R15 R15 RBP i_0x0'
  config:          ''
  register_initial_values:
    - 'RAX=0x0'
    - 'R11=0x0'
    - 'EFLAGS=0x0'
    - 'RBP=0x0'
    - 'R10=0x0'
    - 'RBX=0x0'
    - 'RCX=0x0'
    - 'RDX=0x0'
    - 'RDI=0x0'
    - 'R9=0x0'
    - 'RSI=0x0'
    - 'R12=0x0'
    - 'R13=0x0'
    - 'R14=0x0'
    - 'R15=0x0'
cpu_name:        bdver2
llvm_triple:     x86_64-unknown-linux-gnu
num_repetitions: 10000
measurements:
  - { key: inverse_throughput, value: 0.6073, per_snippet_value: 8.5022 }
error:           ''
info:            instruction has tied variables, using static renaming.
assembled_snippet: 5541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BA000000000000000048BB000000000000000048B9000000000000000048BA000000000000000048BF000000000000000049B9000000000000000048BE000000000000000049BC000000000000000049BD000000000000000049BE000000000000000049BF0000000000000000490F40C3490F40EA490F40DA480F40CA480F40F8490F40D1480F40F04C0F40CB4D0F40D44C0F40DF4C0F40E74C0F40EF4D0F40F14C0F40FD490F40C3490F40EA5B415C415D415E415F5DC35541574156415541545348B8000000000000000049BB00000000000000004883EC08C7042400000000C7442404000000009D48BD000000000000000049BA000000000000000048BB000000000000000048B9000000000000000048BA000000000000000048BF000000000000000049B9000000000000000048BE000000000000000049BC000000000000000049BD000000000000000049BE000000000000000049BF000000000000000049B80200000000000000490F40C3490F40EA490F40DA480F40CA480F40F8490F40D1480F40F04C0F40CB4D0F40D44C0F40DF4C0F40E74C0F40EF4D0F40F14C0F40FD4983C0FF75C25B415C415D415E415F5DC3
...
```
but i open to suggestions as to how test that.

I also have gone with the suggestion to default to this new mode.
This was irking me for some time, so i'm happy to finally see progress here.
Looking forward to feedback.

Reviewers: courbet, gchatelet

Reviewed By: courbet, gchatelet

Subscribers: mstojanovic, RKSimon, llvm-commits, courbet, gchatelet

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D76921
2020-04-02 09:28:35 +03:00
Roman Lebedev cc5549dbc2
[NFC][llvm-exegesis] Docs/help: opcode-index=-1 means measure everything 2020-02-13 12:46:12 +03:00
Roman Lebedev 6030fe01f4
[llvm-exegesis] Exploring X86::OperandType::OPERAND_COND_CODE
Summary:
Currently, we only have nice exploration for LEA instruction,
while for the rest, we rely on `randomizeUnsetVariables()`
to sometimes generate something interesting.
While that works, it isn't very reliable in coverage :)

Here, i'm making an assumption that while we may want to explore
multi-instruction configs, we are most interested in the
characteristics of the main instruction we were asked about.

Which we can do, by taking the existing `randomizeMCOperand()`,
and turning it on it's head - instead of relying on it to randomly fill
one of the interesting values, let's pregenerate all the possible interesting
values for the variable, and then generate as much `InstructionTemplate`
combinations of these possible values for variables as needed/possible.

Of course, that requires invasive changes to no longer pass just the
naked `Instruction`, but sometimes partially filled `InstructionTemplate`.

As it can be seen from the test, this allows us to explore
`X86::OperandType::OPERAND_COND_CODE` for instructions
that take such an operand.
I'm hoping this will greatly simplify exploration.

Reviewers: courbet, gchatelet

Reviewed By: gchatelet

Subscribers: orodley, mgorny, sdardis, tschuett, jrtc27, atanasyan, mstojanovic, andreadb, RKSimon, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D74156
2020-02-12 21:33:52 +03:00
Miloš Stojanović 205292740d [llvm-exegesis] Improve error reporting in BenchmarkRunner.cpp
Followup to D74085.
Replace the use of `report_fatal_error()` with returning the error to
`llvm-exegesis.cpp` and handling it there.
To facilitate this, a new `Error` type has been added which is only used
to log errors to the yaml output.

Differential Revision: https://reviews.llvm.org/D74215
2020-02-07 16:29:52 +01:00
Miloš Stojanović 4bd40f71a7 Recommit: "[llvm-exegesis] Improve error reporting in Target.cpp"
Summary: Commit 141915963b was reverted in
abe01e17f6 because it broke builds testing
without libpfm. A preparatory commit <commit_sha1> was added to enable
this recommit.

Original commit message:

Followup to D74085.
Replace the use of `report_fatal_error()` with returning the error to
`llvm-exegesis.cpp` and handling it there.

Differential Revision: https://reviews.llvm.org/D74113
2020-02-07 14:34:58 +01:00
Miloš Stojanović 830af528a5 Recommit: "[llvm-exegesis] Improve error reporting"
Summary: Commit b3576f60eb was reverted in
abe01e17f6 because it broke builds testing
without libpfm. A preparatory commit <commit_sha1> was added to enable
this recommit.

Original commit message:

Fix inconsistencies in error reporting created by mixing
`report_fatal_error()` and `ExitOnErr()`, and add additional information
to the error message to make it more user friendly. Minimize the use
`report_fatal_error()` because it's meant for use in very rare cases and
it results in low information density of the error messages.

Summary of the new design:

 * For command line argument errors output `llvm-exegesis: <error_message>`,
   which is consistent with the error output format emitted by the backend
   which checks correctness of the command line arguments.
 * For other errors the format `llvm-exegesis error: <error_message>` is used.
 ** If the error occurred during file access `<error_message>` will have
    of two parts: `'<file_name>': <rest_of_the_error_message>`

Differential Revision: https://reviews.llvm.org/D74085
2020-02-07 14:34:58 +01:00
Miloš Stojanović 446268a223 [llvm-exegesis] Add a custom error for clustering
All errors of type `Failure` are `StringError`s. In order for exit code
mapping to detect that specifically a clustering error has occurred it
needs to have a different type.

This patch also prepares D74085 where termination `report_fatal_error()`
will be replaced with emitting `StringError`s.

Differential Revision: https://reviews.llvm.org/D74124
2020-02-07 14:34:57 +01:00
Hans Wennborg abe01e17f6 Revert "[llvm-exegesis] Improve error reporting" and follow-up.
It broke e.g. all tests under tools/llvm-exegesis/X86/ when libpfm is
not available, see comment on D74085.

This reverts commit b3576f60eb and
141915963b.
2020-02-06 12:53:16 +01:00
Miloš Stojanović 141915963b [llvm-exegesis] Improve error reporting in Target.cpp
Followup to D74085.
Replace the use of `report_fatal_error()` with returning the error to
`llvm-exegesis.cpp` and handling it there.

Differential Revision: https://reviews.llvm.org/D74113
2020-02-06 12:26:08 +01:00
Miloš Stojanović b3576f60eb [llvm-exegesis] Improve error reporting
Fix inconsistencies in error reporting created by mixing
`report_fatal_error()` and `ExitOnErr()`, and add additional information
to the error message to make it more user friendly. Minimize the use
`report_fatal_error()` because it's meant for use in very rare cases and
it results in low information density of the error messages.

Summary of the new design:

 * For command line argument errors output `llvm-exegesis: <error_message>`,
   which is consistent with the error output format emitted by the backend
   which checks correctness of the command line arguments.
 * For other errors the format `llvm-exegesis error: <error_message>` is used.
 ** If the error occurred during file access `<error_message>` will have
    of two parts: `'<file_name>': <rest_of_the_error_message>`

Differential Revision: https://reviews.llvm.org/D74085
2020-02-06 12:26:08 +01:00
Miloš Stojanović c7dc4734d2 [llvm-exegesis] Check counters before running
Check if the appropriate counters for the specified mode are defined on
the target. This is checked before any other work is done.

Differential Revision: https://reviews.llvm.org/D71927
2019-12-31 14:17:24 +01:00
Mark de Wever 536c9a604e [Tools] Fixes -Wrange-loop-analysis warnings
This avoids new warnings due to D68912 adds -Wrange-loop-analysis to -Wall.

Differential Revision: https://reviews.llvm.org/D71808
2019-12-22 19:11:17 +01:00
Guillaume Chatelet 32d384c020 [llvm-exegesis][NFC] internal changes
Summary:
BitVectors are now cached to lower memory utilization.
Instructions have reference semantics.

Reviewers: courbet

Subscribers: sdardis, tschuett, jrtc27, atanasyan, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D71653
2019-12-18 17:24:07 +01:00
Simon Pilgrim aedb528d43 llvm-exegesis - fix shadow variable warnings. NFCI. 2019-11-09 13:43:09 +00:00
Clement Courbet 50cdd56beb [llvm-exegesis][NFC] Remove extra `llvm::` qualifications.
Summary: Second patch: in the lib.

Reviewers: gchatelet

Subscribers: nemanjai, tschuett, MaskRay, mgrang, jsji, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68692

llvm-svn: 374158
2019-10-09 11:58:42 +00:00
Clement Courbet 2cd0f28959 [llvm-exegesis] Add options to SnippetGenerator.
Summary:
This adds a `-max-configs-per-opcode` option to limit the number of
configs per opcode.

Reviewers: gchatelet

Subscribers: tschuett, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68642

llvm-svn: 374054
2019-10-08 14:30:24 +00:00
Clement Courbet 03a3d29541 [llvm-exegesis][NFC] Move BenchmarkFailure to own file.
Summary: And rename to exegesis::Failure, as it's used everytwhere.

Reviewers: gchatelet

Subscribers: tschuett, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68217

llvm-svn: 373209
2019-09-30 13:53:50 +00:00
Clement Courbet 3e13816be2 [llvm-exegesis][NFC] Refactor snippet file reading out of tool main.
Summary: Add unit tests.

Reviewers: gchatelet

Subscribers: mgorny, tschuett, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68212

llvm-svn: 373202
2019-09-30 12:50:25 +00:00
Clement Courbet 9431b72ce9 [llvm-exegesis] Add loop mode for repeating the snippet.
Summary:
Before this change the Executable function was made by duplicating the
snippet. This change adds a --repetion-mode={loop|duplicate} flag that
allows choosing between this behaviour and wrapping the snippet instructions
in a loop.

The new mode can help measurements when the snippet fits in the DSB by
short-cirtcuiting decoding. The loop adds a dec + jmp to the measurements, but
since these are not part of the critical path, they execute in parallel
with the measured code and do not impact measurements in practice.

Overview of the change:
 - New SnippetRepetitor abstraction that handles repeating the snippet.
   The assembler delegates repeating the instructions to this class.
 - ExegesisTarget learns how to decrement loop counter and jump.
 - Some refactoring of the assembler into FunctionFiller/BasicBlockFiller.

Reviewers: gchatelet

Subscribers: mgorny, tschuett, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D68125

llvm-svn: 373083
2019-09-27 12:56:24 +00:00
Sylvestre Ledru 375297f38f fix a typo unavaliable=>unavailable
llvm-svn: 362878
2019-06-08 15:07:55 +00:00
Clement Courbet b9274f2694 [llvm-exegesis] Move native target initialization code to a separate file.
Summary: This helps building internal tools on top of the library.

Reviewers: gchatelet

Subscribers: tschuett, llvm-commits, bdb, ondrasej

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D62239

llvm-svn: 361385
2019-05-22 13:50:16 +00:00
Roman Lebedev eb1a156d7f [llvm-exegesis] benchmarkMain(): less cryptic error if built w/o libpfm
Wanted to check if inablility to measure latency of CMOV32rm
is a regression from D60041 / D60138, but unable to do that
because the llvm-exegesis-{8,9} from debian sid fails
with that cryptic, unhelpful error.

I suspect this will be a better error.

llvm-svn: 357900
2019-04-08 10:50:31 +00:00
Guillaume Chatelet 848df5b509 Add an option do not dump the generated object on disk
Reviewers: courbet

Subscribers: llvm-commits, bdb

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D60317

llvm-svn: 357769
2019-04-05 15:18:59 +00:00
Roman Lebedev c2423fe689 [llvm-exegesis] Introduce a 'naive' clustering algorithm (PR40880)
Summary:
This is an alternative to D59539.

Let's suppose we have measured 4 different opcodes, and got: `0.5`, `1.0`, `1.5`, `2.0`.
Let's suppose we are using `-analysis-clustering-epsilon=0.5`.
By default now we will start processing the `0.5` point, find that `1.0` is it's neighbor, add them to a new cluster.
Then we will notice that `1.5` is a neighbor of `1.0` and add it to that same cluster.
Then we will notice that `2.0` is a neighbor of `1.5` and add it to that same cluster.
So all these points ended up in the same cluster.
This may or may not be a correct implementation of dbscan clustering algorithm.

But this is rather horribly broken for the reasons of comparing the clusters with the LLVM sched data.
Let's suppose all those opcodes are currently in the same sched cluster.
If i specify `-analysis-inconsistency-epsilon=0.5`, then no matter
the LLVM values this cluster will **never** match the LLVM values,
and thus this cluster will **always** be displayed as inconsistent.

The solution is obviously to split off some of these opcodes into different sched cluster.
But how do i do that? Out of 4 opcodes displayed in the inconsistency report,
which ones are the "bad ones"? Which ones are the most different from the checked-in data?
I'd need to go in to the `.yaml` and look it up manually.

The trivial solution is to, when creating clusters, don't use the full dbscan algorithm,
but instead "pick some unclustered point, pick all unclustered points that are it's neighbor,
put them all into a new cluster, repeat". And just so as it happens, we can arrive
at that algorithm by not performing the "add neighbors of a neighbor to the cluster" step.

But that won't work well once we teach analyze mode to operate in on-1D mode
(i.e. on more than a single measurement type at a time), because the clustering would
depend on the order of the measurements.

Instead, let's just create a single cluster per opcode, and put all the points of that opcode into said cluster.
And simultaneously check that every point in that cluster is a neighbor of every other point in the cluster,
and if they are not, the cluster (==opcode) is unstable.

This is //yet another// step to bring me closer to being able to continue cleanup of bdver2 sched model..

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40880 | PR40880 ]].

Reviewers: courbet, gchatelet

Reviewed By: courbet

Subscribers: tschuett, jdoerfert, RKSimon, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D59820

llvm-svn: 357152
2019-03-28 08:55:01 +00:00
Roman Lebedev 23629385f1 [llvm-exegesis] Separate tool options into three categories.
Results in much nicer -help output:
```
$ ./bin/llvm-exegesis -help
USAGE: llvm-exegesis [options]

OPTIONS:

Color Options:

  -color                                         - Use colors in output (default=autodetect)

General options:

  -enable-cse-in-irtranslator                    - Should enable CSE in irtranslator
  -enable-cse-in-legalizer                       - Should enable CSE in Legalizer

Generic Options:

  -help                                          - Display available options (-help-hidden for more)
  -help-list                                     - Display list of available options (-help-list-hidden for more)
  -version                                       - Display the version of this program

llvm-exegesis analysis options:

  -analysis-clustering-epsilon=<number>          - dbscan epsilon for benchmark point clustering
  -analysis-clusters-output-file=<string>        -
  -analysis-display-unstable-clusters            - if there is more than one benchmark for an opcode, said benchmarks may end up not being clustered into the same cluster if the measured performance characteristics are different. by default all such opcodes are filtered out. this flag will instead show only such unstable opcodes
  -analysis-inconsistencies-output-file=<string> -
  -analysis-inconsistency-epsilon=<number>       - epsilon for detection of when the cluster is different from the LLVM schedule profile values
  -analysis-numpoints=<uint>                     - minimum number of points in an analysis cluster

llvm-exegesis benchmark options:

  -ignore-invalid-sched-class                    - ignore instructions that do not define a sched class
  -mode=<value>                                  - the mode to run
    =latency                                     -   Instruction Latency
    =inverse_throughput                          -   Instruction Inverse Throughput
    =uops                                        -   Uop Decomposition
    =analysis                                    -   Analysis
  -num-repetitions=<uint>                        - number of time to repeat the asm snippet
  -opcode-index=<int>                            - opcode to measure, by index
  -opcode-name=<string>                          - comma-separated list of opcodes to measure, by name
  -snippets-file=<string>                        - code snippets to measure

llvm-exegesis options:

  -benchmarks-file=<string>                      - File to read (analysis mode) or write (latency/uops/inverse_throughput modes) benchmark results. “-” uses stdin/stdout.
  -mcpu=<string>                                 - cpu name to use for pfm counters, leave empty to autodetect
```

llvm-svn: 356364
2019-03-18 11:32:37 +00:00
Roman Lebedev 542e5d7bb5 [llvm-exegesis] Split Epsilon param into two (PR40787)
Summary:
This eps param is used for two distinct things:
* initial point clusterization
* checking clusters against the llvm values

What if one wants to only look at highly different clusters, without changing
the clustering itself? In particular, this helps to weed out noisy measurements
(since the clusterization epsilon is still small, so there is a better chance
that noisy measurements from the same opcode will go into different clusters)

By splitting it into two params it is now possible.

This is nearly-free performance-wise:
Old:
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency-1.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 10099 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'
...
 Performance counter stats for './bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency-1.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html' (25 runs):

            390.01 msec task-clock                #    0.998 CPUs utilized            ( +-  0.25% )
                12      context-switches          #   31.735 M/sec                    ( +- 27.38% )
                 0      cpu-migrations            #    0.000 K/sec
              4745      page-faults               # 12183.732 M/sec                   ( +-  0.54% )
        1562711900      cycles                    # 4012303.327 GHz                   ( +-  0.24% )  (82.90%)
         185567822      stalled-cycles-frontend   #   11.87% frontend cycles idle     ( +-  0.52% )  (83.30%)
         392106234      stalled-cycles-backend    #   25.09% backend cycles idle      ( +-  1.31% )  (33.79%)
        1839236666      instructions              #    1.18  insn per cycle
                                                  #    0.21  stalled cycles per insn  ( +-  0.15% )  (50.37%)
         407035764      branches                  # 1045074878.710 M/sec              ( +-  0.12% )  (66.80%)
          10896459      branch-misses             #    2.68% of all branches          ( +-  0.17% )  (83.20%)

          0.390629 +- 0.000972 seconds time elapsed  ( +-  0.25% )
```
```
$ perf stat -r 9 ./bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency.yml -analysis-inconsistencies-output-file=/tmp/clusters-old.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 50572 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'
...
 Performance counter stats for './bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency.yml -analysis-inconsistencies-output-file=/tmp/clusters-old.html' (9 runs):

           6803.36 msec task-clock                #    0.999 CPUs utilized            ( +-  0.96% )
               262      context-switches          #   38.546 M/sec                    ( +- 23.06% )
                 0      cpu-migrations            #    0.065 M/sec                    ( +- 76.03% )
             13287      page-faults               # 1953.206 M/sec                    ( +-  0.32% )
       27252537904      cycles                    # 4006024.257 GHz                   ( +-  0.95% )  (83.31%)
        1496314935      stalled-cycles-frontend   #    5.49% frontend cycles idle     ( +-  0.97% )  (83.32%)
       16128404524      stalled-cycles-backend    #   59.18% backend cycles idle      ( +-  0.30% )  (33.37%)
       17611143370      instructions              #    0.65  insn per cycle
                                                  #    0.92  stalled cycles per insn  ( +-  0.05% )  (50.04%)
        3894906599      branches                  # 572537147.437 M/sec               ( +-  0.03% )  (66.69%)
         116314514      branch-misses             #    2.99% of all branches          ( +-  0.20% )  (83.35%)

            6.8118 +- 0.0689 seconds time elapsed  ( +-  1.01%)
```
New:
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency-1.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 10099 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new.html'
...
 Performance counter stats for './bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency-1.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new.html' (25 runs):

            400.14 msec task-clock                #    0.998 CPUs utilized            ( +-  0.66% )
                12      context-switches          #   29.429 M/sec                    ( +- 25.95% )
                 0      cpu-migrations            #    0.100 M/sec                    ( +-100.00% )
              4714      page-faults               # 11796.496 M/sec                   ( +-  0.55% )
        1603131306      cycles                    # 4011840.105 GHz                   ( +-  0.66% )  (82.85%)
         199538509      stalled-cycles-frontend   #   12.45% frontend cycles idle     ( +-  2.40% )  (83.10%)
         402249109      stalled-cycles-backend    #   25.09% backend cycles idle      ( +-  1.19% )  (34.05%)
        1847783963      instructions              #    1.15  insn per cycle
                                                  #    0.22  stalled cycles per insn  ( +-  0.18% )  (50.64%)
         407162722      branches                  # 1018925730.631 M/sec              ( +-  0.12% )  (67.02%)
          10932779      branch-misses             #    2.69% of all branches          ( +-  0.51% )  (83.28%)

           0.40077 +- 0.00267 seconds time elapsed  ( +-  0.67% )

lebedevri@pini-pini:/build/llvm-build-Clang-release$ perf stat -r 9 ./bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency.yml -analysis-inconsistencies-output-file=/tmp/clusters-new.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 50572 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new.html'
...
 Performance counter stats for './bin/llvm-exegesis -mode=analysis -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-latency.yml -analysis-inconsistencies-output-file=/tmp/clusters-new.html' (9 runs):

           6947.79 msec task-clock                #    1.000 CPUs utilized            ( +-  0.90% )
               217      context-switches          #   31.236 M/sec                    ( +- 36.16% )
                 1      cpu-migrations            #    0.096 M/sec                    ( +- 50.00% )
             13258      page-faults               # 1908.389 M/sec                    ( +-  0.34% )
       27830796523      cycles                    # 4006032.286 GHz                   ( +-  0.89% )  (83.30%)
        1504554006      stalled-cycles-frontend   #    5.41% frontend cycles idle     ( +-  2.10% )  (83.32%)
       16716574843      stalled-cycles-backend    #   60.07% backend cycles idle      ( +-  0.65% )  (33.38%)
       17755545931      instructions              #    0.64  insn per cycle
                                                  #    0.94  stalled cycles per insn  ( +-  0.09% )  (50.04%)
        3897255686      branches                  # 560980426.597 M/sec               ( +-  0.06% )  (66.70%)
         117045395      branch-misses             #    3.00% of all branches          ( +-  0.47% )  (83.34%)

            6.9507 +- 0.0627 seconds time elapsed  ( +-  0.90% )
```

I.e. it's +2.6% slowdown for one whole sweep, or +2% for 5 whole sweeps.
Within noise i'd say.

Should help with [[ https://bugs.llvm.org/show_bug.cgi?id=40787 | PR40787 ]].

Reviewers: courbet, gchatelet

Reviewed By: courbet

Subscribers: tschuett, RKSimon, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58476

llvm-svn: 354767
2019-02-25 09:36:12 +00:00
Roman Lebedev 69716394f3 [llvm-exegesis] Opcode stabilization / reclusterization (PR40715)
Summary:
Given an instruction `Opcode`, we can make benchmarks (measurements) of the
instruction characteristics/performance. Then, to facilitate further analysis
we group the benchmarks with *similar* characteristics into clusters.
Now, this is all not entirely deterministic. Some instructions have variable
characteristics, depending on their arguments. And thus, if we do several
benchmarks of the same instruction `Opcode`, we may end up with *different*
performance characteristics measurements. And when we then do clustering,
these several benchmarks of the same instruction `Opcode` may end up being
clustered into *different* clusters. This is not great for further analysis.

We shall find every `Opcode` with benchmarks not in just one cluster, and move
*all* the benchmarks of said `Opcode` into one new unstable cluster per `Opcode`.

I have solved this by making `ClusterId` a bit field, adding a `IsUnstable` bit,
and introducing `-analysis-display-unstable-clusters` switch to toggle between
displaying stable-only clusters and unstable-only clusters.

The reclusterization is deterministically stable, produces identical reports
between runs. (Or at least that is what i'm seeing, maybe it isn't)

Timings/comparisons:
old (current trunk/head) {F8303582}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-old.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-old.html' (25 runs):

           6624.73 msec task-clock                #    0.999 CPUs utilized            ( +-  0.53% )
               172      context-switches          #   25.965 M/sec                    ( +- 29.89% )
                 0      cpu-migrations            #    0.042 M/sec                    ( +- 56.54% )
             31073      page-faults               # 4690.754 M/sec                    ( +-  0.08% )
       26538711696      cycles                    # 4006230.292 GHz                   ( +-  0.53% )  (83.31%)
        2017496807      stalled-cycles-frontend   #    7.60% frontend cycles idle     ( +-  0.93% )  (83.32%)
       13403650062      stalled-cycles-backend    #   50.51% backend cycles idle      ( +-  0.33% )  (33.37%)
       19770706799      instructions              #    0.74  insn per cycle
                                                  #    0.68  stalled cycles per insn  ( +-  0.04% )  (50.04%)
        4419821812      branches                  # 667207369.714 M/sec               ( +-  0.03% )  (66.69%)
         121741669      branch-misses             #    2.75% of all branches          ( +-  0.28% )  (83.34%)

            6.6283 +- 0.0358 seconds time elapsed  ( +-  0.54% )
```

patch, with reclustering but without filtering (i.e. outputting all the stable *and* unstable clusters) {F8303586}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-all.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-all.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-all.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-all.html' (25 runs):

           6475.29 msec task-clock                #    0.999 CPUs utilized            ( +-  0.31% )
               213      context-switches          #   32.952 M/sec                    ( +- 23.81% )
                 1      cpu-migrations            #    0.130 M/sec                    ( +- 43.84% )
             31287      page-faults               # 4832.057 M/sec                    ( +-  0.08% )
       25939086577      cycles                    # 4006160.279 GHz                   ( +-  0.31% )  (83.31%)
        1958812858      stalled-cycles-frontend   #    7.55% frontend cycles idle     ( +-  0.68% )  (83.32%)
       13218961512      stalled-cycles-backend    #   50.96% backend cycles idle      ( +-  0.29% )  (33.37%)
       19752995402      instructions              #    0.76  insn per cycle
                                                  #    0.67  stalled cycles per insn  ( +-  0.04% )  (50.04%)
        4417079244      branches                  # 682195472.305 M/sec               ( +-  0.03% )  (66.70%)
         121510065      branch-misses             #    2.75% of all branches          ( +-  0.19% )  (83.34%)

            6.4832 +- 0.0229 seconds time elapsed  ( +-  0.35% )
```
Funnily, *this* measurement shows that said reclustering actually improved performance.

patch, with reclustering, only the stable clusters {F8303594}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-stable.html
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-stable.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-stable.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-stable.html' (25 runs):

           6387.71 msec task-clock                #    0.999 CPUs utilized            ( +-  0.13% )
               133      context-switches          #   20.792 M/sec                    ( +- 23.39% )
                 0      cpu-migrations            #    0.063 M/sec                    ( +- 61.24% )
             31318      page-faults               # 4903.256 M/sec                    ( +-  0.08% )
       25591984967      cycles                    # 4006786.266 GHz                   ( +-  0.13% )  (83.31%)
        1881234904      stalled-cycles-frontend   #    7.35% frontend cycles idle     ( +-  0.25% )  (83.33%)
       13209749965      stalled-cycles-backend    #   51.62% backend cycles idle      ( +-  0.16% )  (33.36%)
       19767554347      instructions              #    0.77  insn per cycle
                                                  #    0.67  stalled cycles per insn  ( +-  0.04% )  (50.03%)
        4417480305      branches                  # 691618858.046 M/sec               ( +-  0.03% )  (66.68%)
         118676358      branch-misses             #    2.69% of all branches          ( +-  0.07% )  (83.33%)

            6.3954 +- 0.0118 seconds time elapsed  ( +-  0.18% )
```
Performance improved even further?! Makes sense i guess, less clusters to print.

patch, with reclustering, only the unstable clusters {F8303601}
```
$ perf stat -r 25 ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-unstable.html -analysis-display-unstable-clusters
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-unstable.html'
...
no exegesis target for x86_64-unknown-linux-gnu, using default
Parsed 43970 benchmark points
Printing sched class consistency analysis results to file '/tmp/clusters-new-unstable.html'

 Performance counter stats for './bin/llvm-exegesis -mode=analysis -analysis-epsilon=0.5 -benchmarks-file=/home/lebedevri/PileDriver-Sched/benchmarks-inverse_throughput.yaml -analysis-inconsistencies-output-file=/tmp/clusters-new-unstable.html -analysis-display-unstable-clusters' (25 runs):

           6124.96 msec task-clock                #    1.000 CPUs utilized            ( +-  0.20% )
               194      context-switches          #   31.709 M/sec                    ( +- 20.46% )
                 0      cpu-migrations            #    0.039 M/sec                    ( +- 49.77% )
             31413      page-faults               # 5129.261 M/sec                    ( +-  0.06% )
       24536794267      cycles                    # 4006425.858 GHz                   ( +-  0.19% )  (83.31%)
        1676085087      stalled-cycles-frontend   #    6.83% frontend cycles idle     ( +-  0.46% )  (83.32%)
       13035595603      stalled-cycles-backend    #   53.13% backend cycles idle      ( +-  0.16% )  (33.36%)
       18260877653      instructions              #    0.74  insn per cycle
                                                  #    0.71  stalled cycles per insn  ( +-  0.05% )  (50.03%)
        4112411983      branches                  # 671484364.603 M/sec               ( +-  0.03% )  (66.68%)
         114066929      branch-misses             #    2.77% of all branches          ( +-  0.11% )  (83.32%)

            6.1278 +- 0.0121 seconds time elapsed  ( +-  0.20% )
```
This tells us that the actual `-analysis-inconsistencies-output-file=` outputting only takes ~0.4 sec for 43970 benchmark points (3 whole sweeps)
(Also, wow this is fast, it used to take several minutes originally)

Fixes [[ https://bugs.llvm.org/show_bug.cgi?id=40715 | PR40715 ]].

Reviewers: courbet, gchatelet

Reviewed By: courbet

Subscribers: tschuett, jdoerfert, llvm-commits, RKSimon

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D58355

llvm-svn: 354441
2019-02-20 09:14:04 +00:00
Andrea Di Biagio edbf06a767 [AsmPrinter] Remove hidden flag -print-schedule.
This patch removes hidden codegen flag -print-schedule effectively reverting the
logic originally committed as r300311
(https://llvm.org/viewvc/llvm-project?view=revision&revision=300311).

Flag -print-schedule was originally introduced by r300311 to address PR32216
(https://bugs.llvm.org/show_bug.cgi?id=32216). That bug was about adding "Better
testing of schedule model instruction latencies/throughputs".

These days, we can use llvm-mca to test scheduling models. So there is no longer
a need for flag -print-schedule in LLVM. The main use case for PR32216 is
now addressed by llvm-mca.
Flag -print-schedule is mainly used for debugging purposes, and it is only
actually used by x86 specific tests. We already have extensive (latency and
throughput) tests under "test/tools/llvm-mca" for X86 processor models. That
means, most (if not all) existing -print-schedule tests for X86 are redundant.

When flag -print-schedule was first added to LLVM, several files had to be
modified; a few APIs gained new arguments (see for example method
MCAsmStreamer::EmitInstruction), and MCSubtargetInfo/TargetSubtargetInfo gained
a couple of getSchedInfoStr() methods.

Method getSchedInfoStr() had to originally work for both MCInst and
MachineInstr. The original implmentation of getSchedInfoStr() introduced a
subtle layering violation (reported as PR37160 and then fixed/worked-around by
r330615).
In retrospect, that new API could have been designed more optimally. We can
always query MCSchedModel to get the latency and throughput. More importantly,
the "sched-info" string should not have been generated by the subtarget.
Note, r317782 fixed an issue where "print-schedule" didn't work very well in the
presence of inline assembly. That commit is also reverted by this change.

Differential Revision: https://reviews.llvm.org/D57244

llvm-svn: 353043
2019-02-04 12:51:26 +00:00
Roman Lebedev 21193f4b7e [llvm-exegesis] Don't default to running&dumping all analyses to '-'
Summary:
Up until the point i have looked in the source, i didn't even understood that
i can disable 'cluster' output. I have always silenced it via ` &> /dev/null`.
(And hoped it wasn't contributing much of the run time.)

While i expect that it has it's use-cases i never once needed it so far.
If i forget to silence it, console is completely flooded with that output.

How about not expecting users to opt-out of analyses,
but to explicitly specify the analyses that should be performed?

Reviewers: courbet, gchatelet

Reviewed By: courbet

Subscribers: tschuett, RKSimon, llvm-commits

Tags: #llvm

Differential Revision: https://reviews.llvm.org/D57648

llvm-svn: 353021
2019-02-04 09:12:08 +00:00