forked from OSchip/llvm-project
				
			
		
			
				
	
	
		
			244 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			244 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| # Benchmarking `llvm-libc`'s memory functions
 | |
| 
 | |
| ## Foreword
 | |
| 
 | |
| Microbenchmarks are valuable tools to assess and compare the performance of
 | |
| isolated pieces of code. However they don't capture all interactions of complex
 | |
| systems; and so other metrics can be equally important:
 | |
| 
 | |
| -   **code size** (to reduce instruction cache pressure),
 | |
| -   **Profile Guided Optimization** friendliness,
 | |
| -   **hyperthreading / multithreading** friendliness.
 | |
| 
 | |
| ## Rationale
 | |
| 
 | |
| The goal here is to satisfy the [Benchmarking
 | |
| Principles](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Benchmarking_Principles).
 | |
| 
 | |
| 1.  **Relevance**: Benchmarks should measure relatively vital features.
 | |
| 2.  **Representativeness**: Benchmark performance metrics should be broadly
 | |
|     accepted by industry and academia.
 | |
| 3.  **Equity**: All systems should be fairly compared.
 | |
| 4.  **Repeatability**: Benchmark results can be verified.
 | |
| 5.  **Cost-effectiveness**: Benchmark tests are economical.
 | |
| 6.  **Scalability**: Benchmark tests should measure from single server to
 | |
|     multiple servers.
 | |
| 7.  **Transparency**: Benchmark metrics should be easy to understand.
 | |
| 
 | |
| Benchmarking is a [subtle
 | |
| art](https://en.wikipedia.org/wiki/Benchmark_\(computing\)#Challenges) and
 | |
| benchmarking memory functions is no exception. Here we'll dive into
 | |
| peculiarities of designing good microbenchmarks for `llvm-libc` memory
 | |
| functions.
 | |
| 
 | |
| ## Challenges
 | |
| 
 | |
| As seen in the [README.md](README.md#benchmarking-regimes) the microbenchmarking
 | |
| facility should focus on measuring **low latency code**. If copying a few bytes
 | |
| takes in the order of a few cycles, the benchmark should be able to **measure
 | |
| accurately down to the cycle**.
 | |
| 
 | |
| ### Measuring instruments
 | |
| 
 | |
| There are different sources of time in a computer (ordered from high to low resolution)
 | |
|  - [Performance
 | |
|    Counters](https://en.wikipedia.org/wiki/Hardware_performance_counter): used to
 | |
|    introspect the internals of the CPU,
 | |
|  - [High Precision Event
 | |
|    Timer](https://en.wikipedia.org/wiki/High_Precision_Event_Timer): used to
 | |
|    trigger short lived actions,
 | |
|  - [Real-Time Clocks (RTC)](https://en.wikipedia.org/wiki/Real-time_clock): used
 | |
|    to keep track of the computer's time.
 | |
| 
 | |
| In theory **Performance Counters** provide cycle accurate measurement via the
 | |
| `cpu cycles` event. But as we'll see, they are not really practical in this
 | |
| context.
 | |
| 
 | |
| ### Performance counters and modern processor architecture
 | |
| 
 | |
| Modern CPUs are [out of
 | |
| order](https://en.wikipedia.org/wiki/Out-of-order_execution) and
 | |
| [superscalar](https://en.wikipedia.org/wiki/Superscalar_processor) as a
 | |
| consequence it is [hard to know what is included when the counter is
 | |
| read](https://en.wikipedia.org/wiki/Hardware_performance_counter#Instruction_based_sampling),
 | |
| some instructions may still be **in flight**, some others may be executing
 | |
| [**speculatively**](https://en.wikipedia.org/wiki/Speculative_execution). As a
 | |
| matter of fact **on the same machine, measuring twice the same piece of code will yield
 | |
| different results.**
 | |
| 
 | |
| ### Performance counters semantics inconsistencies and availability
 | |
| 
 | |
| Although they have the same name, the exact semantics of performance counters
 | |
| are micro-architecture dependent: **it is generally not possible to compare two
 | |
| micro-architectures exposing the same performance counters.**
 | |
| 
 | |
| Each vendor decides which performance counters to implement and their exact
 | |
| meaning. Although we want to benchmark `llvm-libc` memory functions for all
 | |
| available [target
 | |
| triples](https://clang.llvm.org/docs/CrossCompilation.html#target-triple), there
 | |
| are **no guarantees that the counter we're interested in is available.** 
 | |
| 
 | |
| ### Additional imprecisions
 | |
| 
 | |
| -   Reading performance counters is done through Kernel [System
 | |
|     calls](https://en.wikipedia.org/wiki/System_call). The System call itself
 | |
|     is costly (hundreds of cycles) and will perturbate the counter's value.
 | |
| -   [Interruptions](https://en.wikipedia.org/wiki/Interrupt#Processor_response)
 | |
|     can occur during measurement.
 | |
| -   If the system is already under monitoring (virtual machines or system wide
 | |
|     profiling) the kernel can decide to multiplex the performance counters
 | |
|     leading to lower precision or even completely missing the measurement.
 | |
| -   The Kernel can decide to [migrate the
 | |
|     process](https://en.wikipedia.org/wiki/Process_migration) to a different
 | |
|     core.
 | |
| -   [Dynamic frequency
 | |
|     scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can kick
 | |
|     in during the measurement and change the ticking duration. **Ultimately we
 | |
|     care about the amount of work over a period of time**. This removes some
 | |
|     legitimacy of measuring cycles rather than **raw time**.
 | |
| 
 | |
| ### Cycle accuracy conclusion
 | |
| 
 | |
| We have seen that performance counters are: not widely available, semantically
 | |
| inconsistent across micro-architectures and imprecise on modern CPUs for small
 | |
| snippets of code.
 | |
| 
 | |
| ## Design decisions
 | |
| 
 | |
| In order to achieve the needed precision we would need to resort on more widely
 | |
| available counters and derive the time from a high number of runs: going from a
 | |
| single deterministic measure to a probabilistic one.
 | |
| 
 | |
| **To get a good signal to noise ratio we need the running time of the piece of
 | |
| code to be orders of magnitude greater than the measurement precision.**
 | |
| 
 | |
| For instance, if measurement precision is of 10 cycles, we need the function
 | |
| runtime to take more than 1000 cycles to achieve 1%
 | |
| [SNR](https://en.wikipedia.org/wiki/Signal-to-noise_ratio).
 | |
| 
 | |
| ### Repeating code N-times until precision is sufficient
 | |
| 
 | |
| The algorithm is as follows:
 | |
| 
 | |
| -   We measure the time it takes to run the code _N_ times (Initially _N_ is 10
 | |
|     for instance)
 | |
| -   We deduce an approximation of the runtime of one iteration (= _runtime_ /
 | |
|     _N_).
 | |
| -   We increase _N_ by _X%_ and repeat the measurement (geometric progression).
 | |
| -   We keep track of the _one iteration runtime approximation_ and build a
 | |
|     weighted mean of all the samples so far (weight is proportional to _N_)
 | |
| -   We stop the process when the difference between the weighted mean and the
 | |
|     last estimation is smaller than _ε_ or when other stopping conditions are
 | |
|     met (total runtime, maximum iterations or maximum sample count).
 | |
| 
 | |
| This method allows us to be as precise as needed provided that the measured
 | |
| runtime is proportional to _N_. Longer run times also smooth out imprecision
 | |
| related to _interrupts_ and _context switches_.
 | |
| 
 | |
| Note: When measuring longer runtimes (e.g. copying several megabytes of data)
 | |
| the above assumption doesn't hold anymore and the _ε_ precision cannot be
 | |
| reached by increasing iterations. The whole benchmarking process becomes
 | |
| prohibitively slow. In this case the algorithm is limited to a single sample and
 | |
| repeated several times to get a decent 95% confidence interval.
 | |
| 
 | |
| ### Effect of branch prediction
 | |
| 
 | |
| When measuring code with branches, repeating the same call again and again will
 | |
| allow the processor to learn the branching patterns and perfectly predict all
 | |
| the branches, leading to unrealistic results.
 | |
| 
 | |
| **Decision: When benchmarking small buffer sizes, the function parameters should
 | |
| be randomized between calls to prevent perfect branch predictions.**
 | |
| 
 | |
| ### Effect of the memory subsystem
 | |
| 
 | |
| The CPU is tightly coupled to the memory subsystem. It is common to see `L1`,
 | |
| `L2` and `L3` data caches.
 | |
| 
 | |
| We may be tempted to randomize data accesses widely to exercise all the caching
 | |
| layers down to RAM but the [cost of accessing lower layers of
 | |
| memory](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html)
 | |
| completely dominates the runtime for small sizes.
 | |
| 
 | |
| So to respect **Equity** and **Repeatability** principles we should make sure we
 | |
| **do not** depend on the memory subsystem.
 | |
| 
 | |
| **Decision: When benchmarking small buffer sizes, the data accessed by the
 | |
| function should stay in `L1`.**
 | |
| 
 | |
| ### Effect of prefetching
 | |
| 
 | |
| In case of small buffer sizes,
 | |
| [prefetching](https://en.wikipedia.org/wiki/Cache_prefetching) should not kick
 | |
| in but in case of large buffers it may introduce a bias.
 | |
| 
 | |
| **Decision: When benchmarking large buffer sizes, the data should be accessed in
 | |
| a random fashion to lower the impact of prefetching between calls.**
 | |
| 
 | |
| ### Effect of dynamic frequency scaling
 | |
| 
 | |
| Modern processors implement [dynamic frequency
 | |
| scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling). In so-called
 | |
| `performance` mode the CPU will increase its frequency and run faster than usual
 | |
| within [some limits](https://en.wikipedia.org/wiki/Intel_Turbo_Boost) : _"The
 | |
| increased clock rate is limited by the processor's power, current, and thermal
 | |
| limits, the number of cores currently in use, and the maximum frequency of the
 | |
| active cores."_
 | |
| 
 | |
| **Decision: When benchmarking we want to make sure the dynamic frequency scaling
 | |
| is always set to `performance`. We also want to make sure that the time based
 | |
| events are not impacted by frequency scaling.**
 | |
| 
 | |
| See [REAME.md](REAME.md) on how to set this up.
 | |
| 
 | |
| ### Reserved and pinned cores
 | |
| 
 | |
| Some operating systems allow [core
 | |
| reservation](https://stackoverflow.com/questions/13583146/whole-one-core-dedicated-to-single-process).
 | |
| It removes a set of perturbation sources like: process migration, context
 | |
| switches and interrupts. When a core is hyperthreaded, both cores should be
 | |
| reserved.
 | |
| 
 | |
| ## Microbenchmarks limitations
 | |
| 
 | |
| As stated in the Foreword section a number of effects do play a role in
 | |
| production but are not directly measurable through microbenchmarks. The code
 | |
| size of the benchmark is (much) smaller than the hot code of real applications
 | |
| and **doesn't exhibit instruction cache pressure as much**.
 | |
| 
 | |
| ### iCache pressure
 | |
| 
 | |
| Fundamental functions that are called frequently will occupy the L1 iCache
 | |
| ([illustration](https://en.wikipedia.org/wiki/CPU_cache#Example:_the_K8)). If
 | |
| they are too big they will prevent other hot code to stay in the cache and incur
 | |
| [stalls](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls). So the memory
 | |
| functions should be as small as possible.
 | |
| 
 | |
| ### iTLB pressure
 | |
| 
 | |
| The same reasoning goes for instruction Translation Lookaside Buffer
 | |
| ([iTLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer)) incurring
 | |
| [TLB
 | |
| misses](https://en.wikipedia.org/wiki/Translation_lookaside_buffer#TLB-miss_handling).
 | |
| 
 | |
| ## FAQ
 | |
| 
 | |
| 1.  Why don't you use Google Benchmark directly?
 | |
| 
 | |
|     We reuse some parts of Google Benchmark (detection of frequency scaling, CPU
 | |
|     cache hierarchy informations) but when it comes to measuring memory
 | |
|     functions Google Benchmark have a few issues:
 | |
| 
 | |
|     -   Google Benchmark privileges code based configuration via macros and
 | |
|         builders. It is typically done in a static manner. In our case the
 | |
|         parameters we need to setup are a mix of what's usually controlled by
 | |
|         the framework (number of trials, maximum number of iterations, size
 | |
|         ranges) and parameters that are more tied to the function under test
 | |
|         (randomization strategies, custom values). Achieving this with Google
 | |
|         Benchmark is cumbersome as it involves templated benchmarks and
 | |
|         duplicated code. In the end, the configuration would be spread across
 | |
|         command line flags (via framework's option or custom flags), and code
 | |
|         constants.
 | |
|     -   Output of the measurements is done through a `BenchmarkReporter` class,
 | |
|         that makes it hard to access the parameters discussed above.
 |