298 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			298 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
=====================================
 | 
						||
Performance Tips for Frontend Authors
 | 
						||
=====================================
 | 
						||
 | 
						||
.. contents::
 | 
						||
   :local:
 | 
						||
   :depth: 2
 | 
						||
 | 
						||
Abstract
 | 
						||
========
 | 
						||
 | 
						||
The intended audience of this document is developers of language frontends
 | 
						||
targeting LLVM IR. This document is home to a collection of tips on how to
 | 
						||
generate IR that optimizes well.
 | 
						||
 | 
						||
IR Best Practices
 | 
						||
=================
 | 
						||
 | 
						||
As with any optimizer, LLVM has its strengths and weaknesses.  In some cases,
 | 
						||
surprisingly small changes in the source IR can have a large effect on the
 | 
						||
generated code.
 | 
						||
 | 
						||
Beyond the specific items on the list below, it's worth noting that the most
 | 
						||
mature frontend for LLVM is Clang.  As a result, the further your IR gets from
 | 
						||
what Clang might emit, the less likely it is to be effectively optimized. It
 | 
						||
can often be useful to write a quick C program with the semantics you're trying
 | 
						||
to model and see what decisions Clang's IRGen makes about what IR to emit.
 | 
						||
Studying Clang's CodeGen directory can also be a good source of ideas.  Note
 | 
						||
that Clang and LLVM are explicitly version locked so you'll need to make sure
 | 
						||
you're using a Clang built from the same git revision or release as the LLVM
 | 
						||
library you're using.  As always, it's *strongly* recommended that you track
 | 
						||
tip of tree development, particularly during bring up of a new project.
 | 
						||
 | 
						||
The Basics
 | 
						||
^^^^^^^^^^^
 | 
						||
 | 
						||
#. Make sure that your Modules contain both a data layout specification and
 | 
						||
   target triple. Without these pieces, non of the target specific optimization
 | 
						||
   will be enabled.  This can have a major effect on the generated code quality.
 | 
						||
 | 
						||
#. For each function or global emitted, use the most private linkage type
 | 
						||
   possible (private, internal or linkonce_odr preferably).  Doing so will
 | 
						||
   make LLVM's inter-procedural optimizations much more effective.
 | 
						||
 | 
						||
#. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds
 | 
						||
   of predecessors).  Among other issues, the register allocator is known to
 | 
						||
   perform badly with confronted with such structures.  The only exception to
 | 
						||
   this guidance is that a unified return block with high in-degree is fine.
 | 
						||
 | 
						||
Use of allocas
 | 
						||
^^^^^^^^^^^^^^
 | 
						||
 | 
						||
An alloca instruction can be used to represent a function scoped stack slot,
 | 
						||
but can also represent dynamic frame expansion.  When representing function
 | 
						||
scoped variables or locations, placing alloca instructions at the beginning of
 | 
						||
the entry block should be preferred.   In particular, place them before any
 | 
						||
call instructions. Call instructions might get inlined and replaced with
 | 
						||
multiple basic blocks. The end result is that a following alloca instruction
 | 
						||
would no longer be in the entry basic block afterward.
 | 
						||
 | 
						||
The SROA (Scalar Replacement Of Aggregates) and Mem2Reg passes only attempt
 | 
						||
to eliminate alloca instructions that are in the entry basic block.  Given
 | 
						||
SSA is the canonical form expected by much of the optimizer; if allocas can
 | 
						||
not be eliminated by Mem2Reg or SROA, the optimizer is likely to be less
 | 
						||
effective than it could be.
 | 
						||
 | 
						||
Avoid loads and stores of large aggregate type
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
 | 
						||
LLVM currently does not optimize well loads and stores of large :ref:`aggregate
 | 
						||
types <t_aggregate>` (i.e. structs and arrays).  As an alternative, consider
 | 
						||
loading individual fields from memory.
 | 
						||
 | 
						||
Aggregates that are smaller than the largest (performant) load or store
 | 
						||
instruction supported by the targeted hardware are well supported.  These can
 | 
						||
be an effective way to represent collections of small packed fields.
 | 
						||
 | 
						||
Prefer zext over sext when legal
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
 | 
						||
On some architectures (X86_64 is one), sign extension can involve an extra
 | 
						||
instruction whereas zero extension can be folded into a load.  LLVM will try to
 | 
						||
replace a sext with a zext when it can be proven safe, but if you have
 | 
						||
information in your source language about the range of an integer value, it can
 | 
						||
be profitable to use a zext rather than a sext.
 | 
						||
 | 
						||
Alternatively, you can :ref:`specify the range of the value using metadata
 | 
						||
<range-metadata>` and LLVM can do the sext to zext conversion for you.
 | 
						||
 | 
						||
Zext GEP indices to machine register width
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
 | 
						||
Internally, LLVM often promotes the width of GEP indices to machine register
 | 
						||
width.  When it does so, it will default to using sign extension (sext)
 | 
						||
operations for safety.  If your source language provides information about
 | 
						||
the range of the index, you may wish to manually extend indices to machine
 | 
						||
register width using a zext instruction.
 | 
						||
 | 
						||
When to specify alignment
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
LLVM will always generate correct code if you don’t specify alignment, but may
 | 
						||
generate inefficient code.  For example, if you are targeting MIPS (or older
 | 
						||
ARM ISAs) then the hardware does not handle unaligned loads and stores, and
 | 
						||
so you will enter a trap-and-emulate path if you do a load or store with
 | 
						||
lower-than-natural alignment.  To avoid this, LLVM will emit a slower
 | 
						||
sequence of loads, shifts and masks (or load-right + load-left on MIPS) for
 | 
						||
all cases where the load / store does not have a sufficiently high alignment
 | 
						||
in the IR.
 | 
						||
 | 
						||
The alignment is used to guarantee the alignment on allocas and globals,
 | 
						||
though in most cases this is unnecessary (most targets have a sufficiently
 | 
						||
high default alignment that they’ll be fine).  It is also used to provide a
 | 
						||
contract to the back end saying ‘either this load/store has this alignment, or
 | 
						||
it is undefined behavior’.  This means that the back end is free to emit
 | 
						||
instructions that rely on that alignment (and mid-level optimizers are free to
 | 
						||
perform transforms that require that alignment).  For x86, it doesn’t make
 | 
						||
much difference, as almost all instructions are alignment-independent.  For
 | 
						||
MIPS, it can make a big difference.
 | 
						||
 | 
						||
Note that if your loads and stores are atomic, the backend will be unable to
 | 
						||
lower an under aligned access into a sequence of natively aligned accesses.
 | 
						||
As a result, alignment is mandatory for atomic loads and stores.
 | 
						||
 | 
						||
Other Things to Consider
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
 | 
						||
#. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing
 | 
						||
   analysis), prefer GEPs
 | 
						||
 | 
						||
#. Prefer globals over inttoptr of a constant address - this gives you
 | 
						||
   dereferencability information.  In MCJIT, use getSymbolAddress to provide
 | 
						||
   actual address.
 | 
						||
 | 
						||
#. Be wary of ordered and atomic memory operations.  They are hard to optimize
 | 
						||
   and may not be well optimized by the current optimizer.  Depending on your
 | 
						||
   source language, you may consider using fences instead.
 | 
						||
 | 
						||
#. If calling a function which is known to throw an exception (unwind), use
 | 
						||
   an invoke with a normal destination which contains an unreachable
 | 
						||
   instruction.  This form conveys to the optimizer that the call returns
 | 
						||
   abnormally.  For an invoke which neither returns normally or requires unwind
 | 
						||
   code in the current function, you can use a noreturn call instruction if
 | 
						||
   desired.  This is generally not required because the optimizer will convert
 | 
						||
   an invoke with an unreachable unwind destination to a call instruction.
 | 
						||
 | 
						||
#. Use profile metadata to indicate statically known cold paths, even if
 | 
						||
   dynamic profiling information is not available.  This can make a large
 | 
						||
   difference in code placement and thus the performance of tight loops.
 | 
						||
 | 
						||
#. When generating code for loops, try to avoid terminating the header block of
 | 
						||
   the loop earlier than necessary.  If the terminator of the loop header
 | 
						||
   block is a loop exiting conditional branch, the effectiveness of LICM will
 | 
						||
   be limited for loads not in the header.  (This is due to the fact that LLVM
 | 
						||
   may not know such a load is safe to speculatively execute and thus can't
 | 
						||
   lift an otherwise loop invariant load unless it can prove the exiting
 | 
						||
   condition is not taken.)  It can be profitable, in some cases, to emit such
 | 
						||
   instructions into the header even if they are not used along a rarely
 | 
						||
   executed path that exits the loop.  This guidance specifically does not
 | 
						||
   apply if the condition which terminates the loop header is itself invariant,
 | 
						||
   or can be easily discharged by inspecting the loop index variables.
 | 
						||
 | 
						||
#. In hot loops, consider duplicating instructions from small basic blocks
 | 
						||
   which end in highly predictable terminators into their successor blocks.
 | 
						||
   If a hot successor block contains instructions which can be vectorized
 | 
						||
   with the duplicated ones, this can provide a noticeable throughput
 | 
						||
   improvement.  Note that this is not always profitable and does involve a
 | 
						||
   potentially large increase in code size.
 | 
						||
 | 
						||
#. When checking a value against a constant, emit the check using a consistent
 | 
						||
   comparison type.  The GVN pass *will* optimize redundant equalities even if
 | 
						||
   the type of comparison is inverted, but GVN only runs late in the pipeline.
 | 
						||
   As a result, you may miss the opportunity to run other important
 | 
						||
   optimizations.  Improvements to EarlyCSE to remove this issue are tracked in
 | 
						||
   Bug 23333.
 | 
						||
 | 
						||
#. Avoid using arithmetic intrinsics unless you are *required* by your source
 | 
						||
   language specification to emit a particular code sequence.  The optimizer
 | 
						||
   is quite good at reasoning about general control flow and arithmetic, it is
 | 
						||
   not anywhere near as strong at reasoning about the various intrinsics.  If
 | 
						||
   profitable for code generation purposes, the optimizer will likely form the
 | 
						||
   intrinsics itself late in the optimization pipeline.  It is *very* rarely
 | 
						||
   profitable to emit these directly in the language frontend.  This item
 | 
						||
   explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`.
 | 
						||
 | 
						||
#. Avoid using the :ref:`assume intrinsic <int_assume>` until you've
 | 
						||
   established that a) there's no other way to express the given fact and b)
 | 
						||
   that fact is critical for optimization purposes.  Assumes are a great
 | 
						||
   prototyping mechanism, but they can have negative effects on both compile
 | 
						||
   time and optimization effectiveness.  The former is fixable with enough
 | 
						||
   effort, but the later is fairly fundamental to their designed purpose.
 | 
						||
 | 
						||
 | 
						||
Describing Language Specific Properties
 | 
						||
=======================================
 | 
						||
 | 
						||
When translating a source language to LLVM, finding ways to express concepts
 | 
						||
and guarantees available in your source language which are not natively
 | 
						||
provided by LLVM IR will greatly improve LLVM's ability to optimize your code.
 | 
						||
As an example, C/C++'s ability to mark every add as "no signed wrap (nsw)" goes
 | 
						||
a long way to assisting the optimizer in reasoning about loop induction
 | 
						||
variables and thus generating more optimal code for loops.
 | 
						||
 | 
						||
The LLVM LangRef includes a number of mechanisms for annotating the IR with
 | 
						||
additional semantic information.  It is *strongly* recommended that you become
 | 
						||
highly familiar with this document.  The list below is intended to highlight a
 | 
						||
couple of items of particular interest, but is by no means exhaustive.
 | 
						||
 | 
						||
Restricted Operation Semantics
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
#. Add nsw/nuw flags as appropriate.  Reasoning about overflow is
 | 
						||
   generally hard for an optimizer so providing these facts from the frontend
 | 
						||
   can be very impactful.
 | 
						||
 | 
						||
#. Use fast-math flags on floating point operations if legal.  If you don't
 | 
						||
   need strict IEEE floating point semantics, there are a number of additional
 | 
						||
   optimizations that can be performed.  This can be highly impactful for
 | 
						||
   floating point intensive computations.
 | 
						||
 | 
						||
Describing Aliasing Properties
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
 | 
						||
#. Add noalias/align/dereferenceable/nonnull to function arguments and return
 | 
						||
   values as appropriate
 | 
						||
 | 
						||
#. Use pointer aliasing metadata, especially tbaa metadata, to communicate
 | 
						||
   otherwise-non-deducible pointer aliasing facts
 | 
						||
 | 
						||
#. Use inbounds on geps.  This can help to disambiguate some aliasing queries.
 | 
						||
 | 
						||
 | 
						||
Modeling Memory Effects
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
 | 
						||
#. Mark functions as readnone/readonly/argmemonly or noreturn/nounwind when
 | 
						||
   known.  The optimizer will try to infer these flags, but may not always be
 | 
						||
   able to.  Manual annotations are particularly important for external
 | 
						||
   functions that the optimizer can not analyze.
 | 
						||
 | 
						||
#. Use the lifetime.start/lifetime.end and invariant.start/invariant.end
 | 
						||
   intrinsics where possible.  Common profitable uses are for stack like data
 | 
						||
   structures (thus allowing dead store elimination) and for describing
 | 
						||
   life times of allocas (thus allowing smaller stack sizes).
 | 
						||
 | 
						||
#. Mark invariant locations using !invariant.load and TBAA's constant flags
 | 
						||
 | 
						||
Pass Ordering
 | 
						||
^^^^^^^^^^^^^
 | 
						||
 | 
						||
One of the most common mistakes made by new language frontend projects is to
 | 
						||
use the existing -O2 or -O3 pass pipelines as is.  These pass pipelines make a
 | 
						||
good starting point for an optimizing compiler for any language, but they have
 | 
						||
been carefully tuned for C and C++, not your target language.  You will almost
 | 
						||
certainly need to use a custom pass order to achieve optimal performance.  A
 | 
						||
couple specific suggestions:
 | 
						||
 | 
						||
#. For languages with numerous rarely executed guard conditions (e.g. null
 | 
						||
   checks, type checks, range checks) consider adding an extra execution or
 | 
						||
   two of LoopUnswitch and LICM to your pass order.  The standard pass order,
 | 
						||
   which is tuned for C and C++ applications, may not be sufficient to remove
 | 
						||
   all dischargeable checks from loops.
 | 
						||
 | 
						||
#. If your language uses range checks, consider using the IRCE pass.  It is not
 | 
						||
   currently part of the standard pass order.
 | 
						||
 | 
						||
#. A useful sanity check to run is to run your optimized IR back through the
 | 
						||
   -O2 pipeline again.  If you see noticeable improvement in the resulting IR,
 | 
						||
   you likely need to adjust your pass order.
 | 
						||
 | 
						||
 | 
						||
I Still Can't Find What I'm Looking For
 | 
						||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | 
						||
 | 
						||
If you didn't find what you were looking for above, consider proposing a piece
 | 
						||
of metadata which provides the optimization hint you need.  Such extensions are
 | 
						||
relatively common and are generally well received by the community.  You will
 | 
						||
need to ensure that your proposal is sufficiently general so that it benefits
 | 
						||
others if you wish to contribute it upstream.
 | 
						||
 | 
						||
You should also consider describing the problem you're facing on `llvm-dev
 | 
						||
<http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ and asking for advice.
 | 
						||
It's entirely possible someone has encountered your problem before and can
 | 
						||
give good advice.  If there are multiple interested parties, that also
 | 
						||
increases the chances that a metadata extension would be well received by the
 | 
						||
community as a whole.
 | 
						||
 | 
						||
Adding to this document
 | 
						||
=======================
 | 
						||
 | 
						||
If you run across a case that you feel deserves to be covered here, please send
 | 
						||
a patch to `llvm-commits
 | 
						||
<http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review.
 | 
						||
 | 
						||
If you have questions on these items, please direct them to `llvm-dev
 | 
						||
<http://lists.llvm.org/mailman/listinfo/llvm-dev>`_.  The more relevant
 | 
						||
context you are able to give to your question, the more likely it is to be
 | 
						||
answered.
 | 
						||
 |