297 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			297 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
| =====================================
 | ||
| Performance Tips for Frontend Authors
 | ||
| =====================================
 | ||
| 
 | ||
| .. contents::
 | ||
|    :local:
 | ||
|    :depth: 2
 | ||
| 
 | ||
| Abstract
 | ||
| ========
 | ||
| 
 | ||
| The intended audience of this document is developers of language frontends 
 | ||
| targeting LLVM IR. This document is home to a collection of tips on how to 
 | ||
| generate IR that optimizes well.  
 | ||
| 
 | ||
| IR Best Practices
 | ||
| =================
 | ||
| 
 | ||
| As with any optimizer, LLVM has its strengths and weaknesses.  In some cases, 
 | ||
| surprisingly small changes in the source IR can have a large effect on the 
 | ||
| generated code.  
 | ||
| 
 | ||
| Beyond the specific items on the list below, it's worth noting that the most 
 | ||
| mature frontend for LLVM is Clang.  As a result, the further your IR gets from what Clang might emit, the less likely it is to be effectively optimized.  It 
 | ||
| can often be useful to write a quick C program with the semantics you're trying
 | ||
| to model and see what decisions Clang's IRGen makes about what IR to emit.  
 | ||
| Studying Clang's CodeGen directory can also be a good source of ideas.  Note 
 | ||
| that Clang and LLVM are explicitly version locked so you'll need to make sure 
 | ||
| you're using a Clang built from the same svn revision or release as the LLVM 
 | ||
| library you're using.  As always, it's *strongly* recommended that you track 
 | ||
| tip of tree development, particularly during bring up of a new project.
 | ||
| 
 | ||
| The Basics
 | ||
| ^^^^^^^^^^^
 | ||
| 
 | ||
| #. Make sure that your Modules contain both a data layout specification and 
 | ||
|    target triple. Without these pieces, non of the target specific optimization
 | ||
|    will be enabled.  This can have a major effect on the generated code quality.
 | ||
| 
 | ||
| #. For each function or global emitted, use the most private linkage type
 | ||
|    possible (private, internal or linkonce_odr preferably).  Doing so will 
 | ||
|    make LLVM's inter-procedural optimizations much more effective.
 | ||
| 
 | ||
| #. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds
 | ||
|    of predecessors).  Among other issues, the register allocator is known to 
 | ||
|    perform badly with confronted with such structures.  The only exception to 
 | ||
|    this guidance is that a unified return block with high in-degree is fine.
 | ||
| 
 | ||
| Use of allocas
 | ||
| ^^^^^^^^^^^^^^
 | ||
| 
 | ||
| An alloca instruction can be used to represent a function scoped stack slot, 
 | ||
| but can also represent dynamic frame expansion.  When representing function 
 | ||
| scoped variables or locations, placing alloca instructions at the beginning of 
 | ||
| the entry block should be preferred.   In particular, place them before any 
 | ||
| call instructions. Call instructions might get inlined and replaced with 
 | ||
| multiple basic blocks. The end result is that a following alloca instruction 
 | ||
| would no longer be in the entry basic block afterward.
 | ||
| 
 | ||
| The SROA (Scalar Replacement Of Aggregates) and Mem2Reg passes only attempt
 | ||
| to eliminate alloca instructions that are in the entry basic block.  Given 
 | ||
| SSA is the canonical form expected by much of the optimizer; if allocas can 
 | ||
| not be eliminated by Mem2Reg or SROA, the optimizer is likely to be less 
 | ||
| effective than it could be.
 | ||
| 
 | ||
| Avoid loads and stores of large aggregate type
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| LLVM currently does not optimize well loads and stores of large :ref:`aggregate
 | ||
| types <t_aggregate>` (i.e. structs and arrays).  As an alternative, consider 
 | ||
| loading individual fields from memory.
 | ||
| 
 | ||
| Aggregates that are smaller than the largest (performant) load or store 
 | ||
| instruction supported by the targeted hardware are well supported.  These can 
 | ||
| be an effective way to represent collections of small packed fields.  
 | ||
| 
 | ||
| Prefer zext over sext when legal
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| On some architectures (X86_64 is one), sign extension can involve an extra 
 | ||
| instruction whereas zero extension can be folded into a load.  LLVM will try to
 | ||
| replace a sext with a zext when it can be proven safe, but if you have 
 | ||
| information in your source language about the range of a integer value, it can 
 | ||
| be profitable to use a zext rather than a sext.  
 | ||
| 
 | ||
| Alternatively, you can :ref:`specify the range of the value using metadata 
 | ||
| <range-metadata>` and LLVM can do the sext to zext conversion for you.
 | ||
| 
 | ||
| Zext GEP indices to machine register width
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| Internally, LLVM often promotes the width of GEP indices to machine register
 | ||
| width.  When it does so, it will default to using sign extension (sext) 
 | ||
| operations for safety.  If your source language provides information about 
 | ||
| the range of the index, you may wish to manually extend indices to machine 
 | ||
| register width using a zext instruction.
 | ||
| 
 | ||
| When to specify alignment
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| LLVM will always generate correct code if you don’t specify alignment, but may
 | ||
| generate inefficient code.  For example, if you are targeting MIPS (or older 
 | ||
| ARM ISAs) then the hardware does not handle unaligned loads and stores, and 
 | ||
| so you will enter a trap-and-emulate path if you do a load or store with 
 | ||
| lower-than-natural alignment.  To avoid this, LLVM will emit a slower 
 | ||
| sequence of loads, shifts and masks (or load-right + load-left on MIPS) for 
 | ||
| all cases where the load / store does not have a sufficiently high alignment 
 | ||
| in the IR.
 | ||
| 
 | ||
| The alignment is used to guarantee the alignment on allocas and globals, 
 | ||
| though in most cases this is unnecessary (most targets have a sufficiently 
 | ||
| high default alignment that they’ll be fine).  It is also used to provide a 
 | ||
| contract to the back end saying ‘either this load/store has this alignment, or
 | ||
| it is undefined behavior’.  This means that the back end is free to emit 
 | ||
| instructions that rely on that alignment (and mid-level optimizers are free to 
 | ||
| perform transforms that require that alignment).  For x86, it doesn’t make 
 | ||
| much difference, as almost all instructions are alignment-independent.  For 
 | ||
| MIPS, it can make a big difference.
 | ||
| 
 | ||
| Note that if your loads and stores are atomic, the backend will be unable to 
 | ||
| lower an under aligned access into a sequence of natively aligned accesses.  
 | ||
| As a result, alignment is mandatory for atomic loads and stores.
 | ||
| 
 | ||
| Other Things to Consider
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| #. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing 
 | ||
|    analysis), prefer GEPs
 | ||
| 
 | ||
| #. Prefer globals over inttoptr of a constant address - this gives you 
 | ||
|    dereferencability information.  In MCJIT, use getSymbolAddress to provide 
 | ||
|    actual address.
 | ||
| 
 | ||
| #. Be wary of ordered and atomic memory operations.  They are hard to optimize 
 | ||
|    and may not be well optimized by the current optimizer.  Depending on your
 | ||
|    source language, you may consider using fences instead.
 | ||
| 
 | ||
| #. If calling a function which is known to throw an exception (unwind), use 
 | ||
|    an invoke with a normal destination which contains an unreachable 
 | ||
|    instruction.  This form conveys to the optimizer that the call returns 
 | ||
|    abnormally.  For an invoke which neither returns normally or requires unwind
 | ||
|    code in the current function, you can use a noreturn call instruction if 
 | ||
|    desired.  This is generally not required because the optimizer will convert
 | ||
|    an invoke with an unreachable unwind destination to a call instruction.
 | ||
| 
 | ||
| #. Use profile metadata to indicate statically known cold paths, even if 
 | ||
|    dynamic profiling information is not available.  This can make a large 
 | ||
|    difference in code placement and thus the performance of tight loops.
 | ||
| 
 | ||
| #. When generating code for loops, try to avoid terminating the header block of
 | ||
|    the loop earlier than necessary.  If the terminator of the loop header 
 | ||
|    block is a loop exiting conditional branch, the effectiveness of LICM will
 | ||
|    be limited for loads not in the header.  (This is due to the fact that LLVM 
 | ||
|    may not know such a load is safe to speculatively execute and thus can't 
 | ||
|    lift an otherwise loop invariant load unless it can prove the exiting 
 | ||
|    condition is not taken.)  It can be profitable, in some cases, to emit such 
 | ||
|    instructions into the header even if they are not used along a rarely 
 | ||
|    executed path that exits the loop.  This guidance specifically does not 
 | ||
|    apply if the condition which terminates the loop header is itself invariant,
 | ||
|    or can be easily discharged by inspecting the loop index variables.
 | ||
| 
 | ||
| #. In hot loops, consider duplicating instructions from small basic blocks 
 | ||
|    which end in highly predictable terminators into their successor blocks.  
 | ||
|    If a hot successor block contains instructions which can be vectorized 
 | ||
|    with the duplicated ones, this can provide a noticeable throughput
 | ||
|    improvement.  Note that this is not always profitable and does involve a 
 | ||
|    potentially large increase in code size.
 | ||
| 
 | ||
| #. When checking a value against a constant, emit the check using a consistent
 | ||
|    comparison type.  The GVN pass *will* optimize redundant equalities even if
 | ||
|    the type of comparison is inverted, but GVN only runs late in the pipeline.
 | ||
|    As a result, you may miss the opportunity to run other important 
 | ||
|    optimizations.  Improvements to EarlyCSE to remove this issue are tracked in 
 | ||
|    Bug 23333.
 | ||
| 
 | ||
| #. Avoid using arithmetic intrinsics unless you are *required* by your source 
 | ||
|    language specification to emit a particular code sequence.  The optimizer 
 | ||
|    is quite good at reasoning about general control flow and arithmetic, it is
 | ||
|    not anywhere near as strong at reasoning about the various intrinsics.  If 
 | ||
|    profitable for code generation purposes, the optimizer will likely form the 
 | ||
|    intrinsics itself late in the optimization pipeline.  It is *very* rarely 
 | ||
|    profitable to emit these directly in the language frontend.  This item
 | ||
|    explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`.
 | ||
| 
 | ||
| #. Avoid using the :ref:`assume intrinsic <int_assume>` until you've 
 | ||
|    established that a) there's no other way to express the given fact and b) 
 | ||
|    that fact is critical for optimization purposes.  Assumes are a great 
 | ||
|    prototyping mechanism, but they can have negative effects on both compile 
 | ||
|    time and optimization effectiveness.  The former is fixable with enough 
 | ||
|    effort, but the later is fairly fundamental to their designed purpose.
 | ||
| 
 | ||
| 
 | ||
| Describing Language Specific Properties
 | ||
| =======================================
 | ||
| 
 | ||
| When translating a source language to LLVM, finding ways to express concepts 
 | ||
| and guarantees available in your source language which are not natively 
 | ||
| provided by LLVM IR will greatly improve LLVM's ability to optimize your code. 
 | ||
| As an example, C/C++'s ability to mark every add as "no signed wrap (nsw)" goes
 | ||
| a long way to assisting the optimizer in reasoning about loop induction 
 | ||
| variables and thus generating more optimal code for loops.  
 | ||
| 
 | ||
| The LLVM LangRef includes a number of mechanisms for annotating the IR with 
 | ||
| additional semantic information.  It is *strongly* recommended that you become 
 | ||
| highly familiar with this document.  The list below is intended to highlight a 
 | ||
| couple of items of particular interest, but is by no means exhaustive.
 | ||
| 
 | ||
| Restricted Operation Semantics
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| #. Add nsw/nuw flags as appropriate.  Reasoning about overflow is 
 | ||
|    generally hard for an optimizer so providing these facts from the frontend 
 | ||
|    can be very impactful.  
 | ||
| 
 | ||
| #. Use fast-math flags on floating point operations if legal.  If you don't 
 | ||
|    need strict IEEE floating point semantics, there are a number of additional 
 | ||
|    optimizations that can be performed.  This can be highly impactful for 
 | ||
|    floating point intensive computations.
 | ||
| 
 | ||
| Describing Aliasing Properties
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| #. Add noalias/align/dereferenceable/nonnull to function arguments and return 
 | ||
|    values as appropriate
 | ||
| 
 | ||
| #. Use pointer aliasing metadata, especially tbaa metadata, to communicate 
 | ||
|    otherwise-non-deducible pointer aliasing facts
 | ||
| 
 | ||
| #. Use inbounds on geps.  This can help to disambiguate some aliasing queries.
 | ||
| 
 | ||
| 
 | ||
| Modeling Memory Effects
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| #. Mark functions as readnone/readonly/argmemonly or noreturn/nounwind when
 | ||
|    known.  The optimizer will try to infer these flags, but may not always be
 | ||
|    able to.  Manual annotations are particularly important for external 
 | ||
|    functions that the optimizer can not analyze.
 | ||
| 
 | ||
| #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end 
 | ||
|    intrinsics where possible.  Common profitable uses are for stack like data 
 | ||
|    structures (thus allowing dead store elimination) and for describing 
 | ||
|    life times of allocas (thus allowing smaller stack sizes).  
 | ||
| 
 | ||
| #. Mark invariant locations using !invariant.load and TBAA's constant flags
 | ||
| 
 | ||
| Pass Ordering
 | ||
| ^^^^^^^^^^^^^
 | ||
| 
 | ||
| One of the most common mistakes made by new language frontend projects is to 
 | ||
| use the existing -O2 or -O3 pass pipelines as is.  These pass pipelines make a
 | ||
| good starting point for an optimizing compiler for any language, but they have 
 | ||
| been carefully tuned for C and C++, not your target language.  You will almost 
 | ||
| certainly need to use a custom pass order to achieve optimal performance.  A 
 | ||
| couple specific suggestions:
 | ||
| 
 | ||
| #. For languages with numerous rarely executed guard conditions (e.g. null 
 | ||
|    checks, type checks, range checks) consider adding an extra execution or 
 | ||
|    two of LoopUnswith and LICM to your pass order.  The standard pass order, 
 | ||
|    which is tuned for C and C++ applications, may not be sufficient to remove 
 | ||
|    all dischargeable checks from loops.
 | ||
| 
 | ||
| #. If you language uses range checks, consider using the IRCE pass.  It is not 
 | ||
|    currently part of the standard pass order.
 | ||
| 
 | ||
| #. A useful sanity check to run is to run your optimized IR back through the 
 | ||
|    -O2 pipeline again.  If you see noticeable improvement in the resulting IR, 
 | ||
|    you likely need to adjust your pass order.
 | ||
| 
 | ||
| 
 | ||
| I Still Can't Find What I'm Looking For
 | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 | ||
| 
 | ||
| If you didn't find what you were looking for above, consider proposing an piece
 | ||
| of metadata which provides the optimization hint you need.  Such extensions are
 | ||
| relatively common and are generally well received by the community.  You will 
 | ||
| need to ensure that your proposal is sufficiently general so that it benefits 
 | ||
| others if you wish to contribute it upstream.
 | ||
| 
 | ||
| You should also consider describing the problem you're facing on `llvm-dev 
 | ||
| <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ and asking for advice.  
 | ||
| It's entirely possible someone has encountered your problem before and can 
 | ||
| give good advice.  If there are multiple interested parties, that also 
 | ||
| increases the chances that a metadata extension would be well received by the
 | ||
| community as a whole.  
 | ||
| 
 | ||
| Adding to this document
 | ||
| =======================
 | ||
| 
 | ||
| If you run across a case that you feel deserves to be covered here, please send
 | ||
| a patch to `llvm-commits
 | ||
| <http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review.
 | ||
| 
 | ||
| If you have questions on these items, please direct them to `llvm-dev 
 | ||
| <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_.  The more relevant 
 | ||
| context you are able to give to your question, the more likely it is to be 
 | ||
| answered.
 | ||
| 
 |