forked from OSchip/llvm-project
				
			
		
			
				
	
	
		
			265 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			265 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| # Quantization
 | |
| 
 | |
| This document outlines the design of the MLIR quantization system. While the
 | |
| term "quantization" is highly overloaded, in this case, it refers to a fairly
 | |
| narrow scope of techniques in use to enable conversion of floating-point
 | |
| computations to corresponding and plausible variants expressed in integer math
 | |
| for inference, as has historically been supported by low-bit depth inference
 | |
| engines such as TFLite, various accelerator hardware, and many DSPs.
 | |
| 
 | |
| Much of this is inspired by the approach taken
 | |
| [in this paper](https://arxiv.org/abs/1712.05877) with many extensions and
 | |
| adaptations folded in. It specifically documents the positions that MLIR has
 | |
| taken on the topic, and is not a general reference.
 | |
| 
 | |
| [TOC]
 | |
| 
 | |
| ## Uniform quantization
 | |
| 
 | |
| The primary quantization mechanism supported by MLIR is a scheme which can
 | |
| express fixed point and affine transformations via uniformly spaced point on the
 | |
| [Real](https://en.wikipedia.org/wiki/Real_number) number line.
 | |
| 
 | |
| Further, the scheme can be applied:
 | |
| 
 | |
| *   *per-layer* : Applying to every value within the target type.
 | |
| *   *per-axis* (also called *per-channel*) : Applying individually to each index
 | |
|     along a specific axis of a tensor type.
 | |
| 
 | |
| ### Fixed point values
 | |
| 
 | |
| [Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a
 | |
| [Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*.
 | |
| We will call the result of the divided real the *scaled value*.
 | |
| 
 | |
| $$ real\_value = scaled\_value * scale $$
 | |
| 
 | |
| The scale can be interpreted as the distance, in real units, between neighboring
 | |
| scaled values. For example, if the scale is $$ \pi $$, then fixed point values
 | |
| with this scale can only represent multiples of $$ \pi $$, and nothing in
 | |
| between. The maximum rounding error to convert an arbitrary Real to a fixed
 | |
| point value with a given $$ scale $$ is $$ \frac{scale}{2} $$. Continuing the
 | |
| previous example, when $$ scale = \pi $$, the maximum rounding error will be $$
 | |
| \frac{\pi}{2} $$.
 | |
| 
 | |
| Multiplication can be performed on scaled values with different scales, using
 | |
| the same algorithm as multiplication of real values (note that product scaled
 | |
| value has $$ scale_{product} = scale_{left \mbox{ } operand} * scale_{right
 | |
| \mbox{ } operand} $$). Addition can be performed on scaled values, so long as
 | |
| they have the same scale, using the same algorithm for addition of real values.
 | |
| This makes it convenient to represent scaled values on a computer as signed
 | |
| integers, and perform arithmetic on those signed integers, because the results
 | |
| will be correct scaled values.
 | |
| 
 | |
| ### Affine values
 | |
| 
 | |
| Mathematically speaking, affine values are the result of
 | |
| [adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation).
 | |
| Alternatively (and equivalently), subtracting a zero point from an affine value results in a
 | |
| scaled value:
 | |
| 
 | |
| $$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$
 | |
| 
 | |
| Essentially, affine values are a shift of the scaled values by some constant
 | |
| amount. Arithmetic (i.e., addition, subtraction, multiplication, division)
 | |
| cannot, in general, be directly performed on affine values; they must first be
 | |
| [converted](#affine-to-fixed-point) to the equivalent scaled values.
 | |
| 
 | |
| As alluded to above, the motivation for using affine values is to more
 | |
| efficiently represent real values that will actually be encountered during
 | |
| computation. Frequently, real values that will be encountered are not
 | |
| symmetric around the real zero. We also make the assumption that the real zero
 | |
| is encountered during computation, and should thus be represented.
 | |
| 
 | |
| In this case, it is inefficient to store scaled values represented by signed
 | |
| integers, as some of the signed integers will never be used. In effect, the bit patterns
 | |
| corresponding to those signed integers are going to waste.
 | |
| 
 | |
| In order to exactly represent the real zero with an integral-valued affine
 | |
| value, the zero point must be an integer between the minimum and maximum affine
 | |
| value (inclusive). For example, given an affine value represented by an 8 bit
 | |
| unsigned integer, we have: $$ 0 \leq zero\_point \leq 255$$. This is important,
 | |
| because in convolution-like operations of deep neural networks, we frequently
 | |
| need to zero-pad inputs and outputs, so zero must be exactly representable, or
 | |
| the result will be biased.
 | |
| 
 | |
| ### Relation
 | |
| 
 | |
| Real values, fixed point values, and affine values relate through the following
 | |
| equation, which demonstrates how to convert one type of number to another:
 | |
| 
 | |
| $$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$
 | |
| 
 | |
| Note that computers generally store mathematical values using a finite number of
 | |
| bits. Thus, while the above conversions are exact, to store the result in a
 | |
| finite number of bits, we must, in general, round the result of the conversion
 | |
| (this applies to both cases: storing using floating point and storing using
 | |
| fixed point). Note that a full discussion of rounding behavior is outside the
 | |
| scope of this document, and it is safe to assume unless otherwise stated that
 | |
| rounding should be according to the IEEE754 default of RNE (where hardware
 | |
| permits).
 | |
| 
 | |
| ### Converting between real and fixed point or affine
 | |
| 
 | |
| To convert a real value to a fixed point value, we must know the scale. To
 | |
| convert a real value to an affine value, we must know the scale and the zero point.
 | |
| 
 | |
| #### Real to affine
 | |
| 
 | |
| To convert an input tensor of real-valued elements (usually represented by a
 | |
| floating point format, frequently
 | |
| [Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format))
 | |
| to a tensor of affine elements represented by an integral type (e.g. 8-bit
 | |
| unsigned integer), the following conversion can be performed (note that it is
 | |
| not required that all representable values of the integral type are used):
 | |
| 
 | |
| $$
 | |
| \begin{align*}
 | |
| af&fine\_value_{uint8 \, or \, uint16} \\
 | |
|       &= clampToTargetSize(roundToNearestInteger( \frac{real\_value_{Single}}{scale_{Single}})_{sint32} + zero\_point_{uint8 \, or \, uint16})
 | |
| \end{align*}
 | |
| $$
 | |
| 
 | |
| In the above, we assume that $$real\_value$$ is a Single, $$scale$$ is a Single,
 | |
| $$roundToNearestInteger$$ returns a signed 32-bit integer, and $$zero\_point$$
 | |
| is an unsigned 8-bit or 16-bit integer. Note that bit depth and number of fixed
 | |
| point values are indicative of common types on typical hardware but is not
 | |
| constrained to particular bit depths or a requirement that the entire range of
 | |
| an N-bit integer is used.
 | |
| 
 | |
| #### Affine to real
 | |
| 
 | |
| To convert an output tensor of affine elements represented by uint8
 | |
| or uint16 to a tensor of real-valued elements (usually represented with a
 | |
| floating point format, frequently Single precision), the following conversion
 | |
| can be performed:
 | |
| 
 | |
| $$
 | |
| \begin{align*}
 | |
| re&al\_value_{Single} \\
 | |
|       &= roundToNearestFloat((affine\_value_{uint8 \, or \, uint16} - zero\_point_{uint8 \, or \, uint16})_{sint32})_{Single} * scale_{Single}
 | |
| \end{align*}
 | |
| $$
 | |
| 
 | |
| In the above, we assume that the result of subtraction is in 32-bit signed
 | |
| integer format, and that $$roundToNearestFloat$$ returns a Single.
 | |
| 
 | |
| #### Affine to fixed point
 | |
| 
 | |
| When the affine and fixed point scales are the same, subtract the zero point
 | |
| from the affine value to get the equivalent fixed point value.
 | |
| 
 | |
| $$
 | |
| scaled\_value = affine\_value_{non\mbox{-}negative} - zero\_point_{non\mbox{-}negative}
 | |
| $$
 | |
| 
 | |
| #### Fixed point to affine
 | |
| 
 | |
| When the affine and fixed point scales are the same, add the zero point to the
 | |
| fixed point value to get the equivalent affine value.
 | |
| 
 | |
| $$
 | |
| affine\_value_{non\mbox{-}negative} = scaled\_value + zero\_point_{non\mbox{-}negative}
 | |
| $$
 | |
| 
 | |
| ## Usage within MLIR
 | |
| 
 | |
| There are several components to the quantization system being developed within
 | |
| MLIR:
 | |
| 
 | |
| *   *Quantization* dialect containing:
 | |
| 
 | |
|     *   A family of [QuantizedTypes](#quantized-type) which represent the
 | |
|         mapping between *expressed* values (typically of a floating point
 | |
|         computer type) and *storage* values (typically of an integral computer
 | |
|         type).
 | |
|     *   [Type conversion ops](#quantized-type-conversion-ops) for converting
 | |
|         between types based on a QuantizedType and its *expressed* and *storage*
 | |
|         sub-types.
 | |
|     *   [Instrumentation ops](#instrumentation-and-constraint-ops) for assigning
 | |
|         instrumentation points within the computation where runtime statistics
 | |
|         may help guide the quantization process.
 | |
| 
 | |
| *   [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time)
 | |
| 
 | |
| *   [TFLite native quantization](#tflite-native-quantization)
 | |
| 
 | |
|     *   The TFLite op-set natively supports uniform-quantized variants.
 | |
|     *   Passes and tools exist to convert directly from the *TensorFlow* dialect
 | |
|         to the TFLite quantized operation set.
 | |
| 
 | |
| Not every application of quantization will use all of these facilities. Specifically, the
 | |
| TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own
 | |
| operations for type conversion and expression of the supporting math.
 | |
| 
 | |
| ## Quantization Dialect
 | |
| 
 | |
| ### Quantized type
 | |
| 
 | |
| TODO: Flesh this section out.
 | |
| 
 | |
| *   QuantizedType base class
 | |
| *   UniformQuantizedType
 | |
| 
 | |
| ### Quantized type conversion operations
 | |
| 
 | |
| *   qcast : Convert from an expressed type to QuantizedType
 | |
| *   dcast : Convert from a QuantizedType to its expressed type
 | |
| *   scast : Convert between a QuantizedType and its storage type
 | |
| 
 | |
| ### Instrumentation and constraint operations
 | |
| 
 | |
| *   const_fake_quant : Emulates the logic of the historic TensorFlow
 | |
|     fake_quant_with_min_max_args operation.
 | |
| *   stats_ref : Declares that statistics should be gathered at this point with a
 | |
|     unique key and made available to future passes of the solver.
 | |
| *   stats : Declares inline statistics (per layer and per axis) for the point in
 | |
|     the computation. stats_ref ops are generally converted to statistical operations once
 | |
|     trial runs have been performed.
 | |
| *   coupled_ref : Declares points in the computation to be coupled from a type
 | |
|     inference perspective based on a unique key.
 | |
| 
 | |
| ## Integration with simulated quantization at training time
 | |
| 
 | |
| TensorFlow has historically used the
 | |
| [tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args)
 | |
| family of operations to simulate the effect of quantization at training time.
 | |
| 
 | |
| As originally implemented, TensorFlow Lite was the primary user of such
 | |
| operations at inference time. When quantized inference was enabled, if every
 | |
| eligible tensor passed through an appropriate fake_quant node (the rules of
 | |
| which tensors can have fake_quant applied are somewhat involved), then
 | |
| TensorFlow Lite would use the attributes of the fake_quant operations to make a
 | |
| judgment about how to convert to use kernels from its quantized operations subset.
 | |
| 
 | |
| In MLIR-based quantization, fake_quant_\* operations are handled by converting them to
 | |
| a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an
 | |
| appropriate *UniformQuantizedType* as the target of the qcast operation.
 | |
| 
 | |
| This allows subsequent compiler passes to preserve the knowledge that
 | |
| quantization was simulated in a certain way, while giving the compiler
 | |
| flexibility to move the casts as it simplifies the computation and converts it
 | |
| to a form based on integral arithmetic.
 | |
| 
 | |
| This scheme also naturally allows computations that are *partially quantized*
 | |
| where the parts which could not be reduced to integral operations are still carried out
 | |
| in floating point with appropriate conversions at the boundaries.
 | |
| 
 | |
| ## TFLite native quantization
 | |
| 
 | |
| TODO: Flesh this out
 | |
| 
 | |
| ### General algorithm
 | |
| 
 | |
| 1.  Take input min/max information and set the ArrayInfo (which really is
 | |
|     InputOrOutputArrayInfo.
 | |
| 1.  In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize
 | |
|     nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q
 | |
|     -> tfl.DQ).
 | |
| 1.  Hardcode logic/propagation needs to happen here.
 | |
| 1.  Run TF constant folding.
 | |
| 1.  In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).
 | |
| 1.  Run quantization pass that take (tfl.DQ (for both input and weights) -> op
 | |
|     -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q)
 | |
|     with (constant_quant).
 |