416 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			416 lines
		
	
	
		
			17 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| # MLIR: The case for a simplified polyhedral form
 | |
| 
 | |
| MLIR embraces polyhedral compiler techniques for their many advantages
 | |
| representing and transforming dense numerical kernels, but it uses a form that
 | |
| differs significantly from other polyhedral frameworks.
 | |
| 
 | |
| **Disclaimer / Warning**
 | |
| 
 | |
| This document is a very early design proposal (which has since been accepted)
 | |
| that explored the tradeoffs of using this simplified form vs the traditional
 | |
| polyhedral schedule list form. At some point, this document could be dusted off
 | |
| and written as a proper academic paper, but until now, it is better to included
 | |
| it in this crafty form than not to. Beware that this document uses archaic
 | |
| syntax and should not be considered a canonical reference to modern MLIR.
 | |
| 
 | |
| ## Introduction
 | |
| 
 | |
| This document discusses general goals of the project, introduces context and the
 | |
| two alternatives, then talks about the tradeoffs of these designs. Written by
 | |
| Chris Lattner.
 | |
| 
 | |
| ## General goals of an IR, and goals of mlfunc's specifically
 | |
| 
 | |
| Our currently planned representation for MLIR consists of two kinds of
 | |
| functions: an LLVM-like "CFG Function" and an "ML Function": a function
 | |
| represented in multidimensional loop form. The idea is that a CFG function is
 | |
| capable of full generality for expressing arbitrary computation, but is awkward
 | |
| for loop transformations. In contrast, mlfunc's are limited (e.g. to control
 | |
| flow involving loop nests over affine spaces) but these limitations make it much
 | |
| easier to transform and analyze, particularly for the set of computations in a
 | |
| machine learning kernel.
 | |
| 
 | |
| The design of an intermediate representations is an optimization problem, which
 | |
| makes intentional tradeoffs that aim to make certain kinds of compiler
 | |
| transformations simple. After all, it is "possible" to do almost any
 | |
| transformation on any IR: we could theoretically do loop transformations on
 | |
| assembly language. OTOH, such transformations would take too long to write,
 | |
| would be fragile due to irrelevant changes, would be difficult to maintain, and
 | |
| difficult to make target independent. Performing transformations on the "right
 | |
| level" of IR makes it much easier to do analysis and transformation of code, and
 | |
| can make them faster by reducing the size of the IR, and eliminating
 | |
| possibilities that would have otherwise have to be considered.
 | |
| 
 | |
| This is the reason we're interested in adding polyhedral techniques to an IR in
 | |
| the first place: though our base "CFG function" representation is fully capable
 | |
| of expressing any computation, it is "too" expressive. The limitations imposed
 | |
| by polyhedral techniques (e.g. on affine loop bounds and array subscripts)
 | |
| define a closed algebra that can represent an interesting range of
 | |
| transformations and their compositions, and because of their simplicity, we can
 | |
| perform (e.g.) dependence analysis more efficiently and more reliably.
 | |
| 
 | |
| This raises an important question that this document examines: given we are
 | |
| introducing a redundant and limited way to express code and transformations,
 | |
| exactly what form is best to perform the analyses and transformations we want?
 | |
| 
 | |
| We explore two different design points that are capable of expressing the same
 | |
| class of affine loop computations, but which use different representational
 | |
| forms. These forms trade off verbosity, ease of transformation, and ease of
 | |
| analysis in interesting ways.
 | |
| 
 | |
| ## Context: Traditional Polyhedral Form
 | |
| 
 | |
| We started by discussing a representation that uses the traditional polyhedral
 | |
| schedule set + domain representation, e.g. consider C-like code like:
 | |
| 
 | |
| ```c
 | |
|   void simple_example(...) {
 | |
|     for (int i = 0; i < N; ++i) {
 | |
|       for (int j = 0; j < N; ++j) {
 | |
|          float tmp = X[i,j]    // S1
 | |
|          A[i,j] = tmp + 1      // S2
 | |
|          B[i,j] = tmp * 42     // S3
 | |
|        }
 | |
|     }
 | |
|   }
 | |
| ```
 | |
| 
 | |
| The polyhedral representation doesn't care about the actual computation, so we
 | |
| will abstract them into S1/S2/S3 in the discussion below. Originally, we planned
 | |
| to represent this with a classical form like (syntax details are not important
 | |
| and probably slightly incorrect below):
 | |
| 
 | |
| ```mlir
 | |
|   mlfunc @simple_example(... %N) {
 | |
|     %tmp = call @S1(%X, %i, %j)
 | |
|       domain: (0 <= %i < %N), (0 <= %j < %N)
 | |
|       schedule: (i, j, 0)
 | |
| 
 | |
|     call @S2(%tmp, %A, %i, %j)
 | |
|       domain: (0 <= %i < %N), (0 <= %j < %N)
 | |
|       schedule: (i, j, 1)
 | |
| 
 | |
|     call @S3(%tmp, %B, %i, %j)
 | |
|       domain: (0 <= %i < %N), (0 <= %j < %N)
 | |
|       schedule: (i, j, 2)
 | |
|   }
 | |
| ```
 | |
| 
 | |
| In this design, an mlfunc is an unordered bag of instructions whose execution
 | |
| order is fully controlled by their schedule.
 | |
| 
 | |
| However, we recently agreed that a more explicit schedule tree representation is
 | |
| a better fit for our needs, because it exposes important structure that will
 | |
| make analyses and optimizations more efficient, and also makes the scoping of
 | |
| SSA values more explicit. This leads us to a representation along the lines of:
 | |
| 
 | |
| ```mlir
 | |
|   mlfunc @simple_example(... %N) {
 | |
|     d0/d1 = mlspace
 | |
|     for S1(d0), S2(d0), S3(d0) {
 | |
|       for S1(d1), S2(d1), S3(d1) {
 | |
| 
 | |
|         %tmp = call @S1(%X, d0, d1)      ;; S1
 | |
|           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 | |
| 
 | |
|         call @S2(%tmp, %A, d0, d1)      ;; S2
 | |
|           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 | |
| 
 | |
|         call @S3(%tmp, %B, d0, d1)      ;; S3
 | |
|           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 | |
|       }
 | |
|     }
 | |
|   }
 | |
| ```
 | |
| 
 | |
| This change makes the nesting structure of the loops an explicit part of the
 | |
| representation, and makes lexical ordering within a loop significant
 | |
| (eliminating the constant 0/1/2 of schedules).
 | |
| 
 | |
| It isn't obvious in the example above, but the representation allows for some
 | |
| interesting features, including the ability for instructions within a loop nest
 | |
| to have non-equal domains, like this - the second instruction ignores the outer
 | |
| 10 points inside the loop:
 | |
| 
 | |
| ```mlir
 | |
|   mlfunc @reduced_domain_example(... %N) {
 | |
|     d0/d1 = mlspace
 | |
|     for S1(d0), S2(d0) {
 | |
|       for S1(d1), S2(d1) {
 | |
|         %tmp = call @S1(%X, d0, d1)    ;; S1
 | |
|           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 | |
| 
 | |
|         call @S2(%tmp, %A, d0, d1)      ;; S2
 | |
|           domain: (10 <= d0 < %N-10), (10 <= d1 < %N-10)
 | |
|       }
 | |
|     }
 | |
|   }
 | |
| ```
 | |
| 
 | |
| It also allows schedule remapping within the instruction, like this example that
 | |
| introduces a diagonal skew through a simple change to the schedules of the two
 | |
| instructions:
 | |
| 
 | |
| ```mlir
 | |
|   mlfunc @skewed_domain_example(... %N) {
 | |
|     d0/d1 = mlspace
 | |
|     for S1(d0), S2(d0+d1) {
 | |
|       for S1(d0+d1), S2(d1) {
 | |
|         %tmp = call @S1(%X, d0, d1)    ;; S1
 | |
|           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 | |
| 
 | |
|         call @S2(%tmp, %A, d0, d1)      ;; S2
 | |
|           domain: (0 <= d0 < %N), (0 <= d1 < %N)
 | |
|       }
 | |
|     }
 | |
|   }
 | |
| ```
 | |
| 
 | |
| This form has great power, and the polyhedral code generator (which lowers from
 | |
| an mlfunc to a cfgfunc representation) handles this power so things that
 | |
| introduce loop transformations don't have to explicitly manipulate the looping
 | |
| structure.
 | |
| 
 | |
| ## Proposal: Simplified Polyhedral Form
 | |
| 
 | |
| This document proposes and explores the idea of going one step further, moving
 | |
| all of the domain and schedule information into the "schedule tree". In this
 | |
| form, we would have a representation where all instructions inside of a given
 | |
| for-loop are known to have the same domain, which is maintained by the loop. In
 | |
| the simplified form, we also have an "if" instruction that takes an affine
 | |
| condition.
 | |
| 
 | |
| Our simple example above would be represented as:
 | |
| 
 | |
| ```mlir
 | |
|   mlfunc @simple_example(... %N) {
 | |
|     affine.for %i = 0 ... %N step 1 {
 | |
|       affine.for %j = 0 ... %N step 1 {
 | |
|         // identity noop in this case, but can exist in general.
 | |
|         %0,%1 = affine.apply #57(%i, %j)
 | |
| 
 | |
|         %tmp = call @S1(%X, %0, %1)
 | |
| 
 | |
|         call @S2(%tmp, %A, %0, %1)
 | |
| 
 | |
|         call @S3(%tmp, %B, %0, %1)
 | |
|       }
 | |
|     }
 | |
|   }
 | |
| ```
 | |
| 
 | |
| The example with the reduced domain would be represented with an if instruction:
 | |
| 
 | |
| ```mlir
 | |
|   mlfunc @reduced_domain_example(... %N) {
 | |
|     affine.for %i = 0 ... %N step 1 {
 | |
|       affine.for %j = 0 ... %N step 1 {
 | |
|         // identity noop in this case, but can exist in general.
 | |
|         %0,%1 = affinecall #57(%i, %j)
 | |
| 
 | |
|         %tmp = call @S1(%X, %0, %1)
 | |
| 
 | |
|         if (10 <= %i < %N-10), (10 <= %j < %N-10) {
 | |
| 
 | |
|           %2,%3 = affine.apply(%i, %j)    // identity noop in this case
 | |
| 
 | |
|           call @S2(%tmp, %A, %2, %3)
 | |
|         }
 | |
|       }
 | |
|     }
 | |
|   }
 | |
| ```
 | |
| 
 | |
| These IRs represent exactly the same information, and use a similar information
 | |
| density. The 'traditional' form introduces an extra level of abstraction
 | |
| (schedules and domains) that make it easy to transform instructions at the
 | |
| expense of making it difficult to reason about how those instructions will come
 | |
| out after code generation. With the simplified form, transformations have to do
 | |
| parts of code generation inline with their transformation: instead of simply
 | |
| changing a schedule to **(i+j, j)** to get skewing, you'd have to generate this
 | |
| code explicitly (potentially implemented by making polyhedral codegen a library
 | |
| that transformations call into):
 | |
| 
 | |
| ```mlir
 | |
| mlfunc @skewed_domain_example(... %N) {
 | |
|   affine.for %t1 = 0 ... 2*N-2 step 1 {
 | |
|     affine.for %t2 = max(0, t1-N+1) ... min(N, t1) step 1 {
 | |
|       (%i, %j) = (%t1-%t2, %t2)
 | |
|       ...
 | |
|     }
 | |
|   }
 | |
| }
 | |
| ```
 | |
| 
 | |
| ## Evaluation
 | |
| 
 | |
| Both of these forms are capable of expressing the same class of computation:
 | |
| multidimensional loop nests with affine loop bounds and affine memory
 | |
| references. That said, they pose very different tradeoffs in other ways.
 | |
| 
 | |
| ### Commonality: can express same computation
 | |
| 
 | |
| Both of these can express the same sorts of computation, e.g. kernels written in
 | |
| one form are representable in the other form in all cases.
 | |
| 
 | |
| ### Commonality: dependence analysis
 | |
| 
 | |
| These representations both use affine functions for data layout mapping and
 | |
| access subscripts, and dependence analysis works the same way.
 | |
| 
 | |
| ### Commonality: difficulty of determining optimal transformation series
 | |
| 
 | |
| One major challenge in performance of optimization of this sort of code is
 | |
| choosing the ordering and behavior of various loop transformations that get
 | |
| applied. There are non-local effects of every decision, and neither
 | |
| representation helps solve this inherently hard problem.
 | |
| 
 | |
| ### Commonality: compactness of IR
 | |
| 
 | |
| In the cases that are most relevant to us (hyper rectangular spaces) these forms
 | |
| are directly equivalent: a traditional instruction with a limited domain (e.g.
 | |
| the "reduced_domain_example" above) ends up having one level of ML 'if' inside
 | |
| its loops. The simplified form pays for this by eliminating schedules and
 | |
| domains from the IR. Both forms allow code duplication to reduce dynamic
 | |
| branches in the IR: the traditional approach allows instruction splitting, the
 | |
| simplified form supports instruction duplication.
 | |
| 
 | |
| It is important to point out that the traditional form wins on compactness in
 | |
| the extreme cases: e.g. the loop skewing case. These cases will be rare in
 | |
| practice for our workloads, and are exactly the cases that downstream
 | |
| transformations want to be explicit about what they are doing.
 | |
| 
 | |
| ### Simplicity of code generation
 | |
| 
 | |
| A key final stage of an mlfunc is its conversion to a CFG function, which is
 | |
| required as part of lowering to the target machine. The simplified form has a
 | |
| clear advantage here: the IR has a direct correspondence to the structure of the
 | |
| generated code.
 | |
| 
 | |
| In contrast, the traditional form has significant complexity in the lowering
 | |
| process to a CFG function, because the verbosity not imbued in the IR needs to
 | |
| come out during code generation. Code generation from ISL shows that it is
 | |
| possible to do this, but it is a non-trivial transformation.
 | |
| 
 | |
| ### Ease of transformation
 | |
| 
 | |
| An advantage for the traditional form is that it is easier to perform certain
 | |
| transformations on it: skewing and tiling are just transformations on the
 | |
| schedule of the instructions in question, it doesn't require changing the loop
 | |
| structure.
 | |
| 
 | |
| In practice, the simplified form requires moving the complexity of code
 | |
| generation into the transformations themselves - this is sometimes trivial,
 | |
| sometimes involved. The author believes that this should be possible by making
 | |
| the code generation algorithms themselves be library functions that
 | |
| transformations call into, instead of an opaque block that happens at the end of
 | |
| the mlfunc processing.
 | |
| 
 | |
| Also, the sorts of transformations performed today by XLA (including tiling,
 | |
| padding, unrolling, and other rectangular transformations) should be easy enough
 | |
| to implement on either representation. The only cases that are a challenge are
 | |
| more advanced cases like skewing, e.g. for DMA data movement generation.
 | |
| 
 | |
| ### Ease of analysis: Cost models
 | |
| 
 | |
| The simplified form is much easier for analyses and transformations to build
 | |
| cost models for (e.g. answering the question of "how much code bloat will be
 | |
| caused by unrolling a loop at this level?"), because it is easier to predict
 | |
| what target code will be generated. With the traditional form, these analyses
 | |
| will have to anticipate what polyhedral codegen will do to a set of instructions
 | |
| under consideration: something that is non-trivial in the interesting cases in
 | |
| question (see "Cost of code generation").
 | |
| 
 | |
| ### Cost of code generation
 | |
| 
 | |
| State of the art polyhedral code generation is
 | |
| [expensive and complicated](https://lirias.kuleuven.be/bitstream/123456789/497238/1/toplas-astgen.pdf),
 | |
| sometimes exponential time complexity. We expect that most machine learning
 | |
| workloads will be hyper-rectangular, and thus it should be easy to specialize in
 | |
| important cases. That said, the traditional polyhedral representation makes it
 | |
| very easy to introduce complicated and expensive schedules, and provides no way
 | |
| to understand and project a cost model for using them. All downstream clients of
 | |
| the IR need to be prepared to handle the full generality of IR that may come to
 | |
| them.
 | |
| 
 | |
| The simplified form defines this away: the concepts in the IR remain simple, and
 | |
| the code much more directly reflects the cost model for lowering to CFG
 | |
| functions and machine code. This is expected to be very important in the late
 | |
| stages of a code generator for an accelerator.
 | |
| 
 | |
| ### SSA in ML Functions
 | |
| 
 | |
| We agree already that values defined in an mlfunc can include scalar values and
 | |
| they are defined based on traditional dominance. In the simplified form, this is
 | |
| very simple: arguments and induction variables defined in for-loops are live
 | |
| inside their lexical body, and linear series of instructions have the same "top
 | |
| down" dominance relation that a basic block does.
 | |
| 
 | |
| In the traditional form though, this is not the case: it seems that a lot of
 | |
| knowledge about how codegen will emit the code is necessary to determine if SSA
 | |
| form is correct or not. For example, this is invalid code:
 | |
| 
 | |
| ```mlir
 | |
|   %tmp = call @S1(%X, %0, %1)
 | |
|     domain: (10 <= %i < %N), (0 <= %j < %N)
 | |
|     schedule: (i, j)
 | |
| 
 | |
|   call @S2(%tmp, %A, %0, %1)
 | |
|     domain: (0 <= %i < %N), (0 <= %j < %N)
 | |
|     schedule: (i, j)
 | |
| ```
 | |
| 
 | |
| Because `%tmp` isn't defined on some iterations of the %i loop.
 | |
| 
 | |
| This matters because it makes the verifier more complicated, but more
 | |
| significantly, it means that load promotion and other optimizations that will
 | |
| produce SSA form will need to be aware of this and be able to model what codegen
 | |
| does.
 | |
| 
 | |
| An emergent property of this that we discussed recently is that PHI nodes in
 | |
| mlfunc's (if we support them) will also have to have domains.
 | |
| 
 | |
| ### Lack of redundancy in IR
 | |
| 
 | |
| The traditional form has multiple encodings for the same sorts of behavior: you
 | |
| end up having bits on `affine.for` loops to specify whether codegen should use
 | |
| "atomic/separate" policies, unroll loops, etc. Instructions can be split or can
 | |
| generate multiple copies of their instruction because of overlapping domains,
 | |
| etc.
 | |
| 
 | |
| This is a problem for analyses and cost models, because they each have to reason
 | |
| about these additional forms in the IR.
 | |
| 
 | |
| ### Suitability to purpose: lowering to machine code
 | |
| 
 | |
| One of the main drivers for this work is lowering to low-level accelerator code,
 | |
| including two-dimensional vectorization, insertion of DMAs, and other
 | |
| utilization of the matrix accelerator units. In the author's opinion, the extra
 | |
| compactness of the traditional form is a negative for this purpose: reasoning
 | |
| about the generated machine code will require understanding the mapping from
 | |
| mlfunc to lowered code, which means that it must understand what code generation
 | |
| will do.
 | |
| 
 | |
| In the simplified form, the effect of "code generation" is always obvious from
 | |
| the IR itself, which should make it easier to perform vectorization to target
 | |
| instructions and other analyses we need to perform.
 | |
| 
 | |
| ## Third Alternative: two different levels of mlfunc
 | |
| 
 | |
| One hybrid alternative is to support both the traditional and simplified forms
 | |
| of mlfunc in our IR.
 | |
| 
 | |
| The stages could look like this, for example:
 | |
| 
 | |
| 1.  Early performance transformations could be done on the traditional form.
 | |
| 1.  Partial code generation lowers to the simplified form
 | |
| 1.  Target specific lowering phases for tiling, and vectorization and other 2D
 | |
|     transforms that don't benefit much from the traditional form could be run.
 | |
| 1.  Final codegen to a cfg func can be done when all of the instructions are
 | |
|     replaced with ones valid on the target.
 | |
| 
 | |
| While this is possible, it isn't clear what would justify the complexity of this
 | |
| approach. Unless there is a super compelling reason for this, it would be nice
 | |
| to not do this. **Update:** we discussed this as a design team and agreed that
 | |
| this wouldn't be a good way to go.
 |