This refactors the implementation of block signature(type) conversion to not insert fake cast operations to perform the type conversion, but to instead create a new block containing the proper signature. This has the benefit of enabling the use of pre-computed analyses that rely on mapping values. It also leads to a much cleaner implementation overall. The major user facing change is that applySignatureConversion will now replace the entry block of the region, meaning that blocks generally shouldn't be cached over calls to applySignatureConversion.
PiperOrigin-RevId: 280226936
This also previously triggered the warning:
warning: missing field 'isRecursivelyLegal' initializer [-Wmissing-field-initializers]
legalOperations[op] = {action};
^
PiperOrigin-RevId: 279399175
A pattern rewriter hook, mergeBlock, is added that allows for merging the operations of one block into the end of another. This is used to support a canonicalization pattern for branch operations that folds the branch when the successor has a single predecessor(the branch block).
Example:
^bb0:
%c0_i32 = constant 0 : i32
br ^bb1(%c0_i32 : i32)
^bb1(%x : i32):
return %x : i32
becomes:
^bb0:
%c0_i32 = constant 0 : i32
return %c0_i32 : i32
PiperOrigin-RevId: 278677825
The current lowering of loops to GPU only supports lowering of loop
nests where the loops mapped to workgroups and workitems are perfectly
nested. Here a new lowering is added to handle lowering of imperfectly
nested loop body with the following properties
1) The loops partitioned to workgroups are perfectly nested.
2) The loop body of the inner most loop partitioned to workgroups can
contain one or more loop nests that are to be partitioned across
workitems. Each individual loops nests partitioned to workitems should
also be perfectly nested.
3) The number of workgroups and workitems are not deduced from the
loop bounds but are passed in by the caller of the lowering as values.
4) For statements within the perfectly nested loop nest partitioned
across workgroups that are not loops, it is valid to have all threads
execute that statement. This is NOT verified.
PiperOrigin-RevId: 277958868
Rewrite patterns may make modifications to the CFG, including dropping edges between blocks. This change adds a simple unreachable block elimination run at the end of each iteration to ensure that the CFG remains valid.
PiperOrigin-RevId: 277545805
When we removed a pattern, we removed it from worklist but not from
worklistMap. Then, when we tried to add a new pattern on the same Operation
again, the pattern wasn't added since it already existed in the
worklistMap (but not in the worklist).
Closestensorflow/mlir#211
PiperOrigin-RevId: 277319669
In some cases, it may be desirable to mark entire regions of operations as legal. This provides an additional granularity of context to the concept of "legal". The `ConversionTarget` supports marking operations, that were previously added as `Legal` or `Dynamic`, as `recursively` legal. Recursive legality means that if an operation instance is legal, either statically or dynamically, all of the operations nested within are also considered legal. An operation can be marked via `markOpRecursivelyLegal<>`:
```c++
ConversionTarget &target = ...;
/// The operation must first be marked as `Legal` or `Dynamic`.
target.addLegalOp<MyOp>(...);
target.addDynamicallyLegalOp<MySecondOp>(...);
/// Mark the operation as always recursively legal.
target.markOpRecursivelyLegal<MyOp>();
/// Mark optionally with a callback to allow selective marking.
target.markOpRecursivelyLegal<MyOp, MySecondOp>([](Operation *op) { ... });
/// Mark optionally with a callback to allow selective marking.
target.markOpRecursivelyLegal<MyOp>([](MyOp op) { ... });
```
PiperOrigin-RevId: 277086382
This allows for them to be used on other non-function, or even other function-like, operations. The algorithms are already generic, so this is simply changing the derived pass type. The majority of this change is just ensuring that the nesting of these passes remains the same, as the pass manager won't auto-nest them anymore.
PiperOrigin-RevId: 276573038
This allows mixing linalg operations with vector transfer operations (with additional modifications to affine ops) and is a step towards solving tensorflow/mlir#189.
PiperOrigin-RevId: 275543361
This Chapter now introduces and makes use of the Interface concept
in MLIR to demonstrate ShapeInference.
END_PUBLIC
Closestensorflow/mlir#191
PiperOrigin-RevId: 275085151
The current SignatureConversion framework (part of DialectConversion)
allows remapping input arguments to a function from 1->0, 1->1 or
1->many arguments during conversion. Another case is where the
argument itself is dropped, but it's use are remapped to another
Value*.
An example of this is: The Vulkan/SPIR-V spec requires entry functions
to be of type void(void). The GPU -> SPIR-V conversion implemented
this without having the DialectConversion framework track the
remapping that lead to some undefined behavior. The changes here
addresses that.
PiperOrigin-RevId: 275059656
b843cc5d5a introduced a new op LICM transformation and a LoopLike interface,
but missed the CMake aspects of it. This should fix the build.
PiperOrigin-RevId: 275038533
When dealing with regions, or other patterns that need to generate temporary operations, it is useful to be able to replace other operations than the root op being matched. Before this PR, these operations would still be considered for legalization meaning that the conversion would either fail, erroneously need to mark these ops as legal, or add unnecessary patterns.
PiperOrigin-RevId: 274598513
This will allow for inlining newly devirtualized calls, as well as give a more accurate cost model(when we have one). Currently canonicalization will only run for nodes that have no child edges, as the child nodes may be erased during canonicalization. We can support this in the future, but it requires more intricate deletion tracking.
PiperOrigin-RevId: 274011386
When an operation with regions gets replaced, we currently require that all of the remaining nested operations are still converted even though they are going to be replaced when the rewrite is finished. This cl adds a tracking for a minimal set of operations that are known to be "dead". This allows for ignoring the legalization of operations that are won't survive after conversion.
PiperOrigin-RevId: 274009003
This PR is a stepping stone towards supporting generic multi-store
source loop nests in affine loop fusion. It extends the algorithm to
support fusion of multi-store loop nests that:
1. have only one store that writes to a function-local live out, and
2. the remaining stores are involved in loop nest self dependences
or no dependences within the function.
Closestensorflow/mlir#162
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/162 from dcaballe:dcaballe/multi-output-fusion 7fb7dec6fe8b45f5ce176f018bfe37b256420c45
PiperOrigin-RevId: 273773907
This is similar to the `inlineRegionBefore` hook, except the original blocks are unchanged. The region to be cloned *must* not have been modified during the conversion process at the point of cloning, i.e. it must belong an operation that has yet to be converted, or the operation that is currently being converted.
PiperOrigin-RevId: 273622533
- bodies would earlier appear in the order (i, i+3, i+2, i+1) instead of
(i, i+1, i+2, i+3) for example for factor 4.
- clean up hardcoded test cases
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#170
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/170 from bondhugula:ujam b66b405b2b1894a03b376952e32a9d0292042665
PiperOrigin-RevId: 273613131
Some dialects have implicit conversions inherent in their modeling, meaning that a call may have a different type that the type that the callable expects. To support this, a hook is added to the dialect interface that allows for materializing conversion operations during inlining when there is a mismatch. A hook is also added to the callable interface to allow for introspecting the expected result types.
PiperOrigin-RevId: 272814379
This allows for the inliner to work on arbitrary call operations. The updated inliner will also work bottom-up through the callgraph enabling support for multiple levels of inlining.
PiperOrigin-RevId: 272813876
The generated build methods have result type before the arguments (operands and attributes, which are also now adjacent in the explicit create method). This also results in changing the create method's ordering to match most build method's ordering.
PiperOrigin-RevId: 271755054
- also remove stale terminology/references in docs
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#148
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/148 from bondhugula:cleanup e846b641a3c2936e874138aff480a23cdbf66591
PiperOrigin-RevId: 271618279
The strided MemRef RFC discusses a normalized descriptor and interaction with library calls (https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/MaL8m2nXuio).
Lowering of nested LLVM structs as value types does not play nicely with externally compiled C/C++ functions due to ABI issues.
Solving the ABI problem generally is a very complex problem and most likely involves taking
a dependence on clang that we do not want atm.
A simple workaround is to pass pointers to memref descriptors at function boundaries, which this CL implement.
PiperOrigin-RevId: 271591708
- fix store to load forwarding for a certain set of cases (where
forwarding shouldn't have happened); use AffineValueMap difference
based MemRefAccess equality checking; utility logic is also greatly
simplified
- add missing equality/inequality operators for AffineExpr ==/!= ints
- add == != operators on MemRefAccess
Closestensorflow/mlir#136
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/136 from bondhugula:store-load-forwarding d79fd1add8bcfbd9fa71d841a6a9905340dcd792
PiperOrigin-RevId: 270457011
computeDepth calls itself recursively, which may insert into minPatternDepth. minPatternDepth is a DenseMap, which invalidates iterators on insertion, so this may lead to asan failures.
PiperOrigin-RevId: 270374203
- allow symbols in index remapping provided for memref replacement
- fix memref normalize crash on cases with layout maps with symbols
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Reported by: Alex Zinenko
Closestensorflow/mlir#139
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/139 from bondhugula:memref-rep-symbols 2f48c1fdb5d4c58915bbddbd9f07b18541819233
PiperOrigin-RevId: 269851182
- add canonicalization pattern to compose maps into affine loads/stores;
templatize the pattern and reuse it for affine.apply as well
- rename getIndices -> getMapOperands() (getIndices is confusing since
these are no longer the indices themselves but operands to the map
whose results are the indices). This also makes the accessor uniform
across affine.apply/load/store. Change arg names on the affine
load/store builder to avoid confusion. Drop an unused confusing build
method on AffineStoreOp.
- update incomplete doc comment for canonicalizeMapAndOperands (this was
missed from a previous update).
Addresses issue tensorflow/mlir#121
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#122
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/122 from bondhugula:compose-load-store e71de1771e56a85c4282c10cb43f30cef0701c4f
PiperOrigin-RevId: 269619540
When performing A->B->C conversion, an operation may still refer to an operand of A. This makes it necessary to unmap through multiple levels of replacement for a specific value.
PiperOrigin-RevId: 269367859
- turn copy/dma generation method into a utility in LoopUtils, allowing
it to be reused elsewhere.
- no functional/logic change to the pass/utility
- trim down header includes in files affected
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#124
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/124 from bondhugula:datacopy 9f346e62e5bd9dd1986720a30a35f302eb4d3252
PiperOrigin-RevId: 269106088
- take care of symbolic operands with alloc
- add missing check for compose map failure and a test case
- add test cases on strides
- drop incorrect check for one-to-one'ness
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#132
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/132 from bondhugula:normalize-memrefs 8aebf285fb0d7c19269d85255aed644657e327b7
PiperOrigin-RevId: 269105947
* Add GraphTraits that treat a block as a graph, Operation* as node and use-relationship for edges;
- Just basic graph output;
* Add use iterator to iterate over all uses of an Operation;
* Add testing pass to generate op graph;
This does not support arbitrary operations other than function nor nested regions yet.
PiperOrigin-RevId: 268121782
This will allow clients to implement a different collection strategy on these
values, including collecting each uses within the region for example.
PiperOrigin-RevId: 267803978
This defines a set of initial utilities for inlining a region(or a FuncOp), and defines a simple inliner pass for testing purposes.
A new dialect interface is defined, DialectInlinerInterface, that allows for dialects to override hooks controlling inlining legality. The interface currently provides the following hooks, but these are just premilinary and should be changed/added to/modified as necessary:
* isLegalToInline
- Determine if a region can be inlined into one of this dialect, *or* if an operation of this dialect can be inlined into a given region.
* shouldAnalyzeRecursively
- Determine if an operation with regions should be analyzed recursively for legality. This allows for child operations to be closed off from the legality checks for operations like lambdas.
* handleTerminator
- Process a terminator that has been inlined.
This cl adds support for inlining StandardOps, but other dialects will be added in followups as necessary.
PiperOrigin-RevId: 267426759
- address remaining comments from PR tensorflow/mlir#87 for better test coverage for
pipeline-data-transfer/replaceAllMemRefUsesWith
- remove dead tag allocs the same way they are removed for the replaced buffers
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#106
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/106 from bondhugula:followup 9e868666d047e8d43e5f82f43e4093b838c710fa
PiperOrigin-RevId: 267144774
- introduce utility to convert memrefs with non-identity layout maps to
ones with identity layout maps: convert the type and rewrite/remap all
its uses
- add this utility to -simplify-affine-structures pass for testing
purposes
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#104
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/104 from bondhugula:memref-normalize f2c914aa1890e8860326c9e33f9aa160b3d65e6d
PiperOrigin-RevId: 266985317
- the [begin, end) range identified for copying could end in between the
block, which makes hoisting invalid in some cases. Change the range
identification to always end with end of block.
- add test case to exercise these (with fast mem capacity set to minimal so
that single element memref buffers are generated at the innermost loop)
- the location of begin/end of the block range for data copying was
being confused with the insert points for copy in and copy out code.
In cases, where we choose to hoist transfers, these are separate.
- when copy loops are single iteration ones, promote their bodies at
the end of the pass.
- change default fast mem space to 1 (setting it to zero made it
generate DMA op's that won't verify in the default case - since the
DMA ops have a check for src/dest memref spaces being different).
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Co-Authored-By: Mehdi Amini <joker.eph@gmail.com>
Closestensorflow/mlir#88
COPYBARA_INTEGRATE_REVIEW=https://github.com/tensorflow/mlir/pull/88 from bondhugula:datacopy 88697267c45e850c3ced87671e16e4a930c02a42
PiperOrigin-RevId: 266980911
This interface will allow for providing hooks to interrop with operation folding. The first hook, 'shouldMaterializeInto', will allow for controlling which region to insert materialized constants into. The folder will generally materialize constants into the top-level isolated region, this allows for materializing into a lower level ancestor region if it is more profitable/correct.
PiperOrigin-RevId: 266702972
This is done by providing a walk callback that returns a WalkResult. This result is either `advance` or `interrupt`. `advance` means that the walk should continue, whereas `interrupt` signals that the walk should stop immediately. An example is shown below:
auto result = op->walk([](Operation *op) {
if (some_invariant)
return WalkResult::interrupt();
return WalkResult::advance();
});
if (result.wasInterrupted())
...;
PiperOrigin-RevId: 266436700
This change refactors and cleans up the implementation of the operation walk methods. After this refactoring is that the explicit template parameter for the operation type is no longer needed for the explicit op walks. For example:
op->walk<AffineForOp>([](AffineForOp op) { ... });
is now accomplished via:
op->walk([](AffineForOp op) { ... });
PiperOrigin-RevId: 266209552
Refactor replaceAllMemRefUsesWith to split it into two methods: the new
method does the replacement on a single op, and is used by the existing
one.
- make the methods return LogicalResult instead of bool
- Earlier, when replacement failed (due to non-deferencing uses of the
memref), the set of ops that had already been processed would have
been replaced leaving the IR in an inconsistent state. Now, a
pass is made over all ops to first check for non-deferencing
uses, and then replacement is performed. No test cases were affected
because all clients of this method were first checking for
non-deferencing uses before calling this method (for other reasons).
This isn't true for a use case in another upcoming PR (scalar
replacement); clients can now bail out with consistent IR on failure
of replaceAllMemRefUsesWith. Add test case.
- multiple deferencing uses of the same memref in a single op is
possible (we have no such use cases/scenarios), and this has always
remained unsupported. Add an assertion for this.
- minor fix to another test pipeline-data-transfer case.
Signed-off-by: Uday Bondhugula <uday@polymagelabs.com>
Closestensorflow/mlir#87
PiperOrigin-RevId: 265808183
Switch to C++14 standard method as llvm::make_unique has been removed (
https://reviews.llvm.org/D66259). Also mark some targets as c++14 to ease next
integrates.
PiperOrigin-RevId: 263953918
Often we want to ensure that block arguments are converted before operations that use them. This refactors the current implementation to be cleaner/less frequent by triggering conversion when a set of blocks are moved/inlined; or when legalization is successful.
PiperOrigin-RevId: 263795005
Since raw pointers are always passed around for IR construct without
implying any ownership transfer, it can be error prone to have implicit
ownership transferred the same way.
For example this code can seem harmless:
Pass *pass = ....
pm.addPass(pass);
pm.addPass(pass);
pm.run(module);
PiperOrigin-RevId: 263053082
There are currently several different terms used to refer to a parent IR unit in 'get' methods: getParent/getEnclosing/getContaining. This cl standardizes all of these methods to use 'getParent*'.
PiperOrigin-RevId: 262680287
This will allow for reusing the same pattern list, which may be costly to continually reconstruct, on multiple invocations.
PiperOrigin-RevId: 262664599
This CL is step 2/n towards building a simple, programmable and portable vector abstraction in MLIR that can go all the way down to generating assembly vector code via LLVM's opt and llc tools.
This CL adds the vector.extractelement operation to the MLIR vector dialect as well as the appropriate roundtrip test. Lowering to LLVM will occur in the following CL.
PiperOrigin-RevId: 262545089
These methods will allow replacing the uses of results with an existing operation, with the same number of results, or a range of values. This removes a number of hand-rolled result replacement loops and simplifies replacement for operations with multiple results.
PiperOrigin-RevId: 262206600
This CL modifies the LowerLinalgToLoopsPass to use RewritePattern.
This will make it easier to inline Linalg generic functions and regions when emitting to loops in a subsequent CL.
PiperOrigin-RevId: 261894120
This allows for proper forward declaration, as opposed to leaking the internal implementation via a using directive. This also allows for all pattern building to go through 'insert' methods on the OwningRewritePatternList, replacing uses of 'push_back' and 'RewriteListBuilder'.
PiperOrigin-RevId: 261816316
When inlining the declaration for llvm::DenseSet/DenseMap in the mlir
namespace from a forward declaration, clang does not take the default
for the template parameters if their are declared later.
namespace llvm {
template<typename Foo>
class DenseMap;
}
namespace mlir {
using llvm::DenseMap;
}
namespace llvm {
template<typename Foo = int>
class DenseMap {};
}
namespace mlir {
DenseMap<> map;
}
PiperOrigin-RevId: 261495612
AffineDataCopyGeneration pass relied on command line flags for internal logic
in several places, which makes it unusable in a library context (i.e. outside a
standalone mlir-opt binary that does the command line parsing). Define
configuration flags in the constructor instead, and set them up to command
line-based defaults to maintain the original behavior.
PiperOrigin-RevId: 261322364
Clipping creates non-affine memory accesses, use std_load and std_store instead of affine_load and affine_store.
In the future we may also want a fill with the neutral element rather than clip, this would make the accesses affine if we wanted more analyses and transformations to happen post lowering to pointwise copies.
PiperOrigin-RevId: 260110503
This mode analyzes which operations are legalizable to the given target if a conversion were to be applied, i.e. no rewrites are ever performed even on success. This mode is useful for device partitioning or other utilities that may want to analyze the effect of conversion to different targets before performing it.
The analysis method currently just fills a provided set with the operations that were found to be legalizable. This can be extended in the future to capture more information as necessary.
PiperOrigin-RevId: 259987105
This cl enforces that the conversion of the type signatures for regions, and thus their entry blocks, is handled via ConversionPatterns. A new hook 'applySignatureConversion' is added to the ConversionPatternRewriter to perform the desired conversion on a region. This also means that the handling of rewriting the signature of a FuncOp is moved to a pattern. A default implementation is provided via 'mlir::populateFuncOpTypeConversionPattern'. This removes the hacky implicit 'dynamically legal' status of FuncOp that was present previously, and leaves it up to the user to decide when/how to convert the signature of a function.
PiperOrigin-RevId: 259161999
This CL introduces a simple loop utility function which rewrites the bounds and step of a loop so as to become mappable on a regular grid of processors whose identifiers are given by SSA values.
A corresponding unit test is added.
For example, using CUDA terminology, and assuming a 2-d grid with processorIds = [blockIdx.x, threadIdx.x] and numProcessors = [gridDim.x, blockDim.x], the loop:
```
loop.for %i = %lb to %ub step %step {
...
}
```
is rewritten into a version resembling the following pseudo-IR:
```
loop.for %i = %lb + threadIdx.x + blockIdx.x * blockDim.x to %ub
step %gridDim.x * blockDim.x {
...
}
```
PiperOrigin-RevId: 258945942
This CL adapts the recently introduced parametric tiling to have an API matching the tiling
of AffineForOp. The transformation using stripmineSink is more general and produces imperfectly nested loops.
Perfect nesting invariants of the tiled version are obtained by selectively applying hoisting of ops to isolate perfectly nested bands. Such hoisting may fail to produce a perfect loop nest in cases where ForOp transitively depend on enclosing induction variables. In such cases, the API provides a LogicalResult return but the SimpleParametricLoopTilingPass does not currently use this result.
A new unit test is added with a triangular loop for which the perfect nesting property does not hold. For this example, the old behavior was to produce IR that did not verify (some use was not dominated by its def).
PiperOrigin-RevId: 258928309
This allows for providing specific handling for dynamically legal operations/dialects without overriding the general 'isDynamicallyLegal' hook. This also means that a derived ConversionTarget class need not always be defined when some operations are dynamically legal.
Example usage:
ConversionTarget target(...);
target.addDynamicallyLegalOp<ReturnOp>([](ReturnOp op) {
return ...
};
target.addDynamicallyLegalDialect<StandardOpsDialect>([](Operation *op) {
return ...
};
PiperOrigin-RevId: 258884753
This specific PatternRewriter will allow for exposing hooks in the future that are only useful for the conversion framework, e.g. type conversions.
PiperOrigin-RevId: 258818122
This cl begins a large refactoring over how signature types are converted in the DialectConversion infrastructure. The signatures of blocks are now converted on-demand when an operation held by that block is being converted. This allows for handling the case where a region is created as part of a pattern, something that wasn't possible previously.
This cl also generalizes the region signature conversion used by FuncOp to work on any region of any operation. This generalization allows for removing the 'apply*Conversion' functions that were specific to FuncOp/ModuleOp. The implementation currently uses a new hook on TypeConverter, 'convertRegionSignature', but this should ideally be removed in favor of using Patterns. That depends on adding support to the PatternRewriter used by ConversionPattern to allow applying signature conversions to regions, which should be coming in a followup.
PiperOrigin-RevId: 258645733
This explicit tag is useful is several ways:
*) This simplifies how to mark sub sections of a dialect as explicitly unsupported, e.g. my target supports all operations in the foo dialect except for these select few. This is useful for partial lowerings between dialects.
*) Partial conversions will now verify that operations that were explicitly marked as illegal must be converted. This provides some guarantee that the operations that need to be lowered by a specific pass will be.
PiperOrigin-RevId: 258582879
Users generally want several different modes of conversion. This cl refactors DialectConversion to provide two:
* Partial (applyPartialConversion)
- This mode allows for illegal operations to exist in the IR, and does not fail if an operation fails to be legalized.
* Full (applyFullConversion)
- This mode fails if any operation is not properly legalized to the conversion target. This allows for ensuring that the IR after a conversion only contains operations legal for the target.
PiperOrigin-RevId: 258412243
These methods don't compose well with the rest of conversion framework, and create artificial breaks in conversion. Replace these methods with two(populateAffineToStdConversionPatterns and populateLoopToStdConversionPatterns respectively) that populate a list of patterns to perform the same behavior.
PiperOrigin-RevId: 258219277
This cl changes the way that operations/blocks to convert are collected/traversed so that parent region operations can be legalized before their bodies. Most RewritePatterns for region operations assume that the entry arguments to each region are yet to be converted. Given that the bodies are currently converted first, this makes it difficult to fit these patterns into the same run as one converting types.
The operations/blocks to convert are now collected before any legalization has run, which simplifies the conversion logic itself, as legalization may insert new operations, move blocks, etc.
PiperOrigin-RevId: 258170158
ConversionPattern should ideally only be used when the types of the operands are changing, which in this case they aren't. Using OpRewritePattern also lends to much simpler code.
PiperOrigin-RevId: 258158474
Multiple (perfectly) nested loops with independent bounds can be combined into
a single loop and than subdivided into blocks of arbitrary size for load
balancing or more efficient parallelism exploitation. However, MLIR wants to
preserve the multi-dimensional multi-loop structure at higher levels of
abstraction. Introduce a transformation that coalesces nested loops with
independent bounds so that they can be further subdivided by tiling.
PiperOrigin-RevId: 258151016
These ops should not belong to the std dialect.
This CL extracts them in their own dialect and updates the corresponding conversions and tests.
PiperOrigin-RevId: 258123853
When using a RewritePattern and replacing an operation with an existing value, that value may have already been replaced by something else. This cl ensures that only the final value is used when applying rewrites.
PiperOrigin-RevId: 258058488
This field wasn't updated as the insertion point changed, making it potentially dangerous given the multi-level of MLIR(e.g. 'createBlock' would always insert the new block in 'region'). This also allows for building an OpBuilder with just a context.
PiperOrigin-RevId: 257829135
The GreedyPatternRewriteDriver currently does not notify the OperationFolder when constants are removed as part of a pattern match. This materializes in a nasty bug where a different operation may be allocated to the same address. This causes an assertion in the OperationFolder when it gets notified of the new operations removal.
PiperOrigin-RevId: 257817627
This CL splits the lowering of affine to LLVM into 2 parts:
1. affine -> std
2. std -> LLVM
The conversions mostly consists of splitting concerns between the affine and non-affine worlds from existing conversions.
Short-circuiting of affine `if` conditions was never tested or exercised and is removed in the process, it can be reintroduced later if needed.
LoopParametricTiling.cpp is updated to reflect the newly added ForOp::build.
PiperOrigin-RevId: 257794436
This allows for the attribute to hold symbolic references to other operations than FuncOp. This also allows for removing the dependence on FuncOp from the base Builder.
PiperOrigin-RevId: 257650017
Standard load and store operations are evolving to be separated from the Affine
constructs. Special affine.load/store have been introduced to uphold the
restrictions of the Affine control flow constructs on their operands.
EDSC-produced loads and stores were originally intended to uphold those
restrictions as well so they should use affine.load/store instead of
std.load/store.
PiperOrigin-RevId: 257443307
Parametric tiling can be used to extract outer loops with fixed number of
iterations. This in turn enables mapping to GPU kernels on a fixed grid
independently of the range of the original loops, which may be unknown
statically, making the kernel adaptable to different sizes. Provide a utility
function that also computes the parametric tile size given the range of the
loop. Exercise the utility function through a simple pass that applies it to
all top-level loop nests. Permutability or parallelism checks must be
performed before calling this utility function in actual passes.
Note that parametric tiling cannot be implemented in a purely affine way,
although it can be encoded using semi-affine maps. The choice to implement it
on standard loops is guided by them being the common representation between
Affine loops, Linalg and GPU kernels.
PiperOrigin-RevId: 257180251
Modules can now contain more than just Functions, this just updates the iteration API to reflect that. The 'begin'/'end' methods have also been updated to iterate over opaque Operations.
PiperOrigin-RevId: 257099084
These methods assume that a function is a valid builtin top-level operation, and removing these methods allows for decoupling FuncOp and IR/. Utility "getParentOfType" methods have been added to Operation/OpState to allow for querying the first parent operation of a given type.
PiperOrigin-RevId: 257018913
In most places, this is just a name change (with the exception of affine.dma_start swapping the operand positions of its tag memref and num_elements operands).
Significant code changes occur here:
*) Vectorization: LoopAnalysis.cpp, Vectorize.cpp
*) Affine Transforms: Transforms/Utils/Utils.cpp
PiperOrigin-RevId: 256395088
As with Functions, Module will soon become an operation, which are value-typed. This eases the transition from Module to ModuleOp. A new class, OwningModuleRef is provided to allow for owning a reference to a Module, and will auto-delete the held module on destruction.
PiperOrigin-RevId: 256196193
Move the data members out of Function and into a new impl storage class 'FunctionStorage'. This allows for Function to become value typed, which will greatly simplify the transition of Function to FuncOp(given that FuncOp is also value typed).
PiperOrigin-RevId: 255983022
Type conversion does not necessarily affect all types, some of them may remain
untouched. The type conversion tool from the dialect conversion framework will
unconditionally insert a temporary cast operation from the type to itself
anyway, and will try to materialize it to a real conversion operation if there
are remaining uses. Simply use the original value instead.
PiperOrigin-RevId: 255975450
During conversion, if a type conversion has dangling uses a type conversion must persist after conversion has finished to maintain valid IR. In these cases, we now query the TypeConverter to materialize a conversion for us. This allows for the default case of a full conversion to continue working as expected, but also handle the degenerate cases more robustly.
PiperOrigin-RevId: 255637171
Now that Locations are attributes, they have direct access to the MLIR context. This allows for simplifying error emission by removing unnecessary context lookups.
PiperOrigin-RevId: 255112791
The OperationFolder currently just inserts into the entry block of a Function, but regions may be isolated above, i.e. explicit capture only, and blindly inserting constants may break the invariants of these regions.
PiperOrigin-RevId: 254987796
Now that Locations are Attributes they contain a direct reference to the MLIRContext, i.e. the context can be directly accessed from the given location instead of being explicitly passed in.
PiperOrigin-RevId: 254568329
* Support for 1->0 type mappings, i.e. when the argument is being removed.
* Reordering types when converting a type signature.
* Adding new inputs when converting a type signature.
This cl also lays down the initial foundation for supporting 1->N type mappings, but full support will come in a followup.
Moving forward, function signature changes will be driven by populating a SignatureConversion instance. This class contains all of the necessary information for adding/removing/remapping function signatures; e.g. addInputs, addResults, remapInputs, etc.
PiperOrigin-RevId: 254064665
Arguably, this function is only useful for transformations and should not
pollute the main IR. Also make sure it accepts a the resulting container
by-reference instead of returning it.
PiperOrigin-RevId: 253622981
This converts entire loops into threads/blocks. No check on the size of the
block or grid, or on the validity of parallelization is performed, it is under
the responsibility of the caller to strip-mine the loops and to perform the
dependence analysis before calling the conversion.
PiperOrigin-RevId: 253189268
1) Lowest minimum pattern stack depth when legalizing.
- This leads the system to favor patterns that have lower legalization stacks, i.e. represent a more direct mapping to the target.
2) Pattern benefit.
- When considering multiple patterns with the same legalization depth, this favors patterns with a larger specified benefit.
PiperOrigin-RevId: 252713470
This introduces the support for region-containing operations to the dialect
conversion framework in order to support the conversion of affine control-flow
operations into the standard control flow with branches. Regions that belong
to an operation are converted before the operation itself. The
DialectConversionPattern can therefore access the converted regions of the
original operation and process them further if necessary. In particular, the
conversion is allowed to move the blocks from the original region to other
regions and to split blocks into multiple blocks. All block manipulations must
be performed through the PatternRewriter to ensure they will be undone if the
conversion fails.
Port the pass converting from the affine dialect (loops and ifs with bodies as
regions) to the standard dialect (branch-based cfg) to use DialectConversion in
order to exercise this new functionality. The modification to the lowering
functions are minor and are focused on using the PatterRewriter instead of
directly modifying the IR.
PiperOrigin-RevId: 252625169
- added a typed walk to Block (matching the equivalent on Function)
- added token parsers (incl optional variants) for : and (
- added applyConversionPatterns that takes a list of functions to apply patterns to
PiperOrigin-RevId: 251481608
To accomplish this, moving forward users will need to provide a legalization target that defines what operations are legal for the conversion. A target can mark an operation as legal by providing a specific legalization action. The initial actions are:
* Legal
- This action signals that every instance of the given operation is legal,
i.e. any combination of attributes, operands, types, etc. is valid.
* Dynamic
- This action signals that only some instances of a given operation are legal. This
allows for defining fine-tune constraints, like say std.add is only legal when
operating on 32-bit integers.
An example target is shown below:
struct MyTarget : public ConversionTarget {
MyTarget(MLIRContext &ctx) : ConversionTarget(ctx) {
// All operations in the LLVM dialect are legal.
addLegalDialect<LLVMDialect>();
// std.constant op is always legal on this target.
addLegalOp<ConstantOp>();
// std.return op has dynamic legality constraints.
addDynamicallyLegalOp<ReturnOp>();
}
/// Implement the custom legalization handler to handle
/// std.return.
bool isLegal(Operation *op) override {
// Process the dynamic handling for a std.return op (and any others that were
// marked "dynamic").
...
}
};
PiperOrigin-RevId: 251289374
* the 'empty' method should be used to check for emptiness instead of 'size'
* using decl 'CapturableHandle' is unused
* redundant get() call on smart pointer
* using decl 'apply' is unused
* using decl 'ScopeGuard' is unused
--
PiperOrigin-RevId: 250623863
*) Factors slice union computation out of LoopFusion into Analysis/Utils (where other iteration slice utilities exist).
*) Generalizes slice union computation to take the union of slices computed on all loads/stores pairs between source and destination loop nests.
*) Fixes a bug in FlatAffineConstraints::addSliceBounds where redundant constraints were added.
*) Takes care of a TODO to expose FlatAffineConstraints::mergeAndAlignIds as a public method.
--
PiperOrigin-RevId: 250561529
Fix Block::splitBlock and Block::eraseFromFunction that erronously assume
blocks belong to functions. They now belong to regions. When splitting, new
blocks should be created in the same region as the existing block. When
erasing a block, it should be removed from the region rather than from the
function body that transitively contains the region.
Also rename Block::eraseFromFunction to Block::erase for consistency with other
IR containers.
--
PiperOrigin-RevId: 250278272
The lowering from the Affine dialect to the Standard dialect was originally
implemented as a standalone pass. However, it may be used by other passes
willing to lower away some of the affine constructs as a part of their
operation. Decouple the transformation functions from the pass infrastructure
and expose the entry point for the lowering.
Also update the lowering functions to use `LogicalResult` instead of bool for
return values.
--
PiperOrigin-RevId: 250229198
*) Adds LoopFusionUtils which will expose a set of loop fusion utilities (e.g. dependence checks, fusion cost/storage reduction, loop fusion transformation) for use by loop fusion algorithms. Support for checking block-level fusion-preventing dependences is added in this CL (additional loop fusion utilities will be added in subsequent CLs).
*) Adds TestLoopFusion test pass for testing LoopFusionUtils at a fine granularity.
*) Adds unit test for testing dependence check for block-level fusion-preventing dependences.
--
PiperOrigin-RevId: 249861071
Originally, FunctionConverter::convertRegion in the DialectConversion framework
was implemented as a function template because it was creating a new region in
the parent object, which could have been an op or a function. Since
DialectConversion now operates in place, new region is no longer created so
there is no need for convertRegion to be aware of the parent, only of the error
reporting location.
--
PiperOrigin-RevId: 249826392
* There is no longer a need to explicitly remap function attrs.
- This removes a potentially expensive call from the destructor of Function.
- This will enable some interprocedural transformations to now run intraprocedurally.
- This wasn't scalable and forces dialect defined attributes to override
a virtual function.
* Replacing a function is now a trivial operation.
* This is a necessary first step to representing functions as operations.
--
PiperOrigin-RevId: 249510802
Using ArrayRef introduces issues with the order of evaluation between a constructor and
the arguments of the subsequent calls to the `operator()`.
As a consequence the order of captures is not well-defined can go wrong with certain compilers (e.g. gcc-6.4).
This CL fixes the issue by using lambdas in lieu of ArrayRef.
--
PiperOrigin-RevId: 249114775
The converter now works by inserting fake producer operations when replacing the results of an existing operation with values of a different, now legal, type. These fake operations are guaranteed to never escape the converter.
--
PiperOrigin-RevId: 248969130
generates remarks for testing, it isn't itself a transformation.
While there, upgrade its diagnostic emission to use the streaming interface.
Prune some unnecessary #includes.
--
PiperOrigin-RevId: 247768062
The string was referenced but not captured in the lambda, which causes
a failure when compiling with MSVC.
This issue was discovered by @loic-joly-sonarsource with a proposed fix
in https://github.com/tensorflow/mlir/pull/22.
--
PiperOrigin-RevId: 247085897
Trying to activate both LLVM and MLIR passes in mlir-cpu-runner showed name collisions when registering pass names.
One possible way of disambiguating that should also work across dialects is to prepend the dialect name to the passes that specifically operate on that dialect.
With this CL, mlir-cpu-runner tests still run when both LLVM and MLIR passes are registered
--
PiperOrigin-RevId: 246539917
This CL implements the previously unsupported parsing for Range, View and Slice operations.
A pass is introduced to lower to the LLVM.
Tests are moved out of C++ land and into mlir/test/Examples.
This allows better fitting within standard developer workflows.
--
PiperOrigin-RevId: 245796600
During the pattern rewrite, if the function is changed, i.e. ops created,
deleted or swapped, the pattern rewriter needs to re-scan the function entirely
and apply the patterns again, so the patterns whose root ops have been popped
out from the working list nor an immediate users of the changed ops can be
reconsidered.
A command line flag is added to set the max number of iterations rescanning the
function for pattern match. If the rewrite doesn' converge after this number,
this compiling will continue and the result can be sub-optimal.
One unit test is updated because this change fixed the missing optimization opportunities.
--
PiperOrigin-RevId: 244754190
* dyn_cast_or_null
- This will first check if the operation is null before trying to 'dyn_cast':
Value *v = ...;
if (auto forOp = dyn_cast_or_null<AffineForOp>(v->getDefiningOp()))
...
* isa_nonnull
- This will first check if the pointer is null before trying to 'isa':
Value *v = ...;
if (isa_nonnull<AffineForOp>(v->getDefiningOp());
...
--
PiperOrigin-RevId: 242171343
Note: This now means that we cannot fold chains of operations, i.e. where constant foldable operations feed into each other. Given that this is a testing pass solely for constant folding, this isn't really something that we want anyways. Constant fold tests should be simple and direct, with more advanced folding/feeding being tested with the canonicalizer.
--
PiperOrigin-RevId: 242011744
There are two places containing constant folding logic right now: the ConstantFold
pass and the GreedyPatternRewriteDriver. The logic was not shared and started to
drift apart. We were testing constant folding logic using the ConstantFold pass,
but lagged behind the GreedyPatternRewriteDriver, where we really want the constant
folding to happen.
This CL pulled the logic into utility functions and classes for sharing between
these two places. A new ConstantFoldHelper class is created to help constant fold
and de-duplication.
Also, renamed the ConstantFold pass to TestConstantFold to make it clear that it is
intended for testing purpose.
--
PiperOrigin-RevId: 241971681
This CL fixes the non-determinism across compilers in an edsc::select expression used in LowerVectorTransfers. This is achieved by factoring the expression out of the function call to ensure a deterministic order of evaluation.
Since the expression is now factored out, fewer IR is generated and the test is updated accordingly.
--
PiperOrigin-RevId: 241679962
This is making up for some differences in standard library and linker flags.
It also get rid of the requirement to build with RTTI.
--
PiperOrigin-RevId: 241348845
This CL allows the programmatic control of the target hardware vector size when creating a MaterializeVectorsPass.
This is useful for registering passes for the tutorial.
PiperOrigin-RevId: 240996136
This CL removes the reliance of the vectorize pass on the specification of a `fastestVaryingDim` parameter. This parameter is a restriction meant to more easily target a particular loop/memref combination for vectorization and is mainly used for testing.
This also had the side-effect of restricting vectorization patterns to only the ones in which all memrefs were contiguous along the same loop dimension. This simple restriction prevented matmul to vectorize in 2-D.
this CL removes the restriction and adds the matmul test which vectorizes in 2-D along the parallel loops. Support for reduction loops is left for future work.
PiperOrigin-RevId: 240993827
Dialect conversion currently clones the operations that did not match any
pattern. This includes cloning any regions that belong to these operations.
Instead, apply conversion recursively to the nested regions.
Note that if an operation matched one of the conversion patterns, it is up to
the pattern rewriter to fill in the regions of the converted operation. This
may require calling back to the converter and is left for future work.
PiperOrigin-RevId: 240872410
This CL allows vectorization to be called and configured in other ways than just via command line arguments.
This allows triggering vectorization programmatically.
PiperOrigin-RevId: 240638208
Due to legacy reasons (ML/CFG function separation), regions in affine control
flow operations require contained blocks not to have terminators. This is
inconsistent with the notion of the block and may complicate code motion
between regions of affine control operations and other regions.
Introduce `affine.terminator`, a special terminator operation that must be used
to terminate blocks inside affine operations and transfers the control back to
he region enclosing the affine operation. For brevity and readability reasons,
allow `affine.for` and `affine.if` to omit the `affine.terminator` in their
regions when using custom printing and parsing format. The custom parser
injects the `affine.terminator` if it is missing so as to always have it
present in constructed operations.
Update transformations to account for the presence of terminator. In
particular, most code motion transformation between loops should leave the
terminator in place, and code motion between loops and non-affine blocks should
drop the terminator.
PiperOrigin-RevId: 240536998
Currently, regions can only be constructed by passing in a `Function` or an
`Instruction` pointer referencing the parent object, unlike `Function`s or
`Instruction`s themselves that can be created without a parent. It leads to a
rather complex flow in operation construction where one has to create the
operation first before being able to work with its regions. It may be
necessary to work with the regions before the operation is created. In
particular, in `build` and `parse` functions that are executed _before_ the
operation is created in cases where boilerplate region manipulation is required
(for example, inserting the hypothetical default terminator in affine regions).
Allow creating standalone regions. Such regions are meant to own a list of
blocks and transfer them to other regions on demand.
Each instruction stores a fixed number of regions as trailing objects and has
ownership of them. This decreases the size of the Instruction object for the
common case of instructions without regions. Keep this behavior intact. To
allow some flexibility in construction, make OperationState store an owning
vector of regions. When the Builder creates an Instruction from
OperationState, the bodies of the regions are transferred into the
instruction-owned regions to minimize copying. Thus, it becomes possible to
fill standalone regions with blocks and move them to an operation when it is
constructed, or move blocks from a region to an operation region, e.g., for
inlining.
PiperOrigin-RevId: 240368183
a pointer. This makes it consistent with all the other methods in
FunctionPass, as well as with ModulePass::getModule(). NFC.
PiperOrigin-RevId: 240257910
This combined match/rewrite functionality allows simplifying the majority of existing RewritePatterns, as they do not benefit from separate match and rewrite functions.
Some of the existing canonicalization patterns in StandardOps have been modified to take advantage of this functionality.
PiperOrigin-RevId: 240187856
inherited constructors, which is cleaner and means you can now use DimOp()
to get a null op, instead of having to use Instruction::getNull<DimOp>().
This removes another 200 lines of code.
PiperOrigin-RevId: 240068113
We just need a way to unpack ArrayRef<ValueHandle> to ArrayRef<Value*>.
No need to expose this to the user.
This reduces the cognitive overhead for the tutorial.
PiperOrigin-RevId: 240037425
This also eliminates some incorrect reinterpret_cast logic working around it, and numerous const-incorrect issues (like block argument iteration).
PiperOrigin-RevId: 239712029
This eliminate ConstOpPointer (but keeps OpPointer for now) by making OpPointer
implicitly launder const in a const incorrect way. It will eventually go away
entirely, this is a progressive step towards the new const model.
PiperOrigin-RevId: 239512640
This CL fixes an issue where cloned loop induction variables were not properly
propagated and beefs up the corresponding test.
PiperOrigin-RevId: 239422961
This CL introduces a ValueArrayHandle helper to manage the implicit conversion
of ArrayRef<ValueHandle> -> ArrayRef<Value*> by converting first to ValueArrayHandle.
Without this, boilerplate operations that take ArrayRef<Value*> cannot be removed easily.
This all seems to boil down to decoupling Value from Type.
Alternative solutions exist (e.g. MLIR using Value by value everywhere) but they would be very intrusive. This seems to be the lowest impedance change.
Intrinsics are also lowercased by popular demand.
PiperOrigin-RevId: 238974125
This CL removes the dependency of LowerVectorTransfers on the AST version of EDSCs which will be retired.
This exhibited a pretty fundamental staging difference in AST-based vs declarative based emission.
Since the delayed creation with an AST was staged, the loop order came into existence after the clipping expressions were computed.
This now changes as the loops first need to be created declaratively in fixed order and then the clipping expressions are created.
Also, due to lack of staging, coalescing cannot be done on the fly anymore and
needs to be done either as a pre-pass (current implementation) or as a local transformation on the generated IR (future work).
Tests are updated accordingly.
PiperOrigin-RevId: 238971631
- this is really not a hard error; emit a warning instead (for inability to compute
footprint due to the union failing due to unimplemented cases)
- remove a misleading warning from LoopFusion.cpp
PiperOrigin-RevId: 238118711
- fix for getConstantBoundOnDimSize: floordiv -> ceildiv for extent
- make getConstantBoundOnDimSize also return the identifier upper bound
- fix unionBoundingBox to correctly use the divisor and upper bound identified by
getConstantBoundOnDimSize
- deal with loop step correctly in addAffineForOpDomain (covers most cases now)
- fully compose bound map / operands and simplify/canonicalize before adding
dim/symbol to FlatAffineConstraints; fixes false positives in -memref-bound-check; add
test case there
- expose mlir::isTopLevelSymbol from AffineOps
PiperOrigin-RevId: 238050395
multi-result upper bounds, complete TODOs, fix/improve test cases.
- complete TODOs for loop unroll/unroll-and-jam. Something as simple as
"for %i = 0 to %N" wasn't being unrolled earlier (unless it had been written
as "for %i = ()[s0] -> (0)()[%N] to %N"; addressed now.
- update/replace getTripCountExpr with buildTripCountMapAndOperands; makes it
more powerful as it composes inputs into it
- getCleanupLowerBound and getUnrolledLoopUpperBound actually needed the same
code; refactor and remove one.
- reorganize test cases, write previous ones better; most of these changes are
"label replacements".
- fix wrongly labeled test cases in unroll-jam.mlir
PiperOrigin-RevId: 238014653
- compute tile sizes based on a simple model that looks at memory footprints
(instead of using the hardcoded default value)
- adjust tile sizes to make them factors of trip counts based on an option
- update loop fusion CL options to allow setting maximal fusion at pass creation
- change an emitError to emitWarning (since it's not a hard error unless the client
treats it that way, in which case, it can emit one)
$ mlir-opt -debug-only=loop-tile -loop-tile test/Transforms/loop-tiling.mlir
test/Transforms/loop-tiling.mlir:81:3: note: using tile sizes [4 4 5 ]
for %i = 0 to 256 {
for %i0 = 0 to 256 step 4 {
for %i1 = 0 to 256 step 4 {
for %i2 = 0 to 250 step 5 {
for %i3 = #map4(%i0) to #map11(%i0) {
for %i4 = #map4(%i1) to #map11(%i1) {
for %i5 = #map4(%i2) to #map12(%i2) {
%0 = load %arg0[%i3, %i5] : memref<8x8xvector<64xf32>>
%1 = load %arg1[%i5, %i4] : memref<8x8xvector<64xf32>>
%2 = load %arg2[%i3, %i4] : memref<8x8xvector<64xf32>>
%3 = mulf %0, %1 : vector<64xf32>
%4 = addf %2, %3 : vector<64xf32>
store %4, %arg2[%i3, %i4] : memref<8x8xvector<64xf32>>
}
}
}
}
}
}
PiperOrigin-RevId: 237461836
* bool succeeded(Status)
- Return if the status corresponds to a success value.
* bool failed(Status)
- Return if the status corresponds to a failure value.
PiperOrigin-RevId: 237153884
Adds utility to convert slice bounds to a FlatAffineConstraints representation.
Adds utility to FlatAffineConstraints to promote loop IV symbol identifiers to dim identifiers.
PiperOrigin-RevId: 236973261
- change this for consistency - everything else similar takes/returns a
Function pointer - the FuncBuilder ctor,
Block/Value/Instruction::getFunction(), etc.
- saves a whole bunch of &s everywhere
PiperOrigin-RevId: 236928761
This fixes a bug: previously, during conversion function argument
attributes were neither beings passed through nor converted. This fix
extends DialectConversion to allow for simultaneous conversion of the
function type and the argument attributes.
This was important when lowering MLIR to LLVM where attribute
information (e.g. noalias) needs to be preserved in MLIR(LLVMDialect).
Longer run it seems reasonable that we want to convert both the
function attribute and its type and the argument attributes, but that
requires a small refactoring in Function.h to aggregate these three
fields in an inner struct, which will require some discussion.
PiperOrigin-RevId: 236709409
An analysis can be any class, but it must provide the following:
* A constructor for a given IR unit.
struct MyAnalysis {
// Compute this analysis with the provided module.
MyAnalysis(Module *module);
};
Analyses can be accessed from a Pass by calling either the 'getAnalysisResult<AnalysisT>' or 'getCachedAnalysisResult<AnalysisT>' methods. A FunctionPass may query for a cached analysis on the parent module with 'getCachedModuleAnalysisResult'. Similary, a ModulePass may query an analysis, it doesn't need to be cached, on a child function with 'getFunctionAnalysisResult'.
By default, when running a pass all cached analyses are set to be invalidated. If no transformation was performed, a pass can use the method 'markAllAnalysesPreserved' to preserve all analysis results. As noted above, preserving specific analyses is not yet supported.
PiperOrigin-RevId: 236505642
This CL changes dialect op source files (.h, .cpp, .td) to follow the following
convention:
<full-dialect-name>/<dialect-namespace>Ops.{h|cpp|td}
Builtin and standard dialects are specially treated, though. Both of them do
not have dialect namespace; the former is still named as BuiltinOps.* and the
latter is named as Ops.*.
Purely mechanical. NFC.
PiperOrigin-RevId: 236371358
*) Breaks fusion pass into multiple sub passes over nodes in data dependence graph:
- first pass fuses single-use producers into their unique consumer.
- second pass enables fusing for input-reuse by fusing sibling nodes which read from the same memref, but which do not share dependence edges.
- third pass fuses remaining producers into their consumers (Note that the sibling fusion pass may have transformed a producer with multiple uses into a single-use producer).
*) Fusion for input reuse is enabled by computing a sibling node slice using the load/load accesses to the same memref, and fusion safety is guaranteed by checking that the sibling node memref write region (to a different memref) is preserved.
*) Enables output vector and output matrix computations from KFAC patches-second-moment operation to fuse into a single loop nest and reuse input from the image patches operation.
*) Adds a generic loop utilitiy for finding all sequential loops in a loop nest.
*) Adds and updates unit tests.
PiperOrigin-RevId: 236350987
LoopFusion
- getConstDifference in LoopFusion is pending a refactoring to handle bounds
with min's and max's; it currently asserts on some useful test cases that we
want to experiment with. This CL changes getSliceBounds to be more
conservative so as to not trigger the assertion. Filed b/126426796 to track this.
PiperOrigin-RevId: 235826538
- clean up loop fusion CL options for promoting local buffers to fast memory
space
- add parameters to loop fusion pass instantiation
PiperOrigin-RevId: 235813419
This CL adds a primitive to perform stripmining of a loop by a given factor and
sinking it under multiple target loops.
In turn this is used to implement imperfectly nested loop tiling (with interchange) by repeatedly calling the stripmineSink primitive.
The API returns the point loops and allows repeated invocations of tiling to achieve declarative, multi-level, imperfectly-nested tiling.
Note that this CL is only concerned with the mechanical aspects and does not worry about analysis and legality.
The API is demonstrated in an example which creates an EDSC block, emits the corresponding MLIR and applies imperfectly-nested tiling:
```cpp
auto block = edsc::block({
For(ArrayRef<edsc::Expr>{i, j}, {zero, zero}, {M, N}, {one, one}, {
For(k1, zero, O, one, {
C({i, j, k1}) = A({i, j, k1}) + B({i, j, k1})
}),
For(k2, zero, O, one, {
C({i, j, k2}) = A({i, j, k2}) + B({i, j, k2})
}),
}),
});
// clang-format on
emitter.emitStmts(block.getBody());
auto l_i = emitter.getAffineForOp(i), l_j = emitter.getAffineForOp(j),
l_k1 = emitter.getAffineForOp(k1), l_k2 = emitter.getAffineForOp(k2);
auto indicesL1 = mlir::tile({l_i, l_j}, {512, 1024}, {l_k1, l_k2});
auto l_ii1 = indicesL1[0][0], l_jj1 = indicesL1[1][0];
mlir::tile({l_jj1, l_ii1}, {32, 16}, l_jj1);
```
The edsc::Expr for the induction variables (i, j, k_1, k_2) provide the programmatic hooks from which tiling can be applied declaratively.
PiperOrigin-RevId: 235548228
Analysis - NFC
- refactor AffineExprFlattener (-> SimpleAffineExprFlattener) so that it
doesn't depend on FlatAffineConstraints, and so that FlatAffineConstraints
could be moved out of IR/; the simplification that the IR needs for
AffineExpr's doesn't depend on FlatAffineConstraints
- have AffineExprFlattener derive from SimpleAffineExprFlattener to use for
all Analysis/Transforms purposes; override addLocalFloorDivId in the derived
class
- turn addAffineForOpDomain into a method on FlatAffineConstraints
- turn AffineForOp::getAsValueMap into an AffineValueMap ctor
PiperOrigin-RevId: 235283610
- compute slices precisely where the destination iteration depends on multiple source
iterations (instead of over-approximating to the whole source loop extent)
- update unionBoundingBox to deal with input with non-matching symbols
- reenable disabled backend test case
PiperOrigin-RevId: 234714069
- hoist DMAs past all loops immediately surrounding the region that the latter
is invariant on - do this at DMA generation time itself
PiperOrigin-RevId: 234628447
Expose the result types of edsc::Expr, which are now stored for all types of
Exprs and not only for the variadic ones. Require return types when an Expr is
constructed, if it will ever have some. An empty return type list is
interpreted as an Expr that does not create a value (e.g. `return` or `store`).
Conceptually, all edss::Exprs are now typed, with the type being a (potentially
empty) tuple of return types. Unbound expressions and Bindables must now be
constructed with a specific type they will take. This makes EDSC less
evidently type-polymorphic, but we can still write generic code such as
Expr sumOfSquares(Expr lhs, Expr rhs) { return lhs * lhs + rhs * rhs; }
and use it to construct different typed expressions as
sumOfSquares(Bindable(IndexType::get(ctx)), Bindable(IndexType::get(ctx)));
sumOfSquares(Bindable(FloatType::getF32(ctx)),
Bindable(FloatType::getF32(ctx)));
On the positive side, we get the following.
1. We can now perform type checking when constructing Exprs rather than during
MLIR emission. Nevertheless, this is still duplicates the Op::verify()
until we can factor out type checking from that.
2. MLIREmitter is significantly simplified.
3. ExprKind enum is only used for actual kinds of expressions. Data structures
are converging with AbstractOperation, and the users can now create a
VariadicExpr("canonical_op_name", {types}, {exprs}) for any operation, even
an unregistered one without having to extend the enum and make pervasive
changes to EDSCs.
On the negative side, we get the following.
1. Typed bindables are more verbose, even in Python.
2. We lose the ability to do print debugging for higher-level EDSC abstractions
that are implemented as multiple MLIR Ops, for example logical disjunction.
This is the step 2/n towards making EDSC extensible.
***
Move MLIR Op construction from MLIREmitter::emitExpr to Expr::build since Expr
now has sufficient information to build itself.
This is the step 3/n towards making EDSC extensible.
Both of these strive to minimize the amount of irrelevant changes. In
particular, this introduces more complex pretty-printing for affine and binary
expression to make sure tests continue to pass. It also relies on string
comparison to identify specific operations that an Expr produces.
PiperOrigin-RevId: 234609882
EDSC currently implement a block as a statement that is itself a list of
statements. This suffers from two modeling problems: (1) these blocks are not
addressable, i.e. one cannot create an instruction where thus constructed block
is a successor; (2) they support block nesting, which is not supported by MLIR
blocks. Furthermore, emitting such "compound statement" (misleadingly named
`Block` in Python bindings) does not actually produce a new Block in the IR.
Implement support for creating actual IR Blocks in EDSC. In particular, define
a new StmtBlock EDSC class that is neither an Expr nor a Stmt but contains a
list of Stmts. Additionally, StmtBlock may have (early-) typed arguments.
These arguments are Bindable expressions that can be used inside the block.
Provide two calls in the MLIREmitter, `emitBlock` that actually emits a new
block and `emitBlockBody` that only emits the instructions contained in the
block without creating a new block. In the latter case, the instructions must
not use block arguments.
Update Python bindings to make it clear when instruction emission happens
without creating a new block.
PiperOrigin-RevId: 234556474
generation pass to make it drop certain assumptions, complete TODOs.
- multiple fixes for getMemoryFootprintBytes
- pass loopDepth correctly from getMemoryFootprintBytes()
- use union while computing memory footprints
- bug fixes for addAffineForOpDomain
- take into account loop step
- add domains of other loop IVs in turn that might have been used in the bounds
- dma-generate: drop assumption of "non-unit stride loops being tile space loops
and skipping those and recursing to inner depths"; DMA generation is now purely
based on available fast mem capacity and memory footprint's calculated
- handle memory region compute failures/bailouts correctly from dma-generate
- loop tiling cleanup/NFC
- update some debug and error messages to use emitNote/emitError in
pipeline-data-transfer pass - NFC
PiperOrigin-RevId: 234245969
*) Adds utility to LoopUtils to perform loop interchange of two AffineForOps.
*) Adds utility to LoopUtils to sink a loop to a specified depth within a loop nest, using a series of loop interchanges.
*) Computes dependences between all loads and stores in the loop nest, and classifies each loop as parallel or sequential.
*) Computes loop interchange permutation required to sink sequential loops (and raise parallel loop nests) while preserving relative order among them.
*) Checks each dependence against the permutation to make sure that dependences would not be violated by the loop interchange transformation.
*) Calls loop interchange in LoopFusion pass on consumer loop nests before fusing in producers, sinking loops with loop carried dependences deeper into the consumer loop nest.
*) Adds and updates related unit tests.
PiperOrigin-RevId: 234158370
Function types are built-in in MLIR and affect the validity of the IR itself.
However, advanced target dialects such as the LLVM IR dialect may include
custom function types. Until now, dialect conversion was expecting function
types not to be converted to the custom type: although the signatures was
allowed to change, the outer type must have been an mlir::FunctionType. This
effectively prevented dialect conversion from creating instructions that
operate on values of the custom function type.
Dissociate function signature conversion from general type conversion.
Function signature conversion must still produce an mlir::FunctionType and is
used in places where built-in types are required to make IR valid. General
type conversion is used for SSA values, including function and block arguments
and function results.
Exercise this behavior in the LLVM IR dialect conversion by converting function
types to LLVM IR function pointer types. The pointer to a function is chosen
to provide consistent lowering of higher-order functions: while it is possible
to have a value of function type, it is not possible to create a function type
accepting a returning another function type.
PiperOrigin-RevId: 234124494
for dma-generate, loop-unroll.
- add -tile-sizes command line option for loop tiling to specify different tile
sizes for loops in a band
- clean up command line options for loop-unroll, dma-generate (remove
cl::hidden)
PiperOrigin-RevId: 234006232
- for the DMA transfers being pipelined through double buffering, generate
deallocs for the double buffers being alloc'ed
This change is along the lines of cl/233502632. We initially wanted to experiment with
scoped allocation - so the deallocation's were usually not necessary; however, they are
needed even with scoped allocations in some situations - for eg. when the enclosing loop
gets unrolled. The dealloc serves as an end of lifetime marker.
PiperOrigin-RevId: 233653463
In the current state, edsc::Expr and edsc::Stmt overload operators to construct
other Exprs and Stmts. This includes some unconventional overloads of the
`operator==` to create a comparison expression and of the `operator!` to create
a negation expression. This situation could lead to unpleasant surprises where
the code does not behave like expected. Make all Expr and Stmt construction
operators free functions and move them to the `edsc::op` namespace. Callers
willing to use these operators must explicitly include them with the `using`
declaration. This can be done in some local scope.
Additionally, we currently emit signed comparisons for order-comparison
operators. With namespaces, we can later introduce two sets of operators in
different namespace, e.g. `edsc::op::sign` and `edsc::op::unsign` to clearly
state which kind of comparison is implied.
PiperOrigin-RevId: 233578674
- for the DMA buffers being allocated (and their tags), generate corresponding deallocs
- minor related update to replaceAllMemRefUsesWith and PipelineDataTransfer pass
Code generation for DMA transfers was being done with the initial simplifying
assumption that the alloc's would map to scoped allocations, and so no
deallocations would be necessary. Drop this assumption to generalize. Note that
even with scoped allocations, unrolling loops that have scoped allocations
could create a series of allocations and exhaustion of fast memory. Having a
end of lifetime marker like a dealloc in fact allows creating new scopes if
necessary when lowering to a backend and still utilize scoped allocation.
DMA buffers created by -dma-generate are guaranteed to have either
non-overlapping lifetimes or nested lifetimes.
PiperOrigin-RevId: 233502632
*) Adds parameter to public API of MemRefRegion::compute for passing in the slice loop bounds to compute the memref region of the loop nest slice.
*) Exposes public method MemRefRegion::getRegionSize for computing the size of the memref region in bytes.
PiperOrigin-RevId: 232706165
* AffineStructures has moved to IR.
* simplifyAffineExpr/simplifyAffineMap/getFlattenedAffineExpr have moved to IR.
* makeComposedAffineApply/fullyComposeAffineMapAndOperands have moved to AffineOps.
* ComposeAffineMaps is replaced by AffineApplyOp::canonicalize and deleted.
PiperOrigin-RevId: 232586468
*) After a private memref buffer is created for a fused loop nest, dependences on the old memref are reduced, which can open up fusion opportunities. In these cases, users of the old memref are added back to the worklist to be reconsidered for fusion.
*) Fixed a bug in fusion insertion point dependence check where the memref being privatized was being skipped from the check.
PiperOrigin-RevId: 232477853
- use getAccessMap() instead of repeating it
- fold getMemRefRegion into MemRefRegion ctor (more natural, avoid heap
allocation and unique_ptr where possible)
- change extractForInductionVars - MutableArrayRef -> ArrayRef for the
arguments. Since the method is just returning copies of 'Value *', the client
can't mutate the pointers themselves; it's fine to mutate the 'Value''s
themselves, but that doesn't mutate the pointers to those.
- change the way extractForInductionVars returns (see b/123437690)
PiperOrigin-RevId: 232359277
loops), (2) take into account fast memory space capacity and lower 'dmaDepth'
to fit, (3) add location information for debug info / errors
- change dma-generate pass to work on blocks of instructions (start/end
iterators) instead of 'for' loops; complete TODOs - allows DMA generation for
straightline blocks of operation instructions interspersed b/w loops
- take into account fast memory capacity: check whether memory footprint fits
in fastMemoryCapacity parameter, and recurse/lower the depth at which DMA
generation is performed until it does fit in the provided memory
- add location information to MemRefRegion; any insufficient fast memory
capacity errors or debug info w.r.t dma generation shows location information
- allow DMA generation pass to be instantiated with a fast memory capacity
option (besides command line flag)
- change getMemRefRegion to return unique_ptr's
- change getMemRefFootprintBytes to work on a 'Block' instead of 'ForInst'
- other helper methods; add postDomInstFilter option for
replaceAllMemRefUsesWith; drop forInst->walkOps, add Block::walkOps methods
Eg. output
$ mlir-opt -dma-generate -dma-fast-mem-capacity=1 /tmp/single.mlir
/tmp/single.mlir:9:13: error: Total size of all DMA buffers' for this block exceeds fast memory capacity
for %i3 = (d0) -> (d0)(%i1) to (d0) -> (d0 + 32)(%i1) {
^
$ mlir-opt -debug-only=dma-generate -dma-generate -dma-fast-mem-capacity=400 /tmp/single.mlir
/tmp/single.mlir:9:13: note: 8 KiB of DMA buffers in fast memory space for this block
for %i3 = (d0) -> (d0)(%i1) to (d0) -> (d0 + 32)(%i1) {
PiperOrigin-RevId: 232297044
- fusion already includes the necessary analysis to create small/local buffers
post fusion; allocate these buffers in a higher memory space if the necessary
pass parameters are provided (threshold size, memory space id)
- although there will be a separate utility at some point to directly detect
and promote small local buffers to higher memory spaces, doing it while fusion
when possible is much less expensive, comes free with fusion analysis, and covers
a key common case.
PiperOrigin-RevId: 232063894
This CL applies the following simplifications to EDSCs:
1. Rename Block to StmtList because an MLIR Block is a different, not yet
supported, notion;
2. Rework Bindable to drop specific storage and just use it as a simple wrapper
around Expr. The only value of Bindable is to force a static cast when used by
the user to bind into the emitter. For all intended purposes, Bindable is just
a lightweight check that an Expr is Unbound. This simplifies usage and reduces
the API footprint. After playing with it for some time, it wasn't worth the API
cognition overhead;
3. Replace makeExprs and makeBindables by makeNewExprs and copyExprs which is
more explicit and less easy to misuse;
4. Add generally useful functionality to MLIREmitter:
a. expose zero and one for the ubiquitous common lower bounds and step;
b. add support to create already bound Exprs for all function arguments as
well as shapes and views for Exprs bound to memrefs.
5. Delete Stmt::operator= and replace by a `Stmt::set` method which is more
explicit.
6. Make Stmt::operator Expr() explicit.
7. Indexed.indices assertions are removed to pave the way for expressing slices
and views as well as to work with 0-D memrefs.
The CL plugs those simplifications with TableGen and allows emitting a full MLIR function for
pointwise add.
This "x.add" op is both type and rank-agnostic (by allowing ArrayRef of Expr
passed to For loops) and opens the door to spinning up a composable library of
existing and custom ops that should automate a lot of the tedious work in
TF/XLA -> MLIR.
Testing needs to be significantly improved but can be done in a separate CL.
PiperOrigin-RevId: 231982325
A performance issue was reported due to the usage of NestedMatcher in
ComposeAffineMaps. The main culprit was the ubiquitous copies that were
occuring when appending even a single element in `matchOne`.
This CL generally simplifies the implementation and removes one level of indirection by getting rid of
auxiliary storage as well as simplifying the API.
The users of the API are updated accordingly.
The implementation was tested on a heavily unrolled example with
ComposeAffineMaps and is now close in performance with an implementation based
on stateless InstWalker.
As a reminder, the whole ComposeAffineMaps pass is slated to disappear but the bug report was very useful as a stress test for NestedMatchers.
Lastly, the following cleanups reported by @aminim were addressed:
1. make NestedPatternContext scoped within runFunction rather than at the Pass level. This was caused by a previous misunderstanding of Pass lifetime;
2. use defensive assertions in the constructor of NestedPatternContext to make it clear a unique such locally scoped context is allowed to exist.
PiperOrigin-RevId: 231781279
a trivial inst walker :-) (reduces pass time from several minutes non-terminating to 120ms) - (fixes b/123541184)
- use a simple 7-line inst walker to collect affine_apply op's instead of the nested
matcher; -compose-affine-maps pass runs in 120ms now instead of 5 minutes + (non-
terminating / out of memory) - on a realistic test case that is 20,000 lines 12-d
loop nest
- this CL is also pushing for simple existing/standard patterns unless there
is a real efficiency issue (OTOH, fixing nested matcher to address this issue requires
cl/231400521)
- the improvement is from swapping out the nested walker as opposed to from a bug
or anything else that this CL changes
- update stale comment
PiperOrigin-RevId: 231623619
Cleanup a usage of functional::map that is deemed too obscure in
`reindexAffineIndices`. Also fix a stale comment in `reindexAffineIndices`.
PiperOrigin-RevId: 231211184
Addresses b/122486036
This CL addresses some leftover crumbs in AffineMap and IntegerSet by removing
the Null method and cleaning up the constructors.
As the ::Null uses were tracked down, opportunities appeared to untangle some
of the Parsing logic and make it explicit where AffineMap/IntegerSet have
ambiguous syntax. Previously, ambiguous cases were hidden behind the implicit
pointer values of AffineMap* and IntegerSet* that were passed as function
parameters. Depending the values of those pointers one of 3 behaviors could
occur.
This parsing logic convolution is one of the rare cases where I would advocate
for code duplication. The more proper fix would be to make the syntax
unambiguous or to allow some lookahead.
PiperOrigin-RevId: 231058512
This CL follows up on a memory leak issue related to SmallVector growth that
escapes the BumpPtrAllocator.
The fix is to properly use ArrayRef and placement new to define away the
issue.
The following renaming is also applied:
1. MLFunctionMatcher -> NestedPattern
2. MLFunctionMatches -> NestedMatch
As a consequence all allocations are now guaranteed to live on the BumpPtrAllocator.
PiperOrigin-RevId: 231047766
Example:
dma-generate options:
-dma-fast-mem-capacity - Set fast memory space ...
-dma-fast-mem-space=<uint> - Set fast memory space ...
loop-fusion options:
-fusion-compute-tolerance=<number> - Fractional increase in ...
-fusion-maximal - Enables maximal loop fusion
loop-tile options:
-tile-size=<uint> - Use this tile size for ...
loop-unroll options:
-unroll-factor=<uint> - Use this unroll factor ...
-unroll-full - Fully unroll loops
-unroll-full-threshold=<uint> - Unroll all loops with ...
-unroll-num-reps=<uint> - Unroll innermost loops ...
loop-unroll-jam options:
-unroll-jam-factor=<uint> - Use this unroll jam factor ...
PiperOrigin-RevId: 231019363
index remapping
- generate a sequence of single result affine_apply's for the index remapping
(instead of one multi result affine_apply)
- update dma-generate and loop-fusion test cases; while on this, change test cases
to use single result affine apply ops
- some fusion comment fix/cleanup
PiperOrigin-RevId: 230985830
- Update createAffineComputationSlice to generate a sequence of single result
affine apply ops instead of one multi-result affine apply
- update pipeline-data-transfer test case; while on this, also update the test
case to use only single result affine maps, and make it more robust to
change.
PiperOrigin-RevId: 230965478
This commit introduces a generic dialect conversion/lowering/legalization pass
and illustrates it on StandardOps->LLVMIR conversion.
It partially reuses the PatternRewriter infrastructure and adds the following
functionality:
- an actual pass;
- non-default pattern constructors;
- one-to-many rewrites;
- rewriting terminators with successors;
- not applying patterns iteratively (unlike the existing greedy rewrite driver);
- ability to change function signature;
- ability to change basic block argument types.
The latter two things required, given the existing API, to create new functions
in the same module. Eventually, this should converge with the rest of
PatternRewriter. However, we may want to keep two pass versions: "heavy" with
function/block argument conversion and "light" that only touches operations.
This pass creates new functions within a module as a means to change function
signature, then creates new blocks with converted argument types in the new
function. Then, it traverses the CFG in DFS-preorder to make sure defs are
converted before uses in the dominated blocks. The generic pass has a minimal
interface with two hooks: one to fill in the set of patterns, and another one
to convert types for functions and blocks. The patterns are defined as
separate classes that can be table-generated in the future.
The LLVM IR lowering pass partially inherits from the existing LLVM IR
translator, in particular for type conversion. It defines a conversion pattern
template, instantiated for different operations, and is a good candidate for
tablegen. The lowering does not yet support loads and stores and is not
connected to the translator as it would have broken the existing flows. Future
patches will add missing support before switching the translator in a single
patch.
PiperOrigin-RevId: 230951202
- introduce a way to compute union using symbolic rectangular bounding boxes
- handle multiple load/store op's to the same memref by taking a union of the regions
- command-line argument to provide capacity of the fast memory space
- minor change to replaceAllMemRefUsesWith to not generate affine_apply if the
supplied index remap was identity
PiperOrigin-RevId: 230848185
canonicalizations of operations. The ultimate important user of this is
going to be a funcBuilder->foldOrCreate<YourOp>(...) API, but for now it
is just a more convenient way to write certain classes of canonicalizations
(see the change in StandardOps.cpp).
NFC.
PiperOrigin-RevId: 230770021
- switch some debug info to emitError
- use a single constant op for zero index to make it easier to write/update
test cases; avoid creating new constant op's for common zero index cases
- test case cleanup
This is in preparation for an upcoming major update to this pass.
PiperOrigin-RevId: 230728379
- update fusion cost model to fuse while tolerating a certain amount of redundant
computation; add cl option -fusion-compute-tolerance
evaluate memory footprint and intermediate memory reduction
- emit debug info from -loop-fusion showing what was fused and why
- introduce function to compute memory footprint for a loop nest
- getMemRefRegion readability update - NFC
PiperOrigin-RevId: 230541857
- unrolling a single iteration loop by a factor of one should promote its body
into its parent; this makes it consistent with the behavior/expectation that
unrolling a loop by a factor equal to its trip count makes the loop go away.
PiperOrigin-RevId: 230426499
- ForInst::walkOps will also be used in an upcoming CL (cl/229438679); better to have
this instead of deriving from the InstWalker
PiperOrigin-RevId: 230413820
- the size of the private memref created for the slice should be based on
the memref region accessed at the depth at which the slice is being
materialized, i.e., symbolic in the outer IVs up until that depth, as opposed
to the region accessed based on the entire domain.
- leads to a significant contraction of the temporary / intermediate memref
whenever the memref isn't reduced to a single scalar (through store fwd'ing).
Other changes
- update to promoteIfSingleIteration - avoid introducing unnecessary identity
map affine_apply from IV; makes it much easier to write and read test cases
and pass output for all passes that use promoteIfSingleIteration; loop-fusion
test cases become much simpler
- fix replaceAllMemrefUsesWith bug that was exposed by the above update -
'domInstFilter' could be one of the ops erased due to a memref replacement in
it.
- fix getConstantBoundOnDimSize bug: a division by the coefficient of the identifier was
missing (the latter need not always be 1); add lbFloorDivisors output argument
- rename getBoundingConstantSizeAndShape -> getConstantBoundingSizeAndShape
PiperOrigin-RevId: 230405218
*) Do not remove loop nests which write to memrefs which escape the function.
*) Do not remove memrefs which escape the function (e.g. are used in the return instruction).
PiperOrigin-RevId: 230398630
This CL performs a bunch of cleanups related to EDSCs that are generally
useful in the context of using them with a simple wrapping C API (not in this
CL) and with simple language bindings to Python and Swift.
PiperOrigin-RevId: 230066505
*) Enables reduction of private memref size based on MemRef region accessed by fused slice.
*) Enables maximal fusion by creating a private memref to break a fusion-preventing dependence.
*) Adds maximal fusion flag to enable fusing as much as possible (though it still fuses the minimum cost computation slice).
PiperOrigin-RevId: 229936698
This CL fixes a misunderstanding in how to build DimOp which triggered
execution issues in the CPU path.
The problem is that, given a `memref<?x4x?x8x?xf32>`, the expressions to
construct the dynamic dimensions should be:
`dim %arg, 0 : memref<?x4x?x8x?xf32>`
`dim %arg, 2 : memref<?x4x?x8x?xf32>`
and
`dim %arg, 4 : memref<?x4x?x8x?xf32>`
Before this CL, we wold construct:
`dim %arg, 0 : memref<?x4x?x8x?xf32>`
`dim %arg, 1 : memref<?x4x?x8x?xf32>`
`dim %arg, 2 : memref<?x4x?x8x?xf32>`
and expect the other dimensions to be constants.
This assumption seems consistent at first glance with the syntax of alloc:
```
%tensor = alloc(%M, %N, %O) : memref<?x4x?x8x?xf32>
```
But this was actuallyincorrect.
This CL also makes the relevant functions available to EDSCs and removes
duplication of the incorrect function.
PiperOrigin-RevId: 229622766
*) Adds support for fusing into consumer loop nests with multiple loads from the same memref.
*) Adds support for reducing slice loop trip count by projecting out destination loop IVs greater than destination loop depth.
*) Removes dependence on src loop depth and simplifies cost model computation.
PiperOrigin-RevId: 229575126
This allows load, store and ForNest to be used with both Expr and Bindable.
This simplifies writing generic pieces of MLIR snippet.
For instance, a generic pointwise add can now be written:
```cpp
// Different Bindable ivs, one per loop in the loop nest.
auto ivs = makeBindables(shapeA.size());
Bindable zero, one;
// Same bindable, all equal to `zero`.
SmallVector<Bindable, 8> zeros(ivs.size(), zero);
// Same bindable, all equal to `one`.
SmallVector<Bindable, 8> ones(ivs.size(), one);
// clang-format off
Bindable A, B, C;
Stmt scalarA, scalarB, tmp;
Stmt block = edsc::Block({
ForNest(ivs, zeros, shapeA, ones, {
scalarA = load(A, ivs),
scalarB = load(B, ivs),
tmp = scalarA + scalarB,
store(tmp, C, ivs)
}),
});
// clang-format on
```
This CL also adds some extra support for pretty printing that will be used in
a future CL when we introduce standalone testing of EDSCs. At the momen twe
are lacking the basic infrastructure to write such tests.
PiperOrigin-RevId: 229375850
*) LoopFusion: Adds fusion cost function which compares the cost of the fused loop nest, with the cost of the two unfused loop nests to determine if it is profitable to fuse the candidate loop nests. The fusion cost function is run for various combinations for src/dst loop depths attempting find the minimum cost setting for src/dst loop depths which does not increase the computational cost when the loop nests are fused. Combinations of src/dst loop depth are evaluated attempting to maximize loop depth (i.e. take a bigger computation slice from the source loop nest, and insert it deeper in the destination loop nest for better locality).
*) LoopFusion: Adds utility to compute op instance count for loop nests, sliced loop nests, and to compute the cost of a loop nest fused with another sliced loop nest.
*) LoopFusion: canonicalizes slice bound AffineMaps (and updates related tests).
*) Analysis::Utils: Splits getBackwardComputationSlice into two functions: one which calculates and returns the slice loop bounds for analysis by LoopFusion, and the other for insertion of the computation slice (ones fusion has calculated the min-cost src/dst loop depths).
*) Test: Adds multiple unit tests to test the new functionality.
PiperOrigin-RevId: 229219757
This CL adds a short term remedy to an issue that was found during execution
tests.
Lowering of vector transfer ops uses the permutation map to determine which
ForInst have been super-vectorized. During materialization to HW vector sizes
however, some of those dimensions may be fully unrolled and do not appear in
the permutation map.
Such dimensions were then not clipped and may have accessed out of bounds.
This CL conservatively clips all dimensions to ensure no out of bounds access.
The longer term solution is still up for debate but will probably require
either passing more information between Materialization and lowering, or just
merging the 2 passes.
PiperOrigin-RevId: 228980787
This CL is the 6th and last on the path to simplifying AffineMap composition.
This removes `AffineValueMap::forwardSubstitutions` and replaces it by simple
calls to `fullyComposeAffineMapAndOperands`.
PiperOrigin-RevId: 228962580
This CL is the 5th on the path to simplifying AffineMap composition.
This removes the distinction between normalized single-result AffineMap and
more general composed multi-result map.
One nice byproduct of making the implementation driven by single-result is
that the multi-result extension is a trivial change: the implementation is
still single-result and we just use:
```
unsigned idx = getIndexOf(...);
map.getResult(idx);
```
This CL also fixes an AffineNormalizer implementation issue related to symbols.
Namely it stops performing substitutions on symbols in AffineNormalizer and
instead concatenates them all to be consistent with the call to
`AffineMap::compose(AffineMap)`. This latter call to `compose` cannot perform
simplifications of symbols coming from different maps based on positions only:
i.e. dims are applied and renumbered but symbols must be concatenated.
The only way to determine whether symbols from different AffineApply are the
same is to look at the concrete values. The canonicalizeMapAndOperands is thus
extended with behavior to support replacing operands that appear multiple
times.
Lastly, this CL demonstrates that the implementation is correct by rewriting
ComposeAffineMaps using only `makeComposedAffineApply`. The implementation
uses a matcher because AffineApplyOp are introduced as composed operations on
the fly instead of iteratively forwardSubstituting. For this purpose, a walker
would revisit freshly introduced AffineApplyOp. Regardless, ComposeAffineMaps
is scheduled to disappear, this CL replaces the implementation based on
iterative `forwardSubstitute` by a composed-by-construction
`makeComposedAffineApply`.
Remaining calls to `forwardSubstitute` will be removed in the next CL.
PiperOrigin-RevId: 228830443
This implements the lowering of `floordiv`, `ceildiv` and `mod` operators from
affine expressions to the arithmetic primitive operations. Integer division
rules in affine expressions explicitly require rounding towards either negative
or positive infinity unlike machine implementations that round towards zero.
In the general case, implementing `floordiv` and `ceildiv` using machine signed
division requires computing both the quotient and the remainder. When the
divisor is positive, this can be simplified by adjusting the dividend and the
quotient by one and switching signs.
In the current use cases, we are unlikely to encounter affine expressions with
negative divisors (affine divisions appear in loop transformations such as
tiling that guarantee that divisors are positive by construction). Therefore,
it is reasonable to use branch-free single-division implementation. In case of
affine maps, divisors can only be literals so we can check the sign and
implement the case for negative divisors when the need arises.
The affine lowering pass can still fail when applied to semi-affine maps
(division or modulo by a symbol).
PiperOrigin-RevId: 228668181
- the double buffer should be indexed (iv floordiv step) % 2 and NOT (iv % 2);
step wasn't being accounted for.
- fix test cases, enable failing test cases
PiperOrigin-RevId: 228635726
Supervectorization does not plan on handling multi-result AffineMaps and
non-canonical chains of > 1 AffineApplyOp.
This CL uses the simpler single-result unbounded AffineApplyOp in the
MaterializeVectors pass.
PiperOrigin-RevId: 228469085
This CL is the 2nd on the path to simplifying AffineMap composition.
This CL uses the now accepted `AffineExpr::compose(AffineMap)` to
implement `AffineMap::compose(AffineMap)`.
Implications of keeping the simplification function in
Analysis are documented where relevant.
PiperOrigin-RevId: 228276646
- refactor toAffineFromEq and the code surrounding it; refactor code into
FlatAffineConstraints::getSliceBounds
- add FlatAffineConstraints methods to detect identifiers as mod's and div's of other
identifiers
- add FlatAffineConstraints::getConstantLower/UpperBound
- Address b/122118218 (don't assert on invalid fusion depths cmdline flags -
instead, don't do anything; change cmdline flags
src-loop-depth -> fusion-src-loop-depth
- AffineExpr/Map print method update: don't fail on null instances (since we have
a wrapper around a pointer, it's avoidable); rationale: dump/print methods should
never fail if possible.
- Update memref-dataflow-opt to add an optimization to avoid a unnecessary call to
IsRangeOneToOne when it's trivially going to be true.
- Add additional test cases to exercise the new support
- update a few existing test cases since the maps are now generated uniformly with
all destination loop operands appearing for the backward slice
- Fix projectOut - fix wrong range for getBestElimCandidate.
- Fix for getConstantBoundOnDimSize() - didn't show up in any test cases since
we didn't have any non-hyperrectangular ones.
PiperOrigin-RevId: 228265152
- Detect 'mod' to replace the combination of floordiv, mul, and subtract when
possible at construction time; when 'c' is a power of two, this reduces the number of
operations; also more compact and readable. Update simplifyAdd for this.
On a side note:
- with the affine expr flattening we have, a mod expression like d0 mod c
would be flattened into d0 - c * q, c * q <= d0 <= c*q + c - 1, with 'q'
being added as the local variable (q = d0 floordiv c); as a result, a mod
was turned into a floordiv whenever the expression was reconstructed back,
i.e., as d0 - c * (d0 floordiv c); as a result of this change, we recover
the mod back.
- rename SimplifyAffineExpr -> SimplifyAffineStructures (pass had been renamed but
the file hadn't been).
PiperOrigin-RevId: 228258120
- when SSAValue/MLValue existed, code at several places was forced to create additional
aggregate temporaries of SmallVector<SSAValue/MLValue> to handle the conversion; get
rid of such redundant code
- use filling ctors instead of explicit loops
- for smallvectors, change insert(list.end(), ...) -> append(...
- improve comments at various places
- turn getMemRefAccess into MemRefAccess ctor and drop duplicated
getMemRefAccess. In the next CL, provide getAccess() accessors for load,
store, DMA op's to return a MemRefAccess.
PiperOrigin-RevId: 228243638
Supervectorization does not plan on handling multi-result AffineMaps and
non-canonical chains of > 1 AffineApplyOp.
This CL introduces a simpler abstraction and composition of single-result
unbounded AffineApplyOp by using the existing unbound AffineMap composition.
This CL adds a simple API call and relevant tests:
```c++
OpPointer<AffineApplyOp> makeNormalizedAffineApply(
FuncBuilder *b, Location loc, AffineMap map, ArrayRef<Value*> operands);
```
which creates a single-result unbounded AffineApplyOp.
The operands of AffineApplyOp are not themselves results of AffineApplyOp by
consrtuction.
This represent the simplest possible interface to complement the composition
of (mathematical) AffineMap, for the cases when we are interested in applying
it to Value*.
In this CL the composed AffineMap is not compressed (i.e. there exist operands
that are not part of the result). A followup commit will compress to normal
form.
The single-result unbounded AffineApplyOp abstraction will be used in a
followup CL to support the MaterializeVectors pass.
PiperOrigin-RevId: 227879021
Even though it is unexpected except in pathological cases, a nullptr clone may
be returned. This CL handles the nullptr return gracefuly.
PiperOrigin-RevId: 227764615
The strict requirement (i.e. at least 2 HW vectors in a super-vector) was a
premature optimization to avoid interfering with other vector code potentially
introduced via other means.
This CL avoids this premature optimization and the spurious errors it causes
when super-vector size == HW vector size (which is a possible corner case).
This may be revisited in the future.
PiperOrigin-RevId: 227763966
This corner was found when stress testing with a functional end-to-end CPU
path. In the case where the hardware vector size is 1x...x1 the `keep` vector
is empty and would result a crash.
While there is no reason to expect a 1x...x1 HW vector in practice, this case
can just gracefully degrade to scalar, which is what this CL allows.
PiperOrigin-RevId: 227761097
This change is mechanical and merges the LowerAffineApplyPass and
LowerIfAndForPass into a single LowerAffinePass. It makes a step towards
defining an "affine dialect" that would contain all polyhedral-related
constructs. The motivation for merging these two passes is based on retiring
MLFunctions and, eventually, transforming If and For statements into regular
operations. After that happens, LowerAffinePass becomes yet another
legalization.
PiperOrigin-RevId: 227566113
Existing implementation was created before ML/CFG unification refactoring and
did not concern itself with further lowering to separate concerns. As a
result, it emitted `affine_apply` instructions to implement `for` loop bounds
and `if` conditions and required a follow-up function pass to lower those
`affine_apply` to arithmetic primitives. In the unified function world,
LowerForAndIf is mostly a lowering pass with low complexity. As we move
towards a dialect for affine operations (including `for` and `if`), it makes
sense to lower `for` and `if` conditions directly to arithmetic primitives
instead of relying on `affine_apply`.
Expose `expandAffineExpr` function in LoweringUtils. Use this function
together with `expandAffineMaps` to emit primitives that implement loop and
branch conditions directly.
Also remove tests that become unnecessary after transforming LowerForAndIf into
a function pass.
PiperOrigin-RevId: 227563608
In LoweringUtils, extract out `expandAffineMap`. This function takes an affine
map and a list of values the map should be applied to and emits a sequence of
arithmetic instructions that implement the affine map. It is independent of
the AffineApplyOp and can be used in places where we need to insert an
evaluation of an affine map without relying on a (temporary) `affine_apply`
instruction. This prepares for a merge between LowerAffineApply and
LowerForAndIf passes.
Move the `expandAffineApply` function to the LowerAffineApply pass since it is
the only place that must be aware of the `affine_apply` instructions.
PiperOrigin-RevId: 227563439
The entire compiler now looks at structural properties of the function (e.g.
does it have one block, does it contain an if/for stmt, etc) so the only thing
holding up this difference is round tripping through the parser/printer syntax.
Removing this shrinks the compile by ~140LOC.
This is step 31/n towards merging instructions and statements. The last step
is updating the docs, which I will do as a separate patch in order to split it
from this mostly mechanical patch.
PiperOrigin-RevId: 227540453
This CL introduces a simple set of Embedded Domain-Specific Components (EDSCs)
in MLIR components:
1. a `Type` system of shell classes that closely matches the MLIR type system. These
types are subdivided into `Bindable` leaf expressions and non-bindable `Expr`
expressions;
2. an `MLIREmitter` class whose purpose is to:
a. maintain a map of `Bindable` leaf expressions to concrete SSAValue*;
b. provide helper functionality to specify bindings of `Bindable` classes to
SSAValue* while verifying comformable types;
c. traverse the `Expr` and emit the MLIR.
This is used on a concrete example to implement MemRef load/store with clipping in the
LowerVectorTransfer pass. More specifically, the following pseudo-C++ code:
```c++
MLFuncBuilder *b = ...;
Location location = ...;
Bindable zero, one, expr, size;
// EDSL expression
auto access = select(expr < zero, zero, select(expr < size, expr, size - one));
auto ssaValue = MLIREmitter(b)
.bind(zero, ...)
.bind(one, ...)
.bind(expr, ...)
.bind(size, ...)
.emit(location, access);
```
is used to emit all the MLIR for a clipped MemRef access.
This simple EDSL can easily be extended to more powerful patterns and should
serve as the counterpart to pattern matchers (and could potentially be unified
once we get enough experience).
In the future, most of this code should be TableGen'd but for now it has
concrete valuable uses: make MLIR programmable in a declarative fashion.
This CL also adds Stmt, proper supporting free functions and rewrites
VectorTransferLowering fully using EDSCs.
The code for creating the EDSCs emitting a VectorTransferReadOp as loops
with clipped loads is:
```c++
Stmt block = Block({
tmpAlloc = alloc(tmpMemRefType),
vectorView = vector_type_cast(tmpAlloc, vectorMemRefType),
ForNest(ivs, lbs, ubs, steps, {
scalarValue = load(scalarMemRef, accessInfo.clippedScalarAccessExprs),
store(scalarValue, tmpAlloc, accessInfo.tmpAccessExprs),
}),
vectorValue = load(vectorView, zero),
tmpDealloc = dealloc(tmpAlloc.getLHS())});
emitter.emitStmt(block);
```
where `accessInfo.clippedScalarAccessExprs)` is created with:
```c++
select(i + ii < zero, zero, select(i + ii < N, i + ii, N - one));
```
The generated MLIR resembles:
```mlir
%1 = dim %0, 0 : memref<?x?x?x?xf32>
%2 = dim %0, 1 : memref<?x?x?x?xf32>
%3 = dim %0, 2 : memref<?x?x?x?xf32>
%4 = dim %0, 3 : memref<?x?x?x?xf32>
%5 = alloc() : memref<5x4x3xf32>
%6 = vector_type_cast %5 : memref<5x4x3xf32>, memref<1xvector<5x4x3xf32>>
for %i4 = 0 to 3 {
for %i5 = 0 to 4 {
for %i6 = 0 to 5 {
%7 = affine_apply #map0(%i0, %i4)
%8 = cmpi "slt", %7, %c0 : index
%9 = affine_apply #map0(%i0, %i4)
%10 = cmpi "slt", %9, %1 : index
%11 = affine_apply #map0(%i0, %i4)
%12 = affine_apply #map1(%1, %c1)
%13 = select %10, %11, %12 : index
%14 = select %8, %c0, %13 : index
%15 = affine_apply #map0(%i3, %i6)
%16 = cmpi "slt", %15, %c0 : index
%17 = affine_apply #map0(%i3, %i6)
%18 = cmpi "slt", %17, %4 : index
%19 = affine_apply #map0(%i3, %i6)
%20 = affine_apply #map1(%4, %c1)
%21 = select %18, %19, %20 : index
%22 = select %16, %c0, %21 : index
%23 = load %0[%14, %i1, %i2, %22] : memref<?x?x?x?xf32>
store %23, %5[%i6, %i5, %i4] : memref<5x4x3xf32>
}
}
}
%24 = load %6[%c0] : memref<1xvector<5x4x3xf32>>
dealloc %5 : memref<5x4x3xf32>
```
In particular notice that only 3 out of the 4-d accesses are clipped: this
corresponds indeed to the number of dimensions in the super-vector.
This CL also addresses the cleanups resulting from the review of the prevous
CL and performs some refactoring to simplify the abstraction.
PiperOrigin-RevId: 227367414
on this to merge together the classes, but there may be other simplification
possible. I'll leave that to riverriddle@ as future work.
This is step 29/n towards merging instructions and statements.
PiperOrigin-RevId: 227328680
simplifying them in minor ways. The only significant cleanup here
is the constant folding pass. All the other changes are simple and easy,
but this is still enough to shrink the compiler by 45LOC.
The one pass left to merge is the CSE pass, which will be move involved, so I'm
splitting it out to its own patch (which I'll tackle right after this).
This is step 28/n towards merging instructions and statements.
PiperOrigin-RevId: 227328115
Remove an unnecessary restriction in forward substitution. Slightly
simplify LLVM IR lowering, which previously would crash if given an ML
function, it should now produce a clean error if given a function with an
if/for instruction in it, just like it does any other unsupported op.
This is step 27/n towards merging instructions and statements.
PiperOrigin-RevId: 227324542
representation, shrinking by 70LOC. The PatternRewriter class can probably
also be simplified as well, but one step at a time.
This is step 26/n towards merging instructions and statements. NFC.
PiperOrigin-RevId: 227324218
- introduce PostDominanceInfo in the right/complete way and use that for post
dominance check in store-load forwarding
- replace all uses of Analysis/Utils::dominates/properlyDominates with
DominanceInfo::dominates/properlyDominates
- drop all redundant copies of dominance methods in Analysis/Utils/
- in pipeline-data-transfer, replace dominates call with a much less expensive
check; similarly, substitute dominates() in checkMemRefAccessDependence with
a simpler check suitable for that context
- fix a bug in properlyDominates
- improve doc for 'for' instruction 'body'
PiperOrigin-RevId: 227320507
function pass, and eliminating the need to copy over code and do
interprocedural updates. While here, also improve it to make fewer empty
blocks, and rename it to "LowerIfAndFor" since that is what it does. This is
a net reduction of ~170 lines of code.
As drive-bys, change the splitBlock method to *not* insert an unconditional
branch, since that behavior is annoying for all clients. Also improve the
AsmPrinter to not crash when a block is referenced that isn't linked into a
function.
PiperOrigin-RevId: 227308856
- the load/store forwarding relies on memref dependence routines as well as
SSA/dominance to identify the memref store instance uniquely supplying a value
to a memref load, and replaces the result of that load with the value being
stored. The memref is also deleted when possible if only stores remain.
- add methods for post dominance for MLFunction blocks.
- remove duplicated getLoopDepth/getNestingDepth - move getNestingDepth,
getMemRefAccess, getNumCommonSurroundingLoops into Analysis/Utils (were
earlier static)
- add a helper method in FlatAffineConstraints - isRangeOneToOne.
PiperOrigin-RevId: 227252907
Function::walk functionality into f->walkInsts/Ops which allows visiting all
instructions, not just ops. Eliminate Function::getBody() and
Function::getReturn() helpers which crash in CFG functions, and were only kept
around as a bridge.
This is step 25/n towards merging instructions and statements.
PiperOrigin-RevId: 227243966
consistent and moving the using declarations over. Hopefully this is the last
truly massive patch in this refactoring.
This is step 21/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227178245
The last major renaming is Statement -> Instruction, which is why Statement and
Stmt still appears in various places.
This is step 19/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227163082
StmtResult -> InstResult, StmtOperand -> InstOperand, and remove the old names.
This is step 17/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227121537
OperationInst derives from it. This allows eliminating some forwarding
functions, other complex code handling multiple paths, and the 'isStatement'
bit tracked by Operation.
This is the last patch I think I can make before the big mechanical change
merging Operation into OperationInst, coming next.
This is step 15/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227077411
StmtSuccessorIterator/StmtSuccessorIterator, and rename and move the
CFGFunctionViewGraph pass to ViewFunctionGraph.
This is step 13/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227069438
FuncBuilder class. Also rename SSAValue.cpp to Value.cpp
This is step 12/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227067644
is the new base of the SSA value hierarchy. This CL also standardizes all the
nomenclature and comments to use 'Value' where appropriate. This also eliminates a large number of cast<MLValue>(x)'s, which is very soothing.
This is step 11/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227064624
This *only* changes the internal data structures, it does not affect the user visible syntax or structure of MLIR code. Function gets new "isCFG()" sorts of predicates as a transitional measure.
This patch is gross in a number of ways, largely in an effort to reduce the amount of mechanical churn in one go. It introduces a bunch of using decls to keep the old names alive for now, and a bunch of stuff needs to be renamed.
This is step 10/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227044402
making it more similar to the CFG side of things. It is true that in a deeply
nested case that this is not a guaranteed O(1) time operation, and that 'get'
could lead compiler hackers to think this is cheap, but we need to merge these
and we can look into solutions for this in the future if it becomes a problem
in practice.
This is step 9/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 226983931
graph specializations for doing CFG traversals of ML Functions, making the two
sorts of functions have the same capabilities.
This is step 8/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 226968502
Supervectorization uses null pointers to SSA values as a means of communicating
the failure to vectorize. In operation vectorization, all operations producing
the values of operation arguments must be vectorized for the given operation to
be vectorized. The existing check verified if any of the value "def"
statements was vectorized instead, sometimes leading to assertions inside `isa`
called on a null pointer. Fix this to check that all "def" statements were
vectorized.
PiperOrigin-RevId: 226941552
from it. This is necessary progress to squaring away the parent relationship
that a StmtBlock has with its enclosing if/for/fn, and makes room for functions
to have more than one block in the future. This also removes IfClause and ForStmtBody.
This is step 5/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 226936541
for SSA values in terminators, but easily worked around. At the same time,
move the StmtOperand list in a OperationStmt to the end of its trailing
objects list so we can *reduce* the number of operands, without affecting
offsets to the other stuff in the allocation.
This is important because we want OperationStmts to be consequtive, including
their operands - we don't want to use an std::vector of operands like
Instructions have.
This is patch 4/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 226865727
clients to use OperationState instead. This makes MLFuncBuilder more similiar
to CFGFuncBuilder. This whole area will get tidied up more when cfg and ml
worlds get unified. This patch is just gardening, NFC.
PiperOrigin-RevId: 226701959
StmtBlock. This is more consistent with IfStmt and also conceptually makes
more sense - a forstmt "isn't" its body, it contains its body.
This is step 1/N towards merging BasicBlock and StmtBlock. This is required
because in the new regime StmtBlock will have a use list (just like BasicBlock
does) of operands, and ForStmt already has a use list for its induction
variable.
This is a mechanical patch, NFC.
PiperOrigin-RevId: 226684158
reuse existing ones.
- drop IterationDomainContext, redundant since FlatAffineConstraints has
MLValue information associated with its dimensions.
- refactor to use existing support
- leads to a reduction in LOC
- as a result of these changes, non-constant loop bounds get naturally
supported for dep analysis.
- update test cases to include a couple with non-constant loop bounds
- rename addBoundsFromForStmt -> addForStmtDomain
- complete TODO for getLoopIVs (handle 'if' statements)
PiperOrigin-RevId: 226082008
This introduces a generic lowering pass for ML functions. The pass is
parameterized by template arguments defining individual pattern rewriters.
Concrete lowering passes define individual pattern rewriters and inherit from
the generic class that takes care of allocating rewriters, traversing ML
functions and performing the actual rewrite.
While this is similar to the greedy pattern rewriter available in
Transform/Utils, it requires adjustments due to the ML/CFG duality. In
particular, ML function rewriters must be able to create statements, not only
operations, and need access to an MLFuncBuilder. When we move to using the
unified function type, the ML-specific rewriting will become unnecessary.
Use LowerVectorTransfers as a testbed for the generic pass.
PiperOrigin-RevId: 225887424
This operation is produced and used by the super-vectorization passes and has
been emitted as an abstract unregistered operation until now. For end-to-end
testing purposes, it has to be eventually lowered to LLVM IR. Matching
abstract operation by name goes into the opposite direction of the generic
lowering approach that is expected to be used for LLVM IR lowering in the
future. Register vector_type_cast operation as a part of the SuperVector
dialect.
Arguably, this operation is a special case of the `view` operation from the
Standard dialect. The semantics of `view` is not fully specified at this point
so it is safer to rely on a custom operation. Additionally, using a custom
operation may help to achieve clear dialect separation.
PiperOrigin-RevId: 225887305
provide unroll factors, and a cmd line argument to specify number of
innermost loop unroll repetitions.
- add function callback parameter for outside targets to provide unroll factors
- add a cmd line parameter to repeatedly apply innermost loop unroll a certain
number of times (to avoid using -loop-unroll -loop-unroll ...; instead
-unroll-num-reps=2).
- implement the callback for a target
- update test cases / usage
PiperOrigin-RevId: 225843191
*) Adds simple greedy fusion algorithm to drive experimentation. This algorithm greedily fuses loop nests with single-writer/single-reader memref dependences to improve locality.
*) Adds support for fusing slices of a loop nest computation: fusing one loop nest into another by adjusting the source loop nest's iteration bounds (after it is fused into the destination loop nest). This is accomplished by solving for the source loop nest's IVs in terms of the destination loop nests IVs and symbols using the dependece polyhedron, then creating AffineMaps of these functions for the loop bounds of the fused source loop.
*) Adds utility function 'insertMemRefComputationSlice' which computes and inserts computation slice from loop nest surrounding a source memref access into the loop nest surrounding the destingation memref access.
*) Adds FlatAffineConstraints::toAffineMap function which returns and AffineMap which represents an equality contraint where one dimension identifier is represented as a function of all others in the equality constraint.
*) Adds multiple fusion unit tests.
PiperOrigin-RevId: 225842944
- use addBoundsForForStmt
- getLoopIVs can return a vector of ForStmt * instead of const ForStmt *; the
returned things aren't owned / part of the stmt on which it's being called.
- other minor API cleanup
PiperOrigin-RevId: 225774301
From the beginning, vector_transfer_read and vector_transfer_write opreations
were intended as a mid-level vectorization abstraction. In particular, they
are lowered to the StandardOps dialect before further processing. As such, it
does not make sense to keep them at the same level as StandardOps. Introduce
the new SuperVectorOps dialect and move vector_transfer_* operations there.
This will be used as a testbed for the generic lowering/legalization pass.
PiperOrigin-RevId: 225554492
Originally, loop steps were implemented using `addi` and `constant` operations
because `affine_apply` was not handled in the first implementation. The
support for `affine_apply` has been added, use it to implement the update of
the loop induction variable. This is more consistent with the lower and upper
bounds of the loop that are also implemented as `affine_apply`, removes the
dependence of the converted function on the StandardOps dialect and makes it
clear from the CFG function that all operations on the loop induction variable
are purely affine.
PiperOrigin-RevId: 225165337
- loop step wasn't handled and there wasn't a TODO or an assertion; fix this.
- rename 'delay' to shift for consistency/readability.
- other readability changes.
- remove duplicate attribute print for DmaStartOp; fix misplaced attribute
print for DmaWaitOp
- add build method for AddFOp (unrelated to this CL, but add it anyway)
PiperOrigin-RevId: 224892958
- adding a conservative check for now (TODO: use the dependence analysis pass
once the latter is extended to deal with DMA ops). resolve an existing bug on
a test case.
- update test cases
PiperOrigin-RevId: 224869526
- add method normalizeConstraintsByGCD
- call normalizeConstraintsByGCD() and GCDTightenInequalities() at the end of
projectOut.
- remove call to GCDTightenInequalities() from getMemRefRegion
- change isEmpty() to check isEmptyByGCDTest() / hasInvalidConstraint() each
time an identifier is eliminated (to detect emptiness early).
- make FourierMotzkinEliminate, gaussianEliminateId(s),
GCDTightenInequalities() private
- improve / update stale comments
PiperOrigin-RevId: 224866741
- fix replaceAllMemRefUsesWith call to replace only inside loop body.
- handle the case where DMA buffers are dynamic; extend doubleBuffer() method
to handle dynamically shaped DMA buffers (pass the right operands to AllocOp)
- place alloc's for DMA buffers at the depth at which pipelining is being done
(instead of at top-level)
- add more test cases
PiperOrigin-RevId: 224852231
This was missing from the original commit. The implementation of
createLowerAffineApply was defined in the default namespace but declared in the
`mlir` namespace, which could lead to linking errors when it was used. Put the
definition in `mlir` namespace.
PiperOrigin-RevId: 224830894
are a max/min of several expressions.
- Extend loop tiling to handle non-constant loop bounds and bounds that
are a max/min of several expressions, i.e., bounds using multi-result affine
maps
- also fix b/120630124 as a result (the IR was in an invalid state when tiled
loop generation failed; SSA uses were created that weren't plugged into the IR).
PiperOrigin-RevId: 224604460
- generate DMAs correctly now using strided DMAs where needed
- add support for multi-level/nested strides; op still supports one level of
stride for now.
Other things
- add test case for symbolic lower/upper bound; cases where the DMA buffer
size can't be bounded by a known constant
- add test case for dynamic shapes where the DMA buffers are however bounded by
constants
- refactor some of the '-dma-generate' code
PiperOrigin-RevId: 224584529
This CL adds a pass that lowers VectorTransferReadOp and VectorTransferWriteOp
to a simple loop nest via local buffer allocations.
This is an MLIR->MLIR lowering based on builders.
A few TODOs are left to address in particular:
1. invert the permutation map so the accesses to the remote memref are coalesced;
2. pad the alloc for bank conflicts in local memory (e.g. GPUs shared_memory);
3. support broadcast / avoid copies when permutation_map is not of full column rank
4. add a proper "element_cast" op
One notable limitation is this does not plan on supporting boundary conditions.
It should be significantly easier to use pre-baked MLIR functions to handle such paddings.
This is left for future consideration.
Therefore the current CL only works properly for full-tile cases atm.
This CL also adds 2 simple tests:
```mlir
for %i0 = 0 to %M step 3 {
for %i1 = 0 to %N step 4 {
for %i2 = 0 to %O {
for %i3 = 0 to %P step 5 {
vector_transfer_write %f1, %A, %i0, %i1, %i2, %i3 {permutation_map: (d0, d1, d2, d3) -> (d3, d1, d0)} : vector<5x4x3xf32>, memref<?x?x?x?xf32, 0>, index, index, index, index
```
lowers into:
```mlir
for %i0 = 0 to %arg0 step 3 {
for %i1 = 0 to %arg1 step 4 {
for %i2 = 0 to %arg2 {
for %i3 = 0 to %arg3 step 5 {
%1 = alloc() : memref<5x4x3xf32>
%2 = "element_type_cast"(%1) : (memref<5x4x3xf32>) -> memref<1xvector<5x4x3xf32>>
store %cst, %2[%c0] : memref<1xvector<5x4x3xf32>>
for %i4 = 0 to 5 {
%3 = affine_apply (d0, d1) -> (d0 + d1) (%i3, %i4)
for %i5 = 0 to 4 {
%4 = affine_apply (d0, d1) -> (d0 + d1) (%i1, %i5)
for %i6 = 0 to 3 {
%5 = affine_apply (d0, d1) -> (d0 + d1) (%i0, %i6)
%6 = load %1[%i4, %i5, %i6] : memref<5x4x3xf32>
store %6, %0[%5, %4, %i2, %3] : memref<?x?x?x?xf32>
dealloc %1 : memref<5x4x3xf32>
```
and
```mlir
for %i0 = 0 to %M step 3 {
for %i1 = 0 to %N {
for %i2 = 0 to %O {
for %i3 = 0 to %P step 5 {
%f = vector_transfer_read %A, %i0, %i1, %i2, %i3 {permutation_map: (d0, d1, d2, d3) -> (d3, 0, d0)} : (memref<?x?x?x?xf32, 0>, index, index, index, index) -> vector<5x4x3xf32>
```
lowers into:
```mlir
for %i0 = 0 to %arg0 step 3 {
for %i1 = 0 to %arg1 {
for %i2 = 0 to %arg2 {
for %i3 = 0 to %arg3 step 5 {
%1 = alloc() : memref<5x4x3xf32>
%2 = "element_type_cast"(%1) : (memref<5x4x3xf32>) -> memref<1xvector<5x4x3xf32>>
for %i4 = 0 to 5 {
%3 = affine_apply (d0, d1) -> (d0 + d1) (%i3, %i4)
for %i5 = 0 to 4 {
for %i6 = 0 to 3 {
%4 = affine_apply (d0, d1) -> (d0 + d1) (%i0, %i6)
%5 = load %0[%4, %i1, %i2, %3] : memref<?x?x?x?xf32>
store %5, %1[%i4, %i5, %i6] : memref<5x4x3xf32>
%6 = load %2[%c0] : memref<1xvector<5x4x3xf32>>
dealloc %1 : memref<5x4x3xf32>
```
PiperOrigin-RevId: 224552717
This simplifies call-sites returning true after emitting an error. After the
conversion, dropped braces around single statement blocks as that seems more
common.
Also, switched to emitError method instead of emitting Error kind using the
emitDiagnostic method.
TESTED with existing unit tests
PiperOrigin-RevId: 224527868
This CLs adds proper error emission, removes NYI assertions and documents
assumptions that are required in the relevant functions.
PiperOrigin-RevId: 224377207
This CL adds the following free functions:
```
/// Returns the AffineExpr e o m.
AffineExpr compose(AffineExpr e, AffineMap m);
/// Returns the AffineExpr f o g.
AffineMap compose(AffineMap f, AffineMap g);
```
This addresses the issue that AffineMap composition is only available at a
distance via AffineValueMap and is thus unusable on Attributes.
This CL thus implements AffineMap composition in a more modular and composable
way.
This CL does not claim that it can be a good replacement for the
implementation in AffineValueMap, in particular it does not support bounded
maps atm.
Standalone tests are added that replicate some of the logic of the AffineMap
composition pass.
Lastly, affine map composition is used properly inside MaterializeVectors and
a standalone test is added that requires permutation_map composition with a
projection map.
PiperOrigin-RevId: 224376870
This CL hooks up and uses permutation_map in vector_transfer ops.
In particular, when going into the nuts and bolts of the implementation, it
became clear that cases arose that required supporting broadcast semantics.
Broadcast semantics are thus added to the general permutation_map.
The verify methods and tests are updated accordingly.
Examples of interest include.
Example 1:
The following MLIR snippet:
```mlir
for %i3 = 0 to %M {
for %i4 = 0 to %N {
for %i5 = 0 to %P {
%a5 = load %A[%i4, %i5, %i3] : memref<?x?x?xf32>
}}}
```
may vectorize with {permutation_map: (d0, d1, d2) -> (d2, d1)} into:
```mlir
for %i3 = 0 to %0 step 32 {
for %i4 = 0 to %1 {
for %i5 = 0 to %2 step 256 {
%4 = vector_transfer_read %arg0, %i4, %i5, %i3
{permutation_map: (d0, d1, d2) -> (d2, d1)} :
(memref<?x?x?xf32>, index, index) -> vector<32x256xf32>
}}}
````
Meaning that vector_transfer_read will be responsible for reading the 2-D slice:
`%arg0[%i4, %i5:%15+256, %i3:%i3+32]` into vector<32x256xf32>. This will
require a transposition when vector_transfer_read is further lowered.
Example 2:
The following MLIR snippet:
```mlir
%cst0 = constant 0 : index
for %i0 = 0 to %M {
%a0 = load %A[%cst0, %cst0] : memref<?x?xf32>
}
```
may vectorize with {permutation_map: (d0) -> (0)} into:
```mlir
for %i0 = 0 to %0 step 128 {
%3 = vector_transfer_read %arg0, %c0_0, %c0_0
{permutation_map: (d0, d1) -> (0)} :
(memref<?x?xf32>, index, index) -> vector<128xf32>
}
````
Meaning that vector_transfer_read will be responsible of reading the 0-D slice
`%arg0[%c0, %c0]` into vector<128xf32>. This will require a 1-D vector
broadcast when vector_transfer_read is further lowered.
Additionally, some minor cleanups and refactorings are performed.
One notable thing missing here is the composition with a projection map during
materialization. This is because I could not find an AffineMap composition
that operates on AffineMap directly: everything related to composition seems
to require going through SSAValue and only operates on AffinMap at a distance
via AffineValueMap. I have raised this concern a bunch of times already, the
followup CL will actually do something about it.
In the meantime, the projection is hacked at a minimum to pass verification
and materialiation tests are temporarily incorrect.
PiperOrigin-RevId: 224376828
The recently introduced `select` operation enables ConvertToCFG to support
min(max) in loop bounds. Individual min(max) is implemented as
`cmpi "lt"`(`cmpi "gt"`) followed by a `select` between the compared values.
Multiple results of an `affine_apply` operation extracted from the loop bounds
are reduced using min(max) in a sequential manner. While this may decrease the
potential for instruction-level parallelism, it is easier to recognize for the
following passes, in particular for the vectorizer.
PiperOrigin-RevId: 224376233
The implementation of OpPointer<OpType> provides an implicit conversion to
Operation *, but not to the underlying OpType *. This has led to
awkward-looking code when an OpPointer needs to be passed to a function
accepting an OpType *. For example,
if (auto someOp = genericOp.dyn_cast<OpType>())
someFunction(&*someOp);
where "&*" makes it harder to read. Arguably, one does not want to spell out
OpPointer<OpType> in the line with dyn_cast. More generally, OpPointer is now
being used as an owning pointer to OpType rather than to operation.
Replace the implicit conversion to Operation* with the conversion to OpType*
taking into account const-ness of the type. An Operation* can be obtained from
an OpType with a simple call. Since an instance of OpPointer owns the OpType
value, the pointer to it is never null. However, the OpType value may not be
associated with any Operation*. In this case, return nullptr when conversion
is attempted to maintain consistency with the existing null checks.
PiperOrigin-RevId: 224368103
cl/224246657); eliminate repeated evaluation of exprs in loop upper bounds.
- while on this, sweep through and fix potential repeated evaluation of
expressions in loop upper bounds
PiperOrigin-RevId: 224268918
update/improve/clean up API.
- update FlatAffineConstraints::getConstBoundDifference; return constant
differences between symbolic affine expressions, look at equalities as well.
- fix buffer size computation when generating DMAs symbolic in outer loops,
correctly handle symbols at various places (affine access maps, loop bounds,
loop IVs outer to the depth at which DMA generation is being done)
- bug fixes / complete some TODOs for getMemRefRegion
- refactor common code b/w memref dependence check and getMemRefRegion
- FlatAffineConstraints API update; added methods employ trivial checks /
detection - sufficient to handle hyper-rectangular cases in a precise way
while being fast / low complexity. Hyper-rectangular cases fall out as
trivial cases for these methods while other cases still do not cause failure
(either return conservative or return failure that is handled by the caller).
PiperOrigin-RevId: 224229879
The condition of the "if" statement is an integer set, defined as a conjunction
of affine constraints. An affine constraints consists of an affine expression
and a flag indicating whether the expression is strictly equal to zero or is
also allowed to be greater than zero. Affine maps, accepted by `affine_apply`
are also formed from affine expressions. Leverage this fact to implement the
checking of "if" conditions. Each affine expression from the integer set is
converted into an affine map. This map is applied to the arguments of the "if"
statement. The result of the application is compared with zero given the
equality flag to obtain the final boolean value. The conjunction of conditions
is tested sequentially with short-circuit branching to the "else" branch if any
of the condition evaluates to false.
Create an SESE region for the if statement (including its "then" and optional
"else" statement blocks) and append it to the end of the current region. The
conditional region consists of a sequence of condition-checking blocks that
implement the short-circuit scheme, followed by a "then" SESE region and an
"else" SESE region, and the continuation block that post-dominates all blocks
of the "if" statement. The flow of blocks that correspond to the "then" and
"else" clauses are constructed recursively, enabling easy nesting of "if"
statements and if-then-else-if chains.
Note that MLIR semantics does not require nor prohibit short-circuit
evaluation. Since affine expressions do not have side effects, there is no
observable difference in the program behavior. We may trade off extra
operations for operation-level parallelism opportunity by first performing all
`affine_apply` and comparison operations independently, and then performing a
tree pattern reduction of the resulting boolean values with the `muli i1`
operations (in absence of the dedicated bit operations). The pros and cons are
not clear, and since MLIR does not include parallel semantics, we prefer to
minimize the number of sequentially executed operations.
PiperOrigin-RevId: 223970248
This CL implements and uses VectorTransferOps in lieu of the former custom
call op. Tests are updated accordingly.
VectorTransferOps come in 2 flavors: VectorTransferReadOp and
VectorTransferWriteOp.
VectorTransferOps can be thought of as a backend-independent
pseudo op/library call that needs to be legalized to MLIR (whiteboxed) before
it can be lowered to backend-dependent IR.
Note that the current implementation does not yet support a real permutation
map. Proper support will come in a followup CL.
VectorTransferReadOp
====================
VectorTransferReadOp performs a blocking read from a scalar memref
location into a super-vector of the same elemental type. This operation is
called 'read' by opposition to 'load' because the super-vector granularity
is generally not representable with a single hardware register. As a
consequence, memory transfers will generally be required when lowering
VectorTransferReadOp. A VectorTransferReadOp is thus a mid-level abstraction
that supports super-vectorization with non-effecting padding for full-tile
only code.
A vector transfer read has semantics similar to a vector load, with additional
support for:
1. an optional value of the elemental type of the MemRef. This value
supports non-effecting padding and is inserted in places where the
vector read exceeds the MemRef bounds. If the value is not specified,
the access is statically guaranteed to be within bounds;
2. an attribute of type AffineMap to specify a slice of the original
MemRef access and its transposition into the super-vector shape. The
permutation_map is an unbounded AffineMap that must represent a
permutation from the MemRef dim space projected onto the vector dim
space.
Example:
```mlir
%A = alloc(%size1, %size2, %size3, %size4) : memref<?x?x?x?xf32>
...
%val = `ssa-value` : f32
// let %i, %j, %k, %l be ssa-values of type index
%v0 = vector_transfer_read %src, %i, %j, %k, %l
{permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} :
(memref<?x?x?x?xf32>, index, index, index, index) ->
vector<16x32x64xf32>
%v1 = vector_transfer_read %src, %i, %j, %k, %l, %val
{permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} :
(memref<?x?x?x?xf32>, index, index, index, index, f32) ->
vector<16x32x64xf32>
```
VectorTransferWriteOp
=====================
VectorTransferWriteOp performs a blocking write from a super-vector to
a scalar memref of the same elemental type. This operation is
called 'write' by opposition to 'store' because the super-vector
granularity is generally not representable with a single hardware register. As
a consequence, memory transfers will generally be required when lowering
VectorTransferWriteOp. A VectorTransferWriteOp is thus a mid-level
abstraction that supports super-vectorization with non-effecting padding
for full-tile only code.
A vector transfer write has semantics similar to a vector store, with
additional support for handling out-of-bounds situations.
Example:
```mlir
%A = alloc(%size1, %size2, %size3, %size4) : memref<?x?x?x?xf32>.
%val = `ssa-value` : vector<16x32x64xf32>
// let %i, %j, %k, %l be ssa-values of type index
vector_transfer_write %val, %src, %i, %j, %k, %l
{permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} :
(vector<16x32x64xf32>, memref<?x?x?x?xf32>, index, index, index, index)
```
PiperOrigin-RevId: 223873234
The check for whether the memref was used in a non-derefencing context had to
be done inside, i.e., only for the op stmt's that the replacement was specified
to be performed on (by the domStmtFilter arg if provided). As such, it is
completely fine for example for a function to return a memref while the replacement
is being performed only a specific loop's body (as in the case of DMA
generation).
PiperOrigin-RevId: 223827753
The algorithm collects defining operations within a scoped hash table. The scopes within the hash table correspond to nodes within the dominance tree for a function. This cl only adds support for simple operations, i.e non side-effecting. Such operations, e.g. load/store/call, will be handled in later patches.
PiperOrigin-RevId: 223811328
class. This change is NFC, but allows for new kinds of patterns, specifically
LegalizationPatterns which will be allowed to change the types of things they
rewrite.
PiperOrigin-RevId: 223243783
Several things were suggested in post-submission reviews. In particular, use
pointers in function interfaces instead of references (still use references
internally). Clarify the behavior of the pass in presence of MLFunctions.
PiperOrigin-RevId: 222556851