Currently, regions can only be constructed by passing in a `Function` or an
`Instruction` pointer referencing the parent object, unlike `Function`s or
`Instruction`s themselves that can be created without a parent. It leads to a
rather complex flow in operation construction where one has to create the
operation first before being able to work with its regions. It may be
necessary to work with the regions before the operation is created. In
particular, in `build` and `parse` functions that are executed _before_ the
operation is created in cases where boilerplate region manipulation is required
(for example, inserting the hypothetical default terminator in affine regions).
Allow creating standalone regions. Such regions are meant to own a list of
blocks and transfer them to other regions on demand.
Each instruction stores a fixed number of regions as trailing objects and has
ownership of them. This decreases the size of the Instruction object for the
common case of instructions without regions. Keep this behavior intact. To
allow some flexibility in construction, make OperationState store an owning
vector of regions. When the Builder creates an Instruction from
OperationState, the bodies of the regions are transferred into the
instruction-owned regions to minimize copying. Thus, it becomes possible to
fill standalone regions with blocks and move them to an operation when it is
constructed, or move blocks from a region to an operation region, e.g., for
inlining.
PiperOrigin-RevId: 240368183
a pointer. This makes it consistent with all the other methods in
FunctionPass, as well as with ModulePass::getModule(). NFC.
PiperOrigin-RevId: 240257910
inherited constructors, which is cleaner and means you can now use DimOp()
to get a null op, instead of having to use Instruction::getNull<DimOp>().
This removes another 200 lines of code.
PiperOrigin-RevId: 240068113
- change this for consistency - everything else similar takes/returns a
Function pointer - the FuncBuilder ctor,
Block/Value/Instruction::getFunction(), etc.
- saves a whole bunch of &s everywhere
PiperOrigin-RevId: 236928761
This CL changes dialect op source files (.h, .cpp, .td) to follow the following
convention:
<full-dialect-name>/<dialect-namespace>Ops.{h|cpp|td}
Builtin and standard dialects are specially treated, though. Both of them do
not have dialect namespace; the former is still named as BuiltinOps.* and the
latter is named as Ops.*.
Purely mechanical. NFC.
PiperOrigin-RevId: 236371358
A performance issue was reported due to the usage of NestedMatcher in
ComposeAffineMaps. The main culprit was the ubiquitous copies that were
occuring when appending even a single element in `matchOne`.
This CL generally simplifies the implementation and removes one level of indirection by getting rid of
auxiliary storage as well as simplifying the API.
The users of the API are updated accordingly.
The implementation was tested on a heavily unrolled example with
ComposeAffineMaps and is now close in performance with an implementation based
on stateless InstWalker.
As a reminder, the whole ComposeAffineMaps pass is slated to disappear but the bug report was very useful as a stress test for NestedMatchers.
Lastly, the following cleanups reported by @aminim were addressed:
1. make NestedPatternContext scoped within runFunction rather than at the Pass level. This was caused by a previous misunderstanding of Pass lifetime;
2. use defensive assertions in the constructor of NestedPatternContext to make it clear a unique such locally scoped context is allowed to exist.
PiperOrigin-RevId: 231781279
This CL follows up on a memory leak issue related to SmallVector growth that
escapes the BumpPtrAllocator.
The fix is to properly use ArrayRef and placement new to define away the
issue.
The following renaming is also applied:
1. MLFunctionMatcher -> NestedPattern
2. MLFunctionMatches -> NestedMatch
As a consequence all allocations are now guaranteed to live on the BumpPtrAllocator.
PiperOrigin-RevId: 231047766
Example:
dma-generate options:
-dma-fast-mem-capacity - Set fast memory space ...
-dma-fast-mem-space=<uint> - Set fast memory space ...
loop-fusion options:
-fusion-compute-tolerance=<number> - Fractional increase in ...
-fusion-maximal - Enables maximal loop fusion
loop-tile options:
-tile-size=<uint> - Use this tile size for ...
loop-unroll options:
-unroll-factor=<uint> - Use this unroll factor ...
-unroll-full - Fully unroll loops
-unroll-full-threshold=<uint> - Unroll all loops with ...
-unroll-num-reps=<uint> - Unroll innermost loops ...
loop-unroll-jam options:
-unroll-jam-factor=<uint> - Use this unroll jam factor ...
PiperOrigin-RevId: 231019363
Function::walk functionality into f->walkInsts/Ops which allows visiting all
instructions, not just ops. Eliminate Function::getBody() and
Function::getReturn() helpers which crash in CFG functions, and were only kept
around as a bridge.
This is step 25/n towards merging instructions and statements.
PiperOrigin-RevId: 227243966
consistent and moving the using declarations over. Hopefully this is the last
truly massive patch in this refactoring.
This is step 21/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227178245
StmtResult -> InstResult, StmtOperand -> InstOperand, and remove the old names.
This is step 17/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227121537
FuncBuilder class. Also rename SSAValue.cpp to Value.cpp
This is step 12/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227067644
is the new base of the SSA value hierarchy. This CL also standardizes all the
nomenclature and comments to use 'Value' where appropriate. This also eliminates a large number of cast<MLValue>(x)'s, which is very soothing.
This is step 11/n towards merging instructions and statements, NFC.
PiperOrigin-RevId: 227064624
Supervectorization uses null pointers to SSA values as a means of communicating
the failure to vectorize. In operation vectorization, all operations producing
the values of operation arguments must be vectorized for the given operation to
be vectorized. The existing check verified if any of the value "def"
statements was vectorized instead, sometimes leading to assertions inside `isa`
called on a null pointer. Fix this to check that all "def" statements were
vectorized.
PiperOrigin-RevId: 226941552
clients to use OperationState instead. This makes MLFuncBuilder more similiar
to CFGFuncBuilder. This whole area will get tidied up more when cfg and ml
worlds get unified. This patch is just gardening, NFC.
PiperOrigin-RevId: 226701959
From the beginning, vector_transfer_read and vector_transfer_write opreations
were intended as a mid-level vectorization abstraction. In particular, they
are lowered to the StandardOps dialect before further processing. As such, it
does not make sense to keep them at the same level as StandardOps. Introduce
the new SuperVectorOps dialect and move vector_transfer_* operations there.
This will be used as a testbed for the generic lowering/legalization pass.
PiperOrigin-RevId: 225554492
This CLs adds proper error emission, removes NYI assertions and documents
assumptions that are required in the relevant functions.
PiperOrigin-RevId: 224377207
This CL hooks up and uses permutation_map in vector_transfer ops.
In particular, when going into the nuts and bolts of the implementation, it
became clear that cases arose that required supporting broadcast semantics.
Broadcast semantics are thus added to the general permutation_map.
The verify methods and tests are updated accordingly.
Examples of interest include.
Example 1:
The following MLIR snippet:
```mlir
for %i3 = 0 to %M {
for %i4 = 0 to %N {
for %i5 = 0 to %P {
%a5 = load %A[%i4, %i5, %i3] : memref<?x?x?xf32>
}}}
```
may vectorize with {permutation_map: (d0, d1, d2) -> (d2, d1)} into:
```mlir
for %i3 = 0 to %0 step 32 {
for %i4 = 0 to %1 {
for %i5 = 0 to %2 step 256 {
%4 = vector_transfer_read %arg0, %i4, %i5, %i3
{permutation_map: (d0, d1, d2) -> (d2, d1)} :
(memref<?x?x?xf32>, index, index) -> vector<32x256xf32>
}}}
````
Meaning that vector_transfer_read will be responsible for reading the 2-D slice:
`%arg0[%i4, %i5:%15+256, %i3:%i3+32]` into vector<32x256xf32>. This will
require a transposition when vector_transfer_read is further lowered.
Example 2:
The following MLIR snippet:
```mlir
%cst0 = constant 0 : index
for %i0 = 0 to %M {
%a0 = load %A[%cst0, %cst0] : memref<?x?xf32>
}
```
may vectorize with {permutation_map: (d0) -> (0)} into:
```mlir
for %i0 = 0 to %0 step 128 {
%3 = vector_transfer_read %arg0, %c0_0, %c0_0
{permutation_map: (d0, d1) -> (0)} :
(memref<?x?xf32>, index, index) -> vector<128xf32>
}
````
Meaning that vector_transfer_read will be responsible of reading the 0-D slice
`%arg0[%c0, %c0]` into vector<128xf32>. This will require a 1-D vector
broadcast when vector_transfer_read is further lowered.
Additionally, some minor cleanups and refactorings are performed.
One notable thing missing here is the composition with a projection map during
materialization. This is because I could not find an AffineMap composition
that operates on AffineMap directly: everything related to composition seems
to require going through SSAValue and only operates on AffinMap at a distance
via AffineValueMap. I have raised this concern a bunch of times already, the
followup CL will actually do something about it.
In the meantime, the projection is hacked at a minimum to pass verification
and materialiation tests are temporarily incorrect.
PiperOrigin-RevId: 224376828
This CL implements and uses VectorTransferOps in lieu of the former custom
call op. Tests are updated accordingly.
VectorTransferOps come in 2 flavors: VectorTransferReadOp and
VectorTransferWriteOp.
VectorTransferOps can be thought of as a backend-independent
pseudo op/library call that needs to be legalized to MLIR (whiteboxed) before
it can be lowered to backend-dependent IR.
Note that the current implementation does not yet support a real permutation
map. Proper support will come in a followup CL.
VectorTransferReadOp
====================
VectorTransferReadOp performs a blocking read from a scalar memref
location into a super-vector of the same elemental type. This operation is
called 'read' by opposition to 'load' because the super-vector granularity
is generally not representable with a single hardware register. As a
consequence, memory transfers will generally be required when lowering
VectorTransferReadOp. A VectorTransferReadOp is thus a mid-level abstraction
that supports super-vectorization with non-effecting padding for full-tile
only code.
A vector transfer read has semantics similar to a vector load, with additional
support for:
1. an optional value of the elemental type of the MemRef. This value
supports non-effecting padding and is inserted in places where the
vector read exceeds the MemRef bounds. If the value is not specified,
the access is statically guaranteed to be within bounds;
2. an attribute of type AffineMap to specify a slice of the original
MemRef access and its transposition into the super-vector shape. The
permutation_map is an unbounded AffineMap that must represent a
permutation from the MemRef dim space projected onto the vector dim
space.
Example:
```mlir
%A = alloc(%size1, %size2, %size3, %size4) : memref<?x?x?x?xf32>
...
%val = `ssa-value` : f32
// let %i, %j, %k, %l be ssa-values of type index
%v0 = vector_transfer_read %src, %i, %j, %k, %l
{permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} :
(memref<?x?x?x?xf32>, index, index, index, index) ->
vector<16x32x64xf32>
%v1 = vector_transfer_read %src, %i, %j, %k, %l, %val
{permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} :
(memref<?x?x?x?xf32>, index, index, index, index, f32) ->
vector<16x32x64xf32>
```
VectorTransferWriteOp
=====================
VectorTransferWriteOp performs a blocking write from a super-vector to
a scalar memref of the same elemental type. This operation is
called 'write' by opposition to 'store' because the super-vector
granularity is generally not representable with a single hardware register. As
a consequence, memory transfers will generally be required when lowering
VectorTransferWriteOp. A VectorTransferWriteOp is thus a mid-level
abstraction that supports super-vectorization with non-effecting padding
for full-tile only code.
A vector transfer write has semantics similar to a vector store, with
additional support for handling out-of-bounds situations.
Example:
```mlir
%A = alloc(%size1, %size2, %size3, %size4) : memref<?x?x?x?xf32>.
%val = `ssa-value` : vector<16x32x64xf32>
// let %i, %j, %k, %l be ssa-values of type index
vector_transfer_write %val, %src, %i, %j, %k, %l
{permutation_map: (d0, d1, d2, d3) -> (d3, d1, d2)} :
(vector<16x32x64xf32>, memref<?x?x?x?xf32>, index, index, index, index)
```
PiperOrigin-RevId: 223873234
This CL refactors a few things in Vectorize.cpp:
1. a clear distinction is made between:
a. the LoadOp are the roots of vectorization and must be vectorized
eagerly and propagate their value; and
b. the StoreOp which are the terminals of vectorization and must be
vectorized late (i.e. they do not produce values that need to be
propagated).
2. the StoreOp must be vectorized late because in general it can store a value
that is not reachable from the subset of loads defined in the
current pattern. One trivial such case is storing a constant defined at the
top-level of the MLFunction and that needs to be turned into a splat.
3. a description of the algorithm is given;
4. the implementation matches the algorithm;
5. the last example is made parametric, in practice it will fully rely on the
implementation of vector_transfer_read/write which will handle boundary
conditions and padding. This will happen by lowering to a lower-level
abstraction either:
a. directly in MLIR (whether DMA or just loops or any async tasks in the
future) (whiteboxing);
b. in LLO/LLVM-IR/whatever blackbox library call/ search + swizzle inventor
one may want to use;
c. a partial mix of a. and b. (grey-boxing)
5. minor cleanups are applied;
6. mistakenly disabled unit tests are re-enabled (oopsie).
With this CL, this MLIR snippet:
```
mlfunc @vector_add_2d(%M : index, %N : index) -> memref<?x?xf32> {
%A = alloc (%M, %N) : memref<?x?xf32>
%B = alloc (%M, %N) : memref<?x?xf32>
%C = alloc (%M, %N) : memref<?x?xf32>
%f1 = constant 1.0 : f32
%f2 = constant 2.0 : f32
for %i0 = 0 to %M {
for %i1 = 0 to %N {
// non-scoped %f1
store %f1, %A[%i0, %i1] : memref<?x?xf32>
}
}
for %i4 = 0 to %M {
for %i5 = 0 to %N {
%a5 = load %A[%i4, %i5] : memref<?x?xf32>
%b5 = load %B[%i4, %i5] : memref<?x?xf32>
%s5 = addf %a5, %b5 : f32
// non-scoped %f1
%s6 = addf %s5, %f1 : f32
store %s6, %C[%i4, %i5] : memref<?x?xf32>
}
}
return %C : memref<?x?xf32>
}
```
vectorized with these arguments:
```
-vectorize -virtual-vector-size 256 --test-fastest-varying=0
```
vectorization produces this standard innermost-loop vectorized code:
```
mlfunc @vector_add_2d(%arg0 : index, %arg1 : index) -> memref<?x?xf32> {
%0 = alloc(%arg0, %arg1) : memref<?x?xf32>
%1 = alloc(%arg0, %arg1) : memref<?x?xf32>
%2 = alloc(%arg0, %arg1) : memref<?x?xf32>
%cst = constant 1.000000e+00 : f32
%cst_0 = constant 2.000000e+00 : f32
for %i0 = 0 to %arg0 {
for %i1 = 0 to %arg1 step 256 {
%cst_1 = constant splat<vector<256xf32>, 1.000000e+00> : vector<256xf32>
"vector_transfer_write"(%cst_1, %0, %i0, %i1) : (vector<256xf32>, memref<?x?xf32>, index, index) -> ()
}
}
for %i2 = 0 to %arg0 {
for %i3 = 0 to %arg1 step 256 {
%3 = "vector_transfer_read"(%0, %i2, %i3) : (memref<?x?xf32>, index, index) -> vector<256xf32>
%4 = "vector_transfer_read"(%1, %i2, %i3) : (memref<?x?xf32>, index, index) -> vector<256xf32>
%5 = addf %3, %4 : vector<256xf32>
%cst_2 = constant splat<vector<256xf32>, 1.000000e+00> : vector<256xf32>
%6 = addf %5, %cst_2 : vector<256xf32>
"vector_transfer_write"(%6, %2, %i2, %i3) : (vector<256xf32>, memref<?x?xf32>, index, index) -> ()
}
}
return %2 : memref<?x?xf32>
}
```
Of course, much more intricate n-D imperfectly-nested patterns can be emitted too in a fully declarative fashion, but this is enough for now.
PiperOrigin-RevId: 222280209
This CL adds some vector support in prevision of the upcoming vector
materialization pass. In particular this CL adds 2 functions to:
1. compute the multiplicity of a subvector shape in a supervector shape;
2. help match operations on strict super-vectors. This is defined for a given
subvector shape as an operation that manipulates a vector type that is an
integral multiple of the subtype, with multiplicity at least 2.
This CL also adds a TestUtil pass where we can dump arbitrary testing of
functions and analysis that operate at a much smaller granularity than a pass
(e.g. an analysis for which it is convenient to write a bit of artificial MLIR
and write some custom test). This is in order to keep using Filecheck for
things that essentially look and feel like C++ unit tests.
PiperOrigin-RevId: 222250910
This CL adds support for and a vectorization test to perform scalar 2-D addf.
The support extension notably comprises:
1. extend vectorizable test to exclude vector_transfer operations and
expose them to LoopAnalysis where they are needed. This is a temporary
solution a concrete MLIR Op exists;
2. add some more functional sugar mapKeys, apply and ScopeGuard (which became
relevant again);
3. fix improper shifting during coarsening;
4. rename unaligned load/store to vector_transfer_read/write and simplify the
design removing the unnecessary AllocOp that were introduced prematurely:
vector_transfer_read currently has the form:
(memref<?x?x?xf32>, index, index, index) -> vector<32x64x256xf32>
vector_transfer_write currently has the form:
(vector<32x64x256xf32>, memref<?x?x?xf32>, index, index, index) -> ()
5. adds vectorizeOperations which traverses the operations in a ForStmt and
rewrites them to their vector form;
6. add support for vector splat from a constant.
The relevant tests are also updated.
PiperOrigin-RevId: 221421426
Value type abstraction for locations differ from others in that a Location can NOT be null. NOTE: dyn_cast returns an Optional<T>.
PiperOrigin-RevId: 220682078
The passID is not currently stored in Pass but this avoids the unused variable warning. The passID is used to uniquely identify passes, currently this is only stored/used in PassInfo.
PiperOrigin-RevId: 220485662
Add static pass registration and change mlir-opt to use it. Future work is needed to refactor the registration for PassManager usage.
Change build targets to alwayslink to enforce registration.
PiperOrigin-RevId: 220390178
This CL adds support for vectorization using more interesting 2-D and 3-D
patterns. Note in particular the fact that we match some pretty complex
imperfectly nested 2-D patterns with a quite minimal change to the
implementation: we just add a bit of recursion to traverse the matched
patterns and actually vectorize the loops.
For instance, vectorizing the following loop by 128:
```
for %i3 = 0 to %0 {
%7 = affine_apply (d0) -> (d0)(%i3)
%8 = load %arg0[%c0_0, %7] : memref<?x?xf32>
}
```
Currently generates:
```
#map0 = ()[s0] -> (s0 + 127)
#map1 = (d0) -> (d0)
for %i3 = 0 to #map0()[%0] step 128 {
%9 = affine_apply #map1(%i3)
%10 = alloc() : memref<1xvector<128xf32>>
%11 = "n_d_unaligned_load"(%arg0, %c0_0, %9, %10, %c0) :
(memref<?x?xf32>, index, index, memref<1xvector<128xf32>>, index) ->
(memref<?x?xf32>, index, index, memref<1xvector<128xf32>>, index)
%12 = load %10[%c0] : memref<1xvector<128xf32>>
}
```
The above is subject to evolution.
PiperOrigin-RevId: 219629745
Introduce analysis to check memref accesses (in MLFunctions) for out of bound
ones. It works as follows:
$ mlir-opt -memref-bound-check test/Transforms/memref-bound-check.mlir
/tmp/single.mlir:10:12: error: 'load' op memref out of upper bound access along dimension tensorflow/mlir#1
%x = load %A[%idxtensorflow/mlir#0, %idxtensorflow/mlir#1] : memref<9 x 9 x i32>
^
/tmp/single.mlir:10:12: error: 'load' op memref out of lower bound access along dimension tensorflow/mlir#1
%x = load %A[%idxtensorflow/mlir#0, %idxtensorflow/mlir#1] : memref<9 x 9 x i32>
^
/tmp/single.mlir:10:12: error: 'load' op memref out of upper bound access along dimension tensorflow/mlir#2
%x = load %A[%idxtensorflow/mlir#0, %idxtensorflow/mlir#1] : memref<9 x 9 x i32>
^
/tmp/single.mlir:10:12: error: 'load' op memref out of lower bound access along dimension tensorflow/mlir#2
%x = load %A[%idxtensorflow/mlir#0, %idxtensorflow/mlir#1] : memref<9 x 9 x i32>
^
/tmp/single.mlir:12:12: error: 'load' op memref out of upper bound access along dimension tensorflow/mlir#1
%y = load %B[%idy] : memref<128 x i32>
^
/tmp/single.mlir:12:12: error: 'load' op memref out of lower bound access along dimension tensorflow/mlir#1
%y = load %B[%idy] : memref<128 x i32>
^
#map0 = (d0, d1) -> (d0, d1)
#map1 = (d0, d1) -> (d0 * 128 - d1)
mlfunc @test() {
%0 = alloc() : memref<9x9xi32>
%1 = alloc() : memref<128xi32>
for %i0 = -1 to 9 {
for %i1 = -1 to 9 {
%2 = affine_apply #map0(%i0, %i1)
%3 = load %0[%2tensorflow/mlir#0, %2tensorflow/mlir#1] : memref<9x9xi32>
%4 = affine_apply #map1(%i0, %i1)
%5 = load %1[%4] : memref<128xi32>
}
}
return
}
- Improves productivity while manually / semi-automatically developing MLIR for
testing / prototyping; also provides an indirect way to catch errors in
transformations.
- This pass is an easy way to test the underlying affine analysis
machinery including low level routines.
Some code (in getMemoryRegion()) borrowed from @andydavis cl/218263256.
While on this:
- create mlir/Analysis/Passes.h; move Pass.h up from mlir/Transforms/ to mlir/
- fix a bug in AffineAnalysis.cpp::toAffineExpr
TODO: extend to non-constant loop bounds (straightforward). Will transparently
work for all accesses once floordiv, mod, ceildiv are supported in the
AffineMap -> FlatAffineConstraints conversion.
PiperOrigin-RevId: 219397961
This is done by changing Type to be a POD interface around an underlying pointer storage and adding in-class support for isa/dyn_cast/cast.
PiperOrigin-RevId: 219372163
This CL is a first in a series that implements early vectorization of
increasingly complex patterns. In particular, early vectorization will support
arbitrary loop nesting patterns (both perfectly and imperfectly nested), at
arbitrary depths in the loop tree.
This first CL builds the minimal support for applying 1-D patterns.
It relies on an unaligned load/store op abstraction that can be inplemented
differently on different HW.
Future CLs will support higher dimensional patterns, but 1-D patterns already
exhibit interesting properties.
In particular, we want to separate pattern matching (i.e. legality both
structural and dependency analysis based), from profitability analysis, from
application of the transformation.
As a consequence patterns may intersect and we need to verify that a pattern
can still apply by the time we get to applying it.
A non-greedy analysis on profitability that takes into account pattern
intersection is left for future work.
Additionally the CL makes the following cleanups:
1. the matches method now returns a value, not a reference;
2. added comments about the MLFunctionMatcher and MLFunctionMatches usage by
value;
3. added size and empty methods to matches;
4. added a negative vectorization test with a conditional, this exhibited a
but in the iterators. Iterators now return nullptr if the underlying storage
is nullpt.
PiperOrigin-RevId: 219299489
distinction. FunctionPasses can now choose to get called on all functions, or
have the driver split CFG/ML Functions up for them. NFC.
PiperOrigin-RevId: 218775885
This CL implements a very simple loop vectorization **test** and the basic
infrastructure to support it.
The test simply consists in:
1. matching the loops in the MLFunction and all the Load/Store operations
nested under the loop;
2. testing whether all the Load/Store are contiguous along the innermost
memory dimension along that particular loop. If any reference is
non-contiguous (i.e. the ForStmt SSAValue appears in the expression), then
the loop is not-vectorizable.
The simple test above can gradually be extended with more interesting
behaviors to account for the fact that a layout permutation may exist that
enables contiguity etc. All these will come in due time but it is worthwhile
noting that the test already supports detection of outer-vetorizable loops.
In implementing this test, I also added a recursive MLFunctionMatcher and some
sugar that can capture patterns
such as `auto gemmLike = Doall(Doall(Red(LoadStore())))` and allows iterating
on the matched IR structures. For now it just uses in order traversal but
post-order DFS will be useful in the future once IR rewrites start occuring.
One may note that the memory management design decision follows a different
pattern from MLIR. After evaluating different designs and how they quickly
increase cognitive overhead, I decided to opt for the simplest solution in my
view: a class-wide (threadsafe) RAII context.
This way, a pass that needs MLFunctionMatcher can just have its own locally
scoped BumpPtrAllocator and everything is cleaned up when the pass is destroyed.
If passes are expected to have a longer lifetime, then the contexts can easily
be scoped inside the runOnMLFunction call and storage lifetime reduced.
Lastly, whatever the scope of threading (module, function, pass), this is
expected to also be future-proof wrt concurrency (but this is a detail atm).
PiperOrigin-RevId: 217622889