Take OneFlow as an example to explore the actual development process of MLIR

preface

Recently, I did some OneFlow IR related development with the help of my colleague shenghang. I have some new feelings about the executive part of MLIR, so I try to share it. I spent a lot of time to understand the whole architecture of OneFlow IR (see my Toy Tutorials series), but I have always had doubts about the JIT implementation of OneFlow IR. Recently, I based OneFlow on Job (OneFlow's Job function can be understood as a calculation diagram without considering the equipment) the implementation part of the access to the MLIR project has been reorganized, and the whole process has been understood under the guidance of shenghang. Therefore, in this document, I will introduce how OneFlow and MLIR are combined, how to add a graph level Pass in OneFlow IR, and how OneFlow's Operation automatically becomes The Operation of MLIR and why OneFlow IR can use MLIR to accelerate computing. I don't know much about MLIR. I began to contact it two months ago. Please criticize and correct any mistakes. Article and https://github.com/OneFlow-Inc/OneFlow & https://github.com/BBuf/tvm_mlir_learn If you are interested, you can star pay attention to it.

Op and Operation mentioned in this article are the same thing, and there is no strict distinction.

How does OneFlow work with MLIR?

Introducing MLIR into OneFlow as the IR of OneFlow has many advantages. It can not only replace the handwritten Operation definition in OneFlow through C + + to reduce the development difficulty, but also reduce the overhead related to some containers in the Operation definition. In addition, we can accelerate the calculation of the calculation graph through the infrastructure maintained by MLIR (i.e. multiple Dialect). The calculation graph here can be either eagle's calculation graph or Lazy's calculation graph. Since the acceleration based on Eagle's calculation graph using MLIR (i.e. oneflow.jit.xxx) has not been officially opened, I still use Lazy's calculation graph here (Job) as an example to explain the combination process of OneFlow and MLIR.

First, we need to compile OneFlow to enable MLIR. The compilation command is as follows:

git clone git@github.com:Oneflow-Inc/oneflow.git
cd oneflow && mkdir build && cd build
cmake-C ../cmake/caches/cn/fast/mlir-cuda-75.cmake -DBUILD_TESTING=ON .. && ninja

Then you can write an example to test:

os.environ["ONEFLOW_MLIR_ENABLE_ROUND_TRIP"] = '1'
os.environ["ONEFLOW_MLIR_ENABLE_CODEGEN_FUSERS"] = '1'

@flow.unittest.skip_unless_1n1d()
class TestFuseBiasAddGeLUCPUMLIR(oneflow.unittest.TestCase):
    def test_fused_bias_add_gelu_graph(test_case):
        data = np.random.randn(1, 2, 3)
        bias_data = np.random.randn(2)
        x = flow.tensor(data, dtype=flow.float32)
        bias = flow.tensor(bias_data, dtype=flow.float32)
        y_eager = flow.gelu(flow._C.bias_add(x, bias, axis=1))

        class FuseBiasAddGeLUGraph(flow.nn.Graph):
            def __init__(self):
                super().__init__()

            def build(self, x):
                return flow.gelu(flow._C.bias_add(x, bias, axis=1))

        bias_add_gelu = FuseBiasAddGeLUGraph()
        y_lazy = bias_add_gelu(x)
        test_case.assertTrue(np.array_equal(y_eager.numpy(), y_lazy.numpy()))

After running this example, a log file will be generated in the current running directory with an IR in it_ The pass folder records the calculation diagram (. Protoxt) before and after OneFlow MLIR optimization, the expression (*. mlir) of MLIR, and a * mlir.dot files can be opened with graphviz to visualize the calculation diagram of MLIR expressions. It should be noted that if OneFlow is performing training tasks, this log folder will not only contain forward calculation diagrams and MLIR expressions, but also generate backward calculation diagrams and MLIR expressions. Therefore, MLIR can play a role in the operation process of the whole neural network, which is an important point different from the forward reasoning framework, that is, the training can also be accelerated.

At ONEFLOW / API / Python / IR CPP has the following two lines of code:

REGISTER_JOB_PASS("IRRoundTripBeforeAD", IRRoundTrip<kBeforeAD>);
REGISTER_JOB_PASS("IRRoundTrip", IRRoundTrip<kAfterAD>);

RoundTrip means round-trip. BeforeAD can be understood as before the reverse, kAfterAD can be understood as after the reverse. Here, the connection between OneFlow calculation diagram and MLIR is established by registering the mutual conversion process of OneFlow Job and MLIR as a Pass of OneFlow Job. When executing the OneFlow script, if you want to enable the MLIR to act on the OneFlow calculation diagram, open ONEFLOW_MLIR_ENABLE_ROUND_TRIP=1 environment variable.

Next, to connect the OneFlow calculation diagram with the MLIR is equivalent to one-to-one conversion between the Operation in the OneFlow calculation diagram and the Operation in the MLIR. The Operation of MLIR is defined under all levels of collect. According to the general access principle of MLIR, we have implemented a OneFlow collect and realized the one-to-one mapping from OneFlow Operation to Operation under OneFlow collect on OneFlow collect. How to define OneFlow Dialect and Operation is not covered here. You can refer to the dialogues and ODS section of the official MLIR document( https://mlir.llvm.org/docs/OpDefinitions/ )Or my previous articles, they are based on TableGen rules. About the definition of MLIR Operation, I summarized a document in combination with the Op definition of OneFlow Dialect( https://github.com/BBuf/tvm_mlir_learn Medium). In addition to the definitions of Dialect and Operation, there are other things that need to be defined, such as the definition of OneFlow data type to MLIR data type mapping in OneFlow / IR / include / OneFlow / oneflowenums TD, some common front-end interfaces of OneFlow Dialect Operation are defined in OneFlow / IR / include / OneFlow / oneflowenums td. Here, we take Reshape Operation as an example to briefly explain the components of this Operation:

def OneFlow_ReshapeOp : OneFlow_BaseOp<"reshape", [NoSideEffect, DeclareOpInterfaceMethods<UserOpCompatibleInterface>]> {
  let input = (ins
    AnyType:$in
  );
  let output = (outs
    AnyType:$out
  );
  let attrs = (ins
    AnyI64ElementsAttr:$shape
  );
}

OneFlow_ The name reshapeop is preceded by the name of the Dialect, followed by the name of the Operation under the Dialect. Then this Operation inherits OneFlow_BaseOp base class, and declare the constraints and front-end interface. Next, define the input, output and properties of Operation. It can be found that the definition of OneFlow Dialect Operation is completely consistent with that of OneFlow User Op, which ensures the legitimacy of the interaction between ONEFLOW and MLIR. OneFlow Reshape Operation is defined as follows:

REGISTER_USER_OP("reshape")
    .Input("in")
    .Output("out")
    .Attr<Shape>("shape")
    ...

The mutual transformation of OneFlow Job and MLIR is implemented in ONEFLOW / IR / ONEFLOW translate. The main thing is to traverse the OpGraph of the Job, process the nodes and edges respectively, and finally convert them into an MLIR expression. At the same time, after the calculation is completed, the Job can be rewritten based on the MLIR expression. The overall logic here is complicated, because we need to deal with the conversion of various types of operations and edges in OneFlow Job OpGraph. We won't go further here, because it's not the point I want to discuss in this article. Those interested can read the code directly.

How does OneFlow IR perform?

In the above Operation definition, you take an example of Reshape to browse ONEFLOW / IR / include / ONEFLOW / oneflowops TD it is easy to find that there is also a ONEFLOW defined here_ Mlirjitop, this custom Op is used to execute MLIR expressions, It implements the Kernel of CPU and GPU (the source code is in ONEFLOW / IR / ONEFLOW extension / extension. CPP) to load the LLVM IR finally obtained by running the JIT execution engine provided by MLIR. How does the LLVM IR come from? This is obtained by descending the OneFlow MLIR expression step by step. The specific descending process is as follows:

void AddLowerToLinalgMemRefPasses(PassManager& pm) {
  pm.addPass(createLowerOneFlowToTosaPass());            // lower-oneflow-to-tosa
  pm.addPass(createCSEPass());                           // cse
  pm.addNestedPass<FuncOp>(tosa::createTosaToLinalg());  // tosa-to-linalg-on-tensors
  auto p = createLinalgElementwiseOpFusionPass();
  assert(p->initializeOptions("allow-folding-unit-dim-reshapes=true").succeeded());
  pm.addNestedPass<FuncOp>(std::move(p));                     // linalg-fuse-elementwise-ops
  pm.addNestedPass<FuncOp>(createLinalgBufferizePass());      // linalg-bufferize
  pm.addNestedPass<FuncOp>(createTensorBufferizePass());      // tensor-bufferize
  pm.addPass(createTensorConstantBufferizePass());            // tensor-constant-bufferize
  pm.addPass(createFuncBufferizePass());                      // func-bufferize
  pm.addPass(createBufferResultsToOutParamsPass());           // buffer-results-to-out-params
  pm.addPass(createCanonicalizerPass());                      // canonicalize
  pm.addNestedPass<FuncOp>(createFinalizingBufferizePass());  // finalizing-bufferize
}

LogicalResult LowerModuleToLLVM(mlir::MLIRContext* context, ModuleOp module) {
  mlir::PassManager pm(context);
  AddLowerToLinalgMemRefPasses(pm);
  pm.addNestedPass<FuncOp>(createConvertLinalgToLoopsPass());  // convert-linalg-to-loops
  pm.addNestedPass<FuncOp>(createLowerToCFGPass());            // convert-scf-to-std
  pm.addPass(createConvertLinalgToLLVMPass());                 // convert-linalg-to-llvm
  pm.addPass(createMemRefToLLVMPass());                        // convert-memref-to-llvm
  pm.addPass(createLowerToLLVMPass());                         // convert-std-to-llvm
  pm.addPass(createReconcileUnrealizedCastsPass());
  return pm.run(module);
}

You can see that OneFlow Dialect first drops to Tosa Dialect, then to Linalg Dialect, then Loop Dialect, and finally to LLVM IR. In the process of descending step by step, we can enjoy the optimization opportunities brought by nested loop transformation brought by Linalg Dialect to improve the performance of the final IR. The Lowering process here is triggered when OneFlow calls the Kernel of MlirJitOp (OneFlow / IR / OneFlow extension / extension. CPP). The call is also added to the optimization process as an MLIR Pass. The implementation of the JIT call process Pass can be reduced to:

class OutlineJitFunctionPass : public OutlineJitFunctionPassBase<OutlineJitFunctionPass> {
  void runOnOperation() override {
    Operation* op = getOperation();
    RewritePatternSet patterns(op->getContext());
    oneflow::populateFuserPasses(patterns);
    (void)applyPatternsAndFoldGreedily(op, std::move(patterns));
  }
};

std::unique_ptr<Pass> createOutlineJitFunctionPass() {
  return std::make_unique<OutlineJitFunctionPass>();
}

LogicalResult ApplyRoundTripPatterns(RoundTripOneFlowJobWrapperInterface& job_wrapper,
                                     MLIRContext* context, OwningModuleRef& module) {
  mlir::PassManager pm(context);
  pm.addNestedPass<mlir::FuncOp>(::mlir::createCanonicalizerPass());
  if (job_wrapper.IsLastIRPass() && std::getenv("ONEFLOW_MLIR_ENABLE_CODEGEN_FUSERS") != nullptr) {
    pm.addPass(oneflow::createOutlineJitFunctionPass());
  }
  ...
}

However, there are still two problems to be solved in this process:

The first question is how to do Op integration. The above JIT execution process only considers continuous Lowering. If some operations in OneFlow Dialect can be integrated, what should be done at this time? Very simply, let's follow the DRR rules of MLIR, or use TableGen syntax in OneFlow / IR / include / OneFlow / oneflowpatterns Just write a series of fuse patterns in TD, such as bias_add+gelu can be merged into fused in OneFlow_ bias_ add_ Gelu Op, then you can write the following rules.

def IsGPU: Constraint<CPred<"$0.getValue().equals(\"gpu\")">, "is GPU device">;
def FusedBiasAddGeluPattern : Pat<
  (
    OneFlow_GeluOp : $gelu_op
    (
      OneFlow_BiasAddOp
        $a,
        $b,
        $bias_add_op_name,
        $bias_add_device_tag,
        $bias_add_device_name,
        $bias_add_scope_symbol_id,
        $bias_add_hierarchy,
        $axis
    ),
    $gelu_op_name,
    $gelu_device_tag,
    $gelu_device_name,
    $gelu_scope_symbol_id,
    $gelu_hierarchy
  ),
  (OneFlow_FusedBiasAddGeluOp $a, $b,
    $gelu_op_name,
    $gelu_device_tag,
    $gelu_device_name,
    $gelu_scope_symbol_id,
    $gelu_hierarchy,
    $axis
  ),
  [
    (IsGPU $bias_add_device_tag),
    (IsGPU $gelu_device_tag)
  ]
>;

Here we do expression matching and rewriting based on the DRR rules of MLIR. We can see that if the current running device is GPU and the first and second OPs are Gelu and bias respectively_ Add merges it into a fused_bias_add_gelu_op, reading and writing can be reduced on CUDA to improve execution efficiency.

The second question is how to make some operations of OneFlow enjoy more optimization in the MLIR infrastructure? When the multi-level dialog drops layer by layer, you can see that each subfunction of the MLIR expression of OneFlow will be lower. For the first time, it will be lower to Tosa Dialect. At this time, if an Operation in this sub function does not define a method to convert to Tosa Dialect, it cannot be lower to Tosa Dialect. Naturally, it can not be further reduced to Linalg Dialect, I can't enjoy the optimization brought by some cyclic changes (I think it can be compared with the scheduler optimization of TVM). To solve this situation, we need to define an additional Pass to extract the Op or mode that needs to be converted to Tosa into a function. The oneflow op can be enough to lower to Tosa, and then generate an oneflow mlir jit op to call this function:

def IsNotNestedInJit: Constraint<CPred<"(!$0.getDefiningOp()->getParentOfType<::mlir::FuncOp>()->hasAttr(\"llvm.emit_c_interface\"))">, "">;
def OutlineMulCast : NativeCodeCall<"::mlir::oneflow::OutlineMulCast($_builder, $0, $1)">;
// TODO: remove attr binding if possible
def MulCastPattern : Pat<
  (
    OneFlow_ScalarMulByTensorOp : $mul_op
    (
      OneFlow_CastOp : $cast_op
        $cast_x,
        $cast_op_name,
        $cast_device_tag,
        $cast_device_name,
        $cast_scope_symbol_id,
        $cast_hierarchy,
        $cast_dtype
    ),
    $scalar,
    $mul_op_name,
    $mul_device_tag,
    $mul_device_name,
    $mul_scope_symbol_id,
    $mul_hierarchy
  ),
  (OutlineMulCast $mul_op, $cast_op),
  [
    (IsNotNestedInJit $mul_op)
  ]
>;

::llvm::SmallVector<::mlir::Value, 4> OutlineMulCast(::mlir::PatternRewriter& rewriter,
                                                     mlir::OpResult mul_res,
                                                     mlir::OpResult cast_res) {
  if (auto mul_op = llvm::dyn_cast<ScalarMulByTensorOp>(mul_res.getDefiningOp())) {
    if (auto cast_op = llvm::dyn_cast<CastOp>(cast_res.getDefiningOp())) {
      // TODO: extract a function to generate op name for jit op from ops being fused
      SmallString<64> op_name_storage;
      auto op_name =
          (cast_op.op_name() + "__FUSE__" + mul_op.op_name()).toStringRef(op_name_storage);
      SmallVector<::mlir::Value, 2> operands;
      operands.push_back(cast_op.in());
      operands.push_back(mul_op.scalar());
      SmallVector<::mlir::Value, 1> results;
      results.push_back(mul_op.y());
      NamedAttrList attributes =
          GetJitOpAttributes(rewriter, op_name, operands.size(), results.size(), mul_op);
      SmallVector<Operation*, 4> ops = {cast_op, mul_op};
      auto function =
          GetOrInsertFuncOp(rewriter, mul_op->getLoc(), op_name, operands, results, ops);
      auto created = rewriter.create<MlirJitOp>(mul_op.getLoc(), function, attributes, operands);
      assert(DumpAssembly(rewriter, created).succeeded());
      cast_op->dropAllUses();
      cast_op.erase();
      return created->getResults();
    }
  }
  return {};
}

void populateFuserPasses(::mlir::RewritePatternSet& patterns) {
  patterns.add<MulCastPattern>(patterns.getContext());
}

Here, the MulCast Pattern is manually converted from OneFlow Dialect to Tosa Dialect. Finally, the Pass is added to the optimization process to complete the Pattern in the MLIR expression. The Pattern will Pass through Tosa and Linalg to obtain some optimization opportunities.

summary

Here, we take OneFlow as an example to explain some of the real operation processes of MLIR, that is, how to execute the calculation diagram of the deep learning framework through MLIR and accelerate it. At present, it is inevitable that there are deficiencies in the understanding. We welcome criticism and correction.

Added by Peredy on Mon, 13 Dec 2021 11:47:21 +0200

Programming VIP

Take OneFlow as an example to explore the actual development process of MLIR

preface

How does OneFlow work with MLIR?

How does OneFlow IR perform?

summary

Popular Keywords