ONNX to TensorRT accelerated model reasoning


TensorRT is an efficient deep learning model reasoning framework launched by NVIDIA. It includes deep learning reasoning optimizer and runtime, which can make deep learning reasoning applications have the advantages of low latency and high throughput.

In essence, it is to accelerate the reasoning speed of the whole network by fusing some operators in the model, selecting better implementation methods for operators with specific sizes, and using mixed accuracy.

After using PyTorch to train the network model, we hope to accelerate the model reasoning through TensorRT during model deployment. Then we can first convert the PyTorch model to ONNX, and then turn ONNX to TensorRT engine.

Implementation steps

PyTorch model to ONNX

The specific process can be referred to PyTorch model to ONNX format_ TracelessLe's column - CSDN blog

ONNX to TensorRT engine

Method 1: trtexec

Directly use trtexec command line to convert ONNX model to TensorRT engine:

trtexec --onnx=net_bs8_v1_simple.onnx --tacticSources=-cublasLt,+cublas --workspace=2048 --fp16 --saveEngine=net_bs8_v1.engine  --verbose

Note: (Reference: TensorRT-trtexec-README)
① -- ONNX specifies the ONNX file path
② -- tacticSources specifies the method library to use
③ -- workspace specifies the size of the workspace in MB
④ -- FP16 enable FP16 mode
⑤ -- saveEngine specifies the save path of the generated engine
⑥ -- verbose open verbose mode for more printing information.

Under normal circumstances, the engine file can be generated normally after a long waiting time. In case of any problem, it needs to be analyzed according to the specific problem.

Method 2: engine generation based on Python API

TensorRT provides API s such as C + + and Python, which can be used to generate engine.

__author__ = 'TracelessLe'

import os
import tensorrt as trt

ONNX_SIM_MODEL_PATH = 'net_bs8_v1_simple.onnx'
TENSORRT_ENGINE_PATH_PY = 'net_bs8_v1_fp16_py.engine'

def build_engine(onnx_file_path, engine_file_path, flop=16):
    trt_logger = trt.Logger(trt.Logger.VERBOSE)  # trt.Logger.ERROR
    builder = trt.Builder(trt_logger)
    network = builder.create_network(
        1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    parser = trt.OnnxParser(network, trt_logger)
    # parse ONNX
    with open(onnx_file_path, 'rb') as model:
        if not parser.parse(model.read()):
            print('ERROR: Failed to parse the ONNX file.')
            for error in range(parser.num_errors):
            return None
    print("Completed parsing ONNX file")
    builder.max_workspace_size = 2 << 30
    # default = 1 for fixed batch size
    builder.max_batch_size = 1
    # set mixed flop computation for the best performance
    if builder.platform_has_fast_fp16 and flop == 16:
        builder.fp16_mode = True

    if os.path.isfile(engine_file_path):
        except Exception:
            print("Cannot remove existing file: ",

    print("Creating Tensorrt Engine")

    config = builder.create_builder_config()
    config.set_tactic_sources(1 << int(trt.TacticSource.CUBLAS))
    config.max_workspace_size = 2 << 30
    engine = builder.build_engine(network, config)
    with open(engine_file_path, "wb") as f:
    print("Serialized Engine Saved at: ", engine_file_path)
    return engine

if __name__ == "__main__":    

Under normal circumstances, this method can generate the same engine file as trtexec, and has the same file size and reasoning speed.

Result comparison

Compare the results obtained by engine reasoning with the results obtained by ONNX:

__author__ = 'TracelessLe'

import os
import time
import onnxruntime
import pycuda.driver as cuda
import tensorrt as trt
import numpy as np

ONNX_SIM_MODEL_PATH = 'net_bs8_v1_simple.onnx'
TENSORRT_ENGINE_PATH_PY = 'net_bs8_v1_fp16_py.engine'

def get_numpy_data():
    batch_size = 8
    img_input = np.ones((batch_size, 3, 128, 128), dtype=np.float32)
    return img_input

def test_onnx(inputs, loop=100):
    inputs = inputs.astype(np.float32)
    sess = onnxruntime.InferenceSession(ONNX_SIM_MODEL_PATH)
    batch_size = 8
    time1 = time.time()
    for i in range(loop):
        time_bs1 = time.time()
        out_ort_img = sess.run(None, {sess.get_inputs()[0].name: inputs,})
        time_bs2 = time.time()
        time_use_onnx_bs = time_bs2 - time_bs1
        print(f'ONNX use time {time_use_onnx_bs} for bs8')
    time2 = time.time()
    time_use_onnx = time2-time1
    print(f'ONNX use time {time_use_onnx} for loop {loop}, FPS={loop*batch_size//time_use_onnx}')
    return out_ort_img

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def _load_engine(engine_file_path):
    trt_logger = trt.Logger(trt.Logger.ERROR)
    with open(engine_file_path, 'rb') as f:
        with trt.Runtime(trt_logger) as runtime:
            engine = runtime.deserialize_cuda_engine(f.read())
            print('_load_engine ok.')
    return engine

def _allocate_buffer(engine):
    binding_names = []
    for idx in range(100):
        bn = engine.get_binding_name(idx)
        if bn:

    inputs = []
    outputs = []
    bindings = [None] * len(binding_names)
    stream = cuda.Stream()

    for binding in binding_names:
        binding_idx = engine[binding]
        if binding_idx == -1:
            print("Error Binding Names!")
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        bindings[binding_idx] = int(device_mem)

        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
            outputs.append(HostDeviceMem(host_mem, device_mem))
    return inputs, outputs, bindings, stream

def _test_engine(engine_file_path, data_input, num_times=100):
    # Code from blog.csdn.net/TracelessLe
    engine = _load_engine(engine_file_path)
    # print(engine)
    input_bufs, output_bufs, bindings, stream = _allocate_buffer(engine)
    batch_size = 8
    context = engine.create_execution_context()
    input_bufs[0].host = data_input
    trt_outputs = [output_bufs[0].host.copy()]
    start = time.time()
    for _ in range(num_times):
        time_bs1 = time.time()
        input_bufs[0].host = data_input
        trt_outputs = [output_bufs[0].host.copy()]
        time_bs2 = time.time()
        time_use_bs = time_bs2 - time_bs1
        print(f'TRT use time {time_use_bs} for bs8')
    end = time.time()
    time_use_trt = end - start
    print(f"TRT use time {(time_use_trt)}for loop {num_times}, FPS={num_times*batch_size//time_use_trt}")
    return trt_outputs

def test_engine(data_input, loop=100):
    engine_file_path = TENSORRT_ENGINE_PATH_PY
    cuda_ctx = cuda.Device(0).make_context()
    trt_outputs = None
        trt_outputs = _test_engine(engine_file_path, data_input, loop)
    return trt_outputs

if __name__ == "__main__":
    img_input = get_numpy_data()
    trt_outputs = test_engine(img_input, 100)
    trt_outputs = trt_outputs[0].reshape((8,3,128,128))
    trt_image_numpy = (np.transpose(trt_outputs[0], (1, 2, 0)) + 1) / 2.0 * 255.0
    trt_image_numpy = np.clip(trt_image_numpy, 0, 255)
    out_ort_img = test_onnx(img_input, loop=1)[0]
    onnx_image_numpy = (np.transpose(out_ort_img[0], (1, 2, 0)) + 1) / 2.0 * 255.0
    onnx_image_numpy = np.clip(onnx_image_numpy, 0, 255)
    mse = np.square(np.subtract(onnx_image_numpy, trt_image_numpy)).mean()
    print('mse between onnx and trt result: ', mse)

Other instructions

(1) PyTorch to TensorRT engine
Methods in addition to the conventional PyTorch - > onnx - > tensorrt, there are other methods, such as NVIDIA-AI-IOT torch2trt And NVIDIA TRTorch , you can also try.

(2) ONNX operator support
TensorRT does not support all ONNX operators. See the specific support list Related documents.

(3)engine debug
TensorRT provides a set of tools that can be used for debug ging during engine generation, including Polygraphy, ONNX GraphSurgeon and pytorch quantification. These gadgets are very useful and worthy of further research.

Copyright description

This article is an original article, which is exclusively published in blog.csdn.net/TracelessLe No reprint is allowed without personal permission. If you need help, please email tracelessle@163.com.

reference material

[1] NVIDIA TensorRT | NVIDIA Developer
[2] TensorRT-trtexec-README
[3] PyTorch model to ONNX format _tracelesslecolumn - CSDN blog
[4] How to use TensorRT to accelerate the trained PyTorch model? - cloud + community - Tencent cloud
[5] How to Convert a Model from PyTorch to TensorRT and Speed Up Inference | LearnOpenCV #
[6] How to use TensorRT to accelerate the trained PyTorch model? - what do you know
[7] TensorRT tutorial 3: converting to engine using the trtexec tool
[8] onnx-tensorrt/operators.md at master · onnx/onnx-tensorrt
[9] mmediting/onnx2tensorrt.py at master · open-mmlab/mmediting
[10] TensorRT detailed introduction to the north. If you don't know TensorRT, come and have a look! - SegmentFault
[11] Share some valuable in-depth learning about NVIDIA new technology for AI deployment - what do you know

Keywords: neural networks Pytorch Deep Learning TensorRT

Added by rolwong on Thu, 30 Sep 2021 23:51:37 +0300