TensorRT: a high-performance inference optimizer and runtime for accelerating AI models on NVIDIA GPUs

What it solves

TensorRT is designed to accelerate AI inference workflows, optimizing the performance of deep learning models for deployment on NVIDIA GPUs. It provides tools to import models from various frameworks and optimize them for high-efficiency execution.

How it works

TensorRT optimizes AI models through several import paths, including ONNX, Torch-TensorRT, HuggingFace/Optimum, and a Network Definition API. It supports a wide range of model types, including LLMs, encoder-NLP, vision, audio, diffusion, and multimodal models. The open-source components of the project include the ONNX parser and TensorRT plugins, allowing developers to extend the platform's capabilities.

Who it’s for

It is intended for AI developers and engineers who need to deploy high-performance inference on NVIDIA hardware, including x86_64 and aarch64 (Jetson/DriveOS) platforms.

Highlights

Broad Import Support: Compatible with ONNX, Torch-TensorRT, and HuggingFace/Optimum.
Diverse Model Compatibility: Supports LLMs, vision, audio, and multimodal models.
Flexible Deployment: Provides prebuilt Python packages for easy installation and extensive build options for various OS and hardware targets.
Extensible: Includes open-source plugins and an ONNX parser to customize and optimize model execution.

TensorRT: a high-performance inference optimizer and runtime for accelerating AI models on NVIDIA GPUs

TensorRT: a high-performance inference optimizer and runtime for accelerating AI models on NVIDIA GPUs

What it solves

How it works

Who it’s for

Highlights

Sources