TensorRT: a high-performance inference optimizer and runtime for accelerating AI models on NVIDIA GPUs
TensorRT: a high-performance inference optimizer and runtime for accelerating AI models on NVIDIA GPUs
What it solves
TensorRT is designed to accelerate AI inference workflows, optimizing the performance of deep learning models for deployment on NVIDIA GPUs. It provides tools to import models from various frameworks and optimize them for high-efficiency execution.
How it works
TensorRT optimizes AI models through several import paths, including ONNX, Torch-TensorRT, HuggingFace/Optimum, and a Network Definition API. It supports a wide range of model types, including LLMs, encoder-NLP, vision, audio, diffusion, and multimodal models. The open-source components of the project include the ONNX parser and TensorRT plugins, allowing developers to extend the platform's capabilities.
Who it’s for
It is intended for AI developers and engineers who need to deploy high-performance inference on NVIDIA hardware, including x86_64 and aarch64 (Jetson/DriveOS) platforms.
Highlights
- Broad Import Support: Compatible with ONNX, Torch-TensorRT, and HuggingFace/Optimum.
- Diverse Model Compatibility: Supports LLMs, vision, audio, and multimodal models.
- Flexible Deployment: Provides prebuilt Python packages for easy installation and extensive build options for various OS and hardware targets.
- Extensible: Includes open-source plugins and an ONNX parser to customize and optimize model execution.
Sources
- undefinedNVIDIA/TensorRT