serving: a high-performance production system for managing and serving versioned machine learning model inference
serving: a high-performance production system for managing and serving versioned machine learning model inference
What it solves
It addresses the challenge of deploying machine learning models into production environments. Specifically, it handles the inference phase—taking a trained model and making it available for clients to use reliably and efficiently without requiring changes to client code when models are updated.
How it works
TensorFlow Serving manages the lifetime of models using a high-performance, reference-counted lookup table. It provides versioned access to models and exposes inference endpoints via gRPC and HTTP. To optimize performance, it includes a scheduler that batches individual requests for joint execution on GPUs and supports a variety of "servables," including TensorFlow models, embeddings, and vocabularies.
Who it’s for
It is designed for developers and ML engineers who need to deploy machine learning models to production environments where high performance, low latency, and the ability to manage multiple model versions (including A/B testing and canarying) are required.
Highlights
- Supports simultaneous serving of multiple models or multiple versions of the same model.
- Provides both gRPC and HTTP inference endpoints.
- Enables deployment of new model versions without updating client code.
- Supports canarying and A/B testing for experimental models.
- Includes a request scheduler for GPU batching with configurable latency controls.
- Extensible architecture that can serve non-TensorFlow based models and data.
Sources
- undefinedtensorflow/serving