kuberay: a Kubernetes operator for managing the lifecycle of Ray clusters, jobs, and services

kuberay: a Kubernetes operator for managing the lifecycle of Ray clusters, jobs, and services

What it solves

KubeRay simplifies the deployment and management of Ray applications on Kubernetes, removing the complexity of manually configuring clusters for distributed AI workloads like training and inference.

How it works

It operates as a Kubernetes operator providing three primary Custom Resource Definitions (CRDs) to manage different workload types:

  • RayCluster: Manages the full lifecycle of a Ray cluster, including creation, deletion, autoscaling, and fault tolerance.
  • RayJob: Automates the creation of a cluster, submits a specific job, and can automatically delete the cluster upon completion.
  • RayService: Combines a RayCluster with a Ray Serve deployment graph to enable high availability and zero-downtime upgrades.

Additionally, it offers a kubectl ray plugin for simplified workflows, an API server for configuration management, and an experimental dashboard for resource visualization.

Who it’s for

It is designed for developers and platform engineers who need to run distributed machine learning and AI applications (such as LLM online inference or batch training) at scale on Kubernetes.

Highlights

  • Automated Lifecycle Management: Handles cluster creation, scaling, and fault tolerance automatically.
  • Integrated Ecosystem: Integrates with Prometheus, Grafana, Nginx, and queuing systems like Volcano and Kueue.
  • Workload-Specific Resources: Dedicated resources for long-running services (RayService) and one-off jobs (RayJob).
  • Scalability: Used by organizations like Apple, Google, and Spotify to scale AI infrastructure to thousands of nodes.

Sources