scenic: a JAX-based research library for prototyping large-scale attention-based computer vision models

scenic: a JAX-based research library for prototyping large-scale attention-based computer vision models

What it solves

Scenic provides a streamlined framework for researching and prototyping large-scale attention-based models for computer vision. It reduces the effort required to build complex vision models by providing shared libraries for common training tasks, optimized loops, and input pipelines, specifically designed for multi-device and multi-host environments.

How it works

Built using JAX and Flax, Scenic splits its architecture into two levels:

  1. Library-level code: Minimal, well-tested shared libraries including dataset_lib (scalable IO pipelines), model_lib (abstract model interfaces, attention/transformer layers, and bipartite matchers), train_lib (optimized training loops), and common_lib (general utilities).
  2. Project-level code: Customizable implementations for specific tasks. Researchers can use existing configs or fork library components to redefine architectures, losses, and metrics based on their needs.

Who it’s for

It is designed for AI researchers and developers working on computer vision, including those developing models for classification, segmentation, detection, and multimodal tasks involving images, video, and audio.

Highlights

  • Broad Modality Support: Successfully used for images, video, audio, and multimodal combinations.
  • Scalable Infrastructure: Built-in support for large-scale training across multiple devices and hosts.
  • Extensive Baseline Library: Includes implementations of SOTA models like ViT, DETR, CLIP, and SAM.
  • ** فلسفه (Philosophy)**: Prioritizes simplicity and rapid prototyping by favoring forking and copy-pasting over complex abstractions.

Sources