claude-real-video: Enabling LLMs to Watch Videos via Scene-Aware Frame Extraction
claude-real-video: Enabling LLMs to Watch Videos via Scene-Aware Frame Extraction
Overview
claude-real-video is an open-source tool designed to let Large Language Models (LLMs) like Claude, ChatGPT, and Gemini "watch" videos by converting them into a format the models can easily ingest. Unlike many AI tools that rely solely on transcripts or fixed-interval frame sampling, claude-real-video extracts only the frames that matter—specifically scene changes and key visual shifts—and pairs them with a transcript, all while running locally on the user's machine.
Solving the Limitations of Fixed-Interval Sampling
Most existing video-to-LLM pipelines, including Gemini's native video processing, typically sample frames at a fixed interval (e.g., 1 frame per second). This approach creates two primary inefficiencies:
- Over-sampling static content: A 10-minute static screencast would result in 600 nearly identical frames, wasting context window space and increasing costs.
- Under-sampling fast cuts: Rapid visual changes occurring between the fixed sample points are often missed entirely.
claude-real-video addresses this by using scene-change detection combined with a density floor. This ensures that every visual transition is captured while maintaining a minimum frequency of frames (defined by --fps-floor) to ensure no long stretches of video are ignored.
Key Technical Features
Intelligent Frame Deduplication
To further optimize the context window, the tool employs a sliding-window deduplication process. Instead of simple perceptual hashes, it uses downscaled RGB pixel differences to determine if a frame is unique.
By comparing a new frame against a sliding window of the last several kept frames (controlled by --dedup-window), the tool prevents the same shot from being sent to the LLM multiple times, even if the video uses A-B-A cutaways. Users can tune the sensitivity of this process using the --dedup-threshold flag.
Multi-Modal Data Extraction
Beyond visuals, claude-real-video provides a comprehensive data package for the LLM:
- Transcripts: The tool first attempts to use existing subtitles (SRT/VTT) for maximum accuracy. If none exist, it falls back to OpenAI's Whisper for local audio transcription.
- Audio Extraction: With the
--keep-audioflag, the tool saves the full soundtrack as anaudio.m4afile, allowing models with native audio capabilities (like GPT-4o or Gemini) to analyze tone and music. - Manifest File: A
MANIFEST.txtfile is generated to summarize the extracted content, providing the LLM with a structured map of the video's components.
Local Processing and Privacy
All processing—fetching, extraction, and deduplication—happens locally. The tool uses yt-dlp for URL handling and ffmpeg for frame and audio extraction, ensuring that video data is not uploaded to a cloud service for preprocessing.
Installation and Usage
System Requirements
claude-real-video requires Python 3.10+ and ffmpeg installed on the system path.
Installation
Users can install the core functionality or the full suite including transcription:
# Core frames and deduplication
pip install claude-real-video
# Full suite including audio transcription
pip install "claude-real-video[whisper]"
Basic Commands
- From a URL:
crv "https://www.youtube.com/watch?v=..." - From a local file:
crv lecture.mp4 -o out --lang en - Frames only:
crv clip.mp4 --no-transcribe - Login-gated content:
crv "https://..." --cookies cookies.txt
Summary of Configuration Options
| Flag | Default | Description |
|---|---|---|
--scene |
0.30 |
Sensitivity for scene-change detection (lower = more frames) |
--fps-floor |
1.0 |
Minimum frame capture frequency (one frame every N seconds) |
--max-frames |
150 |
Maximum total frames to extract |
--dedup-threshold |
8 |
Pixel change percentage required to count as a new frame |
--dedup-window |
4 |
Number of previous frames to compare against for deduplication |
--report |
off | Generates report.html to visualize keep/drop decisions |