claude-real-video: Enabling LLMs to Watch Videos via Scene-Aware Frame Extraction

claude-real-video: Enabling LLMs to Watch Videos via Scene-Aware Frame Extraction

Overview

claude-real-video is an open-source tool designed to let Large Language Models (LLMs) like Claude, ChatGPT, and Gemini "watch" videos by converting them into a format the models can easily ingest. Unlike many AI tools that rely solely on transcripts or fixed-interval frame sampling, claude-real-video extracts only the frames that matter—specifically scene changes and key visual shifts—and pairs them with a transcript, all while running locally on the user's machine.

Solving the Limitations of Fixed-Interval Sampling

Most existing video-to-LLM pipelines, including Gemini's native video processing, typically sample frames at a fixed interval (e.g., 1 frame per second). This approach creates two primary inefficiencies:

  1. Over-sampling static content: A 10-minute static screencast would result in 600 nearly identical frames, wasting context window space and increasing costs.
  2. Under-sampling fast cuts: Rapid visual changes occurring between the fixed sample points are often missed entirely.

claude-real-video addresses this by using scene-change detection combined with a density floor. This ensures that every visual transition is captured while maintaining a minimum frequency of frames (defined by --fps-floor) to ensure no long stretches of video are ignored.

Key Technical Features

Intelligent Frame Deduplication

To further optimize the context window, the tool employs a sliding-window deduplication process. Instead of simple perceptual hashes, it uses downscaled RGB pixel differences to determine if a frame is unique.

By comparing a new frame against a sliding window of the last several kept frames (controlled by --dedup-window), the tool prevents the same shot from being sent to the LLM multiple times, even if the video uses A-B-A cutaways. Users can tune the sensitivity of this process using the --dedup-threshold flag.

Multi-Modal Data Extraction

Beyond visuals, claude-real-video provides a comprehensive data package for the LLM:

  • Transcripts: The tool first attempts to use existing subtitles (SRT/VTT) for maximum accuracy. If none exist, it falls back to OpenAI's Whisper for local audio transcription.
  • Audio Extraction: With the --keep-audio flag, the tool saves the full soundtrack as an audio.m4a file, allowing models with native audio capabilities (like GPT-4o or Gemini) to analyze tone and music.
  • Manifest File: A MANIFEST.txt file is generated to summarize the extracted content, providing the LLM with a structured map of the video's components.

Local Processing and Privacy

All processing—fetching, extraction, and deduplication—happens locally. The tool uses yt-dlp for URL handling and ffmpeg for frame and audio extraction, ensuring that video data is not uploaded to a cloud service for preprocessing.

Installation and Usage

System Requirements

claude-real-video requires Python 3.10+ and ffmpeg installed on the system path.

Installation

Users can install the core functionality or the full suite including transcription:

# Core frames and deduplication
pip install claude-real-video

# Full suite including audio transcription
pip install "claude-real-video[whisper]"

Basic Commands

  • From a URL: crv "https://www.youtube.com/watch?v=..."
  • From a local file: crv lecture.mp4 -o out --lang en
  • Frames only: crv clip.mp4 --no-transcribe
  • Login-gated content: crv "https://..." --cookies cookies.txt

Summary of Configuration Options

Flag Default Description
--scene 0.30 Sensitivity for scene-change detection (lower = more frames)
--fps-floor 1.0 Minimum frame capture frequency (one frame every N seconds)
--max-frames 150 Maximum total frames to extract
--dedup-threshold 8 Pixel change percentage required to count as a new frame
--dedup-window 4 Number of previous frames to compare against for deduplication
--report off Generates report.html to visualize keep/drop decisions

Sources