How does this skill let Claude 'watch' videos when Anthropic has no video model?
It splits a video into two parts—frame-by-frame screenshots and a timestamped transcript—and feeds those images and text to Claude so it can read visual context alongside audio text.
Which tools power the pipeline and where do they run?
The pipeline uses yt-dlp to download media and ffmpeg to extract frames and audio locally on your machine; transcription uses Whisper (often via Groq) and Claude does the multimodal analysis.
What are the costs involved?
Most costs are minimal: YouTube captions are free when available, Groq's Whisper free tier covers many cases, and the main expense is Claude usage for analysis—demo runs cost about a dollar each.
What sites and file types are supported?
Any URL that yt-dlp supports—over 1,000 sites—including YouTube, Instagram, Loom, and local files.
How can I use the output in my workflow?
Save Claude's structured summaries and timestamped notes into tools like Obsidian to build a searchable second brain and compound insights over time.