Clip Search Shipped: Find Any Moment Locally

A while back I wrote a post about what was next for Obskura: clip search, auto-clipping, and the careful work of figuring out what was actually feasible before promising anything. That work has shipped. Here's what changed, and where things are heading from here.

Everything below runs on your device. No recording content leaves the machine. Models download once on first use with checksum verification, then they're yours.

Smart review: a 10-hour stream, reviewable in 30 seconds

The first thing you see when you open a recording is an activity timeline. Silent stretches, low-motion lulls, and scene cuts are detected with plain ffmpeg passes, no ML model needed for this part. The timeline gives you a heat track per signal type so you can spot where the action is without scrubbing.

The same passes feed a Smart Cleanup view that surfaces recordings that are mostly silence or mostly idle. It suggests; it never auto-deletes.

Visual search: type what you remember

Type "red dress dancing" or "crowd reaction" and get timestamped thumbnails back. This runs on CLIP embeddings cached per frame, so the first search on a recording takes a minute or two and every subsequent search is instant, even searches you never thought to run when you first embedded the file.

You can search a single recording, an entire streamer's folder, or your whole library. Inference picks the best backend for your hardware automatically: CUDA on NVIDIA, DirectML on Windows GPUs, CoreML on Apple Silicon, CPU otherwise. Same model file, runs on whatever you've got. For a practical walkthrough, see how to find a specific moment in hours of footage.

Speech search: find the exact phrase

Visual search handles "what does it look like". Speech search handles "what did they say". Local Whisper transcription writes a timestamped script to SQLite, which is then queryable with full-text search: phrase quoting, AND/OR/NOT, prefix wildcards, the works. Click a result and the modal player seeks to the line.

A keyword-trigger list lets you flag words or phrases that, when spoken, light up the activity timeline and feed the clip synthesizer, the same pathway visual matches take.

Clip suggestions and one-click export

When enough signals cluster in time, that cluster becomes a clip candidate. Boundaries snap to nearby scene cuts so cuts land cleanly. From there, one click exports it: stream-copy when boundaries fall on keyframes, re-encode when they don't.

Three presets ship by default: high-quality archival, 1080×1920 vertical for social, and a low-bitrate preview proxy. Bulk export hands a folder to ffmpeg and writes N clips in parallel. A compilation export stitches several candidates into one file. The whole thing is built on a generic job queue, so the same pipeline that runs visual search also runs your exports. (Here's how to clip a highlight without scrubbing the whole VOD.)

What's next

Two pieces are headed into research and development next, and a couple of platform items round out the shorter list.

A self-learning ranker. The candidates list reorders itself based on which clips you keep, export, or delete. A feature-weighted linear ranker comes first; a lightweight gradient-boosted ranker follows once there's enough signal volume. Stays on-device. Your behavior is your training data, not anyone else's.

Event search. Visual search can find a frame that looks like "red dress". Event search aims for things that unfold across time, like "streamer jumped into the pool", "funny reaction", or "crowd erupted". Hybrid recipes fuse visual, speech, audio, and motion signals together, with optional VLM verification on top-N candidates only so it stays affordable on consumer hardware.

Optional GPU acceleration for speech. The CPU whisper.cpp binary ships in the installer today. For users with discrete GPUs, an opt-in CUDA or Vulkan build will be available as an in-app download. The binary resolver already prefers a user-installed build over the bundled one.

macOS and Linux speech bundling. The engineering runbook exists. It's gated on having real hardware to verify against, rather than shipping untested cross-platform binaries.

The same bar as the recorder

Obskura's whole point is reliability: a recorder you can trust to be there when you need it. Everything added on top should meet the same bar: local-first, predictable, no surprises on your bandwidth or your privacy.

The best place to hear about progress is our Discord community (available for active subscribers). And if you're curious about the bigger picture, why this project exists and where the subscription money goes, the About page has more.