We recently built Clipto. It’s a tool that lets you search over terabytes of video, audio, and images on your computer, without relying on the cloud.
Motivation: we probably all had this similar experience, we know a moment exists in a video or audio, but finding it takes hours scrubbing the timeline. You can send all the media to process in the cloud, but it’s slow, expensive and raises privacy concerns. So we decided to build our own on-device media search engine.
How it works (high level):
1. We ingest video, audio, image; normalize formats via ffmpeg; run content analysis to downsample the frames for deeper understanding.
2. A local ASR pipeline (optimized Whisper) transcribes speech into text and speakers are identified; faces are detected and if known, person id created; a vision model (optimized Qwen3.5) runs on the downsampled frames to detect scenes, actions, objects, OCR and visual descriptions.
3. A graph data structure ties everything together into a searchable memory.
4. At runtime, user’s query and intention are understood by a lightweight local language model. Graph search conducted to retrieve all the matching clip candidates and reranking is done by a reranking model.
5. All the processes are done on your computer, without touching our servers.
Right now, it runs best on Apple Silicon Macs with 24GB+ memory, but we are working on broader support as well as an API/MCP for other agents to call.
We’d love to hear your feedback. Feel free to ask anything!