We all have thousands of buried screenshots, notes, files, bookmarks, saved posts, etc we'll never find again. The only way to make AI understand all of it is to upload everything to the cloud -- privacy nightmare, and way too expensive at scale. And it shouldn't be possible on-device either: small models are too dumb, and phones are too slow for thousands of LLM inference runs.
So I spent a year deeply optimizing every layer of the on-device inference stack to make it possible anyway.
Sentient OS runs a custom multimodal vision LLM on your phone and laptop while they charge overnight. It understands your entire digital life -- every screenshot, note, file, email, bookmark, plus integrations for external services -- with nothing ever leaving your device.
This gives you three things that weren't possible before:
-> Talk to your entire digital life in natural language: "what was that wine I liked?" / "who did I wanna meet next week?" [on-device RAG]. And with MCP, your existing LLM (ChatGPT, Claude, etc.) can talk to your digital life too -- so it actually understands you.
-> Proactive reminders surfaced from your own data: "that tax return in your Downloads is due next week" / "tickets for that concert you screenshotted open tomorrow"
-> Knowledge graphs of your entire digital life: tap any node to find what you buried!
Here's what I had to build to make this possible:
Inference speed:
- KV cache reuse: the system prompt + few-shot examples are identical across all 3,000 analysis calls. I run inference on that prefix once, cache the KV state, and reuse it for every image. Prefill drops to just processing the image itself.
- Thermal-aware scheduling: I throttle the moment iOS reports thermal state > fair. I have all night, so I trade speed for not cooking the device.
- iOS jetsam awareness: iOS kills apps above a specific RAM threshold. I profiled that threshold across different iPhones and push right up to the edge.
Model quality at small size:
- Vision transplant: a 2B Qwen model has terrible vision. I transplanted Qwen 3.5 9B's multimodal projector onto the 2B base. Same architecture family makes this possible.
- Selective quantization on MLX: MLX doesn't support k-quant style mixed precision. I built it manually: less quantization on first/last layers and high-activation layers, more aggressive on the rest.
The alpha processes ~3,000 screenshots entirely on-device on a 6 year old iPhone. Coming to Mac and iPhone!
Previously I researched Apple's neural accelerators: https://www.reddit.com/r/LocalLLaMA/comments/1ohrn20/
And I love OSS! I built https://github.com/theJayTea/WritingTools (2K+ stars, ~30 press features). I'm considering making Sentient OS OSS under AGPL (so no one else can profit off of my work haha).
I think this is one of the coolest consumer usecases to take advantage of on-device LLMs. I'd love to hear what you all think, and happy to answer any questions (I love geeking out about the deep work that's gone into optimizing models and inference!) :D