You give it one prompt. It browses the real App Store in Chrome, installs the app on a real iPhone through macOS iPhone Mirroring (not a simulator), opens the app and explores it — never seen Snapseed before — records clips and screenshots, composites a narrated review video with FFmpeg locally, uploads it to YouTube, then deletes the app. About an hour, didn't touch the keyboard.
The exploration part is what I'm happiest with. The agent reads the App Store description, goes "they say background removal works, let me try that," and then figures out an unfamiliar app on its own. It regrounds from the live screenshot every action, so unexpected dialogs or UI changes don't kill it.
The reason it can sustain an hour of work: each of the 6 stages runs as a separate child session with its own context. You can't fit an hour of screenshots into one window, so the isolation is necessary. Stages are typed — "workers" are deterministic (browser automation, device control), "skills" are agentic (the agent decides what to do). A "playbook" orchestrates both.
Result video (what the agent published): https://youtube.com/shorts/jliTvpTnsKY?feature=share
Process video (how it was built): https://youtu.be/gYMYI0bxkJs
X: https://x.com/LiangSong850509/status/2037612742392357218?s=2...
MIT license.