Level 1 → launchd auto-restart (seconds) Level 2 → watchdog health check every 60s Level 3 → Claude Code AI diagnosis + repair (fires after 30+ min failure) Level 4 → Discord/Telegram human escalation
The system looked great on paper. But v3.0.0 had a critical bug: Level 2 logged "Escalating to Level 3..." but never actually called the script. The chain was broken at the most important handoff — silently, for weeks.
v3.1.0 fixes this plus adds: - Chain verification in the installer (each level is tested end-to-end) - Graceful .env loading so Level 3 doesn't die silently without API keys - Watchdog Catch-up mode for missed intervals
GitHub: https://github.com/Ramsbaby/openclaw-self-healing
The meta-lesson: "I have monitoring" ≠ "my monitoring chain is actually connected." Testing each component starts is not the same as testing the system works end-to-end.