HN – LLM agent workflows fail silently. Here's the reliability layer we wish existed

For the past few months my co-founder and I have been building complex agentic workflows, and we kept hitting the same recurring reliability issues: inconsistent shared state, silent failures, agents diverging from each other, and no clean way to recover without restarting the entire workflow.

It became clear that most failures weren’t “LLM problems” but classic distributed-systems problems showing up in multi-agent setups.

Since nothing in the current ecosystem addressed this properly, we started building a reliability layer for agent workflows — something that adds structure, safety, and predictable recovery to multi-agent systems without forcing developers to rewrite their stack.

We’re looking to connect with people who have run into similar issues or are building production-grade agent workflows. The goal is to understand how others think about reliability, failure recovery, and workflow consistency in these systems.

If you’re working on this space or want to try the early access, here’s the link: https://tally.so/r/LZDb0j

Would appreciate any thoughts or experiences others have had around agent reliability, especially failure cases or pain points.

LLM agent workflows fail silently. Here's the reliability layer we wish existed

0 comments