The Method: I used adversarial mutations (22+ types like prompt injection, encoding attacks, context manipulation) to simulate real-world hostile inputs, checking for failures in latency, safety, and correctness.
The Result: The agent scored a 5.2% robustness score. 57 out of 60 adversarial tests failed. Key failures:
Encoding Attacks: 0% pass rate. The agent would decode malicious Base64 inputs instead of rejecting them—a major security oversight.
Prompt Injection: 0% pass rate. Basic "ignore previous instructions" attacks succeeded every time.
Severe Performance Degradation: Latency spiked to ~30 seconds under stress, far exceeding reasonable timeouts.
This isn't about one bad agent. It's a pattern suggesting our default "happy path" testing is insufficient. Agents that seem fine in demos can be fragile and insecure under real-world conditions.
I'm sharing this to start a discussion:
Are we underestimating the adversarial robustness needed for production AI agents?
What testing strategies beyond static evals are proving effective?
Is chaos engineering or adversarial testing a necessary new layer in the LLM dev stack?
[1] Flakestorm GitHub (the tool used for testing): https://github.com/flakestorm/flakestorm