Show HN: Cheddar-bench – unsupervised benchmark for coding agents

  • Posted 9 hours ago by przadka
  • 8 points
https://github.com/przadka/cheddar-bench
I built a small benchmark to test CLI coding agents on blind bug detection.

A challenger agent injects bugs and writes ground truth (`bugs.json`). A different reviewer agent audits the repo without seeing ground truth, and an LLM matcher scores bug-to-finding assignments.

Current run: 50 repos, 150 challenges, 450 reviews, 2,603 injected bugs.

Weighted detection: Claude 58.05%, Codex 37.84%, Gemini 27.81%.

LLM-judge benchmarks are easy to get wrong, so I’d really appreciate critical feedback on benchmark fairness, scoring/matching methodology, and obvious failure modes I’m missing.

Full dataset is linked in the docs.

0 comments