HN – Show HN: Cheddar-bench – unsupervised benchmark for coding agents

I built a small benchmark to test CLI coding agents on blind bug detection.

A challenger agent injects bugs and writes ground truth (`bugs.json`). A different reviewer agent audits the repo without seeing ground truth, and an LLM matcher scores bug-to-finding assignments.

Current run: 50 repos, 150 challenges, 450 reviews, 2,603 injected bugs.

Weighted detection: Claude 58.05%, Codex 37.84%, Gemini 27.81%.

LLM-judge benchmarks are easy to get wrong, so I’d really appreciate critical feedback on benchmark fairness, scoring/matching methodology, and obvious failure modes I’m missing.

Full dataset is linked in the docs.

Show HN: Cheddar-bench – unsupervised benchmark for coding agents

0 comments