85.0% on 4,326 questions. where that lands:
ROMA (357B): 93.9% OpenDeepSearch (671B): 88.3% Sonar Pro: 85.8% Llama 3.2 3B + Keiro: 85.0%
the systems ahead of us are running models 100-200x larger. that's why they're ahead. not better retrieval, not better prompting — just way more parameters.
the interesting part is how small the gap is despite that. 3 points behind a 671B model. 0.8 behind Sonar Pro. at some point you have to ask what you're actually buying with all that compute for this class of task.
Want to know how low the reader model can go before it starts mattering. in this setup it clearly wasn't the limiting factor and also if smaller models with web enabled will perform as good( if not better) as larger models for a lot of non coding tasks
Full benchmark script + results --> https://github.com/h-a-r-s-h-s-r-a-h/benchmark
Keiro research -- https://www.keirolabs.cloud/docs/api-reference/research