Is there a good Agent Leaderboard for other real-life things than coding?

  • Posted 7 hours ago by tototozip
  • 1 points
I feel like the benchmark space is quite crowded when it comes to coding Agents. We have some remarkable projects with TerminalBench, SWE-bench, RepoBench, ect, and I actually think we are close to a gold standard here. Also I know that we have general web/computer control benchmarks like GAIA, WebArena, and OSWorld, but these feel like "General Purpose" tests.

People want AI Agents to help them with different tasks, and I find close to none interesting benchmarks outside of the web vertical. Are there any projects addressing "real world" business challenges, or is everyone just focusing on coding and general web browsing right now?

1 comments

    Loading..