In my day to day as a Product Manager working in a team that ships AI products, I often found myself wanting to do 'quick and dirty' LLM-based evaluation on conversation transcripts and traces. I didn't need anything fancy, just 'did the agent answer the question', 'did the agent cover the 5 things it needed to' - that type of thing.
I found myself blocked by 'Gemini in Google Sheets', it was too slow and cumbersome, and it didn't handle eval changes well - particularly when trying to associate evals with ground truth. And because I was exploring or working on new and experimental features, it wasn't helpful to try and set up something more robust with the team.
To fix the problem I eventually learned to call the OpenAI API in Python, but I really felt that I wanted a 'product' to help me and potentially help others who need answers fast - outside of building infrastructure and pipelines.
So over the last few weeks I built: https://beval.space
It has: - LLM-as-judge evals: boolean checks (yes/no), scores (1-5), categories, and freeform comments - Reusable eval definitions you can run across different datasets - Ground truth labelling so you can compare eval versions against human judgments - Per-trace reasoning so you can see why the judge scored something the way it did - An example dataset so you can try it without having your own traces ready
One of our early users described it as 'quick n dirty evals when you don't want to touch a shit load of infra.' I'm trying to figure out if that's a common need or just a niche thing.
Free during beta. Would love HN's take — what's missing, and would you actually use something like this?