HN – Show HN: AptSelect – A local LLM client for parallel testing and evaluation

I built AptSelect to stop writing throwaway scripts every time I needed to test how different LLMs handle specific instructions and prompt edge cases.

What it does:

Parallel Execution: Send a single prompt to OpenAI, Anthropic, Mistral, and Gemini simultaneously. Compare the outputs, latency, and exact token usage side-by-side.

Batch Evaluations: Upload a CSV dataset to run bulk tests across multiple models at once.

Manual Diagnostics: Grade outputs manually (Pass/Fail) and assign diagnostic tags (e.g., Hallucination, Format Error) to build a human-verified performance leaderboard.

Local-first: API keys encrypted with your OS keyring; history stored in a local SQLite DB; no telemetry.

I’m looking for technical feedback. What do you think current LLM testing/evaluation tools get most wrong?

Show HN: AptSelect – A local LLM client for parallel testing and evaluation

0 comments