Ask HN: How do you interpret P99 latency without being misled?

  • Posted 5 hours ago by danelrfoster
  • 1 points
I’ve seen many teams rely heavily on P50/P95/P99 latency numbers, but still miss real user pain or misdiagnose incidents.

Recently I tried to write down a more systematic way to reason about latency distributions in production: how different distribution shapes behave, why aggregation and sampling often lie to us, and why segmentation (by endpoint, tenant, region, workload) usually matters more than adding more percentiles.

I’m curious how others here approach this in practice:

Do you have a mental model for interpreting P99 during incidents?

What charts or breakdowns have actually helped you debug latency issues?

Have you been burned by “good-looking” percentiles that hid real problems?

I wrote up my notes here for reference: https://optyxstack.com/performance/latency-distributions-in-practice-reading-p50-p95-p99-without-fooling-yourself

Would love to hear how people handle this in real systems.

0 comments