HN – Show HN: ChonkLM – Tiny language models running offline in the browser

I had been looking to try <500M parameter language models but you wouldn't find an API to try them anywhere, so I built this cloudflare hosted static website that hosts weights and built an inference runtime for these models that uses WebGPU and runs inference from your browser.

These are only so useful in a multi-turn conversation but it's still interesting to see what you can pack in a <250mb model.

I tried using ONNX versions earlier, but there were too many quirks of using them with language models and the TPS wasn't too impressive. Inspired by svenflow/webgpu-gemma, I put my codex and claude to the task of writing WGSL to run inference for GGUF versions of these models.

Once you load this website and a model, it should load offline too, until your browser evicts the model from the cache.

Show HN: ChonkLM – Tiny language models running offline in the browser

0 comments