Show HN: Wordchipper – Rust BPE tokenizer, 9x faster than tiktoken

  • Posted 2 hours ago by antimora
  • 2 points
Hey HN,

We're ZSpaceLabs, Burn framework contributors working to make Rust a first-class AI/ML stack.

We just released wordchipper, a Rust-native BPE tokenizer for the OpenAI GPT-2 tokenizer family (r50k, cl100k, o200k). On a 64-core machine with the o200k vocab (GPT-4o / GPT-5 tokenizer), we measured 2.4 GiB/s, about 9.2× faster than tiktoken-rs. Through the Python bindings it is typically 2–4× faster than tiktoken, depending on thread count.

The main design goal was to make the internals easy to swap. The tokenizer is split into two parts: pre-tokenization (lexer) and BPE span encoding. Each part can be replaced independently, which makes it easy to experiment with different combinations of lexer backends and span encoding algorithms.

Right now there are three lexer implementations. One uses fancy-regex and is fully compatible with tiktoken. Another uses regex-automata with a runtime DFA and is about 4–8× faster. The third uses logos with a compile-time DFA and is about 14–21× faster on cl100k and o200k.

Write-up with more details: https://zspacelabs.ai/wordchipper/articles/substitutable/

GitHub: https://github.com/zspacelabs/wordchipper

Happy to hear feedback, especially from people working on tokenization, large-scale inference pipelines, or Rust ML tooling.

0 comments