Show HN: Mandarin Word Segmenter with Translation

  • Posted 22 hours ago by routerl
  • 25 points
https://mandobot.netlify.app/
I've built mandoBot, a web app that segments and translates Mandarin Chinese text. This is a Django API (using Django-Ninja and PostgreSQL) and a NextJS front-end (with Typescript and Chakra). For a sample of what this app does, head to https://mandobot.netlify.app/?share_id=e8PZ8KFE5Y. This is my presentation of the first chapter of a classic story from the Republican era of Chinese fiction, Diary of a Madman by Lu Xun. Other chapters are located in the "Reading Room" section of the app.

This app exists because reading Mandarin is very hard for learners (like me), since Mandarin text does not separate words using spaces in the same way Western languages do. But extensive reading is the most effective way to learn vocabulary and grammar. Thus, learning Mandarin by reading requires first memorizing hundreds or thousands of words, before you can even know where one word ends and the next word begins.

I'm solving this problem by allowing users to input Mandarin text, which is then computationally segmented and machine translated by my server, which also adds dictionary definitions for each word and character. The hard part is the segmentation: it turns out that "Chinese Word Segmentation"[0] is the central problem in Chinese Natural Language Processing; no current solutions reach 100% accuracy, whether they're from Stanford[1], Academia Sinica[2], or Tsing Hua University[3]. This includes every LLM currently available.

I could talk about this for hours, but the bottom line is that this app is a way to develop my full-stack skills; the backend should be fast, accurate, secure, well-tested, and well-documented, and the front-end should be pretty, secure, well-tested, responsive, and accessible. I am the sole developer, and I'm open to any comments and suggestions: roberto.loja+hn@gmail.com

Thanks HN!

[0] https://en.wikipedia.org/wiki/Chinese_word-segmented_writing

[1] https://nlp.stanford.edu/software/segmenter.shtml

[2] https://ckip.iis.sinica.edu.tw/project/ws

[3] http://thulac.thunlp.org/

5 comments

    Loading..
    Loading..
    Loading..
    Loading..
    Loading..