Current methodology:
GPT-5-nano forward translation + back-translation
text-embedding-3-small cosine similarity on source vs. back-translated text.
Threshold: ≥0.92 = auto-approved
On a recent ~970-string Spanish localization run:
~75% of strings passed automatically
We then had two human translators review outputs, and both flagged several problematic cases:
"Add Attachment" → Agregar Adjunto
Better: Adjuntar Archivo
"Pay Grades" → Grados de Pago
Better: Escalas salariales
"Sub Unit" → Subunidad
Better: Departamento
All three examples still scored 0.94+ cosine similarity.
Google Translate also back-translates Adjunto more like “Please attach,” which suggests
the issue isn’t just subjective reviewer preference.
Also, currently we pass a note with each transunit, so theres proper context for Ai
Questions:
How can we improve translation accuracy so it becomes more contextually rich and produces outputs that make better sense?
What routing metrics actually correlate best with human acceptance for UI localization?
Has anyone quantified improvements from using cross-engine back-translation (e.g., OpenAI + Google/DeepL) versus single-engine loops?
Would appreciate insights from teams running MT/localization pipelines at scale.