You found a tool. It’s fast, it’s cheap, and it speaks the language you need — Japanese, Cantonese, Arabic, whatever the project calls for. You paste in your script, it reads it back sounding completely sure of itself, and you ship it. The catch is the one thing you can’t do from where you’re sitting: tell whether it’s actually right. If you don’t speak the language, you’re trusting a machine that is, in effect, also guessing — it just guesses in a very confident voice. And confident is not the same as correct.
The numbers make the gap concrete. Mandarin is full of polyphones — characters whose pronunciation changes with context, like 乾 (gān or qián) or 行 (xíng or háng) — and choosing the wrong reading can change what a sentence means. Even the best published models for resolving them top out at around 94% accuracy (Polyphone BERT, 2022). That sounds high until you translate it into practice: roughly one wrong reading in every seventeen polyphonic characters. A single paragraph can contain dozens, so the errors don’t stay isolated — they accumulate. And this is the best-case scenario; researchers building text-to-speech specifically for Taiwanese Mandarin (BreezyVoice, 2025) still describe polyphone disambiguation as an open, unsolved problem.
A mispronunciation in a language you can't hear might feel like a cosmetic detail. It isn't. CSA Research's well-known “Can't Read, Won't Buy” study, which surveyed 8,709 consumers across 29 countries, found that 76% of people prefer to buy in their own language, and 40% won't buy from content in another language at all. Your audience hears the error you can't — a brand name said wrong, a polyphone that quietly flips a sentence — and what reads to you as “good enough AI output” reads to them as a company that didn't care enough to get their language right. The cost isn't the glitch; it's the trust you lose in the exact market you paid to enter.
And pronunciation is only one of the things a non-speaker can’t catch. The register can be wrong — formal where it should be warm. The prosody can be subtly off — the rhythm and stress that tell a native listener “a person made this.” The accent can miss entirely, delivering mainland Mandarin when the brief called for Taiwan. A native speaker notices all of it in a single listen. The person who pressed “generate” notices none of it.
This is the rule we kept when we moved from running a voice studio — which Onyx has done since 2008, with more than 1,500 voice actors — into AI. Pure-AI tools quietly skip it: every Onyx delivery, in any language where accuracy matters, is checked by a native speaker before it reaches the client. Not re-recorded — verified. A native proofreader confirms the pronunciation of names, brands and numbers, checks that every polyphone is read correctly in context, makes sure the meaning is intact, the tone matches the brief, and the rhythm sounds natural to someone who actually speaks the language. If it passes, it ships. If it doesn’t, we fix it before the client ever hears it. It isn’t glamorous, but it is the entire difference between “AI-generated” and “ready to broadcast.”
Here is why that matters to you specifically. You did not come to AI voice to become an expert in it — you came because you have an ad, a course, an audiobook to get out the door, fast and on budget, and you would rather not spend the week auditioning tools or re-listening to a language you can’t even parse. That is completely reasonable. But “fast and cheap” quietly hands you a second job you never asked for: quality control in a language you don’t speak. Most tools leave that job sitting on your desk. We take it off.
That human layer is only as strong as the people in it, and we are expanding it — building a Language QA network of fast, reliable native proofreaders across Mandarin, Cantonese, Japanese, Korean, Thai, Spanish and more. If you have a native ear for how a line should really land, or you work as a translator or proofreader and turn jobs around quickly, we’d like to meet you.
So here is what it comes down to. You turned to AI for speed and a price that works — not to become the person who has to verify whether the Cantonese is right. You shouldn’t have to be. Send us the script; what comes back is already checked by someone who speaks it, and ready to use. That is the difference between a tool you have to babysit and a studio you can hand things to: you stay out of the weeds, and the voice still lands. Tell us what you need voiced — we’ll take it from there.
Sources
- 1.Polyphone BERT — Mandarin polyphone disambiguation tops out around 94.1% accuracy (Interspeech 2022)
- 2.CSA Research — “Can’t Read, Won’t Buy”: 8,709 consumers across 29 countries (76% prefer their own language; 40% won’t buy without it)
- 3.BreezyVoice — text-to-speech built for Taiwanese Mandarin; treats polyphone disambiguation as still unsolved (2025)
