Multilingual Speech Recognition
Builder question: How well does AI handle multilingual and accented speech?
BrassTranscripts draws from these six primary sources to characterize multilingual and accented speech performance — what the coverage map looks like across 102+ languages, where accent-related performance gaps are documented in peer-reviewed research, and which underrepresented populations are most affected by training data gaps.
Contents — 6 entries
🧪 FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech
Conneau, Ma, Khanuja et al. (Google), 2022 · arxiv.org/abs/2205.12446
BrassTranscripts uses FLEURS as its reference for understanding which languages the underlying model covers confidently and which it does not. The 102-language coverage map (approximately 12 hours per language, built on FLoRes-101 MT data) reveals that high-resource European and East Asian languages perform substantially better than low-resource African and indigenous languages. If your audio is in one of those underrepresented languages, expect higher WER than published benchmarks suggest. Builders building multilingual products should use FLEURS WER by language as the primary reference for coverage gaps — not marketing claims of "99+ languages."
What it's good for: Language-by-language coverage mapping across 102 languages in a consistent evaluation framework. Where BrassTranscripts draws from it: Setting per-language accuracy expectations and communicating honestly about which languages have stronger versus weaker coverage.
🧪 Mozilla Common Voice (v20)
Ardila, Branson, Davis et al. (Mozilla), LREC 2020 · commonvoice.mozilla.org/en/datasets · arXiv: 1912.06670
BrassTranscripts monitors Common Voice as a proxy for language-model training data availability. Languages with large Common Voice corpora generally have better coverage in multilingual ASR models — a file in a language with 5 hours of training data will produce materially higher WER than one with 500 hours. Version 20 (December 2024) covers 133 languages with 33,150+ hours total under CC0 license. The distribution is heavily skewed: English dominates, and a small set of high-resource European languages account for most of the data. Builders assessing language support should check the per-language hour count in Common Voice v20 as a first-order predictor of ASR quality for that language.
What it's good for: Estimating training data availability as a predictor of ASR quality for a given language. Where BrassTranscripts draws from it: Explaining per-language accuracy variation to users who ask why one language produces better results than another.
📄 Earnings-22: A Practical Benchmark for Accents in the Wild
Del Rio, Ha, McNamara, Miller, Chandra, 2022 · arxiv.org/abs/2203.15591
BrassTranscripts cites Earnings-22 when users ask about accented English. The key finding is that accent of origin produces measurable WER variation even within English — the 119-hour corpus of earnings calls from global companies includes WER broken down by speaker country of origin, documenting significant variation across 4 commercial ASR systems. A file from a South Asian multinational call will have different accuracy characteristics than one from a US company, regardless of language. Earnings-22 is also included in both the Open ASR Leaderboard long-form track and the Artificial Analysis composite — making it the single dataset with the widest cross-benchmark coverage for real-world business audio evaluation.
What it's good for: Documenting accent-related WER variation in a real-world business audio context across multiple ASR systems. Where BrassTranscripts draws from it: Setting per-accent accuracy expectations for business audio customers, particularly for multinational and non-US-English speaker contexts.
📄 Evaluating Whisper ASR: Performance Across Diverse Accents and Speaker Traits
Graham, Roll (JASA Express Letters, 2024) · pubs.aip.org (JASA Express Letters)
BrassTranscripts treats this as the canonical peer-reviewed evidence for accent performance variation in the AI speech recognition architecture it runs on. Testing 18 English accent variants from the Speech Accent Archive, the study finds: American English outperforms British and Australian accents; native accents outperform non-native accents; English language experience is a statistically significant negative predictor of WER (less fluent speakers produce higher WER); and female speakers show significantly lower WER than male speakers. The female/male WER finding is counter-intuitive but robust — it reflects acoustic characteristics (fundamental frequency, formant patterns) of the training data composition rather than linguistic complexity. Builders reporting accuracy should specify the accent distribution of their evaluation set.
What it's good for: Peer-reviewed quantification of accent and speaker-trait effects on ASR WER across a controlled accent variant set. Where BrassTranscripts draws from it: The gender and accent performance findings that inform how BrassTranscripts communicates accuracy expectations to users with non-American-English audio.
📄 Svarah: Evaluating English ASR Systems on Indian Accents
Javed, Joshi, Nagarajan et al. (AI4Bharat / IIT Madras), Interspeech 2023 · arxiv.org/abs/2305.15760
BrassTranscripts uses Svarah as the reference for Indian-accented English performance — the largest underrepresented English accent population in standard benchmarks. India has approximately 130 million English speakers, yet Indian English is severely underrepresented in LibriSpeech and other standard training corpora. The benchmark covers 9.6 hours from 117 speakers across 65 Indian geographic locations spanning 19 constitutional languages. Users transcribing recordings from Indian business, academic, or government settings should expect measurably higher WER than US-English baseline numbers suggest. Builders deploying ASR systems for Indian-English contexts should evaluate against Svarah, not LibriSpeech.
What it's good for: Documenting the performance gap for Indian-accented English — the single largest underrepresented population in standard ASR benchmarks. Where BrassTranscripts draws from it: Communicating accuracy expectations to customers transcribing Indian-English content, and supporting the broader point that benchmark WER numbers may not reflect performance on specific regional accents.
📄 AfriSpeech-200: Pan-African Accented Speech Dataset
Olatunji, Afonja, Yadavalli, Emezue et al. (TACL 2023) · arxiv.org/abs/2310.00274
BrassTranscripts cites AfriSpeech-200 as evidence that racial bias in ASR systems is documented and measurable, not hypothetical. The 200-hour dataset covers 2,463 speakers across 120 indigenous accents from 13 African countries in both clinical and general domain contexts. The key finding: state-of-the-art diarization systems show 10%+ DER degradation for African-accented long-form speech versus native-accent baselines — the largest single accent-related performance gap in the peer-reviewed literature. Builders whose users include African business, healthcare, or academic contexts should treat this gap as a known limitation when setting accuracy expectations.
What it's good for: Quantifying the diarization performance gap for African-accented English across 120 indigenous accent variants and 13 countries. Where BrassTranscripts draws from it: The largest documented accent-related accuracy gap in the literature — supporting honest communication about ASR system limitations for African-English-speaker contexts.
Frequently Asked Questions
Does AI transcription accuracy vary by language?
Yes, substantially. High-resource languages (English, Spanish, French, Mandarin, German, Japanese) with large training corpora produce lower WER than low-resource languages where training data is scarce. The FLEURS benchmark covers 102 languages and reveals systematic gaps: European and East Asian languages consistently outperform African and indigenous languages. Within English, accent of origin produces measurable WER variation — the Earnings-22 benchmark documents this across speaker countries of origin in a business audio context.
What does "accented English" mean for AI transcription accuracy?
Accent effects on ASR accuracy are documented and statistically significant. A peer-reviewed JASA Express Letters study (Graham and Roll, 2024) tested 18 English accent variants from the Speech Accent Archive and found that American English outperforms British and Australian accents, native accents outperform non-native accents, and — counterintuitively — female speakers show significantly lower WER than male speakers. English language experience (fluency) is a statistically significant negative predictor of WER: less-fluent English speakers produce higher WER even when their speech is intelligible to human listeners.
What is FLEURS and why does it matter for language coverage?
FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) covers 102 languages with approximately 12 hours of audio per language. It was created by Google to expose multilingual ASR coverage gaps by providing a consistent evaluation methodology across a broad language set. The 102-language coverage map reveals that model performance is not uniform: high-resource languages covered by large training corpora produce substantially better WER than low-resource languages. FLEURS is the standard reference for understanding where any multilingual ASR system has confident coverage versus uncertain coverage.
Why does Common Voice dataset size predict ASR accuracy?
Common Voice provides a proxy for training data availability: languages with large Common Voice corpora also tend to have large corpora in other sources used to train multilingual ASR models. A language with 5 hours of training data will produce materially higher WER than one with 500 hours — the model has fewer examples from which to learn the phonological patterns of that language. Common Voice v20 covers 133 languages with 33,150+ hours total, but the distribution is heavily skewed toward English and a small number of other high-resource languages.
What does the AfriSpeech-200 finding mean for AI transcription equity?
AfriSpeech-200 documents that African-accented English produces 10%+ diarization degradation compared to native-accent baselines for state-of-the-art systems. This is the largest single accent-related performance gap in the peer-reviewed literature. It means that users transcribing content from African business, academic, or government settings — or any setting with speakers from African countries using English — should expect measurably higher error rates than users transcribing American or British English content. The gap is not hypothetical: it is documented, measured, and reproducible.
What is Svarah and how does it relate to Indian-accented English transcription?
Svarah is a 9.6-hour benchmark covering 117 speakers from 65 Indian geographic locations spanning 19 constitutional languages, published by AI4Bharat and IIT Madras at Interspeech 2023. India has approximately 130 million English speakers yet is severely underrepresented in LibriSpeech and other standard benchmarks. Svarah documents that ASR systems trained primarily on American English show higher WER on Indian-accented English than their LibriSpeech numbers would predict. Users transcribing recordings from Indian business, academic, or government settings should expect measurably higher WER than US-English baseline numbers suggest.