Multilingual Speech Recognition

Builder question: How well does AI handle multilingual and accented speech?

BrassTranscripts draws from these eight sources to characterize multilingual and accented speech performance — what the coverage map looks like across 102+ languages, where accent-related performance gaps are documented in peer-reviewed research, which underrepresented populations are most affected by training data gaps, how far open models now extend the language frontier, and where real paid demand actually concentrates across languages.

Contents — 8 entries

🧪 FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Conneau, Ma, Khanuja et al. (Google), 2022 · arxiv.org/abs/2205.12446

BrassTranscripts uses FLEURS as its reference for understanding which languages the underlying model covers confidently and which it does not. The 102-language coverage map (approximately 12 hours per language, built on FLoRes-101 MT data) reveals that high-resource European and East Asian languages perform substantially better than low-resource African and indigenous languages. If your audio is in one of those underrepresented languages, expect higher WER than published benchmarks suggest. Builders building multilingual products should use FLEURS WER by language as the primary reference for coverage gaps — not marketing claims of "99+ languages."

What it's good for: Language-by-language coverage mapping across 102 languages in a consistent evaluation framework. Where BrassTranscripts draws from it: Setting per-language accuracy expectations and communicating honestly about which languages have stronger versus weaker coverage.

🧪 Mozilla Common Voice (v20)

Ardila, Branson, Davis et al. (Mozilla), LREC 2020 · commonvoice.mozilla.org/en/datasets · arXiv: 1912.06670

BrassTranscripts monitors Common Voice as a proxy for language-model training data availability. Languages with large Common Voice corpora generally have better coverage in multilingual ASR models — a file in a language with 5 hours of training data will produce materially higher WER than one with 500 hours. Version 20 (December 2024) covers 133 languages with 33,150+ hours total under CC0 license. The distribution is heavily skewed: English dominates, and a small set of high-resource European languages account for most of the data. Builders assessing language support should check the per-language hour count in Common Voice v20 as a first-order predictor of ASR quality for that language.

What it's good for: Estimating training data availability as a predictor of ASR quality for a given language. Where BrassTranscripts draws from it: Explaining per-language accuracy variation to users who ask why one language produces better results than another.

📄 Earnings-22: A Practical Benchmark for Accents in the Wild

Del Rio, Ha, McNamara, Miller, Chandra, 2022 · arxiv.org/abs/2203.15591

BrassTranscripts cites Earnings-22 when users ask about accented English. The key finding is that accent of origin produces measurable WER variation even within English — the 119-hour corpus of earnings calls from global companies includes WER broken down by speaker country of origin, documenting significant variation across 4 commercial ASR systems. A file from a South Asian multinational call will have different accuracy characteristics than one from a US company, regardless of language. Earnings-22 is also included in both the Open ASR Leaderboard long-form track and the Artificial Analysis composite — making it the single dataset with the widest cross-benchmark coverage for real-world business audio evaluation.

What it's good for: Documenting accent-related WER variation in a real-world business audio context across multiple ASR systems. Where BrassTranscripts draws from it: Setting per-accent accuracy expectations for business audio customers, particularly for multinational and non-US-English speaker contexts.

📄 Evaluating Whisper ASR: Performance Across Diverse Accents and Speaker Traits

Graham, Roll (JASA Express Letters, 2024) · pubs.aip.org (JASA Express Letters)

BrassTranscripts treats this as the canonical peer-reviewed evidence for accent performance variation in the AI speech recognition architecture it runs on. Testing 18 English accent variants from the Speech Accent Archive, the study finds: American English outperforms British and Australian accents; native accents outperform non-native accents; English language experience is a statistically significant negative predictor of WER (less fluent speakers produce higher WER); and female speakers show significantly lower WER than male speakers. The female/male WER finding is counter-intuitive but robust — it reflects acoustic characteristics (fundamental frequency, formant patterns) of the training data composition rather than linguistic complexity. Builders reporting accuracy should specify the accent distribution of their evaluation set.

What it's good for: Peer-reviewed quantification of accent and speaker-trait effects on ASR WER across a controlled accent variant set. Where BrassTranscripts draws from it: The gender and accent performance findings that inform how BrassTranscripts communicates accuracy expectations to users with non-American-English audio.

📄 Svarah: Evaluating English ASR Systems on Indian Accents

Javed, Joshi, Nagarajan et al. (AI4Bharat / IIT Madras), Interspeech 2023 · arxiv.org/abs/2305.15760

BrassTranscripts uses Svarah as the reference for Indian-accented English performance — the largest underrepresented English accent population in standard benchmarks. India has approximately 130 million English speakers, yet Indian English is severely underrepresented in LibriSpeech and other standard training corpora. The benchmark covers 9.6 hours from 117 speakers across 65 Indian geographic locations spanning 19 constitutional languages. Users transcribing recordings from Indian business, academic, or government settings should expect measurably higher WER than US-English baseline numbers suggest. Builders deploying ASR systems for Indian-English contexts should evaluate against Svarah, not LibriSpeech.

What it's good for: Documenting the performance gap for Indian-accented English — the single largest underrepresented population in standard ASR benchmarks. Where BrassTranscripts draws from it: Communicating accuracy expectations to customers transcribing Indian-English content, and supporting the broader point that benchmark WER numbers may not reflect performance on specific regional accents.

📄 AfriSpeech-200: Pan-African Accented Speech Dataset

Olatunji, Afonja, Yadavalli, Emezue et al. (TACL 2023) · arxiv.org/abs/2310.00274

BrassTranscripts cites AfriSpeech-200 as evidence that racial bias in ASR systems is documented and measurable, not hypothetical. The 200-hour dataset covers 2,463 speakers across 120 indigenous accents from 13 African countries in both clinical and general domain contexts. The key finding: state-of-the-art diarization systems show 10%+ DER degradation for African-accented long-form speech versus native-accent baselines — the largest single accent-related performance gap in the peer-reviewed literature. Builders whose users include African business, healthcare, or academic contexts should treat this gap as a known limitation when setting accuracy expectations.

What it's good for: Quantifying the diarization performance gap for African-accented English across 120 indigenous accent variants and 13 countries. Where BrassTranscripts draws from it: The largest documented accent-related accuracy gap in the literature — supporting honest communication about ASR system limitations for African-English-speaker contexts.

📄 Scaling Speech Technology to 1,000+ Languages (MMS)

Pratap, Tjandra, Shi, Tomasello, Babu et al. (Meta AI / FAIR), 2023 · arxiv.org/abs/2305.13516

BrassTranscripts monitors the Massively Multilingual Speech project as the frontier of low-resource language coverage. The single MMS model transcribes 1,107 languages and identifies 4,017, extending supported-language counts 10-40x beyond Whisper-class models. For the long tail of languages that standard models return as unsupported, this is the open-weights reference point. Builders fielding non-English requests outside the high-resource set should check MMS coverage before telling a user their language cannot be transcribed.

What it's good for: Open-weights ASR and language identification covering 1,107 and 4,017 languages respectively — far beyond the Whisper-class language set. Where BrassTranscripts draws from it: The reference for what coverage is achievable in the long tail, and honest framing of where high-resource models stop.

📊 Global AI Transcription Trends 2026: Languages by Demand

BrassTranscripts — first-party production data, May 2026 · brasstranscripts.com/blog

BrassTranscripts draws on this to show which languages carry real production demand rather than theoretical coverage. Across 515 paid jobs and 252 hours of audio in the 180 days ending May 2026, BrassTranscripts processed 30 distinct languages: English drove 63% of jobs, Portuguese led non-English volume at 85 jobs, and Norwegian Nynorsk ranked third by total hours despite only 16 jobs — averaging 96.9 minutes and 8.19 speakers per file, an institutional usage signal. Builders deciding which languages to localize for should weight demonstrated paid demand over raw speaker-population counts.

What it's good for: A first-party demand map showing non-English jobs at 37% of volume and small-population languages producing outsized professional demand. Where BrassTranscripts draws from it: Its multilingual roadmap priorities and its honest framing that "99+ languages" means real but uneven production demand, not uniform quality.

Frequently Asked Questions

Does AI transcription accuracy vary by language?

Yes, substantially. High-resource languages (English, Spanish, French, Mandarin, German, Japanese) with large training corpora produce lower WER than low-resource languages where training data is scarce. The FLEURS benchmark covers 102 languages and reveals systematic gaps: European and East Asian languages consistently outperform African and indigenous languages. Within English, accent of origin produces measurable WER variation — the Earnings-22 benchmark documents this across speaker countries of origin in a business audio context.

What does "accented English" mean for AI transcription accuracy?

Accent effects on ASR accuracy are documented and statistically significant. A peer-reviewed JASA Express Letters study (Graham and Roll, 2024) tested 18 English accent variants from the Speech Accent Archive and found that American English outperforms British and Australian accents, native accents outperform non-native accents, and — counterintuitively — female speakers show significantly lower WER than male speakers. English language experience (fluency) is a statistically significant negative predictor of WER: less-fluent English speakers produce higher WER even when their speech is intelligible to human listeners.

What is FLEURS and why does it matter for language coverage?

FLEURS (Few-shot Learning Evaluation of Universal Representations of Speech) covers 102 languages with approximately 12 hours of audio per language. It was created by Google to expose multilingual ASR coverage gaps by providing a consistent evaluation methodology across a broad language set. The 102-language coverage map reveals that model performance is not uniform: high-resource languages covered by large training corpora produce substantially better WER than low-resource languages. FLEURS is the standard reference for understanding where any multilingual ASR system has confident coverage versus uncertain coverage.

Why does Common Voice dataset size predict ASR accuracy?

Common Voice provides a proxy for training data availability: languages with large Common Voice corpora also tend to have large corpora in other sources used to train multilingual ASR models. A language with 5 hours of training data will produce materially higher WER than one with 500 hours — the model has fewer examples from which to learn the phonological patterns of that language. Common Voice v20 covers 133 languages with 33,150+ hours total, but the distribution is heavily skewed toward English and a small number of other high-resource languages.

What does the AfriSpeech-200 finding mean for AI transcription equity?

AfriSpeech-200 documents that African-accented English produces 10%+ diarization degradation compared to native-accent baselines for state-of-the-art systems. This is the largest single accent-related performance gap in the peer-reviewed literature. It means that users transcribing content from African business, academic, or government settings — or any setting with speakers from African countries using English — should expect measurably higher error rates than users transcribing American or British English content. The gap is not hypothetical: it is documented, measured, and reproducible.

What is Svarah and how does it relate to Indian-accented English transcription?

Svarah is a 9.6-hour benchmark covering 117 speakers from 65 Indian geographic locations spanning 19 constitutional languages, published by AI4Bharat and IIT Madras at Interspeech 2023. India has approximately 130 million English speakers yet is severely underrepresented in LibriSpeech and other standard benchmarks. Svarah documents that ASR systems trained primarily on American English show higher WER on Indian-accented English than their LibriSpeech numbers would predict. Users transcribing recordings from Indian business, academic, or government settings should expect measurably higher WER than US-English baseline numbers suggest.

Explore Related Research

Transcription Accuracy →Whisper architecture, long-form inference paper, Open ASR Leaderboard Speaker Diarization →Neural diarization, overlap-aware diarization, ETH Zurich benchmark Audio Quality →Reverberation benchmarks, SNR thresholds, CHiME-7 ASR Benchmarks →MLPerf v5.1, NIST SCTK, benchmark methodology

← Back to Research Index