AI Transcription Accuracy Research

Builder question: How accurate is AI transcription for real-world audio?

BrassTranscripts draws from these seven sources to characterize AI transcription accuracy honestly — what the best available benchmarks show, where those benchmarks diverge from production conditions, what the architecture properties mean for real-world audio quality, and how distillation trades speed against accuracy.

Contents — 8 entries

📄 Robust Speech Recognition via Large-Scale Weak Supervision

Radford et al. (OpenAI), ICML 2023 · arxiv.org/abs/2212.04356

BrassTranscripts runs on the architecture this paper introduced. The key insight is that training on 680,000 hours of weakly supervised web audio produces robustness properties that clean-speech WER numbers cannot capture — the model substantially outperforms fine-tuned alternatives on out-of-distribution real-world datasets even when its LibriSpeech WER appears similar. Builders evaluating ASR systems should test on their own audio, not delegate that decision to clean-speech benchmarks.

What it's good for: Understanding why Whisper-class models generalize across audio conditions when fine-tuned models do not. Where BrassTranscripts draws from it: Justification for using a large pretrained architecture over domain-specific fine-tuning for general-purpose transcription use cases.

📄 WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Bain, Huh, Han, Zisserman (Oxford VGG), 2023 · arxiv.org/abs/2303.00747

BrassTranscripts uses this inference architecture as its inference engine. The VAD pre-segmentation is what makes accurate long-form transcription (1h+ audio) reliable — standard processing degrades on long audio without it because sliding-window attention loses context at segment boundaries, producing repetition and hallucination artifacts. WhisperX's Cut & Merge strategy delivers approximately 12x faster inference than standard processing while improving accuracy on long-form audio. Builders processing recordings over 15 minutes should prioritize VAD-segmented inference pipelines over naive sliding-window approaches.

What it's good for: Understanding the long-form inference architecture that makes 1h+ recordings tractable. Where BrassTranscripts draws from it: The inference pipeline design, word-level timestamp alignment, and long-form accuracy properties that distinguish production-grade transcription from naive API calls.

🧪 Open ASR Leaderboard

Srivastav, Zheng, Bezzam et al. (Hugging Face), 2025 · huggingface.co (Open ASR Leaderboard) · arXiv: 2510.06961

BrassTranscripts uses this leaderboard to track where its inference engine stands against alternatives. The long-form track (Earnings-21/22) is the most relevant column — not the LibriSpeech column — because it reflects actual production conditions. The leaderboard evaluates 86 ASR systems across 12 datasets covering English short-form, long-form, and multilingual tracks. Builders comparing ASR systems should sort by the Earnings-22 column rather than the aggregate score, which is dominated by LibriSpeech performance.

What it's good for: Reproducible cross-system WER comparison across short-form, long-form, and multilingual tracks. Where BrassTranscripts draws from it: Long-form track rankings as the primary accuracy reference for production audio conditions.

📊 Artificial Analysis Speech-to-Text Model Comparison

Artificial Analysis (independent), continuously updated 2024–2026 · artificialanalysis.ai/speech-to-text

BrassTranscripts monitors this benchmark for price-performance changes across the 49 systems it tracks — particularly the hosting-provider comparison for the same underlying model weights, which reveals that latency and cost vary dramatically (up to 10x) across providers running identical models. The composite weighting (50% AA-AgentTalk conversational audio, 25% VoxPopuli, 25% Earnings-22) makes it more predictive of real-world customer files than academic benchmarks alone. Builders evaluating cost efficiency should consult this benchmark before selecting an inference provider.

What it's good for: Comparing price, latency, and WER across 49 STT systems simultaneously, including provider-level variation for the same model. Where BrassTranscripts draws from it: Infrastructure cost-efficiency analysis and ongoing model landscape monitoring.

🧪 LibriSpeech

Panayotov, Chen, Povey, Khudanpur (ICASSP 2015) · openslr.org/12

BrassTranscripts treats LibriSpeech WER as a floor, not a ceiling. When a vendor cites an accuracy percentage, the first question to ask is whether that number came from LibriSpeech test-clean — it is the most favorable benchmark condition available (professional readers, studio acoustics, standard American English, no overlap), not representative of real-world customer audio. The approximately 1,000-hour corpus of audiobook recordings remains the standard reference point for historical comparison but is a poor predictor of performance on meeting recordings, depositions, interviews, or any uncontrolled environment.

What it's good for: Historical baseline comparison across the ASR literature — almost every published WER number traces to LibriSpeech. Where BrassTranscripts draws from it: Explaining why lab accuracy numbers should not be the primary selection criterion for production transcription services.

📄 Distil-Whisper: Knowledge Distillation via Large-Scale Pseudo-Labelling

Gandhi, von Platen, Rush (Hugging Face / Cornell), 2023 · arxiv.org/abs/2311.00430

BrassTranscripts draws on Distil-Whisper to explain the speed-accuracy trade-off in production transcription. The distilled model runs 5.8x faster with 51% fewer parameters while staying within 1% WER of full Whisper on out-of-distribution audio, and a speculative-decoding variant adds a further 2x. The takeaway for builders is that most accuracy ceilings are set by audio conditions, not model size, so a smaller distilled model is often the right call when latency or GPU cost is the binding constraint. A faster tier need not mean a meaningfully worse transcript on clear audio.

What it's good for: Quantifying how far distillation can cut latency and cost while holding WER within 1% on out-of-distribution audio. Where BrassTranscripts draws from it: The evidence that speed and accuracy are not strictly traded one-for-one, informing infrastructure and model-tier decisions.

📄 Conformer: Convolution-augmented Transformer for Speech Recognition

Gulati et al. (Google), 2020 · arxiv.org/abs/2005.08100

BrassTranscripts treats the Conformer architecture as the methodological reason modern speech recognition holds up on real-world audio, not only clean studio recordings. By combining convolution with self-attention, it captures local acoustic detail and long-range context together, reaching 1.9%/3.9% word error rate on LibriSpeech test and test-other with a language model, and 2.1%/4.3% without one. For builders, this explains why current engines stay accurate on conversational and accented speech where attention-only models slipped, and why the achievable accuracy floor is set by architecture, not training-data volume alone.

What it's good for: Understanding why convolution-augmented encoders beat attention-only models on noisy, conversational audio. Where BrassTranscripts draws from it: The architectural basis for accuracy claims about real-world, non-studio recordings.

📊 AI Transcription Accuracy Claims: A BrassTranscripts Investigation

BrassTranscripts — first-party investigation, November 2025 · brasstranscripts.com/blog

BrassTranscripts draws on this to ground every accuracy claim it publishes in documented evidence rather than marketing figures. The investigation traced the widely repeated "98% accuracy" claim to a single MLCommons LibriSpeech result (97.93% on clean audiobook reads) and compiled peer-reviewed studies showing real-world accuracy spans roughly 33% to 97.9% — a 65-point range driven by audio quality, accent, speaker demographics, and language. Builders evaluating any vendor's accuracy percentage should ask which condition produced it before treating it as representative.

What it's good for: A sourced, condition-by-condition accuracy range — phone calls 46–57%, accented speech 71–77%, clean audiobooks 97%+ — instead of one context-free number. Where BrassTranscripts draws from it: The factual basis for its 30-word preview model and its refusal to publish unqualified accuracy percentages.

Frequently Asked Questions

What does WER (word error rate) actually measure?

WER counts the minimum number of word-level edits (substitutions, deletions, insertions) needed to transform a hypothesis transcript into the reference transcript, divided by the total number of words in the reference. A 5% WER means roughly 1 error per 20 words — but the distribution matters: a single wrong word in a name or number can invalidate an entire sentence while contributing only one error to the WER count.

Why is LibriSpeech WER misleading for real-world audio?

LibriSpeech is built from audiobook recordings: professional readers, studio conditions, no background noise, no accents outside standard American English, and no overlapping speech. Real-world customer audio — meetings, interviews, phone calls, depositions — violates all of these conditions simultaneously. WER on LibriSpeech test-clean is the most favorable benchmark condition available; the same model will produce substantially higher WER on actual customer files.

What is the Open ASR Leaderboard long-form track?

The Open ASR Leaderboard (Hugging Face) evaluates ASR systems across multiple tracks. The long-form track covers Earnings-21, Earnings-22, and TED-LIUM — datasets of 20-minute-to-2-hour recordings from real business and conference settings. This track is more predictive of production performance than the short-form LibriSpeech columns because it reflects the file lengths and acoustic conditions that professional transcription customers actually upload.

How does VAD pre-segmentation affect long-form accuracy?

Standard Whisper processes audio in 30-second windows using a sliding attention window. On long recordings, the model can lose context at segment boundaries, producing repetition artifacts, missed speech, or hallucinated text. VAD (voice activity detection) pre-segmentation cuts the audio at natural silence boundaries before feeding segments to the model, eliminating the sliding-window artifacts and producing substantially more accurate output on recordings over 15 minutes.

What does Artificial Analysis measure that the Open ASR Leaderboard does not?

Artificial Analysis adds latency and cost dimensions to the accuracy evaluation. Its composite WER benchmark covers 49 systems and weights toward conversational audio (50% AA-AgentTalk, 25% VoxPopuli, 25% Earnings-22). Critically, it evaluates the same underlying model weights deployed by different hosting providers — revealing that latency and price vary up to 10x across providers running identical models, something no pure-accuracy benchmark captures.

Explore Related Research

Speaker Diarization →Neural diarization, overlap-aware diarization, ETH Zurich benchmark Multilingual Speech →FLEURS, Common Voice, accent peer review Audio Quality →Reverberation benchmarks, SNR thresholds, CHiME-7 ASR Benchmarks →MLPerf v5.1, NIST SCTK, benchmark methodology

← Back to Research Index