AI Transcription Accuracy Research
Builder question: How accurate is AI transcription for real-world audio?
BrassTranscripts draws from these five primary sources to characterize AI transcription accuracy honestly — what the best available benchmarks show, where those benchmarks diverge from production conditions, and what the architecture properties mean for real-world audio quality.
Contents — 5 entries
📄 Robust Speech Recognition via Large-Scale Weak Supervision
Radford et al. (OpenAI), ICML 2023 · arxiv.org/abs/2212.04356
BrassTranscripts runs on the architecture this paper introduced. The key insight is that training on 680,000 hours of weakly supervised web audio produces robustness properties that clean-speech WER numbers cannot capture — the model substantially outperforms fine-tuned alternatives on out-of-distribution real-world datasets even when its LibriSpeech WER appears similar. Builders evaluating ASR systems should test on their own audio, not delegate that decision to clean-speech benchmarks.
What it's good for: Understanding why Whisper-class models generalize across audio conditions when fine-tuned models do not. Where BrassTranscripts draws from it: Justification for using a large pretrained architecture over domain-specific fine-tuning for general-purpose transcription use cases.
📄 WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
Bain, Huh, Han, Zisserman (Oxford VGG), 2023 · arxiv.org/abs/2303.00747
BrassTranscripts uses this inference architecture as its inference engine. The VAD pre-segmentation is what makes accurate long-form transcription (1h+ audio) reliable — standard processing degrades on long audio without it because sliding-window attention loses context at segment boundaries, producing repetition and hallucination artifacts. WhisperX's Cut & Merge strategy delivers approximately 12x faster inference than standard processing while improving accuracy on long-form audio. Builders processing recordings over 15 minutes should prioritize VAD-segmented inference pipelines over naive sliding-window approaches.
What it's good for: Understanding the long-form inference architecture that makes 1h+ recordings tractable. Where BrassTranscripts draws from it: The inference pipeline design, word-level timestamp alignment, and long-form accuracy properties that distinguish production-grade transcription from naive API calls.
🧪 Open ASR Leaderboard
Srivastav, Zheng, Bezzam et al. (Hugging Face), 2025 · huggingface.co (Open ASR Leaderboard) · arXiv: 2510.06961
BrassTranscripts uses this leaderboard to track where its inference engine stands against alternatives. The long-form track (Earnings-21/22) is the most relevant column — not the LibriSpeech column — because it reflects actual production conditions. The leaderboard evaluates 86 ASR systems across 12 datasets covering English short-form, long-form, and multilingual tracks. Builders comparing ASR systems should sort by the Earnings-22 column rather than the aggregate score, which is dominated by LibriSpeech performance.
What it's good for: Reproducible cross-system WER comparison across short-form, long-form, and multilingual tracks. Where BrassTranscripts draws from it: Long-form track rankings as the primary accuracy reference for production audio conditions.
📊 Artificial Analysis Speech-to-Text Model Comparison
Artificial Analysis (independent), continuously updated 2024–2026 · artificialanalysis.ai/speech-to-text
BrassTranscripts monitors this benchmark for price-performance changes across the 49 systems it tracks — particularly the hosting-provider comparison for the same underlying model weights, which reveals that latency and cost vary dramatically (up to 10x) across providers running identical models. The composite weighting (50% AA-AgentTalk conversational audio, 25% VoxPopuli, 25% Earnings-22) makes it more predictive of real-world customer files than academic benchmarks alone. Builders evaluating cost efficiency should consult this benchmark before selecting an inference provider.
What it's good for: Comparing price, latency, and WER across 49 STT systems simultaneously, including provider-level variation for the same model. Where BrassTranscripts draws from it: Infrastructure cost-efficiency analysis and ongoing model landscape monitoring.
🧪 LibriSpeech
Panayotov, Chen, Povey, Khudanpur (ICASSP 2015) · openslr.org/12
BrassTranscripts treats LibriSpeech WER as a floor, not a ceiling. When a vendor cites an accuracy percentage, the first question to ask is whether that number came from LibriSpeech test-clean — it is the most favorable benchmark condition available (professional readers, studio acoustics, standard American English, no overlap), not representative of real-world customer audio. The approximately 1,000-hour corpus of audiobook recordings remains the standard reference point for historical comparison but is a poor predictor of performance on meeting recordings, depositions, interviews, or any uncontrolled environment.
What it's good for: Historical baseline comparison across the ASR literature — almost every published WER number traces to LibriSpeech. Where BrassTranscripts draws from it: Explaining why lab accuracy numbers should not be the primary selection criterion for production transcription services.
Frequently Asked Questions
What does WER (word error rate) actually measure?
WER counts the minimum number of word-level edits (substitutions, deletions, insertions) needed to transform a hypothesis transcript into the reference transcript, divided by the total number of words in the reference. A 5% WER means roughly 1 error per 20 words — but the distribution matters: a single wrong word in a name or number can invalidate an entire sentence while contributing only one error to the WER count.
Why is LibriSpeech WER misleading for real-world audio?
LibriSpeech is built from audiobook recordings: professional readers, studio conditions, no background noise, no accents outside standard American English, and no overlapping speech. Real-world customer audio — meetings, interviews, phone calls, depositions — violates all of these conditions simultaneously. WER on LibriSpeech test-clean is the most favorable benchmark condition available; the same model will produce substantially higher WER on actual customer files.
What is the Open ASR Leaderboard long-form track?
The Open ASR Leaderboard (Hugging Face) evaluates ASR systems across multiple tracks. The long-form track covers Earnings-21, Earnings-22, and TED-LIUM — datasets of 20-minute-to-2-hour recordings from real business and conference settings. This track is more predictive of production performance than the short-form LibriSpeech columns because it reflects the file lengths and acoustic conditions that professional transcription customers actually upload.
How does VAD pre-segmentation affect long-form accuracy?
Standard Whisper processes audio in 30-second windows using a sliding attention window. On long recordings, the model can lose context at segment boundaries, producing repetition artifacts, missed speech, or hallucinated text. VAD (voice activity detection) pre-segmentation cuts the audio at natural silence boundaries before feeding segments to the model, eliminating the sliding-window artifacts and producing substantially more accurate output on recordings over 15 minutes.
What does Artificial Analysis measure that the Open ASR Leaderboard does not?
Artificial Analysis adds latency and cost dimensions to the accuracy evaluation. Its composite WER benchmark covers 49 systems and weights toward conversational audio (50% AA-AgentTalk, 25% VoxPopuli, 25% Earnings-22). Critically, it evaluates the same underlying model weights deployed by different hosting providers — revealing that latency and price vary up to 10x across providers running identical models, something no pure-accuracy benchmark captures.