ASR Benchmarks & Independent Testing

Builder question: What do independent benchmarks say about AI transcription accuracy?

BrassTranscripts draws from these seven benchmark frameworks to evaluate AI transcription accuracy claims independently — the leaderboards and toolkits that underpin every WER number published in the literature, and the methodology for interpreting those numbers correctly.

Contents — 8 entries

🧪 Open ASR Leaderboard

Srivastav, Zheng, Bezzam et al. (Hugging Face), 2025 · huggingface.co (Open ASR Leaderboard) · arXiv: 2510.06961

BrassTranscripts uses the Open ASR Leaderboard's long-form track — Earnings-21, Earnings-22, TED-LIUM — as its primary accuracy reference, not the short-form LibriSpeech column. The long-form track is the only one that reflects performance on the 20-minute-to-2-hour files that BrassTranscripts customers actually upload. Evaluating 86 ASR systems across 12 datasets, the leaderboard provides reproducible cross-system comparison with transparent methodology. Builders comparing systems should sort by the Earnings-22 long-form column as their primary decision criterion for production audio conditions.

What it's good for: Reproducible cross-system WER comparison across short-form, long-form, and multilingual tracks with transparent methodology. Where BrassTranscripts draws from it: The long-form track as the primary production accuracy reference, updated continuously as new models are added.

📊 Artificial Analysis Speech-to-Text Model Comparison

Artificial Analysis (independent), continuously updated 2024–2026 · artificialanalysis.ai/speech-to-text

BrassTranscripts monitors Artificial Analysis for price-performance changes across the 49 systems it tracks — particularly the hosting-provider comparison for the same model weights, which reveals that latency and cost vary dramatically (up to 10x) across providers running identical models. The composite weighting (50% AA-AgentTalk conversational audio, 25% VoxPopuli, 25% Earnings-22) is more predictive of real-world customer files than academic benchmarks alone. Builders evaluating cost efficiency should consult this benchmark before selecting an inference provider — the cheapest provider running the same model as a more expensive provider may have 3-5x higher latency, a trade-off invisible in pure-accuracy benchmarks.

What it's good for: Provider-level price, latency, and WER comparison for 49 STT systems — including multiple providers running identical model weights. Where BrassTranscripts draws from it: Infrastructure cost and latency decisions, and ongoing monitoring of the model landscape for price-performance shifts.

📄 Benchmarking Diarization Models (ETH Zurich, 2025)

Lanzendörfer, Grötschla, Blaser, Wattenhofer (ETH Zurich), 2025 · arxiv.org/abs/2509.26177

BrassTranscripts cites this as the most current independent multi-model diarization comparison — relevant here because diarization accuracy is part of the end-to-end quality of a speaker-labeled transcript, not just WER. Evaluating 5 state-of-the-art models across 4 multilingual datasets (196.6 hours), the benchmark establishes that missed speech (not speaker confusion) is the dominant error source, and that PyannoteAI leads at 11.2% DER with DiariZen open-source at 13.3% DER. Builders should treat this as a complementary benchmark to ASR WER benchmarks — a transcript with low WER but high DER is inaccurate in a different dimension that WER alone cannot reveal.

What it's good for: The most current side-by-side comparison of diarization systems across consistent multilingual datasets. Where BrassTranscripts draws from it: Diarization quality monitoring and communicating what missed-speech errors look like in practice versus speaker confusion errors.

🏢 MLPerf Inference v5.1 — Whisper Benchmark

MLCommons, September 2025 · mlcommons.org (MLPerf Inference v5.1)

BrassTranscripts treats Whisper's MLPerf inclusion as the formal industry endorsement of this model as a production ASR reference workload. MLPerf Inference v5.1 provides standardized hardware-independent evaluation of accuracy and throughput using the same framework used for image classification and large language model inference benchmarks. When a vendor claims "Whisper-quality" accuracy, MLPerf v5.1 is the methodology for verifying that claim reproducibly across different hardware — without it, hardware-specific performance numbers are not comparably interpretable across vendors. Builders evaluating infrastructure partners should ask for MLPerf compliance data alongside proprietary benchmarks.

What it's good for: Hardware-independent standardized evaluation of ASR accuracy and throughput under the same framework used across ML workloads. Where BrassTranscripts draws from it: The formal benchmark framework for verifying "Whisper-quality" claims from infrastructure and model providers.

🔧 NIST SCTK / sclite — Speech Recognition Scoring Toolkit

U.S. National Institute of Standards and Technology · github.com/usnistgov/SCTK · 241 stars (widely deployed via package managers)

BrassTranscripts treats SCTK/sclite as the methodological foundation for interpreting any WER claim. Understanding how sclite's dynamic programming alignment works — and what counts as a substitution versus deletion versus insertion — is prerequisite knowledge for reading any accuracy table in any benchmark, including this one. The toolkit contains sclite (WER scorer), sc_stats (significance testing), rover (system combination / ensemble voting), and asclite (multi-speaker alignment). Virtually every published ASR WER number traces to SCTK conventions. Builders who want to compare their own transcript quality to published benchmarks must use the same scoring methodology; jiwer (see Audio Quality page) provides a Python-friendly interface to the same underlying alignment algorithms.

What it's good for: The reference implementation of WER scoring, significance testing, and multi-speaker alignment underlying virtually all published ASR accuracy numbers. Where BrassTranscripts draws from it: The methodological standard that all accuracy comparisons must conform to for apples-to-apples interpretation of WER numbers across systems.

🧪 Earnings-22: A Practical Benchmark for Accents in the Wild

Del Rio, Ha, McNamara, Miller, Chandra, 2022 · arxiv.org/abs/2203.15591

BrassTranscripts uses Earnings-22 as its primary real-world accuracy reference for English-language business audio. It is now included in both the Open ASR Leaderboard long-form track and the Artificial Analysis composite (25% weight) — making it the single dataset with the widest cross-benchmark coverage for real-world audio evaluation. The 119-hour corpus of earnings calls from global companies documents significant WER variation across speaker countries of origin, making it the canonical reference for accented English business audio performance. Builders selecting between ASR systems for business use cases should prioritize Earnings-22 WER as the primary selection criterion over LibriSpeech or any proprietary internal benchmark.

What it's good for: Cross-framework real-world English business audio evaluation — the only dataset appearing in both the Open ASR Leaderboard and Artificial Analysis composite. Where BrassTranscripts draws from it: The primary real-world accuracy reference for English business audio, covering accented speech conditions that other benchmarks underrepresent.

🧪 GigaSpeech: 10,000 Hours of Multi-Domain Transcribed Audio

Chen, Chai, Wang et al. (GigaSpeech consortium), Interspeech 2021 · arxiv.org/abs/2106.06909

BrassTranscripts uses GigaSpeech as a multi-domain accuracy reference closer to real uploads than read-speech benchmarks. It provides 10,000 hours of labeled audio (40,000 hours total) spanning audiobooks, podcasts, and YouTube, so its test sets carry the spontaneous, messy acoustics that LibriSpeech lacks. Builders comparing systems should read GigaSpeech WER alongside LibriSpeech; a model that looks strong on clean read speech can rank differently once podcast and video audio enter the mix.

What it's good for: A large multi-domain corpus (audiobooks, podcasts, YouTube) for evaluating WER on spontaneous, real-world audio. Where BrassTranscripts draws from it: A production-representative accuracy reference that complements the read-speech LibriSpeech floor.

🧪 The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset

Galvez et al. (MLCommons), 2021 · arxiv.org/abs/2111.09344

BrassTranscripts uses The People's Speech as a reference point for what large-scale, commercially usable training data actually looks like. It is a 30,000-hour and growing supervised conversational English corpus released under CC-BY-SA with a CC-BY subset, and a model trained on it reaches 9.98% word error rate on LibriSpeech test-clean. For builders, it matters because conversational, real-world audio at this scale is what pushes models past clean read-speech accuracy onto the messy recordings users actually upload, and its permissive license is why such data can train commercial systems at all.

What it's good for: Gauging the scale and licensing of open conversational training data behind modern speech recognition. Where BrassTranscripts draws from it: Evidence that real-world conversational corpora, not just read-speech, underpin accuracy on everyday recordings.

Frequently Asked Questions

Which benchmark column should I look at when comparing ASR systems?

For production transcription use cases — meeting recordings, interviews, podcasts, business calls — look at the Earnings-22 column in the Open ASR Leaderboard long-form track, not the LibriSpeech test-clean column. LibriSpeech test-clean is the most favorable benchmark condition available (professional readers, studio acoustics) and systematically underestimates real-world WER. Earnings-22 (119 hours of earnings call audio with accented speakers and financial terminology) is more predictive of actual production performance because its acoustic conditions more closely match real customer files.

What is MLPerf and why does it matter for AI transcription claims?

MLPerf is the industry-standard hardware-independent benchmark for machine learning inference, maintained by MLCommons. Whisper was added as an official MLPerf Inference v5.1 benchmark in September 2025 — the same framework used to benchmark image classification and large language model inference. When a vendor claims "Whisper-quality" accuracy or performance, MLPerf v5.1 is the methodology for verifying that claim reproducibly across different hardware. Without MLPerf or an equivalent standardized benchmark, hardware-specific performance claims are not comparably interpretable.

What is NIST SCTK and what does sclite do?

The NIST Speech Recognition Scoring Toolkit (SCTK) is the reference implementation for computing WER using dynamic programming alignment. Its core tool, sclite, counts substitutions, deletions, and insertions by finding the minimum-edit alignment between a hypothesis transcript and a reference transcript. Virtually every published ASR WER number traces to SCTK conventions. Understanding how sclite's alignment works — and what counts as a substitution versus deletion versus insertion — is prerequisite knowledge for reading any accuracy table in any benchmark.

How does Artificial Analysis differ from the Open ASR Leaderboard?

The Open ASR Leaderboard focuses on WER accuracy across a reproducible multi-dataset framework and is primarily research-oriented. Artificial Analysis adds price and latency dimensions: it tracks cost-per-minute, median latency, and throughput alongside WER for 49 systems, including multiple hosting providers running the same underlying model weights. This provider-level comparison reveals that price and latency vary up to 10x across providers running identical models — something no pure-accuracy benchmark captures. Builders selecting an inference provider (rather than selecting a model architecture) should use Artificial Analysis.

Why does Earnings-22 appear in multiple benchmark frameworks?

Earnings-22 (119 hours of earnings call audio, published 2022) has become the de facto real-world English business audio reference because it combines realistic conditions — accented English, financial terminology, variable recording quality — with a publicly available evaluation set. It is included in the Open ASR Leaderboard long-form track, the Artificial Analysis composite weighting (25%), and is one of the most cited datasets in the ASR benchmark literature. This cross-framework presence means Earnings-22 performance numbers can be compared across benchmark systems — unlike benchmark-specific internal datasets that are not publicly reproducible.

Explore Related Research

Transcription Accuracy →Whisper architecture, long-form inference paper, Open ASR Leaderboard Speaker Diarization →Neural diarization, overlap-aware diarization, ETH Zurich benchmark Multilingual Speech →FLEURS, Common Voice, accent peer review Audio Quality →Reverberation benchmarks, SNR thresholds, CHiME-7

← Back to Research Index