The 98% Accuracy Claim: An Investigation into AI Transcription Performance

In October 2025, while researching industry claims about AI transcription accuracy, we encountered a statement on Grokipedia's transcription service page that caught our attention: "OpenAI's Whisper and similar systems, have achieved near-human accuracy rates of up to 98% in 2025."

The claim appeared alongside other factual information about transcription technology, but it cited no source. This raised questions. Where did "98%" come from? What does it actually measure? And most importantly—can it be verified?

We investigated this claim through official documentation, peer-reviewed academic studies, and independent industry benchmarks. What we found was a significant gap between marketing language and documented performance.

The Investigation

We approached this research with a straightforward question: Is "98% accuracy" for Whisper factually supported?

Our methodology involved:

Reviewing OpenAI's official Whisper documentation and research papers
Analyzing peer-reviewed studies from academic conferences (Interspeech 2023, 2025; JASA Express Letters)
Examining independent industry benchmarks (MLCommons, Deepgram, AssemblyAI, Soniox)
Investigating WhisperX, the optimized implementation used by commercial transcription services
Cross-referencing claims against authoritative sources

We conducted this research in parallel—two independent investigators examining the same sources to verify findings.

What OpenAI Actually Documents

OpenAI's official Whisper research paper and GitHub repository report specific performance metrics. The most commonly cited number comes from testing on LibriSpeech, a dataset of read audiobook recordings.

Documented performance:

LibriSpeech clean-test: 2.5-2.7% Word Error Rate (WER), equivalent to 97.3-97.5% accuracy
LibriSpeech other-test: 5.2% WER, equivalent to 94.8% accuracy

These numbers appear in OpenAI's official documentation. Nowhere in these documents does the term "98% accuracy" appear.

The LibriSpeech dataset consists of professionally recorded audiobooks—single speakers reading prepared text in quiet environments. OpenAI's paper explicitly describes Whisper's LibriSpeech performance as "unremarkable" while noting the model's "very different robustness properties" across varied conditions.

The documentation also includes explicit warnings. The paper states that models "perform unevenly across languages," show "lower accuracy on low-resource languages," demonstrate "disparate performance on different accents and dialects," and "may have higher word error rate across different demographics."

The MLPerf Benchmark: Where "98%" Likely Originated

The closest we found to "98%" appears in MLCommons' Whisper benchmarking results. Their testing on LibriSpeech showed 97.9329% word accuracy—97.93% when rounded to two decimal places.

This appears to be the source: 97.93% rounded up to "98%" and repeated across industry marketing without context about test conditions.

The problem: This benchmark measures performance on clean, single-speaker audiobooks with standard American accents. It does not represent the varied audio conditions most users encounter.

Real-World Performance: What Peer-Reviewed Research Shows

Academic studies testing Whisper on diverse audio conditions reveal significant performance variation.

Conversational Audio

AssemblyAI's comparison study tested Whisper large-v3 and turbo models on conversational audio. Results: 7.75-7.88% WER, equivalent to 92.12-92.25% accuracy.

Soniox's March 2023 benchmark compared Whisper Large against human transcription. Whisper achieved 17.6% WER (82.4% accuracy) compared to humans at 6.8% WER (93.2% accuracy).

Phone Calls and Low-Quality Audio

Deepgram's 2024 analysis of Whisper v3 on real-world phone calls and conversational audio showed median WER of 53.4% (46.6% accuracy) for v3 and 42.9% WER (57.1% accuracy) on phone calls specifically.

Forensic and Challenging Audio

A 2024 Frontiers in Communication study by Loakes tested Whisper on forensic-quality audio. The system transcribed 69% of 116 words, with 72.5% of those transcribed correctly. Overall WER: 50% (50% accuracy).

The study concluded: "good performance is relative; the overall WER is still 50%."

Accents and Demographic Variation

A 2024 JASA Express Letters study by Graham and Roll examined Whisper performance across accents and speaker traits. Findings: "Superior recognition in American vs British/Australian accents" and "Native English accents demonstrate higher accuracy than non-native."

An Interspeech 2025 study on accent variation documented WER of 29% (71% accuracy) for ICE Nigeria corpus and 26% WER (74% accuracy) for ICE Scotland corpus.

Spontaneous Speech with Disfluencies

An Interspeech 2025 paper evaluated Whisper on spontaneous speech containing natural errors and disfluencies. Performance:

Fluent speech: 23-24% WER (76-77% accuracy)
Stuttered speech: 41-49% WER (51-59% accuracy)

WhisperX: Understanding the Implementation

Many commercial transcription services use WhisperX, an optimized implementation developed by researchers at the University of Oxford's Visual Geometry Group (Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman).

WhisperX is not a new model. It's a wrapper around OpenAI's Whisper that adds voice activity detection (VAD), batched inference for speed, forced phoneme alignment for word-level timestamps, and speaker diarization.

Official WhisperX Documentation

The WhisperX GitHub repository documents specific performance claims:

"70× real-time" transcription speed for large-v2
"Accurate word-level timestamps via phoneme alignment"
"VAD preprocessing reduces hallucination & batching with no WER degradation"

The repository makes no claims about improving transcription accuracy. The emphasis is on speed and timestamp precision while maintaining—not improving—the accuracy of the underlying Whisper model.

Peer-Reviewed WhisperX Performance

The authoritative WhisperX benchmark appears in the paper "WhisperX: Time-Accurate Speech Transcription of Long-Form Audio" (Bain et al., 2023), presented at Interspeech 2023. This is the only peer-reviewed study specifically benchmarking WhisperX.

Results on long-form datasets:

Dataset	Base Whisper WER	WhisperX WER	Improvement	WhisperX Accuracy
TED-LIUM 3 (talks)	10.5%	9.7%	0.8 points	90.3%
Kincaid46 (YouTube)	7.7%	6.7%	1.0 points	93.3%
AMI (meetings)	12.5%	11.8%	0.7 points	88.2%
Switchboard (phone)	11.0%	11.8%	~similar	~88-89%

WhisperX improved accuracy by 0.8 to 1.0 percentage points while achieving 9.7× to 11.8× faster processing.

The paper's Table 3 reveals a critical finding: Without VAD segmentation, batched inference actually degrades accuracy. VAD preprocessing is essential to maintaining performance while achieving speed improvements.

WhisperX on Spontaneous Speech

The same Interspeech 2025 study on spontaneous speech errors tested WhisperX on the SFU Speech Error Database containing 5,300 human speech errors.

Results:

Sound errors: 83% accuracy
Word errors: 74% accuracy

The study concluded that "WhisperX struggles with speech errors and that accuracy varies widely across error types and conditions."

The Range of Documented Performance

Compiling peer-reviewed and independently verified results, we documented the following accuracy range for Whisper and WhisperX:

Condition	Source	Accuracy
Clean audiobooks (LibriSpeech)	OpenAI, MLPerf	97.3-97.9%
Conversational audio	AssemblyAI, Soniox	82.4-92.2%
WhisperX on benchmarks	Interspeech 2023	88.2-93.3%
Fluent spontaneous speech	Interspeech 2025	76-77%
Accented speech (Nigeria, Scotland)	Interspeech 2025	71-74%
Phone calls	Deepgram	46.6-57.1%
Forensic audio	Frontiers in Communication	50%
Stuttered speech	Interspeech 2025	51-59%
Low-resource languages	Various studies	33-67%

The documented range spans from 33% to 97.9%—a 65-percentage-point gap between best and worst conditions.

Demographic Variation

Multiple studies document performance variation across demographic groups. One analysis found:

Male speakers: Over 90% accuracy
Female speakers: Around 80% accuracy
Children: As low as 40% accuracy

This variation raises questions about any universal accuracy claim that doesn't account for speaker demographics.

Marketing Claims vs. Documented Evidence

During our research, we encountered numerous online sources claiming "95-98% accuracy" for Whisper and WhisperX. We traced these claims to their sources.

What we found: marketing content from service providers and technology blogs, without citations to peer-reviewed studies or documented methodology. Many sources cited each other in circular fashion, but none linked to authoritative benchmarks supporting the claimed range.

The gap between documented evidence and marketing claims ranges from 5 to 12 percentage points, depending on audio conditions.

What "Accuracy" Actually Measures

AI transcription accuracy is typically measured as 100% minus Word Error Rate (WER). WER counts substitutions (wrong word), deletions (missed word), and insertions (added word) as a percentage of total words.

This measurement has limitations:

It doesn't account for meaning preservation
Homophones counted as errors even when meaning is clear
Doesn't measure punctuation, capitalization, or formatting
Context-dependent (technical terminology, proper nouns)

A 10% WER means 90% accuracy, but whether that's "good" depends on use case. For legal or medical transcription, 90% accuracy means significant error rates. For casual note-taking, it may be acceptable. Learn more about how to choose the right transcription service for your specific needs.

What We Can Verify

Based on authoritative sources, these statements are factually supported:

OpenAI's Whisper achieves 97.3% accuracy on LibriSpeech clean test set (clean, read speech)
Real-world Whisper accuracy ranges from 50-92% depending on audio quality, speaker accent, and language
WhisperX improves Whisper accuracy by 0.8-1.0 percentage points on long-form audio
WhisperX achieves 88.2-93.3% accuracy on clean benchmark datasets
Performance drops significantly on phone calls (46-57%), accented speech (71-77%), and spontaneous speech with disfluencies (74-83%)
WhisperX's primary innovations are processing speed (70× faster) and word-level timestamps, not transcription accuracy transformation

What We Cannot Verify

The following claims lack authoritative support:

"98% accuracy" for Whisper or WhisperX in general use
"95-98% accuracy" as a typical performance range
"Near-human accuracy" (humans benchmark at 93-97%; AI systems document at 50-93% depending on conditions)
Universal accuracy claims without context about audio quality, speaker demographics, or language

The Context Problem

The fundamental issue with "98% accuracy" is not that the number is entirely fabricated—it's that a benchmark result from one specific test condition (clean audiobooks with standard American accents) has been extracted from context and applied universally.

This would be like testing a car's fuel efficiency on a flat highway at 55 mph, then claiming that same efficiency applies in city traffic, mountain roads, and stop-and-go conditions.

OpenAI's documentation explicitly warns about this. The paper describes varied performance across conditions and demographics. Removing these warnings and citing only the best-case number misrepresents the technology's actual capabilities.

Why Accuracy Varies

AI transcription accuracy depends on multiple factors:

Audio quality: Background noise, recording equipment, compression, and audio format all affect performance. Each 10dB increase in background noise can reduce accuracy by 8-12%. For tips on improving your audio quality, see our comprehensive guide.

Speaker characteristics: Native vs. non-native speakers, regional accents, speech impediments, speaking pace, and clarity all impact results.

Language: High-resource languages (extensive training data) perform better than low-resource languages. Whisper supports 99 languages but with widely varying accuracy.

Content type: Technical terminology, proper nouns, industry jargon, and domain-specific vocabulary reduce accuracy compared to general conversation.

Audio length and segmentation: How audio is chunked for processing affects results. WhisperX's VAD segmentation improves accuracy specifically by optimizing chunk boundaries.

Number of speakers: Single-speaker audio performs better than multi-speaker. Overlapping speech, cross-talk, and rapid speaker changes increase error rates. Learn more about how speaker identification works and its impact on accuracy.

What Consumers Should Know

When evaluating AI transcription services, ask:

Under what conditions was accuracy tested? Clean studio recordings, phone calls, meetings, interviews?
What audio quality? Professional equipment or consumer devices?
Which languages and accents? Performance varies significantly.
What content type? General conversation or specialized terminology?
Is methodology documented? Can results be independently verified?

Be skeptical of universal accuracy percentages without context. "95% accuracy" on audiobooks differs dramatically from "95% accuracy" on phone calls—except peer-reviewed studies show phone calls achieve 46-57% accuracy, not 95%. If you're experiencing AI transcription errors, understanding these factors can help identify the root cause.

The Role of Post-Processing

Some transcription services improve raw AI output through additional processing:

Vocabulary customization (medical terms, company names, technical jargon)
Language model fine-tuning on domain-specific text
Post-processing rules (common error corrections, formatting)
Human review for critical applications

These improvements can boost accuracy, but they're service-specific enhancements, not inherent AI model capabilities. Claims about improved accuracy should specify whether they measure raw AI output or post-processed results.

Conclusion

The claim that AI transcription systems achieve "98% accuracy" is not supported by authoritative sources. The highest documented performance is 97.3-97.9% on LibriSpeech clean audio—a specific benchmark of read speech in ideal conditions. This result has been rounded up and repeated without the critical context about test conditions.

Peer-reviewed research and independent benchmarks document accuracy ranging from 50% to 93% depending on audio quality, speaker characteristics, and language. WhisperX, the optimized implementation used by many transcription services, achieves 88-93% accuracy on clean benchmarks and 74-83% on spontaneous speech with natural disfluencies.

The 65-percentage-point gap between best and worst documented conditions demonstrates why context-free accuracy claims are misleading. AI transcription technology has made remarkable progress, but accuracy remains highly dependent on conditions.

For consumers and businesses evaluating transcription services, understanding this variability matters more than memorizing a single percentage. The technology works extremely well under certain conditions and poorly under others. Making informed decisions requires knowing which conditions apply to your specific use case.

We continue to use WhisperX at BrassTranscripts because peer-reviewed evidence shows it delivers the best available accuracy for long-form audio while providing word-level timestamps and speaker identification. We offer a 30-word preview before purchase specifically so users can verify quality on their actual audio before paying—because we recognize that performance varies based on your specific recording conditions.

Sources and Further Reading

Official Documentation

Peer-Reviewed Studies

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio (Bain et al., Interspeech 2023)
Evaluating ASR robustness to spontaneous speech errors (Alderete et al., Interspeech 2025)
Evaluating OpenAI's Whisper ASR: Performance analysis across diverse accents and speaker traits (Graham & Roll, JASA Express Letters 2024)
Automatic speech recognition and the transcription of indistinct forensic audio (Loakes, Frontiers in Communication 2024)

Independent Benchmarks

This investigation was conducted in November 2025 through independent research of peer-reviewed academic studies, official documentation, and industry benchmarks. All accuracy figures cited are from authoritative sources with documented testing methodology.