Skip to main content
← Back to Blog
8 min readBrassTranscripts Team

BrassTranscripts Launches WhisperX Large-v3 with Automatic Speaker Identification

BrassTranscripts today announces the availability of professional multi-speaker transcription powered by WhisperX large-v3 with automatic speaker identification. The service combines OpenAI's latest Whisper model with Pyannote 3.1 speaker diarization technology to deliver accurate transcripts with automatic speaker labels across 99+ languages.

Quick Navigation

What's New in WhisperX Large-v3

WhisperX large-v3 represents the latest generation of speech recognition technology, offering professional-grade transcription accuracy for clear audio recordings. The model processes audio files in 1-3 minutes per hour of content, delivering fast turnaround times for time-sensitive projects.

Key Capabilities:

  • 99+ Language Support - Automatic language detection across global languages
  • Professional Accuracy - Optimized for clear audio recording conditions
  • Fast Processing - 1-3 minutes per hour of audio
  • Four Output Formats - TXT, SRT, VTT, and JSON included with every transcription

The large-v3 model builds on years of speech recognition research and training, providing improved handling of diverse accents, technical terminology, and conversational speech patterns. For tips on maximizing accuracy, read our guide on audio quality secrets for perfect transcription.

Automatic Speaker Identification Technology

Speaker diarization—the process of determining "who said what" in an audio recording—runs automatically on all BrassTranscripts uploads using Pyannote 3.1 technology. Learn more about how speaker identification works and explore different speaker diarization models.

How Speaker Identification Works

  1. Audio Analysis - The system analyzes voice characteristics including pitch, tone, and speaking patterns
  2. Speaker Clustering - Distinct voices are grouped into separate speaker labels (Speaker 1, Speaker 2, etc.)
  3. Timestamp Attribution - Each spoken segment is attributed to the identified speaker
  4. Label Integration - Speaker labels appear in all four output formats (TXT, SRT, VTT, JSON)

What You Get

Every transcript includes speaker labels automatically:

[Speaker 1]: Welcome to today's meeting. Let's start with the quarterly review.

[Speaker 2]: Thanks for having me. I'd like to begin with our revenue results.

[Speaker 1]: Please go ahead. The team is eager to hear the update.

No additional configuration or manual tagging required—speaker identification runs automatically on every upload. If you need to fix speaker labels after processing, see our guide on how to correct speaker attribution errors.

Professional Use Cases

Business Meetings

Document team discussions, client calls, and strategy sessions with clear attribution of who made each point. Ideal for meeting minutes, decision tracking, and accountability documentation. Learn more about corporate meeting documentation workflows and how to create executive summaries from meeting transcripts.

Common Applications:

  • Board meetings and executive sessions
  • Client consultation calls
  • Team planning discussions
  • Performance review conversations

Research Interviews

Academic researchers, journalists, and qualitative analysts benefit from automatic speaker labeling in interview transcripts, eliminating hours of manual speaker tagging. See our comprehensive guide on interview transcription for qualitative research and learn expert interview techniques.

Research Applications:

  • Academic qualitative research
  • Ethnographic interviews
  • Focus group discussions
  • Expert interview documentation

Content Creation

Podcasters, video creators, and media producers receive ready-to-edit transcripts with speaker identification for show notes, captions, and content repurposing. Explore our podcast transcription workflow for content creators and learn how to build a content empire from podcast transcripts.

Creator Applications:

  • Podcast episode transcripts
  • YouTube video captions
  • Panel discussion documentation
  • Interview show transcripts

Legal professionals working with depositions, witness interviews, and client meetings receive accurate speaker-attributed transcripts for case documentation.

Legal Applications:

  • Deposition transcription
  • Witness interview documentation
  • Client consultation records
  • Legal proceeding documentation

Technical Specifications

File Support

Audio Formats: MP3, M4A, WAV, AAC, FLAC, OGG, Opus, WebM, MPGA Video Formats: MP4, MPEG (audio automatically extracted) Total Supported Formats: 11 formats

File Limits:

  • Maximum file size: 250MB
  • Maximum duration: 2 hours
  • Minimum duration: 5 minutes

Processing Specifications

Processing Speed: 1-3 minutes per hour of audio Language Detection: Automatic across 99+ supported languages Speaker Identification: Automatic using Pyannote 3.1 Output Formats: TXT, SRT, VTT, JSON (all included)

AI Technology Stack

Transcription Engine: WhisperX with Whisper large-v3 model Speaker Diarization: Pyannote 3.1 Language Support: 99+ languages with automatic detection Infrastructure: RunPod serverless GPU processing

How It Works

1. Upload Your File

Drag and drop your audio or video file (up to 250MB, 2 hours maximum). The system validates format and duration automatically.

2. Automatic Processing

WhisperX large-v3 transcribes the audio while Pyannote 3.1 simultaneously identifies and labels speakers. Processing completes in 1-3 minutes per hour of content.

3. Preview Your Results

Receive a 30-word preview of your transcript to verify quality before payment. Preview includes speaker labels so you can confirm identification accuracy.

4. Download All Formats

After payment, download your complete transcript in four formats:

  • TXT - Clean text with speaker labels
  • SRT - Video subtitle format with speakers
  • VTT - Web video captions with speakers
  • JSON - Structured data with timestamps and speakers

Learn more about choosing the right transcript format and explore multi-speaker transcript format options.

Pricing and Availability

Tier 1: $2.25 flat rate for files 1-15 minutes Tier 2: $0.15 per minute for files 16+ minutes

Example: 60-minute meeting = $9.00 total

All four output formats (TXT, SRT, VTT, JSON) included with every transcription. Speaker identification runs automatically at no additional cost.

100% Satisfaction Guarantee

Try the 30-word preview before payment. If you're not satisfied with your transcript quality, contact support@brasstranscripts.com for a full refund—no questions asked. Learn more about our preview-before-purchase guarantee system.

Data Privacy and Security

Audio File Retention: Automatically deleted after 24 hours Transcript Retention: Available for download for 48 hours, then deleted No Personal Data Collection: Anonymous processing with industry-standard encryption

Frequently Asked Questions

What is speaker diarization?

Speaker diarization is the AI process of identifying "who said what" in an audio recording. The system analyzes voice characteristics to distinguish between different speakers and assigns labels (Speaker 1, Speaker 2, etc.) to each spoken segment with timestamps.

Does speaker identification work automatically?

Yes. Every file uploaded to BrassTranscripts automatically receives speaker identification processing using Pyannote 3.1 technology. No configuration or manual tagging required—speaker labels appear in all output formats automatically.

How many speakers can the system identify?

The system can identify multiple speakers in a recording. For best results, we recommend 2-4 distinct speakers with clear audio quality. Learn more about multi-speaker transcription.

What if the speaker labels are wrong?

If speaker labels need correction after processing, you can use AI prompts to fix attribution errors. See our guide on correcting wrong speaker labels for detailed instructions.

Which file formats support speaker identification?

All 11 supported formats receive speaker identification: MP3, M4A, WAV, AAC, FLAC, OGG, Opus, WebM, MPGA, MP4, and MPEG. Speaker labels appear in all four output formats (TXT, SRT, VTT, JSON).

How accurate is WhisperX large-v3?

WhisperX large-v3 delivers professional-grade transcription accuracy for clear audio recordings. Accuracy depends on audio quality, background noise, accents, and speaker clarity. Use the 30-word preview to verify quality for your specific audio before payment.

Yes. Legal professionals use BrassTranscripts for depositions, witness interviews, and client consultation documentation. However, verify transcripts for accuracy before submitting for legal proceedings. See our legal toolkit for deposition analysis.

What languages does WhisperX large-v3 support?

The system supports 99+ languages with automatic detection. Upload your file and the AI automatically identifies the language and transcribes accordingly—no manual language selection needed.

How long does processing take?

Processing completes in 1-3 minutes per hour of audio. Example: a 60-minute meeting processes in approximately 1-3 minutes total.

Is there a satisfaction guarantee?

Yes. BrassTranscripts offers a 100% satisfaction guarantee with no-questions-asked refunds. Try the 30-word preview before payment, and if you're unsatisfied with results, contact support@brasstranscripts.com for a full refund.

Getting Started

Visit BrassTranscripts.com to upload your first file. No account required—simply upload, preview, and download.

For technical questions or enterprise volume inquiries, contact support@brasstranscripts.com.


About BrassTranscripts

BrassTranscripts provides professional AI-powered transcription services for businesses, researchers, content creators, and legal professionals. Built on WhisperX large-v3 with automatic speaker identification, the service processes audio and video files in 99+ languages with fast turnaround times and transparent pricing.

Ready to try BrassTranscripts?

Experience the accuracy and speed of our AI transcription service.