Multi-Speaker Transcript Formats: SRT, VTT, JSON with Speaker Names
Multi-speaker transcripts come in different formats, each designed for specific use cases. Whether you need subtitles for video, data for software integration, or simple readable text, choosing the right format matters.
This guide covers the four most common transcript formats with speaker identification: TXT, SRT, VTT, and JSON.
Quick Navigation
- Why Transcript Format Matters
- TXT Format: Simple and Readable
- SRT Format: Video Subtitles
- VTT Format: Web Video Captions
- JSON Format: Structured Data for Software
- Choosing the Right Format
- Converting Between Formats
- Adding Speaker Labels to Existing Transcripts
- Frequently Asked Questions
Why Transcript Format Matters
Different Formats for Different Needs
You've transcribed your multi-speaker audio, but now you need to use that transcript for a specific purpose:
- Video subtitles? You need SRT or VTT format
- Podcast show notes? You need readable TXT format
- Software integration? You need structured JSON format
- Video editing? You need SRT with timecodes
Each format has specific structure, capabilities, and use cases.
What Makes Multi-Speaker Formats Different
Basic transcripts contain just text and timestamps. Multi-speaker formats add speaker identification:
Basic transcript (no speakers):
Hello everyone, welcome to the meeting.
Thanks for having me.
Multi-speaker transcript:
[00:00:03] Speaker 0: Hello everyone, welcome to the meeting.
[00:00:07] Speaker 1: Thanks for having me.
This guide focuses on formats that support speaker labels.
TXT Format: Simple and Readable
What is TXT Format?
Plain text format designed for human readability. No special syntax or metadata - just text with timestamps and speaker labels.
Best for:
- Reading and reviewing transcripts
- Creating meeting notes
- Podcast show notes
- Blog post transcripts
- Email sharing
TXT Format Structure
Basic structure:
[HH:MM:SS] Speaker Label: Text spoken
[00:00:03] Speaker 0: Hello everyone, welcome to today's product meeting.
[00:00:07] Speaker 1: Thanks for having me. I'd like to start by discussing the Q4 roadmap.
[00:00:15] Speaker 0: Great, let's dive in. What are the top priorities?
[00:00:19] Speaker 1: The analytics dashboard redesign is our primary focus.
Key elements:
- Timestamp:
[HH:MM:SS]format showing when speech begins - Speaker label:
Speaker 0:,Speaker 1:, or actual names - Text: Spoken words transcribed
- Blank lines: Separate speaker turns for readability
TXT with Real Names
After identifying speakers, replace generic labels with real names:
[00:00:03] Sarah Martinez: Hello everyone, welcome to today's product meeting.
[00:00:07] Michael Chen: Thanks for having me. I'd like to start by discussing the Q4 roadmap.
[00:00:15] Sarah Martinez: Great, let's dive in. What are the top priorities?
[00:00:19] Michael Chen: The analytics dashboard redesign is our primary focus.
TXT Format Advantages
Pros:
- Universal compatibility (any text editor, email, messaging app)
- Human-readable without special software
- Small file size
- Easy to edit
- No formatting restrictions
Cons:
- No timing synchronization (can't sync with video/audio automatically)
- No styling options (bold, italic, color)
- Manual formatting required for different uses
When to Use TXT
Choose TXT format when:
- You need to read and review the content
- You're creating meeting notes or summaries
- You're sharing transcripts via email or messaging
- You're copying transcript sections into other documents
- You don't need video/audio synchronization
SRT Format: Video Subtitles
What is SRT Format?
SubRip Subtitle format (.srt) is the most widely supported subtitle format for video. Used by video players, editing software, and streaming platforms.
Best for:
- Video subtitles
- YouTube captions
- Video editing (Premiere, Final Cut, DaVinci Resolve)
- Social media videos (Instagram, Facebook, TikTok)
- Video players (VLC, QuickTime, Windows Media Player)
SRT Format Structure
Basic structure:
1
00:00:03,000 --> 00:00:07,000
Speaker 0: Hello everyone, welcome to
today's product meeting.
2
00:00:07,000 --> 00:00:15,000
Speaker 1: Thanks for having me.
I'd like to start by discussing the Q4 roadmap.
3
00:00:15,000 --> 00:00:19,000
Speaker 0: Great, let's dive in.
What are the top priorities?
4
00:00:19,000 --> 00:00:25,000
Speaker 1: The analytics dashboard
redesign is our primary focus.
Key elements:
- Sequence number:
1,2,3(each subtitle block numbered) - Timing:
HH:MM:SS,MS --> HH:MM:SS,MS(start time --> end time) - Text: Subtitle text (typically 1-2 lines, including speaker label)
- Blank line: Separates each subtitle block
SRT with Speaker Styling
SRT supports basic styling tags for speaker differentiation:
Using color tags:
1
00:00:03,000 --> 00:00:07,000
<font color="#00FF00">Sarah:</font> Hello everyone,
welcome to today's product meeting.
2
00:00:07,000 --> 00:00:15,000
<font color="#00FFFF">Michael:</font> Thanks for having me.
I'd like to start by discussing the Q4 roadmap.
Using position tags:
1
00:00:03,000 --> 00:00:07,000
{\an7}Sarah: Hello everyone,
welcome to today's product meeting.
2
00:00:07,000 --> 00:00:15,000
{\an7}Michael: Thanks for having me.
I'd like to start by discussing the Q4 roadmap.
Styling support varies by player - test with your target platform.
SRT Format Advantages
Pros:
- Universal subtitle support (works everywhere)
- Synchronizes automatically with video
- Supported by all major video editing software
- Accepted by YouTube, Vimeo, and streaming platforms
- Simple text-based format (easy to edit)
Cons:
- Limited styling options
- Must manually break long sentences into readable chunks
- Timing must be precise for good viewing experience
- No metadata storage (language, author, etc.)
When to Use SRT
Choose SRT format when:
- Adding subtitles to video content
- Creating YouTube captions
- Working with video editing software
- Need universal video player compatibility
- Creating social media videos with captions
VTT Format: Web Video Captions
What is VTT Format?
WebVTT (Web Video Text Tracks, .vtt) is the modern web standard for video captions. Similar to SRT but with more features and better web browser support.
Best for:
- HTML5 video players
- Web-based video platforms
- Accessibility compliance (WCAG, Section 508)
- Interactive video experiences
- Modern streaming applications
VTT Format Structure
Basic structure:
WEBVTT
1
00:00:03.000 --> 00:00:07.000
<v Speaker 0>Hello everyone, welcome to today's product meeting.
2
00:00:07.000 --> 00:00:15.000
<v Speaker 1>Thanks for having me. I'd like to start by discussing the Q4 roadmap.
3
00:00:15.000 --> 00:00:19.000
<v Speaker 0>Great, let's dive in. What are the top priorities?
4
00:00:19.000 --> 00:00:25.000
<v Speaker 1>The analytics dashboard redesign is our primary focus.
Key elements:
- Header:
WEBVTT(required first line) - Cue identifier:
1,2,3(optional but helpful) - Timing:
HH:MM:SS.MS --> HH:MM:SS.MS(periods instead of commas) - Voice tags:
<v Speaker Name>(identifies speakers) - Text: Caption text
- Blank line: Separates cues
VTT with Named Speakers
Using voice tags for speaker identification:
WEBVTT
1
00:00:03.000 --> 00:00:07.000
<v Sarah Martinez>Hello everyone, welcome to today's product meeting.
2
00:00:07.000 --> 00:00:15.000
<v Michael Chen>Thanks for having me. I'd like to start by discussing the Q4 roadmap.
3
00:00:15.000 --> 00:00:19.000
<v Sarah Martinez>Great, let's dive in. What are the top priorities?
Browser rendering: Many HTML5 players can style different speakers with distinct colors automatically based on voice tags.
VTT Advanced Features
Styling classes:
WEBVTT
STYLE
::cue(.sarah) { color: cyan; }
::cue(.michael) { color: yellow; }
1
00:00:03.000 --> 00:00:07.000
<v.sarah Sarah>Hello everyone, welcome to today's product meeting.
2
00:00:07.000 --> 00:00:15.000
<v.michael Michael>Thanks for having me.
Positioning:
1
00:00:03.000 --> 00:00:07.000 align:start position:10%
<v Sarah>Hello everyone, welcome to today's product meeting.
Metadata:
WEBVTT
NOTE
This transcript was created on 2025-01-15
Speakers: Sarah Martinez, Michael Chen
1
00:00:03.000 --> 00:00:07.000
<v Sarah>Hello everyone...
VTT Format Advantages
Pros:
- Modern web standard with full browser support
- Built-in speaker identification (
<v>tags) - Advanced styling capabilities
- Metadata support
- Accessibility features (language tags, audio descriptions)
- Better Unicode support than SRT
Cons:
- Less universal than SRT (some older software doesn't support)
- More complex syntax
- Requires web browser or modern video player
When to Use VTT
Choose VTT format when:
- Creating web-based video content
- Need accessibility compliance
- Want to style speakers differently
- Building interactive video experiences
- Using HTML5 video players
- Need metadata in subtitle file
JSON Format: Structured Data for Software
What is JSON Format?
JavaScript Object Notation (.json) is a structured data format for software integration. Ideal for developers building applications, data analysis, or custom processing.
Best for:
- Software integration and APIs
- Data analysis and processing
- Custom application development
- Search and indexing systems
- Machine learning training data
- Archival with rich metadata
JSON Format Structure
Basic multi-speaker structure:
{
"metadata": {
"duration": 125.5,
"language": "en",
"speakers": ["Speaker 0", "Speaker 1"],
"created": "2025-01-15T10:30:00Z"
},
"segments": [
{
"id": 1,
"speaker": "Speaker 0",
"start": 3.0,
"end": 7.0,
"text": "Hello everyone, welcome to today's product meeting."
},
{
"id": 2,
"speaker": "Speaker 1",
"start": 7.0,
"end": 15.0,
"text": "Thanks for having me. I'd like to start by discussing the Q4 roadmap."
},
{
"id": 3,
"speaker": "Speaker 0",
"start": 15.0,
"end": 19.0,
"text": "Great, let's dive in. What are the top priorities?"
},
{
"id": 4,
"speaker": "Speaker 1",
"start": 19.0,
"end": 25.0,
"text": "The analytics dashboard redesign is our primary focus."
}
]
}
Key elements:
- metadata: Overall file information
duration: Total audio length in secondslanguage: Language codespeakers: Array of speaker labelscreated: Timestamp
- segments: Array of speech segments
id: Unique identifier for segmentspeaker: Speaker labelstart: Start time in secondsend: End time in secondstext: Transcribed text
Extended JSON with Rich Metadata
Full-featured structure:
{
"metadata": {
"version": "1.0",
"duration": 125.5,
"language": "en",
"source_file": "meeting_2025-01-15.mp3",
"transcription_service": "BrassTranscripts",
"transcription_date": "2025-01-15T10:30:00Z",
"speakers": [
{
"id": "Speaker 0",
"name": "Sarah Martinez",
"role": "Product Manager"
},
{
"id": "Speaker 1",
"name": "Michael Chen",
"role": "Engineering Manager"
}
]
},
"segments": [
{
"id": 1,
"speaker_id": "Speaker 0",
"speaker_name": "Sarah Martinez",
"start": 3.0,
"end": 7.0,
"text": "Hello everyone, welcome to today's product meeting.",
"confidence": 0.94,
"words": [
{"word": "Hello", "start": 3.0, "end": 3.2, "confidence": 0.98},
{"word": "everyone", "start": 3.2, "end": 3.6, "confidence": 0.96},
{"word": "welcome", "start": 3.7, "end": 4.1, "confidence": 0.95}
]
}
]
}
Additional fields:
confidence: Transcription confidence score (0.0-1.0)words: Word-level timestamps and confidencespeaker_name: Actual names after identification- Rich metadata for archival and analysis
JSON Format Advantages
Pros:
- Structured data for programmatic access
- Store rich metadata
- Easy to parse in any programming language
- Supports nested data structures
- Word-level timestamps and confidence scores
- Searchable and indexable
- Version control friendly (git diff works well)
Cons:
- Not human-readable (requires software to view)
- Larger file size than TXT/SRT
- No standardized schema (varies by service)
- Requires programming knowledge for custom processing
When to Use JSON
Choose JSON format when:
- Building custom applications
- Need programmatic access to transcript data
- Performing data analysis or research
- Integrating with APIs or databases
- Need word-level timestamps
- Archiving with rich metadata
- Training machine learning models
Choosing the Right Format
Decision Matrix
| Use Case | Best Format | Why |
|---|---|---|
| Reading transcript | TXT | Simple, readable, universal |
| Video subtitles | SRT | Universal video player support |
| Web video captions | VTT | Modern web standard, styling |
| YouTube upload | SRT or VTT | Both supported |
| Video editing | SRT | Adobe, Final Cut, DaVinci support |
| Podcast show notes | TXT | Readable text for blog posts |
| Software integration | JSON | Structured data, APIs |
| Data analysis | JSON | Programmatic access, metadata |
| Accessibility compliance | VTT | WCAG standard |
| Social media videos | SRT | Instagram, Facebook, TikTok |
Multiple Format Strategy
Best practice: Keep all formats.
Most professional transcription services (including BrassTranscripts) provide all formats simultaneously:
- TXT for reading
- SRT for video subtitles
- VTT for web captions
- JSON for software integration
No need to choose - use the format that fits each specific need.
Converting Between Formats
Manual Conversion
Simple formats can be converted manually:
TXT to SRT
Process:
- Add sequence numbers (1, 2, 3...)
- Convert timestamps:
[00:00:03]→00:00:03,000 --> 00:00:07,000 - Add blank lines between blocks
- Break long text into 1-2 line chunks
Before (TXT):
[00:00:03] Speaker 0: Hello everyone, welcome to today's product meeting.
After (SRT):
1
00:00:03,000 --> 00:00:07,000
Speaker 0: Hello everyone, welcome to
today's product meeting.
SRT to VTT
Process:
- Add
WEBVTTheader at top - Change comma to period in timestamps:
00:00:03,000→00:00:03.000 - Replace speaker labels with voice tags:
Speaker 0:→<v Speaker 0>
Before (SRT):
1
00:00:03,000 --> 00:00:07,000
Speaker 0: Hello everyone.
After (VTT):
WEBVTT
1
00:00:03.000 --> 00:00:07.000
<v Speaker 0>Hello everyone.
Automated Conversion Tools
Online converters:
- Subtitle Edit (free, Windows/Mac/Linux)
- Rev.com Subtitle Converter
- Kapwing Subtitle Converter
- HandBrake (for video + subtitles)
Command-line tools:
- FFmpeg (universal media processing)
- ccextractor (subtitle extraction)
- pysrt (Python library for subtitle manipulation)
Programming libraries:
- Python:
pysrt,webvtt-py - JavaScript:
subtitlenpm package - Ruby:
webvtt-rubygem
FFmpeg Conversion Examples
Extract subtitles from video:
ffmpeg -i input_video.mp4 -map 0:s:0 subtitles.srt
Convert SRT to VTT:
ffmpeg -i subtitles.srt subtitles.vtt
Burn subtitles into video:
ffmpeg -i video.mp4 -vf subtitles=subs.srt output.mp4
Adding Speaker Labels to Existing Transcripts
You Have: Basic Transcript Without Speakers
Problem: Your transcript has text and timestamps but no speaker labels.
Solution: Add speaker identification:
Option 1: Re-Transcribe with Speaker Diarization
Fastest approach:
- Use transcription service with speaker diarization
- Upload original audio
- Get transcript with speaker labels in all formats
- Typical cost: $0.15/minute
Time: 5-10 minutes total
Option 2: Manually Add Labels
Process:
- Listen to audio with transcript open
- Identify speaker changes
- Add labels before each speaker's text
Time: 4-8 hours per hour of audio
Reality check: Re-transcribing is almost always faster and more cost-effective than manual labeling.
You Have: Generic Labels ("Speaker 0, 1, 2")
Problem: Transcript has speaker labels but they're generic.
Solution: Replace with real names:
- Identify speakers - See Who Said What? How to Get Speaker Names in Transcripts
- Find and replace:
- Find: "Speaker 0"
- Replace: "Sarah Martinez"
- Replace all
- Repeat for each speaker
Time: 5-15 minutes
Frequently Asked Questions
Which format is best for YouTube captions?
Both SRT and VTT work for YouTube.
SRT is more common and slightly simpler:
- More tools support SRT export
- Easier to edit manually
- Standard format for most video editing software
VTT offers more features:
- Better styling options
- Native web format
- Easier to add metadata
Recommendation: Use SRT unless you specifically need VTT features.
Can I include speaker names in video subtitles?
Yes, include speaker labels in the subtitle text:
SRT format:
1
00:00:03,000 --> 00:00:07,000
Sarah: Hello everyone, welcome to
today's product meeting.
VTT format with voice tags:
WEBVTT
1
00:00:03.000 --> 00:00:07.000
<v Sarah>Hello everyone, welcome to today's product meeting.
Some video players can style different speakers automatically based on VTT voice tags.
How do I edit speaker labels in subtitle files?
Simple edits: Use text editor
- Open .srt or .vtt file in any text editor
- Find speaker labels (e.g., "Speaker 0:")
- Replace with real names
- Save file
Complex edits: Use subtitle software
- Subtitle Edit (free, powerful)
- Aegisub (free, advanced styling)
- Adobe Premiere Pro (professional, paid)
What's the difference between SRT and VTT?
Technical differences:
| Feature | SRT | VTT |
|---|---|---|
| File extension | .srt | .vtt |
| Header | None | Required "WEBVTT" |
| Timestamp format | HH:MM:SS,MS |
HH:MM:SS.MS |
| Speaker tags | No | Yes (<v> tags) |
| Styling | Limited | Extensive (CSS-like) |
| Metadata | No | Yes |
| Web standard | No | Yes (W3C) |
| Video player support | Universal | Modern players |
Practical differences:
- SRT: Universal compatibility, simpler
- VTT: Better for web, more features
Can JSON transcripts be converted to SRT?
Yes, JSON contains all necessary data (timestamps, text, speakers).
Python example:
import json
# Load JSON transcript
with open('transcript.json', 'r') as f:
data = json.load(f)
# Convert to SRT
srt_content = ""
for i, segment in enumerate(data['segments'], start=1):
start = format_timestamp(segment['start'])
end = format_timestamp(segment['end'])
speaker = segment['speaker']
text = segment['text']
srt_content += f"{i}\n"
srt_content += f"{start} --> {end}\n"
srt_content += f"{speaker}: {text}\n\n"
# Save SRT file
with open('transcript.srt', 'w') as f:
f.write(srt_content)
def format_timestamp(seconds):
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
Many transcription services provide conversion tools so you don't need to code this yourself.
Do all transcription services provide multiple formats?
No, format availability varies by service:
Single format services:
- Some basic services provide TXT only
- May require manual conversion to SRT/VTT
Multiple format services:
- Professional services provide TXT, SRT, VTT, JSON
- BrassTranscripts provides all 4 formats automatically
- Otter.ai, Rev, Descript provide multiple formats
Check before choosing a service:
- What formats are included?
- Are all formats available for multi-speaker transcripts?
- Is there an extra charge for specific formats?
How do I handle very long speaker names in subtitles?
Long names reduce space for actual text in subtitles.
Solutions:
1. Use shortened names:
- "Sarah Martinez" → "Sarah"
- "Dr. Jennifer Thompson" → "Dr. Thompson"
2. Use initials:
- "Sarah Martinez" → "SM:"
- "Michael Chen" → "MC:"
3. Use roles:
- "Product Manager:"
- "Engineering Manager:"
4. Omit names in burned-in subtitles, use full names in transcript file:
- On-screen: No names, just text
- Downloadable file: Full names for reference
Can speaker labels be color-coded in video subtitles?
Yes, but implementation varies by platform:
SRT with color tags:
1
00:00:03,000 --> 00:00:07,000
<font color="#00FF00">Sarah:</font> Hello everyone.
2
00:00:07,000 --> 00:00:15,000
<font color="#00FFFF">Michael:</font> Thanks for having me.
Support varies:
- Some video players honor color tags
- Others ignore them
- YouTube doesn't support color in uploaded captions
VTT with CSS styling:
WEBVTT
STYLE
::cue(v[voice="Sarah"]) { color: cyan; }
::cue(v[voice="Michael"]) { color: yellow; }
1
00:00:03.000 --> 00:00:07.000
<v Sarah>Hello everyone.
Better support in:
- HTML5 video players
- Custom web video platforms
- Modern browsers
Limitation: YouTube and most social platforms don't support custom styling.
Conclusion
Multi-speaker transcript formats serve different purposes: TXT for reading, SRT for video subtitles, VTT for web captions, and JSON for software integration.
Key takeaways:
- TXT is for humans - Simple, readable, universal sharing
- SRT is for video - Universal subtitle format for video players and editing software
- VTT is for web - Modern web standard with speaker tagging and styling
- JSON is for software - Structured data for integration and analysis
- Keep all formats - Professional services provide all formats simultaneously
Quick reference guide:
- Adding subtitles to video? → SRT
- Creating YouTube captions? → SRT or VTT
- Web-based video player? → VTT
- Reading transcript? → TXT
- Building software integration? → JSON
- Podcast show notes? → TXT
- Data analysis? → JSON
Best practice: Use a transcription service that provides all formats automatically, then use whichever format fits each specific need.
For multi-speaker transcription with all formats included (TXT, SRT, VTT, JSON), visit BrassTranscripts - automatic speaker identification and all formats delivered simultaneously.
Related Guides:
- How to Transcribe Multiple Speakers [Complete Guide] - Complete guide to multi-speaker transcription methods
- Who Said What? How to Get Speaker Names in Transcripts - Identifying speakers and assigning real names
- Speaker Labels Wrong? How to Fix Transcript Speaker Errors - Troubleshooting speaker identification issues
- Speaker Identification Complete Guide - Comprehensive guide with AI prompts