Skip to main content
← Back to Blog
17 min readBrassTranscripts Team

Multi-Speaker Transcript Formats: SRT, VTT, JSON with Speaker Names

Multi-speaker transcripts come in different formats, each designed for specific use cases. Whether you need subtitles for video, data for software integration, or simple readable text, choosing the right format matters.

This guide covers the four most common transcript formats with speaker identification: TXT, SRT, VTT, and JSON.

Quick Navigation

Why Transcript Format Matters

Different Formats for Different Needs

You've transcribed your multi-speaker audio, but now you need to use that transcript for a specific purpose:

  • Video subtitles? You need SRT or VTT format
  • Podcast show notes? You need readable TXT format
  • Software integration? You need structured JSON format
  • Video editing? You need SRT with timecodes

Each format has specific structure, capabilities, and use cases.

What Makes Multi-Speaker Formats Different

Basic transcripts contain just text and timestamps. Multi-speaker formats add speaker identification:

Basic transcript (no speakers):

Hello everyone, welcome to the meeting.
Thanks for having me.

Multi-speaker transcript:

[00:00:03] Speaker 0: Hello everyone, welcome to the meeting.
[00:00:07] Speaker 1: Thanks for having me.

This guide focuses on formats that support speaker labels.

TXT Format: Simple and Readable

What is TXT Format?

Plain text format designed for human readability. No special syntax or metadata - just text with timestamps and speaker labels.

Best for:

  • Reading and reviewing transcripts
  • Creating meeting notes
  • Podcast show notes
  • Blog post transcripts
  • Email sharing

TXT Format Structure

Basic structure:

[HH:MM:SS] Speaker Label: Text spoken

[00:00:03] Speaker 0: Hello everyone, welcome to today's product meeting.

[00:00:07] Speaker 1: Thanks for having me. I'd like to start by discussing the Q4 roadmap.

[00:00:15] Speaker 0: Great, let's dive in. What are the top priorities?

[00:00:19] Speaker 1: The analytics dashboard redesign is our primary focus.

Key elements:

  • Timestamp: [HH:MM:SS] format showing when speech begins
  • Speaker label: Speaker 0:, Speaker 1:, or actual names
  • Text: Spoken words transcribed
  • Blank lines: Separate speaker turns for readability

TXT with Real Names

After identifying speakers, replace generic labels with real names:

[00:00:03] Sarah Martinez: Hello everyone, welcome to today's product meeting.

[00:00:07] Michael Chen: Thanks for having me. I'd like to start by discussing the Q4 roadmap.

[00:00:15] Sarah Martinez: Great, let's dive in. What are the top priorities?

[00:00:19] Michael Chen: The analytics dashboard redesign is our primary focus.

TXT Format Advantages

Pros:

  • Universal compatibility (any text editor, email, messaging app)
  • Human-readable without special software
  • Small file size
  • Easy to edit
  • No formatting restrictions

Cons:

  • No timing synchronization (can't sync with video/audio automatically)
  • No styling options (bold, italic, color)
  • Manual formatting required for different uses

When to Use TXT

Choose TXT format when:

  • You need to read and review the content
  • You're creating meeting notes or summaries
  • You're sharing transcripts via email or messaging
  • You're copying transcript sections into other documents
  • You don't need video/audio synchronization

SRT Format: Video Subtitles

What is SRT Format?

SubRip Subtitle format (.srt) is the most widely supported subtitle format for video. Used by video players, editing software, and streaming platforms.

Best for:

  • Video subtitles
  • YouTube captions
  • Video editing (Premiere, Final Cut, DaVinci Resolve)
  • Social media videos (Instagram, Facebook, TikTok)
  • Video players (VLC, QuickTime, Windows Media Player)

SRT Format Structure

Basic structure:

1
00:00:03,000 --> 00:00:07,000
Speaker 0: Hello everyone, welcome to
today's product meeting.

2
00:00:07,000 --> 00:00:15,000
Speaker 1: Thanks for having me.
I'd like to start by discussing the Q4 roadmap.

3
00:00:15,000 --> 00:00:19,000
Speaker 0: Great, let's dive in.
What are the top priorities?

4
00:00:19,000 --> 00:00:25,000
Speaker 1: The analytics dashboard
redesign is our primary focus.

Key elements:

  1. Sequence number: 1, 2, 3 (each subtitle block numbered)
  2. Timing: HH:MM:SS,MS --> HH:MM:SS,MS (start time --> end time)
  3. Text: Subtitle text (typically 1-2 lines, including speaker label)
  4. Blank line: Separates each subtitle block

SRT with Speaker Styling

SRT supports basic styling tags for speaker differentiation:

Using color tags:

1
00:00:03,000 --> 00:00:07,000
<font color="#00FF00">Sarah:</font> Hello everyone,
welcome to today's product meeting.

2
00:00:07,000 --> 00:00:15,000
<font color="#00FFFF">Michael:</font> Thanks for having me.
I'd like to start by discussing the Q4 roadmap.

Using position tags:

1
00:00:03,000 --> 00:00:07,000
{\an7}Sarah: Hello everyone,
welcome to today's product meeting.

2
00:00:07,000 --> 00:00:15,000
{\an7}Michael: Thanks for having me.
I'd like to start by discussing the Q4 roadmap.

Styling support varies by player - test with your target platform.

SRT Format Advantages

Pros:

  • Universal subtitle support (works everywhere)
  • Synchronizes automatically with video
  • Supported by all major video editing software
  • Accepted by YouTube, Vimeo, and streaming platforms
  • Simple text-based format (easy to edit)

Cons:

  • Limited styling options
  • Must manually break long sentences into readable chunks
  • Timing must be precise for good viewing experience
  • No metadata storage (language, author, etc.)

When to Use SRT

Choose SRT format when:

  • Adding subtitles to video content
  • Creating YouTube captions
  • Working with video editing software
  • Need universal video player compatibility
  • Creating social media videos with captions

VTT Format: Web Video Captions

What is VTT Format?

WebVTT (Web Video Text Tracks, .vtt) is the modern web standard for video captions. Similar to SRT but with more features and better web browser support.

Best for:

  • HTML5 video players
  • Web-based video platforms
  • Accessibility compliance (WCAG, Section 508)
  • Interactive video experiences
  • Modern streaming applications

VTT Format Structure

Basic structure:

WEBVTT

1
00:00:03.000 --> 00:00:07.000
<v Speaker 0>Hello everyone, welcome to today's product meeting.

2
00:00:07.000 --> 00:00:15.000
<v Speaker 1>Thanks for having me. I'd like to start by discussing the Q4 roadmap.

3
00:00:15.000 --> 00:00:19.000
<v Speaker 0>Great, let's dive in. What are the top priorities?

4
00:00:19.000 --> 00:00:25.000
<v Speaker 1>The analytics dashboard redesign is our primary focus.

Key elements:

  1. Header: WEBVTT (required first line)
  2. Cue identifier: 1, 2, 3 (optional but helpful)
  3. Timing: HH:MM:SS.MS --> HH:MM:SS.MS (periods instead of commas)
  4. Voice tags: <v Speaker Name> (identifies speakers)
  5. Text: Caption text
  6. Blank line: Separates cues

VTT with Named Speakers

Using voice tags for speaker identification:

WEBVTT

1
00:00:03.000 --> 00:00:07.000
<v Sarah Martinez>Hello everyone, welcome to today's product meeting.

2
00:00:07.000 --> 00:00:15.000
<v Michael Chen>Thanks for having me. I'd like to start by discussing the Q4 roadmap.

3
00:00:15.000 --> 00:00:19.000
<v Sarah Martinez>Great, let's dive in. What are the top priorities?

Browser rendering: Many HTML5 players can style different speakers with distinct colors automatically based on voice tags.

VTT Advanced Features

Styling classes:

WEBVTT

STYLE
::cue(.sarah) { color: cyan; }
::cue(.michael) { color: yellow; }

1
00:00:03.000 --> 00:00:07.000
<v.sarah Sarah>Hello everyone, welcome to today's product meeting.

2
00:00:07.000 --> 00:00:15.000
<v.michael Michael>Thanks for having me.

Positioning:

1
00:00:03.000 --> 00:00:07.000 align:start position:10%
<v Sarah>Hello everyone, welcome to today's product meeting.

Metadata:

WEBVTT

NOTE
This transcript was created on 2025-01-15
Speakers: Sarah Martinez, Michael Chen

1
00:00:03.000 --> 00:00:07.000
<v Sarah>Hello everyone...

VTT Format Advantages

Pros:

  • Modern web standard with full browser support
  • Built-in speaker identification (<v> tags)
  • Advanced styling capabilities
  • Metadata support
  • Accessibility features (language tags, audio descriptions)
  • Better Unicode support than SRT

Cons:

  • Less universal than SRT (some older software doesn't support)
  • More complex syntax
  • Requires web browser or modern video player

When to Use VTT

Choose VTT format when:

  • Creating web-based video content
  • Need accessibility compliance
  • Want to style speakers differently
  • Building interactive video experiences
  • Using HTML5 video players
  • Need metadata in subtitle file

JSON Format: Structured Data for Software

What is JSON Format?

JavaScript Object Notation (.json) is a structured data format for software integration. Ideal for developers building applications, data analysis, or custom processing.

Best for:

  • Software integration and APIs
  • Data analysis and processing
  • Custom application development
  • Search and indexing systems
  • Machine learning training data
  • Archival with rich metadata

JSON Format Structure

Basic multi-speaker structure:

{
  "metadata": {
    "duration": 125.5,
    "language": "en",
    "speakers": ["Speaker 0", "Speaker 1"],
    "created": "2025-01-15T10:30:00Z"
  },
  "segments": [
    {
      "id": 1,
      "speaker": "Speaker 0",
      "start": 3.0,
      "end": 7.0,
      "text": "Hello everyone, welcome to today's product meeting."
    },
    {
      "id": 2,
      "speaker": "Speaker 1",
      "start": 7.0,
      "end": 15.0,
      "text": "Thanks for having me. I'd like to start by discussing the Q4 roadmap."
    },
    {
      "id": 3,
      "speaker": "Speaker 0",
      "start": 15.0,
      "end": 19.0,
      "text": "Great, let's dive in. What are the top priorities?"
    },
    {
      "id": 4,
      "speaker": "Speaker 1",
      "start": 19.0,
      "end": 25.0,
      "text": "The analytics dashboard redesign is our primary focus."
    }
  ]
}

Key elements:

  • metadata: Overall file information
    • duration: Total audio length in seconds
    • language: Language code
    • speakers: Array of speaker labels
    • created: Timestamp
  • segments: Array of speech segments
    • id: Unique identifier for segment
    • speaker: Speaker label
    • start: Start time in seconds
    • end: End time in seconds
    • text: Transcribed text

Extended JSON with Rich Metadata

Full-featured structure:

{
  "metadata": {
    "version": "1.0",
    "duration": 125.5,
    "language": "en",
    "source_file": "meeting_2025-01-15.mp3",
    "transcription_service": "BrassTranscripts",
    "transcription_date": "2025-01-15T10:30:00Z",
    "speakers": [
      {
        "id": "Speaker 0",
        "name": "Sarah Martinez",
        "role": "Product Manager"
      },
      {
        "id": "Speaker 1",
        "name": "Michael Chen",
        "role": "Engineering Manager"
      }
    ]
  },
  "segments": [
    {
      "id": 1,
      "speaker_id": "Speaker 0",
      "speaker_name": "Sarah Martinez",
      "start": 3.0,
      "end": 7.0,
      "text": "Hello everyone, welcome to today's product meeting.",
      "confidence": 0.94,
      "words": [
        {"word": "Hello", "start": 3.0, "end": 3.2, "confidence": 0.98},
        {"word": "everyone", "start": 3.2, "end": 3.6, "confidence": 0.96},
        {"word": "welcome", "start": 3.7, "end": 4.1, "confidence": 0.95}
      ]
    }
  ]
}

Additional fields:

  • confidence: Transcription confidence score (0.0-1.0)
  • words: Word-level timestamps and confidence
  • speaker_name: Actual names after identification
  • Rich metadata for archival and analysis

JSON Format Advantages

Pros:

  • Structured data for programmatic access
  • Store rich metadata
  • Easy to parse in any programming language
  • Supports nested data structures
  • Word-level timestamps and confidence scores
  • Searchable and indexable
  • Version control friendly (git diff works well)

Cons:

  • Not human-readable (requires software to view)
  • Larger file size than TXT/SRT
  • No standardized schema (varies by service)
  • Requires programming knowledge for custom processing

When to Use JSON

Choose JSON format when:

  • Building custom applications
  • Need programmatic access to transcript data
  • Performing data analysis or research
  • Integrating with APIs or databases
  • Need word-level timestamps
  • Archiving with rich metadata
  • Training machine learning models

Choosing the Right Format

Decision Matrix

Use Case Best Format Why
Reading transcript TXT Simple, readable, universal
Video subtitles SRT Universal video player support
Web video captions VTT Modern web standard, styling
YouTube upload SRT or VTT Both supported
Video editing SRT Adobe, Final Cut, DaVinci support
Podcast show notes TXT Readable text for blog posts
Software integration JSON Structured data, APIs
Data analysis JSON Programmatic access, metadata
Accessibility compliance VTT WCAG standard
Social media videos SRT Instagram, Facebook, TikTok

Multiple Format Strategy

Best practice: Keep all formats.

Most professional transcription services (including BrassTranscripts) provide all formats simultaneously:

  • TXT for reading
  • SRT for video subtitles
  • VTT for web captions
  • JSON for software integration

No need to choose - use the format that fits each specific need.

Converting Between Formats

Manual Conversion

Simple formats can be converted manually:

TXT to SRT

Process:

  1. Add sequence numbers (1, 2, 3...)
  2. Convert timestamps: [00:00:03]00:00:03,000 --> 00:00:07,000
  3. Add blank lines between blocks
  4. Break long text into 1-2 line chunks

Before (TXT):

[00:00:03] Speaker 0: Hello everyone, welcome to today's product meeting.

After (SRT):

1
00:00:03,000 --> 00:00:07,000
Speaker 0: Hello everyone, welcome to
today's product meeting.

SRT to VTT

Process:

  1. Add WEBVTT header at top
  2. Change comma to period in timestamps: 00:00:03,00000:00:03.000
  3. Replace speaker labels with voice tags: Speaker 0:<v Speaker 0>

Before (SRT):

1
00:00:03,000 --> 00:00:07,000
Speaker 0: Hello everyone.

After (VTT):

WEBVTT

1
00:00:03.000 --> 00:00:07.000
<v Speaker 0>Hello everyone.

Automated Conversion Tools

Online converters:

  • Subtitle Edit (free, Windows/Mac/Linux)
  • Rev.com Subtitle Converter
  • Kapwing Subtitle Converter
  • HandBrake (for video + subtitles)

Command-line tools:

  • FFmpeg (universal media processing)
  • ccextractor (subtitle extraction)
  • pysrt (Python library for subtitle manipulation)

Programming libraries:

  • Python: pysrt, webvtt-py
  • JavaScript: subtitle npm package
  • Ruby: webvtt-ruby gem

FFmpeg Conversion Examples

Extract subtitles from video:

ffmpeg -i input_video.mp4 -map 0:s:0 subtitles.srt

Convert SRT to VTT:

ffmpeg -i subtitles.srt subtitles.vtt

Burn subtitles into video:

ffmpeg -i video.mp4 -vf subtitles=subs.srt output.mp4

Adding Speaker Labels to Existing Transcripts

You Have: Basic Transcript Without Speakers

Problem: Your transcript has text and timestamps but no speaker labels.

Solution: Add speaker identification:

Option 1: Re-Transcribe with Speaker Diarization

Fastest approach:

  1. Use transcription service with speaker diarization
  2. Upload original audio
  3. Get transcript with speaker labels in all formats
  4. Typical cost: $0.15/minute

Time: 5-10 minutes total

Option 2: Manually Add Labels

Process:

  1. Listen to audio with transcript open
  2. Identify speaker changes
  3. Add labels before each speaker's text

Time: 4-8 hours per hour of audio

Reality check: Re-transcribing is almost always faster and more cost-effective than manual labeling.

You Have: Generic Labels ("Speaker 0, 1, 2")

Problem: Transcript has speaker labels but they're generic.

Solution: Replace with real names:

  1. Identify speakers - See Who Said What? How to Get Speaker Names in Transcripts
  2. Find and replace:
    • Find: "Speaker 0"
    • Replace: "Sarah Martinez"
    • Replace all
  3. Repeat for each speaker

Time: 5-15 minutes

Frequently Asked Questions

Which format is best for YouTube captions?

Both SRT and VTT work for YouTube.

SRT is more common and slightly simpler:

  • More tools support SRT export
  • Easier to edit manually
  • Standard format for most video editing software

VTT offers more features:

  • Better styling options
  • Native web format
  • Easier to add metadata

Recommendation: Use SRT unless you specifically need VTT features.

Can I include speaker names in video subtitles?

Yes, include speaker labels in the subtitle text:

SRT format:

1
00:00:03,000 --> 00:00:07,000
Sarah: Hello everyone, welcome to
today's product meeting.

VTT format with voice tags:

WEBVTT

1
00:00:03.000 --> 00:00:07.000
<v Sarah>Hello everyone, welcome to today's product meeting.

Some video players can style different speakers automatically based on VTT voice tags.

How do I edit speaker labels in subtitle files?

Simple edits: Use text editor

  • Open .srt or .vtt file in any text editor
  • Find speaker labels (e.g., "Speaker 0:")
  • Replace with real names
  • Save file

Complex edits: Use subtitle software

  • Subtitle Edit (free, powerful)
  • Aegisub (free, advanced styling)
  • Adobe Premiere Pro (professional, paid)

What's the difference between SRT and VTT?

Technical differences:

Feature SRT VTT
File extension .srt .vtt
Header None Required "WEBVTT"
Timestamp format HH:MM:SS,MS HH:MM:SS.MS
Speaker tags No Yes (<v> tags)
Styling Limited Extensive (CSS-like)
Metadata No Yes
Web standard No Yes (W3C)
Video player support Universal Modern players

Practical differences:

  • SRT: Universal compatibility, simpler
  • VTT: Better for web, more features

Can JSON transcripts be converted to SRT?

Yes, JSON contains all necessary data (timestamps, text, speakers).

Python example:

import json

# Load JSON transcript
with open('transcript.json', 'r') as f:
    data = json.load(f)

# Convert to SRT
srt_content = ""
for i, segment in enumerate(data['segments'], start=1):
    start = format_timestamp(segment['start'])
    end = format_timestamp(segment['end'])
    speaker = segment['speaker']
    text = segment['text']

    srt_content += f"{i}\n"
    srt_content += f"{start} --> {end}\n"
    srt_content += f"{speaker}: {text}\n\n"

# Save SRT file
with open('transcript.srt', 'w') as f:
    f.write(srt_content)

def format_timestamp(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"

Many transcription services provide conversion tools so you don't need to code this yourself.

Do all transcription services provide multiple formats?

No, format availability varies by service:

Single format services:

  • Some basic services provide TXT only
  • May require manual conversion to SRT/VTT

Multiple format services:

  • Professional services provide TXT, SRT, VTT, JSON
  • BrassTranscripts provides all 4 formats automatically
  • Otter.ai, Rev, Descript provide multiple formats

Check before choosing a service:

  • What formats are included?
  • Are all formats available for multi-speaker transcripts?
  • Is there an extra charge for specific formats?

How do I handle very long speaker names in subtitles?

Long names reduce space for actual text in subtitles.

Solutions:

1. Use shortened names:

  • "Sarah Martinez" → "Sarah"
  • "Dr. Jennifer Thompson" → "Dr. Thompson"

2. Use initials:

  • "Sarah Martinez" → "SM:"
  • "Michael Chen" → "MC:"

3. Use roles:

  • "Product Manager:"
  • "Engineering Manager:"

4. Omit names in burned-in subtitles, use full names in transcript file:

  • On-screen: No names, just text
  • Downloadable file: Full names for reference

Can speaker labels be color-coded in video subtitles?

Yes, but implementation varies by platform:

SRT with color tags:

1
00:00:03,000 --> 00:00:07,000
<font color="#00FF00">Sarah:</font> Hello everyone.

2
00:00:07,000 --> 00:00:15,000
<font color="#00FFFF">Michael:</font> Thanks for having me.

Support varies:

  • Some video players honor color tags
  • Others ignore them
  • YouTube doesn't support color in uploaded captions

VTT with CSS styling:

WEBVTT

STYLE
::cue(v[voice="Sarah"]) { color: cyan; }
::cue(v[voice="Michael"]) { color: yellow; }

1
00:00:03.000 --> 00:00:07.000
<v Sarah>Hello everyone.

Better support in:

  • HTML5 video players
  • Custom web video platforms
  • Modern browsers

Limitation: YouTube and most social platforms don't support custom styling.

Conclusion

Multi-speaker transcript formats serve different purposes: TXT for reading, SRT for video subtitles, VTT for web captions, and JSON for software integration.

Key takeaways:

  1. TXT is for humans - Simple, readable, universal sharing
  2. SRT is for video - Universal subtitle format for video players and editing software
  3. VTT is for web - Modern web standard with speaker tagging and styling
  4. JSON is for software - Structured data for integration and analysis
  5. Keep all formats - Professional services provide all formats simultaneously

Quick reference guide:

  • Adding subtitles to video? → SRT
  • Creating YouTube captions? → SRT or VTT
  • Web-based video player? → VTT
  • Reading transcript? → TXT
  • Building software integration? → JSON
  • Podcast show notes? → TXT
  • Data analysis? → JSON

Best practice: Use a transcription service that provides all formats automatically, then use whichever format fits each specific need.

For multi-speaker transcription with all formats included (TXT, SRT, VTT, JSON), visit BrassTranscripts - automatic speaker identification and all formats delivered simultaneously.


Related Guides:

Ready to try BrassTranscripts?

Experience the accuracy and speed of our AI transcription service.