YouTube Caption Formats: SRT vs VTT vs SBV Guide

You uploaded a YouTube video. Studio asks you for a caption file. The dropdown shows SRT, VTT, SBV, and several formats you've never heard of. Which one do you pick, and does it actually matter?

For most creators, it doesn't matter much — YouTube converts every format to the same internal representation. But three formats dominate real-world use, and knowing which one to grab when affects your workflow on other platforms. This guide walks through SRT, VTT, and SBV in plain English, shows the same caption in all three side by side, and tells you which to pick for which job.

YouTube Wants Captions. Which File?
The Three Formats in Plain English
Side-by-Side Format Example
Which Format YouTube Studio Accepts
Which Format Renders Best in YouTube Player
When SRT Is the Right Choice
When VTT Is the Right Choice
When SBV Is the Right Choice
Common Caption Mistakes
Converting Between Formats
Frequently Asked Questions

YouTube Wants Captions. Which File?

BrassTranscripts returns SRT and VTT with every transcription, and YouTube Studio accepts both directly with no conversion step. SBV is YouTube's own simplified caption format and is also supported, but it's a legacy format that most transcription services no longer produce. If you're uploading a fresh transcript today, SRT is the safe default because it works on every video platform you'd touch next.

The short answer: download the SRT file, drag it into YouTube Studio under Subtitles, done. The longer answer covers why VTT exists, why SBV still exists, and when you'd actually pick something other than SRT. The rest of this guide is the longer answer.

The Three Formats in Plain English

The three formats every YouTuber encounters are SRT, VTT, and SBV. They all do the same job: pair lines of text with start and end timestamps. The differences are in syntax, age, and what advanced features each one supports.

SRT (SubRip)

SRT is the industry standard. It was born in the late 1990s as the output format of a Windows tool called SubRip that pulled subtitles off DVD images. Today, almost every video platform speaks SRT: Netflix, broadcast TV workflows, Vimeo, Facebook, Instagram, TikTok, LinkedIn, X, and YouTube. If you only learn one caption format, learn this one.

The structure is dead simple. Each caption is a numbered block with a timestamp range on its own line, then one or more lines of text, then a blank line before the next block. Timestamps look like 00:00:01,500 --> 00:00:04,000 — note the comma before the milliseconds. That comma is SRT's signature.

SRT has no styling features. No colors, no positioning, no font choices. Plain text only. For 95% of caption use cases, that's a feature, not a limitation — the player decides how captions look, and they look consistent across every platform.

VTT (WebVTT)

VTT, short for Web Video Text Tracks, is the W3C standard for captions in HTML5 video. If you embed a <video> tag on a web page and want captions, VTT is what the browser expects via the <track> element. Modern web video players, including the ones built by Vimeo, Wistia, and most custom video CMS setups, prefer VTT.

VTT files start with the literal text WEBVTT on the first line — that header is how parsers identify the format. Timestamps look almost like SRT but use a period instead of a comma: 00:00:01.500 --> 00:00:04.000. That single character is the visual difference between an SRT and a VTT file.

The real upgrade is the feature set. VTT supports styling cues (colors, italic, bold), positioning (caption can appear in any corner of the frame), karaoke-style timing, and metadata tracks. Most creators never use those features, but they're available if you need them. YouTube ignores the advanced features on upload and just reads the timestamps and text.

SBV (SubViewer)

SBV is YouTube's legacy native caption format. When Google launched closed captions for YouTube in the late 2000s, SBV was one of the formats Studio's caption editor produced natively for download. It's a simplified format: no numbered blocks like SRT, no WEBVTT header like VTT, just a timestamp line followed by text lines, separated by blank lines.

Timestamps use a period for milliseconds (like VTT) and a comma to separate start and end (unique to SBV): 0:00:01.500,0:00:04.000. Two text lines maximum per cue. No styling, no positioning, no advanced features.

YouTube still accepts SBV uploads today. Almost no transcription service produces SBV anymore — most creators encounter it only if they downloaded captions from YouTube Studio years ago, or if they hand-edited a caption file in Notepad and an older guide told them to use SBV. There's no practical reason to choose SBV for a new project in 2026.

Side-by-Side Format Example

Here's the same four-line caption, rendered in all three formats, so you can see exactly what they look like.

The original captions, in plain reading order:

0:01.5–4.0: Welcome back to the channel. 4.0–7.5: Today we're testing three caption formats. 7.5–11.0: SRT, VTT, and SBV — same content. 11.0–14.5: Different syntax. Same result on YouTube.

As SRT

1
00:00:01,500 --> 00:00:04,000
Welcome back to the channel.

2
00:00:04,000 --> 00:00:07,500
Today we're testing three caption formats.

3
00:00:07,500 --> 00:00:11,000
SRT, VTT, and SBV — same content.

4
00:00:11,000 --> 00:00:14,500
Different syntax. Same result on YouTube.

As VTT

WEBVTT

00:00:01.500 --> 00:00:04.000
Welcome back to the channel.

00:00:04.000 --> 00:00:07.500
Today we're testing three caption formats.

00:00:07.500 --> 00:00:11.000
SRT, VTT, and SBV — same content.

00:00:11.000 --> 00:00:14.500
Different syntax. Same result on YouTube.

As SBV

0:00:01.500,0:00:04.000
Welcome back to the channel.

0:00:04.000,0:00:07.500
Today we're testing three caption formats.

0:00:07.500,0:00:11.000
SRT, VTT, and SBV — same content.

0:00:11.000,0:00:14.500
Different syntax. Same result on YouTube.

Three details to notice. SRT has numbered cues (1, 2, 3, 4) while VTT and SBV don't. VTT requires a WEBVTT header line; the other two don't. SBV uses a comma between start and end timestamps; SRT and VTT use --> (space, two hyphens, greater-than, space).

Which Format YouTube Studio Accepts

YouTube Studio accepts a longer list of caption formats than most creators realize. SRT, VTT (WebVTT), and SBV (SubViewer) are the three plain-text formats. Studio also accepts SCC, MCC, TTML, and DFXP — broadcast-industry XML formats used by professional captioning houses for compliance with FCC closed-captioning rules.

The current accepted-formats list is documented in YouTube Help: Caption file types. Check that page if you're working with an unusual format and need to know whether to convert. For 99% of creators, SRT or VTT is what you'll have, and either uploads cleanly.

A practical note: YouTube auto-detects the format from the file contents, not the file extension. If you rename a VTT file to .srt, the upload still works because YouTube reads the WEBVTT header inside. The reverse is also true. That said, keeping the extension accurate makes life easier for every other tool in your workflow.

Which Format Renders Best in YouTube Player

YouTube converts every uploaded caption file to its own internal representation on ingest, then renders that internal version through the player. The user picking SRT versus VTT versus SBV makes no visible difference in the YouTube player. Same fonts, same positioning, same timing, same colors.

Format choice affects two things outside the YouTube player: what advanced features survive the upload (VTT styling and positioning cues are discarded, so don't rely on them), and what file you have left over to use elsewhere (the SRT you uploaded to YouTube is the same SRT you can drop into Vimeo or TikTok an hour later).

If you only ever publish on YouTube and never use captions anywhere else, format choice is irrelevant. Pick any of the three. SRT is still the default we recommend because the leftover file is useful everywhere else.

When SRT Is the Right Choice

SRT is the right choice when the same captions will appear on more than one platform. Vimeo, Facebook, Instagram, TikTok, LinkedIn, X, and most podcast players that show captions all speak SRT natively. A single SRT file uploaded to all of those platforms produces identical captions, with no conversion step.

SRT is also the right choice when a human will edit the captions in a text editor. The numbered cues make it easy to find a specific line ("there's a typo in cue 47"), and the syntax is forgiving — extra blank lines and minor formatting variations rarely break parsers.

For BrassTranscripts customers uploading transcripts to YouTube as the primary destination but with secondary distribution to other platforms, SRT is the recommended default. It's the format that travels best.

When VTT Is the Right Choice

VTT is the right choice when the captions will play in a custom web video player. If you're embedding video on your own site with an HTML5 <video> element, the <track> tag expects WebVTT, and using anything else means writing a converter. Most self-hosted video setups, including Plyr, Video.js, and JW Player, default to VTT.

VTT is also the right choice when you need styling, positioning, or speaker-color cues to survive playback. Those features don't work on YouTube but do work in most modern web players, and a single VTT file can serve both a YouTube upload (which ignores the cues) and a custom embed (which honors them).

For a creator running a personal site or course platform alongside a YouTube channel, downloading VTT and using the same file in both places is a clean workflow.

When SBV Is the Right Choice

SBV is almost never the right choice in 2026. The legitimate use case is creators who learned the format years ago in YouTube Studio's caption editor and have a workflow built around hand-editing SBV files in a text editor — that habit still works, and there's no reason to break it.

For anyone starting fresh, skip SBV. Transcription services produce SRT and VTT, every modern caption editor reads SRT and VTT, and every video platform reads at least one of SRT or VTT. SBV adds nothing that the other two don't already cover.

The one technical edge SBV has is simplicity for manual editing — no cue numbers to renumber when you insert or delete a line, no WEBVTT header to preserve. If you write captions from scratch in vim, that simplicity is real. For everyone else, it's a curiosity.

Common Caption Mistakes

Format choice is the easy part. The hard part is writing captions that are actually readable, and most beginner caption mistakes apply equally to SRT, VTT, and SBV.

Timing too fast for reading speed. The standard is 17 characters per second or roughly 160 words per minute on screen. Captions that flash by faster than the viewer can read are worse than no captions at all. If a single cue is more than two lines or stays on screen less than 1.5 seconds, the timing is off.

Line length too long. Each line should fit comfortably in the player without wrapping. The industry guideline is 32 to 42 characters per line, two lines maximum per cue. Captions that wrap unexpectedly because a line was 50 characters look amateur on every platform.

No speaker labels in multi-person videos. If two or more people speak on camera and the captions don't identify who's talking, viewers with the sound off lose the conversation. SRT and VTT both let you prefix lines with the speaker name (SARAH: That's exactly what I meant). BrassTranscripts automatic speaker identification produces transcripts with these labels already in place.

No punctuation. Auto-generated captions often skip commas and periods entirely, which makes long sentences run together. Captions should be punctuated like written text — periods at the end of sentences, commas where natural pauses occur, question marks for questions.

Missing music and sound effect cues for accessibility. Captions are an accessibility tool first. Viewers who can't hear the audio need to know when music plays, when a door slams, when there's applause. The convention is square brackets: [upbeat music], [door slams], [laughter]. WCAG 2.1 success criterion 1.2.2 requires captions to include non-speech sounds that are relevant to understanding the content.

Converting Between Formats

Most creators never need to convert between caption formats because BrassTranscripts produces both SRT and VTT with every transcription, and YouTube accepts both. If you need a third format, conversion is usually unnecessary too — YouTube auto-converts on upload, and most modern players accept multiple formats.

If you do need to convert, the easiest tool is ffmpeg, which handles caption format conversion natively. The command for SRT to VTT is ffmpeg -i input.srt output.vtt — ffmpeg detects the input format from the file contents and writes the requested output. The same syntax works in reverse for VTT to SRT.

Free online converters work too. Search for "SRT to VTT converter" and several browser-based tools will appear. Avoid pasting transcripts that contain sensitive information into random web converters — for confidential content, use ffmpeg locally or download the format you need directly from your transcription service. BrassTranscripts delivers SRT, VTT, TXT, and JSON together for exactly this reason: pick what you need without uploading transcripts to a third party.

For a deeper comparison of every transcript format and when to use each, see Which Transcript Format? TXT vs SRT vs VTT vs JSON. For multi-speaker workflows specifically, see Multi-Speaker Transcript Formats: SRT, VTT, and JSON. For a broader format decision guide, see Transcription File Formats: Decision Guide 2026. For YouTube-specific workflows, see How to Transcribe YouTube to Text: 5 Methods Compared and Video Transcription Complete Guide for YouTube Content.

Frequently Asked Questions

Does YouTube accept VTT files, or do I have to use SRT?

YouTube Studio accepts both. The full list of supported caption formats is documented on YouTube Help and includes SRT, VTT (WebVTT), SBV (SubViewer), and several broadcast-industry formats like SCC, MCC, TTML, and DFXP. SRT and VTT are the two formats almost everyone uses. BrassTranscripts produces both with every transcription, so you can upload either to YouTube without converting.

Is SBV better for YouTube because it's YouTube's native format?

No. YouTube converts every uploaded caption file to its own internal representation regardless of input format, so the on-screen rendering is identical whether you upload SRT, VTT, or SBV. SBV is simpler to hand-edit in a text editor than SRT, but that's the only practical advantage, and most modern caption editors handle SRT and VTT natively.

Can I edit captions after uploading them to YouTube?

Yes. YouTube Studio has a built-in caption editor under Subtitles for each video. You can adjust timing, fix text, and re-publish without re-uploading a file. For larger edits, most creators edit the SRT or VTT file locally, then re-upload the corrected version, which replaces the previous caption track.

Why does my SRT file have timing like 00:00:01,500 but my VTT uses 00:00:01.500?

That's the format spec, not a bug. SRT uses a comma as the millisecond separator (00:00:01,500), while VTT uses a period (00:00:01.500). The comma-vs-period difference is the single most reliable way to tell SRT and VTT apart at a glance. Both represent the same one-and-a-half-second timestamp.

Does YouTube auto-generate captions if I don't upload a file?

Yes, YouTube auto-generates captions for most videos using its own speech recognition, but the accuracy varies widely with audio quality, accents, and technical vocabulary. Uploaded captions from a dedicated transcription service give you control over timing, speaker labels, punctuation, and corrections that auto-captions often miss.

I have a TXT transcript with no timestamps. Can I upload that to YouTube?

Yes. YouTube accepts plain TXT files and auto-times them against your video's audio, which works reasonably well for solo narration but struggles with multi-speaker content and natural pauses. For better results, upload an SRT or VTT file with real timestamps from a transcription service.