About Captions

KALM-150x150"Tom discusses the history and implementations of captions with Andy Beach, video engineer and author of Video Compression Handbook.

Featuring Tom Merritt and Andy Beach.



A special thanks to all our supporters–without you, none of this would be possible.

Thanks to Kevin MacLeod of Incompetech.com for the theme music.

Thanks to Garrett Weinzierl for the logo!

Thanks to our mods, Kylde, Jack_Shid, KAPT_Kipper, and scottierowland on the subreddit

Send us email to [email protected]

Episode Script
I turned on captions on my TV but they didn’t turn on for my Netflix…
Then I turned on captions for Hulu and they turned on everywhere…?
How the heck do I make captions only show up where I want them too…?
Are you confused?
Don’t be.
Let’s help you Know a Little more about captions.
It’s Know A little More’s first two-time guest please welcome back Andy Beach, author of Video Compression handbook and video engineer and architect at Microsoft.
After reading an email about the confusion caused by captions settings on my show Cordkillers, Andy Beach was nice enough to send me an email about how they work and then even NICER enough to agree to record all this for Know A Little More.
Andy’s email:

Before I actually answer your question, a few fun captioning facts:
There is also open captions – closed captions are called that because the text is hidden until it is turned on by the user. Open captions (also sometimes called “burned in captions”) are captions that are ever present in the video. Captioning in Other parts of the world, particularly Europe are referred to as Teletext, but they often use the same tech that the US based nomenclature uses.
Closed captions as we used to think of them in analog days were also called 608 or line 21 as they were visibly delivered in to the tv set in the part of the video image that falls outside the normal display area (the vertical blanking interval, which happens to be 21 lines long).
When we moved to digital broadcast, we needed a new method, so the 708 standard was created (CEA-708 for ATSC TV signal, if you want to be pedantic). Instead of a visible image that had to be read by the system, now we have a digital signal embedded directly in the MPEG Transport stream which has approx 9600 bits per second in order to encapsulate the data. (and just a reminder for everyone who thinks this is ancient history, Digital TV was federally mandated in the US until June 12, 2009.
And now we come to the internet – which had no standard for how text was transmitted or rendered for end users until approx 2015. Up to that point the closest we had to a standard was TTML (Timed Text), which was an extension of HTML that was proposed in the early 2000 by wc3 (and was the culmination of a good amount of early work done around SMIL – Synchronized Multimedia Integration Layer).
Around 2010 in conjunction with HTML5 there was a nearly a standard based around SRT subtitles that was called WebVTT (also a W3C standard), but it never gained wide enough adoption. FINALLY in 2015, the WC#, HBO, and netflix shared an emmy for their work around standardizing TTML, which was published as TTML2 spec and by and large has become how CC is delivered to web clients today. TTML is fairly straight forward as it is an extension of XML work and generally looks like human readable text broken up by time code that tells the system when to render it. You will occasionally find systems where the CC is delivered as a completely separate track from the video, however a video container is a highlight versatile thing.
One last fact – while most platforms have the ability to call a generic video player (particularly if it’s an HTML5 based app), it is rare that an app would use this. The bigger a service, the more they want to “own” the whole playback experience – implementing a customized player (some license a player from a company that build them, some build their own off of open source options) so they can tweak the quality and playback experience to be as great as possible, plus bake in all the tooling to send back telemetry about how the user viewed the content (what did you play, what you skipped, where you paused, where you rewound, etc) in order to continually improve the user experience.
OKAY – now with this base level of info, I feel like I can answer the question of why you have to activate some captions one way versus another.
As I mentioned above there are three likely devices in the playback loop that could render your content – if it’s over the air, it’s the tv itself, which would know how to render 708 CC (or if we time traveled to the days of analog, 608 line 21 data). The next in the chain is the device itself – a tivo, an xbox, a roku or firestick. All these devices have their own OS and some notion of a default media player and a service might choose to utilize that, in which case both in the app and probably at a global settings level, you could turn on and off captions. And finally there is the app itself. If they are using a custom player, it likely isnt really connected to the system reference media player and therefore doesn’t know how to respect those settings, so instead, the devs need to add the ability in app to turn on and off the CC. For both of these last options (device and app) these are some variant of TTML2 captions most likely and their specs are closely aligned with whatever protocol is delivering it (the two most common right now are HLS and DASH, though others are out there, particularly for low latency use cases).

Thank you Andy!
I hope now you know a little more about captions.
Three helpful links for learning more about closed captions:

A History of Closed Captioning