
As someone who writes to live and lives to read, I might be biased when it comes to this—but in my proclivity for turning on closed captions, I’m also not alone. More than half of millennials told surveyors last year that they watch TV with subtitles most of the time. Gen Z respondents were even more pro-caption (70%).
And it’s not just that AirPods have wrecked our eardrums. More than 100 studies have shown captioning improves “comprehension of, attention to, and memory” of video content. Rather than distracting from the visuals, captions allow viewers to take in the texture of each scene without devoting their energy to deciphering every uttered word. The textual companion becomes even more of a must when watching something in public, or when trying to keep the volume down for the sake of a roommate.
That is, until the game comes on. Then, for me, the captions have to be turned off.
During live events, the blocky text of closed captions using default settings can obstruct key areas of the screen, diverting attention away from the action with its constant movement. Often littered with errors, the lines almost always lag well behind the action, too. Videos describing how to turn the feature off on various platforms have racked up thousands of views from frustrated fans. And for viewers who don’t have the option of listening to the audio feed, the situation is even more maddening. But other than that, they’re great!
As Sportico’s media watcher Anthony Crupi pointed out to me, broadcasters actually have some disincentive to improve their caption offering. The current method Nielsen uses to track out-of-home viewership requires panelists’ devices to pick up an auditory tone emanating from the broadcast. By that metric, someone watching on mute might as well be watching something else.
But if closed captioning for live sports continues to pale in comparison to pre-taped alternatives, the new generation of viewers could be less likely to stick with sports.
Governments may help push leagues into the present. In the name of accessibility, municipalities are now requiring establishments to present captions on their TVs. Meanwhile, a number of individuals have filed lawsuits against venues and websites for not providing ADA-compliant offerings.
Of course, the challenge of presenting high-quality captions in real time is harder than ever, when broadcasters are airing thousands of events online in addition to linear programming.
VITAC general manager Doug Karlovits said his company, the oldest commercial provider of captioning services, handles 600,000 hours of live video a year. Much of that work is managed by automated speech recognition (ASR) programs that are pre-trained to keep a computerized ear out for subject-specific dictionaries of words. But, Karlovits added, “sports is very difficult for an ASR engine to do well,” mainly due to the wide array of proper nouns that may appear in a broadcast.
Humans who have achieved sports-specific certification handle many of the broadcasts, either by using a modified stenotype to type out words or by “respeaking” the announcer’s words into a microphone and ASR engine tuned to their voice. (The other industry that often relies on human transcribers, it turns out, is corporate America, where so many acronyms abound.) This process generally leads to a delay of three to seven seconds.
The industry clearly needs a near-infinite supply of supremely accurate stenography services with superhuman speed. If you’re thinking what I’m thinking, know that Disney has already thought it, too: This sounds like a problem for AI.
After launching ESPN+, and as it worked to make games available on Hulu, the company recognized it would now need to produce captions for thousands upon thousands of additional contests. Rather than continuing to rely on individual vendor relationships, Disney integrated access to an ASR system into its programming apparatus, allowing schedulers to directly assign captioning responsibilities to the computer for certain events.
Once implemented, the system proved to be as accurate as humans, within 1%, and up to five seconds faster. It has already been used close to 30,000 times, with the tech now also being extended to ABC News programming.
“Through work with partners and by smartly using the capability of AI, we’re able to enhance exponentially more of our world-class content with captions,” Disney Entertainment & ESPN Technology SVP of technical services Dave Johnson said in a statement. “And it will continue to improve over the course of its development and implementation.”
ESPN is still leaning on humans, though, particularly for marquee events.
“When I first came in… it was seven years away from not needing humans,” Karlovits said of speech-to-text technology. “That was 30 years ago…. Our company always believes that it’s good to have humans involved in the process.”
Still, Karlovits is among the many grappling with recent AI advances. “I sit here every day, and I think, for my younger kids, like what kind of jobs are gonna be out there five years from now?”
The biggest gains to be had will be for the hard-of-hearing people who rely on caption options. But advancing technology could also allow broadcasters to expand the role captions play for the entire audience, if the words were to be integrated more deeply into the video feed. Jim Nantz’s “Hello, friends” could appear across the bottom of the screen like sing-along lyrics. Mike Breen’s Bang!s could vibrate with his excitement. High-quality audio recognition features could also produce searchable transcripts and summaries of telecasts, or even of everything players say near an on-field mic (if player unions were ever crazy enough to allow such a log).
Other broadcast augmentations, like Amazon’s player-tracking Prime Vision feed and its stat-tallying X-Ray offering, have already improved the muted experience. Those offerings could be expanded as well, turning AI tools from mere transcribers into fully functionally sports-specific assistants.
Interactive broadcasts will also open the door for more customization of existing closed captioning. YouTube, for example, already lets users drag the captions around a given video, to wherever is most convenient or least obtrusive. TV services offer a variety of caption options these days, but navigating those settings can become a nightmare in itself.
Why stop there? Future AI algorithms will be able to “view” game content and directly create audio descriptions—no human announcer needed—to improve accessibility for people with vision loss, or to create a listen-only feed for those hoping to track games on the go. IBM VP of sports and entertainment partnerships Noah Syken said his team is already working on similar functionality, after unveiling AI-generated highlight commentary produced from data feeds for this year’s Masters.
“With each kind of progression of the technology, we think that there’s going to be additional facets of the game, additional facets of the competition, that are going to feed into the large language models that feed into the captioning ultimately,” Syken said. “So I don’t think we’ve seen the the ultimate state of affairs here at all.”