DVR-MS: Adventures in Closed Captioning
I finished the code for this project months and months ago, and I had every
intention of writing a full MSDN article describing the ins and outs of what I’d
accomplished, but time seems to have gotten away from me. Rather than let the
code languish any longer, I’ve decided to simply write up a (relatively) short
blog post, making the code available for all to explore.
For those of you who have read my previous MSDN articles about Media
Center,
you know I’m an avid fan. I’m also really interested in exploring and expanding
on the developer scenario for working with Media Center as well as with the
DVR-MS files in which it saves recorded video. There is so much interesting
information there, be it the actual video or audio content, or be it the
metadata surrounding that content. However, there is a large amount of
information available in DVR-MS files that has largely gone unnoticed and
unsung, and with this post I hope to rectify that inequity.
Closed captions are an amazing source of information about recorded
television shows. The metadata headers in DVR-MS files provide a great deal of
information about the show, most of it based on information captured from the
electronic program guide. But closed captions detail everything that goes on in
a show, the lines that were spoken, interesting dialog, when music is played,
and so on. Being able to harness this data makes it possible to write a wide
variety of just plain cool applications that make watching and working with
recorded television much more enjoyable.
To exemplify this, I’ve written a managed class library in C# (extending that
which I created for my original Fun
with DVR-MS article on MSDN), that parses and exposes all of the closed
captioning data from NTSC non-HD DVR-MS files in a developer-friendly fashion.
I’ve then layered on top of this library a few sample applications to
demonstrate the power of such a library and the types of application you can
have fun writing yourself.
How do I use the library?
You can download the complete sample source code and compiled binaries at http://toub.members.winisp.net/Code/ClosedCaptions.zip.
It includes Toub.MediaCenter.Dvrms.dll, which provides the classes for parsing
out the captions and working with them. The central class to this effort is the
abstract class ClosedCaptionsParser, available in the
Toub.MediaCenter.Dvrms.ClosedCaptions namespace. From this class derives the
concrete NtscClosedCaptionsParser class which is used to extract the closed
captioning data from NTSC files (you’ll notice that the library also includes a
PalClosedCaptionsParser, but if you look inside it, you’ll see that it’s simply
a shell waiting for you to implement it ;) To use an NtscClosedCaptionsParser,
one simply instantiates the class, passing to the constructor the path to the
DVR-MS file whose captions are to be extracted, and then calls the parser’s
GetCaptions method. GetCaptions returns a ClosedCaptionsCollection, which
contains the parsed ClosedCaption instances. Each ClosedCaption instance exposes
four properties, three of which are TimeSpans and one of which is a string. The
string is the text of the caption (most likely an individual word or sentence
from the captioning), and the TimeSpans represent the times at which the data
for the caption began to be received, was displayed to the screen, and was
cleared from the screen. The NtscClosedCaptionsParser actually creates instances
of a class derived from ClosedCaption, NtscClosedCaption, that provides two
additional properties. The first of these additional properties is of the
enumeration type NtscClosedCaptionType and describes the type of the caption:
roll-up, pop-on, or paint-on. To understand the distinction between these types
of captions, I suggest you read the free Code
for Federal Regulations document 47CFR15.119. This will provide you with a
PDF of the subsection titled “Closed caption decoder requirements for analog
television receivers,” which is part of the book covering the Federal
Communications Commision (FCC) and the section covering radio frequency devices.
The other property exposed by NtscClosedCaption is Channel, which tells you with
which channel of captioning this particular caption is associated (you’ll often
find multiple languages on different channels interwoven, and this allows you to
separate them out from each other).
So, as an example of how this can be used, here is a snippet of code that
parses the captions from a DVR-MS file and writes them out to the console in a
tab-delimited fashion:
NtscClosedCaptionsParser parser = new NtscClosedCaptionsParser(filename);
ClosedCaptionCollection ccs = parser.GetCaptions();
Console.WriteLine("Start\tDisplay\tClear\tText\tType\tChannel");
foreach(NtscClosedCaption cc in ccs)
{
Console.WriteLine("{0}\t{1}\t{2}\t{3}\t{4}\t{5}",
cc.StartTimecode, cc.DisplayTimecode, cc.ClearTimecode,
cc.Text, cc.CaptionType, cc.Channel);
}
A Few Words About NTSC Closed Captions
A brief tour of NTSC closed captions is probably appropriate. Note that there
are a lot of intricacies involved with parsing and rendering closed captioning
data, and I’ve chosen to ignore most of them. Thus, I’ve coded up a simplified
parser that just deals with the text included in the stream (most of the
specification is dedicated to the commands that are also included in the data,
which dictate actions such as moving the cursor around the screen and changing
the font color). This works very well for most, if not all, of the recorded
programs I’ve tested it with, but there could very well be some recordings that
cause this code to blow up. If it destroys your computer, I take no
responsibility (though I would like to know about it, as I’d find it pretty
amazing).
DVR-MS files are created by the Stream Buffer Engine (SBE) introduced in
Windows XP Service Pack 1, and are used by Media Center for storing recorded
television. They contain metadata describing the contents of the recording
(title, episode title, actors, etc.) as well as a variety of data streams. A
typical DVR-MS file is made up of two or three of these streams, depending on
what version of Media Center created it, the type of signal being recorded, and
the actual data being recorded. Most, if not all, DVR-MS files will have both an
audio and a video stream. The third stream is the one that may or may not exist,
though it will for almost all NTSC content recorded with a recent version of
Media Center. This third stream contains the closed captioning data for the
recorded television show. As with the audio and video stream, the closed
captioning stream is encrypted/tagged, so it must first pass through a
Decrypter/Detagger filter. If you use GraphEdit to look at the Out pin on this
filter, you'll see that its major type is AUXLine21Data and that its subtype is
Line21_BytePair (this may vary based on whether the content is NTSC, PAL, HD,
etc.). Closed captioning in NTSC television shows is encoded into line 21 of the
Vertical Blanking Interval (VBI).
There are a few approaches one could take to extract the closed captioning
data from the CC stream. One approach would be to hook up a Dump filter to the
CC stream, saving the CC stream’s raw contents to a file. This file could then
be opened and parsed through a managed FileStream. This approach, while very
simple, has some significant downsides. The streams that make up DVR-MS files
are actually split up into a series of samples, where each sample contains data
but also some metadata about that data. One important piece of metadata for each
sample is the timecode at which that sample is supposed to be rendered when the
file is played. However, when you use the Dump filter to dump the data contained
within a stream, you’re simply dumping the data contents of each sample, not the
metadata. Thus, there’s no way by examining the dumped bytes to determine with
certainty what timecode each dumped byte corresponds to in the original file,
especially if the CC stream isn’t continuous (for example, if commercials in the
show aren’t captioned).
There’s another way to access the closed caption data, and this approach
provides full fidelity, as it provides you access to the metadata for each
sample. The Windows
Media 9 Series Format SDK, provides classes and interfaces that allow you to
read Windows Media files using synchronous calls. These interfaces can also be
used with DVR-MS files, and as such the Format SDK allows you to process DVR-MS
files using the IWMSyncReader interface. The WMCreateSyncReader function is used
to create one of these synchronous reader objects.
In order to parse out the closed captions, I use an IWMSyncReader to find and
walk through the closed captions stream in the DVR-MS file. This can involve
reading in and looking through gigabytes of data, which means that this isn’t
currently a fast process. In fact, to parse the captions from a half an hour
show can take a minute or more on my laptop (though I’m running on battery power
right now, on the plane on the way back from PDC 2005, so that might be slowing
down the process further).
The closed caption stream is divided into a series of two byte instructions,
some of which are data and some of which are command codes that detail how to
process the data. For example, one command might inform the processor to render
the current caption to the screen, and another command might ask the processor
to clear the currently displayed caption. My parser is a very simple state
machine that loops through the data looking at each byte pair, processing and
reacting to each as they occur. There is a whole list of command codes which
I’ve special cased in a switch statement: if any of those are found, the command
itself is ignored and instead a space is added to the output text. If the
command is 0x942f (end of caption) or 0x942c (erase displayed memory), it is
treated as the end of the current caption, and whatever text seen up to this
point is stored into a new ClosedCaption instance. This instance is then added
to the collection of captions, and the text buffer is erased to prepare for the
next caption. With the exception of special characters (which occupy two bytes
and begin with either 0x11 or 0x19, both of which I’ve ignored and treated as
commands) such as musical notes and registered marks, each text byte is
inclusively between 0x20 and 0x7A and represents its ASCII equivalent. As such,
if a byte isn’t part of a command, I simply cast it to Char and add it to the
current text buffer (which will eventually be stored in a ClosedCaption instance
when an end of caption or erase displayed memory command is received).
Two-bytes of closed captioning data are sent with every frame of the video,
and so determining the time code at which to display a caption can be calculated
based on in what frame it was sent (frames that contain no useful closed
captioning will contain 0x8080 as a filler). There are two ways to compute the
time code based on the frame: dropframe and non-dropframe. Non-dropframe is
computed simply by dividing the frame number in which the closed caption
instruction was sent by the number of frames per second; for NTSC, this is 29.97
frames per second, and for PAL, 25 frames per second. Dropframe, on the other
hand, is used for most broadcast signals and attempts to account for the
non-integer 29.97 value using a scheme similar to that employed for leap years
(i.e. the earth actually spins 365.25 times in a year, so rather than worry
about the quarter day each year, it’s simply rounded off and turned into a 366th
day once every four years). Instead of computing the time code based on 29.97
frames per second, it’s computed based on 30 frames per second. This results in
18,000 frames in ten minutes, as opposed to using 29.97, which results in 17,982
frames in ten minutes, a difference of 18 frames every ten minutes. So, when
computing timecodes with dropframe, the first two frames of each of the first
nine minutes out of ten are ignored, thus eliminating the problem of having an
extra 18 frames. Thus, if you use non-dropframe to decode a broadcast that used
dropframe, by the end of each 10 minute cycle the timecodes could be off by a
little more than half a second. For simplicitly, I’ve decided this is an
acceptable discrepancy and have coded my timecode computation method to use
non-dropframe. Feel free to change it if this bothers you.
To help with the speed issue I mentioned previously, by default the
ClosedCaptionsParser caches a serialized version of the parsed
ClosedCaptionCollection into a NTFS Alternate Data Stream associated with the
DVR-MS file. After the captions have been parsed successfully once, by default
any attempts to parse the captions in the future will first attempt to retrieve
the cached captions from this stream. So, while this operation is very fast on
all subsequent parsing operations, parsing a DVR-MS the first time can cause
significant wait time.
Navigation
For me, one of the most interesting scenarios this capability presents is
enhanced navigation. Most folks today with PVR capabilities are stuck in the
television-watching mindset of fast-foward and rewind. But what about search?
Search is huge! What if you could jump to a place in the video where a
particular line was said? What if you want to show someone that really funny
joke in the episode of Friends you recorded last night? Closed captions make
that possible.
I’ve dubbed this first sample application I’ve implemented “Search and View,”
and it does exactly what its name implies. The application hosts the Windows
Media Player ActiveX control in order to play a DVR-MS file. When a file is
selected to be played, the captions are parsed from the file and are displayed
in a list box, allowing individual captions to be selected. When a caption is
double-clicked, the video jumps to the location in the video where that caption
was displayed to the screen, allowing you instance access to any spoken dialogue
in the video. Moreover, there’s a search box on the form that provides very
simple searching capabilities. You can enter a search term and have the list of
captions narrowed to only those captions that include the search term.
Obviously, there are a plethora of ways in which this application can be
expanded upon and improved, but I happen to think it’s pretty darn cool as is.
Thanks to Derek Del Conte for working with me to flush out the idea for this
sample.
Search
If you’re like me, you record many episodes of a few different shows, and
sometimes it can be difficult finding the show you’re really interested in
watching. What I really needed was a way to do an intelligent full-text search
on all of the videos on my hard disk in order to narrow down my files to only
those I’m interested in... oh wait, I already have that: Windows Desktop Search.
In order to use Windows Desktop Search to allow me to search for recorded videos
based on that funny dialogue I want to play for my fiance, I implemented an
IFilter for DVR-MS files. This IFilter is written in managed code using COM
interop to expose the necessary functionality to Windows Desktop Search so that
it can index all of the closed captions contained in my recorded DVR-MS files. I
can now simply type into the Desktop Search text box a phrase from a show I
previously recorded, and voila, I’m instantly provided with the DVR-MS file I
should view. Wow.
Note: I mentioned earlier that it can be very slow to do the initial captions
parse for a DVR-MS file. The way it’s currently implemented, this can cause
problems for Desktop Search, which expects the IFilters it uses to be timely in
their responses. If an IFilter takes too long to process a file (some number of
minutes), Desktop Search does the right thing and assumes the IFilter has hung,
aborting the indexing for that file. Additionally, the way I currently parse the
file, I do all of the parsing in one fell swoop, which precludes Desktop Search
from throttling the indexing; a robust IFilter would handle this much better,
but, well, this is sample code. The IFilter will run very quickly if the
captions have already been cached from a previous parse. Also note that the
Desktop Search team recently released their own sample for how to implement
managed IFilters.
Saving Captions When Converting To Other Formats
In my Fun
with DVR-MS article, I demonstrated how it’s possible to use DirectShow to
convert from DVR-MS files to other media formats such as WMV and WMA. However,
those samples did not preserve the closed captions. And how could they; after
all, WMV and WMA files don’t contain streams for captions, right? They do
something even better. Both WMV and WMA files allow you to use the
WM/Lyrics_Synchronised metadata header to store a collection of strings, each of
which is associated with a particular time in the video file at which the string
should be displayed. Sound familiar? Synchronized lyrics are visible in two
different ways in Windows Media Player. The first, and most obvious, is through
the synchronized lyrics editor that’s part of the Advanced Tag Editor in Media
Player.

But these lyrics wouldn’t do any good if you couldn’t view them
along with the media at the appropriate time. In fact, you can. If you enable
captions/subtitles for a video (the option is available from the Play menu in
Media Player), Media Player will show you the synchronized lyrics along with the
video at the appropriate time. So, I’ve augmented the ConvertToWmv and
ConvertToWma applications I provided in the Fun with DVR-MS article to extract
the closed captions from the original DVR-MS file and to save them as
synchronized lyrics into the metadata headers for the generated Windows Media
files.

Summarizing Video Files
This one is admittedly a bit far fetched, but I still think the idea is neat
and wanted to see how it would fair. Microsoft Word has the ability to provide
automated summaries for documents. You specify how much the text in the document
should be summarized, Word analyzes the textual content, and it provides a new
document that is some percentage in size of the original text (25% by default, I
believe). Wouldn’t it be neat if we could do the same thing for video files? I
decided it’d be fun to use Word’s AutoSummarize feature to implement this for
videos. First, I extract the captioning from a DVR-MS file. I then
programmatically dump that textual content into a Word document and ask Word to
summarize it for me. The summarized text is then mapped back to the original
captions (in a slightly haphazard fashion, I admit) as parsed from the file in
order to determine which captions should be kept as part of the summary. The
RecComp class, as described in the Fun
with DVR-MS article, is then used to create a video summary including the
segments that contain the summary captions. Useful? Unsure. Cool? Yup, or at
least I think so.
Other Ideas
At over 3500 words, this post didn’t end up being as short as I’d planned,
but hey, the more the better I guess. There are so many neat things you can do
with captions, I’m sure this only scratches the surface. Some additional ideas
for things I’d implement if I had the time:
- An add-in for Media Center that displays a list of captions for the
current video and lets you jump to a caption. You can discover the recorded
show that’s currently playing using code from my Time
Travel with Media Center article.
- An app that combines Desktop Search and my Search and View app, allowing
you to search your whole disk for a particular phrase, show the video, and
jump right to the phrase in the video.
- A really powerful Tablet PC-based remote control for Media Center. An
add-in in Media Center could expose through remoting not only the AddInHost,
but also all of the captioning information. The Tablet could then expose an
interface that allows you to navigate the show on your Media Center based on
traditional navigation controls but also based on searching and selecting
closed captions.
- A speech-based navigation engine that buillds up a grammar based on the
closed captions and lets you navigate the current show purely by speaking a
line from the show.
- A transcript generator. Grabs images from the video and includes them in a
Word document along side the closed captions at the correct point within the
document.
- A Windows service that monitors for when new shows have been recorded and
automatically starts the process of extracting the closed captions and saving
them. This will then make it very fast for other applications to work with the
captions, as they’ll already be parsed and available. Of course, in a sense
this is one feature the IFilter provides in concert with Desktop Search.
- In some circumstances, you can distinguish commercials from actual content
based on the type and content of the closed captions... not that I'm
suggesting anything.
- A summary video generator that searches for keywords and uses
IStreamBufferRecComp to generate a new summary video containing portions of
the video with the specified keyword.
I’d love to hear what else you come up with and any feedback you might have
(again, though, this is all unsupported, so while I’ll try to help where and
when I can, I make no guarantees about anything). In the meantime, I hope this
is helpful.
Happy programming!
-Steve