accessible and universal web design, development, and consultation

Learn More

Captioning YouTube Videos

You Tube logoBack in March 2010, I rather gleefully blogged about YouTube’s latest feature called “automatic captioning.” Since that time, I have become bemused and amused by the state of this “service.” It seems Google – the owners and operators of YouTube – have been using our videos as fodder for their new Google Voice speech-to-text (S-t-T) translation machine. Google claims, “It (Google Voice transcripts) will improve over time as our transcription engine gets smarter.” It is not clear how the Google transcription engine will get “smarter,” but I’m, figuring the more the system is used, the more it will learn, and the smarter it will become…make sense?

Whoever perfects S-t-T stands to make billions in the first year, so it stands to reason Google would be interested in tapping into that treasure chest. But perfecting S-t-T has always been an elusive goal and anyone worth their salt in the captioning or transcription business knows the human beings still make the best captionists.

That said, at the recent Accessibility Unconference a few weeks ago, the issue of S-t-T came up and there was lots of interest in YouTube’s “automatic captioning” service. I should note here that YouTube currently calls this a “machine transcription” service and offered it with some caveats. They also seem, in some ways, to be more interested in the language translation tool that was also delivered on YouTube at the same time. Perhaps there is more money to be made in the translation of Chinese to English than in S-t-T.

At the Unconference, there was one gentleman who represented a transcription service company in Massachusetts that used a system based upon a combination of automated S-t-T and human power. He claimed that his system was much faster than regular human-only transcription because machines  take the first cut at the translation and humans completed the final edits. He also claimed it was flawless. Lastly, he noted that the fee for this service ranged on a scale based upon the quality of the audio. Apparently, the poorer the quality of the speech, the more interactions with humans is necessary, and the more expensive is the price tag.

So all this got me thinking about the experimental YouTube video I created and posted back in early March. The “automatic captioning,” eh, machine translation, of my video was indeed a bit hilarious. Sharing it with friends, we all howled at the bizarre transcripts that were produced by the system. It was a bit like playing that children’s game, “Telephone,” where you whisper something into someone’s ear and they whisper it into the next person and so on down the line until the last person says it out loud. The final product never comes out correctly and is usually quite funny. And indeed, the YouTube “machine transcription” was much the same.

For my test video, I purposely read a printed text –  as opposed to spontaneous speech – so I would have an exact copy of the content from which to compare the transcript. The results were marginal at best and honestly, the transcript really made no logical sense. It was also amazing what YouTube’s machine translation failed to recognize. The machine translation had a particular difficult time with the words “accessibility” and “web design.” Go figure.

I recently learned that you could download the YouTube machine translation, edit it, and then re-post it to the original YouTube video. So, today I finally got around to trying this and though successful, the process was not without pain.

First, the machine transcript is saved in some unique YouTubian format (.SBV). The content is readable using a simple text editor and looks like this:

   okay so am I- of doing it tested video here
   it and I'm going to read this to see if the
 captioning system works well

Fortunately, my MovCaptioner software could import the file and provide an easy way for editing the content. But after editing the text, I could not export the transcript without first merging it with a video. I had to grab the original video from YouTube (which I downloaded in .MP4 format) and then load that into MovCaptioner. Once the editing was finished (see note below about time), I was able to save and export the file in another format (.SUB for Subtitle format) and then upload that transcript file to YouTube.

The final edited .SUB file looks like this:

   Okay so I am doing a test
   video here and I'm going to
   read this to see if the
 captioning system works well

As predicted, the most strenuous part of the process is the actual editing of the transcript. Even though the machine transcript had gotten about 50% of the content correct, it still took close to 45 minutes for me to edit the three minutes of video. It is clear that I talk pretty fast, as there was 75 lines of text that had to be edited. I can’t imagine doing this for anything longer.

So, I’ve learned a few things here:

First, YouTube’s “automatic captioning/machine translation” is far from perfect and must not be used, at this point, for anything other than amusement. I am not sure if Google has a timeline on when this will get better, but until it produces accuracy at a 85% or higher basis, I would not rely on it as a usable transcription.

Second, while machine translation, followed by human editing is clearly more accurate than machine translation alone, the time savings may not be all that one might imagine. I’m guessing that a professional transcriptionist using state of the art equipment would have been able to transcribe the three minutes of video a lot faster than I was able to edit the machined version.

Last, we are still a long way from fully accurate S-t-T and if you are going to use videos on your websites, and want them to be accessible, you are probably still going to have to pay someone to create a transcript/caption file for you.

Note: jeremykemp has posted a YouTube video comparing human vs. machine translation on several video clips. You can see the errors produced by the machine transcription.

Second Note: Human transcribed captioning is still your best bet. The company we have used for transcription/captioning is AST Sync out of California. They are very easy to work with, provide great customer service, are fast and very reasonably priced (about $185 per one hour of video; less if you already have a transcript). If you do it yourself, count on it taking you a minimum of 3-to-1 in staff time to do a complete transcription with edits and time marks. In other words, a one hour video will take three hours of your – or someone’s time to get a quality final product. You can do the math.


  6 comments for “Captioning YouTube Videos

  1. dawn
    June 2, 2010 at 2:32 pm

    thanks for confirming my it’s-not-ready-for-prime-time opinion of YouTube’s transcription service. I use to transcribe videos. Right after YT came out with their “transcription” service they blocked Overstream from accessing their videos, but acquiesed after much pleading and it’s working again – and honestly it’s easier to use (I think) than CaptionTube.

  2. Dave
    June 3, 2010 at 9:24 am

    It’s nice to see a write-up of someone’s experiences, but you seem to take a harsh slant because the auto-captions aren’t a flawless, one-button process.

    YouTube’s auto-captioning does the hard part – the time codes. A year ago, organizations could say “our videos aren’t captioned because we don’t have the expertise and don’t have the resources to hire someone”. Now, the only expertise needed is for someone to take the auto-captions, watch the video, and type the words they hear. Any organization that can produce and distribute a video can type the words they hear. There aren’t excuses any more!

    The auto-captions don’t sell accessibility as a priority, but they make accessibility achievable for people who already had it as a priority. That’s still a tremendous step.

  3. Jake
    June 17, 2010 at 6:15 pm

    I’ve come to the same conclusion with YouTube’s captions. Simply not worth the editing time. I’m also a MovCaptioner user (I think it’s the best captioning app for the money), but the way I use it now is with MacSpeech Dictate. I had heard that Dictate works with MovCaptioner and it is much easier to say the captions than to type them. Just open MovCaptioner and Dictate. Dictate usually opens a text editor but just close that so it will type into MovCaptioner’s text box instead. Then hit the Start button after enabling Dictate and repeat what you hear in your headphones into your microphone and it will type it all for you. Just hit the Return key to move on to the next bit. This is much less tiring than trying to type it for me. They have a video on their website that shows a demo of this:

  4. Gina
    October 12, 2010 at 10:59 am

    Thank you. Automatic captioning for you tube videos is useful upto certain extent. And same is useful for audio transcription also.

  5. November 11, 2011 at 3:06 pm

    Their auto timing for txt files feature is pretty awesome tho. Just uploading a plain txt doc with transcript and it times it out to what’s on screen.

  6. jeb
    November 11, 2011 at 3:53 pm

    Cool. Nice thing to know. Thanks for the feedback.

Leave a Reply

Your email address will not be published. Required fields are marked *