Back in March 2010, I rather gleefully blogged about YouTube’s latest feature called “automatic captioning.” Since that time, I have become bemused and amused by the state of this “service.” It seems Google – the owners and operators of YouTube – have been using our videos as fodder for their new Google Voice speech-to-text (S-t-T) translation machine. Google claims, “It (Google Voice transcripts) will improve over time as our transcription engine gets smarter.” It is not clear how the Google transcription engine will get “smarter,” but I’m, figuring the more the system is used, the more it will learn, and the smarter it will become…make sense?
Whoever perfects S-t-T stands to make billions in the first year, so it stands to reason Google would be interested in tapping into that treasure chest. But perfecting S-t-T has always been an elusive goal and anyone worth their salt in the captioning or transcription business knows the human beings still make the best captionists.
That said, at the recent Accessibility Unconference a few weeks ago, the issue of S-t-T came up and there was lots of interest in YouTube’s “automatic captioning” service. I should note here that YouTube currently calls this a “machine transcription” service and offered it with some caveats. They also seem, in some ways, to be more interested in the language translation tool that was also delivered on YouTube at the same time. Perhaps there is more money to be made in the translation of Chinese to English than in S-t-T.
At the Unconference, there was one gentleman who represented a transcription service company in Massachusetts that used a system based upon a combination of automated S-t-T and human power. He claimed that his system was much faster than regular human-only transcription because machines take the first cut at the translation and humans completed the final edits. He also claimed it was flawless. Lastly, he noted that the fee for this service ranged on a scale based upon the quality of the audio. Apparently, the poorer the quality of the speech, the more interactions with humans is necessary, and the more expensive is the price tag.
So all this got me thinking about the experimental YouTube video I created and posted back in early March. The “automatic captioning,” eh, machine translation, of my video was indeed a bit hilarious. Sharing it with friends, we all howled at the bizarre transcripts that were produced by the system. It was a bit like playing that children’s game, “Telephone,” where you whisper something into someone’s ear and they whisper it into the next person and so on down the line until the last person says it out loud. The final product never comes out correctly and is usually quite funny. And indeed, the YouTube “machine transcription” was much the same.
For my test video, I purposely read a printed text – as opposed to spontaneous speech – so I would have an exact copy of the content from which to compare the transcript. The results were marginal at best and honestly, the transcript really made no logical sense. It was also amazing what YouTube’s machine translation failed to recognize. The machine translation had a particular difficult time with the words “accessibility” and “web design.” Go figure.
I recently learned that you could download the YouTube machine translation, edit it, and then re-post it to the original YouTube video. So, today I finally got around to trying this and though successful, the process was not without pain.
First, the machine transcript is saved in some unique YouTubian format (.SBV). The content is readable using a simple text editor and looks like this:
0:00:02.179,0:00:07.740 okay so am I- of doing it tested video here it and I'm going to read this to see if the 0:00:07.740,0:00:09.959 captioning system works well
Fortunately, my MovCaptioner software could import the file and provide an easy way for editing the content. But after editing the text, I could not export the transcript without first merging it with a video. I had to grab the original video from YouTube (which I downloaded in .MP4 format) and then load that into MovCaptioner. Once the editing was finished (see note below about time), I was able to save and export the file in another format (.SUB for Subtitle format) and then upload that transcript file to YouTube.
The final edited .SUB file looks like this:
00:00:02.17,00:00:07.72 Okay so I am doing a test video here and I'm going to read this to see if the 00:00:07.74,00:00:09.94 captioning system works well
As predicted, the most strenuous part of the process is the actual editing of the transcript. Even though the machine transcript had gotten about 50% of the content correct, it still took close to 45 minutes for me to edit the three minutes of video. It is clear that I talk pretty fast, as there was 75 lines of text that had to be edited. I can’t imagine doing this for anything longer.
So, I’ve learned a few things here:
First, YouTube’s “automatic captioning/machine translation” is far from perfect and must not be used, at this point, for anything other than amusement. I am not sure if Google has a timeline on when this will get better, but until it produces accuracy at a 85% or higher basis, I would not rely on it as a usable transcription.
Second, while machine translation, followed by human editing is clearly more accurate than machine translation alone, the time savings may not be all that one might imagine. I’m guessing that a professional transcriptionist using state of the art equipment would have been able to transcribe the three minutes of video a lot faster than I was able to edit the machined version.
Last, we are still a long way from fully accurate S-t-T and if you are going to use videos on your websites, and want them to be accessible, you are probably still going to have to pay someone to create a transcript/caption file for you.
Note: jeremykemp has posted a YouTube video comparing human vs. machine translation on several video clips. You can see the errors produced by the machine transcription.
Second Note: Human transcribed captioning is still your best bet. The company we have used for transcription/captioning is AST Sync out of California. They are very easy to work with, provide great customer service, are fast and very reasonably priced (about $185 per one hour of video; less if you already have a transcript). If you do it yourself, count on it taking you a minimum of 3-to-1 in staff time to do a complete transcription with edits and time marks. In other words, a one hour video will take three hours of your – or someone’s time to get a quality final product. You can do the math.