After implementing the “Import Text” from Workspace feature, I was trying to eat my own dog-food, captioning lots of songs using the tool. It turns out that while the Import Text function is extremely useful (it cuts down the typing significantly!) , it can be a lot better. The idea of the “Import Text” from Workspace featuer is that basically for every line that is selected in the text box, a new segment is created and dump into the timeline track, saving me the typing. However, the duration of the segment was a fixed constant, 1 second. I still had to adjust the start time and the duration of every captions in another play-through to make sure the captions were synchronized correctly.
To improve the feature, I played with a few ideas of how the text-importer could estimate the duration of the segment and used value instead of the default 1 second. I thought of phonetic analysis, got a few PDF-papers from Google, skimmed through them. Right from the start,I knew that I woudln’t need very high accuracy for the estimation. More importantly, I had very limited budget of time (an hour or two) to implement the feature, digging through complex phonetic algorithms will take a few days easily, and maybe a bit overkill for Captionizer.
I thought some more but not too hard because I had been working on so many captions for the past few days that the music kept on playing in my head. Then suddenly, I realized that I could just count the vowels in a word and from there, I can estimate how long a word is spoken. Even though I’m not a linguistic expert, the algorithm can still come up with a reasonable guestimation value for how long a sentence would be spoken!
Determining how long a sentence would be spoken
Let’s take an example. We have this sentence (read it yourself)
Read This Sentence Aloud
Counting the vowel groups (2 vowels next to each other can be grouped together as one phonetic sound), we will have
[ ea, i, e, e, e, a, ou ]
7 vowel groups. With normal phonetic stress and regular gaps between the words when spoken, I assume that it would take an average person 0.3 second to pronounce a vowel (or vowel group). So the above sentence would take 7 x 0.3 = 2.1 seconds at average speed..
There are of course exceptions, such as words with multiple vowels but still get pronounced as one sound, such as “there”, “where”, “these”. However, the overal spoken time of a sentence will be averaged out as we have other words as well.
The implementation is a mere 3 lines in JavaScript:
var wordSpokenLength = 0.3; // seconds
var vowels = text.strip().split( /[AEIOU]{1,}/i );
var totalDuration = vowels.length * wordSpokenLength;
(totalDuration is the suggested value for how long the segment should be)
The crux of the algorithm is the splitting of the whole sentence into groups. The estimated duration can be calculated by multiplying the number of groups with the average length of spoken word. Instead of using a regular string delimiter for the splitting, I used small regex of a group of standard latin vowels, A, E, I, O, U, with possible repetitions. For example, aloud should be spliited into 2 pronounceable groups: “a”, and “ou”. If you execute the split in the javascript console, the javascript split will return 3 groups ["", l, "d" ] instead of 2, but it’s okay because of the estimation.
This method worked surprisingly well when I tested with a few songs. I choosed Rihanna’s new song, Take a Bow, because it is a slower pace song. For most part, the suggested duration was in good agreement with the singing, even though Rihinna stretched her voice longer at the end of a line more, thus,I would still need to stretch out some imported caption segments.
I tested again with Boys Like Girls – Thunder, a more popular rock,and Kardinal Offishal/Akon – Dangerous, a hiphop/Rap song. 0.3 second for average spoken word seems to works well in all of those cases. However, with “Dangerous”, the rapping part is a bit faster than the suggested values, but this is an exception because those rappers they go for speed! I will probably need to lower the average value to 0.1 second to get a better estimation.
Since the vowels being used are more specific to English, the algorithm won’t work as well with other languages with different or more vowels, such as my beloved Vietnamese. In Vietnamese, we have vowels such as â, ă, ê, ơ, ô, ư with different accents such as ầ, ả, à, á, ê, ề, ế, ơ, ờ, ư, etc. Thus the current pattern would fail to split the sentence and the estimated duration will be shorter than what it should be. To overcome this, the splitter pattern needs to get updated to include all those different vowels so that we can get a better split results. And for non-latin languages such as Chinese, Korean, Japanese, Arabic, Mayan
etc., my method also fails miserably.
Another idea I’m thinking is that the importer can learn more about the pace of the current captions by keeping some kind of statistic of the average spoken word length. “Dangerous” have a higher word density per captions, thus the average should be 0.2, instead of 0.3, while Rihnna have a less “densed” words-per-caption, and a longer duration, then the values should be 0.35 instead of 0.3. However, this would be far more complicated implementation. And right now, from my hands-on experiements, 0.3 seconds seems to do the job.
I am quite please with this tweak and how it would help make the captioning process easier and faster. As we develop TubeCaption further, we have collected more feedbacks from friends and family, and we also learned more about the process. It has been a long development road since the first prototype of Captionizer. We added more and more features, tools, shortcuts, etc., with one single goal: to make captioning less tedious and more like fun. We hope you enjoy it as well!

