The ability to make computers read text has been a commonplace feature of daily life for quite some time now. We have grown accustomed to hearing metallic voices in our smartphones reading monotone summaries of the day’s weather and headlines. Recently, however, artificial voice technology is going through a renaissance in output quality. Content creators now have access to not only traditional TTS (Text-to-Speech) services, but incredibly human-like voiceover software that mimics natural speech. Futurist Ray Kurzweil once predicted in 2005 that the cost-performance ratio from speech synthesizers would be exponential, allowing for creators to not only reduce significant costs in production, but also democratize access to previously industry-exclusive sound processing tools.
It would not be remiss to say that we have now reached, or are close to reaching, the future that Kurzweil predicted, with machine-read audiobooks and virtual influencers. However, before we arrive at such a judgement of the present, let us take a look back on the technological advancements that enabled us to enter this new period of TTS growth.
Origins & Methodology
The foundation of voice cloning technology is the synthesization of speech using a TTS engine. TTS is in fact a decades old technology that dates back to the 1960s, when researchers like Noriko Umeda of the Electrotechnical Laboratory and physicist John Larry Kelly Jr. developed the first versions of computer-based speech synthesis. Curiously enough, it was Kelly’s synthesizer demonstration, which recreated the song “Daisy Bell”, that inspired Stanley Kubrick and Arthur C. Clarke to use the electronic synthesizer in the climax of the film “2001: A Space Odyssey”.
In the past, before the development of neural-network powered TTS models that we use today, there were two main approaches to the cloning of voice: Concatenative TTS and Parametric TTS. Both approaches attempted to maximize naturalness and intelligibility, the most important characteristics in speech synthesis.
Concatenative TTS describes the process of amassing a database of short sound samples that could range from 10 milliseconds to 1 second, which the user would then directly manipulate to merge and generate specific sequences of sound. These sequences could then be formed to create audible and intelligible verbal sentences, but due to the uniform and static nature of the sequences, Concatenative TTS lacked the phonetic variations and idiosyncrasies that makes speech sound natural and emotionally expressive. In addition, producing the datasets required for a fully-functioning Concatenative TTS was incredibly time-consuming.
Parametric TTS uses statistical models to predict speech variations in the parameters that make speech. Once a voice actor is recorded reading a script, the researcher can train a generative model to learn the specific distributions of the recorded sound parameters (acoustics, frequency, magnitude spectrum, prosodic, spectogram) and linguistics of the text, and then utilize the TTS to reproduce artificial speech with similar parameters to the original voice recording (vocoding). In the end, this means that the data footprint is significantly minimized in comparison to Concatenative TTS, and the output model is far more flexible in adapting specific vocal expressions and accents. The TTS also “oversmoothes” the recording, making sound discontinuities rare, but in contrast makes the speech more flat and monotone, making it easily differentiable to natural voice.
Even with their limitations, it was the development of such TTS methodologies using Linear Predictive Coding (LPC) that allowed for the manufacturing of iconic consumer speech synthesizers such as the one used by Stephen Hawking in 1999 and games like Milton.
AI Powered TTS (LOVO)
Today, TTS is going through a rapid new stage of innovation, and is dominated primarily by the Deep Neural Network (DNN) approach. By leveraging artificial intelligence and machine learning algorithms, the DNN method attempts to remove all human intervention from the voice cloning process, fully automizing tasks such as smoothing and parameter generation. Of course, science has not yet reached a stage of full automation, but we are getting there.
Some early pioneers of the DNN approach include Google Deepmind’s Wavenet, an autoregressive parametric model built through casual convolutions, and Baidu’s Deepvoice, which uses a convolutional neural network. All of these confusingly technical descriptions don’t help us understand the developmental stage we have now arrived at: with minimal audio samples, we can now recreate human-like AI Voiceovers to accelerate content production.
This is where LOVO comes in. Through our advanced HD speech synthesis technology, we not only provide 180+ voice skins in 34 languages for creators to effortlessly build artificial narration through our Studio platform, users can also create natural-sounding clones of their voice with just 15 minutes of recording data.
Don’t believe us? Check out our demos: