

WaveNet is a deep neural network for generating raw audio waveforms that utilize probabilistic and autoregressive models designed by DeepMind, a company acquired by Google in 2014. With recent advancements in deep learning and neuro networks, such as WaveNet’s text-to-speech, it has now moved from mere concatenative text to speech to a near-fluid human-like synthesis process - at least as per my comparison with conventional text-to-speech applications. To remember how this actually sounded - at some point, some of you may have dabbled with Adobe’s PDF reader’s text-to-speech feature and can relate to how incredibly unbearable and robotic it was to listen to. Though this process largely outdated the production of speech via this mode of speech synthesis, it mostly sounded incredibly monotone and somewhat robotic. Once there's a need to convert text to speech, the text-to-speech engine would search this large database for speech units, match the input text defined by a user, which would then begin the concatenating process to derive the final audio file - think of this as stitching audio fragments together. This approach solely relied on a very large database of short speech fragments that were recorded from a single speaker and then recombined to form complete utterances. Text-to-speech synthesis, TTS for short, is the artificial production of human speech which was, in the past, largely performed by the generation of human speech that artificially used a process called concatenative text-to-speech.
