Do you know the difference between AI-generated computer speech and a real, live speech of a human being? Things are really getting very interesting because Google engineers have brought out a text-to-speech artificial intelligence (AI) system called Tacotron 2 which can confuse you with its human-like articulation.
According to tech site Inc.com, “Tacotron 2” gives an AI-generated computer speech that almost matches with the voice of humans.
In his address at Google, I/O 2017 developers’ conference, company’s Indian-origin CEO Sundar Pichai talked about how the internet giant is shifting its focus from mobile-first to “AI first” and has launched several products and features, including Google Lens, Smart Reply for Gmail and Google Assistant for iPhone.
According to the paper published in arXiv.org, the technology consists of two deep neural networks. The first network translates the text into a spectrogram, and then sends them into the Google’s existing WaveNet algorithm which uses the image and brings AI closer than ever to indiscernibly mimicking human speech. The algorithm is capable of reading different voices and can even generate artificial breaths.
If the claims made by researchers are to be believed this model has secured a mean opinion score (MOS) of 4.53 compared to a MOS of 4.58 for professionally recorded speech.
On the basis of its audio samples, Google claims that “Tacotron 2” can identify the context difference between the noun “desert” and the verb “desert,” as well as the noun “present” and the verb “present,” and can alter its pronunciation accordingly. The paper also reveals that it can place emphasis on capitalized words and apply the proper inflection when asking a question rather than making a statement.
Though Google’s engineers are reluctant to reveal much, they did indicate developers to figure out how far they are close to developing this new system.
In these recordings, the voice says “That girl did a video about Star Wars lipstick.”
Tacotron 2 works well on out-of-domain and complex words.
“Generative adversarial network or variational auto-encoder.”
“Basilar membrane and otolaryngology are not auto-correlations.”
According to the report, each of the ‘.wav’ file samples have a filename containing either the term “gen” or “gt”. Based on the paper, it’s highly likely that “gen” indicates speech generated by Tacotron 2 and “gt” is real human speech.