The new language model from Microsoft Vall-E It is said to be able to mimic any voice using only a three-second recording sample.
The recently released AI tool was tested on 60,000 hours of English speech data. It can replicate the emotions and tone of a speaker, researchers said in a Cornell University paper.
These results seemed to be true even when a recording of words never said by a native speaker was created.
“Vall-E highlights learning capabilities in context and can be used to synthesize personalized, high-quality speech using it Recording recorded for only 3 seconds From the invisible speaker as a voice prompt. The results of the experiment show that the Vall-E is significantly superior to the latest zero-shot [text to speech] system in terms of naturalness of speech and similarity of the speaker,” the authors wrote. In addition, we find that Vall-E can keep the speaker’s emotion and the acoustic environment of the soundboard in tuning. “
Val-E samples Shared on GitHub are eerily similar to the speaker claims, though they range in quality.
In one of the compound sentences from the Emotional Voices Database, Val-E calmly says the sentence: “We have to reduce the number of plastic bags.”
However, the search in Text-to-speech AI It comes with a warning.
“Since Vall-E can synthesize speech that preserves the speaker’s identity, it might as well Possible risk of misuse of the form, such as impersonating the identification of a voice or impersonating a specific speaker,” the researchers on this webpage say. We run the experiments assuming that the user agrees to be the target speaker in the speech synthesis. When the model is generalized to unseen speakers in the real world, it must include a protocol to ensure the speaker consents to the use of their voice and the synthesized speech detection model. “
Currently, Vall-E, which Microsoft calls a “neural markup language paradigm,” is not available to the public.