In the past, making a digital copy of a person’s voice was a complex, timely and a pricey process which only served those at risk of losing their speech due to sickness or injury. It meant recording many phrases, each spoken many times and in different contexts. Fast forward to today and the gap between human and computer speech is closing as powerful algorithmic software and hardware can now analyze and clone any person’s voice in less than a minute.
There is a Canadian company called Lyrebird, that specializes in speech synthesis software. They’ve developed software they claim can copy anyone’s voice and make it say anything.
The founders tell me if they can get a high-quality recording of you speaking for just five minute, their software can replicate your voice to say almost anything with very high accuracy.
The recreated voices are artificial sounding but you can definitely tell who’s who as they have posted some samples on their website of example statements sounding very much like Barack Obama and Donald Trump.
To be able to do that, they acquired speeches from both presidents have made in the past and had their software recreate their voices having a made up conversation. Again, its a bit robotic sounding but give them a few more years and it will be a perfect copy.
Last year, Adobe gave a sneak peek of a feature they call “VoCo,” short for “voice conversion.” Both VoCo and Lyrebird work in a similar way. They analyze a recording of someone’s voice, break it into component parts called phonemes. Then it presents you with a text box, where you can type anything you want. The system uses the voice model to construct new words and phrases, even if they weren’t in the original recording.
The idea is that with a properly trained voice model, you can make anyone say anything. What’s more, Lyrebird’s creators say you can also manipulate the tone and emotion of an artificial voice. So in addition to changing what someone says, you can also change how they say it.
Whether it’s Lyrebird, Adobe or some other company leading the way, it seems we’re entering a future where people’s voices can be easily copied or forged. This opens up a big can of worms.
In some ways, the ethical concerns of manipulating people’s voices to generate them saying anything are analogous to the ethical concerns of using photo editing software to alter a photo or adding CGI special effects to a video. I expect in the next 10 years this technology to advance to the point where we won’t be able to tell the difference between an artificial voice and a real voice. Just think if you can have anyone say anything? It’s going to be a big concern.
Immediate applications for this technology would be to provide more accurate speech synthesis for people with disabilities or traumatic brain injuries but actors with the means to copyright their voices could also cash in on this technology.
I’d also make a bet on the future of voice-controlled computing. I imagine a world where more and more of our interactions with the digital world are done through voice. That means an increased focus on voice recognition, and voice synthesis.
We use our voices to build trust in so many aspects of our lives, whether it’s your bank verifying your identity with a voiceprint, or a friend recognizing your voice on the other end of a phone line. But soon, the ability to forge or mimic someone’s voice may be as commonplace as retouching photos or adding filters in Snapchat.