Voice-replicating technology has improved an impressive amount in the last few years, thanks to new speech generative models.
With various products, people are able to generate relatively convincing audio copies of people’s voices with surprisingly little input. Voice Engine from OpenAI, for example, claims to use “text input and a single 15-second audio sample to generate natural-sounding speech that closely resembles the original speaker.”
The results are pretty good, if occasionally straying into the uncanny valley.
But with cool new technology comes people hoping to exploit it for nefarious purposes. The USA, for instance, has already seen one scam involving a robocaller impersonating President Joe Biden, urging Democrats in New Hampshire not to vote in the Presidential primaries. Not all are so ambitious as attempting to affect who is in the White House. Others have had scam phone calls supposedly from loved ones, in attempts to get some good old-fashioned money.
It’s a problem, but cybersecurity researchers are working on a solution in the form of watermarking audio. Meta – the parent company of Facebook and Instagram – has created a product called AudioSeal, which they call “the first audio watermarking technique designed specifically for localized detection of AI-generated speech”.
At the moment, detecting synthesized audio generally relies upon training algorithms to distinguish it from normal speech. In a different approach, the team looked at ways that AI-generated speech could be “watermarked” with imperceptible noise.
“Watermarking emerges as a strong alternative. It embeds a signal in the generated audio, imperceptible to the ear but robustly detectable by specific algorithms,” the team behind the technique explains in a paper posted to pre-print server arXiv (meaning it’s yet to be peer-reviewed). “It is based on a generator/detector architecture that can generate and extract watermarks at the audio sample level. This removes the dependency on slow brute force algorithms, traditionally used to encode and decode audio watermarks.”
The team told MIT Technology Review that the system is effective at picking up on the watermarks, correctly identifying watermarks with between 90 and 100 percent accuracy. However, detection via this method would require voice-generating technology companies to place watermarks within their audio files, something which isn’t necessarily going to happen any time soon.
“Watermarking in general can have a set of potential misuses such as government surveillance of dissidents or corporate identification of whistle blowers,” the team adds in the paper. “Additionally, the watermarking technology might be misused to enforce copyright on user-generated content, and its ability to detect AI-generated audio could increase skepticism about digital communication authenticity, potentially undermining trust in digital media and AI.
“However, despite these risks, ensuring the detectability of AI-generated content is important, along with advocating for robust security measures and legal frameworks to govern the technology’s use.”
The paper is posted on the pre-print server arXiv, while AudioSeal itself is available on GitHub.