A new audio system confuses smart devices that try to listen

You may know them as Siri or Alexa. By calling personal assistants, these smart devices are attentive listeners. Say a few words, and they’ll play a favorite song or head to the nearest gas station. But all of this listening carries a risk of privacy. To help protect people from hearing aids, a new system plays soft, calculated sounds. This hides conversations to confuse devices.

Mia Chiquier is a graduate student at Columbia University. He works in a computer research lab run by Carl Vondrick. Mia Chiquier

Mia Chiquier explains that smart devices use automated voice recognition or ASR to translate sound waves into text. He studied computer science at Columbia University in New York. The new program deceives ASR by playing sound waves that vary according to your speech. These added waves confuse a sound signal to make it difficult for the ASR to select your speech sounds. “It completely confuses this transcription system,” says Chiquier.

He and his colleagues describe their new system as “voice camouflage”.

The volume of the masking sounds is not the key. In fact, these sounds are soothing. Chiquier compares them to the sound of a small air conditioner in the background. The trick to being effective, he says, is to match those sound waves called “attacks” with what someone is saying. To work, the system predicts sounds that someone will tell you in the future. He then silently emits the sounds chosen by the intelligent speaker to confuse the interpretation of those words.

Chiquier described it on April 25 at the Virtual Learning Representations for International Conference.

Getting to know you

The first step in creating a great voiceover: Get to know the speaker.

If you send a lot of text, your phone will start predicting the next letters or words in a message. It also gets used to the types of messages you send and use. The new algorithm works the same way.

“Our system listens to the last two seconds of your talk,” Chiquier explained. “Based on that talk, he anticipates the sounds you might make in the future.” And not just in the future, but half a second later. This prediction is based on the characteristics of your voice and your language patterns. This data helps the algorithm to learn and calculate what the group calls a predictive attack.

This attack is the sound that the system plays along with the words of the speaker. And it changes with every sound that someone speaks. When the attack is performed in conjunction with the predicted words in the algorithm, the combined sound waves are converted into an acoustic mix, which can be heard, which confuses any ASR system.

Predictive attacks are also difficult for a smart ASR system to overcome, says Chiquier. For example, if someone tries to pause an ASR by playing a single sound in the background, the device can remove that noise from the speech sounds. That’s true, even if the masking sound changes over time.

The new system creates sound waves, according to a recent speaker. So the sounds of his attacks are constantly changing, and in an unpredictable way. According to Chiquier, this “makes it very difficult [an ASR device] to defend against. ‘

Attacks in action

To test their algorithm, the researchers simulated a real-life situation. They recorded a recording of someone speaking English in a room with an average background noise. An ASR device transcribed what was heard and heard. The team repeated this test after adding white noise in the background. Finally, the band turned on the voice masking system.

Voice camouflage algorithm 80% of the time ASR has kept hearing the words correctly. Common words like “The” and “our” were the hardest to mask. But these words do not contain much information, the researchers added. Their system was much more efficient than white noise. It also worked well against ASR systems designed to remove background noise.

The algorithm could someday be embedded in an application for real-world use, Chiquier says. To ensure that an ASR system cannot be heard reliably, “you would open the app,” he says. “That’s it.” It could be added to any device that emits sound.

That’s a little bit ahead of schedule, though. Then come more tests.

This is “good work,” says Bhiksha Raj. He is an electrical and computer engineer at Carnegie Mellon University in Pittsburgh, PA. He was not involved in this investigation. But he also looks at how people can use technology to protect their speech and voice privacy.

Raj says smart devices control how the user’s voice and conversations are protected. But he believes that control should be left to the speaker.

“There are so many aspects to giving a voice to,” Raj explained. Words are one aspect. But a voice can also contain other personal information, such as someone’s accent, gender, health, emotional state, or physical size. Businesses can take advantage of these features to target users with different content, ads, or prices. They can also sell voice information to others, he says.

As for the voice, “the challenge is to know how we can hide it,” Raj says. “But we need to have some control over at least some of the parts.”

Leave a Comment