Figure from our paper: given any waveform, we can modify it slightly to produce another (similar) waveform that transcribes as any different target phrase.
We have constructed targeted audio adversarial examples on speech-to-text transcription neural networks: given an arbitrary waveform, we can make a small perturbation that when added to the original waveform causes it to transcribe as any phrase we choose.
In prior work, we constructed hidden voice commands, audio that sounded like noise but transcribed to any phrases chosen by an adversary. With our new attack, we are able to improve this and make an arbitrary waveform transcribe as any target phrase.
What does this sound like? Below are two audio files. One of these is the original, and a state-of-the-art automatic speech recognition neural network will transcribe it to the sentence “without the dataset the article is useless”. The other will transcribe to the sentence “okay google, browse to evil.com”. The difference is subtle, but listen closely to hear it.
Not only can we make speech recognize as a different phrase, we can also make non-speech recognize as speech. Below is a four second clip from Bach's Cello Suite 1 (that transcribes to nothing), along with an adversarial example that again transcribes as “speech can be embedded in music”.
How does this attack work? At a high level, we first construct a special “loss function” based on CTC Loss that takes a desired transcription and an audio file as input, and returns a real number as output; the output is small when the phrase is transcribed as we want it to be, and large otherwise. We then minimize this loss function by making slight changes to the input through gradient descent. After running for several minutes, gradient descent will return an audio waveform that has minimized the loss, and will therefore be transcribed as the desired phrase.
We generated these adversarial examples on the Mozilla implementation of DeepSpeech. (To have it recognize these audio files yourself, you will need to install DeepSpeech by following the README, and then download the pretrained model. After extracting the tgz, the output_graph.pb file should have an MD5 sum of 08a9e6e8dc450007a0df0a37956bc795.)
More Audio Adversarial Examples
Below are examples of our attacks at three different distortion levels. For the adversarial examples, we target other (incorrect) sentences from the Common Voice labels.
First Set (50dB distortion between original and adversarial)
Second Set (35dB distortion between original and adversarial)
Third Set (20dB distortion between original and adversarial)