WHEN SPEAKERS ARE ALL EARS

Understanding when smart speakers mistakenly record conversations

Daniel J. Dubois (Northeastern University), Roman Kolcun (Imperial College London), Anna Maria Mandalari (Imperial College London), Muhammad Talha Paracha (Northeastern University), David Choffnes (Northeastern University), Hamed Haddadi (Imperial College London)

Last updated: 07/06/2020

This page contains our preliminary findings. More recent findings are available.

speakers

Summary

Voice assistants such as Amazon’s Alexa, OK Google, Apple’s Siri, and Microsoft’s Cortana are becoming increasingly pervasive in our homes, offices, and public spaces. While convenient, these systems also raise important privacy concerns—namely, what exactly are these systems recording from their surroundings, and does that include sensitive and personal conversations that were never meant to be shared with companies or their contractors? These aren’t just hypothetical concerns from paranoid users: there have been a slew of recent reports about devices constantly recording audio and cloud providers outsourcing to contractors transcription of audio recordings of private and intimate interactions

 

Anyone who has used voice assistants knows that they accidentally wake up and record when the “wake word” isn’t spoken—for example, “Seriously” sounds like the wake word “Siri” and often causes Apple’s Siri-enabled devices to start listening. There are many other anecdotal reports of everyday words in normal conversation being mistaken for wake words.  For the past six months, our team has been conducting research to go beyond anecdotes through the use of repeatable, controlled experiments that shed light on what causes voice assistants to mistakenly wake up and record.  Below, we provide a brief summary of our approach, findings so far, and their implications. This is ongoing research, and we will update this page as we learn more.

Goals and Approach

Goals: The main goals of our research are to detect if, how, when, and why smart speakers are unexpectedly recording audio from their environment (we call this activation). We are also interested in whether there are trends based on certain non-wake words, type of conversation, location, and other factors. 

Approach: When figuring out what smart speakers listen and wake up to, we need to expose them to spoken words. And if we are to uncover any patterns in what causes devices to wake up, we further need repeatable, native-speaker, conversational audio—along with corresponding text that was spoken at each moment. In theory, we could accomplish this using researchers who speak from scripts, but this would take an enormous amount of time and would cover only a small number of people’s voices. 

Instead, we came up with a much simpler approach: we turn to popular TV shows containing reasonably large amounts of dialogue. Namely, our experiments use 125 hours of Netflix content from a variety of themes/genres, and we repeat the tests multiple times to understand which non-wake words consistently lead to activations and voice recording.

Show Category
Gilmore Girls Comedy, Drama
Grey’s Anatomy Medical drama
The L Word Drama, Romance
The Office Comedy
Greenleaf Drama
Dear White People Comedy, Drama
Riverdale Crime, Drama, Mystery
Jane the Virgin Comedy
Friday Night Tykes Reality TV
Big Bang Theory Comedy, Romance
The West Wing Political Drama
Narcos Crime drama

We also need ways to detect when smart speakers are recording audio. For this we use several approaches, including capturing video feeds of the devices (to detect lighting up when activated), network traffic (to detect audio data sent to the cloud), and self-reported recordings from smart speakers’ cloud services (when available).  We remove cases where the wake word was spoken in TV shows. Finally, we use closed caption text from each TV show episode to automatically extract which spoken words caused each activation.

Testbed:

We focused only on voice assistants installed on the following stand-alone smart speakers:  

  • Google Home Mini 1st generation (wake up word: OK/Hey/Hi Google)
  • Apple Homepod 1st generation (wake up word: Hey, Siri)
  • Harman Kardon Invoke by Microsoft (wake up word: Cortana)
  • 2 Amazon Echo Dot 2nd generation (wake up words: Alexa, Amazon, Echo, Computer)
  • 2 Amazon Echo Dot 3rd generation (wake up words: Alexa, Amazon, Echo, Computer)

To conduct our measurements, we needed to build a custom monitoring system consisting of smart speakers, a camera to detect when they light up, a speaker to play the audio from TV shows, a microphone to monitor what audio the speakers play (such as responses to commands), and a wireless access point that records all network traffic between the devices and the Internet. 

speakers-testbed

A picture of our testbed: camera on the top to detect activations, speakers on the left to play video material, smart speakers under test on the right.

speakers-camera

An example of video capture of an activation: the Amazon Echo dot device in the center is lighting up, signaling a voice activation on 11/24/2019 at 09:52:22.

Key initial findings

Below is a list of some of our initial findings, with links to more details below. Everything described below is based on activations when the wake word was not spoken. Of course, all our findings pertain only to the source material (audio from selected TV shows) and we cannot make claims about more general trends. 

  • Are these devices constantly recording our conversations? In short, we found no evidence to support this. The devices do wake up frequently, but often for short intervals (with some exceptions).
  • How frequently do devices activate? The average rate of activations per device is between 1.5 and 19 times per day (24 hours) during our experiments. HomePod and Cortana devices activate the most, followed by Echo Dot series 2, Google Home Mini, and Echo Dot series 3. 
  • How consistently do they activate during a conversation? The majority of activations do not occur consistently. We repeated our experiments 12 times (4 times for Cortana), and only 8.44% of activations occurred consistently (at least 75% of tests). This could be due to some randomness in the way smart speakers detect wake words, or the smart speakers may learn from previous mistakes and change the way they detect wake words.
  • Are there specific TV shows that cause more overall activations than others? If so, why? Gilmore Girls and The Office were responsible for the majority of activations. These two shows have more dialogue with respect to the others, meaning that the number of activations is at least in part related to the amount of dialogue.
  • Do specific TV shows cause more activations for a given wake word? Yes. For each wake word, a different show causes the most activations. 
  • Are there any TV shows that do not cause activations? No. All shows cause at least one device to wake up at least once. Almost every TV show causes multiple devices to wake up.
  • Are activations long enough to record sensitive audio from the environment? Yes, we have found several cases of long activations. Echo Dot 2nd Generation and Invoke devices have the longest activations (20-43 seconds). For the Homepod and the majority of Echo devices, more than half of the activations last 6 seconds or more. 
  • What kind of non-wake words consistently cause long activations? We found several patterns for non-wake words causing activations that are 5 seconds or longer.
    • For instance, with the Google Home Mini, these activations commonly occurred when the dialogue included words rhyming with “hey” (such as letter “A” or “They”) followed by something that starts with hard “G”, or that contains “ol” such as “cold and told”. Examples include “A-P girl”, “Okay, and what”, “I can work”, “What kind of”, “Okay, but not”, “I can spare”, “I don’t like the cold”.
    • For the Apple Homepod, activations occurred with words rhyming with Hi or Hey, followed by something that starts with S+vowel, or when a word includes a syllable that rhymes with “ri” in Siri. Examples include “He clearly”, “They very”, “Hey sorry”, “Okay, Yeah”, “And seriously”, “Hi Mrs”, “Faith’s funeral”, “Historians”, “I see”, “I’m sorry”, “They say”.
    • For Amazon devices, we found activations with words that contain “k” and sound similar to “Alexa,” such as “exclamation”, “kevin’s car”, “congresswoman”. When using the “Echo” wake word, we saw activations from words containing a vowel plus “k” or “g” sounds. Examples include “pickle”, “that cool”, “back to”, “a ghost”. When using the “Computer” wake up word, we saw activations from words containing “co” or “go” followed by a nasal sound, such as “cotton”, “got my GED”, “cash transfers”. Finally, when using the “Amazon” wake word, we saw activations from words containing combinations of “I’m” / “my” or “az”. Examples include: “I’m saying”, “my pants on”, “I was on”, “he wasn’t”.
    • For Invoke (powered by Cortana), we found activations with words starting with “co”, such as “Colorado”, “consider”, “coming up”.

Press

Ongoing work

This report is just a preview of our larger study on smart speakers. There are several other important open questions that we are in the process of answering, such as:

  • How many activations lead to audio recordings being sent to the cloud vs. processed only on the smart speaker?
  • Do cloud providers correctly show all cases of audio recording to users?
  • Do activations depend on the TV show character’s accent, ethnicity, gender, or other factors?
  • Do smart speakers adapt to observed audio and change whether they activate in response to certain words over time?

We will update this page when we have more details to share.