home  previous   next 
The Twenty-first Annual Interactive Audio Conference
BBQ Group Report:
The Future of Voice Interfaces
Participants: A.K.A. "2016: A Voice Odyssey: I'm sorry, Dave. I'm afraid I can't do that...yet."
Dafydd Roche, Dialog Semiconductor David Berol, Bosch (Akustica)
Adam Smith Kipnis, Samsung Design America Mohamed El-Hage, Conexant
Randy Stephens, Cirrus Logic Leng Ooi, Google
Elia Shenberger, Waves Rich Herr, Magic Leap
Larry Przywara, Cadence Dave Dashefsky, Analog Devices
Eran Belaish, Ceva Scott McNeese, Cirrus Logic
Facilitator: Doug Peeler  
  PDF download the PDF


Define the future opportunity spaces for seamless environment to environment, human-like voice cognitive, affordable, voice interfaces in the home.


  • Perfect AI in terms of natural language processing and context awareness

Application areas

  • Home
  • Wearable (as a subset of home)
  • Business
  • Industrial
  • Automotive


  • “Seamless” from environment to environment really is an Iron-Man Jarvis experience. Walking from his car, to his house, to his suit-- the context follows across multiple devices and locations.
  • Walled Gardens - seamless within vs seamless across AI providers. Walled gardens are the cloud voice service providers.

Baseline: What is/are the interface(s)

  • Can be both ways.
  • Subscribe to services
  • Pull Data from your cloud account
  • In Quiet Short Sentences (Commands etc.) “What is the weather in Tokyo” has high accuracy to the intention of the device (Phone, Echo etc.)
  • Echo Cancellation / Barge-In at reasonable levels
  • Consumer Grade Biometrics
  • Multi Factor Authentication - Face, Fingerprint etc.

High Level - How do we voice enable everyday life?
Tradeoffs - security vs convenience
User definable.


The group agreed that one of the primary and most complex (in terms of products, use cases, service providers etc.) is home. Other markets to be addressed, with their own complex use cases would be automotive, business and industrial applications. Many of these other systems would be a subset of the smarthome use case.

We split Smarthomes into the needs for interoperability, general future needs, future technology needs and then split into 4 use cases - Commerce, Search, Control and Communication.

However, to enable such technologies, there are common needs and common technologies required.

Issues with today’s system

  • In today’s systems, there are No Multi-User experiences. No way for Mom to buy something from her account, and Dad to buy something from his. The future needs to resolve this with Speaker ID technology.
  • No “Dad” control to stop the kids using equipment.
  • Can’t use Echo to make a phone call.
    • If you ask Alexa to make a call, then audio is routed to the phone, not through Alexa.
    • This raises a fundamental issue.
  • No Intercom type functionality.
  • Lack of Peer to Peer Audio streaming.
  • “Fog” instead of “Cloud”. How can each system share and stream audio between each other? Can one set of microphones work alongside others?
  • Cellphone calls to the ASRA and noise suppression during while on a phone call - within a phone call, calling a voice command. “Eg, we should call grandma while you're on the line, Call Grandma!”
    • How does it sound to other folks on the phone call if you ‘deny’ a call?
  • We can’t Whisper Voice Commands today. Other people in the room hear you yelling commands. You look like a mad man. Technical term is Sub Ambient Voice Commands.

Common Technologies Required to Deliver the Experiences discussed below.

  • Biometrics to for identification
    • Multi-user capabilities
    • Presence tokens (i.e., wearables, Bluetooth beacons that detect presence of users by seeing their smartphones etc.)
  • Perfect AI
  • Speech Recognition
  • Computer Vision
  • Peer to peer backend. This is needed to allow local streaming of mic and audio data.
  • “Fog” hub for local computing.
    • Most devices assume they are alone in the house. Future collections of systems will need Enumeration and registration within the fog for each device in the home.
    • Fog could connect to multiple WG’s depending on the service required. E.g. search is done by google, but purchases done by amazon etc.
    • Fog-Host would likely have to be at the router level.
    • Fog on a playback audio streaming is not a concern.
    • Fog on a microphone recording side, and the voice to text service makes things much harder.   
  • Each piece of equipment needs a two-way street from the appliance to the voice agent. Both ends need to make an API to talk to the equipment. However, a local hub could be used to talk to some simple switches etc. (Think of a Wi-Fi to Z-Wave bridge)
  • Smart/soft/networked control elements (i.e., appliance switches)
  • Cloud services (i.e., Google, Amazon, Apple, etc.)
  • There should be a way to disable all or part of conversations
  • Interoperability
    • Each Device has its own name
    • Households desire a “single” AI client
    • API for cross platform interoperability
  • Classes of Devices
    • White Goods Appliances
    • HVAC
    • Safety - Smoke Alarms etc.
    • PDA - Personal Digital Assistant
    • Personal Control - control of entertainment system etc. - local control
    • Wireless Cameras with Audio
  • Voice Capture “layer” as control hub
  • Is there a commercial purpose/value for walled garden cooperation?
  • Services have to develop across WG’s (Walled Gardens)


In the area of Commerce, we see five areas of opportunity.

  • Authentication
  • Advertisement
  • Explicit Purchasing
  • Implicit Purchasing
  • Fulfilment

There are three steps to securing a voice system, Identification, Authentication, and Authorization:

Identification:  We recognize your voice and have an idea of your identity.
Authentication: We have validated your identity and have ensured we know who you are.
Authorization: We grant you access based on known settings

Currently, devices in the market are authorized by a user, rather than a user being authorized to make transactions.  To enable financial transactions and experience customization, future solutions will need to be able to solve the problem of identification and authentication of the end user.

How do you authorize the person rather than the device? 
As a first step, the end user will be via voice command waking up the device and making a request. Identifying the user will need to rely on solutions that are characterized as having a high level of accuracy in both wakeup and identifying the voice print of the end user. This can be accomplished in a variety of methods, such as user trained keyword entry or other techniques. This is akin to inserting the bank card to an ATM machine.

Authentication is the natural second step to guarantee the identity of the user is certain. This is akin to entering the PIN and there are many techniques both explicit and implicit commonly referred to as Multi-Factor Authentication. Implicit techniques will rely on contextual data, common behaviour to generate an authentication challenge that can take the form of query addressed to the end user. Explicit techniques are pre-agreed tokens such as passwords, fingerprints, etc…

Other dimensions of authentication to keep under consideration are:

  • Multi-User Authentication
    • It is highly desirable that devices in homes with multiple inhabitants are capable of distinguishing between each seamlessly. 
  • Continuous Authentication
    • If a device could understand us and authenticate us just based on the sound of our voice, this could add to a more natural user experience.
  • Wearable authentication token
  • In scenarios where multiple devices from varying manufacturers and ecosystems exist within the same space, a unification of multiple device endpoint’s authentication schemes may be required.

Authorization is primarily the rules and boundaries set for each user when making a purchase. These rules shall be unique and custom to each individual user.

Advertisement and Market Research is one of the key areas of opportunity, however eavesdropping technology presents an incentive model for decreased privacy that could be disadvantageous to the consumer if it’s abused.  The key to preserving trust with a consumer is to enable them to have transparency and control of their interactions that were captured and stored.

  1. Always Listening for Context-Based Purchase Intent: A user could opt-in for an always-listening experience that monitors all their conversations for purchase intent, and then re-targets them for advertising across platforms.
  2. Assume that a user should always have transparency and control over content that was captured.
  3. If all conversations were captured, then this would essentially be an auto-transcriptionist, and allow for searching “Who said what when.”  How would this affect human interactions?

Explicit purchasing is where a user directly specifies the exact thing that they want, and the device enables the transaction.  Within this domain, there are several opportunity spaces:

  1. Buy Apple Vs. Buy an Apple:  How do we establish certainty?
  2. Stock Market Agent or Grocery Store Agent.  Identify an agent to create context for purchasing.  Include a default context. E.g. Assume grocery unless I say I’m talking about stocks.
  3. Detect stress, and make experiential decisions based on that.

Implicit purchasing is where the device notices something that wasn’t directed to it, yet it proactively takes action to be helpful. 

  1. Overheard you saying “We need Milk” so I added it to a shopping list
  2. Overheard you saying “We need Milk”, so I purchased it for you
  3. Overheard you saying “We need Milk”, so I said “Should I purchase milk for you?”.
  4. Observed purchase pattern and history, so I added it to a shopping list
  5. Observed purchase pattern and history, so I said “Should I purchase milk for you?”

Fulfilment is an area where a user should be able to get information on purchases that they’ve made.

  1. Where’s my stuff?  Has it been shipped?
  2. What’s the ETA?
  3. What was delivered?

To ensure a privacy and security, storage of sensitive material to complete authentication shall be in a secure environment. The voice print and authentication criteria can be stored in secure encrypted cores or more likely in the cloud. Other considerations to take into account when characterizing a user are the dynamics under which their voice print may change, it is preferred that the final solution include features that allow for an update of the user’s authentication criteria over time.

Additionally, it is recommended that alternative authentication methods be enabled in cases where voice input is not feasible, such as illness or a noisy environment.

Many of the techniques described above have been used and are established. Forums such as the FIDO Alliance (A General Security Standards Body) have been specifying the requirements necessary to use voice for financial transactions. As an example: USAA uses voice authentication on their mobile app.


  • Smart Intercom Peer to Peer - not all rooms need to sound when someone speaks. If we know where people are, we can ring to only that room.
  • Phone Call
  • Location in the home detection for call direction
  • Call follows user room to room
  • Challenge - it’s not just being in any room to start a dialog. I want to roam from room to room. Many challenges in echo/acoustic compensation and cancellation.


Problem Statement:  Natural, intelligible, seamless, and private worldwide human to human communication available anywhere on residential property, in all likely noise cases.

For communication, we see the solution space falling into the following high-level categories with details for each use case listed beneath:

  1. Intercom
    1. Person-person local
    2. Quality not necessarily high (typically Wideband)
    3. Smart Routing (Keep conversations private within house)
    4. Node only starts listening with trigger word
    5. User-definable trust mode
    6. Presence detection
      1. - Room to room person tracking
      2. -Dial in speakers and mics for location...handoffs between endpoints
    7. Personal ID
    8. Full Duplex
    9. Local processing
    10. Noise and echo suppression, de-reverb
    11. Call control / Selective transmission
    12. Intelligibility>quality
  2. VOIP:  
    1. -Call quality more important than intercom (SWB?)
    2. -Far end cleanup
    3. -Quality>intelligibility
    4. -(Also includes Intercom requirements)
  3. Audio to remote security: (Security/safety)
    1. Battery backup
    2. 2 establishment methods:
      1. Hazard trigger from record side to trigger recording, group texts, door lock/unlock, etc.
        1. Turn off not under duress (voice stress) for alarm
        2. Selectable biometrics to turn off
      2. Manual trigger from listening side
    3. Hard to defeat/break connection (non-internet path...landline or cell)
  4. Non-real-time communication (speech to email, social)
    1. Distinguish between message + other speech
    2. Capture inflection in text, and do this in local hub (privacy)
    3. High accuracy transcription using as many mics as possible
    4. Speech processing where needed (noise reduction, echo cancel, de-reverb)
  5. Remote collaboration (Music, party, presentation etc.)
    1. High bandwidth (~48 kHz)
    2. Minimal noise reduction (wind/breath)
    3. Selective voice processing (Noise reduction, echo cancel, de-reverb)
    4. Low latency important
  6. Non-discussed topics
    1. Account selection
    2. Gaming voice
    3. Speech-to-text-to-speech in different voice (games or other)


Search is one of the most commonly used features used in existing voice systems, such as Siri and Cortana. From resolving drinking disputes to helping kids with homework it’s a feature typically used by all the family, at any time.

This creates use-case and privacy issues around which user is using the system at which particular time, along with understanding “who” is giving the results and “where”. Voice technology is a shared technology, unlike the mobile phone or personal computer. The fundamental mindset is to expect multi-user and there should be partitions and realization that users are well isolated and partitioned.

The group pulled together highlights of technology that’d be required to enable the following technologies. These technologies represent desired use cases.

  • Speaker Identification
    • Should be performed locally, either in the device, or in the local “fog”
    • Voice technology is a shared technology, unlike the mobile phone or personal computer. The fundamental mindset is to expect multi-user and there should be partitions and realization that users are well isolated and partitioned.
    • Speaker identification to a registered user.
    • Ability to identify non-users and limit guest access accordingly.
    • Admin/Dad/Parental tiered control mode, locking of the system. (e.g. allow control of home lights, but no music selection etc.
    • Multi-user capable - Independent search trails, isolation in history cache? Cross-corruption of search history c).
  • Search Context should be stored in the cloud
    • Needs to consistently recognize the user. (see above)
    • Emotional context of voice changes the reaction of the search; urgency of toilet paper? I need toilet paper NOW!! Vs Where can I find toilet paper on sale could result in amazon prime now purchase made versus a list of places nearby where toilet paper is available.
    • Multi-user with context; search history based on user, parental control, instant privacy button, incognito mode for voice search, history caching for searching back on the content of discussions?
    • Birthday/relative/friend search mode: e.g.: ask the assistant what the significant others prefer from his/her search history, privacy concerns? Permissions scheme, level of access for others to your search results (does not seem like it’s limited to voice).
    • (Crossover with entertainment) - Who’s the guy in the scene? How long has this movie been on? Have I seen this before?
      • Shall not interrupt experience and should be a real time overlay of search results on your personal display or TV.
  • Search for devices in the home
    • Still leveraging user id
      • Where are my keys? (May require Tile or other device tag)
      • Where is my phone?
  • Completion of transaction
    • This is to aid with privacy and
    • Verbal response - “Thanks” or “Thank you” or UI tone with descending notes.
    • System needs to Timeout after a certain time if no more requests are made of it.
  • Different search servers, handoff protocol, global settings, alternate services:
    • Assistant tells you the source of the answer, which services that provides the answers, different characters or voice to identify the source, or service provider. Doing so is much faster than “Google said this and Amazon found that…”
    • Handover of rendering or served results to other devices or context relevant services; send map to car, send calls to phones, pass calls to another person in other part of the house or geographical locations (ease of use)
    • There are large search and service providers in other countries that should not be ignored in these integrations (Baidu, WeChat, Yandex… etc.)
  • Need a graceful plan for how the system handles search results that are a failed task completion, “I'm sorry I put a search result in the app” is not a great user experience. Should understand the difference between error in recognition, context errors, false triggers, cross-corruption of history and recommend next steps.
  • Marketing/Directed/Targeted messages over use of service?
    • Negatives
      •  Ethical and privacy challenged
      • Unprompted suggestions could be perceived as annoying if handled poorly or too aggressively, in a Clippy type situation.
    • Positives
      • Knowing the demographic of the room, e.g. children under 8, and don't show violent movie ads
      • Ability to introduce new features and updated services to the user
  • The functionality has to meet some required response time per use case. Push to talk Intercom is different than doing an across house jam session. There has to be minimal round trip latency.
  • Hub vs Ad Hoc network of voice, should one device or family of devices do far field and close range listening (a bunch of Dots) or will there be a network of microphones from various devices listening at the same time (TV, Remote Control, Set Top Box, Hub, Thermostat and Wall Switch, etc.)


  • Striping away your voice signature and background (like text to speech) and putting another voice signature on it. Playing a game using the voice of a celebrity.


We envision voice as a primary interface for controlling a network of smart home devices in a few years’ time.  Examples of smart devices include the following:

-kitchen and laundry room appliances
-robots. For example, automatic room vacuum cleaners.
- audio and visual entertainment (TV, stereo)
- window shades
- irrigation and outdoor lighting
- alarms/security, locks
- smoke alarm,
- automotive: starting the car, or the car’s AC/heat.
- BBQ pit (preheat to 300 degrees)

User Experience (UX)

The smart home should be simple to set up for voice control of the smart devices.

Interaction through voice control should be human-like and rely on natural language for specifying commands rather than a limited set of pre-defined commands. Also, voice control should not be limited to close proximity to the controlled device i.e. the user should be able to control all smart device using voice commands from anywhere in the house. In terms of feedback, we would expect to get voice feedback whenever we are not in proximity to the controlled device and cannot see the effect of the command.

Unfortunately, we envision a compromise in terms of user experience due to the walled garden nature of future voice interfaces which stem from business considerations of major players like Amazon, Apple and Google. This will result in interoperability issues unless users ensure they purchase smart devices compatible with the walled garden. That said, there should be many choices for users as more and more manufacturers adopt one or more providers of voice services e.g. Sonos has implemented the Alexa API into their products. We envision central processing, possibly in the cloud, that can control multiple devices (e.g. turn off everything in one command when leaving the house) and also make centralized decisions in cases of multiple mics receiving the same command. Suppose you have 6 rooms. The Amazon Dot in each room knows its room and knows its default devices are the ones in the current room, if the house has multiple instances of that device. For example, a house has 2 TV’s each in a different room. If you’re in the living room (tv1) then “turn on the TV” means turn on the TV in the living room.  You will have to specify the other TV by name if you want to turn it on.

We have identified the following technical challenges:

Common challenges:

  • As mentioned above a centralized hub is required to provide adequate user experience (i.e., specific home profile).
  • Removing the voice trigger for a more natural user experience and human-like interaction
  • Security
  • Speaker separation
  • Voice biometrics

Challenges specific for control:

  • User profiles and privileges for control purposes e.g. little Joe shouldn’t turn on the BBQ pit
  • Smart devices need to be enumerated in case of multiple devices of the same kind
  • Interoperability - as mentioned above we don’t envision a complete solution for this challenge but rather a walled garden approach in near future
  • Power consumption- battery operated devices (e.g. smoke alarm) would require an efficient implementation


Real Time Notes:

  • User Experience Focus
  • What “feedback establishes context and acknowledgements
  • Timeouts
  • Naturalness of triffer works
  • What other sensors are necessary?
  • Identity
  • Communications
  • Search Control (Home Automation)
  • Entertainment
  • Existing voice interfaces
    • Starting point
  • Usage Models
  • Technical capabilities
  • Business Model
  • Security
    • Multi factor authentication
  • Modelling human interaction
  • Pitch level detection - detect tone, mood (frustration etc.)
  • Cloud vs. local distribution of processing
  • Commerce, communication, control and search
  • Multi-channel participant

Other reference material:

http://www.projectbarbq.com/reports/bbq13/bbq13r6.htm - Group Report: Using Sensor Data to Improve the User Experience of Audio Applications
http://www.projectbarbq.com/reports/bbq14/bbq14r7.htm - Group Report: Audio opportunities in the Internet of Things
http://www.projectbarbq.com/reports/bbq15/bbq15r4.htm - Group Report: Audio Of Things: Audio Features and Security for Smart Homes/Internet of Things

section 3

next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. The Future of Voice Interfaces
4. Audio Sensor Opportunities: Market Requirements and Technology Challenges for the next Decade
5. Always Be Closing (This isnít marketing after all)
6. R.I/O.T: The Next Great Interactive Group Listening Experience!
7. The Need for a New Wireless Audio Network Standard
8. Creating Immersive Music with Audio Objects
9. Schedule & Sponsors