home  previous   next 
The Twentieth Annual Interactive Audio Conference

Group Report: Audio Of Things: Audio Features and Security
for Smart Homes/Internet of Things

Participants: A.K.A. "Audio Of Things"

David Berol, Knowles

Konstantin Merkher, CEVA, Inc.
Avi Keren, DSP Group Andy Lambrecht, Analog Devices
Mikko Suvanto, Akustica Whit Hutson, Conexant
Trausti Thormundsson, Conexant Jack Joseph Puig, Waves
Facilitator: Linda Law, Project Bar-B-Q  
  PDF download the PDF

Problem Statement

The Internet of Things (IoT) has been a catch all for many new technologies and products that are becoming increasingly connected. Many of these technologies address voice (and audio/sound in general) controlled interaction with home and personal systems for the purposes of home automation or other lifestyle or safety enhancing purposes.  There are already a number of devices in our lives that are capable of always listening and this will only increase as more everyday items become part of the IoT.  As IoT devices proliferate, it is expected that voice coverage within the home will increase.  This raises concerns regarding safety and security.  Audio aspects of the IoT, Audio of Things (AoT) need careful consideration for security, privacy, interoperability and collaboration/cooperation.

Adding to security concerns is the increasing reliance on cloud processing for voice and audio processing.  There are well known benefits for performing significant portions of audio processing in the cloud.  The business models of Google, Apple, and Amazon steer them to perform much of this processing in the cloud.  Each of these companies have introduced one or more devices which have the potential of becoming central to establishing a centerpiece to a home automation solution.  It is believed that there may be opportunities for alternative solutions that partition the audio problem such that sensitive information can be contained within the home.

Opportunity Statement

The IoT is rapidly evolving and there are numerous opportunities available:

  • Alternative partitioning – Partitioning sound processing such that sensitive information within the home presents an opportunity to serve the portion of the market that is concerned about privacy. 
  • Privacy and trust – There is an opportunity for a business model focused on privacy and trust
  • Technology improvements – there are numerous opportunities for technology improvements

AoT Background

The IoT is an expansive concept that encompasses the interconnection of physical objects in industrial, commercial, and residential environments for the purpose of collecting and exchanging information.  In residential environments, a key aspect of the residential IoT is the constant sensing of the environment such that data collected from numerous sensors and be processed to determine specific situation and perform desired actions automatically as a result. Sound sensors (microphones) are key elements in the residential IoT and are capable of “always listening”.  The ability to be always listening raises privacy concerns among many. 

Questions that were raised and addressed include:

  • Are there issues with processing everything in the cloud?
  • When should some data go to the cloud and when should it stay local?  How can the consumer set privacy levels?
  • Is privacy a generational issue?  Do younger generations care less about privacy than older generations?
  • When multiple devices are listening, how is it coordinated such that the right devices is collecting sound?
  • How does the consumer provide guidance as to which informati

Looking 5-10 years into the future, processor and memory technologies will increase predictably and, when coupled with new and improved sensors embedded in everyday objects, will result in enable highly intelligent homes.   

Ten years from now when one walks into the house, it is expected that arrival will be anticipated and/or recognized and the environment will be automatically set to known preferences.  Patterns and preferences are learned for each resident and actions are performed automatically and learned compromises must take place when multiple residents are present with differing preferences.

AoT Issues, Elements and Solutions

The AoT workgroup looked at a number of general question and issues as well as the elements of a residential AoT system. 

Do sensors within the home generate a complete audio stream that can be mined in the cloud? Or, is a decision made locally and sent point-to-point?

Recommendation: A hybrid system can be ideal where simple commands (“turn on the light”) are handled locally while cloud processing is used for richer feature sets.  The local system can learn and adapt and limit what needs to be sent to the cloud for further processing.  Data should leave the house only when allowed and is encrypted except in emergency situations when emergency situations are allowed

What happens when access to the cloud is limited or nonexistent?

Recommendation: A local system and/or manual backup should provide functionality during failure states

Security around Audio Scene Monitoring

    1. Audio scrobbling (non-optional) / active advertising

Recommendation: Be a gentleman about it (Ironman’s J.A.R.V.I.S.)
Oppportunity: A holistic integrator without an agenda may be a business model.  The ability to handle search needs through anonymous aggregation to Google may present an opportunity for privacy conscious homeowners.
Recommendation:  Cloud services security concerns might be placated with transparency in usage.

There are two key areas of concern:  Command and Control & Information Privacy.

  • How is it decided what activity is creating a sound?  How are the sounds of a video game differentiated from the sounds of struggle with an intruder? 
  • In a surveillance use-case, how are audio analytics used to decide on a particular action?  i.e., a baby is crying, how is a decision made to start recording audio and video vs. alerting a parent vs. calling the police? 
  • It now seems impractical to record all of the audio and send the audio all of the time someplace else, but future storage and communication capabilities may make this possible. How is privacy ensured?
  • Artificial Intelligence (AI): How are decisions made?  How is it trained? What data is important?
    • Recommendation: A hybrid system is proposed where some of the feature set is kept local and some is cloud-based. They system should allow for user preference and should comprehend the complexity of decisions to be made.  Training can be done automatically (self-adaptation) or mentored (“From now on …”)
    • Recommendation: Should provide some ability to set preferences on notification for certain events and to perform certain actions on certain events.  Should come preloaded with related templates of notifications and use-cases. (“You have a water leak at the basement. What shall I do with regards to it: cut the water? Call to technician?”)
  • The system should allow for decisions be based on both Internally created date and external data (i.e. weather conditions, traffic, personal calendar)
  • Each sensor by itself might not be creating information that needs to be protected but the combination might be. 

Elements of AoT Audio Subsystem

The AoT workgroup discussed the elements of an AoT audio subsystem and what improvements are needed over the next 5-10 years to achieve the vision of a highly intelligent home.

Microphone, Source Isolation

Large improvements are needed in far field source selection.  Far field source selection is a much different problem than speaking directly to the microphone of a smart phone.  The presence of multiple microphones coupled with algorithms such as beamforming can be utilized to provide necessary improvement.  Although smartphones aren’t currently optimized for far field use, the audio picked up by a smartphone can be processed with audio picked up by other microphones in a room to isolate the desired audio source.  Improvements in fidelity (lower noise, better THD+N) will also contribute to improvements.

Keyword Detector

There are a number of improvement recommended for the keyword detector.  The keyword detector is a key function which is always listening for select words and phrases (“OK Google”, “Alexa”, etc.) to switch to a more active mode in which action can be taken.  Improvements and recommendation for the keyword detector include

  • Ultra low power – IoT objects won’t all be wall powered or even battery powered.  Alternative energy sources, including energy harvesting methods, will be used to power IoT objects.  Embedded keyword detectors should be capable of operating at very low power levels. 
  • Personalization – Keywords should be end user programmable.
  • User trained – the system should be trainable to recognize specific residents and/or visitors. 
  • Improved performance – fast and accurate.
  • User dependence and user independence – Keywords should be programmable such that they can be user independent or user dependent. 
  • User recognition – the system should be able to distinguish which resident is speaking  (parent, child, etc.).  This should not be confused with access authorization for activities such as banking. 

Speaker System

AoT is not just about voice/audio input.  The highly intelligent home will need to interact with residents.  Recommended improvements and capabilities include:

  • Efficiency – improved amplifier efficiency is needed to include speakers in very low powered IoT objects. 
  • Integration – existing speaker systems, including wireless speakers, should be integrated into the AoT subsystem. 
  • Multiroom, Room Dependency – interactions should be whole house or only within the room where a specific resident is present depending on the type of interaction.  AoT should be able to use the location of a specific resident to isolate speakers in that room.

Speech Synthesis

Improvements in speech synthesis are needed such that it is natural and not robotic.

Interoperable Communication and Connectivity

Improved standards are needed to allow for interoperable connection and communication of various sensor types.  Communication standards need to allow for cooperation and collaboration among sensors (sensor fusion).  For example, in a room with a large number of microphones, a certain subset of microphones might be in the best position to isolate a source.  Furthermore, additional sensor types might be used to determine the stress level of a resident to determine additional context.

Converters (Analog-to-Digital converters and Digital-to-Analog converters)

Improved price and lower power are needed to use in ultra low power objects.

Input Sources

Input sources include residents, visitors, sounds within the house, and sound generating devices such as TVs.  These sources are what they are.  Knowledge of the audio generated by sources such as the TV can be used to remove that audio source in the process of source selection. Similarly, knowledge of audio generated by sources such as TVs and computer games can be used in sound scene analysis to distinguish between gaming activity and actual in home situations that could require assistance.  Metadata added to TV audio signals could be used to communicate with the AoT system.

Noise/Distractor Elimination; Source Isolation

See the “Microphone, Source Isolation” element for specific improvements and recommendations for far field noise suppression/source selection.  A privacy trust issue needs to be addressed in that residents and visitors need to be comfortable that the AoT system does not address the cocktail party effect such that all conversations are being monitored and possible recorded.  The system does need the capability of detecting owners’ commands in a cocktail party environment. 

Speaker identification/authentication/verification

Improved price/performance is expected from existing products providing this capability.

Sensors, sensor fusion

New sensor types are expected to emerge and current sensor types are expected to evolve in performance and capability.  Price/performance ratio is expected to improve over time.  The ability to combine data from disparate sensor types (sensor fusion) to improve scene analysis and context awareness will improve as residential AI improves.

Digital Signal Processing/Processor (DSP)

Ultra-low power DSPs will allow more objects to become audio enabled.  Price/performance ratio is expected to improve along typical timelines.  These will allow for improved local command recognition.  In some cases, analog signal processing may be used in combination with DSP to optimize product’s performance.

Application processor

Application processor performance is expected to improve along a predictable trajectory.  A more open API is recommended to facilitate the addition of external capabilities.

There is an open question that was raised – “How does vertical integration impact the evolution of AoT?”  Companies such as Apple, and to a lesser degree Samsung, are increasingly becoming more vertically integrated with the internal development of their own processors, OS, and end product.  How does this impact -- positively or negatively – evolution of AoT. 


Memory density is expected to improve along predictable trajectories.  There are some architectural improvements that were identified that can facilitate some aspects of AoT.  When external memory devices are needed, cost and/or performance can be greatly impacted.  A choice often needs to be made between slower, pin-efficient memory devices and faster, high pin-count memory devices.  Higher pin-counts typically translate to higher power.  A fast, narrow interface is recommended to provide improved support for embedded real-time audio processing. 

AI/decision maker

Great improvements in Artificial Intelligence (AI) are expected over the next several years.  An AI that learns quickly and easily is needed.  Distribution of the AI across multiple devices in the home may be necessary.  Collaboration between devices that are part of the home and devices that are part of the person (i.e. smart phones) may be needed to eliminate conflicting or even counter-active actions.  It should be comprehended that multiple objects/devices are contributing to decisions.  The system needs to be adaptive to learn the habits and needs of individual residents. 

Appliance Control interfaces

IoT is expected to evolve such that control interfaces with common appliances will become standardized. 

System security

The ever growing threat of malicious attacks coupled with the sensitive private information and activity that is present in virtually every home, system security carries extreme importance.  There is tremendous opportunity here for the creation of technology and products that can isolate and protect the home. Key aspects of the IoT security that need to be addressed are

  • Prevention of malicious hacking
  • Encrypted transmission of data to cloud
  • Firewall improvements that result in Isolation from the internet of processes and information that is supposed to remain local
  • Updates to system components need to be trusted and secure to prevent Trojan Horses or other malicious content

Power source, power backup

IoT devices are largely expected to be powered by something other than line power.  Objects that are plugged in today (toaster) will continue to be line powered.  But objects that are unpowered today are likely to be powered through batteries or alternate energy sources such as solar or energy harvesting. Batteries are expected to improve along predictable paths.  Issues with battery level or energy sources need to result in resident friendly alerts. 

Access to cloud

As more functionality relies on connectivity to the cloud, the more importance is placed on reliable access to the cloud.  Improvements to continuity of access are needed.  Access bandwidth is expected to improve over time.  A backup access method (i.e. mobile phone network) would be attractive to some owners. It is recommended that the system have some degree of capability when access to the cloud is not available.  This requires some amount of local capability that does not rely on the cloud. Additionally, manual override is recommended as a failsafe.

Cloud Based Processing, Trust and Security

With the sensitivity of voice/audio data that can be collected and then processed and stored in the cloud, there is great opportunity to create trust in how this data will be handled.  Existing businesses can create differentiation in the areas of trust and security.  If existing businesses don’t adequately address this, it is expected that there could be significant opportunity for businesses that create capability based on a transparency and trust model.  Recommendations here include:

  • Reasonable and easily identifiable data retention policy. Storage of specific audio content should not be long term.  Storage of extracted features for the purpose of improved performance is considered reasonable.
  • Transparency and trust – improve trust in cloud-based processing of personal data
  • Audit of data retention, transparency, and trust policies to ensure compliance.

It is possible that the sensitivity to sending personal information to the cloud may be dependent on age/generation with younger generations less worried, or even apathetic to privacy issues.

Access to mobile network for security alerts – likely commercial choice

Use of mobile phone network for security alerts and backup access could be a differentiating capability.  The presence of this capability is likely to be commercial choice of the system provider.

User Interface (UI)

UI is always an important aspect of technology products.  Specific recommendations with respect to UI include:

  • Natural and unobtrusive – The system should be functional without access to screen-based UI
  • Self configuration – Addition of new devices should be discovered and configured automatically, i.e. plug-n-play
  • Voice and gesture control should cooperate with today’s analog controls
  • Preconfigured settings templates – system should include preconfigured templates which represent basic habits and use cases and are based on demographic information
  • Easy system reset – a single reset action should reset a malfunctioning system.  It should not be required to reset multiple devices. 
  • Easy system override – It should be easy to override an automatic action.

Interoperability protocol/OS

Siri and Alexa need to be friends.  A homeowner should not be put in the position of deciding up front that a home should be an Apple home, a Google home, or an Amazon home.  There is significant opportunity for a business to create a system level framework that is OS independent and will allow devices from different OEMs to cooperate and even collaborate.  Standards are likely necessary to achieve this as a number of different companies are currently vying to be the home’s center of intelligence. 


It is expected that the price of AoT components will follow typical consumer pricing trends by declining significantly over time allowing AoT to become more affordable to the mass market.

Additional Project BBQ Items Worth Mentioning

The workgroup considers the following report from previous years’ Project BBQ worth reviewing
BBQ report - 2014: Audio opportunities in the Internet of Things

section 4

next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. The Smartest Person in the Room is the Room: Applications for Virtual and Augmented Music Production
4. Audio Of Things: Audio Features and Security for Smart Homes/Internet of Things
5. A Brief History of Time
6. This Ain’t Your Mom’s Horn Tone
7. Protecting Tomorrow’s Ears
8. Schedule & Sponsors