Project Bar-B-Q 2013 report section 6

home previous next
The Eighteenth Annual Interactive Audio Conference PROJECT BAR-B-Q 2013

Group Report: Using Sensor Data to Improve the User Experience of Audio Applications

Participants: A.K.A. "Senseless Confusion"

Larry Przywara, Tensilica, Inc.

Howard Brown, IDT, Inc.

Michael Pate, Audience

Rob Goyens, NXP Software

Jan-Paul Huijser, NXP

Mikko Suvanto, Akustica, Inc.

Cyril Martin, Analog Devices

Michael Townsend, Harman Embedded Audio

Scott McNeese, Cirrus Logic

Facilitator: Ron Kuper, Sonos, Inc.

download the PDF

Problem Statement

The audio user experience is often compromised by the surrounding environment, the user, and the context. To overcome the multitude of scenarios, we believe that fusing audio and many non-audio sensors can significantly improve the user experience of audio applications.

Example: Your child is presenting on stage, you’re in the audience in the back row with your camera. You zoom into your child and also want to capture the audio as he says his lines. For this there are several functions which could be added to the system. One example is that you pick up the signals from the microphone he’s wearing, by tapping into the environment resources (house sound system). Another example is to use an audio zoom function linked to the camera which combines microphone beamforming with shake compensation, position compensation, etc.

Sensors are getting widely deployed on smart devices. For example on smartphones today, accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras have become common. Despite the presence of the sensors, mobile phone sensing is still in its infancy.

We want to raise industry awareness of end user benefits by combining multiple sensors domains. In this report we limit the scope towards benefits for audio related applications/use cases and their relationship to sensor data. On top we analyzed bandwidth requirements for sensors to enable these benefits.

The report is organized in following sections:

Sensors
Multilayer approach
Audio applications
Challenges
Conclusions

I. Sensors

Below figure shows an overview of different sensors. Some of these are widely deployed into mobile phones, smartphones and wearable today, such as an accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras.

The output of the sensor layer is a raw data signal with structured properties, such as information about the current data, sampling frequency, number of dimensions and the size of each dimension. Most sensors will yield one dimensional data, for example, audio signals, temperature. There are also sensors providing multi-dimensional data, for example an accelerometer.

A. Sensor Details

The physical limitations and range of measurement for the sensors bound what can be measured. The system architect also needs to know power and processing requirements.

1. Mechanical

a. Acceleration

Accelerometer
Stimulus range	0.0005 → 16	±g / axis
Dynamic range	90	dB
Frequency range	10 → 500	Hz
PGA	0 → 24	dB
ADC resolution	≥ 12	Bits
Arithmetic	16	Bits

Sampling mode	Rate	Power
Continuous	200Hz	0.2 mW
Sync’ed	200Hz	0.2 mW
Triggered	N/A	N/A
Event triggered	0	0.00025mW
Suspend	0	0.00002mW

b. Rotation

Gyroscope
Stimulus range	0.1 → 50 000	±˚ /s
Dynamic range	115	dB
Frequency range	1 → 5000	Hz
PGA	N/A	dB
ADC resolution	≥ 20	Bits
Arithmetic	≥ 24	Bits

Sampling mode	Rate	Power
Continuous	10 000 Hz	2mW
Sync’ed	500 Hz	5mW
Triggered	N/A	N/A
Event triggered	0	0.00025mW
Suspend	0	0.00002mW

c. Atmospheric Pressure

Barometer
Stimulus range	200 → 1500	mbar RMS
Dynamic range	100	dB
Frequency range	10	Hz
PGA	N/A	dB
ADC resolution	≥ 24	Bits
Arithmetic	≥ 24	Bits

Sampling mode	Rate	Power
Continuous	10 Hz	0.01mW
Sync’ed	1 Hz	0.05mW
Triggered	N/A	N/A
Event triggered	0	N/A
Suspend	0	0.000001mW

d. Sound pressure

Microphone
Stimulus range	20 → 140	dB SPL
Dynamic range	120	dB
Frequency range	10 → 20 000	Hz
PGA	N/A	dB
ADC resolution	≥ 20	Bits
Arithmetic	≥ 24	Bits

Sampling mode	Rate	Power
Continuous	3 MHz	2 mW
Sync’ed	3 MHz	2 mW
Triggered	N/A	N/A
Event triggered	0	TBD
Suspend	0	0.001mW

e. Ultrasonic wave pressure

Ultrasonic Microphone
Stimulus range	20 → 100	dB SPL
Dynamic range	80	dB
Frequency range	20k → 80k	Hz
PGA	N/A	dB
ADC resolution	≥ 14	Bits
Arithmetic	≥ 16	Bits

Sampling mode	Rate	Power
Continuous	3 MHz	2 mW
Sync’ed	3 MHz	2 mW
Triggered	N/A	N/A
Event triggered	0	TBD
Suspend	0	0.001mW

f. Gasflow

g. Speaker

h. Temperature

2. Electromagnetic

a. Ambient light

Ambient Light Sensor
Stimulus range	0.002 → 65k	Lux
Dynamic range	150	dB
Frequency range	N/A	Hz
PGA	0 → 36	dB
ADC resolution	≥ 16	Bits
Arithetic	≥ 32	Bits

Sampling mode	Rate	Power
Continuous	10 Hz	0.5mW
Sync’ed	10 Hz	0.5mW
Triggered	N/A	N/A
Event triggered	0	TBD
Suspend	0	0.001mW

b. Infrared light

c. Magnetism

Magnetometer
Stimulus range	0.005 → 16	± gauss
Dynamic range	70	dB
Frequency range	N/A	Hz
PGA	0 → 12	dB
ADC resolution	≥ 12	Bits
Arithmetic	≥ 16	Bits

Sampling mode	Rate	Power
Continuous	50 Hz	0.05mW
Sync’ed	20 Hz	0.5mW
Triggered	N/A	N/A
Event triggered	0	TBD
Suspend	0	0.001mW

d. GPS

e. Camera

3. Human

a. Blood pressure

b. Hand Grip

c. Skin conductivity

d. Fingerprint detection

4. Connectivity

a. Bluetooth

b. WLAN

II. Multilayer Approach

The raw data from the sensors are typically interpreted in a multi-layer approach in order to make higher level, context-aware decisions. Reasons against always streaming the raw sensor data are:

Privacy: sending raw data to the cloud
Bandwidth
Energy consumption: e.g., application processor processing high-bandwidth raw data
CPU usage

In this report, we follow a three-layered architecture:

Functions: the raw sensor data, which sensors are available in the system;
Features: compressed summaries or cues interpreted from (multiple) raw sensor data, what can we learn from the sensors?
Applications or User benefits: This is the decision level: how to combine features into a tangible benefit towards a user;

A. Examples

Example 1: You are picking up a call in your office, and you need to have a conversation with a group of people. You are walking to a meeting room where more people are present, you put down your phone and your phone goes into a desktop conferencing mode.

Functions: Ultrasonic microphone, accelerometer, gyro, magnetometer, proximity, grip detect
Feature: Local device mode (Near Field vs. Hands Free vs. Far Field)
Application: Automatic switching from earpiece to speakerphone mode during a call

Example 2: You are on a conference call, and one participant is causing manipulation noise (rubbing his phone on the table, touching buttons, etc.) adding all kinds of background junk to the call. These clicks can be very annoying for the far-end listener. We can resolve this using audio content only, but it gets much easier if we can also use input from the accelerometer, etc., to more robustly detect these manipulation sounds.

Functions: Microphone array, accelerometer, gyro
Feature: Manipulation noise detection
Application: Improved noise suppression during voice calling

Example 3: Make a noise map of a city. For this an application would probably want to measure sound levels when the phone is out of the pocket or bag. To make a robust, in-pocket detection, multiple sensors could be combined:

Functions: GPS, Microphone, Accelerometer, ambient light sensor
Feature: Sound level, Pocket detection
Application: noise map of a city

B. Architecture

Although it is not the goal of this working group to focus on the details of the system architecture, a high-level proposal can be investigated. The architecture split up of Functions, Features and Applications are done in a smart way which could enable the architecture as pictured below.

The sensors (Functions) are connected to a sensor hub or a sensor specific DSP/CPU core in the application. To minimize the data transfer the Features routines will need to run locally in this sensor hub or specific DSP/CPU core. Application functions will run on the main application processor and can call the feature routines to get key information (the application will call the specific Features routines in the hub. These Features routines will give back simple information based on the specific sensory called out by the Feature routine).

For a detailed list of the Functions, Features and Applications please refer to the tables in this document.

Advantages of proposed architecture:

Reduced data transfer between Hub and Application processor
Optimized for power management
Standardization possible to support distributed systems (Ubiquitous Network)
Independent for main operating system used

III. Audio Applications

Redundant and complimentary sensors can be fused and integrated in order to enhance system reliability and accuracy. Multi-sensor fusion can bring benefits in a wide range of applications, such as, robotics, military and biomedical.

In this section, we analyzed how fusing audio and non-audio sensors using the multi-layer approach can benefit the experience of audio applications. In a first step, we list typical challenges in audio use cases. In a next step, sensors functions are mapped towards resolving these hallenges.

A. Use Cases and Challenges

For this analysis, audio applications were classified according to use cases:

Two-way communication (human-human)
One-way communication (human-machine)
Multimedia recording
Multimedia playback
Objective audio
Idle case

1. Two-way communication (human - human)

Two-way communication happens in adverse acoustic conditions:

Noisy environments
Echo: sound of the speaker is captured by the microphone resulting in echo for the far-end talker;
Room acoustics: reverberation and reflections of the audio signals
Varying signal levels;
Unknown user handling: strange device positions, covering microphones, position of speaker towards ear, pressure of speaker on the ear, etc.

These conditions are improved by active voice processing techniques, e.g., acoustic echo cancelling (AEC) and multi-microphone noise suppression (NS) algorithms.

Mode	Challenges benefit by sensor fusion
Close talk (earpiece)	Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user; Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal; Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary; Manipulation noise: e.g., user tapping the phone Seamless transition between NF and HH modes
Handheld speaker	Close talk challenges + Loudness: get more loudness out of small speakers (keeping distortion to minimal level); Changing room dynamics Privacy
Conference mode	Handheld speaker challenges + Multiple talkers: who is talking and who is desired?
Headset	Manipulation noise Intelligently enabling environmental noise
Far talk	Reverberation Talker location
Automotive	Reverberation Multiple speakers: are there other passengers?

2. One-way communication (human - machine)

One-way communication happens in similar adverse conditions as two-way communication. Solutions however can be distinct as automatic speech recognition engines do not necessarily react the same as human listeners.

Mode	Challenges benefit by sensor fusion
Close talk (earpiece)	Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user; Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal; Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary; Manipulation noise: e.g., user tapping the phone Seamless transition between NF and HH modes
Handheld speaker	Close talk challenges + Loudness: get more loudness out of small speakers (keeping distortion to minimal level); Changing room dynamics Privacy
Conference mode	Handheld speaker challenges + Multiple talkers: who is talking and who is desired?
Headset	Manipulation noise Intelligently enabling environmental noise
Far talk	Reverberation Talker location
Automotive	Reverberation Multiple speakers: are there other passengers?

3. Multimedia recording

Mode	Challenges benefit by sensor fusion
Camcording	Audio zoom: changing audio processing depending on camera focal length and depending who is in focus Stereo-mono selection based upon device orientation Motor noise cancellation Attach Meta data: GPS location, noise, talkers, etc.
Voice recording	Automatic microphone selection upon device orientation Attach Meta data: GPS location, noise, talkers, etc. Manipulation noise
Sound/music recording	Attach Meta data: GPS location, noise, talkers, etc. Manipulation noise

4. Multimedia playback

Multimedia playback can happen in a variety of environments from quiet, to consistent noise (airplane) to varying noise across a variety of output devices – headphones/earbuds, internal and external speakers.

Mode	Challenges / Benefit by sensor fusion
Mobile device
Internal speaker	Orientation – Render stereo vs mono Equalization based on placement Loudness boost Multi device synchronous playback – group play Pocket detection Location detection w.r.t. listener and room characteristics Sweet spot creation based on listener location relative to the device
Headset/Headphone	Environmental noise reduction Head and device position tracking Playback and stop
Push – Airplay, Etc.	Positional location w.r.t. speaker resources
At home - Stationary	Orientation – Render stereo vs mono Equalization based on placement Multi device synchronous playback – group play Location detection w.r.t. listener and room characteristics Sweet spot creation based on listener location relative to the device Playback and stop User identification – voice or visual

5. Objective audio (gaming)

Mode	Challenges / Benefit by sensor fusion
Headset/Headphone	Head rotation tracking Manipulation noise suppression e.g., noise from the game controller Spatialization of device to device Intelligently enabling environmental noise
Handheld - Internal speaker	Privacy Echo cancellation Mic coverage Loudness: get more loudness out of small speakers (keeping distortion to minimal level); Changing room dynamics Speaker coverage Manipulation noise suppression e.g., noise from the game controller
TV/Living room- External speaker	Privacy Echo cancellation Mic coverage Far talk Reverberation Talker location Loudness boost if external speakers are limited in response Speaker coverage

6. Idle case

The idle case is when the device is in the low power always on state waiting for a wake up event. Lowest power is ideal so only those sensors absolutely needed are left on. Always listening can be accomplished by the combination of a sound/speech detector and a full voice trigger that’s initiated after the sound/speech threshold is tripped. For proximity detection ultrasonic detection and the accelerometer can be utilized and is sufficient for always on.

Mode	Challenges / Benefit by sensor fusion
Always on low power listening	Hands free operation with single mic input VAD or sound detect Wake up Word / Hot Word Speaker ID, authentication
Proximity detection	Hands free operation with accelerometer or ultrasonic mic input Device wake up

B. Exploiting Sensor Fusion

1. From Functions to Features

In a first step, we map the functions towards features (or cues) which we need to enhance above discussed challenges. Multiple sensors can be used to provide a reliable or more accurate feature. This is sensor data fusion.

2. From Features to User Benefits

In a second step, we map the features towards an improvement in user experience. From this table we see a second level of sensor fusion as multiple features are combined: ‘feature fusion’.

IV. Challenges

The system is rather complex and the infra structure is currently not available to make optimum use of all the sensors and systems available in a room or even in a 1 box system.
There are also challenges in standardization

To make use of this system in a most effective way it is important to standardize on:
- the Features and their interfacing command structure
- interfaces / bus to the sensors but also to the Application processor
Ultrasonic:
- Standardize on identification
  (different frequencies for different devices in the room, adding meta data, ….;
- who is pinging;
- ultrasonic pollution
Software architecture

V. Conclusions

Fusing of the sensor data can improve the user experience by increasing the contextual awareness of the device.
Audio (microphone and speakers) can be considered a sensor as well.
Most valuable sensors to fuse with common audio processing seem to be:

Accelerometer
Ultrasonic-microphones
Proximity detector
Speaker as sensor

A layered architecture is needed

Multiple feature routines are running simultaneously on the sensor hub
Higher-level functions run on the application processor

The sensor hub can have it’s own OS (lower power & performance)
Standardization is Required:

In the software API
In the sensor bus
Minimization of ultrasonic pollution
Identification (authentication) of the ultrasonic source

section 6

next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. Ubiquitous Networked Audio
4. HD Audio Capture in Consumer Devices
5. Enabling More Profound Human Expression with Modern Musical Instruments
6. Using Sensor Data to Improve the User Experience of Audio Applications
7. When is Hardware Offloading Preferable, Now and in the Future?
8. Schedule & Sponsors