home  previous   next 
The Eighteenth Annual Interactive Audio Conference
PROJECT BAR-B-Q 2013
BBQ Group Report: Using Sensor Data to Improve the User Experience of Audio Applications
   
Participants: A.K.A. "Senseless Confusion"

Larry Przywara, Tensilica, Inc.

Howard Brown, IDT, Inc.
Michael Pate, Audience Rob Goyens, NXP Software
Jan-Paul Huijser, NXP Mikko Suvanto, Akustica, Inc.
Cyril Martin, Analog Devices Michael Townsend, Harman Embedded Audio
Scott McNeese, Cirrus Logic Facilitator: Ron Kuper, Sonos, Inc.
 
  PDF download the PDF

 

Texas Longhorn Steer

Problem Statement

The audio user experience is often compromised by the surrounding environment, the user, and the context. To overcome the multitude of scenarios, we believe that fusing audio and many non-audio sensors can significantly improve the user experience of audio applications.

Example: Your child is presenting on stage, you’re in the audience in the back row with your camera. You zoom into your child and also want to capture the audio as he says his lines. For this there are several functions which could be added to the system. One example is that you pick up the signals from the microphone he’s wearing, by tapping into the environment resources (house sound system). Another example is to use an audio zoom function linked to the camera which combines microphone beamforming with shake compensation, position compensation, etc.

Sensors are getting widely deployed on smart devices. For example on smartphones today, accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras have become common. Despite the presence of the sensors, mobile phone sensing is still in its infancy.

We want to raise industry awareness of end user benefits by combining multiple sensors domains. In this report we limit the scope towards benefits for audio related applications/use cases and their relationship to sensor data. On top we analyzed bandwidth requirements for sensors to enable these benefits.

The report is organized in following sections:

  • Sensors
  • Multilayer approach
  • Audio applications
  • Challenges
  • Conclusions

I. Sensors

Below figure shows an overview of different sensors. Some of these are widely deployed into mobile phones, smartphones and wearable today, such as an accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras.

sensors

The output of the sensor layer is a raw data signal with structured properties, such as information about the current data, sampling frequency, number of dimensions and the size of each dimension. Most sensors will yield one dimensional data, for example, audio signals, temperature. There are also sensors providing multi-dimensional data, for example an accelerometer.

A. Sensor Details

The physical limitations and range of measurement for the sensors bound what can be measured. The system architect also needs to know power and processing requirements.

     1. Mechanical

          a. Acceleration

Accelerometer

Stimulus range

0.0005 → 16

±g / axis

Dynamic range

90

dB

Frequency range

10 → 500

Hz

PGA

0 → 24

dB

ADC resolution

≥ 12

Bits

Arithmetic

16

Bits

Sampling mode

Rate

Power

Continuous

200Hz

0.2 mW

Sync’ed

200Hz

0.2 mW

Triggered

N/A

N/A

Event triggered

0

0.00025mW

Suspend

0

0.00002mW

          b. Rotation

Gyroscope

Stimulus range

0.1 → 50 000

±˚ /s

Dynamic range

115

dB

Frequency range

1 → 5000

Hz

PGA

N/A

dB

ADC resolution

≥ 20

Bits

Arithmetic

≥ 24

Bits

Sampling mode

Rate

Power

Continuous

10 000 Hz

2mW

Sync’ed

500 Hz

5mW

Triggered

N/A

N/A

Event triggered

0

0.00025mW

Suspend

0

0.00002mW

          c. Atmospheric Pressure

Barometer

Stimulus range

200 → 1500

mbar RMS

Dynamic range

100

dB

Frequency range

10

Hz

PGA

N/A

dB

ADC resolution

≥ 24

Bits

Arithmetic

≥ 24

Bits

Sampling mode

Rate

Power

Continuous

10 Hz

0.01mW

Sync’ed

1 Hz

0.05mW

Triggered

N/A

N/A

Event triggered

0

N/A

Suspend

0

0.000001mW

          d. Sound pressure

Microphone

Stimulus range

20 → 140

dB SPL

Dynamic range

120

dB

Frequency range

10 → 20 000

Hz

PGA

N/A

dB

ADC resolution

≥ 20

Bits

Arithmetic

≥ 24

Bits

Sampling mode

Rate

Power

Continuous

3 MHz

2 mW

Sync’ed

3 MHz

2 mW

Triggered

N/A

N/A

Event triggered

0

TBD

Suspend

0

0.001mW

          e. Ultrasonic wave pressure

Ultrasonic Microphone

Stimulus range

20 → 100

dB SPL

Dynamic range

80

dB

Frequency range

20k → 80k

Hz

PGA

N/A

dB

ADC resolution

≥ 14

Bits

Arithmetic

≥ 16

Bits

Sampling mode

Rate

Power

Continuous

3 MHz

2 mW

Sync’ed

3 MHz

2 mW

Triggered

N/A

N/A

Event triggered

0

TBD

Suspend

0

0.001mW

          f. Gasflow

          g. Speaker

          h. Temperature

     2. Electromagnetic

          a. Ambient light

Ambient Light Sensor

Stimulus range

0.002 → 65k

Lux

Dynamic range

150

dB

Frequency range

N/A

Hz

PGA

0 → 36

dB

ADC resolution

≥ 16

Bits

Arithetic

≥ 32

Bits

Sampling mode

Rate

Power

Continuous

10 Hz

0.5mW

Sync’ed

10 Hz

0.5mW

Triggered

N/A

N/A

Event triggered

0

TBD

Suspend

0

0.001mW

          b. Infrared light

          c. Magnetism

Magnetometer

Stimulus range

0.005 → 16

± gauss

Dynamic range

70

dB

Frequency range

N/A

Hz

PGA

0 → 12

dB

ADC resolution

≥ 12

Bits

Arithmetic

≥ 16

Bits

Sampling mode

Rate

Power

Continuous

50 Hz

0.05mW

Sync’ed

20 Hz

0.5mW

Triggered

N/A

N/A

Event triggered

0

TBD

Suspend

0

0.001mW

          d. GPS

          e. Camera

     3. Human

          a. Blood pressure

          b. Hand Grip

          c. Skin conductivity

          d. Fingerprint detection

     4. Connectivity

          a. Bluetooth

          b. WLAN

II. Multilayer Approach

The raw data from the sensors are typically interpreted in a multi-layer approach in order to make higher level, context-aware decisions. Reasons against always streaming the raw sensor data are:

  • Privacy: sending raw data to the cloud
  • Bandwidth
  • Energy consumption: e.g., application processor processing high-bandwidth raw data
  • CPU usage

In this report, we follow a three-layered architecture:

  • Functions: the raw sensor data, which sensors are available in the system;
  • Features: compressed summaries or cues interpreted from (multiple) raw sensor data, what can we learn from the sensors?
  • Applications or User benefits: This is the decision level: how to combine features into a tangible benefit towards a user;

A. Examples

Example 1: You are picking up a call in your office, and you need to have a conversation with a group of people.  You are walking to a meeting room where more people are present, you put down your phone and your phone goes into a desktop conferencing mode.

  • Functions: Ultrasonic microphone, accelerometer, gyro, magnetometer, proximity, grip detect
  • Feature: Local device mode (Near Field vs. Hands Free vs. Far Field)
  • Application: Automatic switching from earpiece to speakerphone mode during a call

Example 2: You are on a conference call, and one participant is causing manipulation noise (rubbing his phone on the table, touching buttons, etc.) adding all kinds of background junk to the call. These clicks can be very annoying for the far-end listener. We can resolve this using audio content only, but it gets much easier if we can also use input from the accelerometer, etc., to more robustly detect these manipulation sounds.

  • Functions: Microphone array, accelerometer, gyro
  • Feature: Manipulation noise detection
  • Application: Improved noise suppression during voice calling

Example 3: Make a noise map of a city. For this an application would probably want to measure sound levels when the phone is out of the pocket or bag. To make a robust, in-pocket detection, multiple sensors could be combined:

  • Functions: GPS, Microphone, Accelerometer, ambient light sensor
  • Feature: Sound level, Pocket detection
  • Application: noise map of a city

B. Architecture

Although it is not the goal of this working group to focus on the details of the system architecture, a high-level proposal can be investigated. The architecture split up of Functions, Features and Applications are done in a smart way which could enable the architecture as pictured below.

The sensors (Functions) are connected to a sensor hub or a sensor specific DSP/CPU core in the application. To minimize the data transfer the Features routines will need to run locally in this sensor hub or specific DSP/CPU core. Application functions will run on the main application processor and can call the feature routines to get key information (the application will call the specific Features routines in the hub. These Features routines will give back simple information based on the specific sensory called out by the Feature routine).

system architecture

For a detailed list of the Functions, Features and Applications please refer to the tables in this document.

Advantages of proposed architecture:

  • Reduced data transfer between Hub and Application processor
  • Optimized for power management
  • Standardization possible to support distributed systems (Ubiquitous Network)
  • Independent for main operating system used

III. Audio Applications

Redundant and complimentary sensors can be fused and integrated in order to enhance system reliability and accuracy. Multi-sensor fusion can bring benefits in a wide range of applications, such as, robotics, military and biomedical.

In this section, we analyzed how fusing audio and non-audio sensors using the multi-layer approach can benefit the experience of audio applications. In a first step, we list typical challenges in audio use cases. In a next step, sensors functions are mapped towards resolving these hallenges.

A. Use Cases and Challenges

For this analysis, audio applications were classified according to use cases:

  • Two-way communication (human-human)
  • One-way communication (human-machine)
  • Multimedia recording
  • Multimedia playback
  • Objective audio
  • Idle case

1. Two-way communication (human - human)

Two-way communication happens in adverse acoustic conditions:

  • Noisy environments
  • Echo: sound of the speaker is captured by the microphone resulting in echo for the far-end talker;
  • Room acoustics: reverberation and reflections of the audio signals
  • Varying signal levels;
  • Unknown user handling: strange device positions, covering microphones, position of speaker towards ear, pressure of speaker on the ear, etc.

These conditions are improved by active voice processing techniques, e.g., acoustic echo cancelling (AEC) and multi-microphone noise suppression (NS) algorithms.

Mode

Challenges benefit by sensor fusion

Close talk (earpiece)

  • Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user;
  • Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal;
  • Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary;
  • Manipulation noise: e.g., user tapping the phone
  • Seamless transition between NF and HH modes

Handheld speaker

  • Close talk challenges +
  • Loudness: get more loudness out of small speakers (keeping distortion to minimal level);
  • Changing room dynamics
  • Privacy

Conference mode

  • Handheld speaker challenges +
  • Multiple talkers: who is talking and who is desired?

Headset

  • Manipulation noise
  • Intelligently enabling environmental noise

Far talk

  • Reverberation
  • Talker location

Automotive

  • Reverberation
  • Multiple speakers: are there other passengers?

2. One-way communication (human - machine)

One-way communication happens in similar adverse conditions as two-way communication. Solutions however can be distinct as automatic speech recognition engines do not necessarily react the same as human listeners.

Mode

Challenges benefit by sensor fusion

Close talk (earpiece)

  • Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user;
  • Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal;
  • Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary;
  • Manipulation noise: e.g., user tapping the phone
  • Seamless transition between NF and HH modes

Handheld speaker

  • Close talk challenges +
  • Loudness: get more loudness out of small speakers (keeping distortion to minimal level);
  • Changing room dynamics
  • Privacy

Conference mode

  • Handheld speaker challenges +
  • Multiple talkers: who is talking and who is desired?

Headset

  • Manipulation noise
  • Intelligently enabling environmental noise

Far talk

  • Reverberation
  • Talker location

Automotive

  • Reverberation
  • Multiple speakers: are there other passengers?

3. Multimedia recording

Mode

Challenges benefit by sensor fusion

Camcording

  • Audio zoom: changing audio processing depending on camera focal length and depending who is in focus
  • Stereo-mono selection based upon device orientation
  • Motor noise cancellation
  • Attach Meta data: GPS location, noise, talkers, etc.

Voice recording

  • Automatic microphone selection upon device orientation
  • Attach Meta data: GPS location, noise, talkers, etc.
  • Manipulation noise

Sound/music recording

  • Attach Meta data: GPS location, noise, talkers, etc.
  • Manipulation noise

4. Multimedia playback

Multimedia playback can happen in a variety of environments from quiet, to consistent noise (airplane) to varying noise across a variety of output devices – headphones/earbuds, internal and external speakers.

Mode

Challenges / Benefit by sensor fusion

Mobile device

 

Internal speaker

  • Orientation – Render stereo vs mono
  • Equalization based on placement
  • Loudness boost
  • Multi device synchronous playback – group play
  • Pocket detection
  • Location detection w.r.t. listener and room characteristics
  • Sweet spot creation based on listener location relative to the device

Headset/Headphone

  • Environmental noise reduction
  • Head and device position tracking
  • Playback and stop

Push – Airplay, Etc.

  • Positional location w.r.t. speaker resources
  •  

At home - Stationary

  • Orientation – Render stereo vs mono
  • Equalization based on placement
  • Multi device synchronous playback – group play
  • Location detection w.r.t. listener and room characteristics
  • Sweet spot creation based on listener location relative to the device
  • Playback and stop
  • User identification – voice or visual

5. Objective audio (gaming)

Mode

Challenges / Benefit by sensor fusion

Headset/Headphone

  • Head rotation tracking
  • Manipulation noise suppression e.g., noise from the game controller
  • Spatialization of device to device
  • Intelligently enabling environmental noise

Handheld - Internal speaker

  • Privacy
  • Echo cancellation
  • Mic coverage
  • Loudness: get more loudness out of small speakers (keeping distortion to minimal level);
  • Changing room dynamics
  • Speaker coverage
  • Manipulation noise suppression e.g., noise from the game controller

TV/Living room- External speaker

  • Privacy
  • Echo cancellation
  • Mic coverage
    • Far talk
    • Reverberation
  • Talker location
  •  
  • Loudness boost if external speakers are limited in response
  • Speaker coverage

6. Idle case

The idle case is when the device is in the low power always on state waiting for a wake up event.  Lowest power is ideal so only those sensors absolutely needed are left on.  Always listening can be accomplished by the combination of a sound/speech detector and a full voice trigger that’s initiated after the sound/speech threshold is tripped.  For proximity detection ultrasonic detection and the accelerometer can be utilized and is sufficient for always on.

Mode

Challenges / Benefit by sensor fusion

Always on low power listening

  • Hands free operation with single mic input
    • VAD or sound detect
    • Wake up Word / Hot Word
    • Speaker ID, authentication

Proximity detection

  • Hands free operation with accelerometer or ultrasonic mic input
    • Device wake up

B. Exploiting Sensor Fusion

1. From Functions to Features

In a first step, we map the functions towards features (or cues) which we need to enhance above discussed challenges. Multiple sensors can be used to provide a reliable or more accurate feature. This is sensor data fusion.

map of functions and features

2. From Features to User Benefits

In a second step, we map the features towards an improvement in user experience. From this table we see a second level of sensor fusion as multiple features are combined: ‘feature fusion’.

map of features and user experience

IV. Challenges

The system is rather complex and the infra structure is currently not available to make optimum use of all the sensors and systems available in a room or even in a 1 box system.
There are also challenges in standardization

  • To make use of this system in a most effective way it is important to standardize on:
    • the Features and their interfacing command structure
    • interfaces / bus  to the sensors but also to the Application processor
  • Ultrasonic:
    • Standardize on identification
      (different frequencies for different devices in the room, adding meta data, ….;
    • who is pinging;
    • ultrasonic pollution
  • Software architecture

V. Conclusions

  • Fusing of the sensor data can improve the user experience by increasing the contextual awareness of the device.
  • Audio (microphone and speakers) can be considered a sensor as well.
  • Most valuable sensors to fuse with common audio processing seem to be:
    • Accelerometer
    • Ultrasonic-microphones
    • Proximity detector
    • Speaker as sensor
  • A layered architecture is needed 
    • Multiple feature routines are running simultaneously on the sensor hub
    • Higher-level functions run on the application processor
  • The sensor hub can have it’s own OS (lower power & performance)
  • Standardization is Required: 
    • In the software API
    • In the sensor bus
    • Minimization  of ultrasonic pollution
    • Identification (authentication) of the ultrasonic source

section 6


next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. Ubiquitous Networked Audio
4. HD Audio Capture in Consumer Devices
5. Enabling More Profound Human Expression with Modern Musical Instruments
6. Using Sensor Data to Improve the User Experience of Audio Applications
7. When is Hardware Offloading Preferable, Now and in the Future?
8. Schedule & Sponsors