|The Eighteenth Annual Interactive Audio Conference
PROJECT BAR-B-Q 2013
|Group Report: Using Sensor Data to Improve the User Experience of Audio Applications|
|Participants: A.K.A. "Senseless Confusion"|
Larry Przywara, Tensilica, Inc.
|Howard Brown, IDT, Inc.|
|Michael Pate, Audience||Rob Goyens, NXP Software|
|Jan-Paul Huijser, NXP||Mikko Suvanto, Akustica, Inc.|
|Cyril Martin, Analog Devices||Michael Townsend, Harman Embedded Audio|
|Scott McNeese, Cirrus Logic||Facilitator: Ron Kuper, Sonos, Inc.|
|download the PDF|
The audio user experience is often compromised by the surrounding environment, the user, and the context. To overcome the multitude of scenarios, we believe that fusing audio and many non-audio sensors can significantly improve the user experience of audio applications.
Example: Your child is presenting on stage, you’re in the audience in the back row with your camera. You zoom into your child and also want to capture the audio as he says his lines. For this there are several functions which could be added to the system. One example is that you pick up the signals from the microphone he’s wearing, by tapping into the environment resources (house sound system). Another example is to use an audio zoom function linked to the camera which combines microphone beamforming with shake compensation, position compensation, etc.
Sensors are getting widely deployed on smart devices. For example on smartphones today, accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras have become common. Despite the presence of the sensors, mobile phone sensing is still in its infancy.
We want to raise industry awareness of end user benefits by combining multiple sensors domains. In this report we limit the scope towards benefits for audio related applications/use cases and their relationship to sensor data. On top we analyzed bandwidth requirements for sensors to enable these benefits.
The report is organized in following sections:
Below figure shows an overview of different sensors. Some of these are widely deployed into mobile phones, smartphones and wearable today, such as an accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras.
The output of the sensor layer is a raw data signal with structured properties, such as information about the current data, sampling frequency, number of dimensions and the size of each dimension. Most sensors will yield one dimensional data, for example, audio signals, temperature. There are also sensors providing multi-dimensional data, for example an accelerometer.
A. Sensor Details
The physical limitations and range of measurement for the sensors bound what can be measured. The system architect also needs to know power and processing requirements.
c. Atmospheric Pressure
d. Sound pressure
e. Ultrasonic wave pressure
a. Ambient light
b. Infrared light
a. Blood pressure
b. Hand Grip
c. Skin conductivity
d. Fingerprint detection
II. Multilayer Approach
The raw data from the sensors are typically interpreted in a multi-layer approach in order to make higher level, context-aware decisions. Reasons against always streaming the raw sensor data are:
In this report, we follow a three-layered architecture:
Example 1: You are picking up a call in your office, and you need to have a conversation with a group of people. You are walking to a meeting room where more people are present, you put down your phone and your phone goes into a desktop conferencing mode.
Example 2: You are on a conference call, and one participant is causing manipulation noise (rubbing his phone on the table, touching buttons, etc.) adding all kinds of background junk to the call. These clicks can be very annoying for the far-end listener. We can resolve this using audio content only, but it gets much easier if we can also use input from the accelerometer, etc., to more robustly detect these manipulation sounds.
Example 3: Make a noise map of a city. For this an application would probably want to measure sound levels when the phone is out of the pocket or bag. To make a robust, in-pocket detection, multiple sensors could be combined:
Although it is not the goal of this working group to focus on the details of the system architecture, a high-level proposal can be investigated. The architecture split up of Functions, Features and Applications are done in a smart way which could enable the architecture as pictured below.
The sensors (Functions) are connected to a sensor hub or a sensor specific DSP/CPU core in the application. To minimize the data transfer the Features routines will need to run locally in this sensor hub or specific DSP/CPU core. Application functions will run on the main application processor and can call the feature routines to get key information (the application will call the specific Features routines in the hub. These Features routines will give back simple information based on the specific sensory called out by the Feature routine).
For a detailed list of the Functions, Features and Applications please refer to the tables in this document.
Advantages of proposed architecture:
III. Audio Applications
Redundant and complimentary sensors can be fused and integrated in order to enhance system reliability and accuracy. Multi-sensor fusion can bring benefits in a wide range of applications, such as, robotics, military and biomedical.
In this section, we analyzed how fusing audio and non-audio sensors using the multi-layer approach can benefit the experience of audio applications. In a first step, we list typical challenges in audio use cases. In a next step, sensors functions are mapped towards resolving these hallenges.
A. Use Cases and Challenges
For this analysis, audio applications were classified according to use cases:
1. Two-way communication (human - human)
Two-way communication happens in adverse acoustic conditions:
These conditions are improved by active voice processing techniques, e.g., acoustic echo cancelling (AEC) and multi-microphone noise suppression (NS) algorithms.
2. One-way communication (human - machine)
One-way communication happens in similar adverse conditions as two-way communication. Solutions however can be distinct as automatic speech recognition engines do not necessarily react the same as human listeners.
3. Multimedia recording
4. Multimedia playback
Multimedia playback can happen in a variety of environments from quiet, to consistent noise (airplane) to varying noise across a variety of output devices – headphones/earbuds, internal and external speakers.
5. Objective audio (gaming)
6. Idle case
The idle case is when the device is in the low power always on state waiting for a wake up event. Lowest power is ideal so only those sensors absolutely needed are left on. Always listening can be accomplished by the combination of a sound/speech detector and a full voice trigger that’s initiated after the sound/speech threshold is tripped. For proximity detection ultrasonic detection and the accelerometer can be utilized and is sufficient for always on.
B. Exploiting Sensor Fusion
1. From Functions to Features
In a first step, we map the functions towards features (or cues) which we need to enhance above discussed challenges. Multiple sensors can be used to provide a reliable or more accurate feature. This is sensor data fusion.
2. From Features to User Benefits
In a second step, we map the features towards an improvement in user experience. From this table we see a second level of sensor fusion as multiple features are combined: ‘feature fusion’.
The system is rather complex and the infra structure is currently not available to make optimum use of all the sensors and systems available in a room or even in a 1 box system.
select a section:
Copyright 2000-2014, Fat Labs, Inc., ALL RIGHTS RESERVED