|
The audio user experience is often compromised by the surrounding environment, the user, and the context. To overcome the multitude of scenarios, we believe that fusing audio and many non-audio sensors can significantly improve the user experience of audio applications.
Example: Your child is presenting on stage, you’re in the audience in the back row with your camera. You zoom into your child and also want to capture the audio as he says his lines. For this there are several functions which could be added to the system. One example is that you pick up the signals from the microphone he’s wearing, by tapping into the environment resources (house sound system). Another example is to use an audio zoom function linked to the camera which combines microphone beamforming with shake compensation, position compensation, etc.
Sensors are getting widely deployed on smart devices. For example on smartphones today, accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras have become common. Despite the presence of the sensors, mobile phone sensing is still in its infancy.
We want to raise industry awareness of end user benefits by combining multiple sensors domains. In this report we limit the scope towards benefits for audio related applications/use cases and their relationship to sensor data. On top we analyzed bandwidth requirements for sensors to enable these benefits.
The report is organized in following sections:
- Sensors
- Multilayer approach
- Audio applications
- Challenges
- Conclusions
Below figure shows an overview of different sensors. Some of these are widely deployed into mobile phones, smartphones and wearable today, such as an accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker, front and back facing cameras.
The output of the sensor layer is a raw data signal with structured properties, such as information about the current data, sampling frequency, number of dimensions and the size of each dimension. Most sensors will yield one dimensional data, for example, audio signals, temperature. There are also sensors providing multi-dimensional data, for example an accelerometer.
The physical limitations and range of measurement for the sensors bound what can be measured. The system architect also needs to know power and processing requirements.
1. Mechanical
a. Acceleration
Accelerometer |
Stimulus range |
0.0005 → 16 |
±g / axis |
Dynamic range |
90 |
dB |
Frequency range |
10 → 500 |
Hz |
PGA |
0 → 24 |
dB |
ADC resolution |
≥ 12 |
Bits |
Arithmetic |
16 |
Bits |
Sampling mode |
Rate |
Power |
Continuous |
200Hz |
0.2 mW |
Sync’ed |
200Hz |
0.2 mW |
Triggered |
N/A |
N/A |
Event triggered |
0 |
0.00025mW |
Suspend |
0 |
0.00002mW |
b. Rotation
Gyroscope |
Stimulus range |
0.1 → 50 000 |
±˚ /s |
Dynamic range |
115 |
dB |
Frequency range |
1 → 5000 |
Hz |
PGA |
N/A |
dB |
ADC resolution |
≥ 20 |
Bits |
Arithmetic |
≥ 24 |
Bits |
Sampling mode |
Rate |
Power |
Continuous |
10 000 Hz |
2mW |
Sync’ed |
500 Hz |
5mW |
Triggered |
N/A |
N/A |
Event triggered |
0 |
0.00025mW |
Suspend |
0 |
0.00002mW |
c. Atmospheric Pressure
Barometer |
Stimulus range |
200 → 1500 |
mbar RMS |
Dynamic range |
100 |
dB |
Frequency range |
10 |
Hz |
PGA |
N/A |
dB |
ADC resolution |
≥ 24 |
Bits |
Arithmetic |
≥ 24 |
Bits |
Sampling mode |
Rate |
Power |
Continuous |
10 Hz |
0.01mW |
Sync’ed |
1 Hz |
0.05mW |
Triggered |
N/A |
N/A |
Event triggered |
0 |
N/A |
Suspend |
0 |
0.000001mW |
d. Sound pressure
Microphone |
Stimulus range |
20 → 140 |
dB SPL |
Dynamic range |
120 |
dB |
Frequency range |
10 → 20 000 |
Hz |
PGA |
N/A |
dB |
ADC resolution |
≥ 20 |
Bits |
Arithmetic |
≥ 24 |
Bits |
Sampling mode |
Rate |
Power |
Continuous |
3 MHz |
2 mW |
Sync’ed |
3 MHz |
2 mW |
Triggered |
N/A |
N/A |
Event triggered |
0 |
TBD |
Suspend |
0 |
0.001mW |
e. Ultrasonic wave pressure
Ultrasonic Microphone |
Stimulus range |
20 → 100 |
dB SPL |
Dynamic range |
80 |
dB |
Frequency range |
20k → 80k |
Hz |
PGA |
N/A |
dB |
ADC resolution |
≥ 14 |
Bits |
Arithmetic |
≥ 16 |
Bits |
Sampling mode |
Rate |
Power |
Continuous |
3 MHz |
2 mW |
Sync’ed |
3 MHz |
2 mW |
Triggered |
N/A |
N/A |
Event triggered |
0 |
TBD |
Suspend |
0 |
0.001mW |
f. Gasflow
g. Speaker
h. Temperature
2. Electromagnetic
a. Ambient light
Ambient Light Sensor |
Stimulus range |
0.002 → 65k |
Lux |
Dynamic range |
150 |
dB |
Frequency range |
N/A |
Hz |
PGA |
0 → 36 |
dB |
ADC resolution |
≥ 16 |
Bits |
Arithetic |
≥ 32 |
Bits |
Sampling mode |
Rate |
Power |
Continuous |
10 Hz |
0.5mW |
Sync’ed |
10 Hz |
0.5mW |
Triggered |
N/A |
N/A |
Event triggered |
0 |
TBD |
Suspend |
0 |
0.001mW |
b. Infrared light
c. Magnetism
Magnetometer |
Stimulus range |
0.005 → 16 |
± gauss |
Dynamic range |
70 |
dB |
Frequency range |
N/A |
Hz |
PGA |
0 → 12 |
dB |
ADC resolution |
≥ 12 |
Bits |
Arithmetic |
≥ 16 |
Bits |
Sampling mode |
Rate |
Power |
Continuous |
50 Hz |
0.05mW |
Sync’ed |
20 Hz |
0.5mW |
Triggered |
N/A |
N/A |
Event triggered |
0 |
TBD |
Suspend |
0 |
0.001mW |
d. GPS
e. Camera
3. Human
a. Blood pressure
b. Hand Grip
c. Skin conductivity
d. Fingerprint detection
4. Connectivity
a. Bluetooth
b. WLAN
The raw data from the sensors are typically interpreted in a multi-layer approach in order to make higher level, context-aware decisions. Reasons against always streaming the raw sensor data are:
- Privacy: sending raw data to the cloud
- Bandwidth
- Energy consumption: e.g., application processor processing high-bandwidth raw data
- CPU usage
In this report, we follow a three-layered architecture:
- Functions: the raw sensor data, which sensors are available in the system;
- Features: compressed summaries or cues interpreted from (multiple) raw sensor data, what can we learn from the sensors?
- Applications or User benefits: This is the decision level: how to combine features into a tangible benefit towards a user;
Example 1: You are picking up a call in your office, and you need to have a conversation with a group of people. You are walking to a meeting room where more people are present, you put down your phone and your phone goes into a desktop conferencing mode.
- Functions: Ultrasonic microphone, accelerometer, gyro, magnetometer, proximity, grip detect
- Feature: Local device mode (Near Field vs. Hands Free vs. Far Field)
- Application: Automatic switching from earpiece to speakerphone mode during a call
Example 2: You are on a conference call, and one participant is causing manipulation noise (rubbing his phone on the table, touching buttons, etc.) adding all kinds of background junk to the call. These clicks can be very annoying for the far-end listener. We can resolve this using audio content only, but it gets much easier if we can also use input from the accelerometer, etc., to more robustly detect these manipulation sounds.
- Functions: Microphone array, accelerometer, gyro
- Feature: Manipulation noise detection
- Application: Improved noise suppression during voice calling
Example 3: Make a noise map of a city. For this an application would probably want to measure sound levels when the phone is out of the pocket or bag. To make a robust, in-pocket detection, multiple sensors could be combined:
- Functions: GPS, Microphone, Accelerometer, ambient light sensor
- Feature: Sound level, Pocket detection
- Application: noise map of a city
Although it is not the goal of this working group to focus on the details of the system architecture, a high-level proposal can be investigated. The architecture split up of Functions, Features and Applications are done in a smart way which could enable the architecture as pictured below.
The sensors (Functions) are connected to a sensor hub or a sensor specific DSP/CPU core in the application. To minimize the data transfer the Features routines will need to run locally in this sensor hub or specific DSP/CPU core. Application functions will run on the main application processor and can call the feature routines to get key information (the application will call the specific Features routines in the hub. These Features routines will give back simple information based on the specific sensory called out by the Feature routine).
For a detailed list of the Functions, Features and Applications please refer to the tables in this document.
Advantages of proposed architecture:
- Reduced data transfer between Hub and Application processor
- Optimized for power management
- Standardization possible to support distributed systems (Ubiquitous Network)
- Independent for main operating system used
Redundant and complimentary sensors can be fused and integrated in order to enhance system reliability and accuracy. Multi-sensor fusion can bring benefits in a wide range of applications, such as, robotics, military and biomedical.
In this section, we analyzed how fusing audio and non-audio sensors using the multi-layer approach can benefit the experience of audio applications. In a first step, we list typical challenges in audio use cases. In a next step, sensors functions are mapped towards resolving these hallenges.
For this analysis, audio applications were classified according to use cases:
- Two-way communication (human-human)
- One-way communication (human-machine)
- Multimedia recording
- Multimedia playback
- Objective audio
- Idle case
1. Two-way communication (human - human)
Two-way communication happens in adverse acoustic conditions:
- Noisy environments
- Echo: sound of the speaker is captured by the microphone resulting in echo for the far-end talker;
- Room acoustics: reverberation and reflections of the audio signals
- Varying signal levels;
- Unknown user handling: strange device positions, covering microphones, position of speaker towards ear, pressure of speaker on the ear, etc.
These conditions are improved by active voice processing techniques, e.g., acoustic echo cancelling (AEC) and multi-microphone noise suppression (NS) algorithms.
Mode |
Challenges benefit by sensor fusion |
Close talk (earpiece) |
- Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user;
- Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal;
- Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary;
- Manipulation noise: e.g., user tapping the phone
- Seamless transition between NF and HH modes
|
Handheld speaker |
- Close talk challenges +
- Loudness: get more loudness out of small speakers (keeping distortion to minimal level);
- Changing room dynamics
- Privacy
|
Conference mode |
- Handheld speaker challenges +
- Multiple talkers: who is talking and who is desired?
|
Headset |
- Manipulation noise
- Intelligently enabling environmental noise
|
Far talk |
- Reverberation
- Talker location
|
Automotive |
- Reverberation
- Multiple speakers: are there other passengers?
|
2. One-way communication (human - machine)
One-way communication happens in similar adverse conditions as two-way communication. Solutions however can be distinct as automatic speech recognition engines do not necessarily react the same as human listeners.
Mode |
Challenges benefit by sensor fusion |
Close talk (earpiece) |
- Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user;
- Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal;
- Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary;
- Manipulation noise: e.g., user tapping the phone
- Seamless transition between NF and HH modes
|
Handheld speaker |
- Close talk challenges +
- Loudness: get more loudness out of small speakers (keeping distortion to minimal level);
- Changing room dynamics
- Privacy
|
Conference mode |
- Handheld speaker challenges +
- Multiple talkers: who is talking and who is desired?
|
Headset |
- Manipulation noise
- Intelligently enabling environmental noise
|
Far talk |
- Reverberation
- Talker location
|
Automotive |
- Reverberation
- Multiple speakers: are there other passengers?
|
3. Multimedia recording
Mode |
Challenges benefit by sensor fusion |
Camcording |
- Audio zoom: changing audio processing depending on camera focal length and depending who is in focus
- Stereo-mono selection based upon device orientation
- Motor noise cancellation
- Attach Meta data: GPS location, noise, talkers, etc.
|
Voice recording |
- Automatic microphone selection upon device orientation
- Attach Meta data: GPS location, noise, talkers, etc.
- Manipulation noise
|
Sound/music recording |
- Attach Meta data: GPS location, noise, talkers, etc.
- Manipulation noise
|
4. Multimedia playback
Multimedia playback can happen in a variety of environments from quiet, to consistent noise (airplane) to varying noise across a variety of output devices – headphones/earbuds, internal and external speakers.
Mode |
Challenges / Benefit by sensor fusion |
Mobile device |
|
Internal speaker |
- Orientation – Render stereo vs mono
- Equalization based on placement
- Loudness boost
- Multi device synchronous playback – group play
- Pocket detection
- Location detection w.r.t. listener and room characteristics
- Sweet spot creation based on listener location relative to the device
|
Headset/Headphone |
- Environmental noise reduction
- Head and device position tracking
- Playback and stop
|
Push – Airplay, Etc. |
- Positional location w.r.t. speaker resources
-
|
At home - Stationary |
- Orientation – Render stereo vs mono
- Equalization based on placement
- Multi device synchronous playback – group play
- Location detection w.r.t. listener and room characteristics
- Sweet spot creation based on listener location relative to the device
- Playback and stop
- User identification – voice or visual
|
5. Objective audio (gaming)
Mode |
Challenges / Benefit by sensor fusion |
Headset/Headphone |
- Head rotation tracking
- Manipulation noise suppression e.g., noise from the game controller
- Spatialization of device to device
- Intelligently enabling environmental noise
|
Handheld - Internal speaker |
- Privacy
- Echo cancellation
- Mic coverage
- Loudness: get more loudness out of small speakers (keeping distortion to minimal level);
- Changing room dynamics
- Speaker coverage
- Manipulation noise suppression e.g., noise from the game controller
|
TV/Living room- External speaker |
- Privacy
- Echo cancellation
- Mic coverage
- Talker location
-
- Loudness boost if external speakers are limited in response
- Speaker coverage
|
6. Idle case
The idle case is when the device is in the low power always on state waiting for a wake up event. Lowest power is ideal so only those sensors absolutely needed are left on. Always listening can be accomplished by the combination of a sound/speech detector and a full voice trigger that’s initiated after the sound/speech threshold is tripped. For proximity detection ultrasonic detection and the accelerometer can be utilized and is sufficient for always on.
Mode |
Challenges / Benefit by sensor fusion |
Always on low power listening |
- Hands free operation with single mic input
- VAD or sound detect
- Wake up Word / Hot Word
- Speaker ID, authentication
|
Proximity detection |
- Hands free operation with accelerometer or ultrasonic mic input
|
1. From Functions to Features
In a first step, we map the functions towards features (or cues) which we need to enhance above discussed challenges. Multiple sensors can be used to provide a reliable or more accurate feature. This is sensor data fusion.
2. From Features to User Benefits
In a second step, we map the features towards an improvement in user experience. From this table we see a second level of sensor fusion as multiple features are combined: ‘feature fusion’.
The system is rather complex and the infra structure is currently not available to make optimum use of all the sensors and systems available in a room or even in a 1 box system.
There are also challenges in standardization
- To make use of this system in a most effective way it is important to standardize on:
- the Features and their interfacing command structure
- interfaces / bus to the sensors but also to the Application processor
- Ultrasonic:
- Standardize on identification
(different frequencies for different devices in the room, adding meta data, ….;
- who is pinging;
- ultrasonic pollution
- Software architecture
- Fusing of the sensor data can improve the user experience by increasing the contextual awareness of the device.
- Audio (microphone and speakers) can be considered a sensor as well.
- Most valuable sensors to fuse with common audio processing seem to be:
- Accelerometer
- Ultrasonic-microphones
- Proximity detector
- Speaker as sensor
- A layered architecture is needed
- Multiple feature routines are running simultaneously on the sensor hub
- Higher-level functions run on the application processor
- The sensor hub can have it’s own OS (lower power & performance)
- Standardization is Required:
- In the software API
- In the sensor bus
- Minimization of ultrasonic pollution
- Identification (authentication) of the ultrasonic source
section 6
|