|   The audio user  experience is often compromised by the surrounding environment, the user, and  the context. To overcome the multitude of scenarios, we believe that  fusing audio and many non-audio sensors can significantly improve the user  experience of audio applications. Example: Your child  is presenting on stage, you’re in the audience in the back row with your  camera. You zoom into your child and also want to capture the audio as he  says his lines. For this there are several functions which could be added  to the system. One example is that you pick up the signals from the microphone he’s wearing, by tapping into the environment resources (house sound system). Another example is to use an audio zoom function linked to the camera  which combines microphone beamforming with shake compensation, position  compensation, etc.  Sensors are getting  widely deployed on smart devices. For example on smartphones today,  accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker,  front and back facing cameras have become common. Despite the presence of the  sensors, mobile phone sensing is still in its infancy. We want to raise industry awareness of end user benefits by combining multiple  sensors domains. In this report we limit the scope towards benefits for audio  related applications/use cases and their relationship to sensor data. On top we  analyzed bandwidth requirements for sensors to enable these benefits.  The  report is organized in following sections: 
        SensorsMultilayer approachAudio applicationsChallengesConclusions Below  figure shows an overview of different sensors. Some of these are widely  deployed into mobile phones, smartphones and wearable today, such as an  accelerometer, gyroscope, GPS, proximity sensor, microphone arrays, speaker,  front and back facing cameras. The output of the sensor  layer is a raw data signal with structured properties, such as information  about the current data, sampling frequency, number of dimensions and the size  of each dimension. Most sensors will yield one dimensional data, for example,  audio signals, temperature. There are also sensors providing multi-dimensional  data, for example an accelerometer. The physical limitations and range of measurement for the  sensors bound what can be measured. The system architect also needs to know power and processing requirements.      1. Mechanical           a. Acceleration 
        
          | Accelerometer |  
          | Stimulus range | 0.0005 → 16 | ±g / axis |  
          | Dynamic range | 90 | dB |  
          | Frequency range | 10 → 500 | Hz |  
          | PGA | 0 → 24 | dB |  
          | ADC resolution | ≥ 12 | Bits |  
          | Arithmetic | 16 | Bits |  
        
          | Sampling mode | Rate | Power |  
          | Continuous | 200Hz | 0.2 mW |  
          | Sync’ed | 200Hz | 0.2 mW |  
          | Triggered | N/A | N/A |  
          | Event triggered | 0 | 0.00025mW |  
          | Suspend | 0 | 0.00002mW |            b. Rotation 
        
          | Gyroscope |  
          | Stimulus range | 0.1 → 50 000 | ±˚ /s |  
          | Dynamic range | 115 | dB |  
          | Frequency range | 1 → 5000 | Hz |  
          | PGA | N/A | dB |  
          | ADC resolution | ≥ 20 | Bits |  
          | Arithmetic | ≥ 24 | Bits |  
        
          | Sampling mode | Rate | Power |  
          | Continuous | 10 000 Hz | 2mW |  
          | Sync’ed | 500 Hz | 5mW |  
          | Triggered | N/A | N/A |  
          | Event triggered | 0 | 0.00025mW |  
          | Suspend | 0 | 0.00002mW |            c. Atmospheric  Pressure 
        
          | Barometer |  
          | Stimulus range | 200 → 1500 | mbar RMS |  
          | Dynamic range | 100 | dB |  
          | Frequency range | 10 | Hz |  
          | PGA | N/A | dB |  
          | ADC resolution | ≥ 24 | Bits |  
          | Arithmetic | ≥ 24 | Bits |  
        
          | Sampling mode | Rate | Power |  
          | Continuous | 10 Hz | 0.01mW |  
          | Sync’ed | 1 Hz | 0.05mW |  
          | Triggered | N/A | N/A |  
          | Event triggered | 0 | N/A |  
          | Suspend | 0 | 0.000001mW |            d. Sound pressure 
        
          | Microphone |  
          | Stimulus range | 20 → 140 | dB SPL |  
          | Dynamic range | 120 | dB |  
          | Frequency range | 10 → 20 000 | Hz |  
          | PGA | N/A | dB |  
          | ADC resolution | ≥ 20 | Bits |  
          | Arithmetic | ≥ 24 | Bits |  
        
          | Sampling mode | Rate | Power |  
          | Continuous | 3 MHz | 2 mW |  
          | Sync’ed | 3 MHz | 2 mW |  
          | Triggered | N/A | N/A |  
          | Event triggered | 0 | TBD |  
          | Suspend | 0 | 0.001mW |            e. Ultrasonic wave  pressure 
        
          | Ultrasonic Microphone |  
          | Stimulus range | 20 → 100 | dB SPL |  
          | Dynamic range | 80 | dB |  
          | Frequency range | 20k → 80k | Hz |  
          | PGA | N/A | dB |  
          | ADC resolution | ≥ 14 | Bits |  
          | Arithmetic | ≥ 16 | Bits |  
        
          | Sampling mode | Rate | Power |  
          | Continuous | 3 MHz | 2 mW |  
          | Sync’ed | 3 MHz | 2 mW |  
          | Triggered | N/A | N/A |  
          | Event triggered | 0 | TBD |  
          | Suspend | 0 | 0.001mW |            f. Gasflow           g. Speaker           h. Temperature      2. Electromagnetic           a. Ambient light 
        
          | Ambient Light Sensor |  
          | Stimulus range | 0.002 → 65k | Lux |  
          | Dynamic range | 150 | dB |  
          | Frequency range | N/A | Hz |  
          | PGA | 0 → 36 | dB |  
          | ADC resolution | ≥ 16 | Bits |  
          | Arithetic | ≥ 32 | Bits |  
        
          | Sampling mode | Rate | Power |  
          | Continuous | 10 Hz | 0.5mW |  
          | Sync’ed | 10 Hz | 0.5mW |  
          | Triggered | N/A | N/A |  
          | Event triggered | 0 | TBD |  
          | Suspend | 0 | 0.001mW |            b. Infrared light           c. Magnetism 
        
          | Magnetometer |  
          | Stimulus range | 0.005 → 16 | ± gauss |  
          | Dynamic range | 70 | dB |  
          | Frequency range | N/A | Hz |  
          | PGA | 0 → 12 | dB |  
          | ADC resolution | ≥ 12 | Bits |  
          | Arithmetic | ≥ 16 | Bits |  
        
          | Sampling mode | Rate | Power |  
          | Continuous | 50 Hz | 0.05mW |  
          | Sync’ed | 20 Hz | 0.5mW |  
          | Triggered | N/A | N/A |  
          | Event triggered | 0 | TBD |  
          | Suspend | 0 | 0.001mW |            d. GPS           e. Camera      3. Human           a. Blood pressure           b. Hand Grip           c. Skin conductivity           d. Fingerprint detection      4. Connectivity           a. Bluetooth           b. WLAN The raw data from the sensors are typically interpreted in a multi-layer approach in order to make higher level, context-aware decisions. Reasons against always streaming the raw sensor data are: 
        Privacy: sending raw data to the cloudBandwidthEnergy consumption: e.g., application processor processing high-bandwidth raw data CPU usage In this report, we follow a three-layered architecture: 
        Functions: the raw sensor data, which sensors are available in the system;Features: compressed summaries or cues interpreted from (multiple) raw       sensor data, what can we learn from the sensors?Applications or User benefits: This is the decision level: how to combine features into a tangible benefit towards a user; Example 1: You are picking up a call in your office, and you need to have a conversation with a  group of people.  You are walking to a meeting room where more people are present, you put down your phone and your phone goes into a desktop conferencing mode. 
        Functions: Ultrasonic microphone, accelerometer, gyro, magnetometer, proximity, grip detectFeature: Local device mode (Near Field vs. Hands Free vs. Far Field)Application: Automatic switching from earpiece to speakerphone mode during a call Example 2: You are on a conference call, and one participant is causing manipulation noise (rubbing his  phone on the table, touching buttons, etc.) adding all kinds of background junk to the call. These clicks can be very annoying for the far-end listener. We can resolve this using audio content only, but it gets much easier if we can also use input from the accelerometer, etc., to more robustly detect these manipulation sounds. 
        Functions: Microphone array, accelerometer, gyroFeature: Manipulation noise detectionApplication: Improved noise suppression during voice calling Example 3: Make a noise map of a city. For this an application would probably want to measure sound  levels when the phone is out of the pocket or bag. To make a robust, in-pocket detection, multiple sensors could be combined:  
        Functions: GPS, Microphone, Accelerometer, ambient light sensorFeature: Sound level, Pocket detectionApplication: noise map of a city Although it is not the goal of this working group to focus on the details of the system  architecture, a high-level proposal can be investigated. The architecture split up of Functions, Features and Applications are done in a smart way which could enable the architecture as pictured below. The sensors (Functions) are connected to a sensor hub or a sensor specific DSP/CPU core in the application. To minimize the data transfer the Features routines will need to run locally in this sensor hub or specific DSP/CPU core. Application functions will run on the main application processor and can call the feature routines to get key information (the application will call the specific Features routines in the hub. These Features routines will give back simple information based on the specific sensory called  out by the Feature routine).
 For  a detailed list of the Functions, Features and Applications please refer to the tables in this document.  Advantages  of proposed architecture: 
        Reduced  data transfer between Hub and Application processorOptimized  for power managementStandardization  possible to support distributed systems (Ubiquitous Network)Independent  for main operating system used Redundant and complimentary sensors can be fused and integrated in order to enhance system reliability and accuracy. Multi-sensor fusion can bring benefits in a wide range of applications, such as, robotics, military and biomedical. In this section, we analyzed how fusing audio and non-audio sensors using the multi-layer approach can benefit the experience of audio applications. In a first step, we list typical challenges in audio use cases. In a next step, sensors functions are mapped towards resolving these  hallenges. For this analysis, audio applications were classified according to use cases: 
        Two-way communication (human-human)One-way communication (human-machine)Multimedia recordingMultimedia playbackObjective audioIdle case 1. Two-way communication (human - human) Two-way communication happens in adverse acoustic conditions: 
        Noisy environmentsEcho: sound of the speaker is captured by the microphone resulting in echo for the far-end talker;Room acoustics: reverberation and reflections of the audio signalsVarying signal levels;Unknown user handling: strange device positions, covering microphones, position of speaker towards ear, pressure of speaker on  the ear, etc. These conditions are improved by active voice  processing techniques, e.g., acoustic echo cancelling (AEC) and  multi-microphone noise suppression (NS) algorithms. 
        
          | Mode | Challenges benefit by sensor fusion |  
          | Close talk (earpiece) | 
            Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user;Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal;Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary;Manipulation noise: e.g., user tapping the phoneSeamless transition between NF and HH modes |  
          | Handheld speaker | 
            Close talk challenges +Loudness: get more loudness out of small speakers (keeping distortion to minimal level);Changing room dynamics Privacy  |  
          | Conference mode | 
            Handheld speaker challenges + Multiple talkers: who is talking and who is desired?  |  
          | Headset | 
            Manipulation noiseIntelligently enabling environmental noise |  
          | Far talk | 
            ReverberationTalker location |  
          | Automotive | 
            ReverberationMultiple speakers: are there other passengers? |  2. One-way communication (human - machine) One-way communication happens in similar adverse conditions as two-way communication. Solutions  however can be distinct as automatic speech recognition engines do not necessarily react the same as human listeners. 
        
          | Mode | Challenges benefit by sensor fusion |  
          | Close talk (earpiece) | 
            Positional/Orientation robustness: provide same acoustic performance independent from device position relative to the user;Microphone coverage: user can cover one or multiple microphones by hand or face, influencing greatly the captured signal;Speaker leakage/coverage: depending on pressure applied from earpiece speaker towards the ear, the loudness, captured echo, frequency response vary;Manipulation noise: e.g., user tapping the phoneSeamless transition between NF and HH modes |  
          | Handheld speaker | 
            Close talk challenges +Loudness: get more loudness out of small speakers (keeping distortion to minimal level);Changing room dynamics Privacy  |  
          | Conference mode | 
            Handheld speaker challenges + Multiple talkers: who is talking and who is desired?  |  
          | Headset | 
            Manipulation noiseIntelligently enabling environmental noise |  
          | Far talk | 
            ReverberationTalker location |  
          | Automotive | 
            ReverberationMultiple speakers: are there other passengers? |  3. Multimedia recording 
        
          | Mode | Challenges benefit by sensor fusion |  
          | Camcording | 
            Audio zoom: changing audio processing depending on camera focal length and depending who is in focusStereo-mono selection based upon device orientationMotor noise cancellationAttach Meta data: GPS location, noise, talkers, etc. |  
          | Voice recording | 
            Automatic microphone selection upon device orientationAttach Meta data: GPS location, noise, talkers, etc. Manipulation noise  |  
          | Sound/music recording | 
            Attach Meta data: GPS location, noise, talkers, etc. Manipulation noise  |  4. Multimedia playback Multimedia playback can happen in a variety of environments from quiet, to consistent noise (airplane) to varying noise across a variety of output devices – headphones/earbuds, internal and external speakers. 
        
          | Mode | Challenges / Benefit by sensor fusion |  
          | Mobile device |   |  
          | Internal speaker | 
            Orientation – Render stereo vs monoEqualization based on placementLoudness boostMulti device synchronous playback – group playPocket detectionLocation detection w.r.t. listener and room characteristicsSweet spot creation based on listener location relative to the device |  
          | Headset/Headphone | 
            Environmental noise reductionHead and device position trackingPlayback and stop |  
          | Push – Airplay, Etc. | 
            Positional location w.r.t. speaker resources  |  
          | At home - Stationary | 
            Orientation – Render stereo vs monoEqualization based on placementMulti device synchronous playback – group playLocation detection w.r.t. listener and room characteristicsSweet spot creation based on listener location relative to the devicePlayback and stopUser identification – voice or visual |  5. Objective audio (gaming) 
        
          | Mode | Challenges / Benefit by sensor fusion |  
          | Headset/Headphone | 
            Head rotation trackingManipulation noise suppression e.g., noise from the game controller Spatialization of device to device Intelligently enabling environmental noise  |  
          | Handheld - Internal speaker | 
            PrivacyEcho cancellationMic coverageLoudness: get more loudness out of small speakers (keeping distortion to minimal level);Changing room dynamicsSpeaker coverageManipulation noise suppression e.g., noise from the game controller |  
          | TV/Living room- External speaker | 
            PrivacyEcho cancellationMic coverageTalker location Loudness boost if external speakers are limited in responseSpeaker coverage |  6. Idle case The  idle case is when the device is in the low power always on state waiting for a  wake up event.  Lowest power is ideal so only those sensors absolutely needed are left on.  Always listening can be accomplished by the combination of a sound/speech detector and a full voice trigger that’s initiated after the sound/speech threshold is tripped.  For proximity detection ultrasonic detection and the accelerometer can be utilized and is sufficient for always on. 
        
          | Mode | Challenges / Benefit by sensor fusion |  
          | Always on low power listening | 
            Hands free operation with single mic input
              VAD or sound detectWake up Word / Hot WordSpeaker ID, authentication |  
          | Proximity detection | 
            Hands free operation with accelerometer or ultrasonic mic input |  1. From Functions to Features In a first step, we map the functions towards features (or cues) which we need to enhance above discussed challenges. Multiple sensors can be used to provide a reliable or more accurate feature. This is sensor data fusion.  2. From Features to User Benefits In a second step, we map the features towards an improvement in user experience. From this table we see a second level of sensor  fusion as multiple features are combined: ‘feature fusion’.
       The system is rather complex and the infra structure is currently not available to make optimum use of all the sensors and systems available in a room or even in a 1 box system.There are also challenges in standardization
 
        To  make use of this system in a most effective way it is important to standardize  on:
          
            the  Features and their interfacing command structureinterfaces  / bus  to the sensors but also to the  Application processorUltrasonic: 
          
            Standardize  on identification(different frequencies for different devices in the room, adding meta data, ….;
who  is pinging; ultrasonic  pollutionSoftware  architecture 
        Fusing  of the sensor data can improve the user experience by increasing the contextual  awareness of the device.Audio  (microphone and speakers) can be considered a sensor as well.Most  valuable sensors to fuse with common audio processing seem to be:
          AccelerometerUltrasonic-microphonesProximity detectorSpeaker as sensor A  layered architecture is needed  
          Multiple feature routines  are running simultaneously on the sensor hubHigher-level functions run  on the application processor The  sensor hub can have it’s own OS (lower power & performance)Standardization  is Required:  
          In the software APIIn the sensor busMinimization  of ultrasonic pollutionIdentification  (authentication) of the ultrasonic source section 6 
 |