Reading the room: Today’s Augmented Reality (AR, also MR1, XR2) solutions do not consider listening and source contexts.
- Audio sources, when delivered to the listener, are “contaminated” by the acoustic environment where they were produced.
- Audio sources, are not presented to the listener in a way that will be perceived as residing in the listener's acoustic environment.
Both shortcomings lead to cognitive dissonance and unconvincing AR audio.
1 Mixed Reality
2 XR is meant to be a catch-all acronym that includes Virtual Reality, Augmented Reality, and Mixed Reality. In this document, AR refers to those modes that transport others into the listener’s environment.
This leads to the following questions:
- How do we divorce a sound source from its surroundings (its “context”)?
- Who does this (source? app/service?), and how dynamically?
- How do we adapt sources to the listener's room/surroundings?
- How do we capture room profiles (both source and listener)?
- Can we use pre-existing microphones or hardware (even if only briefly), along with the AR hardware worn by the user?
- For all four of these (source/context separation, source/destination placement, and characterizing of source and listener environments), who does this work (app/service? destination?)?
- Are there sliding levels of fidelity at which this work can be done, depending on available resources (versus the low-latency immediacy of local processing)?
- What information does a renderer (listener end) require to be effective?
- What info about each sound source?
- What information about its own listening environment?
- Does the renderer need any information about the source's “context”?
Information needed by the renderer (depending on the use case) might include:
- Sources: positions, rotation, geometry/occlusion/propagation pattern
- Listener environment: materials (reflection, absorption), geometry (reverberation), overall impulse response
- Listener: position, rotation, HRTF
- (Note that the source’s environment is completely “canceled” and need not be conveyed to listeners)
More detailed answers to these questions will be use-case specific, described below.
We considered three broad categories of use cases:
- Social Interactions in AR
- Instructional and Training Applications
- Gaming in AR
- “They are here”
- Dissonance caused by sources
- E.g. their room profile leaks into our space
- Dynamic noise sources (e.g. faucet as you wash hands during a call)
- Source head movement
- Where are they looking when they talk?
- Who selects the rendering mode?
- Render movement only with voice
- “What do you think, Dave?” refers to which BarBQ Dave?
- Device/ capture
- Device/ rendering
- Prioritize A vs V
How to solve listener dissonance, when interacting with remote source(s):
- Divorce the source from its room
- Remove room at the remote source.
- A dry signal is required to insure that the local rendering is clean1
- Static room profile is needed, to be cancelled
- Dynamic room profile/noise sources should be removed as well, as they arise
- Put the source in the listener's space, with a realistic position and rotation (taking into account the listener's environment, and ideally the source's 3D propagation pattern)
- Listener, room response
- Metadata needed to be transmitted to listener (might be ignored by naive listener clients):
- Position/location of sound sources in source space
- Rotation (reference and delta)
- Polar response of the source
Solutions (progressively more nuanced and expensive):
- Telephone-style dry voice, placed into a static 3D location by the listener's renderer
- Incorporate source rotation data, using standard HRTF profiles to generate renderer 3D positions and rotation that entail listener environment reflection/occlusion/absorption
- Generate custom 3D profiles for each source, via the ad hoc multi-mic room array of existing mics in the source environment (phone, Echo, etc.).
- Audio “brain” useful when visual brain is busy
- Audio good for 360 degree awareness, vision for foveated front awareness
“Where is the Hammer”
- Item beyond view
- Item behind you etc
- You don’t always know where north is
Visual de-clutter. Clear notifications from screen
- Room context aware notifications; e.g., quiet vs loud environment
- Too many notifications to handle visually
Object based audio solves sources (e.g., dry, positional etc.)
- Other “real players” need to “be here”
- Dissonance etc. (see social)
- Small room, lots of players, how do you make it “sound right”?
- Pitch shift voices to make everyone fit?
- Same room audio pass-through
- Room info (static)
- Dynamic room info (voice from local folks)
- How do we get room info?
Recommendations for an industry standard to solve static room profiling (and eventually, 3D characterization of sound source propagation) “Inspired by Sonos” 2
- Make systems partially open for greater good -- time bounded
- “Alexa for 2 minutes, please share your microphones”
- Could be Roku, Samsung etc.
- Time stamps/sync will make this much easier
- Mic/source characteristics
- Security issues
- Trusted hardware groups?
- Aggregation /API
- Where are the room models and source models / libraries stored?
- The edge (not the U2 guitarist). If not on-device, then on a more powerful local device, or in a relatively-close, low-latency cloud (“fog”?)
As a general thought, we noted that the only change needed in order to adapt this AR architecture to a VR architecture is to make the listener context a synthetic one (rather than being forced to reflect the actual listener environment). Elements such as the reflection/absorption/geometry of the VR environment remain required elements.
1: Dereverberation techniques are being actively developed, and if effective will do much of the heavy lifting in removing the “source” environment influence. Discussed here for example: https://www.vocal.com/dereverberation/speech-dereverberation-using-channel-inversion-and-equalization/
Reports from Prior Project Bar-B-Q Meeting
“Mode and Nodes – Enabling Consumer Use of Heterogeneous Wireless Speaker” Devices” https://www.projectbarbq.com/reports/bbq17/bbq17r7.htm
- Relevant work on Cooperating Devices
“A spatial audio format with 6 Degrees of Freedom”
- A proposed standard for communicating meta data for six degrees of freedom information.
“Creating Immersive Music with Audio Objects”
- Realistic rendering of audio for AR/VR
“Audio Sensor Opportunities: Market Requirements and Technology Challenges for the next Decade”
- Detailed survey of acoustic sensors currently available in devices.