Everything that you perceive with your ears is coloring every other perception you have and every conscious thought that you have. Sound gets in so fast that it modifies all other input and sets the stage for it.
- Seth Horowitz
Most audio is hyper-real. Audio for games, TV and movies, goes beyond ‘realistic’ sound to suspend your disbelief and immerse you in an experience. Even meticulously-recorded music is mixed into a hyper-real presentation. As of today, audio attempting to be truly ‘real’ is limited to certain musical recordings and a handful of other stuff.
AR experiences are currently a small but growing subset. Distractions are exponentially more difficult to deal with in AR, since both virtual video and virtual audio objects must be modeled accurately. It’s much more difficult to suspend disbelief when we are living in the real world.
As AR/MR rises in popularity, is it now more important than ever before to be able to convey real audio. Hyper-real sounds pull you out of an augmented environment. But it is not sufficient to simply make binaural recordings of environments … the sound is interactive, and aspects have to be procedurally generated. So what is needed, and what are the obstacles?
- Virtual objects must match the real world acoustics
- The content must be balanced with the real world to be believable
- The content must not sound arbitrarily generated (i.e., footsteps should vary and never repeat)
Out-of-scope topics for this paper
- Real world objects interacting with virtual occlusions
- Reproduction of content coming out of virtual speakers
- Speech synthesis
- Synthesis techniques
Uncanny Valley: the phenomenon whereby a computer-generated figure or humanoid robot bearing a near-identical resemblance to a human being arouses a sense of unease or revulsion in the person viewing it.
Figure. The Uncanny Valley
To date, there is a small amount of literature on the topic of an “aural” uncanny valley for audio. This will be an area of great interest, in the future, because for an audio presentation to be real, the uncanny valley must be avoided. While it is beyond the scope of this document to attempt to fully define the aural uncanny valley, here are some references:
From The audio Uncanny Valley: Sound, fear and the horror game.:
- Certain amplitude envelopes applied to sound affect perceptions of urgency.
- Frequency might have an effect on the unpleasantness of sound and this might lead to negative affect.
- Familiar or iconic sounds can be defamiliarized and this can lead to perceptions of uncanniness.
- Uncertainty about the location of a sound source, its cause or its meaning in the virtual world increases the fear emotion.
- An aural resolution that is lower than a high quality, human-like visual resolution might lead to the uncanny.
- An exaggerated articulation of the mouth whilst speaking might lead to the uncanny.
- A lack of synchronization between lips and voice for photo-realistic virtual characters leads to a perception of the uncanny. In particular, sound that precedes associated video can be very unsettling.
- Industrial design constraints
- Speaker/transducer location - closer to the ear is better, but not on or blocking the ear
- Speaker/transducer size - larger is usually better
- Transducer quality - low distortion
- Number of transducers - minimum of 2 speakers required for spatial audio
- Frequency response at ear
- Speaker/transducer placement - closer to ear is better, but not on or blocking the ear
- HRTF accuracy
- Object-to-ear HRTF - personalized is best
- Speaker-to-ear HRTF - needs to be considered for near-ear speakers
- Computing resources - limited on battery powered devices
- M/AR is not a tethered experience
Virtual objects must match the real world acoustics
- Need a means to measure and match the room’s reverb
- Virtual audio objects must be occluded in the same way that real objects
- Ideally real-world audio would also be occluded by virtual visual objects
The content must be balanced with the real world to be believable
- Eliminate visual or physical distractions that may distract from the experience (e.g. hiding speakers, using near-ear speakers instead of headphones)
- Ensure that content is not being masked by real world sounds
The content must not sound arbitrarily generated (i.e., footsteps should vary and never repeat)
- How can we test this?
- This brings us closer to the original Turing test - Is a believable AI required to achieve this?
What is good enough to be believable?
- We need to better understand and quantify the threshold for what sounds real vs fake
- It often depends on the listener’s intention: “I want to be excited” (hyper-real) vs “I want it to be real”
- We need to support both “They are here” and “We are there” use cases
- Is it a background or foreground experience?
- The believability is affected by experience, expectations, and age of listener
- How realistic are any synthetic “organic” sounds
- Audio cues extend perceived field of view and the graphics depth of field
- If the related video is not realistic, then audio may be less believable
Below are two examples of subjective test methods that may be useful to rate the realness of an M/AR audio system. They are presented as introductory ideas; other test methodologies may be equal to or better than these.
In front of the listener is an acoustically-transparent screen (or the subject is blindfolded). Behind the screen is an acoustic sound source. Also present is a playback system. During the test, either the acoustic source or the playback system plays a sound.
- If the alternation is random, the listener may be asked to identify the sound source each time.
- If the alternation is not random, the listener may be asked to identify the real source, after a set of iterations.
What would the speaker-produced audio need to do?
- Would it need to make mistakes/variances as any acoustic source would?
- Would it need to express movement?
- Would it need to make ‘human’ or ‘natural’ like sounds in addition to the desired sound
- Can one distinguish synthetic content from acoustically generated content?
It is clear that if only pre-recorded or synthesized sounds are used, there would need to be a corpus of them to convey the variance expected of a natural sound. Perhaps one practical way to accomplish the test would be to procedurally generate the sounds.
A measure of audio realness is the ability of the audio to contribute to an immersive experience without inadvertently breaking suspension of reality. A test of this attribute would necessarily be passive, as the audio must blend into the experience. This can help us quantify the threshold of believability.
- Create an immersive experience containing sounds/events to be tested
- One experience might contain background sounds, intended to be experienced inattentively
- Another might contain foreground sounds intended to draw attention
- Could have multiple versions of the experience with parameters changed (reverberance, sound design, av sync, etc.)
- Subject is entered into the experience
- Interview the subject after the test about the realism of it or request the listener to score it
- Alternatively, measure the subject’s attentiveness to sound in a non-subjective manner
The Bandwidth of Human Perception and its Implications for Pro Audio by Thomas Lund (AES Library)
The McGurk Effect
Carnegie Mellon: The challenges of testing in a non deterministic world
Will Virtual Reality Get Lost in the Uncanny Valley Of Sound?
Gamastura: Virtual Reality in the Uncanny Aural Valley
Examples of ‘uncanny’ sounds generated by ML