We often talk about ‘immersive audio’, where one feels like they are in the middle of a game, orchestra or movie. The use of spatial audio (HRTFs, room models, BRIRs, etc.) to render these immersive scenes is usually the ‘go-to’ idea. Some of the problems with synthetic spatial audio, as well as binaural field recordings, are:
1) The visual cues are missing or wrong.
2) Head motion is not taken into account.
3) HRTFs are generic and not individualized.
4) The listener’s environment is not taken into account.
That last point is particularly important. If you have a binaural recording made in a small room, but you listen to it in a large room, it will sound terribly colored. In fact, if the room you are listening in is not taken into account, any synthetic or binaural recording will have coloration.
Another big issue is that, if the visual cue is missing, the listener tends to localize the sound behind them (or at least somewhere outside of their field of vision).
So what can be done to mitigate these issues? Is this something that we can engineer (i.e., build me some new, celebrity endorsed headphones), or is it a matter of getting the signal processing just right (can you say ‘head tracker’, hallelujah!), or are there limitations at the cognitive level that need to be addressed?