Project Bar-B-Q 2019 report section 5

All of us have had the experience of sitting down to watch a movie or other A/V content they craved only to have the experience degraded by audio rendering issues. Perhaps the dialogue is barely intelligible, the music soundtrack is overbearing, or the sound effects are lost due to a mismatch between encoding and decoding and or limitations of the acoustical playback system. Perhaps you forgot to change the audio settings on your device from “movies” to “music” before listening to your concert. How often has your immersion been shattered by a blaring commercial in the middle of a dramatic scene?

Can machine learning also make the experience more personalized to the user and optimized for the class of media?

Can machine learning also address future home entertainment challenges as technology unlocks more interactive and immersive experiences?

The block diagram above shows a typical home entertainment system.
OTT: “Over the Top”. This is an internet stream decoder.
STB: “Set top box”. Used with a cable or satellite system
AVR: “Audio Video Receiver”
ARC: “Audio Return Channel”. This is an audio stream after decoding within the TV, back to the AVR or Soundbar.

HDMI signals contain multiplexed video and audio content. Current TVs, AVRs, and Soundbars include a decoder for the HDMI signal. The decoder then passes the raw audio (and video) streams to rendering devices (downmixing to the actual number of speakers). The renderer is then connected to post-processing, which performs tasks such as selecting a soundfield mode, applying EQ, and similar functions.

In our proposed architecture, a Machine Learning processor is inserted between the decoder + renderer and the post-processor. The ML processor analyzes the audio data, and creates control signals which can be used to manipulate the parameters of the post-processor.

Renderers often contain proprietary algorithms, and so for practical reasons we cannot add Machine Learning into the path before the renderer. In an “open world”, this would allow for additional capabilities in the classification of the audio.

See the “Proposed Neural Network Architecture” section of this report for information on how the ML device analyzes the audio data to create the control signals.

There are 2 types of algorithms utilized in the ML processing. One is signal processing feature-extraction, filtering, transforms etc. The others are Neural Network based sound classification, context-understanding, etc. The 1st type utilizes 32-bit MAC fixed point and/or single precision floating point and the 2nd type utilizes NN MACs supporting 16x8, 8x8 or lower weight sizes. DSPs supporting a large number of 32-bit MACS, 16xN-bit MACs and multiple FP operations/cycles will provide the most energy and cost efficient solution.

Machine learning / Deep Learning (RW)

“Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned.”

“Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own.”

“Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence.”

[https://www.zendesk.com/blog/machine-learning-and-deep-learning/]

“A simple neural network is composed of an input layer, a single hidden layer, and an output layer.
A deep neural network has one key difference: instead of having a single hidden layer, it has multiple hidden layers. This allows the network to understand and emulate more complex and abstract behaviors.”

[https://www.quora.com/What-is-the-difference-between-neural-networks-and-deep-neural-networks]

“Within the field of machine learning, there are two main types of tasks: supervised, and unsupervised. The main difference between the two types is that supervised learning is done using a ground truth, or in other words, we have prior knowledge of what the output values for our samples should be. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.”

Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable. The CNN input is traditionally two-dimensional, a field or matrix, but can also be changed to be one-dimensional, allowing it to develop an internal representation of a one-dimensional sequence.

For much of the past decade, deep neural networks have offered solutions to the problems associated with inferring metadata from content, particularly image, video and audio. Frequently used for image “classification” (or labeling) and especially for facial recognition, the convolutional network network can be used to classify the audio stream and assign a label dependent on the content, such as “News” or “Sports,” etc. Once the system has assigned a label to the content stream, that information can be provided to the post-processing and/or rendering engines to select an appropriate audio preset or mode for the audio system, which will then process the audio appropriately based on the determined content type and output the processed audio via the users’ home entertainment system.

A proposed block diagram for this use case is shown below. This block diagram begins once the audio has been decoded. That PCM stream of audio should be converted into a spectral representation and fed into a trained convolutional neural network, which will output a label. That label will be fed to the DSP system to set the appropriate preset or mode for audio processing and subsequent amplification.

In the future, we can contemplate additional options that can process the audio more surgically. Neural networks have also demonstrated their usefulness at unmixing audio using spectral masking. An enhanced neural network could make use of such an approach to, for example, separate the dialogue audio content from the rest of the sound effects, music, ambience and other background audio tracks, for separate processing. The diagram below demonstrates a potential systems architecture for such a system.

In this example the audio is labeled (“classified”) as before, but we have added an additional neural network that can perform a spectral masking step to “separate” the audio signals into dialogue and “background” (or all other) audio tracks for independent processing via the post processing and rendering systems.

We imagine that, in addition to the pre-built processing (e.g. categorization/classification) performed by this system, there will be a number of user parameters which can be customized in order to fine tune the learning model.

“Fine-tuning means taking some machine learning model that has already learned something before (i.e. been trained on some data) and then training that model (i.e. training it some more, possibly on different data). “

These parameters could be directly set by the user, or learned by the system while observing the listening behavior of the user.

The system could then feed back the user’s changes into the model, to improve the classifications. This is known as “on-line fine tuning”.

“Preference learning is a subfield in machine learning, which is a classification method based on observed preference information [1]. In the view of supervised learning, preference learning trains on a set of items which have preferences toward labels or other items and predicts the preferences for all items.”

The data set (source domain) of another prototypical user can also be adapted to the current user (target domain). This is called domain adaptation or transfer learning.

“Domain adaptation[1][2] is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user (the source distribution) to a new one who receives significantly different emails (the target distribution). Domain adaptation has also been shown to be beneficial for learning unrelated sources.[3] Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation.[4]”

One system might have a collection of separate users. For example, different family members or combinations of family members playing a game or watching a sporting event/concert/news on their entertainment system. Therefore the system could store a separate customization for each situation. This is known as Personalization.


Participants: A.K.A. "Not Hot Dog"
Rick Cohen, Qubiq	TVB Subbu, Analog Devices
Scott McNeese, Surfaceink	Jonathan Bailey, iZotope
Craig Linssen, Apple	Larry Przywara, Cadence

	download the PDF

Brief statement of the problem(s) on which the group worked: All of us have had the experience of sitting down to watch a movie or other A/V content they craved only to have the experience degraded by audio rendering issues. Perhaps the dialogue is barely intelligible, the music soundtrack is overbearing, or the sound effects are lost due to a mismatch between encoding and decoding and or limitations of the acoustical playback system. Perhaps you forgot to change the audio settings on your device from “movies” to “music” before listening to your concert. How often has your immersion been shattered by a blaring commercial in the middle of a dramatic scene? Is machine learning a viable cure to these and other audio rendering issues? Can machine learning also make the experience more personalized to the user and optimized for the class of media? Can machine learning also address future home entertainment challenges as technology unlocks more interactive and immersive experiences? Expanded problem statement: The problem can be segmented as follows: Content is not being rendered as intended Mismatch between media encoding and device decoding Mismatch between intended acoustic reproduction system and actual system (number and type of speaker channels, speaker characteristics) Rendering not optimized for user preferences and content genre (personalization) For example “Rock” EQ inappropriately applied to “Classical” music For example “blaring” commercials in contrast to “soft” movie dialogue. In future may include real time ML learning of individual user preferences May leverage integration of biometric speaker ID into voice control Expanded Solution Description Proposed Systems Architecture The block diagram above shows a typical home entertainment system. OTT: “Over the Top”. This is an internet stream decoder. STB: “Set top box”. Used with a cable or satellite system AVR: “Audio Video Receiver” ARC: “Audio Return Channel”. This is an audio stream after decoding within the TV, back to the AVR or Soundbar. HDMI signals contain multiplexed video and audio content. Current TVs, AVRs, and Soundbars include a decoder for the HDMI signal. The decoder then passes the raw audio (and video) streams to rendering devices (downmixing to the actual number of speakers). The renderer is then connected to post-processing, which performs tasks such as selecting a soundfield mode, applying EQ, and similar functions. In our proposed architecture, a Machine Learning processor is inserted between the decoder + renderer and the post-processor. The ML processor analyzes the audio data, and creates control signals which can be used to manipulate the parameters of the post-processor. Renderers often contain proprietary algorithms, and so for practical reasons we cannot add Machine Learning into the path before the renderer. In an “open world”, this would allow for additional capabilities in the classification of the audio. See the “Proposed Neural Network Architecture” section of this report for information on how the ML device analyzes the audio data to create the control signals. Software Requirements There are 2 types of algorithms utilized in the ML processing. One is signal processing feature-extraction, filtering, transforms etc. The others are Neural Network based sound classification, context-understanding, etc. The 1st type utilizes 32-bit MAC fixed point and/or single precision floating point and the 2nd type utilizes NN MACs supporting 16x8, 8x8 or lower weight sizes. DSPs supporting a large number of 32-bit MACS, 16xN-bit MACs and multiple FP operations/cycles will provide the most energy and cost efficient solution. Machine Learning Terminology Machine learning / Deep Learning (RW) “Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned.” “Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own.” “Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence.” [https://www.zendesk.com/blog/machine-learning-and-deep-learning/] “A simple neural network is composed of an input layer, a single hidden layer, and an output layer. A deep neural network has one key difference: instead of having a single hidden layer, it has multiple hidden layers. This allows the network to understand and emulate more complex and abstract behaviors.” [https://www.quora.com/What-is-the-difference-between-neural-networks-and-deep-neural-networks] “Within the field of machine learning, there are two main types of tasks: supervised, and unsupervised. The main difference between the two types is that supervised learning is done using a ground truth, or in other words, we have prior knowledge of what the output values for our samples should be. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.” Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable. The CNN input is traditionally two-dimensional, a field or matrix, but can also be changed to be one-dimensional, allowing it to develop an internal representation of a one-dimensional sequence. Proposed Neural Network Architecture For much of the past decade, deep neural networks have offered solutions to the problems associated with inferring metadata from content, particularly image, video and audio. Frequently used for image “classification” (or labeling) and especially for facial recognition, the convolutional network network can be used to classify the audio stream and assign a label dependent on the content, such as “News” or “Sports,” etc. Once the system has assigned a label to the content stream, that information can be provided to the post-processing and/or rendering engines to select an appropriate audio preset or mode for the audio system, which will then process the audio appropriately based on the determined content type and output the processed audio via the users’ home entertainment system. A proposed block diagram for this use case is shown below. This block diagram begins once the audio has been decoded. That PCM stream of audio should be converted into a spectral representation and fed into a trained convolutional neural network, which will output a label. That label will be fed to the DSP system to set the appropriate preset or mode for audio processing and subsequent amplification. In the future, we can contemplate additional options that can process the audio more surgically. Neural networks have also demonstrated their usefulness at unmixing audio using spectral masking. An enhanced neural network could make use of such an approach to, for example, separate the dialogue audio content from the rest of the sound effects, music, ambience and other background audio tracks, for separate processing. The diagram below demonstrates a potential systems architecture for such a system. In this example the audio is labeled (“classified”) as before, but we have added an additional neural network that can perform a spectral masking step to “separate” the audio signals into dialogue and “background” (or all other) audio tracks for independent processing via the post processing and rendering systems. Personalization and Customization We imagine that, in addition to the pre-built processing (e.g. categorization/classification) performed by this system, there will be a number of user parameters which can be customized in order to fine tune the learning model. “Fine-tuning means taking some machine learning model that has already learned something before (i.e. been trained on some data) and then training that model (i.e. training it some more, possibly on different data). “ [https://www.quora.com/What-is-the-difference-between-transfer-learning-and-fine-tuning] Some simple examples: skewing the master EQ to boost or cut the bass changing the sound field mode from the one selected by the ML system muting a TV commercial These parameters could be directly set by the user, or learned by the system while observing the listening behavior of the user. The system could then feed back the user’s changes into the model, to improve the classifications. This is known as “on-line fine tuning”. “Preference learning is a subfield in machine learning, which is a classification method based on observed preference information [1]. In the view of supervised learning, preference learning trains on a set of items which have preferences toward labels or other items and predicts the preferences for all items.” [https://en.wikipedia.org/wiki/Preference_learning] The data set (source domain) of another prototypical user can also be adapted to the current user (target domain). This is called domain adaptation or transfer learning. “Domain adaptation[1][2] is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user (the source distribution) to a new one who receives significantly different emails (the target distribution). Domain adaptation has also been shown to be beneficial for learning unrelated sources.[3] Note that, when more than one source distribution is available the problem is referred to as multi-source domain adaptation.[4]” [https://en.wikipedia.org/wiki/Domain_adaptation] One system might have a collection of separate users. For example, different family members or combinations of family members playing a game or watching a sporting event/concert/news on their entertainment system. Therefore the system could store a separate customization for each situation. This is known as Personalization. [https://digitalgrowthunleashed.com/the-future-of-personalization-with-ai-and-machine-learning/] What’s next? Automatic transcription of voices from audio (speech to text). Once the text is available, this can be displayed as closed captioning, input to a text to speech system, input to a translation system, input to an animated sign language video generation system, and more. [https://medium.com/@ageitgey/machine-learning-is-fun-part-6-how-to-do-speech-recognition-with-deep-learning-28293c162f7a] Lip reading from video, to automatic transcription of the text. (Same post-processing as above). Prototyping of an ML system using a mobile device (i.e. iPhone) for processing the audio, and output remote control signals via Bluetooth to an existing audio processing system. Classification can for other purposes, such as a “mood input” for a visualization system. See “Semantic Audio” or “Music Information Retrieval”. Segmentation of a song into its components, such as “verse”, “chorus” and “bridge”. These components can be used as input for a visualization system. There (could be) an app for that! Recent iPhones include a dedicated "Neural Engine" processor. This processor can perform up to 600 billion operations per second! Core ML framework in iOS makes it easy to prototype and integrate machine learning models into an app. Models can use user data to make predictions, and train or fine-tune models, all on the user’s device, rather than time consuming operations in the cloud, etc. App ideas: Use the built-in iPhone microphone or existing speaker mic arrays to analyze audio, make decisions and help the iPhone to function as a remote control Analyze user behavior to notice things like whether the volume is being turned up or down and use that data to enhance the UX. Use meta-data about what’s being watched to improve the accuracy of the classification system. Render audio content in advance, which can create a deeper segmented analysis of audio files to create richer audio/video visualizers. Conclusions: TODAY ML can elevate the home entertainment experience by matching playback processing to the class of content iPhones and Core ML in iOS can be used to both rapidly prototype ideas, as well as develop stand-alone apps, or apps that integrate with existing hardware SOON ML can also . . . Remix content to optimize for the capabilities of the acoustic playback system Enhance the entertainment experience through learning and personalization Audio playback can be further enhanced by: Adding or utilizing existing microphone arrays (i.e. in speakers, iPhones, etc.) for voice control, acoustics corrections and to identify users Speaker ID (user) for personalization and content control Multi-modal ML to classify visual as well as audio content Inspiration / Links Machine learning for synthesis (Project Bar-B-Q 2018 report) https://www.projectbarbq.com/reports/bbq18/bbq18r3.htm https://valossa.com/ https://www.youtube.com/watch?v=pqTntG1RXSY (Hotdog or not hotdog) https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785?gi=d14bcb0bfa34 (Source Separation using Spectral Masking) https://www.youtube.com/watch?v=FwoggcbA_Sw (iZotope Neutron Track Assistant) https://medium.com/@CVxTz/audio-classification-a-convolutional-neural-network-approach-b0a4fce8f6c (Audio classification using neural networks) section 5

home previous next
The Twenty-fourth Annual Interactive Audio Conference PROJECT BAR-B-Q 2019