Project Bar-B-Q 2017 report section 6

home previous next
The Twenty-second Annual Interactive Audio Conference PROJECT BAR-B-Q 2017

Group Report: CAAML: Creative Audio Applications of Machine Learning

Participants: A.K.A. "Pigmoid and the Funky Softmaxes"

Marcus Altman, Dolby

TVB Subbu, Analog Devices

Chris Bauerlein, Magic Leap

Alex Westner, Cakewalk

Yipeng Liu, Cadence

Owen Vallis, Kadenze

Konstantin Merkher, CEVA DSP

Anatoly Savchenkov, Synopsys

Avi Keren, DSP Group

Michael Vulikh, P Product

Adeel Aslam, Intel

Elia Shenberger, Waves

Matthew Altman, DTS/Xperi

Facilitator: Aaron Higgins

download the PDF

Problem Statement

In recent years, machine learning has provided a number of state of the art solutions for computer vision¹, text translation², speech recognition³, and a host of other problems. It is clear that machine learning is a powerful tool for solving a wide number of problems, and that it has the potential to provide new solutions for audio applications. However, many of these audio applications have existing classical solutions or have latency and compute constraints that severely limit the use of the deeper neural networks. What is the set of audio problems that are amenable to machine learning? What is the set of audio problems for which classical solutions exist and are sufficient? What problems would benefit from a mix of classical solutions and machine learning? And finally, are there limitations or constraints that would exclude the use of a machine learning approach to the problem?

A brief statement of the group’s solutions to those problems

Many of the use cases discussed by the group already have classical solutions that solve the problems, and it was unclear what additional benefit would be gained by replacing these existing solutions with a machine learning approach. Additionally, many of the audio applications have latency or compute constraints that make it difficult or impossible to use the larger state of the art machine learning networks. It quickly became clear that machine learning is a powerful tool but not a panacea for all audio applications.

However, further discussions revealed many opportunities where hybrid systems can utilize classical solutions and machine learning to improve existing audio applications. We have categorized examples of machine learning to better understand these opportunities. There are three categories based on the use of machine learning within an application: machine learning as a partial solution within a larger pipeline, machine learning as a full solution that stands by itself, and the use of machine learning in parallel with a classical solution. Additionally, we split the groups further based on whether the solution was an improvement to an existing application, or whether it was solving a new problem.

Definitions
Partial: Machine learning as part of an application
Full: A machine learning based application
Parallel: Machine learning in parallel to a conventional system, perhaps for redundancy
New: No practical solution exists
Improve: Practical solutions exist, but machine learning makes it better, e.g. power consumption, computation, accuracy, speed

	Partial	Full	Parallel
New	Context-aware noise reduction	Speaker Diarization Auto-generated subtitles “Babelfish”, real-time translation	Context-aware noise reduction Self-driving cars: audio sensors for safety, also multi-modal
Improve	Double Talk Detector Restoration	Anomaly Detection Voice control	Double Talk Detector

Example Details

Context aware noise reduction

Noise reduction that can optimize itself for the current environment.
Auto detection of external sounds of interest, e.g. police sirens, someone calling the user’s name.
Estimation system reduces computation time.
Upload user noise sources for offline training and general improvement of the algorithm.

Auto-generated subtitles

Speaker identification and diarization → Understand who says what
Speech recognition → speech to text
Emotions recognition → add punctuation
Natural language processing → analyze the result and give feedback to earlier stages

Self-driving cars: Audio sensors for safety, also multi-modal
Acoustical context awareness can

detect sirens coming
be the second sense for object detection

Audio Restoration

Models are able to synthesize an appropriate replacement for the missing or corrupted audio.
There is precedent for this in image deep neural networks, where they are able to upsample low resolution images by synthesizing new pixels at the higher resolution.⁴ ⁵ ⁶

Double-Talk Detector

Parallel improvement case
A “controller” of the conventional LMS engine adaptation -
Runs in parallel to conventional DTD to improve detection
Classical NN usage: Detects a situation of simultaneous speech

Anomaly Detection

An ML system that has memory can detect significant deviations from the predicted output.
LSTM (Long Short Term Memory) can be used to predict the next sample from an audio stream. The difference between the predicted output and the observed audio can be used as an anomaly detection signal.

(image from https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265)

Brainstorm of Audio Applications for Machine Learning & Artificial Intelligence

Voice User Interface

Voice controlled devices
Voice to text in a meeting environment (taking notes)
Room correction for voice capture based on voice signature, pattern
Boot strapped network, such as ImageNet, trained on speech.
Speaker identification. One example is in the meeting room, conference call.
Auto generated subtitles
Voice identification

Context awareness

How AI can work like human minds, i.e. can fill in missing or incorrect spelling or conversation.
Context awareness, on steroids. Through learning, the system can understand context such as each speaker's preference, personality, emotions, etc.
Human-machine, machine-machine conversation in gaming.
Multimodal biometric.
Context aware noise reduction, source separation

Music synthesis (creating audio)

Auto generated game score – score generated based on personal preference/history
Music post production complicated and tedious, solutions are painful. Can machine learning help? Find right sound effect, reverb,
Help student to correct what’s wrong, like music teacher
Music composition, example Flow
Synthesize new sound
Polyphonic pitch correction
Mimic style of artist, instrument – like mimic Van Gogh in painting
Digital instrument with style and context awareness
Resynthesize missing audio caused by packet loss
Transducers, physical music instruments designed by machine

User experience

Out of the box recommendation
Organize, sort, navigate sounds
Tuning sound system for live performances
Tuning of devices
Noise canceling using NN
QC of audio. E.g. converting film to digital
Correcting distortion and nonlinearity caused by transducer
Automotive self driving application

Conclusion
Machine learning has many uses, but not every use is an appropriate use. This report identifies a categorization methodology. Machine learning is a tool and like all tools, selective application must be applied. A hammer is great, but sometimes you need a screwdriver, too.

Other reference material:

There are also a number of creative tools that are emerging as well. Some examples are:

Automatic mastering service - https://www.landr.com/en/
Audio loop remixing tools - https://accusonus.com/products/regroover
Audio VST tools that analyze the audio - https://www.izotope.com/en/products/mix/neutron.html

¹ (2015, December 8). [1512.02325] SSD: Single Shot MultiBox Detector - arXiv. Retrieved December 1, 2017, from https://arxiv.org/abs/1512.02325

² (2016, September 26). Google's Neural Machine Translation System: Bridging the Gap .... Retrieved December 1, 2017, from https://arxiv.org/abs/1609.08144

³ (2017, July 24). Exploring Neural Transducers for End-to-End Speech Recognition. Retrieved December 1, 2017, from https://arxiv.org/abs/1707.07413

⁴ (2016, August 9). Image Completion with Deep Learning in TensorFlow - Brandon Amos. Retrieved December 1, 2017, from http://bamos.github.io/2016/08/09/deep-completion/

⁵ (n.d.). Deep Image Inpainting - CS231n. Retrieved December 1, 2017, from http://cs231n.stanford.edu/reports/2017/pdfs/328.pdf

⁶ (2015, July 31). Image Super-Resolution Using Deep Convolutional Networks - arXiv. Retrieved December 1, 2017, from https://arxiv.org/pdf/1501.00092

section 6

next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. Alexa, Siri, Cortana or: How I Learned to Stop Worrying and Love the Cloud
4. “You and the Uni: Defining Pedagogical Requirements for Audio Engineering Education” a.k.a. Discovering What to Learn Them Young Whippersnappers
5. A spatial audio format with 6 Degrees of Freedom
6. CAAML: Creative Audio Applications of Machine Learning
7. Mode and Nodes – Enabling Consumer Use of Heterogeneous Wireless Speaker Devices
8. Abusing Technology for Creative Purposes
9. Schedule & Sponsors