home  previous   next 
The Twenty-second Annual Interactive Audio Conference
BBQ Group Report: CAAML: Creative Audio Applications of Machine Learning
Participants: A.K.A. "Pigmoid and the Funky Softmaxes"
Marcus Altman, Dolby TVB Subbu, Analog Devices
Chris Bauerlein, Magic Leap Alex Westner, Cakewalk
Yipeng Liu, Cadence Owen Vallis, Kadenze
Konstantin Merkher, CEVA DSP Anatoly Savchenkov, Synopsys
Avi Keren, DSP Group Michael Vulikh, P Product
Adeel Aslam, Intel Elia Shenberger, Waves
Matthew Altman, DTS/Xperi  
Facilitator: Aaron Higgins  
  PDF download the PDF

Problem Statement

In recent years, machine learning has provided a number of state of the art solutions for computer vision1, text translation2, speech recognition3, and a host of other problems. It is clear that machine learning is a powerful tool for solving a wide number of problems, and that it has the potential to provide new solutions for audio applications. However, many of these audio applications have existing classical solutions or have latency and compute constraints that severely limit the use of the deeper neural networks. What is the set of audio problems that are amenable to machine learning? What is the set of audio problems for which classical solutions exist and are sufficient? What problems would benefit from a mix of classical solutions and machine learning? And finally, are there limitations or constraints that would exclude the use of a machine learning approach to the problem?

A brief statement of the group’s solutions to those problems

Many of the use cases discussed by the group already have classical solutions that solve the problems, and it was unclear what additional benefit would be gained by replacing these existing solutions with a machine learning approach. Additionally, many of the audio applications have latency or compute constraints that make it difficult or impossible to use the larger state of the art machine learning networks. It quickly became clear that machine learning is a powerful tool but not a panacea for all audio applications.

However, further discussions revealed many opportunities where hybrid systems can utilize classical solutions and machine learning to improve existing audio applications. We have categorized examples of machine learning to better understand these opportunities. There are three categories based on the use of machine learning within an application: machine learning as a partial solution within a larger pipeline, machine learning as a full solution that stands by itself, and the use of machine learning in parallel with a classical solution. Additionally, we split the groups further based on whether the solution was an improvement to an existing application, or whether it was solving a new problem.

Partial: Machine learning as part of an application
Full: A machine learning based application
Parallel: Machine learning in parallel to a conventional system, perhaps for redundancy
New: No practical solution exists
Improve: Practical solutions exist, but machine learning makes it better, e.g. power consumption, computation, accuracy, speed






  • Context-aware noise reduction
  • Speaker Diarization
  • Auto-generated subtitles
  • “Babelfish”, real-time translation
  • Context-aware noise reduction
  • Self-driving cars: audio sensors for safety, also multi-modal


  • Double Talk Detector
  • Restoration
  • Anomaly Detection
  • Voice control
  • Double Talk Detector

Example Details
Context aware noise reduction

  • Noise reduction that can optimize itself for the current environment.
  • Auto detection of external sounds of interest, e.g. police sirens, someone calling the user’s name.
  • Estimation system reduces computation time.
  • Upload user noise sources for offline training and general improvement of the algorithm.

Auto-generated subtitles

  • Speaker identification and diarization → Understand who says what
  • Speech recognition → speech to text
  • Emotions recognition → add punctuation
  • Natural language processing → analyze the result and give feedback to earlier stages

Self-driving cars: Audio sensors for safety, also multi-modal
Acoustical context awareness can

  • detect sirens coming
  • be the second sense for object detection

Audio Restoration

  • Models are able to synthesize an appropriate replacement for the missing or corrupted audio.
  • There is precedent for this in image deep neural networks, where they are able to upsample low resolution images by synthesizing new pixels at the higher resolution.4 5 6


Double-Talk Detector

  • Parallel improvement case
  • A “controller” of the conventional LMS engine adaptation -
  • Runs in parallel to conventional DTD to improve detection
  • Classical NN usage: Detects a situation of simultaneous speech

Anomaly Detection

  • An ML system that has memory can detect significant deviations from the predicted output.
  • LSTM (Long Short Term Memory) can be used to predict the next sample from an audio stream. The difference between the predicted output and the observed audio can be used as an anomaly detection signal.

(image from https://www.slideshare.net/arunkejariwal/live-anomaly-detection-80287265)


Brainstorm of Audio Applications for Machine Learning & Artificial Intelligence

Voice User Interface

  1. Voice controlled devices
  2. Voice to text in a meeting environment (taking notes)
  3. Room correction for voice capture based on voice signature, pattern
  4. Boot strapped network, such as ImageNet, trained on speech.
  5. Speaker identification. One example is in the meeting room, conference call.
  6. Auto generated subtitles
  7. Voice identification

Context awareness

  1. How AI can work like human minds, i.e. can fill in missing or incorrect spelling or conversation.
  2. Context awareness, on steroids. Through learning, the system can understand context such as each speaker's preference, personality, emotions, etc.
  3. Human-machine, machine-machine conversation in gaming.
  4. Multimodal biometric.
  5. Context aware noise reduction, source separation

Music synthesis (creating audio)

  1. Auto generated game score – score generated based on personal preference/history
  2. Music post production complicated and tedious, solutions are painful. Can machine learning help? Find right sound effect, reverb,
  3. Help student to correct what’s wrong, like music teacher
  4. Music composition, example Flow
  5. Synthesize new sound
  6. Polyphonic pitch correction
  7. Mimic style of artist, instrument – like mimic Van Gogh in painting
  8. Digital instrument with style and context awareness
  9. Resynthesize missing audio caused by packet loss
  10. Transducers, physical music instruments designed by machine

User experience

  1. Out of the box recommendation
  2. Organize, sort, navigate sounds
  3. Tuning sound system for live performances
  4. Tuning of devices
  5. Noise canceling using NN
  6. QC of audio. E.g. converting film to digital
  7. Correcting distortion and nonlinearity caused by transducer
  8. Automotive self driving application

Machine learning has many uses, but not every use is an appropriate use. This report identifies a categorization methodology. Machine learning is a tool and like all tools, selective application must be applied. A hammer is great, but sometimes you need a screwdriver, too.

Other reference material:

There are also a number of creative tools that are emerging as well. Some examples are:

  1. Automatic mastering service - https://www.landr.com/en/
  2. Audio loop remixing tools - https://accusonus.com/products/regroover
  3. Audio VST tools that analyze the audio - https://www.izotope.com/en/products/mix/neutron.html

1 (2015, December 8). [1512.02325] SSD: Single Shot MultiBox Detector - arXiv. Retrieved December 1, 2017, from https://arxiv.org/abs/1512.02325

2 (2016, September 26). Google's Neural Machine Translation System: Bridging the Gap .... Retrieved December 1, 2017, from https://arxiv.org/abs/1609.08144

3 (2017, July 24). Exploring Neural Transducers for End-to-End Speech Recognition. Retrieved December 1, 2017, from https://arxiv.org/abs/1707.07413

4 (2016, August 9). Image Completion with Deep Learning in TensorFlow - Brandon Amos. Retrieved December 1, 2017, from http://bamos.github.io/2016/08/09/deep-completion/

5 (n.d.). Deep Image Inpainting - CS231n. Retrieved December 1, 2017, from http://cs231n.stanford.edu/reports/2017/pdfs/328.pdf

6 (2015, July 31). Image Super-Resolution Using Deep Convolutional Networks - arXiv. Retrieved December 1, 2017, from https://arxiv.org/pdf/1501.00092

section 6

next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. Alexa, Siri, Cortana or: How I Learned to Stop Worrying and Love the Cloud
4. “You and the Uni: Defining Pedagogical Requirements for Audio Engineering Education” a.k.a. Discovering What to Learn Them Young Whippersnappers
5. A spatial audio format with 6 Degrees of Freedom
6. CAAML: Creative Audio Applications of Machine Learning
7. Mode and Nodes Enabling Consumer Use of Heterogeneous Wireless Speaker Devices
8. Abusing Technology for Creative Purposes
9. Schedule & Sponsors