home  previous   next 
The Twenty-third Annual Interactive Audio Conference

Group Report:
An Exploration of Machine Learning and the use cases
where it might provide the most benefit for Audio Synthesis

Participants: A.K.A. "Frankensynth and the machines"
Jean-Marc Jot, Magic Leap Rhonda Wilson, Dolby Labs
Rick Cohen, Qubiq Dave Hodder, Novation
Howard Brown, Owl Labs Lior Maimon, Waves
Francis Preve, Symplesound Jon Bailey, iZotope
Martin Puryear, Google  
Facilitator: Aaron Higgins, 1010Music  
  PDF download the PDF

Problem Statement

Machine learning represents a huge opportunity. How do we explore this in the context of synthesis?

Executive Summary

During Project BBQ 2018, we explored potential applications of Deep Learning for audio synthesis, focused primarily on musical and mixed reality applications. Within this report, we explore potential synthesis use cases that could benefit from current approaches in Deep Learning, zeroing in on three specific examples Mixed Reality, Virtual Sound Designer and Unsupervised Waveform Synthesis and theorizing how those might be tackled with specific neural network architectures and data requirements.


Synthesis (FP)

  • Definition

The above diagram illustrates a common mixing signal flow for music/movie production or games/VR/AR audio.  Sound generation is performed in a first stage, including synthesis and optional insert effects, and produces a single or dual-channel audio signal.  This signal is then subject to a spatialization process including spatial “panning” and artificial reverberation.

  • Existing sound synthesis techniques
    • Subtractive: The basis for “analog”.
      • Tone generator (osc) into filter followed by amplifier, with envelopes and LFOs for modulation tasks.
    • Additive
      • Multiple oscillators, each with their own amplitude envelope, act as the harmonics of a sound.
      • Components for “resynthesis” might be measured using FFT or other means.
    • Frequency Modulation: Oscillators, configurable as “algorithms” with one or more oscillators modulating frequency of one or more carriers. Originally based on John Chowning’s Stanford research and pioneered by Yamaha.
    • Sampling: Recording, editing, and pitch control of digital audio, often configured in a subtractive synthesis format.
      • Detailed manipulation, rearrangement, and modulation of audio fragments/slices is also referred to as “granular synthesis”
    • Wavetable: Variable set of single-cycle waves arranged in a lookup table (array) that can be scanned and/or modulated in real-time and then used for one or more oscillators, often configured in a subtractive synthesis format.
      • Various methodologies for creating wavetables include
    • Physical modeling: Often based on the Karplus-Strong algorithm, which uses very short delays and a feedback loop similar to that of a comb filter for z-transform analysis. Waveguides are then applied in conjunction with the feedback loop to reproduce the frequency content and envelope characteristics of acoustic instruments.

Machine learning / Deep Learning (RW)

“Machine learning uses algorithms to parse data, learn from that data, and make informed decisions based on what it has learned.”

“Deep learning structures algorithms in layers to create an “artificial neural network” that can learn and make intelligent decisions on its own.”

“Deep learning is a subfield of machine learning. While both fall under the broad category of artificial intelligence, deep learning is what powers the most human-like artificial intelligence.”


“A simple neural network is composed of an input layer, a single hidden layer, and an output layer.
A deep neural network has one key difference: instead of having a single hidden layer, it has multiple hidden layers. This allows the network to understand and emulate more complex and abstract behaviors.”


“Within the field of machine learning, there are two main types of tasks: supervised, and unsupervised. The main difference between the two types is that supervised learning is done using a ground truth, or in other words, we have prior knowledge of what the output values for our samples should be. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.”

[https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d ]

“An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction.”


A number of different networks/structures/architectures have been developed in deep learning.

GAN : generative adversarial network. Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework. One network generates candidates (generative) and the other evaluates them (discriminative)


MLPs are suitable for classification prediction problems where inputs are assigned a class or label. They are also suitable for regression prediction problems where a real-valued quantity is predicted given a set of inputs.

Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable. The CNN input is traditionally two-dimensional, a field or matrix, but can also be changed to be one-dimensional, allowing it to develop an internal representation of a one-dimensional sequence.

Recurrent Neural Networks, or RNNs, were designed to work with sequence prediction problems. Recurrent neural networks were traditionally difficult to train. The Long Short-Term Memory, or LSTM, network is perhaps the most successful RNN because it overcomes the problems of training a recurrent network and in turn has been used on a wide range of applications. RNNs in general and LSTMs in particular have received the most success when working with sequences of words and paragraphs, generally called natural language processing. This includes both sequences of text and sequences of spoken language represented as a time series.


Batch normalization (BN) is a technique for improving the performance and stability of artificial neural networks. It is a technique to provide any layer in a neural network with inputs that are zero mean/unit variance. It is used to normalize the input layer by adjusting and scaling the activations.


The quantity and quality of the input data has a large impact on the performance of deep learning algorithms. Building the sufficiently large and relevant data set for a given application can take more time/work than building the network/architecture.

Explored Use Cases

As part of our discussions, we focused a conversation on use cases for either existing or (potentially) new forms audio synthesis that could benefit from approaches based on Deep Learning, resulting in the following list:

  • Musical Instruments
    • Acoustic & Representational
    • Abstract & Impressionistic
    • Singing voice synthesis
      • E.g. Yamaha Vocaloid
  • Environmental
    • Environmental scene creation
    • Sound effects, immersive
  • Recommenders
    • “Complementary Sound“ - Create/find a sound that works well with an existing sound already in use.
    • “Scene Recommender” - Create/find a sound that works within a given scene, landscape or emotion.
    • “Semantic Synthesis” - Sound from high level descriptors (“bright”, “buzzy”, “fat”) 
    • “Sound classification” - Work backwards from a sound to a set of synthesis parameters
  • Style transfer (audio example, video example)
    • Cloning
    • Morphing
    • Transfer of specific musical and sonic features
        • ...how do we control which elements of musical style we’re working with? in the context of a synth patch, isn’t it all just style?
      • Articulation
      • Timbre
  • Resynthesis for source separation
    • Separate a mixture into components (tracks)
    • Use synthesis to reconstruct those original components
  • Personal Synth
    • Use personalization to adapt the user interface in response to the user’s preferences
  • Adaptive Synth
    • Synth that adjusts weightings based on actual usage data
  • Excluded, Deferred, Ignored
    • Speech Synthesis (Google has this covered)
    • Data sonification
    • Other uses of synthesis such as environmental masking, etc

Use Case Focus

To better explore the potentials for Deep Synthesis, we zeroed in a three use cases that broadly covered different creative domains (Music, Mixed Reality) and made use of different Deep Learning approaches (Supervised and Unsupervised learning).

Use Case #1: Deep Synthesis for Mixed Reality

The problem

Synthesize “plausible” elementary sounds for virtual object interactions in VR or AR.


  • cheers with two virtual beer mugs (more or less carefully)
  • contact of a virtual coffee cup being set on a real table (more or less “softly”)
  • rubbing of a virtual piece of chalk against a real or virtual blackboard
  • hitting a virtual or real drum with a virtual stick

Sample based approach

Advances in computer-generated imagery have brought vivid, realistic animations to life, but the sounds associated with what we see simulated on screen, such as two objects colliding, are often recordings.

Most sounds associated with animations rely on pre-recorded clips, which require vast manual effort to synchronize with the action on-screen. These clips are also restricted to noises that exist – they can’t predict anything new.

Audio for virtual worlds is often generated using simple sample-based techniques. These leave much to be desired in terms of sound realism, especially where the sound is closely linked with visual cues. There has been much research into modeling natural sounds, but this has not yet developed into a comprehensive methodology for producing modeled audio content in virtual worlds. Physics engines are now routinely used to interactively simulate the motion of rigid bodies, deformable bodies, flexible surfaces and liquids. This sophistication only highlights the relative inadequacy of conventional audio techniques.

Desired features, requirements

  • Ability to simulate environmental sound in interactive virtual worlds
  • Using the physical state of objects as control parameters
  • Integration with physics simulation engines
  • Wide range of behaviours can be simulated
  • Flexible and practical system, minimize runtime compute complexity
  • Maintain intuitive controls for sound design professionals.

Physical modeling approach

Wave-based Sound Synthesis for Computer Animation (Prof. Doug James, Stanford Univ., USA)  SIGGRAPH 2018

“There’s been a Holy Grail in computing of being able to simulate reality for humans. We can animate scenes and render them visually with physics and computer graphics, but, as for sounds, they are usually made up.” “Currently there exists no way to generate realistic synchronized sounds for complex animated content, such as splashing water or colliding objects, automatically. This fills that void.”

Informed by geometry and physical motion, the system figures out the vibrations of each object and how, like a loudspeaker, those vibrations excite sound waves. It computes the pressure waves cast off by rapidly moving and vibrating surfaces but does not replicate room acoustics. So, although it does not recreate the echoes in a grand cathedral, it can resolve detailed sounds from scenarios like a crashing cymbal, an upside-down bowl spinning to a stop, a glass filling up with water or a virtual character talking into a megaphone.

The core of our approach is a sharp-interface finite-difference time-domain (FDTD) wavesolver, with a series of supporting algorithms to handle rapidly deforming and vibrating embedded interfaces arising in physics-based animation sound. Once the solver rasterizes these interfaces, it must evaluate acceleration boundary conditions (BCs) that involve model and phenomena-specific computations. We introduce acoustic shaders as a mechanism to abstract away these complexities, and describe a variety of implementations for computer animation: near-rigid objects with ringing and acceleration noise, deformable (finite element) models such as thin shells, bubble-based water, and virtual characters. Since time-domain wave synthesis is expensive, we only simulate pressure waves in a small region about each sound source, then estimate a far-field pressure signal. To further improve scalability beyond multi-threading, we propose a fully time-parallel sound synthesis method that is demonstrated on commodity cloud computing resources.

Paper [high quality (98M)] [compressed (5.6M)]  Supplemental Video [high quality (174M)]
Presentation [Keynote (507M)]


General approach.  Supports a wide variety of physics-based simulation models and computer-animated phenomena.

Highly detailed results.  Perhaps the most significant improvements are in complex nonlinear phenomena, such as bubble-based water, where no prior methods can effectively resolve the complex acoustic emissions.


Compute complexity.  The most obvious limitation of our approach is that it is slow. Our CPU-based prototype allowed us to explore the numerical methods needed to support general animated phenomena, but the sound system “screams out” for GPU acceleration.

Explicit physical models are often difficult to calibrate to a desired sound behaviour, although they are controlled directly by physical parameters

Phya, physically based environmental sound synthesis (Dr Dylan Menzies, Univ. of Southampton, UK)  2011

Phya is a C++ library and set of tools for efficiently generating natural sounding collision sound within a virtual world. Sound types include impacts, scraping, dragging, rolling, and associated resonance of objects, solid and deformable. Loose or particulate surface sound like gravel, grass, foil can also be generated. Phya is designed to integrate with physics engines.

There has been much research into modeling natural sounds, but this has not yet developed into a comprehensive methodology for producing modeled audio content in virtual worlds. Physics engines are now routinely used to interactively simulate the motion of rigid bodies, deformable bodies, flexible surfaces and liquids. This sophistication only highlights the relative inadequacy of conventional audio techniques.


  • Efficient dynamic body/surface/contact/impact/generator/resonator framework.
  • Scene management to minimize and manage cost.
  • Variety of surface models for fixed and loose surfaces
  • Resonator models, including non-linear distortions, rattling.
  • Modal analyzer, produces compact files to configure resonators.
  • Utility functions for interfacing with Physics Engines, geometry etc.
  • Look-ahead limiter class for output streams.

Lightweight C++ library and tools to facilitate the addition of modeled audio into virtual worlds, using a physics engine to provide macro-dynamic information about contacts and impacts. The project also includes an ongoing effort to develop audio models. The aim is to generate a practical, flexible and efficient system that can be adapted to a wide range of scenarios, while making consistent compromises. Once object audio properties and their links to physical objects are specified, the system can generate audio without further intervention.

The properties describing the sound objects can be extracted from real recordings using analysis tools, a process sometimes called physical sampling. For instance a recording of an oil drum being hit can be analyzed, then used in a world where an oil drum was being rolled and hit. Instead of playing back that same sample again and again, we hear the variation in collision sound that matches its detailed motion. Another advantage, is that the memory footprint for the physical sample, is a small fraction of one short audio sample. Physical samples can also be edited in interesting ways not possible with direct samples.

Components in a Phya application. Arrows point in the direction of function calls.

Main objects in Phya, with arrows pointing to referenced objects.

Sound spatialization

It is preferable to keep spatialization as separated as possible from sound generation, if possible. A large body of algorithms and software exist for spatializing, and the best approach depends on the context of the application. Output from Phya is available as a simple mono or stereo mix, or separately from each body so that external spatialization can be applied.


Physical principles guide the system design, combined with judgments about what is perceptually
most relevant. This has previously been a successful approach in physical modeling of acoustic systems.
A simple observation can lead to a feature that has a big impact.

A source can be given directionality by filtering the mono signal to produce a signal that varies with direction from the source. This technique is often used in computer games, and can be applied as part of the external spatialization process. Mono synthesis followed by external filtering can reproduce directional sound correctly, because at each frequency the directionality is fixed. For sources in general the directionality at each frequency can vary over time.


Necessary to separate/categorize different types (“profiles”) of interaction and resonators .
Control interface not readily intuitive for sound designers.
Compute complexity still high.

A machine learning approach?

Existing work

Speech synthesis using neural networks

  • example: Google Wavenet

A suggested direction

Leverage the application of machine learning technique in the context of “traditional” synthesis techniques -- see below.

Use Case #2: Deep Francis aka the Virtual Sound Designer

This discussion imagined that we could train a neural network to replicate the facility of an experienced sound designer (Francis Preve) when recreating an existing sound via audio synthesis.  

  • Control of existing synthesizers vs creating a new form of synthesis (RC)
    • Existing synths will allow composers and sound designers to leverage pre-existing work. Tying into this would be the use of existing synth parameters (or plugin automation parameters) as the data sources.
    • New synthesis module can be integrated into existing “modular” or “plugin” systems, for expansion of sound palettes.  A learning synth might be able to appear as a low-level module, or as a higher level construct with its own internal modulators.
    • Adding “tags” to sounds will enable intelligent filtering of the sounds (ref: Akai VIP, Native Instruments Kontrol). Conversely, generated sounds could be tagged automatically by the system.
    • Intelligent mapping for plugin parameters or synth parameters to ML parameters would be useful. Generated sounds could have their parameters mapped automatically to real-time controllers (if they don’t disrupt the sound). Ref: same as above, Nektar Tech Panorama.

Possible Neural Network Architectures

We imagine several different neural network architectures, relying on different input sources and producing different outputs, to build our Virtual Sound Designer.

  • Stacked GAN (Audio equivalent of StackGAN)
    • Input: Label(s)
    • Output: Synthesis parameters
  • Convolutional Neural Network (CNN)
    • Example A
      • Input: Featurised rendered waveform
        • What features do we use? MFCCs and many more options.
      • Output: Synthesis parameters
    • Example B
      • Input: Featurised rendered waveform
      • Output: Category label

We discovered that SAFE (Semantic Audio Feature Extraction) project can provide considerable insight for several of these concepts, as they are already extracting descriptor language from audio features. We also discussed the need the articulation data that generated the audio examples, so that we can feature extract temporal parameters.

Data Needs

What data do we need to gather, to seed the database? Here are some proposed requirements:

  • Audio samples can be useful for analysis of sound characteristics
    • Potentially rendered at different velocity levels of varying durations
    • MIDI event data may be needed to determine “note off” for the purposes of release envelope calculation etc
  • Meta data such as “tags” (e.g. Bass, Slow, Plucked, Ambient, music genres)
  • Synth parameters used by the source engine to generate the sound
    • Sysex dump
    • Preset saved file
    • VST “FXB” file for automation parameters

A hypothetical approach

  • Constrain the problem by choosing either a prototypical synth (e.g. Minimoog / “2 osc into a filter into an envelope”) or a prototypical sound (e.g. “Synth bass”)
  • Choose a particular synth architecture (collection of parameters and wiring). 
  • The ML dataset will be seeded with examples of these sounds (upload your patch set and any meta data).
  • Generate audio data with a standardized MIDI file, for consistency of scanning.  The MIDI data might include different note numbers, velocities, parameter settings, or other articulation data. This is necessary for synth presets which vary widely depending on their inputs.
  • Graded sounds (crowd sourced or the choices of the sound designer/composer) will guide the learning system to the “best” choices being offered first in future searches.
  • A subset of these parameters will be varied to achieve the desired sound.

Use Case #3: An Unsupervised Synthesis Approach

    •   NSynth / Goldsmith’s demo
      • An unsupervised approach to generating sound directly (sample by sample) from a machine learning algorithm. Using variations on the autoencoder network. Broadly - feed the system with examples of audio, and the autoencoder builds a compressed representation of that audio. This representation is called the “latent space” and consists of hundreds of dimensions. These are unlikely to correspond to meaningful musical or sonic concepts that a human could use to steer the synth.
      • Instead, the synth is controlled by interpolating between points in the latent space. In the case of NSynth, this is mapped to a square control surface with familiar sound examples at each corner.
      • In theory an interpolation in latent space is more meaningful than, for example, a linear cross-fade between samples. The synth has learned something about the nature of sound.
      • A visual analogy - the difference between cross-fading images of a full and empty glass, versus generating a sequence of images where the glass is drained in a realistic manner.
      • NSynth’s quirks
        • Low bit depth and sample rate lead to unsatisfying results
        • Not immediately obvious how it differs from the linear cross-fade / vector synthesis approach. Probably due to choice of training data and demo examples
      • Goldsmiths quirks
        • Very broad training data leads to a pretty rough ride through latent space
        • Wouldn’t it be amazing to train on a more focused dataset, e.g. a narrower genre of music or an artist’s catalogue?
  • Next steps
    • Can we clarify what style transfer would mean in the context of a synth sound? It’s possible to have intuition about style transfer for a piece of music or environment, but does it make sense for a single sound/texture/timbre?
    • Explore ways to make unsupervised models more controllable and interpretable. Who is doing primary research in this area, perhaps outside the field of audio? Semi-supervised approaches for more control / insight into what these algorithms are doing, and how to get more predictable results.
    • Explore GAN approaches that can handle discrete data (e.g. synth parameters like oscillator type) versus typical vision processing GANs which only really work with continuous data (pixels)
    • Get MORE DATA!
      • Speak to the QMUL / FAST Impact team about extending the semantic audio database to include synth-friendly features and descriptors, e.g. temporal features
      • Is there a way to share / re-use the Semantic Audio team’s work to gather more data about synthetic and sampled sound?
      • Who else has a stake in access to good semantic audio data with a synth-friendly twist? Can we collaborate?
      • Are there privacy-friendly ways to use customer data for building this kind of training set? (see Google’s federated learning approach, which keeps customer data private while still contributing insight to a centralized machine learning system)   


Object collision sound modeling

Music composition, synthesis

Music audio classification


Semantic audio

Dylan Menzies, “Physically motivated environmental sound synthesis for virtual worlds.” EURASIP Journal on Audio, Speech, and Music Processing, 2010.  pdf

Dylan Menzies, “Phya and vfoley, physically motivated audio for virtual environments.” Audio Engineering Society Conference: 35th International Conference: Audio for Games. Audio Engineering Society, 2009.   pdf

Jui-Hsien Wang, Ante Qu Timothy R. Langlois, Doug L. James, “Wave-based Sound Synthesis for Computer Animation.” ACM SIGGRAPH, 2018.  pdf

Leon Fedden “Using AI to Defeat Software Synthesisers.” 2017

section 3

next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. An Exploration of Machine Learning and the use cases where it might provide the most benefit for Audio Synthesis
4. Benchmarking methodology for a multi-voice assistant enabled future
5. Problems and Solutions for Audio in Augmented Reality Headsets
6. A World Without 3.5mm: Transport Features, Guidelines, and Opportunities
7. Pork Rinds: Challenges with the present hearable model
8. Taking the "virtual" out of virtual audio
9. Impact of non-traditional sound: mic used for ultrasonic, etc. Everything is broken!
10. Schedule & Sponsors