home  previous   next 
The Sixth Annual Interactive Music Conference
brainstorming graphic

Group Report: Towards Interactive XMF

Participants: A.K.A. "The Neighbors of the Beast" Chris Grigg; Beatnik
Larry the O; LucasArts Rob Rampley; Line 6
George A. Sanger; The Fat Man Steve Horowitz; Nickelodeon Online
Bob Starr; Beatnik  


Editor's Note: During preparation of this group's report, Chris Grigg began writing a separate narrative combining a description of the group experience with various insights and historical references to other interactive audio systems and design theories. The members of the group have decided to submit Chris' narrative (as shaped by their feedback) in lieu of a more conventional report. Since the piece was originally intended to be published elsewhere, it is © 2001 by Chris Grigg, all rights reserved, and published by Project BBQ with the author's permission.


I had a dream.

No, not that kind of dream - I started this Rogue Group because of an actual dream. On the second night of BBQ 2001, after a day full of talk of XMFs and interactive audio and Integrators and so forth, I dreamed a System diagram. When I woke up I knew what I had to do. I knew I'd feel I hadn't made the most of my BBQ 2001 if I didn't at least try to take advantage of the rare and precious opportunity to gather together some of the world-class loonies, er, I mean, interactive audio visionaries in attendance, and pound out at least a basic design pointing the way to an XMF-based, nonproprietary interactive audio framework in a venerable vein: the 1999 Big Picture group, the 2000 Q group, LucasArts iMuse ™, and George's Integrator, tossing in some of the best bits of my own designs going back to pre-PC computer games.

In recent years most of the attention in the game sound arena has been focused on 3D audio, the latest sound cards, 5.1 surround, and other rendering features. While the systems we have in mind would certainly use that stuff, technically it doesn't have very much to do with them. In plain terms, the goal of these systems is to make sure that, in the audio artists' view, the right sound plays at the right time. I realize that sounds so basic it's silly, but the harsh truth is that in the ordinary way of doing interactive audio, where the sound artists depend on programmers outside (or occasionally inside) the audio department to make sure that the audio files are used in the intended way, the audio artist's vision (or whatever the aural equivalent of 'vision' would be) is rarely achieved. It just doesn't happen because it's too labor- and communication-intensive, and sound work has always had a funny way of getting pushed down the priority list over the course of a project.

In my view, this is a - no, make that THE - fundamental problem of our field and has never (yet) been adequately solved. For many years I have devoted a small piece of my brain and time to contributing to a nonproprietary solution. So at Saturday breakfast, at the urging of several BBQ brothers and both BBQ sisters, I stood up and announced my schism. Nobody threw food (or worse), the right peoples' eyes lit up, and a 10:00 rendezvous was set for the overlook.

The day turned out even better than I'd hoped. By the time we were done we not only had a PowerPoint presentation, we also had GUI tool mockups, a timeline to reduce our design to implementable data structures, commitments for an IA-SIG working group to standardize the work, and a promise of a review by the LucasArts team. Now it looks like there could be a session at the Game Developers Conference, and that this work will help drive George's book/tool project. Not too bad for six hours' work. We have now moved considerably farther "Toward Interactive XMF" than expected.

Why so much success? Because, it seems, the time for this particular form of this particular idea has finally come. Game development has matured to the point where more developers realize that platform-specific interactive audio tools are not as cost-effective as cross-platform tools. Some developers may feel that any competitive advantage from proprietary interactive audio systems may be in danger of being overshadowed by their high cost and functionality limitations (amortization base too small to justify additional development). Also, the importance of human factors such as production process and the working interfaces between the audio and programming departments is now more widely recognized than before. It remains to be seen whether there is a business to be had developing this system (George thinks there is, I'm not so sure), but to developers even an open/shared development model looks more promising than having every developer continue to go it alone, or depend on single-platform tools.

Another reason for the success is the maturity of the design we have in mind. Forget for a moment the XMF part. Each member of the group brought a unique (not to mention esteemed) body of experience to the overlook, both in terms of breadth of game and audio styles and breadth of interactive audio systems used (or designed), yet independently we had each come to essentially the same view of what the design concepts for a cue-oriented interactive audio system should be. There's ample wiggle room when it comes down to details and implementation, but the value of our group's work product is in the expression of those core concepts, with a practical standards-based path to implementation.


As I outlined in my talk Friday morning, previous BBQ groups have done work articulating parts of this vision. It is perhaps best appreciated by contrast with the usual working method for interactive audio, which unfortunately in most cases can be stated simply as "I talk to the programmer, we decide what sounds are needed, I make a file for each sound, I give him the files, he puts them in the game but often doesn't use them the right way." This, as the kids like to say, sucks.

We'd rather work according to these principles:

  • Defend against misuse of delivered audio
  • Use MIDI & audio together
  • Simplify tool development
  • Standard formats lead to better tools
  • Avoid least-common-denominator functionality limits

The following tactics have been identified to help put the audio artist in control:

  • Use data-driven runtime audio systems, not code-driven
  • Abstract the interface between coding and audio teams
  • Have the game request audio services by symbol, not filename
    (Which relates to Brian Schmidt's immortal comment "Anyone who still thinks there's a 1:1 relationship between a sound and a WAV file just doesn't get it.")
  • Put audio guy in control of the data (i.e. by providing an editor)

Going further, some niceties we might like to see in an ideal world might include:

  • Simplify & formalize content management & hand-off procedures
  • Attach audio artist notes to individual sounds
  • Generate reports & databases easily from sound, & vice-versa
  • Facilitate collaboration among multiple audio artists

Of course this is all high-level goal stuff; there are additional objectives like controlling sound parameters while a sound is playing, making it easy to do commonly needed things like randomization and spatial placement, synchronizing new pieces of music to already-playing pieces of music, providing callbacks so events can be triggered when a sound gets to a certain point, etc. These will all pop up below, once we get into implementation, but they also point up a couple issues.


It's likely we could identify a basic set of core functionality for this system that would satisfy 80% of its users 80% of the time. If such a baseline system were built, it would be a great boon to the industry, and a great gift to the audio artist community. But there's also a whole world of further improvements that some few folks would like to have; excluding these would keep some amazing soundtracks from ever happening, which would (as the kids say) suck. So do we wait to figure out all the cool stuff before we start building?

George has a flower. Or rather, an analogy involving a flower, which inspired our group logo and goes something like this: All plants need roots, a stem and some leaves. You can draw a dotted line just below the flower, and cut, and you'll still have a perfectly viable plant. The analogy is: If you want to get anywhere with this thing, figure out what's basic vs. what's attractive but nonessential, and go with basics because you'll never ever get to the bottom of what would be cool to add.

FLOWER V. 2.0:

We thought this over and said: "Naaaaaah." We're sensitive guys and we like flowers, or in other words the purpose of the system is to empower audio artists to make the coolest things possible. But George is really right about not waiting to figure out all the possibilities up front, so, taking an inspiration from the X in XMF, we decided to make our system eXtensible. In other words, we added as new a working design principle the idea that, in any area of functionality where someone might conceivably want to do more than the basics, we should provide a mechanism for extending the standard functionality.


XMF, it turns out, helps achieve all that. It's designed for bundling media files together, and allows content to be organized as files & folders like a computer file system. It also lets you attach metadata to files and folders in a simple and consistent way, so you can attach name tags or composer comments, so simple programs can generate reports based on the contents. Or slightly more elaborate programs could create XMF files based on entries made in a database. And it's extensible, so we can easily add new media formats, data formats, and metadata fields as needed.

Because XMF already handles so much of the organization and housekeeping, it helps us focus our design concepts into a concrete design, and allows us to focus just on the parts of the problem that are left. But it's important to understand that the current XMF standards don't include anything like the high-level interactivity functionality we want to see. XMF technology per se is just about providing the vessel that the data rides in. All of that new functionality needs to be invented and represented in data structures before it can be carried inside XMF.

So one useful way to look at the work of our Rogue Group is an attempt to answer the question Larry the O posed: "What do XMF and BBQ-style interactive audio mean for each other?" Of course, that begs another question: "What exactly does 'BBQ-style interactive audio' mean?"


I believe all of us in the rogue group have the same basic thing in mind, but as with the story of the blind gents and the elephant, each of us conceives it from a different standpoint. To George, for years it's looked like a tool the audio artist can use to unambiguously demonstrate the audio media and the way it should be used. Lately it's begun to look like a standard for a data format that would be carried in an XMF file together with the audio media; the audio artist would create that chunk of data with a tool as per George's vision, and then the game would use that data. As one who's worked on both sides of the artist/programmer fence, my view is that we can't reduce the tool or the data format to practice until we make some design decisions about the program that's going to read that data at runtime. Like a razor and its blades, you need both parts to get the job done, and each part has to be designed from the start to work with the other.


This is where the dream comes in. Dreams have a funny way of crystallizing things we already know into a new, usually symbolic, form. In this case the symbols were literally symbols: blocks in a diagram. (Hey, if you think that's weird, I once had a girlfriend whose dad had this recurring dream of huge colored numbers raining out of the sky, the air thick with them like a downpour.) The dream showed me how all the parts would fit together, in such a way that the media and interactivity rules could be carried in an XMF file, with complete portability (meaning that the same XMF file could be used on practically any platform, and would behave the same everywhere).

I think this is the model for the thing we need to build. If we can accept this despite our differing points of view of the elephant, we'll have a basis for moving forward with designing everything else we need for real-world implementations. The rest of this document is just filling in the details of this diagram.


If you (or your publisher) develop for more than one platform, sooner or later you start to care about platform independence. If you intersect with lots of platforms, you start caring a lot about platform independence, real fast.

The key to the system's platform independence is the combination of a platform-independent Soundtrack Manager (SM) with a platform-specific Adapter Layer (AL). The SM would have to be written by somebody, and that's where the bulk of the work will be. It would be written in transportable C or C++ (maybe ported to Java at some point), and talk to each platform's AL in exactly the same way, using a few simple commands like Load File, Allocate Memory, Play File, Set File Volume, Locate File To, and so on. For each platform, an implementation of the AL would turn those commands into whatever native API calls are required. It helps portability that only very basic native MIDI and audio file playback abilities are needed (play from/to, pause/resume, set volume, set pan/spatial location, set callback), which is possible because all the advanced features are handled at the SM level.

Because it's the platform-independent SM and not the platform-specific playback APIs that reads the Interactivity Rules, the same set of content (media files and interactivity rules) responds to a given set of stimuli in exactly the same way on all platforms where the system runs.

Now that we have platform independence nailed, we can start considering the creative features and behaviors the audio artist and the game programmer will be working with.


Whether you're a programmer or a sound artist, if you're going to be a user of this system the first thing to do is to get your head around the idea of a Cue, which has a couple parts. Basically a Cue is an event that the game (or application or performer or whatever) signals to the Soundtrack Manager, and to which the Soundtrack Manager responds with a corresponding, predefined action designed by the sound artist.

We call it a Cue because it resonates with preexisting terminology in two related fields:

    • In film, each discrete unit of scored music is referred to as a Cue
    • In theater, stagecraft events driven by the action, and 'called' (to the stage crew, via headset intercom) by the stage manager during the performance, are referred to Cues

So the term Cue combines both sides of Stimulus and Response: the game signals a Cue to the Soundtrack Manager (that'd be the stimulus part), and the Soundtrack Manager responds by doing the right thing.

This, by the way, is not a completely new idea in game sound. Early 8-bit computer and console games typically wrapped up all the music and sound effects for a given level into 'modules' that started with tables of 16-bit offsets to the start of each tune or sound effect, allowing the game programmer to request each one by number. The sound artist and the programmer had previously agreed on the number for each needed sound, so those numbers served as Cue signals. ("Play cue number 15.")

A more modern view of cue IDs would be to use ASCII text strings (maybe qualified with namespaces) because they're descriptive, readable by both humans and machines, and easily attached to objects (e.g. stored in a game's geometry database). So from here on out we'll assume every cue is identified by an ASCII nametag. To request a Cue from the Sound Manager, the game would call a function, supplying the nametag of the desired Cue. ("Play the cue named 'FootstepGrassSoftLeft'.")


This is a fundamentally different way for the game to address the audio system, and it changes the way in which the sound department and the game designers interact during game design and production. Before, the two sides got together and agreed on what sounds were needed, how the sound files should be prepared and used, and then the sound artists created and delivered the files - but it was up to the programmers to do the (in many cases manual C++ coding) work of seeing that the sound files actually got used in the agreed upon ways.

With a cue-based Soundtrack Manager system, the game designers and sound artists get together and decide on what sounds are needed, what Cue signals are needed (remember, a single Cue tag may be able to access any number of sound files), and agree on a nametag for each of those signals; and then the sound artists create the sound files and the cue responses, and deliver them all in an XMF bundle. The programmers just have to plug the XMF files in for the right sounds to happen at the right time, without doing any additional coding.

Because the sound artists in this scenario have interactive tools for testing out how the sounds and interactivity rules respond to the cue signals, they have better confidence that the game will sound right than they did when the programmers had to interpret every individual artistic intention into concrete code. And the engineering burden from each new delivery shifts from heavy-duty programming to just verifying that the sound files and cue responses have been authored in a way that won't overburden the game engine memory-wise or CPU-wise (resource hogging).

We foresee way less antacid consumption all around.


So. The next questions are: What exactly are these Interactivity Rules (IR), how do they relate to media files like AIFF and MIDI files, how are they represented in data, and how do the IRs and media get bundled into XMF files? The answer is: Cue Sheets.

A Cue Sheet is a collection of Cue data chunks, where each Cue data chunk contains both the media files and the interactivity rules for a given Cue signal that the Soundtrack Manager can receive from the game. Here again we borrow from preexisting terminology. In film sound post-production, a Cue Sheet is a timeline showing every piece of sound that will be mixed together into the soundtrack, laid out onto a channels vs. time grid.

So a Cue Sheet is the thing that tells you what media is there, and when each piece gets played. Unlike film where the 'when' is unambiguously tied to elapsed time within the film reel (at time 12:03:08, hear "bonk"), in our interactive case the 'when' is usually determined by the gameplay events that happen at runtime (whenever player head hits tree, hear "bonk"). But you see the basic similarity. For any given scene or subsection, the Cue Sheet collects all the info on all the Cues in the soundtrack.

Once we invent a few data structures, a Cue Sheet can be represented with an XMF file.

In a Cue Editor that a sound artist might use, a Cue Sheet might look something like this:


Each row in a Cue Sheet represents one Cue. That is, the media files and interactivity rules that get used when the game sends the corresponding Cue signal to the Soundtrack Manager. Each Cue has several elements (or, if you're more of an object-oriented-programming kind of person, 'data members').

    • The signal and the Cue are linked by name tag, so each Cue needs a Name field. This is just ASCII text.
    • To make life easier for the sound artists, each Cue should include a Comments field (not shown in the illustration), which could be used for anything the sound artist wants. A simple program could skim this info into a database or report generator. If there's a need to save space in the final product, this field can be automatically stripped out of the delivered XMF file.
    • We've said a Cue states the media files it uses, so each Cue needs a pool (AKA a list) of media files. This could be done in many ways, but since we're using XMF, this should be stored as a list of references to media files stored in XMF files (see XMF spec for details).
    • Experience shows that mixing a game is usually hard to do, so in our vision we include a Fader (AKA volume control - actually, it's more like a Trim than a Fader, but 'fader' sounds cooler) for every Cue, to make it easier for the sound artist to balance Cues against each other. Don't worry about extra DSP overhead - this just becomes a coefficient in volume scalar calculations for the media files in this Cue, so the Sound Manager only sends one final setVolume command to the native playback APIs. (For the same reason, we put a fader on each Media File in the pool too.)

We've also been doing a lot of hand waving about 'Interactivity Rules', which we at this stage we need to get more specific about:

    • What actually happens when the Cue signal is received? If there are multiple media files for a cue, which one(s) get played? Can the Cue do different things depending on what else is happening at the time? We combine all such logical, decision-oriented stuff into an Action, which is so much fun - and so involved - that we won't talk about it just yet; see below. (You'll have noticed that what we call an 'Action' has a lot of similarities to what in other systems are called 'Scripts'.)
    • Another important part of the interactive response is the manner in which new sound elements (media files) are introduced into the soundtrack. For example, if you just start one music file while another one is already playing, you'll get cacophony because both will be playing at the same time, and the rhythms won't align. (Usually I mean 'cacophony' in a good way, but in this case I mean it in a bad way.) It would be much better to synchronize the new music to the old music, and switch over at a musically graceful point which, depending on the style, could be a phrase, bar, or beat. In many situations the game design will call for a gradual change of scene (this can be true for either music or sound effects), in which case a crossfade of given duration would be better than a hard cut. We can generalize these different modes of introduction for new elements as Transitions, and Transitions should be part of a good Cue.

Remember George's Flower? I think the only two areas of the Cue where we need to provide eXtensibility mechanisms are the Action and the Transition. The other parts (Name tag, Comment, Media Pool, and Fader) seem to be dead simple, and don't need extension.

So a Cue includes:

    • Name tag
    • Media Pool
    • Action (with extensibility mechanism)
    • Transition (with extensibility mechanism)
    • Fader
    • Comment

These are the items, collected into Cue Sheets, which need to be represented in a data structure in order to be carried in the XMF file. This list is complicated enough it might take more then one screen to edit a Cue in a real-world Cue Editor. For example, the Media Pool editor might look something like this (fader column not shown in illustration):

The Action and Transition are more complicated than the Media Pool, and will probably require separate editor screens.


We said there was a lot more to Actions; here are the grisly details.

First off, let's go contemplate that flower again. For many Cues, in fact, probably most Cues, an Action needs to do no more than choose and play the right file from the Media Pool. There are only a handful of common basic rules for choosing which file to play. We can call these Playback Modes, and George offers up the following four (colorfully named!) ones:

  • Gunshot - Each time the cue is called, a different media file from the pool is played once, all the way through, then stops.
  • Sampler - Each time the cue is called, a different media file from the pool is played all the way through and looped, like a sampler imitating a flute.
  • Clatter - Plays each one of the files in the pool once through, in order. When all files have been played once, audio stops. This would be handy to create a random-sounding "everything falls out of Fibber McGee's Closet" sound.
  • Jukebox - Each time the cue is called, a different file plays through once and stops. After all files in the pool have been played once, the pool is reshuffled (if "random" is selected), and the pool is played again, ad infinitum.
  • And I would add: Singleton - When there's only one file in the Media Pool, there's no decision to make - you just play that file.

There may be a few more good basic Playback Modes as well; we'll let the IA-SIG working group ferret that out. These basic modes are the stem of the flower, and they should be made totally simple to access and use.

But there will also be cases where a crafty (or wiseacre) sound artist wants more control than that. That's the blossom on the flower, and it kind of calls for a simple scripting language. I think we can accept that any sound artist using such a language would have to be a bit of a propellerhead.

How do we reconcile stem and blossom? In this case, we cheat. We can make the Action system script-driven under the hood, and put two faces on the it in the GUI editor: Basic level, where you just pick one of the preset Playback Modes (which under the hood is just a script, but a built-in, standard one), and Advanced level, where you get access to a full script editor and can get your hands dirty (virtually speaking).

And now, a brief side-trip. Before we talk about the innards of the scripting system, we need to introduce the two-part system that lets a Cue do 2-way communication with the game (or application, or whatever's talking to the Soundtrack Manager) while the Cue is running, to wit: Mailboxes and Callbacks.


There are sounds, and then there are sounds. Some only need do one thing, but others are called upon to change depending on what's happening in the game at the time. The classic example is the vehicle engine, where the RPM, load, and other simulation parameters are constantly careening around unpredictably in response to game action. A less chaotic (but still classic) example might be a piece of music where every 8 measures the next phrase to be played is selected, based loosely on how much activity is going on in the gameplay. In both cases, information passes from the game to the soundtrack, influencing playback while it's still in progress. Not allowing this kind of thing would be just plain lame. This means our Soundtrack Manager needs to provide some sort of conduit between the game and the Cues that are running at any given time.

The Rogue Group discussed a few ways of handling this kind of communication, and agreed that the mailbox metaphor that quite a few game sound systems have used is hard to beat for simplicity and generality. Imagine a post office, with a row of numbered mailboxes. Incoming mail is addressed to a particular box according to box number. To receive mail a customer just has to look in the right numbered box. In our sound system, the Sound Manager maintains a numbered array of mailboxes, each holding one (probably numeric) value, and the game can set the contents of any mailbox to any value at any time.

This simple mechanism can be used for several kinds of communication. A mailbox can be used like a knob, so that whenever the game changes the value, some sound parameter is changed proportionally. Or a mailbox can be used as a signal that a condition or event has happened - for example a cue may watch to see whether mailbox 33 contains 0 or 1, because the game programmer and the sound artist agreed that would signal whether the hero's radioactive eye sockets are visible (because you have a musical line that goes extra-cool with the animation effect).


Although many uses of mailboxes are closely tied to individual running cues (like a dynamic volume control), some uses involve communication to all cues at once (like a Dim or Mute function), or communication from one running cue to another. That's why I'd like to see a new refinement of the mailbox system: one set of 'Global' mailboxes, plus a separate set of mailboxes for each running cue. So instead of setting mailboxes solely by mailbox number, we would do it by cue and number, or else 'global' and number.


One of the jobs we leave to the IA-SIG Working Group is to determine what sound parameters should by default be controlled by mailboxes, and what mailbox number to use for each. Volume, stereo pan (for stereo sounds), and spatial location (for spatialized sounds) are likely candidates, as would be effects sends. There should also be an eXtensible way to drive other, new, and different parameters from mailboxes, including platform-specific rendering features. These sound parameters would for the most part be attached to the per-cue mailbox set, not the global mailboxes, although it might be useful to control a few global sound parameters in a similar way from the global mailboxes (global volume, for instance). In both cases, a goodly number of the mailboxes should be left unassigned to avoid cramping anyone's style.


Mailboxes can also be used to pass information from the soundtrack back to the game, for the same kinds of reasons, and can support the same types of communication (continuous control, events). The sound artist should be provided with Mailbox-setting commands via Markers, which can be embedded in MIDI and audio media, and via Steps in the Action scripting language. Of course you can also use these commands to talk to other Cues, not just the game; imagine several pieces of music running in parallel at the same time, with tracks being dynamically muted and un-muted based on a mailbox value controlled by the 'master' song.


So far we've been talking about making sound things happen in response to instructions from the game or host application, but a good sound system also needs to furnish some way to let the soundtrack drive game things when necessary. For example, you might want to trigger an explosion animation in sync with a cymbal crash at the end of a piece of music. Since the timing of the cymbal crash is determined by the music playback, not the game code, there has to be some sort of music-event-driven signaling mechanism. In general these are called Callbacks because the game leaves the Soundtrack Manager a function to call when the desired event happens, and the Soundtrack Manager uses that to "call back" to the game at the right time. (Think of it as the SM making a phone call to let the game know that an interesting sound-based event has happened and the terminology makes more sense.)

There's probably no limit to what an interesting sound-based event might be, but two intensely practical occasions where it would be useful to trigger a callback would be 1) when playback of a piece of media reaches a particular point, and 2) when a particular step in an Action is reached.


One thing audio files and MIDI files have in common is that they are both (at least in their simplest forms) linear media: playback begins at the beginning, runs through the file in linear order, and ends at the end of the file. Another thing they have in common is Markers, which are non-sound events that are pegged to particular times during playback, with space for a scrap of text you can fill in. You can place markers in any good MIDI or audio file editor. (In MIDI files a marker is a "Meta Event" appearing in the MIDI event stream at the appropriate time, and in audio files a marker is a special data chunk stored outside the audio data chunk.)

We can set a rule that in our sound system, any marker with text in the form "Callback:YourTagHere" will cause the Sound Manager to call the game's callback function that's tagged 'YourTagHere'. If the game hasn't set up such a callback, then nothing happens.

I bet that was too propellerheaded to understand, so here's an example. Joe Programmer has a function called DoCrashAnimation() and wants the Sound Manager to call it when a certain piece of music reaches its crescendo. Jane Soundgrrl chats with Joe, and they agree the tag for this callback should be "TimeToCrash". Jane goes back to her MIDI sequencer in the basement, opens the song in question, and inserts a new marker with the text "Callback:TimeToCrash" at the appropriate place, then redelivers the media (in an XMF file) to Joe. Meanwhile Joe's been adding code to the game to tell the Soundtrack Manager that his function DoCrashAnimation() should be called when the callback event tagged "TimeToCrash" occurs [which would look something like this: theSoundtrackManager.setCallback( "TimeToCrash", &DoCrashAnimation ) ]. Joe rebuilds the game, incorporating the new XMF file, and the next time that piece of music plays, the animation happens when the song hits the marker.


It would also be useful (not to mention easy) to provide a Step in the Action scripting language to trigger a callback. For example, with a combination of mailbox-setting markers and an Action script, you could watch for unusual musical states (like when multiple loops of different length all restart at the same time), and send the game a callback to say it's time to move on to a new setup.


The intrepid few who venture into creating their own Action scripts will need a basic language with enough control statements and media manipulation statements to get the job done. Most non-programmers get discouraged by the whole Syntax Error thing, so to make it all seem just incredibly artist-friendly, we'll call the statements 'Steps' and provide a button-driven, error-proof editor looking something like this:

To add a step, you just put the cursor at the desired line and click the 'Add Step' button for the desired step. To delete a step, select it and click the Delete Step button. To set a parameter for a step, click on it and enter a new value. Or, where it makes sense, pick a mailbox number and the script will take the parameter from whatever's in that mailbox at the time the script runs. You can see where this is leading, right? It means any part of any script can be driven by any mailbox - and since the game or any piece of sound media can drive any mailbox, well... things could get pretty interesting.


Designing a scripting language can bring on practically religious differences of opinion, so we'll leave the bulk of this potentially controversial task to the IA-SIG Working Group, and offer the following short list of steps (statements) as a starting point. Note that the steps treat all types of media files (MIDI and audio) in the same way, not as special cases.

Preload File

Start File

Stop File

Play From / To

Set Fader

Fade In

Fade Out


Play Nth File Of...

Play Random File Of...

Set Mailbox X to Y

Execute Callback 'CallbackTagName'


What happens when some genius absolutely needs a new step we haven't thought of? Flower Time. The data format for the scripting steps needs to include an extensibility mechanism so that in the future, when we'll be smarter than we are now, anyone can add new stuff without breaking the old stuff. (Hint to implementors: Look at XMF's ResourceFormatID non-collision mechanism.)


Notice how much better speech synthesizers have gotten in the last 10 years? The early ones played steady-state phonemes one after another, end-to-end, and they sounded like bad sci-fi robots. Then someone realized that in actual speech, most of the time the vocal tract isn't standing still, it's changing shape as it moves from one phoneme to the next - so for natural-sounding speech, what you really need to model is the transitions, not just the steady states.

Mixing sound for film or games is similar, in that if you simply hard-cut from one sound to another - for example, when the scene or location changes - the transition usually feels wrong: too sudden, jarring, and unnatural. The alternative is to transition between settings more smoothly, which from a mixing standpoint means having the tracks for both scenes playing, and then cross-fading from one scene's elements to the next one's over an appropriate interval of time. This opens the door to a whole world of much more natural, more cinematic, or more musical effects, and by doing so reinforces the production's immersive effect. By contrast, a jarring transition frequently detracts from the player's suspension of disbelief.

But there's more to transitions than just crossfading. We could define 'transition' as anything relating to the way a newly introduced sound relates to the other sounds running at the time. In some cases, the entrance of a new cue should kill off other cues (e.g. the classic whistle fall vs. bomb explosion). In some cases calling a cue when the same cue is already running should stop playback of the previous sound (flush toilet); other times you'll want the old sound to continue, so that it overlaps the new sound (banging on several crash cymbals). So sometimes a Transition includes killing other cues, or media files within the same cue.


While smooth transitions can be very important in creating convincing sound effects that continue across scene changes, smooth musical transitions probably offer the best payoff, in terms of a perceived increase in production values (see for example Peter McConnell's soundtrack for LucasArts' "Grim Fandango"). From a technical perspective, however, musical transitions bring the additional requirement that if the transition is going to sound OK, the new piece of music and the old piece of music must be both a) playing at the same speed, and b) synchronized to the same musical point (or at least a compatible point) in the song. This can be tricky. Perhaps the simplest way to achieve this is to place markers in both music files, for example exactly on the 1st beat of each music phrase, and let the Sound Manager use the markers' times to figure out when to start the newer piece. Depending on the musical style, finer-grained markers (8th notes, 16th notes, etc.) could be used, making faster musical transitions possible.

Marker alignment may also be useful for rhythmic or looped sound effects. For example, transitions among different angles or intensity levels on a continuous piledriver or motor would have to be synchronized to the rhythm to sound right.


Again the IA-SIG WG can decide what functionality is necessary, but here's our initial stab at the essential creative controls for a good Transition:


Likely the above controls come nowhere near exhausting the possibilities for useful transition rules, so here again the system should be designed with an eye to extensibility I confess this seems to me to be an area where it may be difficult to foresee all the possibilities, so designing an extensibility mechanism for transitions might be a little tougher than some of the other areas.


So. What does all the above add up to?

  • To the audio artist, all the above appears as markers placed into the audio and MIDI media using the existing audio and MIDI editors, and interactivity rules created with a new Cue Editor application. The Cue Editor would include auditioning tools, for example a way to load multiple Cue Sheets and a separate button to call every contained Cue. Final files would be delivered in XMF format, exported by the Cue Editor.
  • To the audio integration programmer, all the above appears as the audio artist delivering media in XMF format, and way less custom programming required.
  • To the game engine architect, all the above appears as a better-behaved, drop-in way to achieve systematized soundtrack control. World objects can be easily tied to sound behaviors that the audio artist can adjust independently.
  • To a programmer maintaining a proprietary in-house interactive audio system, all the above may appear as an interesting alternative approach, or it may appear as some useful ideas to incorporate into the next version. At a minimum, bundling content with XMF file technology might seem worth looking into.
  • To the game design team, all the above appears as a way to let the audio artist get the creative job done that avoids the usual resource bottleneck of getting precious and expensive programmer support time.
  • To the craft of interactive audio, all the above appears to have promise as a basis for a standard generalized working methodology that would foster tool development, facilitate collaboration and cross-platform content creation, and make better interactive soundtracks much, much easier to create.


This document has laid out a pretty Big Picture, but only in sketch form. Many details remain to be filled in, and there will be a great deal of implementation work to be done. The Rogue Group members would like to offer our encouragement and our support to those who decide to continue this work, including but not limited to the IA-SIG Working Group. We all hope to participate directly in whatever comes next, but if for any reason that's not possible then you can be sure we'll be there in spirit.


section 5

next section

select a section:
1. Introduction  2. Speakers  3. Executive Summary  
4. Networked Audio Devices Interoperability Standard
5. Towards Interactive XMF  
6. Improve PC Audio 
7. Game Audio Network Guild
8. A Compelling Music-Making System for the Living-Room Console
9. Schedule & Sponsors