home  previous   next 
The Seventh Annual Interactive Music Conference
PROJECT BAR-B-Q 2002
brainstorming graphic

Group Report: Proposal for Latency and Uncertainty (Jitter) Management By Enumerating Renderers and Sources

Participants:
A.K.A. "The Plumbers"

Dan Bogard; SigmaTel

David Zicarelli; Cycling '74 Chris Grigg; Beatnik
Mak Jukic; Yamaha Ron Kuper; Cakewalk
Steve Pitzel; Intel Jim Rippie; Sonic Network
Brian Smithers; Full Sail Keith Weiner; DiamondWare
Devon Worrell; Intel Nathan Yeakel; Gibson Guitar
  Facilitator: Linda Law; Fat Labs, Inc.
 

Executive Summary
Complicated audio and video technology has found its way from recording studios and multichannel surround movie theaters to the consumer household. As traditional distinctions between computing and consumer electronic products begin to fade, consumers will confront the same limitations in media performance that have hampered professional users for years: unacceptably high levels of latency (delays in responsiveness), difficulty in synchronizing media playback, and interoperability problems.

Left unaddressed, these limitations risk customer dissatisfaction and slow adoption of our next generation consumer products. We propose a problem-solving approach that applies principles of object-oriented computing to computing hardware and software. We also explore the challenges inherent in applying aggressive Digital Rights Management technologies to next generation products.

By creating a software-based management system that treats all elements of the audio chain as interconnected components, we can control and reduce latency, ensure that all media is properly synchronized, and provide a new generation of successful products for the emerging digital lifestyle in the home, and the professionals who provide that content.

Table of Contents

Introduction
Standard computing architectures lie at the heart of contemporary media delivery systems, such as personal computers, home digital video recorders, and some set-top box and other living room entertainment products. Just a few years ago these products were rare, but the rapid adoption of DVD technologies and the increasing numbers of consumers who create their own media with video cameras and home computers have whetted consumer appetites and permanently altered the landscape of consumer electronic and computing products. The line between computers and common consumer electronic products is evaporating.

Existing media streaming architectures are already very mature and full featured, and they continue to evolve and improve. Recent steps towards media convergence, e.g., the computer in the living room controlling the TV set, present a new proliferation of multiple input and output streams and disparate media types. The market will reveal new and no doubt unpredicted ways to use these multiple streams.

The market for these products, however, appears to be developing faster than the products themselves. While there are many and varied reasons for slow product innovation, including the lack of a widely accepted market solution for dealing with copyrighted (and possibly copy-protected) content, there are clearly technical limitations hampering our efforts. In this report, we will deal with two principal issues: latency and synchronization.

In particular, existing architectures do not adequately solve the synchronization and latency management problems that arise in the converged home media environment. At the same time, the unique needs of the professional media content creator have not been adequately met. Major software and hardware manufactures (who shall remain nameless) are currently wrestling with issues of latency and synchronization in next generation platforms, so there is an opportunity at BBQ 2002 to provide some guidance to their efforts.

This report addresses synchronization and latency in a framework for real-time media. It is platform agnostic and addresses the needs of the consumer as well as the professional content creator. We hope that the general principles we outline will form the basis for new hardware platform and operating system development, and when necessary, software standards that provide the media infrastructure for home entertainment and professional production.

Definitions and Assumptions
We assume a fairly technical audience, one accustomed to concepts of contemporary software design and issues confronting media developers. We will use jargon commonly used in media software and hardware design, like "renderer" (a device that provides final output, like a video card or a digital-to-analog converter on a sound card). We explain some complicated concepts, like synchronization, only to the degree necessary to propose our approach.

This paper describes a media system that represents streaming components as blocks connected in an acyclic graph, a la AVStream in Windows. We will use the term "component" to describe an element of the graph, such as an audio renderer or codec. We will use the term "graph manager" to describe the presumed system element which manages the connection and lifetime of components in a graph.

Latency: Background
To gain an understanding of the issues surrounding latency, examine the following usage scenarios.

Case 1: Gaming. When you push the button on your joystick, you need to hear the gunshot sound within 50 milliseconds to be convincing. In this case, the system latency needs to somewhat low, but any value below 50 msec is tolerable. 20 msec or less provides heightened immediacy that furthers the gameplay.

Case 2: Audio production. Vocalists want to hear themselves in headphones with a reverb, while recording on to computer based workstation. If the reverb is being run on the host computer, the full duplex system latency (audio input, through the main processor, to the audio output) must be below 12 msec to be musically useful, and below 5 msec to be non-distracting.

Case 3: Acoustic Echo Cancellation. This is perhaps one of the more demanding cases for latency, and is a key feature of telephony software. To build AEC algorithms one must have low, consistent, and predicable latency for the algorithm to work properly.

Based on the usage scenarios, we see two key requirements for the latency of a media system on common hardware platforms:

  • Latency must be low, preferably below 5 msec
  • Latency changes must be managed by communication with the various components in the media system

Latency: Proposed Solutions
We propose that both components and the graph manager work together to manage latency throughout the system. Components provide reporting and notification fielded by the graph manager. The graph manager uses this information to manage the overall system latency intelligently, both at "discovery time" and at "run time."

"Discovery time" describes when new media components come to life. For example, if an end user plug a DV camera into a 1394 port on a desktop computer, a new media component appears on the graph, perhaps as a DV input and decoder, with video rendered on the computer screen and an audio rendered through the computer's speakers. A user downloading and installing a new audio processing component, like a three-dimensional spatializing or reverberation processor, represents another kind of media component appearing in the same system context.

At discovery time, the graph manager asks each component to report its latency characteristics. These characters include, but are not limited to:

  • Intrinsic latency. This latency value is a description of the worst-case upper bound latency for the component. The component will never do worse than this under normal conditions.
  • Jitter. This describes how much the actual latency may vary at runtime. Transport layers such as USB have inherent jitter, though internal buffering may mask this jitter, which is reflected in the intrinsic latency value.
  • Amount of latency added by compensating for jitter. (See above-if a USB device is configured to compensate for jitter by buffering, it needs to report how much latency this would introduce.)
  • Preferred buffer alignment and granularity. Some components, such as an FFT based audio plug-in, operate most efficiently when given buffers that are exact powers of two. They will incur a latency penalty for other buffer sizes. Therefore we allow the component to report its desired buffer geometry, and the cost for the host failing to provide this desired buffer size.
  • Permissible buffer alignment and granularity. These represent the range of acceptable values that will permit the component to operate normally, even if its performance isn't ideal.
  • Read-only values vs. read-write values. Some components may allow the amount of extra buffering for jitter to be controlled by the host (graph manager), so they would expose this particular value as being a read-write value.

To avoid complicated logic among components, it is assumed that only graph manager or (optionally) an application should need to traverse the graph. In other words, any determination of "global" latency is the sole responsibility of the graph manager.

During runtime, changes can occur to the latency of the system. To allow for this:

  • Components must be able to report their current actual latency at any time.
  • Components must signal "runtime events" to the graph manager, such as a transient change in latency (e.g., "the user downloaded a video picture across the USB bus, so my audio is 200 msec late"), or a permanent change in latency (e.g., "my FFT block size just changed and I need 2k frames instead of 1k frames"). Other kinds of runtime events would be changes in jitter or error conditions (user unplugged USB cable).

The graph manager is a suitable mechanism for global latency management because it has the bird's eye view of the graph, and can make good decisions about how to minimize latency. For example, suppose a USB component can optionally have its buffering disabled, thereby exposing all of its jitter. If this USB component lived upstream from an FFT plug-in which required large buffer sizes, then there is no reason for the USB device to buffer too-this additional stage of buffering is redundant and adds unnecessary latency to the system with no benefit. Since the graph manager is responsible for latency management, it could disable buffering on the USB device and minimize overall latency.

Synchronization: Background
A modern media system will employ multiple sources, multiple renderers, and different clocks playing streams across different hardware. The user expects all of these streams to play or record in sync in a system that does not necessarily have the luxury of a hardware solution such as word clock. (Word clock provides a common timing reference to any device connected to its output.)

At the same time, whatever solution is devised cannot degrade the experience for professional content creators-professionals and consumers will increasingly run different applications on the same basic platforms.

As with latency, usage scenarios shed some light on synchronization problems confronting professionals and consumers:

  • The user is playing a DVD on his or her computer, connected to a TV via HDMI. At the same time, he wants to play the DVD audio streams on the headphone output on the computer. Since the computer runs on a different timing clock than the TV, the two media streams will drift unless sync is employed.
  • The previous example becomes much more difficult if one or both of the streams is encrypted, because there might not be a "trusted" component to perform Sample Rate Conversion (SRC). Untrusted components can see only encrypted blocks of data, which are totally opaque and cannot be processed in any useful way.
  • A professional content creator is recording an orchestra from an array of 32 microphones, patched into 2 different PCI cards each having 16 analog inputs. These 2 capture streams must remain perfectly sample aligned.

Scope and Goals for a Synchronization Solution
To achieve synchronization, you must be able to "expand" or "shrink" the length of media streams when necessary to bring them into alignment. Minimizing the use of asynchronous sample rate converters (ASRCs) is desirable, especially in a production system. Although ASRCs can be quite processor efficient (keeping processor overhead low is an ongoing requirement), they can noticeably degrade audio quality.

The system must be able to tolerate multiple simultaneous clocks, such as an audio card's sample clock and video frame rate clock.

The system must support multiple media types at once, such as audio and video and animations and MIDI.

If hardware based synchronization is available, as would be the case in a professional authoring environment, it should be utilized. In a similar vein, because software based ASRCs are lossy, there needs to be a way to configure the graph without them.

Some kinds of media streams are inside a "closed pipe", for example, a DVD movie that is being rendered directly to a TV set. The synchronization system needs to allow and account for these kinds of streams.

The Relationship between Synchronization and Latency
ASRCs add latency to the system, because they will grow or shrink output buffer sizes relative to the input buffer size. Whenever you change a buffer size, all components downstream may need to start buffering to compensate. For example, an FFT-based plugin which wanted powers-of-2 sized buffers and warned you about the penalty of giving it something else, will begin to howl if it's downstream from an ASRC.

Also, latency that exists in the processing chain makes the task of actually implementing synchronization much more complicated for the developer. For example, suppose your processing chain has 1 second of latency. When you deliver a buffer into your graph, you are doing so based on the clock skew defined as "right now," even though the buffer won't actually be heard until a second from now. If your sync algorithm doesn't allow for this delayed effect, you will oscillate as you try to synchronize and never achieve useful sync.

Synchronization and DRM
Technologies that provide Digital Rights Management on the system run the risk of compromising successful synchronization of media in the home, not to mention the professional studio. There was a concern expressed in the PLUMBERS group that if a tradeoff had to be made between DRM integrity and maintaining sync, platform vendors will probably defer to DRM exclusively, by default.

This would have tragic consequences for content creators, the very people that DRM purports to protect. Product adoption rates will slow dramatically as these customers turn instead to products that allow them to do their jobs on spec and on deadline. Since synchronization problems affect both consumers and creators, consumers will also reject products that don't behave the way they expect. While the group acknowledges the challenges faced by copyright owners, successful DRM cannot be the sole criterion for any new platform development, and should not be allowed to compromise the core platform's sync capabilities.

Synchronization Solutions
As with latency, the graph manager must have the intelligence to manage synchronization among the media components. Part of the graph manager's role is to designate a "reference clock" when necessary, choosing from among the components that can provide a clock. This reference clock serves as the absolute time referenced by which all other clock skew rates and position errors are determined.

The graph manager must actively detect clock skew,e.g. by receiving a periodic notification from the master clock that an "abstract tick" has occurred. On each abstract tick, the graph manager can compare the reference clock to all other clocks and determine skew values.

Based on the choice of the master clock, the graph manager must insert ASRCs at the appropriate path(s) before the renderer. To match the buffer size variation that will occur at the outputs of the ASRC, the graph manager may optionally required that the sources vary the output buffers that they produce upstream.

In a professional production environment, all of the above mechanisms must be able to defer to external or hardware sync.

Because ASRCs will vary buffer sizes, this implies that actual latency of the system may vary dynamically in cases where synchronization is being used. Fortunately, we've already designed a latency manager to assist in keeping components of the graph notified about these changes. Nonetheless, ASRCs should be avoided when possible.

Some leverage in sync can be obtained by using the error correcting capabilities of certain formats to avoid ASRC. For example, in AC3 you can remove a sample and the codec will reconstruct that missing data (transport error correction).

Similarly, phase vocoder techniques can be used during the decompression of compressed sources (such as MP3) to avoid ASRC.

Finally, any DRM or trusted components must expose control inputs for synchronization, in the event that it is not possible to ASRC the unprotected data stream directly.

Conclusion and Action Items
The group laid a good foundation for further work, which includes advocating this approach to specific platform developers and standards organizations. The action items and owners are:

1. Finalize Report - Jim
2. Present this group's report to the MMA - Chris
3. Get our report to Microsoft - Devon
4. Drive Microsoft to present response at WinHEC - Devon
5. BBQ wants to go to the WinHEC board to present - Keith
6. Propose Audio Component Working Group at NAMM MMA meeting - Ron

Appendix: A Call for a Standardized Audio Component Framework
[For additional discussion of audio component architectures, the reader is strongly recommended to see a formal attachment to this report, Chris Grigg's presentation at Project BarBQ 1999, available on the Project BarBQ web site.]

Before moving to the specific problems of synchronization and latency management, the PLUMBER group discussed another barrier to new product development in the home and in professional markets: software audio components. This general term can be applied to a wide range of product types, from 3D expansion processors in hardware or software, to sophisticated host-based audio reverberation and synthesis processors.

The professional audio market offers a variety of audio component formats, some hardware and some software based, all entirely incompatible. While they can be considered "standards," they are proprietary, and the companies responsible for their development assume a documentation and support burden. In other industries, web standards for example, organizations independent of any single company take responsibility for the growth and development of the standard.

In the absence of any industry wide audio software component standards organization, audio component developers are required to develop multiple formats, or take the business risk of focusing on a single format. Application developers are frequently required to support and host multiple formats to satisfy customer demand.

We discussed the definition of a component framework and the characteristics of components, their roles, responsibilities and duties:

  • A discovery mechanism that seeks out and makes connections
  • Flow control duties
  • Connection, where components are located in the graph
  • Capability to exchange trust
  • Clocking, clock mismatch, converters
  • Parameter flow
  • Parameter constraint
  • Persistence (saving and retrieving of a component's state)
  • Error handling

A component framework:

  • is a plug-in format
  • has an abstraction model
  • has a discovery and enumeration mechanism
  • is transport neutral
  • is scalable: distributed, multi-instance, low-overhead, adaptive to hardware
  • is programming language agnostic
  • is a virtual wire
  • has an exportable UI
  • has a standardized persistence mechanism
  • is compatible with XMF: extend XMF to include a mix chunk type A component framework is not:a parameterized sink

Furthermore, this component framework is:

  • not a definition of functionality within a component
  • not burdened with IP and licensing fees
  • not tied to platform specific language variations

Several members of the group will assume the subtask of further defining and promoting such a component framework with an appropriate standards organization, such as MMA. (See "Action Items," above.)

Benefits to Consumers, Software Developers, Pro Users, and Hardware Manufacturers
If the industry adopts a component solution that accomplishes these goals, we see the following benefits:

Consumers

  • Consistent experience
  • Increased choice
  • Portability of their investment (i.e., when you upgrade your system you can keep your old stuff)
  • No audio processing format wars
  • Enables clustering

Pro users

  • All benefits from users, plus...
  • Reduced downtime
  • Interoperability

Software developers

  • One source
  • Increased market reach
  • Lowered development cost
  • Your customer does not abandon you to get a component incompatible with your product
  • Lowering the barriers to entry

Hardware manufacturers

  • Can bundle software with speakers, headphones
  • Leverage existing software through simple porting
  • DSP chip or board vendors have a ready market
  • No lengthy industry evangelism
  • Finished hardware goes immediately to market
  • Provides a migration path to host-based processing
  • Interoperability
  • More hardware acceleration options
  • Enables clustering

section 4


next section

select a section:
1. Introduction  2. Speakers  3. Executive Summary  
4. Proposal for Latency and Uncertainty (Jitter) Management By Enumerating Renderers and Sources
5. The Future of Hardware for Home Entertainment and Computer Systems  
6. User Interface Design Issues for Audio Creation Tools 
7. Maximizing the Resources Available to Achieve Quality Game Audio
8. Schedule & Sponsors