home  previous   next 
The Eighteenth Annual Interactive Audio Conference
PROJECT BAR-B-Q 2013
BBQ Group Report: Ubiquitous Networked Audio
   
Participants: A.K.A. "The Kissingers"

Ethan Schwartz, Conexant

Tomer Elbaz, Waves
Jack Joseph Puig, Waves Sergio Liberman, Freescale
Chris Grigg, MMA Desheet Mehta, Beats By Dre
Doug Gabel, Intel Peter Frith, Linn
Alex Kovacs, Cirrus Med Dyer, Harman
Facilitator: Doug Peeler, Dell  
 
  PDF download the PDF

The Problem

Currently there are multiple, incompatible methods for network connecting audio devices.
 
As a result, products from different manufacturers cannot be used together, negatively impacting the user experience and so reducing consumer uptake of connected audio devices.

Below we identify shortcomings in each of the existing systems: Due to these shortcomings no one of these systems can be recommended as a preferred standard going forward.

We recommend that development of a new standard, (or a significant revision of one or more of the existing standards), be undertaken to incorporate the additional functionality needed (detailed below) and so enable the nascent rich marketplace that is waiting for interoperable, connected home audio input and output devices.

Solution Summary

We recommend that a standardized solution be created to enable all networked audio devices to interoperate.  Our recommendation requires support for both input and output of audio reflecting our belief that applications such as ‘Skype’ type communication are becoming as important as audio playback only applications in future homes.  To accomplish this we propose to:
A - Engage with existing standards bodies, e.g. uPnP & DLNA to encourage enhancements
B - Engaging with device manufacturers to encourage compatibility
C - Propose ways to fix existing/legacy devices (FW upgrades, SW translation, “Kissinger Box” - Kissinger being a ‘Diplomat who encourages negotiation over conflict’!)

Expanded Problem Statement

At this point in time (BBQ2013) we are at the point where connectivity of audio devices, be they portable or static, is beginning to become normally by wireless or networked connection, the RCA jack and 3.5mm headphone connector finally being consigned slowly to the junk drawer.

However, the (perhaps inevitable) development of multiple connection methods and standards has led to an environment where the leading commercially available connectivity solutions will not interoperate and no one of the systems commonly available satisfies all application needs.

This has led to the unsatisfactory situation where the user might have up to four sets of compromised loudspeakers in his living room for example: one each for the TV, Hifi, Dock and Speakerphone, instead of having one set of networked high quality speakers that any of these four ‘applications’ can utilise.

Furthermore, these audio systems only consider playback: we are today at the point where users are beginning to consider IP telephone and associated full audio bandwidth communication a norm and as such, future networked audio resources must support audio capture as well as playback.

Expanded Solution Discussion

Nirvana
In an ideal future networked audio environment the user will be surrounded by audio rendering resources (playback and capture) that are available to any of his control devices to connect to. This allows any data, stored or live, to be streamed to and from any renderer. any new devices, from any manufacturer, running any OS, then brought into the home will automatically connect to and make use of this ubiquitous networked audio environment.

In Practice Today
More users are listening to more music of higher quality, in more locations and on more devices than ever before. It seems every other pedestrian or jogger is listening to white earbuds, but when he gets home and wants  to remove the earbuds and listen to speakers, his choice or connection is determined and limited by the manufacturer of his mobile device:

Apple devices will ‘Airplay’ only to Apple compliant renderers, and then only in 16 bit 44.1ks/s sample rate so the user cannot experience studio quality 24/96 playback..
If the user wishes to communicate in high quality via facetalk he must do so via the small speakers in his portable device and cannot utilise his Airplay playback device capability.

Android device users in the same household must have their own solutions - they cannot use the Airplay devices of their friends or family.  Some android devices are beginning to support higher resolution audio, but then when the user wants to watch a youtube video on his device he finds he is unable to stream the audio to his networked high quality audio speakers whilst having the video play on his device or stream to his monitor/TV display.
Users of proprietary networked audio solutions have the same problems, and in addition require to install proprietary control application software on to each of their different devices.

And none of the solutions have considered the emerging desire to capture the audio in the home to support IP telephony - Who wants to have to hold a handset to their ear whilst conducting family telepresence call to Grandma in Australia using the big screen TV display and the high quality networked audio playback speakers?

Standards
Ironically the underlying hardware utilized in implementing the rudimentary streaming solutions on sale today seems fundamentally capable of satisfying the need for ubiquitous networked audio (albeit without having capture capability today).

What is missing are standards to make all the hardware solutions talk nicely to each other. The closest thing we have to  a standard appears to be the uPnP discovery solution combined with the DLNA (Digital Living Network Alliance) set of recommendations.

uPnP defines the three primary components of today’s audio distribution devices and DLNA describes how they should interact. The uPnP specification describes an audio system comprising connected control device, content and playback or render device. This system can conveniently be expressed as a ‘triangle’ of connections with the three functions described above at its corners:

render/capture-control-content

Control
What is a Control device?  Any device that can set up communication channel(s) between Content and Render/Capture devices and either runs an application or controls an application on another device.  Any Control device can access any Render/Capture device and control it (with permission).  It is the agent in charge of discovery and configuration.

Examples
- Smartphone
- Laptop/PC/Tablet
- Future Undefined Device (e.g. smart universal device, Wii-U… anything that runs the applications)

Content
What is content?  It is a collection of stored or streamed audio data located either locally or remotely.

Examples
- Data in the cloud, either from live streaming source or content repository
- NAS, network storage or other devices on the network
- Local repository, on the local device
- Real time audio content (VoIP/communication/control/archiving) applications

Render/Capture
What is a render or capture device?  These devices are user-local, the interface to the analog world. 

A capture device is any source of audio, primarily a microphone and its associated digitizer.  Data from a capture device becomes Content, available to multiple applications potentially at the same time. 

A render device is any means of playing Content audio to the end user, primarily a DAC, amplifier and speaker(s).  There may be one to many render devices on a particular network.  The Control may logically group the render devices by function or proximity.

Examples
- Home Theater receiver (render)
- Smartphone (render/capture)
- Networked Speaker (render)
- Airport Express (render)
- IP Camera microphone (capture)

The Current Network State
Currently multiple system ‘triangles’ or playback systems exist in a given room/home.

Examples of “self contained” and currently non-interoperable triangles are:-
 

Clearly there are redundant Render/Capture devices in the same room.  Render/Capture device from one system triangle don’t currently communicate with other triangles. It may make more sense to playback content on another system triangle, but currently there is no easy method to route the stream. The end result are multiple duplicated devices, each of compromised quality.
When the system triangles are able to communicate with each other, all the resources become available to all the controllers, provided they are all connected with the same network. Duplication is avoided and the user is more inclined to install better quality devices knowing they are utilized across multiple applications.

Recommendations for Solutions
Project BBQ working groups are not standards bodies. We recognize that DLNA is a generally inclusive body that represents the majority of manufacturers of multimedia network equipment and as such is a body that ought to be open to hearing the concerned views of industry parties as represented by BBQ attendees.

As such we would like to influence the solution specification owners i.e. DLNA or H/W manufacturers. In the case of DLNA we would like to motivate creation of a  DLNA spec 2.0 with extensions to address the problems described above..

Recommendations for Next Generation Networked Audio

1.0  Support audio capture

a.     Majority of today’s CE devices have capabilities to capture audio. These devices are on the network, but do not have a mechanism to stream the content to rendering devices.

b.     In the near future, we are anticipating the emergence of applications such as telepresence, voice control, VOPI telephony which require multiple audio capture devices. These might be built into existing devices or added as standalone network devices.

c.     We anticipate microphones all around us will stream onto a network and be available simultaneously to multiple applications.

2.0  Support for both ‘push’ and ‘pull’ streaming
           
a.     Pull- Pull is already implemented in DLNA today. In the case of pull the rendering device grabs chunks of data from the content device.
 
b.   Push- In the case of push the control device sends content and is used
today for live streaming scenarios. The new requirements must support a
push model as well as pull. In the case of the push scenario consider a TV being a video rendering device which needs to stream the audio from the program lip synched to the speaker renderer: Specifications need creating that minimise latency, yet ensure quality of service, i.e. no drop-outs in the audio playback

3.0  Support for ‘split destination’ streaming

a.     Currently multimedia streams can only be sent from source to a single renderer, for example a TV. In future it is desirable to be able to send the video component to a display screen whilst simultaneously sending the audio part of the program to a separate audio rendering device (speaker).
b.    For example, the user wants to watch the video component of a youtube video on his laptop screen whilst having the audio played back, in synch, from his networked loudspeakers. This is preluded in current systems, instead, the audio MUST play on the laptop speakers.

4.0  Local tracklist storage

a.     The current solutions require the content device to provide a single track which gets rendered. When the track finishes the controller requires to instruct playback of the next track. If the controller has gone to sleep (batteries flat for example) playback stops.
b.     Future DLNA requirement would be to have rendering device store the playlist
c.     Future DLNA requirement would be to allow other devices to view the current playlist stored on the rendering device.

5.0  Volume control support

a. UpnP / DLNA in the future spec needs to track and allow networked volume control

6.0  Multi channel/room synchronized playback

a.    Future DLNA spec should support synchronized playback when multiple rendering devices are available.

7.0  Non audio device support (or at least tolerance!)

a.    It is anticipated that networked non-audio devices like security, lighting controls, and many more will connect to home networks in future. The new specification must support (or tolerate) the discovery mechanism, and potentially support the streaming specifications or these non audio devices.


Appendix: General notes captured during brainstorming activity
 
Content
- Productization opportunity
- capture will create content

Control
- Location in the environment & relative to other devices
- Where is the audio processing done (Application)
- User friendliness?  User experience
- Productization opportunity
- Smart house -- voice control, multiple mics & speakers multiple rooms
- Creation based on endpoint understanding -- Endpoint capability to be known, can the content be tailored for various known end point types
- simultaneous access
- OS support  -  system should be OS agnostic
- what algorithms will be needed? different types of processing needed, technology & innovation opportunities
- how to determine orientation of user in a room with multiple speakers
- prioritization schemes -- multiple users in a single location
- synchronization with other audio devices and devices of other types (other systems)
- feedback of devices (thermal, power protection, etc)

Capture/Render
- Location in the environment & relative to other devices
- Productization opportunity
- Smart house -- voice control, multiple mics & speakers multiple rooms
- inputs & outputs
- microcontroller is an end point

Target Experience
- Group experiences (converting people back to having a set of speakers, communal experience … jambox++)
- Intra house communication (old house intercoms), internet of things
- Smart house -- voice control, multiple mics & speakers multiple rooms
- voice assistant -- what's the weather, how is traffic
- voice communication -- voip, speech recognition
- Must be “just works” and “where do you want to send it” … cannot have more than a couple of clicks.  Out of Box Experience (OOBE) -- Discovery & Configuration
- User experience -- Content in hand or cloud, want to send it to single device, group of devices?
- Pointing device at target, or say location
- replacing legacy technology like speakerphone / conference phones
- use voip type controls to ensure audio is being delivered (summarized: make sure the device is actually working)
- how to determine orientation of user in a room with multiple speakers

Business Opportunity
- Consider current features and prices for bluetooth speaker devices -- There is a desire to have the abilities, people will want ability to use more than one
- multiple devices possible? (nest, Belkin wimo -- can these all live together?)
- biz opportunities / challenges
- standardization?
- how to reduce cost, while keeping connection reliable & available (comm. over AC?)
- hardware is here; the key will be software
- what are the necessary parameters to communicate/control
- how to justify these products--why it would be profitable & desired by customer

section 3


next section

select a section:
1. Introduction
2. Workgroup Reports Overview
3. Ubiquitous Networked Audio
4. HD Audio Capture in Consumer Devices
5. Enabling More Profound Human Expression with Modern Musical Instruments
6. Using Sensor Data to Improve the User Experience of Audio Applications
7. When is Hardware Offloading Preferable, Now and in the Future?
8. Schedule & Sponsors