Ever since the simplest “beep” sound made by the very first
personal computers, audio alerts have been at the core of every major
PC operating system. Much to the disappointment of computer audio folks,
however, PC users are increasingly turning off these sounds. To audio
people, this is a clear symptom of a deeper issue. Why has computer audio
alerting become so irritating and counter-productive? Why bother with
audio as an attention-getter anyway? Can audio alerting actually enhance
our computing experience? With the continuing emergence of the PC as a
central device both at home and in the office, our group of audio experts
figured that now was a good time to try and answer these questions.
The use of audio on PCs is currently quite primitive. User control over
the audio environment, for example, is very limited – basically
just volume control. There were the old “Sound Schemes” that
have fallen into disuse, mainly because application vendors did not embrace
the paradigm that Microsoft envisaged in the early 1990’s. In many
ways we have gone backward in audio use, even as the capabilities of PC
have dramatically increased.
People tend to turn off audio alerts because they interrupt our workflow.
We assert that the concept of “Flow” is important in that
it is the desired state for most people – when someone is in flow,
they feel very productive. Consider the following discussion on flow:
Over and over again, as people describe how it feels when they
thoroughly enjoy themselves, they mention eight distinct dimensions
of experience. These same aspects are reported by Hindu yogis and Japanese
teenagers who race motorcycles, by American surgeons and basketball
players, by Australian sailors and Navajo shepherds, by champion figure
skaters and by chess masters. These are the characteristic dimensions
of the flow experience:
1. Clear goals: an objective is distinctly defined; immediate feedback:
one knows instantly how well one is doing.
2. The opportunities for acting decisively are relatively high,
and they are matched by one's perceived ability to act. In other words,
personal skills are well suited to given challenges.
3. Action and awareness merge; one-pointedness of mind.
4. Concentration on the task at hand; irrelevant stimuli disappear
from consciousness; worries and concerns are temporarily suspended.
5. A sense of potential control.
6. Loss of self-consciousness, transcendence of ego boundaries,
a sense of growth and of being part of some greater entity.
7. Altered sense of time, which usually seems to pass faster.
8. Experience becomes autotelic: If several of the previous conditions
are present, what one does becomes autotelic, or worth doing for its
The Evolving Self - Mihaly Csikszentmihalyi, 178-179
We want to enable more people to enter the Flow state - even possibly
during the working day, but people are constantly annoyed and distracted
by incoming interruptions. People get so annoyed by audio alerts and other
sounds coming from their PCs that many turn the audio off – and
some never turn it back on again.
In 2000, Gloria Mark was hired as a professor at the University
of California at Irvine. Until then, she was working as a researcher,
living a life of comparative peace. She would spend her days in her
lab, enjoying the sense of serene focus that comes from immersing yourself
for hours at a time in a single project. But when her faculty job began,
that all ended. Mark would arrive at her desk in the morning, full of
energy and ready to tackle her to-do list - only to suffer an endless
stream of interruptions. No sooner had she started one task than a colleague
would e-mail her with an urgent request; when she went to work on that,
the phone would ring. At the end of the day, she had been so constantly
distracted that she would have accomplished only a fraction of what
she set out to do. "Madness," she thought. "I'm trying
to do 30 things at once."
--CLIVE THOMPSON, The New York Times, October 16, 2005, “Meet
the Life Hackers”
Our group would like to overturn the bad reputation that currently plagues
PC audio alerting by arguing that sound is in fact the best way to alert
PC users of a wide variety of events requiring their attention. After
some discussion, we decided that the problem isn’t with sound itself,
but rather with the primitive way that sounds have been used.
In Defense of Sound
In order to establish the framework over which an effective and “flow-friendly”
sound alert paradigm can be built, it is important to highlight the many
advantages of sound as an alert mechanism:
- The available sound palette is both enormous and varied.
There is vast assortment of sound types that can be used to alert a
computer user. Sound qualities can have many dimensions such as volume,
timbre, pitch, consonance/dissonance, and percussiveness. Sounds can
range from being totally artificial (electronically synthesized) to
completely natural (animal sounds, weather, human speech). Just the
category of musical sounds alone illustrates the vastness of the sound
palette. As an added bonus, for most of these sound dimensions, there
is a way of moving smoothly along the dimension. This is obvious for
properties such as pitch and volume, but even artificial sounds can
be smoothly “morphed” in to natural ones.
- Sound can be placed in space.
The fact that we can perceive sound in three-dimensional space offers
yet another mechanism for varying and crafting sounds. We can make use
of distance, spatial position, trajectory and speed. For example, humans
tend to have a personal space that can be exploited – by bringing
sounds right into someone’s “comfort zone”, we can
grab the attention of the user. And by the same token, by keeping the
distance of a sound at the periphery of a user’s comfort zone,
we can lessen the impact of the interruption.
- Sound can profoundly affect our attention.
Sound can affect our attention in a wide variety of ways. It can range
from relaxing (flowing water) to jarring (thunder, the growl of a panther,
or the rattle of a snake). In particular, speech is especially important
to humans. For example, many people who are asleep will stop snoring
if their partner simply speaks their name. Speech is so fundamental
to human development, that humans have evolved a range of information
extraction mechanisms relating to the spoken word. For example, most
of us can determine the emotional state of someone by the sound of their
voice (technically called “prosody” which includes all the
information in speech other than the words themselves – for example
the timbre, pitch, and tempo).
- Sound and vision can be processed independently.
Our brains can process sound independently of visual input, because
our sound and vision processing systems have parallel channels to our
consciousness. This means that we can be engaged in an activity such
as typing a report, and without interrupting our flow, be made aware
of some external event via our auditory system. And because of the available
palette of sounds described above, the level of evoked awareness can
be precisely tuned.
- We can process many streams of sound simultaneously.
Human hearing is unique among the senses in that it allows the both
low and high level attentive monitoring of the environment. That is,
you can be intensely engaged in a conversation with someone while at
the same time be aware of things like weather conditions or musical
sounds. In fact, many of us have seen an ambassador at the United Nations
continue speaking to the assembly while an aide whispers some important
information in his ear.
Using Sound to Alert Users
By considering the many capabilities of sound, it is possible to come
up with a much more compelling sound experience on the PC for the purpose
of getting the attention of users. The following examples illustrate some
more advanced ways of using sound:
- Create beautiful sounds! Get rid of badly quantized 8-bit sounds that
do nothing but irritate the user.
- Take advantage of the sophisticated 3D processing capability of the
human brain. Here are a few examples:
a. Use 3D processing to make the sound corresponding to a popup
dialog box appear as if it were coming from the same location on
the screen as the dialog box. This is especially important as screens
become larger and multiple screens are commonly used making it very
easy to miss an important visual alert.
b. Use 3D processing to move a sound closer as the importance increases.
For example, an approaching thunder storm could be used to warn
an engineer of that dreaded meeting with marketing.
- Take advantage of our ability to process multiple channels of sound.
For example, if a user is on a Skype call and an important alert is
required, the user may miss a visual alert because he or she may not
be looking at the screen. Using a whispered voice coming from the side
to provide the alert may be a good alternative.
- Vary sounds over different dimensions in order to move smoothly into
the user’s consciousness, thereby reducing the chance of jarring
the user out of flow. For example, create a meeting soundtrack for each
type of meeting. The computer will start the music gently playing in
the background, and then slowly increase the volume / proximity as the
meeting start time approaches.
Going Further: Flow Friendly Computing
In order to take the next step in friendly computing, we propose a new
PC software application that can sense the user’s state of flow
and appropriately modulate the audio that they hear. The software would
wait for work flow pauses and use a level of interruption that is appropriate
for the situation and flow. It would understand privacy! There should
only be a very few interruptions that make it to the user when the privacy
level requested is high.
For example, consider a system that detects that I am working on Bar-BQ
PowerPoint presentation and I am deeply in a flow state. However, the
deadline is approaching and I need to go to the meeting room in 5 minutes.
The system has held all my calls and stopped instant messages from reaching
me. With 3 minutes to go, it starts gently playing the 1812 Overture,
which rises gradually in volume until a crescendo of cannon fire informs
me that I need to leave.
When designing new technology products, interaction researcher Bill
Buxton advocates leveraging the existing skills that people have because
there is a only a finite number of skills that each of us can have at
any one time (skills take time to learn and maintain, and our available
time is finite). For example, I shouldn’t need to learn how to forward
my phone to voicemail – the act of closing my door indicates that
I do not wish to be disturbed.
In terms of audio, human beings already have lots of built-in ‘skills’
that allow us to be aware of events and information from the world around
us, just from their sound. We would like the audio presented to the user
to be empathetic to them. That is, we would like the user’s state
of mind to be taken into consideration as the system decides what audio
to play to them. How can we deduce the user’s state of mind? We
are not talking about artificial intelligence (right now), nor some kind
of ‘digital psychic’ but initially a set of simple heuristic
rules that can be applied to a richer set of information about the user
and their history and patterns of behavior.
What information do humans generate that can be used to derive their
state of mind. We are proposing a general-purpose flow detector that recognizes
whether a user in “flow” and, if so, avoids disturbing them
unless necessary. And if it is necessary, then we should alert them with
audio, and we should do this in a polite way. The following tables provide
some ideas about how we might detect the state of a user:
and movement (shifting in your seat (or not),
mics in furniture, sonar or radar, personal sensors - fabric tension
Gaze, head orientation
cameras, Headphones with headtracking
logger - vigorous typing may indicate flow. (May have to be application
- button presses
drivers, other OS
rate, respiration (CO2), skin resistance.
Equipment state (phone,
switches, phone system status messages, pressure sensors in chairs.
In as sense, we are proposing a kind of an ‘executive producer’
for your PC audio. It mutes and un-mutes applications, changes voices
to whispers, senses your state and generally tries to keep you in flow
unless you need to be moved on to another task or meeting.
What else is being done in this field?
An excellent starting point in the field is W.
Wayt Gibbs, “Considerate Computing,” Scientific American Jan
2005 This article mentions several key research groups, including:
The Web site http://interruptions.net/
also has a large bibliography of papers and a list of researchers in the
field of interruptions relating to human computer interactions.
While much progress has been made in this nascent field in recent times,
there seems to be a good opportunity to make the research and commercial
community aware of the potential benefits of using the power of audio
to present information in ways that other mediums cannot.
What existing Windows/MacOS infrastructure can be leveraged?
To build a ‘considerate audio’ system in which the user’s
state could be modeled as described here, existing components such as
the following can be leveraged:
- standardized mouse and keyboard interfaces -- there is no standard
OS-wide “keyboard activity” API and most presence-sensing
systems rely on tracking or logging mouse movements and keystrokes to
sense whether there is a human being sitting at the PC.
- Standardized camera interfaces – USB video class devices and
windows and MacOS camera APIs allow video data to be accessed easily.
- Nascent ‘standards’ for sensor networks and security system
integration with PCs. Microsoft and companies like Zensys (Z-Wave),
Intellon and others are working on ways to bring sensor data into PCs
for security, energy management and home automation. These could be
leveraged for a considerate audio system.
There does not appear to be any suitable presence-sensing infrastructure
available in the common operating systems themselves, however several
applications such as Skype and Windows Messenger do attempt to sense the
‘presence’ of the user at the PC. These applications can be
interrogated through standard APIs that are documented.
Again, the operating systems themselves do not offer anything at such
a high level, but the above infrastructure and sensor hardware now becoming
available can be integrated to derive levels of busyness – although
research and effort is still required to robustly recognize the user’s