Humans communicate with each other not just with pure speech but with speech augmented by a variety of cues including audio, facial, posture and gestures. Do any of these cues also add value in human to machine communications? Could they serve as another form of contextual awareness making ASR more accurate and machine responses more meaningful.
Some suggest these cues are too individual in nature to augment human to machine communication. Others point out that experts can identify and interpret these audio and visual cues by observing anyone which suggests computer intelligence could easily programmed to interpret these cues.
Let’s identify specific applications where a multifactor interface (speech plus other visual and audio cues) would add value in the new world where your voice becomes the primary interface to consumer products.