| Define the future opportunity spaces for seamless environment to environment, human-like voice cognitive, affordable, voice interfaces in the home.
 
        Perfect AI in terms of natural language processing and context  awareness 
        Home Wearable (as a subset of home) BusinessIndustrialAutomotive 
        “Seamless” from environment to environment really is an Iron-Man  Jarvis experience. Walking from his car, to his house, to his suit-- the  context follows across multiple devices and locations.Walled Gardens - seamless within vs seamless across AI  providers. Walled gardens are the cloud voice service providers. Baseline:  What is/are the interface(s) 
        Can be both ways.Subscribe to servicesPull Data from your cloud accountIn Quiet Short Sentences (Commands etc.) “What is the weather in  Tokyo” has high accuracy to the intention of the device (Phone, Echo etc.)Echo Cancellation / Barge-In at reasonable levelsConsumer Grade BiometricsMulti Factor Authentication - Face, Fingerprint etc. High Level - How do we voice enable everyday  life?Tradeoffs - security vs convenience
 User  definable.
 The  group agreed that one of the primary and most complex (in terms of products,  use cases, service providers etc.) is home. Other markets to be addressed, with  their own complex use cases would be automotive, business and industrial  applications. Many of these other systems would be a subset of the smarthome  use case. We  split Smarthomes into the needs for interoperability, general future needs,  future technology needs and then split into 4 use cases - Commerce, Search,  Control and Communication. However,  to enable such technologies, there are common needs and common technologies  required. 
        In today’s systems, there are No Multi-User experiences. No way  for Mom to buy something from her account, and Dad to buy something from his.  The future needs to resolve this with Speaker ID technology. No “Dad” control to stop the kids using equipment.Can’t use Echo to make a phone call.
          If you ask Alexa to make a call, then audio is routed to the  phone, not through Alexa.This raises a fundamental issue.  No Intercom type functionality.Lack of Peer to Peer Audio streaming. “Fog” instead of “Cloud”. How can each system share and stream  audio between each other? Can one set of microphones work alongside others?Cellphone calls to the ASRA and noise suppression during while  on a phone call - within a phone call, calling a voice command. “Eg, we should  call grandma while you're on the line, Call Grandma!”
          How does it sound to other folks on the phone call if you ‘deny’  a call? We can’t Whisper Voice Commands today. Other people in the room  hear you yelling commands. You look like a mad man. Technical term is Sub  Ambient Voice Commands. 
        Biometrics to for identification
          Multi-user capabilitiesPresence tokens (i.e., wearables, Bluetooth beacons that detect  presence of users by seeing their smartphones etc.) Perfect AISpeech RecognitionComputer VisionPeer to peer backend. This is needed to allow local streaming of  mic and audio data.“Fog” hub for local computing.
          Most devices assume they are alone in the house. Future  collections of systems will need Enumeration and registration within the fog  for each device in the home.Fog could connect to multiple WG’s depending on the service  required. E.g. search is done by google, but purchases done by amazon etc.Fog-Host would likely have to be at the router level.Fog on a playback audio streaming is not a concern.Fog on a microphone recording side, and the voice to text  service makes things much harder.     Each piece of equipment needs a two-way street from the  appliance to the voice agent. Both ends need to make an API to talk to the  equipment. However, a local hub could be used to talk to some simple switches  etc. (Think of a Wi-Fi to Z-Wave bridge)Smart/soft/networked control elements (i.e., appliance switches) Cloud services (i.e., Google, Amazon, Apple, etc.)There should be a way to disable all or part of conversationsInteroperability
          Each Device has its own nameHouseholds desire a “single” AI clientAPI for cross platform interoperability Classes of Devices
          White Goods AppliancesHVACSafety - Smoke Alarms etc.PDA - Personal Digital AssistantPersonal Control - control of entertainment system etc. - local  controlWireless Cameras with Audio Voice Capture “layer” as control hubIs there a commercial purpose/value for walled garden  cooperation?Services have to develop across WG’s (Walled Gardens) In  the area of Commerce, we see five areas of opportunity. 
        AuthenticationAdvertisementExplicit PurchasingImplicit PurchasingFulfilment AUTHENTICATIONThere  are three steps to securing a voice system, Identification, Authentication, and  Authorization:
 Identification:  We recognize your voice and have an idea of  your identity.Authentication: We have validated your  identity and have ensured we know who you are.
 Authorization: We grant you access  based on known settings
 Currently,  devices in the market are authorized by a user, rather than a user being  authorized to make transactions.  To  enable financial transactions and experience customization, future solutions will  need to be able to solve the problem of identification and authentication of  the end user. How  do you authorize the person rather than the device?  As  a first step, the end user will be via voice command waking up the device and  making a request. Identifying the user will need to rely on solutions that are  characterized as having a high level of accuracy in both wakeup and identifying  the voice print of the end user. This can be accomplished in a variety of  methods, such as user trained keyword entry or other techniques. This is akin  to inserting the bank card to an ATM machine.
 Authentication  is the natural second step to guarantee the identity of the user is certain.  This is akin to entering the PIN and there are many techniques both explicit  and implicit commonly referred to as Multi-Factor Authentication. Implicit  techniques will rely on contextual data, common behaviour to generate an  authentication challenge that can take the form of query addressed to the end  user. Explicit techniques are pre-agreed tokens such as passwords,  fingerprints, etc… Other  dimensions of authentication to keep under consideration are: 
        Multi-User Authentication
          It is highly desirable that devices in homes with multiple  inhabitants are capable of distinguishing between each seamlessly.   Continuous Authentication
          If a device could understand us and authenticate us just based  on the sound of our voice, this could add to a more natural user experience. Wearable authentication tokenIn scenarios where multiple devices from varying manufacturers  and ecosystems exist within the same space, a unification of multiple device  endpoint’s authentication schemes may be required. Authorization is primarily the rules and  boundaries set for each user when making a purchase. These rules shall be  unique and custom to each individual user. ADVERTISEMENTAdvertisement  and Market Research is one of the key areas of opportunity, however  eavesdropping technology presents an incentive model for decreased privacy that  could be disadvantageous to the consumer if it’s abused.  The key to preserving trust with a consumer  is to enable them to have transparency and control of their interactions that  were captured and stored.
 
        Always Listening for Context-Based Purchase Intent: A user could  opt-in for an always-listening experience that monitors all their conversations  for purchase intent, and then re-targets them for advertising across platforms.Assume that a user should always have transparency and  control over content that was captured.If all conversations were captured, then this would essentially  be an auto-transcriptionist, and allow for searching “Who said what when.”  How would this affect human interactions? EXPLICIT  PURCHASINGExplicit  purchasing is where a user directly specifies the exact thing that they want,  and the device enables the transaction.   Within this domain, there are several opportunity spaces:
 
        Buy Apple Vs. Buy an Apple:   How do we establish certainty?Stock Market Agent or Grocery Store Agent.  Identify an agent to create context for  purchasing.  Include a default context.  E.g. Assume grocery unless I say I’m talking about stocks.Detect stress, and make experiential decisions based on that. IMPLICIT  PURCHASINGImplicit  purchasing is where the device notices something that wasn’t directed to it,  yet it proactively takes action to be helpful.
 
        Overheard you saying “We need Milk” so I added it to a shopping  listOverheard you saying “We need Milk”, so I purchased it for youOverheard you saying “We need Milk”, so I said “Should I purchase milk for you?”.Observed purchase pattern and history, so I added it to a shopping listObserved purchase pattern and history, so I said “Should I purchase milk for you?” FULFILMENTFulfilment  is an area where a user should be able to get information on purchases that  they’ve made.
 
        Where’s my stuff?  Has it  been shipped?What’s the ETA?What was delivered? SYSTEM  CONSIDERATIONSTo  ensure a privacy and security, storage of sensitive material to complete  authentication shall be in a secure environment. The voice print and  authentication criteria can be stored in secure encrypted cores or more likely  in the cloud. Other considerations to take into account when characterizing a  user are the dynamics under which their voice print may change, it is preferred  that the final solution include features that allow for an update of the user’s  authentication criteria over time.
 Additionally,  it is recommended that alternative authentication methods be enabled in cases  where voice input is not feasible, such as illness or a noisy environment. Many  of the techniques described above have been used and are established. Forums  such as the FIDO Alliance (A General Security Standards Body) have been  specifying the requirements necessary to use voice for financial transactions.  As an example: USAA uses voice authentication on their mobile app. 
			Smart Intercom Peer to Peer - not  all rooms need to sound when someone speaks. If we know where people are, we  can ring to only that room.Phone CallLocation in the home detection for call  directionCall follows user room to roomChallenge - it’s not just being in any room to  start a dialog. I want to roam from room to room. Many challenges in  echo/acoustic compensation and cancellation. WORKGROUP STUFF: Problem Statement:  Natural, intelligible, seamless, and private  worldwide human to human communication available anywhere on residential  property, in all likely noise cases. For communication, we see the solution space  falling into the following high-level categories with details for each use case  listed beneath: 
        Intercom:  
          
            Person-person localQuality not necessarily high (typically Wideband)Smart Routing (Keep conversations private within house)Node only starts listening with trigger wordUser-definable trust modePresence detection
              
                - Room to room person tracking-Dial in speakers and mics for location...handoffs between  endpointsPersonal IDFull DuplexLocal processingNoise and echo suppression, de-reverbCall control / Selective transmissionIntelligibility>qualityVOIP:  
          
            -Call quality more important than intercom (SWB?)-Far end cleanup-Quality>intelligibility-(Also includes Intercom requirements)Audio to remote  security:  (Security/safety)
          
            Battery backup2 establishment methods:
              
                Hazard trigger from record side to trigger recording, group  texts, door lock/unlock, etc.
                  
                    Turn off not under duress (voice stress) for alarmSelectable biometrics to turn offManual trigger from listening sideHard to defeat/break connection (non-internet path...landline or  cell)Non-real-time  communication (speech to email, social)
          
            Distinguish between message + other speechCapture inflection in text, and do this in local hub (privacy)High accuracy transcription using as many mics as possibleSpeech processing where needed (noise reduction, echo cancel,  de-reverb)Remote collaboration (Music, party,  presentation etc.)
          
            High bandwidth (~48 kHz)Minimal noise reduction (wind/breath)Selective voice processing (Noise reduction, echo cancel,  de-reverb)Low latency importantNon-discussed topics
          
            Account selectionGaming voiceSpeech-to-text-to-speech in different voice (games or other) Search  is one of the most commonly used features used in existing voice systems, such  as Siri and Cortana. From resolving drinking disputes to helping kids with  homework it’s a feature typically used by all the family, at any time.  This  creates use-case and privacy issues around which user is using the system at  which particular time, along with understanding “who” is giving the results and  “where”. Voice technology is a shared technology, unlike the mobile phone or  personal computer. The fundamental mindset is to expect multi-user and there  should be partitions and realization that users are well isolated and  partitioned. The  group pulled together highlights of technology that’d be required to enable the  following technologies. These technologies represent desired use cases. 
        Speaker Identification
          Should be performed locally, either in the device, or in the  local “fog”Voice technology is a shared technology, unlike the mobile phone  or personal computer. The fundamental mindset is to expect multi-user and there  should be partitions and realization that users are well isolated and  partitioned.Speaker identification to a registered user.Ability to identify non-users and limit guest access  accordingly.Admin/Dad/Parental tiered control mode, locking of the system.  (e.g. allow control of home lights, but no music selection etc.Multi-user capable - Independent search trails, isolation in  history cache? Cross-corruption of search history c). Search Context should be stored in the cloud
          Needs to consistently recognize the user. (see above)Emotional context of voice changes the reaction of the search;  urgency of toilet paper? I need toilet paper NOW!! Vs Where can I find toilet  paper on sale could result in amazon prime now purchase made versus a list of  places nearby where toilet paper is available. Multi-user with context; search history based on user, parental  control, instant privacy button, incognito mode for voice search, history  caching for searching back on the content of discussions?Birthday/relative/friend search mode: e.g.: ask the assistant  what the significant others prefer from his/her search history, privacy  concerns? Permissions scheme, level of access for others to your search results  (does not seem like it’s limited to voice).(Crossover with entertainment) - Who’s the guy in the scene? How  long has this movie been on? Have I seen this before?
            Shall not interrupt experience and should be a real time overlay  of search results on your personal display or TV. Search for devices in the home
          Still leveraging user id
            Where are my keys? (May require Tile or other device tag)Where is my phone? Completion of transaction
          This is to aid with privacy andVerbal response - “Thanks” or “Thank you” or UI tone with  descending notes.System needs to Timeout after a certain time if no more requests  are made of it.  Different search servers, handoff protocol, global settings,  alternate services:
          Assistant tells you the source of the answer, which services  that provides the answers, different characters or voice to identify the  source, or service provider. Doing so is much faster than “Google said this and  Amazon found that…”Handover of rendering or served results to other devices or  context relevant services; send map to car, send calls to phones, pass calls to  another person in other part of the house or geographical locations (ease of  use)There are large search and service providers in other countries  that should not be ignored in these integrations (Baidu, WeChat, Yandex… etc.) Need a graceful plan for how the system handles search results  that are a failed task completion, “I'm sorry I put a search result in the app”  is not a great user experience. Should understand the difference between error  in recognition, context errors, false triggers, cross-corruption of history and  recommend next steps.Marketing/Directed/Targeted messages over use of service? 
          Negatives
            
               Ethical and privacy  challengedUnprompted suggestions could be perceived as annoying if handled  poorly or too aggressively, in a Clippy type situation.Positives
            
              Knowing the demographic of the room, e.g. children under 8, and  don't show violent movie adsAbility to introduce new features and updated services to the  user The functionality has to meet some required response time per  use case. Push to talk Intercom is different than doing an across house jam session.  There has to be minimal round trip latency.Hub vs Ad Hoc network  of voice, should one device or family of devices do far field and close range  listening (a bunch of Dots) or will there be a network of microphones from  various devices listening at the same time (TV, Remote Control, Set Top Box,  Hub, Thermostat and Wall Switch, etc.) Entertainment 
        Striping away your voice signature and background (like text to  speech) and putting another voice signature on it. Playing a game using the  voice of a celebrity. We envision voice as a primary interface for controlling a network of smart home  devices in a few years’ time.  Examples  of smart devices include the following: -lights-kitchen  and laundry room appliances
 -robots.  For example, automatic room vacuum cleaners.
 -  audio and visual entertainment (TV, stereo)
 -  window shades
 -  irrigation and outdoor lighting
 -  alarms/security, locks
 -  HVAC
 -  smoke alarm,
 -  automotive: starting the car, or the car’s AC/heat.
 -  BBQ pit (preheat to 300 degrees)
 The  smart home should be simple to set up for voice control of the smart devices. Interaction  through voice control should be human-like and rely on natural language for  specifying commands rather than a limited set of pre-defined commands. Also,  voice control should not be limited to close proximity to the controlled device  i.e. the user should be able to control all smart device using voice commands  from anywhere in the house. In terms of feedback, we would expect to get voice  feedback whenever we are not in proximity to the controlled device and cannot  see the effect of the command. Unfortunately,  we envision a compromise in terms of user experience due to the walled garden  nature of future voice interfaces which stem from business considerations of  major players like Amazon, Apple and Google. This will result in  interoperability issues unless users ensure they purchase smart devices  compatible with the walled garden. That said, there should be many choices for  users as more and more manufacturers adopt one or more providers of voice  services e.g. Sonos has implemented the Alexa API into their products. We  envision central processing, possibly in the cloud, that can control multiple  devices (e.g. turn off everything in one command when leaving the house) and  also make centralized decisions in cases of multiple mics receiving the same  command. Suppose you have 6 rooms. The Amazon Dot in each room knows its room  and knows its default devices are the ones in the current room, if the house  has multiple instances of that device. For example, a house has 2 TV’s each in  a different room. If you’re in the living room (tv1) then “turn on the TV”  means turn on the TV in the living room.   You will have to specify the other TV by name if you want to turn it on. Common  challenges: 
        As mentioned above a centralized hub is required to provide  adequate user experience (i.e., specific home profile).Removing the voice trigger for a more natural user experience  and human-like interactionSecuritySpeaker separationVoice biometrics Challenges  specific for control: 
        User profiles and privileges for control purposes e.g. little  Joe shouldn’t turn on the BBQ pitSmart devices need to be enumerated in case of multiple devices  of the same kindInteroperability - as mentioned above we don’t envision a  complete solution for this challenge but rather a walled garden approach in  near futurePower consumption- battery operated devices (e.g. smoke alarm)  would require an efficient implementation 
 Backup Real  Time Notes: 
        User Experience FocusWhat “feedback establishes context and acknowledgementsTimeoutsNaturalness of triffer worksWhat other sensors are necessary?IdentityCommunicationsSearch Control (Home Automation)EntertainmentExisting voice interfacesUsage ModelsTechnical capabilitiesBusiness ModelSecurity
          Multi factor authentication Modelling human interactionPitch level detection - detect tone, mood (frustration etc.) Cloud vs. local distribution of processingCommerce, communication, control and searchMulti-channel participant 
        http://www.projectbarbq.com/reports/bbq13/bbq13r6.htm - Group Report: Using  Sensor Data to Improve the User Experience of Audio Applicationshttp://www.projectbarbq.com/reports/bbq14/bbq14r7.htm - Group Report: Audio  opportunities in the Internet of Things
 http://www.projectbarbq.com/reports/bbq15/bbq15r4.htm - Group Report: Audio  Of Things: Audio Features and Security for Smart Homes/Internet of Things
 section 3
 |