|
Current and near-term speech assistants leave much to be desired. The aim of our group is to profile the ideal user experience in the 5-7 year range for hearable-nearable-earable voice assistance.
In addition, we would like to evaluate the available technologies that will support the ideal user experience and identify technologies that need further development to meet our goals.
Addressing these technology gaps will require privacy and other trade-offs to be made for compelling user experience and product adoption.
– This is the essence of our topic.
Facilitating Technology:
- The more the machine knows about you, the more it can act as a companion and anticipate your needs, wants. NLU: Natural language processing (understanding and generation) needs to improve to enable smart answers and understand the context. Eliminate uncanny valley.
- (ASR) Automated Speech Recognition and Synthesis are becoming quite good
- True sensor hub functionality to improve context awareness is lacking
- People “want” privacy - so is this a feature that you can monetize - a private Google, for example. Or does your in-home privacy exist on your home server. Not everything needs to go to the cloud. But can be free if you don’t care.
- Privacy mode, just like airplane mode
Trade-offs:
- The more the machine knows about you, the more personal information is given. The best experience - you have no privacy.
– Minimal latency at conversational speed.
- What is “real time”? - no more than 1s latency, “UN” mode. Predictive models aid in latency. Re-shuffling phrases due to language structures. (The market, I go to… vs. I am going to the market)
Facilitating Technology:
- Edge processing - memory in phone and end device. Push the momentum forward. Less on cloud and more on device.
Trade-offs:
- The robot anticipates you before you. Privacy! Latency vs conversational nature.
(adjustable based on environment) – Dad mode vs. Office mode
- Habits/Lifestyle integrated.
- The building you enter talks to you: Welcome to the XX building, coffee shop on the left . . .(because your device told it you are slightly tired as you walked in the door)
Facilitating Technology:
- True Sensor Hub - data aggregation. Device to device data-sharing on the individual(phone, watch, earbuds) and friendly systems. Google and Alexa (and sometimes Siri) need to all intermingle and connect to intelligent infrastructure. A megacorporation must open their systems to grow the ecosystem.
- Endpoint sensors -no ubiquitous interface - processor I/O limited (will we all adopt soundwire or the like). For the system to work we need a sensor output data standard - communication control and data. How many damn wires can I add...how many pins can I have. ***Please have a workgroup to foster innovation here.
- The body is the personal area network. The building is the local area network. The cloud is the cloud area network and both PAN and LAN tether into it. It is the cream of the oreo.
- Developer monetization- downloadable context aware assistance- I just landed in Osaka, open app that links to personal profile and offer suggestions based on “you” for $5.99.
Trade-offs:
People actually have to work together, walled gardens must fall. Industries must collaborate to develop open standards from both the assistant side as well as the component side.
– Let Neo from the matrix guide you while walking down the street…let your mom scold you for consuming too many calories... Neural Speech Synthesis
- Uploadable digital avatar - Real or Fiction, I can choose the companion in my ear. I can mix Darth Vader with my husband.
- Capture and distribute myself/essence - For example when I am out of town, I can send my avatar to my daughter to remind her that I exist.
- “Neural Personality Synthesis” - you have a coach whispering in your ear
- Unique assistants based on need - your virtual friend depending on what you need: fun/office/life coach - you talk to different aspects of your personality based on the situation.
Facilitating Technology:
Next generation Neural Speech Synthesis. Likely tethered to phone. Models on your local device (ex- I am an American who lives in Texas and is 20 years old lives on my phone), still much cloud processing but what gets pushed is limited for bandwidth preservation. The individual user model gets continuously improved. The individual user model is a portable thing that you’ll log into when you buy an iPhone v14.
- Where we are today - Samuel Jackson can tell you dirty jokes on Alexa, but perhaps can’t handle your shopping list.
- Multi-modal capture of speech - mic+bone conduction seem reasonable merger to minimize power consumption if needed.
Trade-offs:
- Self upload: Privacy - all of “you” is given to a 3rd party to generate realistic avatar.
- Sensory overload - unwelcome assistant intrusion. Ad Spam.
- Safeword:The auto-generated push notifications become overwhelming - we need a quick “STFU”.
- Always on-ish. Battery life.
- Can I whisper? Sub ambient input (bone conduction, multi-modal format)
- Non-audio commands such as next-generation proximity gesture support or novel bone conduction control input (tooth chatter)
Facilitating Technology:
Bone conduction coupled with next generation ANC. Processors must accommodate ultra-wide dynamic range signals. Directive speakers to enable sound bubbles.
Trade-offs:
More sensors, more problems: shared interface, data aggregation, power consumption.
– what replaces and improves upon speech to text
- Immune to wind and other environmental corruption.
Facilitating Technology:
- We need to move beyond bluetooth - current bandwidth limitations deteriorate. And, regardless of bandwidth, the models need to be trained on some other basis beyond a quiet room.
- Move beyond speech to text.
- Insertion of contextual metadata onto the speech signal - what is the transmitted voice metadata of the future?
- When you throw away the wind noise, you also throw away some of the voice. Near-term models must be updated.
Trade-offs:
User data is needed to improve upon the models - privacy may be compromised. Some additional power to be consumed as sensors are higher performance.
We want to avoid artificial experiential triggers. How long does the session last once invoked to preserve privacy?
Facilitating Technology:
- give the assistant a name based on the persona you desire.
- Social engagement - how do you talk differently to the assistant vs the person sitting next to you. Directive queues - in a room full of people I look at the person I want to talk to and can easily work around the room. How do I direct to the assistant: Training the device - the inflection and tone must adjust to the assistant similar to how I talk differently to my mother versus my child.
- In-ear EEG sensors - think to invoke (beyond 10 years)
Trade-offs:
- If we take away their branding “Alexa/Amazon” they will invest less in the ecosystem. Uncontrolled wakeword models may lead to false wake-ups (like a bad VAD).
- Privacy: I don’t want this device always listening and collecting everything I say and given to the service provider.
– full biometrics / sensor fusion
- Blood glucose is highly desirable
- Security – Payment authentication, data protection, encryption.
- EEG as a mode of input to biometric (rudimentary functionality within the scope of this study) - mode of invoking, emotional state, etc.
Facilitating Technology:
- EEG technology must advance in terms of form factor and power consumption for even very basic functionality.
- Biometric voiceprint technology must be scalable to in-ear devices.
- Modeling the human ear form is highly individualized - for proper mechanical coupling.
- Blood glucose monitoring needs much advancement
- 6DOF Audio both in the hardware and the XR context is mandatory
- True Sensor hub implementation for data aggregation and processing
- Endpoint sensors -no ubiquitous interface - processor I/O limited (will we all adopt soundwire or the like). For the system to work we need a sensor output data standard - communication control and data. How many damn wires can I add...how many pins can I have. ***Please have a workgroup to foster innovation here.
Trade-offs:
Power consumption and size - we want the best performance but batteries are a challenge.
– as natural as wearing glasses or earrings. It should be comfortable and non-obtrusive.
- Battery should last all day…..
- In-ear, on-ear, over-ear. A proper hearing aid gets really gunky and will turn off people. Earbuds get dirty. Wearing something all day should be comfortable and low maintenance.
- Can we completely avoid insertion?
- I have enough holes in my head, no more. No implantable (yet).
Facilitating Technology:
- Advances in materials science - what kind of insertion material repels ear gunk.
- Battery technology and capacitor advancement - both in deep discharge and capacity
- Offload some functionality to other devices - necklace/ring/clothing/etc to maintain reasonable industrial design of headphone.
- Social interaction - I’ve got something in my ear, therefore I can’t hear you, people won’t talk to you. Audio transparency is not currently a given.
- New social norms must evolve - people now know how to interact with others while operating a cell phone. But with hearable devices, the rules of engagement are not standardized. Social lighting - blue light means “i’m busy so don’t talk to me”.
Trade-offs:
Small size lends itself to small battery. Social stigma - you look like a dork. Sound leakage due to ear proximity of auxiliary devices (ie. a necklace speaker). Don’t be a glasshole.
- Automatic transparency (someone calls your name)
- Dad passthrough – make my daughter hear me
- Selectable alert modes (to mute or unmute, disable ANC, based on the nature of the event source)
- Scene processing: Is this person angry, Am I in danger?
- Context/Scene/Location based automated prompts (similar to a push notification) - You walk into the building. Building tells you hello and some specifics based on the personal data you share with the building.
Facilitating Technology:
- Data aggregation and capture models need improvement for accurate personality of the custom generated assistant.
- Endpoint processors must enable local machine learning functionality: Huge, open database of accessible context training material.
- Democratize (and anonymize) the input / content creation for more models, more events, more context.
Trade-offs:
- Privacy and ownership of your personal data.
- The companies with the large pools of data are not willing to open to the competition.
(and building to device) communication / connectivity with integrated owner discernment
- File sharing, point to point communication “hey Joe, listen to this song”. “Hey building, where is meeting room 52?”
- Air Drop for earbuds
- Targeted communication bubble / passthrough to facilitate conversation in crowded environments.
- Location based data/context dump. So I walk into McDonald’s and the menu is pushed to my device - and deleted when I leave.
Facilitating Technology:
- Security vs. Convenience doesn’t need to be a major undertaking. How about a simple head nod to access non-critical data. I also don’t want to walk into a room and share all of my data.
- Protocol must be ubiquitous across different device brands: My Apple buds talk to my Google buds.
- What comes after Bluetooth?
Trade-offs:
Security still is a challenge.
– you can’t damage your ears. Adjustable user calibration/eq. Custom HRTF. Couple with the previously mentioned scene re-profiling.
- Nanny mode – you’ve been listening to death metal for 5 hours…cut it out.
Facilitating Technology:
- My hearable must test me to understand my level of fatigue and exposure to automatically adjust / compensate.
- The active electronics used in device protection might co-exist with the hearing protection functionality.
- Advances in headphone personalization/customization - playback reprofiling.
Trade-offs:
It could be too quiet - limiting performance.
– compensating for hearing loss. Bionic hearing ability.
- Augmented hearing detection – That bird you were looking for is 4 meters to your left. “Watch out a car is coming from behind you and hit you”.
- Snoop mode - I want to eavesdrop. Selective hearing.
Facilitating Technology:
- Scene detection to re-profile the playback path based on environment.
- Self contained individual ear calibration.
Trade-offs:
Potentially intrusive to others (creepy listener/eavesdropper)
(lofty goal for 5 years)
- The entire conversation should feel natural and “local” even remotely. If grandma is in her living room and grandson is in his echoing basement, they both sound like they are in the other’s environment...or another location, like a beach!
Facilitating Technology:
- Understanding room mapping & dynamic based impulse response based upon position in the room. Major challenge for a low power, low profile “earable” device.
Trade-offs:
- This opens a big can of worms technology-wise. What is in the metadata?
section 7
|