A voice user interface is a technology that allows people to interact with a computer or device using spoken commands. Think of Captain Kirk standing on the bridge of the starship Enterprise asking the computer for an analysis. Once the stuff of science fiction, today VUI is one of the fastest growing technologies on the planet.
Every month there are one billion searches performed by voice, and 72% of people who use voice searches do so daily. Google’s voice technology now recognizes over 100 languages, and recent analytics suggest their natural language processing stack is over 95% accurate. There is no denying that VUIs are making huge gains in terms of adoption and accuracy. One might even argue that the gains in adoption are a result of the gains in accuracy. There is some truth to that, but it’s not the whole story.
Jump to:
How VUI helps future-proof digital products
Human beings are hardwired for speech. As a species humans have been communicating with the spoken word for no less than 50 thousand years. On average, we can speak 125 to 150 words per minute: That’s over three times the average typing speed. When you put it that way, you begin to wonder if future generations will bother learning to type at all.
SEE: Hiring kit: Back-end Developer (TechRepublic Premium)
If you are building a digital product or service, there is a good chance a VUI is or will be on your roadmap. Twenty years ago, adding a voice user interface to an application required a team of specialized engineers, expensive hardware and frequently resulted in something that sounded like the Speak & Spell.
Today, even as a beginner, you can build your first voice application in under an hour using something like the Alexa Skills Kit. But it’s not just the technology that will make or break your VUI. To build a voice user interface that will elevate your digital offering to the next level, you’ll need to understand some best practices and philosophies.
VUI best practices
Start with the ideal interaction
You’ll want to start designing your voice interaction by mapping an end-to-end dialog flow. Start with the golden path, then work on filling in the branches and edge cases. Watch out for dead ends in your conversation trees. Just like when talking with humans, awkward silence is a conversation killer.
More options does not mean more value
Remember that users start with no clear indication of what options are available, so proper onboarding is essential. Start with an overview of what the interface can do. Keep lists short — usually three or fewer options. Consider prefacing those options with numeric identifiers so your users have less to remember. It’s also important to keep in mind that text-to-speech engines typically recite information much slower than people read, so keep your menu options concisely worded.
Context, context, context
Programmatically deciphering and maintaining context is difficult both within a single session and across multiple sessions. When humans interact with each other, we are privy to a number of non-verbal clues. Pitch, tone and even facial expressions all provide additional context. Most commercial VUI software is ignorant of these context clues. Interestingly enough, however, nearly all can convey some additional context in the response, via speech synthesis markup language. SSML allows a developer to introduce pauses, pitch and even some emotion into responses, increasing the conversational feeling of your VUI.
Voice specific error handling
Error handling on a VUI has specific challenges. Error messages need to be specific and suggest a next course of action to the user. For example: “I’m afraid I don’t know how to help you with that. As a reminder, I can assist you with the following…”
SEE: Hiring kit: Python developer (TechRepublic Premium)
You’ll also want to be wary of a generic try-catch type error handler pushing a system level error all the way up to your TTS. You don’t want your voice assistant telling users “socket closed by remote host” or some other common low-level occurrence. Logging is your best friend when it comes to debugging a VUI. Just remember your logs contain what the VUI heard, not necessarily what the user said.
Crowdsource utterances
One of the more challenging aspects of creating a good VUI is training your model on all the different ways your users may ask for the same thing. You’ll never be able to think of all the variations on your own, and surveys don’t generally work because people write differently than they speak.
Instead, you’ll need to observe — and, when possible, record — users in real life to understand a reasonable number of user inputs at launch. Ensure you are observing users who are representative of your target users: Doctors use a very different set of shorthand and abbreviations than mechanics or soldiers.
Don’t forget about privacy and security
When developing a VUI, it’s your responsibility to understand privacy and security concerns. Commercial smart speakers are constantly scanning for a wakeword. However once engaged, they usually record and decipher everything said, requiring up to eight seconds between commands before going back to passive listening.
Developers need to be aware of any sensitive information that might be required for a particular use case, and the policies and regulations that govern handling of that data. Also keep in mind that it’s impossible to know who might walk into a room between the time when information was requested and the response is actually spoken.
How to choose the right VUI tech
Today there is quite an extensive list of options to jump start development of your voice user interface. Before selecting a specific solution, make sure you have a firm grasp on your non-functional requirements:
- Connectivity
- Will the device be connected to the internet all the time?
- Speed and accuracy
- Does the translation need to happen in real time?
- What is the trade off between speed and accuracy?
- Domain data models
- How well trained are the models in your domain?
- Do you need to understand full sentences or just pick out keywords?
- Fallbacks
- Is there a keyboard or touchscreen in case the voice input fails?
- Consequences
- Will an incorrectly processed voice command result in an irreversible action?
- Environment
- What surrounding conditions does your solution need to perform under?
VUI represents a fundamental shift in human-computer interaction. When building a voice powered application, designers and developers must rethink the approach. Focus on voice-first, truly conversational experiences, and your customers will thank you.