Designing great voice experiences using natural language cues

Echo, Siri and Cortana are popular VUIs. However, all these assistants were designed with different goals in mind. Echo was designed as a voice first interface while Siri was designed as just another way of interacting with your iPhone. This is changing.

Speech recognition refers to what the recognition system hears. The recognition engine returns a set of responses for every user query. As this technology improves, the challenge of designing a great VUI lies in how the system responds. Natural language understanding(NLU) is how the VUI interprets those responses. It is the way input is handled, rather than the accuracy of correctly transcribing what was said.

Handling user responses

When you say “Read me this article”, I can interpret this as “Read me this article” or “Reed me this article.” The second interpretation does not really make sense. As an intelligent person, I am expected to understand the context and the logic of the statement to respond to it. Likewise, an intelligent system needs to know what the user meant and respond to it accordingly. Here are the different types of responses that the system needs to handle:

Constrained/finite responses

For many of the questions that a system asks, there are only a finite set of responses that are logically valid. For example, when the system asks “what is your favorite animal?” or “Do you want to go outside?” Saying “orange” is not a logical response. Providing a mechanism to capture and interpret responses and discard irrelevant ones is especially challenging for a voice based system where the user can choose from a seemingly infinite response set. In the first case, the VUI will have a list of accepted animal names, while in the second one we only need to look at variations of “yes” and “no.” If the user says something outside of the finite set of responses, it can be handled with a null response like “I’m sorry I don’t understand.”

Here are some examples that require constrained responses:

What is your favorite fruit? : Mango, Apple, Banana, etc.
What country are you from? : India, USA, China, etc.
What song do you want to hear? : Fix you, Lasya, Roja, etc.
Would like to book the tickets? : Yes, No, Naah, Oh yeah, etc.

The response sets might be lengthy, however they still are finite.

Open Speech

One case is when you want the conversation to have a natural flow, but do not explicitly want to handle input. For example, the assistant might say “Hey there! Long time no see!” to which the user might reply “Nothing much” or “Gotta go work now.” In that case, the assistant can give a generic reply like “Hmm.. I see” or “Ok alright.” The user response is not critical for the conversation to continue, the user could say anything and the logic of the response should still be valid. A generic answer is alright. Another strategy for a generic reply would be during confirmations like “Alright! Done!” or “I’ll send this information to our customer service team.”

Categorization of input

A good strategy to handle user input is to categorize inputs in broadly defined buckets such as positive/negative, happy/sad/excited, good/bad, etc. The VUI simply looks to map to a category rather than give an exact response.

For example, the VUI can ask “How was your experience at our restaurant?”:

Good: Good, Amazing, Terrific, Awesome, etc.
Bad: Depressing, Bad, Irritating, etc.

The assistant can then respond accordingly.

_Protip:_Try not announcing what the user is feeling already. For example, when the user says that the experience was bad, do not say “It seems you’ve had a bad experience. Let us know how we can improve.” The user has already indicated the mood, it’s unnatural to repeat it. Instead try saying something reassuring like “I’m sorry to hear that. Want to tell us more?”

Logical expressions

Looking for specific keywords or phrases is a simpler method, however it is important for a voice based system to allow for more complex queries. For example, the intent for the following queries is the same:

“My computer is really slow”
“My computer is really really slow”
“Computer is slow. What to do?”

Booking a cab or ordering food are simple intents, however there might be variations in the way a user asks for information. It would be a huge task to write a condition for each of these variations. Instead, the system could build a recognition for common patterns.

Negation

Imagine a VUI asking you “How was you experience at the restaurant?” and you say “Not very good.” The VUI designers have not considered this response and pick up the keyword “good” and responds by saying “Awesome! Thanks!” The VUI already sounds stupid and the user might become wary of trusting the assistant. Handling negation is a much more difficult task, but the cost of ignoring it is high.

Disambiguation

The word disambiguation literally means removing uncertainty and is arguably is one of the most important problems that voice interfaces need to tackle. A simple example would be placing a phone call. If you ask Siri to “call John” and there are multiple Johns in your contact list, it would ask you “which John?” followed by the full names of each of the contacts disambiguated.The system also needs to disambiguate in cases where the user provides insufficient information or excessive information. For example, if a user says “I’d like a large pizza” for which the natural followup question could be “what kind of pizza would you like?” This is a case of insufficient information which the assistant can resolve by asking a leading question. In the case where a the user gives excessive information which the system is not built to handle, the system can ask the user to provide only one piece of information at a time. However, it would be more beneficial to account for multiple pieces of information.

Capturing intent

For more complex VUIs, you need smarter ways of handling speech input. In many cases like messaging, there are multiple things you can do with a messaging app. You could say, “Send a message to mom” or “Read my last message” or “Have I got any messages?” In each case, the intent is different and handling these queries by searching for the keyword “message” might not be the best strategy. Instead the VUIs NLU model should be trained to handle each of these queries as separate intents.

Capturing objects

In cases where user utters multiple pieces of information at once, the NLU model should be able to handle the query and capture objects to be used for the intent. For example, if a user says “Order me a large cappuccino from Starbucks at home,” the user has already specified the type of coffee, size, restaurant name and place of delivery. The system should be able to pre-fill this information.

You can use tools like Api.ai, Microsoft LUIS, Nuance Mix, Wit.ai, etc. to build and test these models.

Wake words

Wake words are often used to invoke VUI system. For example, “Alexa” is the wake word for Amazon Echo, while “Hey Google” or “Ok Google” are wake words for Google assistant. Using a wake word is one of the ways to start an interaction with the VUI system without having to touch any device.

Following are the things to keep in mind when designing a wake word:

It should be easily recognizable. Short words like “Jim” or “Will” are difficult to recognize.
It should be easy for users to say it.
Use words with multiple syllables. Take note of Alexa or Siri’s wake words, they all have multiple syllables.
Don’t choose words that people might say regularly in conversations.

Another important thing to note is that wake words should be handled locally. You device should always listen for the wake word and then start recording the user’s voice to send it to the cloud for processing. Always recording and sending data to the cloud is unethical and will lead to serious distrust among users.

TTS versus Recorded voice

Another important decision you need to make is whether to use Text-to-speech (TTS) or a recorded voice to answer user queries. Although a recorded voice feels more natural, it is expensive and time consuming to record for all answers. TTS on the other hand can work realtime, but sounds robotic. Although it is improving, TTS still has difficulty pronouncing certain words, emotion is difficult to indicate.

TTS can be improved by applying Speech Synthesis Markup Language (SSML), which can help add more natural sounding intonations and pronunciations on the fly. Despite this, there are still words and phrases that the TTS engine might have difficulty with and it might be necessary for you to build a pronunciation dictionary.

As a rule of thumb, it is generally a good strategy to use a combination of TTS and recorded voice. Recordings can be used for most common responses like confirmations and greetings. Apart from this, it is also a good strategy to build a voice font of your recording artist, in case you are using a hybrid model.

Voice biometric authentication

Voice biometric authentication also known as Voice ID or Speaker verification is a mechanism which allows users to authenticate themselves just by their voice. Although the technology is improving, it is generally not advisable to use voice ID alone for authentication (I’m sure i’ll have to eat my words about this in the future). Another use of speaker verification is in speaker identification rather than to authenticate. The VUI can identify who is speaking and respond accordingly.

Sentiment analysis

Another way of making your VUI smarter would be using sentiment analysis and detecting emotions. Sentiment analysis often refers to identifying and categorizing opinions in a piece of text. Although it sounds complicated, you constant doing sentiment analysis quite easily. First you need to define your categories (for eg. positive, negative or neutral) and compare what the user said to categorize. You can use open source tools like KNIME to get started.

Emotion detection

This is a fairly new field where companies like Affectiva(now owned by Apple) have begun using facial characteristics to detect emotions of users which can be used by the system to give an appropriate response. Emotion detection can also be done by analyzing voice samples in real time. Beyond verbal detects emotions via a voice stream which their Moodies app analyzes to display the primary emotion.

One key thing to remember when using techniques to detect emotions is to err on the side of caution. Getting the emotional state right is good, however getting it wrong might have disastrous consequences.

**References:**1. Being Digital — Nicholas Negroponte2. Designing voice user interfaces — Cathy Pearl3. Design for Voice Interfaces — Laura Klein

If you liked this article, please click the 👏 button (once, twice or more). Share to help others find it!