Understanding speech recognition to design better Voice interfaces

When designing a good voice user interface, it is always advantageous to know how the technology works.

Knowing what goes on behind the scenes enables you to make design decisions that take into account the current limitations and advantages of the technology.

Speech recognition technology

One key component of a Voice user interfaces(VUI) is automated speech recognition (ASR) that enables users’ speech to be translated into text. There are a number of free and paid services available that provide ASR engines. When choosing an engine, it is important to keep in mind the following two things:

Robustness and accuracy of data As a rule of thumb, the more data the company has, the better it’s speech recognition will be. Incumbent companies are generally good at amassing a large quantity of data.
A good endpoint detection Endpoint detection refers to how the ASR engine knows when the user has begun or finished speaking. Try to go for a service that provides the best endpoint detection.

Again, not all ASR tools will have advanced features like N-best lists, end of speech timeouts or the ability to incorporate custom vocabularies. It might be quicker to start with the cheapest ASR tool, however if the recognition accuracy is sub-standard, or the endpoint detection does not work very well, it is going to frustrate your users to eventually give up on the product.

Barge-in

Barge-In is the ability built into a VUI that allows for the user to interrupt the system anytime during the conversation. The decision to enable this will greatly depend on the type of VUI you are planning. It is advantageous if your VUI is going to say a long list of menu items, or tell a story or generally be verbose. Users might want to interrupt and stop it midway.When deciding on a barge-in strategy, you need to keep in mind whether you want to enable barge-in with anything that the user says, or only using the wake word. Most common VUIs use the latter strategy of using a wake word. When Alexa is playing a song, the user can barge-in anytime to stop playing. Without barge-in, there would be no way to stop playing by using a voice command.

Timeouts

VUIs need to know when a user starts speaking as well as when the speech has ended. Knowing when the user stops speaking is known as a timeout. Giving an optimum timeout is critical to a good VUI experience. Think of a video call where the voice of the other person lagged and it was difficult to follow the conversation.There are different conditions at which the ASR engine can decide to timeout:

End of speech timeout Knowing when the user has finished talking, i.e. finished their turn in the conversation is one of the most important characteristics of a good VUI system. This is sometimes referred to as endpoint detection. Giving a response as soon as the user has stopped speaking is unnatural. The system needs to pause for sometime before continuing the conversation. It is a basic conversation etiquette. Some ASR engines allow you to configure this pause, also known as end-of-speech timeout. Using a 1.5 seconds end-of-speech timeout is a good rule of thumb. However, there might be cases when you’d need a longer end-of-speech timeout, such as when saying a long string of characters or numbers. For cases where the user only needs to give a one word response such as a ‘yes’ or a ‘no’, a shorter timeout works fine.
No speech timeout (NSP) As the name indicates, this timeout is used to detect if there is no speech detected. It is different from an end-of-speech timeout, where there is a concrete beginning and an end to the user speech. This timeout is usually longer at 10 seconds. There are different ways in which these timeouts can be handled ranging from showing the user a list of things or actions that can be done to doing nothing at all.
Too much speech This is a rare case when the user is talking for too long without any pauses. In most scenarios, you don’t need to handle Too much speech (TMS) timeouts. However if you want to incorporate this then a good rule of thumb is 7–10 seconds at which the system times out.

Incorporating timeouts is essential to know when the user has stopped speaking.

N-best lists

When a user speaks with the VUI, the speech recognition system returns more than one response to what was said. It assigns a confidence value to each result and usually picks the one that has the highest confidence value. In simple terms, a confidence value is a percentage that indicates how confident the system is about a particular result. For example, when you say “Read me a book,” the system can interpret it as follows:

Read me a book : 95% confidence
Reed me a book : 70% confidence
Rid mia boo : 30% confidence

If you’ve designed your VUI to read books then the system would pick up the first result.A recognition engine often does not return only one result. It returns an N-best list, which is a list of what it thinks the user might’ve said in the order of likelihood. It is usually the top 5–10 results along with a confidence score.N-best lists are useful in cases where you’ve designed the system to answer in a narrow domain. For example, in a VUI that gives information about animals, when you say “Show me a Badger,” the ASR tool might interpret it as follows:

Show me a badge her : 92% confidence
Show me a badger : 89% confidence

Since you already know that this VUI is about animal information, it can search for cues for animal names and pick the second result even if it does not have the highest confidence level.Another use of N-best list is in correcting information in case the first answer is not valid.

Challenges for Automated speech recognition tools (ASRT)

Many studies show that Automated speech recognition tools have an accuracy of more than 90%, however this is under ideal conditions. An ideal condition for an ASR tool is an adult male in a quiet room with a good microphone. Rest assured, most real life conditions are not ideal.

Noise

Handling noise is one of the most difficult challenges for ASR engines currently. Noise refers to situations like any noisy environment, television in the background, multiple people talking at the same time or side speech when the user talks to another person while the VUI is listening. Some VUIs can detect noise and ask the user to move to a less noisy environment. However, there’s not much you can do about this apart from waiting for the technology to improve.

Children

It is much more difficult for ASR tools to accurately recognize voices of children. Children have higher pitched voices owing to shorter vocal tracts. As yet there is less data to recognize that type of speech. Children often stutter and repeat words which is another challenge for ASR tools. Much like the earlier problem, things are improving.

Names, spelling and alphanumeric

It is easier for ASR tools to recognize longer phrases like “yes I will” rather than shorter responses like “yes.” Names, spellings and alphanumeric strings are also tough to recognize and it often makes sense to have a GUI input for these. The name ‘Karan’ is often misinterpreted as ‘Karen.’ Asking the VUI to call Karan might accidentally call the wrong person. Again, the technology is improving.

User Data Privacy

When designing voice interfaces, it is important to make sure that you do not store private data unless absolutely necessary. When you store this data, make sure the user is aware and has a choice in denying access. It might be tempting to store conversations and use this information for improving this experience. Even if the device is constantly listening, any data before the user has said the wake word should not be stored or sent to the cloud. Users expect privacy and breaking trust of your users is the worst thing you can do.

**References:**1. Being Digital — Nicholas Negroponte2. Designing voice user interfaces — Cathy Pearl3. Design for Voice Interfaces — Laura Klein

If you liked this article, please click the 👏 button (once, twice or more). Share to help others find it!