Guidelines for voice user interfaces

Home assistants have become the in thing. As I argued in my previous story, I still believe that currently it is better to develop for the unsexy phone systems rather than the sexy new home assistants. But, home assistants are the future. And lots of startups have realized that.

Just look at the list of startups who have built interfaces for Alexa and Google home. But if you use any of the apps, you will immediately realize they have not put much thought into the interface design. Most probably the following is the pipeline being adopted:

Identify data to be exposed for voice apps.
Build APIs to expose that data.
Use the toolkits provided by the assistants like the Alexa skill builder.
Test one flow and launch in production.
Announce to the world they have an Alexa integration. Job done.

After using many of the apps, I have to say, almost none of them have put thought into the interface design.

The thing is, even voice apps need a lot of thought on interface design just like regular apps. In fact, I would say they need a lot more thought as the visual domain compensates for a lot of missing information where as in the speech domain a lot is left to the imagination.

Speech apps have to be designed with a completely different approach. We cannot apply the design principles that are applied to regular visual apps.

Speech is the most natural form of communication and as such it has been highly optimized over centuries of communication.

Consider the following use case, booking a flight ticket:

Speech: I want to book a ticket from San Francisco to New York on 20th October.

Time taken to specify intent: A couple of seconds.

Web/App: Minimum of 4 clicks(not counting the clicks needed to close the pop ups) and a couple of screens.

Time taken to specify intent: Minimum of 20 secs.

Speech is highly optimized for these kinds of scenarios. But there is a downside too.

Specifying intent is fast in speech, but choosing options is slow. Consider the above example, everything is fine if you say you want a ticket and the system selects the ticket. But what if there are multiple options and you have to choose. Which airline, number of stops, price etc. Theoretically you could still form a sentence which captures all this information, but it is not natural. We fine tune our choices through a conversation.

But making a choice visually is easier. We can scan the options faster and make a choice. In speech systems, we have to listen to the choices, remember them and then make a choice.

So its very clear that designing interfaces for voice based applications is different.

Based on my experiences in designing voice solutions, the following are a few guidelines that I follow when designing VUIs(voice user interfaces).

Don’t try to cheat the user: In most cases, the user just wants some information. Don’t try to be too smart and try to convince the user that they are conversing with a human. You will fail. In fact, it helps to inform the user up front that they are dealing with a bot.
Have a back up plan: In most cases, the backup plan is to connect to a human. No matter how well you design your system, it will fail in some scenarios. So, rather than lose a customer, have an option to allow the user to connect to a real agent so that their query can be answered.
Avoid long menus: People can’t remember more than 4 items. In fact I am so bad, I have trouble remembering the previous 3 items in a menu. Break it up into smaller blocks and a build tree structure. Make sure the most common options are mentioned first.
Avoid menus if possible: With the advancements in speech technology we are close enough to avoid menus in most cases. Have a good slot filling natural language processing system along with a good speech recognition system and you should be able to avoid menus. But remember, this technology is still in a beta stage, so make sure point 2 above is covered.
Give feedback: Silence is your biggest enemy when designing voice user interfaces. Remember, your users are completely blind and if you don’t give auditory feedback, they become deaf too. And they will hate you for it. Fill silence with something. When you are waiting for some long running process to return the info, fill the silence with hold music. Or prompts like,”please wait, we are trying to fetch your information” etc. Anything is better than silence.
Use other channels: No matter how well you design your system, sometimes, the voice channel is just not sufficient. So don’t hesitate to use other channels to complete the transaction. For example, in the airline booking scenario above, after the user has indicated their intent, maybe you can say “Thanks for the information. The list of flights from San Francisco to New York on 20th October is sent as an SMS to you. Please choose an option, I will wait till you make a choice”. And play hold music till the user makes a choice.
Give context: Remember, in a voice flow, users can get lost if the flow is big enough. So make sure you also repeat the options chosen. For example, rather than say,”Thanks for choosing. Shall I confirm?” Its better to say, “Thanks for choosing. Shall I confirm your flight from San Francisco to New York on 20th October at 8 am?”
Give examples: When users first encounter your system, they may not know how to interact with your system. Left to their own devices, they will try to say sentences in their own way. And you will have to build a system which can anticipate all the ways in which users can converse with your app. Instead, it will be better if you can give examples of what they can say. For example, you can start by saying, “Welcome to Hello Travels. You can book your tickets by saying, I want to book a ticket from San Francisco to New York on 20th October”. You will be surprised to find how many people mimic the sentence in the exact same way.

Go ahead. Implement these guidelines in your own voice applications and let me know your feedback.