How to Test Voice Apps is Remarkably Difficult

I’ve been authoring Alexa skills for the past year, and am recognized as an Alexa Champion by Amazon. During this time, I’ve made many mistakes, so please learn from my blunders!

Authoring voice apps for Amazon Alexa focuses on user flow and developing responses. This is a good place to start, but don’t forget to include some early test cycles validating your design. Spend time testing with your voice, not just traditional text based messages to the Alexa API. Voice apps are a huge challenge to test — here’s why.

Testing Voice Apps != Testing Mobile/Web Apps

There are many different types of testing — unit, system, integration, performance, endurance, etc. These all carry over to Voice, but there are challenges to be ready for.

Challenge #1 — standard tooling is text-based

If you’re not familiar with testing an Alexa skill, the basic testing starts by using the console. It’s great, although begins steering you down a familiar road that you’ll need to get off of quick. Here’s a screenshot of the console if you’re not familiar with it.

The tool facilitates testing the basic request-response model typical for attempting any RESTful API. Unfortunately, that’s not a full simulation as it is using text and just the backend of the app. That works for unit testing, but doesn’t mean it’s ready to publish.

Use your voice to execute tests early in the process. Think of it as a type of integration testing of the different layers of technology involved. Remember, nobody is ever going to use your skill with text. So while the console is an easy way to get started, it’s not a true representation of the real world.

Challenge #2 — Voice generated input data creates infinite test cases

Building voice applications leverages help of powerful Machine Learning capabilities converting speech to text. It’s a thrill to be able to tap into that power, but it’s a different model than ones that rely on a keyboard, mouse, or touchscreen. Let’s explore why.

Web has drop down boxes, radio buttons, freeform text, etc. driven by a keyboard and mouse. Mobile has similar features, as well as the ability to track position of a finger on a frame of pixels. Each of these patterns has a finite number of permutations, so full test coverage is possible.

The framework for a voice platform like Alexa starts with an intent model. This model maps all possible outcomes for a voice invocation. Publishing the skill loads the model into the platform. This allows the Alexa machine learning to pick the most likely intent based on the vocal request at runtime. The requests are then translated into text before passing the data over to the API. Here’s an example of what an intent model looks like.

{“intents”: [{“intent”: “PlayNote”,“slots”: [{“name”: “Note”,“type”: “LIST_OF_NOTES”}]},{"intent": "TeachSong","slots": [{"name": "SongTitle","type": "LIST_OF_SONGS"}]},{"intent": "ListSongs"},{"intent": "Help"}, ...]}

For variables, slots are contained within the intent which get parsed out in the response. There is further information contained about the possible slot values as well as the different utterances that should trigger each intent. Now remember, these custom slots aren’t enforcing data into patterns — they’re just used to train the machine learning models of what’s most likely to be heard. If the user clearly asks for a value not spelled out, it can still pass it through to the API. For example, if someone asks to play the note “T Sharp” (something not on any piano keyboard), there better be exception handling written to cover this case.

An analogy to the touch world and a mobile device is when a user is provided a screen with a few buttons, if the user misses pressing one of the buttons and just hits gray space, nothing happens. The GUI framework enforces precision, and limits the number of potential testing scenarios. Voice is different. Due to this, the variations in user behavior are infinite as there aren’t constraints, rather the voice gets translated into text and statical models shape which intent the request will flow into.

Okay — now ready for the red pill? Try testing your skill by coughing, having your cell phone ring, or creating other non-verbal sounds. These are also potential audio requests that may occur during a session. How will your skill handle it? Will it throw an exception? Better yet, if you’re in the middle of a session, how forgiving is your skill?

Voice applications need something like hitting a browser “back arrow” to handle real-world situations like this. I haven’t seen a natural pattern emerge yet that would solve every scenario, and I’d like to see the platform handle more of the filtering ahead of invoking the API. What I’ve done in my skills is to save session state between utterances, thus a *cough* can’t disrupt the flow of a skill as I can reload the session if the response is nonsense, and give the user another try. This will be a good test of user experience, and these things do surface in the real world!

Challenge #3 — your best intentions can be your worst enemy

When creating an intent file, one might think that trying to come up with many paths is helpful. Here’s a nasty bug I found in one of my recent skills. First for context, this is a skill that teaches how to play the piano, and has a song catalog within the “ListSongs” invocation.

ListSongs list songsListSongs list available songsListSongs what songs can you playListSongs what songs do you haveListSongs which songs are there

Seems reasonable? The challenge here is by putting additional utterances with very broad term or phrases like “what”, “which”, “can you”, “do you”, and “are there”, you can accidentally collide with potential speech patterns that are also found in other utterances. This creates a challenge within Alexa as the broader the possible potential combinations in the model, the greater likelihood of Alexa picking that path. I like to think of this in terms of coverage on a map (see diagram below). It’s limited space, and by adding more paths, it’s defaulting more possible combinations towards that feature.

In the example of my skill, I found that users were asking broad questions about other functionality (for example — “Can you play Beethoven”) and instead it would pick this up as a request to list songs. Once I reduced these extra utterances, it improved the accuracy for other features in the skill.

So be careful being “helpful” in providing many paths — especially for non-core functionality.

Challenge #4 — how to scale volume of test cases

By now everyone understands that quality product development assumes some level of automation within testing. The more features you add to an application, the more coverage required — but how to automate with voice? Automating the unit test cases can leverage, but I’ve already called out the risks around being too focused on the backend that doesn’t pick-up the other assembly of a voice platform.

The problem with testing and voice systems is playing back real audio takes time if you want a quality test using verbal data. If the audible utterance for each test case takes 5–10 seconds, and an additional 10–20 to process a response, 100 test cases could take an hour to run. 1000+ test cases becomes unmanageable without some frameworks that I haven’t yet found on the marketplace.

My current thinking is to take an old mobile device and use the voice playback function on it and stick it next to the Alexa to playback the test data. I can record the test data once, then playback again and again (potentially in an overnight execution). Of course, that brings up one of my favorite Star Wars quotes.

“Machines making machines. Hmm. How perverse!” — C3PO on Geonosis after seeing the Droid factory.

Is Guerrilla Testing the Answer?

I recently read a good book on VUI design by Cathy Pearl, and there are two chapters that cover testing that explore this topic in some detail. One idea that is suggested is to include Guerrilla testing — getting random users to actually use the product and see if they break it. This seems promising given that they could cover many different random cases and use different variations of voice, but the challenge may be how to find these testers. For a large corporation, they might have a department that can handle this — but how does a 2–5 person startup do this?

How are you testing your skills?

So these are the main unexpected challenges that I’ve run into after one year of developing on the platform. This is a topic where I’m still scratching my head some and looking for ways to improve. I welcome thoughts and ideas!