Everyone’s Doing It, But That Doesn’t Make It Easy: What To Consider When Building A Good Voice Or…

This article is written by Marcin Kloda, the Vice President Of High-Technology and GM Americas at intive. intive is a software company focused on digital product development with more than 18 years of experience and 150+ apps.

We’ve always been fascinated by communicating with machines — yes, even before Siri and Alexa. In fact, conversational systems go back way further than most people might think. Consider Audrey, for example. Born in 1952, Audrey was a 6-foot tall computing system developed by Bell Labs, which could recognize the sound of digits from zero to nine — at least, with voices it was familiar with.

Almost 70 years later, we’ve come a long way. No longer a novelty, more and more companies are developing voice products to cater to the consumer’s need for convenience; in fact, the global voice recognition market is expected to reach $126.5 billion by 2023. The gesture recognition industry is growing as well; although, it is significantly smaller. This market is set to reach $33.05 billion by 2025.

Both technologies have entered the mainstream. But this doesn’t mean building a voice or gesture product is easy. Actually in 19 years on the market, our software development firm has seen tons of companies fail to implement good voice and gesture recognition products. Although consumers have gotten used to having the tech on their devices, and quickly get frustrated when their commands are not understood, they fail to consider the difficulty that accompanies building a voice or gesture product. In order to make something consumers will love, newcomers should follow these tips:

Reduce the noise

If voice or gesture products are tested in a controlled environment — that is, in a quiet, empty room — most won’t have a problem recognizing what a user is trying to say. But when out in the real world, many have difficulty trying to decipher language or movements. Kids talking in the backseat? If the product isn’t up to par, don’t expect the voice tool to understand a thing.

Voice and gesture recognition technology is made to be used on the go, so before launching a product on the market, companies must ensure it performs well in the real world. In short: they need to find a way to filter out all the noise.

Outside noise is a problem in almost all machine learning solutions, but radar and voice (both one dimensional) are probably most serious, because of the lack of redundant information in data. The classical approach to solving the noise problem is to use a microphone array to perform beamforming for selective listening. Thanks to the array of mics and AI algorithms, you can separate voices from each other.

Unless a company has voice and gesture experts in house, it’s prudent to find a software vendor to test the end device in a range of real environments. Voice recognition products must be tested in environments that simulate interruptions that may occur. Children in the car, external noises and music or conversations in the background are among the variables in the simulations. Other factors should also be taken into account, such as the accents which represent a particular region. This can prove difficult, as some states such as California have accents from all over the world.

For gesture recognition products, it’s critical that a Digital Signal Processor (DSP) be embedded which can recognize when a command is started, and then passed onto a larger DSP to interpret the gesture. The DSP also needs to filter through “noise” that people make with their gestures. This means that they could make unrecognizable movements, and it’s up to the DSP to catch the correct movement and interpret it. To anticipate this, the DSP is tested with the full flow of the movement and deliberate interruptions.

The DSP is a processor with the architecture and instruction set optimized for efficient computations on long numerical sequences that represent signals, including acoustic and visual ones. The meaning behind what is being said, along with gestures, can be extracted from these signals by applying Natural Language Processing.

DSPs are valuable in gesture recognition because radar-based solutions such as Google SOIL have not been fully developed yet. In the future, radar recognition could become the go-to for gesture recognition products, but it has its limitations — for example — it only works within a very close range of the sensor. DSP is designed to process various digital signals at a high rate. Some of them may process audio with very low energy, others, more power consuming are capable of real-time video decoding.

Translating accents

The biggest challenge for voice recognition is to properly translate what people say. Voice recognition engines need to learn the English of the user and translate it. If someone buys a smart speaker, it initially doesn’t have any experience with the person who purchases it. The speaker needs to adjust to the accent of each individual.

If the client is a native speaker, the accent isn’t considered because the digital assistant is already trained to hear native English. Otherwise, it needs to adapt by progressively updating its default voice model with information about how the user is pronouncing words and phrases, as well as learning the nuances that make it possible to distinguish his or her voice from those of other people. Since adjustments to the specific voice model are stored in the cloud, they can be easily shared with other devices in the user’s household that are registered with the same account, shortening the overall training process.

For speech recognition, adaptation is done on a very low level. Higher-order features of speech such as tempo are already accounted for in a sequence model, but in general, adaptations have one more disadvantage: If a speech model adapts to one speaker, it becomes less accurate when attempting to recognize other speakers. To overcome this, an option is to try to recognize voice using general and speaker-dependent models simultaneously, but this increases computational requirements and energy usage. As such, it is used only in specific scenarios such as mobile assistance or text dictation.

Most research usually goes into creating a general model that is able to recognize various styles of pronunciation. Alexa, for example, may ask the user to read ten sentences to recognize them. Distinguishing different speakers is an area that demands lots of research. The most popular approach creates a model of the human vocal tract, which means that there are features that a speaker cannot alter (such as faking the voice of another person and even detecting a voice actor).

Develop an easy UX

The voice and gesture recognition market is highly competitive. Human-machine interfaces are considered to be an area where a lot remains to be done, and high hopes are pinned on making use of humans’ most basic communication abilities. For example, the development of touch input was a breakthrough that hugely accelerated smartphone revolution, but it has limitations — such as bandwidth and the need to be near the input surface — that voice and gesture understanding does not. If the limitations are perfected, another revolution could be in the making. And while this is great for industry innovation, it does have a downfall: since everyone on the market is trying to set themselves apart with unique functionalities, the apps are becoming overly complicated.

In the race to the top, many companies have developed a clouded view of what customers really want and need. For example, the typical customer of sports apps likes to have a developed social environment. This means comparing their scores with other users, their challenges, and achievements. Companies are spending huge amounts of money to build products that fail in the end, anyway. It’s what happens when companies have great engineers, but not great designers.

To build voice or gesture products that people will actually use, it’s important to partner with vendors that understand UX. Be sure to request case studies or testimonials from other clients, and ask that designers analyze the product to find it’s difficulties. Designers should go through the whole installation process, and then provide a report that states what discouraged them about using the product — and therefore, what would discourage the end customer too.

The thing is, there’s no clear-cut set of features that make up a great UX for voice and gesture; that’s why it’s so difficult to build a great product. But one great example of an app that’s easy to use is Alexa. The voice-controlled speaker has many built-in features, such as a list of more than 15,000 “skills” like the apps on a smartphone. These “skills” allow Alexa to learn new capabilities such as ordering an uber or reading recipes. It can also add events to a calendar and make orders from Amazon.

Third-party developers can add new “skills” using Alexa Skills Kit that exposes programmatic interfaces to the assistant (and other parts of Amazon ecosystem, like its comprehensive back-end or easy payments), enabling creation of custom, voice-based services without the need to dive into details of voice recognition or natural language understanding.

Look for partners instead of suppliers

Software development suppliers work on a project-to-project basis. So once an app is complete, they’ll wipe their hands clean of it. It’s better to look for vendors that will be with you for the long-haul; long-term partners tend to be more invested in the product, and are more willing to suggest creative changes based on the needs of the end-user.

While it’s always difficult to create a product that’s super unique, an invested partner will go through the product and, well — be blatantly honest about it. They’ll state why the product is better or worse than a competitor’s, and offer suggestions on where to improve the product if necessary.

Partners should also be open to producing the first iteration of a product. This is less than an MVP, but can help ensure whether voice or gesture capabilities will actually add real value to a product. If it doesn’t make sense to implement one, companies have no obligation to continue the relationship. This saves them from spending thousands of dollars on a product that has little chance of success.

Just because everyone’s implementing voice and gesture capabilities, doesn’t mean they’re doing it right. However, if companies pay special attention to developing a simplified UX, testing the product in a range of environments, and choosing a long-term vendor, they’ll have a much higher chance of success.