Wave Hello to the Future: Designing Intuitive Gesture Recognition Systems for Smart Devices

Have you ever found yourself in a situation where you need to interact with a smart home device, Smart TV or Smart Display but you can't touch it? Imagine, you are in the kitchen with wet or dirty hands, and you need to switch to the next episode on your Smart TV. It's not practical to touch the remote or device in this situation. Similarly, if your phone starts ringing and you need to quickly pause the music playing on it, it's much more convenient to use voice commands than to touch the device. Thanks to gesture recognition technology and voice recognition, users can now control their devices multimodel: via gestures or voice control.

My name is Daria and I am a Senior Product Manager with 5+ years of experience in the tech industry, working on products related to hardware, computer vision, and voice recognition, and in this article, we'll dive into gesture recognition systems, why and how these services are created, and what both product and technical teams need to consider in their development. I will share personal examples, my own experience, and reflections on the mistakes that I made along with my team when we were developing multimodal navigation for smart devices by SberDevices.

Why?

Enter touchscreens, the game-changer for tech interaction. Swipe, tap, and voila – complex actions made easy and user experiences amped up. But it wasn't always like this, before iPhone was released the only way to interact with a device was through a series of buttons.

The iPhone's release marked a turning point, launching us into the touchscreen era, fueled by Apple's extensive research and development efforts. With its debut, the iPhone introduced an intuitive touch interface that eliminated the need for a stylus or buttons, setting it apart from competitors. But now, we're on the brink of another transition: from touchscreens to touchless interfaces.

Touchscreens no longer suffice for users' needs, leading to the emergence of gesture control as an additional modality for interaction. Designing this new experience is an incredibly creative task. The first touchscreen prototype was featured in Star Trek, while Blade Runner and Black Mirror demonstrated gesture interfaces. Our sci-fi-inspired imaginations have envisioned touchless interfaces for years, and as technology advances, we eagerly anticipate which other futuristic concepts will become reality.

What exactly?

To be specific, let's examine interaction examples and use cases that arise. For instance, smart TV devices with cameras, such as Facebook Portal or SberBox Top, are designed with voice-first interfaces, thanks to their onboard virtual assistants that create an exceptional assisted experience. However, these devices still come with a traditional remote. The presence of a camera adds another dimension of interaction – gestures.

Users can make hand movements in the air without touching the screen or remote, creating a blend of touchscreen and remote control where the system responds to gestures. While I believe that we'll eventually transition to fully touchless interfaces, current computational and recognition technology limitations place us in a transitional period. We're developing multimodal interfaces, allowing users to choose the most convenient method – voice, remote, or gesture – for accomplishing their tasks.

How?

I will share insights and recommendations derived from my team's work in developing touchless interfaces. These guidelines aim to help others avoid unnecessary mistakes when tackling similar challenges. By using these recommendations as a reference point, you can streamline your process and make more informed decisions.

Understand Usage Contexts: Identify situations where users may need touchless interactions, such as when their hands are wet or dirty, or when interacting from a distance
Understand Technical Limitations: Consider device constraints like recognition methods, hardware load limits, acceptable delays, and operating ranges under various lighting conditions.
Formulate Gesture Basket: Create a list of user-friendly gestures based on usage contexts and technical limitations. Determine intuitive gestures for specific tasks, like adjusting the volume or pausing a video.
Conduct Iterative Testing: Refine the gesture recognition system through extensive testing under different conditions and by gathering user feedback to enhance accuracy and user experience.

When designing a gesture recognition system, select movements that are easy to execute, culturally appropriate, consistent, and unique. The gestures should be logically connected and adaptable for future additions.

Сriterias for our gesture basket:

Ease of execution: Gestures should be simple to perform with one hand.
Cultural appropriateness: Gestures should not be offensive or have negative cultural connotations.
Consistency: Gestures should be recognizable and consistent across users.
Uniqueness: Gestures should not resemble common, accidental movements made in daily life.

Technical and data collection sides

One of the toughest challenges we faced in creating our gesture recognition system involved working with data from both aspects: how to collect it and how to annotate it.

When collecting datasets from paid respondents, they tended to perform the movements more accurately and mechanically, as they followed specific instructions. However, in real life, people behave quite differently – they may be lounging on a couch or in bed, resulting in more relaxed and imprecise movements. This created a significant gap between the dataset domains and real-world scenarios. To address this issue, we collaborated with actors who could take their time to get into character and exhibit more natural behaviour, allowing us to gather a more diverse and representative set of data.

But that wasn't the end of our problems! After collecting the data, we had to correctly label each movement, which was a daunting task in itself, as it was often difficult to determine where the movement began and ended.

We faced issues like defining consistent movements, accounting for user backgrounds and lighting conditions, and balancing movement complexity with model retraining. Iterative testing helped us refine our system, collecting data from different angles and lighting conditions.

The key aspect of our work was the iterative beta testing that our team started piloting in the early stages when the recognition network was not yet perfect. We conducted a closed beta with respondents, using a false positive detection system. When the network recognized a movement, it saved that frame on the device, and only the device owner had access to these frames. This allowed us to quickly receive feedback on unique real-life cases where we performed poorly. Immediately after receiving feedback, we collected new data on a larger scale to cover that particular case. For example, at the very beginning, the network recognized holding a cup in one's hand as a 'like' gesture, and we collected data from people holding cups to retrain the network.

Designing a gesture recognition system is no easy task, and we encountered several unexpected challenges along the way:

Defining a consistent set of movements proved difficult, as people executed even the most basic gestures in different ways.
The diversity of users' backgrounds and lighting conditions made it challenging for our system to differentiate between movements and background elements.
We faced a difficult decision: should we make the movements more complex for users or retrain the model to account for movement diversity?
Iterative testing was crucial to refining our gesture recognition system, and we had to collect data from different angles and lighting conditions to ensure its effectiveness.

Key takeaways

Context is key: When designing gesture recognition interfaces, it's important to consider the context in which users will be interacting with the device. For example, users might be in different physical positions, have different levels of lighting, or be wearing different types of clothing. To account for these factors, we need to design interfaces that are adaptable and responsive to these varying contexts.
Iterative testing is essential: One of the biggest challenges in developing gesture recognition systems is the high degree of variability in how users perform gestures. This means that it's essential to test and refine our systems iteratively, both in controlled settings and in real-world scenarios. By continuously gathering feedback from users and refining our algorithms, we can improve the accuracy and effectiveness of our gesture recognition systems over time.
Data annotation is crucial: In order to train machine learning algorithms to recognize gestures, we need large amounts of annotated data that accurately represent the range of movements that users might make. However, annotating this data can be a time-consuming and labour-intensive process. To address this challenge, we can use automated annotation tools, or leverage crowdsourcing platforms to enlist the help of a large pool of annotators.
User feedback is essential for success: Finally, in order to create truly successful gesture recognition products, we need to continually gather feedback from our users. This can involve conducting user surveys, analyzing usage data, or even directly observing users as they interact with our products. By listening to our users and incorporating their feedback into our product design and development process, we can create gesture recognition interfaces that are intuitive, effective, and truly responsive to their needs.