Smart Gesture Recognition in iOS 11 with Core ML and TensorFlow

In part 1, I showed how one can use deep learning to recognize complex gestures like heart shapes, check marks or happy faces on mobile devices. I also explained how apps might benefit from using such gestures, and went a little bit into the UX side of things.

This time around I’ll be walking you through the technical process of implementing those ideas in your own app. I’ll also introduce and use Apple’s Core ML framework (new in iOS 11).

Real-time recognition of complex gestures, at the end of each stroke

This technique uses machine learning to recognize gestures robustly. In the interest of reaching as many developers as possible, I won’t assume any understanding of the field. Some of this article is iOS-specific but Android developers should still find value.

The source code for the completed project is available here.

What We’re Building

By the end of the tutorial we’ll have a setup that allows us to pick totally custom gestures and recognize them with high accuracy in an iOS app. The components involved are:

An app to collect a some examples of each gesture (draw some check marks, draw some hearts, etc.)
Some Python scripts to train a machine learning algorithm (explained below) to recognize the gestures. We’ll be using TensorFlow, but we’ll get to that later.
The app to use the custom gestures in. Records the user’s strokes on the screen and uses the machine learning algorithm to figure out what gesture, if any, they represent.

Gestures we draw will be used to train a machine learning algorithm that we’ll evaluate in-app using Core ML.

In part 1 I went over why it’s necessary to use a machine learning algorithm. In short, it’s much harder than you’d think to write code to explicitly detect that a stroke the user made is in the shape of a heart, for example.

What’s a Machine Learning Algorithm?

A machine learning algorithm learns from a set of data in order to make inferences given incomplete information about other data.

In our case, the data are strokes made on the screen by the user and their associated gesture classes (“heart”, “check mark”, etc.). What we want to make inferences about are new strokes made by a user for which we don’t know the gesture class (incomplete information).

Allowing an algorithm to learn from data is called “training” it. The resulting inference machine that models the data is aptly called a “model”.

What’s Core ML?

Machine learning models can be complex and (especially on a mobile device) slow to evaluate. With iOS 11, Apple introduces Core ML, a new framework that makes them fast and easy to implement. With Core ML, implementing a model comes down primarily to saving it in the Core ML model format (.mlmodel). Xcode 9 makes the rest easy.

An official Python package coremltools is available that makes it easy to save mlmodel files. It has converters for Caffe, Keras, LIBSVM, scikit-learn and XCBoost models, as well as a lower-level API for when those don’t suffice (e.g. when using TensorFlow). Unfortunately coremltools currently requires Python 2.7.

Supported formats can be automatically converted into Core ML models using coremltools. Unsupported formats like TensorFlow require more manual work.

Note: Core ML only enables evaluating models on-device, not training new ones.

1. Making the Data Set

First let’s make sure we have some data (gestures) for our machine learning algorithm to learn from. To generate a realistic data set, I wrote an iOS app called GestureInput to enter the gestures on-device.

If your use case isn’t much different from mine, you may be able to use GestureInput as-is. It allows you to enter a number of strokes, preview the resulting image and add it to the data set. You can also modify the associated classes (called labels) and delete examples.

GestureInput randomly chooses gesture classes for you to draw examples of so that you get roughly equal numbers of each. When I want to change the frequencies with which they show up (e.g. when adding a new class to an existing data set), I change the hard-coded values and recompile. Not pretty, but it works.

Generating data for the machine learning algorithm to learn from

The readme for this project explains how to modify the set of gesture classes, which include check marks, x marks, ascending diagonals, “scribbles” (rapid side-to-side motion while moving either up or down), circles, U shapes, hearts, plus signs, question marks, capital A, capital B, happy faces and sad faces. A sample data set is also included which you can use by transferring it to your device.

How many gestures should you draw? As I mentioned in part 1, I was able to get 99.4% accuracy with 60 examples of each gesture, but I would actually recommend making about 100. Try to draw your gestures in a variety of ways so that the algorithm can learn them all.

Exporting For Training

A “Rasterize” button in GestureInput converts the user’s strokes into images and saves them into a file called data.trainingset. These images are what we’ll input to the algorithm.

As covered in part 1, I scale and translate the user’s gesture (“drawing”) to fit in a fixed-size box before converting it into a grayscale image. This helps make our gesture recognition independent of where and how big the user makes their gesture. It also minimizes the number of pixels in the image that represent empty space.

Converting the user’s strokes into a grayscale image for input into our machine learning algorithm

Note that I still store the raw time sequence of touch positions for each stroke in another file. That way I can change the way gestures are converted into images in the future, or even use a non-image-based approach to recognition, without having to draw all the gestures again.

GestureInput saves the data set in the documents folder of its container. The easiest way to get the data set off your device is by downloading the container through Xcode.

2. Training a Neural Network

In step 1 we converted our data set into a set of images (with class labels). This converts our gesture classification problem into an image classification problem — just one (simple) approach to recognizing the gestures. A different approach might use velocity or acceleration data.

I mentioned that we’d be using a machine learning algorithm. It turns out the state-of-the-art class of machine learning algorithms for image classification right now is convolutional neural networks (CNNs). See this excellent beginner-friendly introduction to CNNs. We’ll train one with TensorFlow and use it in our app.

If you’re not familiar with TensorFlow, you can learn about it here, but this article has all the instructions you’ll need to train a model. My neural network is based off the one used in the Deep MNIST for Experts TensorFlow tutorial.

The set of scripts I used to train and export a model are in a folder called gesturelearner. I’ll be going over the typical use case, but they have some extra command-line options that might be useful. Start by setting up with virtualenv:

cd /path/to/gesturelearner# Until coremltools supports Python 3, use Python 2.7.virtualenv -p $(which python2.7) venvpip install -r requirements.txt

Preparing the Data Set

First, I use filter.py to split the data set into a 15% “test set” and an 85% “training set”.

# Activate the virtualenv.source /path/to/gesturelearner/venv/bin/activate

# Split the data set.python /path/to/gesturelearner/filter.py --test-fraction=0.15 data.trainingset

The training set is of course used to train the neural network. The purpose of the test set is to show how well the neural network’s learnings generalize to new data (i.e. is the network just memorizing the labels of the gestures in the training set, or is it discovering an underlying pattern)?

I chose to set aside 15% of the data for the test set. If you only have a few hundred gesture examples in total then 15% will be a rather small number. That means the accuracy on the test set will only give you a rough idea of how well the algorithm is doing.

This part is optional. Ultimately the best way to find out how well the network performs is probably to just put it in your app and try it out.

Training

After converting my custom .trainingset format into the TFRecords format that TensorFlow likes, I use train.py to train a model. This is where the magic happens. Where our neural network learns from the examples we gave it to robustly classify new gestures it encounters in the future.

train.py prints its progress, periodically saving a TensorFlow checkpoint file and testing its accuracy on the test set (if specified).

# Convert the generated files to the TensorFlow TFRecords format.python /path/to/gesturelearner/convert_to_tfrecords.py data_filtered.trainingsetpython /path/to/gesturelearner/convert_to_tfrecords.py data_filtered_test.trainingset

# Train the neural network.python /path/to/gesturelearner/train.py --test-file=data_filtered_test.tfrecords data_filtered.tfrecords

Training should be quick, reaching about 98% accuracy in a minute and settling after about 10 minutes.

Training the neural network

If you quit train.py during training, you can start again later and it will load the checkpoint file to pick up where it left off. It has options for where to load the model from and where to save it.

Training With Lopsided Data

If you have significantly more examples of some gestures than other gestures, the network will tend to learn to recognize the better-represented gestures at the expense of the others. There are a few different ways to cope with this:

The neural network is trained by minimizing a cost function associated with making errors. To avoid neglecting certain classes, you can increase the cost of misclassifying them.
Include duplicates of the less-represented gestures so that you have equal numbers of all gestures.
Remove some examples of the more-represented gestures.

My code doesn’t do these things out-of-the-box, but they should be relatively easy to implement.

Exporting to Core ML

As I alluded to earlier, Core ML does not have a “converter” for converting TensorFlow models into Core ML MLModels the way it does with Caffe and scikit-learn, for example. This leaves us with two options to convert our neural network into an MLModel:

Use the coremltools.models package, which has an API for building neural networks.
Since the MLModel specification is based on Google’s protocol buffers, you can skip coremltools and use protobuf directly in virtually any programming language.

So far there don’t seem to be any examples of either method on the web, other than in the internal code of the existing converters. Here’s a condensed version of my example using coremltools:

To use it:

# Save a Core ML .mlmodel file from the TensorFlow checkpoint model.ckpt.python /path/to/gesturelearner/save_mlmodel.py model.ckpt

The full code can be found here. If for some reason you prefer to skip coremltools and work directly with the MLModel protobuf specification, you can also see how to do that there.

One ugly side effect of having to write this conversion code ourselves is that we describe our entire network in two places (the TensorFlow code, and the conversion code). Any time we change the TensorFlow graph, we have to synchronize the conversion code to make sure our models export properly.

Hopefully in the future Apple will develop a better method for exporting TensorFlow models.

On Android you can use the official Tensorflow API. Google will also be releasing a mobile-optimized version of TensorFlow called TensorFlow Lite.

3. Recognizing Gestures In-App

Finally, let’s put our model to work in a user-facing app. This part of the project is GestureRecognizer, the app you saw in action at the beginning of the article.

Once you have an mlmodel file, you can add it to a target in Xcode. You’ll need to be running Xcode 9. At the moment it’s in public beta, but its release will likely coincide with that of the new iPhone and iOS 11 next week.

Xcode 9 will compile any mlmodel files that you add to your target and generate Swift classes for them. I named my model GestureModel so Xcode generated GestureModel, GestureModelInput and GestureModelOutput classes.

We’ll need to convert the user’s gesture (Drawing) into the format that GestureModel accepts. That means converting the gesture into a grayscale image exactly the same way we did in step 1. Core ML then requires us to convert the array of grayscale values to its multidimensional array type, [MLMultiArray](https://developer.apple.com/documentation/coreml/mlmultiarray).

MLMultiArray is like a wrapper around a raw array that tells Core ML what type it contains and what its shape (i.e. dimensions) is. With an MLMultiArray in hand, we can evaluate our neural network.

I use a shared instance of GestureModel since each instance seems to take a noticeable length of time to allocate. In fact, even after the instance is created, the model is slow to evaluate for the first time. I evaluate the network once with an empty image when the application starts so that the user doesn’t see a delay when they start gesturing.

Interpreting the Network’s Output

The function above outputs an array of “probabilities” for each possible gesture class (label). Higher values generally represent higher confidence, but a lot of gestures that do not belong to any of the classes will counterintuitively receive high scores.

In part 1 I talked about how to reliably distinguish invalid gestures from valid ones. One solution being to create an “invalid gesture” category with a variety of different gestures that don’t belong to any of the other categories. For this project I just consider a gesture valid if the network classifies it with a “probability” above a certain threshold (0.8).

Avoiding Conflicts Between Gestures

Since some of the gesture classes I used contain each other (happy faces contain U shape mouths, x marks contain ascending diagonals), it’s possible to prematurely recognize the simpler gesture when the user actually intends to draw the more complex one.

To reduce conflicts, I used two simple rules:

If a gesture could make up part of a more complex gesture, delay its recognition briefly to see if the user draws that larger gesture.
Given the number of strokes the user makes, don’t recognize a gesture that can’t sensibly have been drawn yet (e.g. a happy face requires at least 3 strokes for the mouth and two eyes).

In general though, for high robustness and responsiveness you should probably choose gestures that don’t contain each other.

And that’s it! With this setup, you can add a completely new gesture to your iOS app in about 20 minutes (input 100 images, train to 99.5+% accuracy, and export model).

To see how the pieces fit together or use them in your own project, see the full source code.