https://upscri.be/hackernoon/Creating insanely fast image classifiers with MobileNet in TensorFlow

“It’s like hot dog not hot dog, but for roads.”

MobileNets are a new family of convolutional neural networks that are set to blow your mind, and today we’re going to train one on a custom dataset.

There are a few things that make MobileNets awesome:

They’re insanely small
They’re insanely fast
They’re remarkably accurate
They’re easy to tune for resources vs. accuracy

Why is this important? Many mobile deep learning tasks are actually performed in the cloud. When you want to classify an image, that image is sent to a web service, it’s classified on a remote server, and the result is sent back to your phone.

That’s changing quickly. The computational power on your phone is increasing rapidly, and the network complexity required for computer vision is shrinking (thanks to architectures like SqueezeNet and MobileNet).

Besides the clear benefit of AI without an internet connection, sending images to the cloud is impractical in speed-hungry situations, like in the vehicle safety apps we’re developing at Coastline.

With that, let’s learn the following:

What are MobileNets?
How to build a custom dataset to train a MobileNet with TensorFlow
How to train a MobileNet that’s pretrained on ImageNet with TensorFlow
How MobileNets perform against Inception V3
How to use your retrained MobileNet to classify images

What are MobileNets?

MobileNets are a class of convolutional neural network designed by researches at Google. They are coined “mobile-first” in that they’re architected from the ground up to be resource-friendly and run quickly, right on your phone.

The main difference between the MobileNet architecture and a “traditional” CNN’s is instead of a single 3x3 convolution layer followed by batch norm and ReLU, MobileNets split the convolution into a 3x3 depthwise conv and a 1x1 pointwise conv. The details of why this is so significant can be found in the MobileNet paper, which I strongly encourage you to read.

So what’s the catch? Accuracy. MobileNets are not usually as accurate as the bigger, more resource-intensive networks we’ve come to know and love. But finding that resource/accuracy trade-off is where MobileNets really shine.

MobileNets surface two parameters that we can tune to fit the resource/accuracy trade-off of our exact problem: width multiplier and resolution multiplier. The width multiplier allows us to thin the network, while the resolution multiplier changes the input dimensions of the image, reducing the internal representation at every layer.

Google open-sourced the MobileNet architecture and released 16 ImageNet checkpoints, each corresponding to a different parameter configuration. This gives us an excellent starting point for training our own classifiers that are insanely small and insanely fast.

To learn more about how MobileNets work, read MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.

How to build your own dataset to train a MobileNet

Our challenge today is to build an image classifier that can detect if an image is a road or not a road. It’s like hot dog, not hot dog, but for roads.

Why road, not road? At Coastline, we’re developing safety features for your car with mobile apps that use computer vision. As with all vision problems, user privacy is critical. So one of the first checks we do when a user turns on the camera in our app is check if it sees a road. If it doesn’t, we turn off the camera. We want to be able to do this fast and at as little computational cost to the user as possible.

Since we’re tackling a custom problem, we need to start with creating our dataset. Our target is to collect 10,000 images, split roughly evenly 50/50 road/not road.

We’ll get data from a few different places:

4,000 obviously road images sampled randomly from the Coastline driving dataset
2,000 obviously not road images sampled randomly from the ImageNet dataset
3,000 less obviously not road scenics sampled from the internet to make sure the classifier doesn’t just learn sky, not sky
1,000 less obviously road scenics also sampled from the internet to make sure the classifier doesn’t recognize windshield reflections or other commonalities in the Coastline data

We’ll place each image into one of two folders, each representing the class of that image: road and not-road. That’s all we have to do to prepare our images for retraining!

However, while grabbing CC images from the internet is a great place to add diversity to your dataset, it comes with a drawback: the labels are noisy. For instance, an image found by searching “road landscape” could have a road front and center with a nice scene in the background, or it could be a mountain scene with a tiny road off in the distance.

Two images both labeled “road”. For our problem, the first should be “not road” and the second should be “road”. Left: “Scenic Mountain Road” by Tyler Thompson is licensed under CC BY 2.0 Right: “150927-road-scenic-bypass-mountains.jpg” by r. nial bradshaw is licensed under CC BY 2.0

We could solve for this by going through each image and hand-labeling it, but why do that when we have deep learning?! Instead, we retrain a big network (like Inception V3) on all our data, paying special attention not to over-fit on our training data by early stopping and heavy data augmentation. Then we run every image of our dataset (even those images we just used to train!) through the network and keep track of the images it classified incorrectly or with little confidence. Then we go through each of those images and move them to their proper classes, if applicable. This reduces the number of images we have to manually clean up significantly. And doing multiple passes of this technique helped us increase our accuracy by seven percentage points on Inception.

Now that we have 5,000 road images and 5,000 not road images, in a structure like this…

data/road/[images...]not-road/[images...]

…we’ll use TensorFlow and transfer learning to fine-tune MobileNets on our custom dataset.

How to retrain a MobileNet that’s pretrained on ImageNet

TensorFlow comes packaged with great tools that you can use to retrain MobileNets without having to actually write any code. This stuff is fresh off the presses: Retraining support for MobileNet was added less than a week ago!

A couple very obviously road images from the Coastline dataset.

If you don’t have it downloaded already, fork and/or clone the TensorFlow repo:

git clone https://github.com/tensorflow/tensorflow.git

Now you can use the scripts in the example folder to retrain MobileNet on your own data.

But wait! Which MobileNet should you use? That’s a good question. Let’s retrain a small assortment and see how they perform. To kick off training, we’ll run the following command from the root of the TensorFlow repo:

python tensorflow/examples/image_retraining/retrain.py \    --image_dir ~/ml/blogs/road-not-road/data/ \    --learning_rate=0.0001 \    --testing_percentage=20 \    --validation_percentage=20 \    --train_batch_size=32 \    --validation_batch_size=-1 \    --flip_left_right True \    --random_scale=30 \    --random_brightness=30 \    --eval_step_interval=100 \    --how_many_training_steps=600 \    --architecture mobilenet_1.0_224

The architecture flag is where we tell the retraining script which version of MobileNet we want to use. The 1.0 corresponds to the width multiplier, and can be 1.0, 0.75, 0.50 or 0.25. The 224 corresponds to image resolution, and can be 224, 192, 160 or 128. For example, to train the smallest version, you’d use --architecture mobilenet_0.25_128.

Some other important parameters:

learning_rate: This is something you’ll want to play with. I found 0.0001 to work well.
testing and validation percentage: The script will split your data into train/val/test for you. It will use train to train, val to give performance updates every “eval_step_interval”, and test will run after “how_many_training_steps” to give you your final score.
validation_batch_size: Setting this to -1 tells the script to use all your data to validate on. When you don’t have a lot of data (like only 10,000 images), it’s a good idea to use -1 here to reduce variance between evaluation steps.

After retraining on several model architectures, let’s see how they compare.

Comparing MobileNet parameters and their performance against Inception

After just 600 steps on training Inception to get a baseline (by setting the — architecture flag to inception_v3), we hit 95.9%. Training took 18 minutes. (There is a lot of room for improvement here, but we don’t have all day!) The resulting checkpoint landed at 84mb. And doing a quick speed test by running 1,000 images through it shows it can classify images on an NVIDIA GeForce 960m GPU at ~19fps.

Aside: Why “only” 95.9% and not 100%? It seems like a pretty simple problem, right? Well, besides the ample tuning we could do to the training parameters (we actually achieved 98.9% with the same data using a different configuration in another go), it turns out the distinction between classes is a bit more subtle than it seems on the surface. Take these cases:

The image is a one-lane dirt road in the woods: Is it a road or a trail? I don’t even know.
It’s a landscape scenic with a road in the distance. Is it a picture or a road, or is there just a road in the picture? At what point does the scenic move classes?
It’s an artsy tilt-shift photo of a couple holding hands in the foreground and a time-lapsed street in the background. Road or not road? Flip a coin.

So, how do the MobileNets perform? Not surprisingly, not quite as well. However, the tradeoff benefit is astounding.

Using the biggest MobileNet (1.0, 224), we were able to achieve 95.5% accuracy with just 4 minutes of training. The resulting model size was just 17mb, and it can run on the same GPU at ~135fps.

For those keeping score, that’s 7 times faster and a quarter the size. All for just 0.4% loss in accuracy.

How about the smallest MobileNet (0.24, 128), using quantized weights? Big accuracy tradeoff, achieving just 89.2%. But get this: 450 frames per second, and the model takes just 930kb of memory. That’s kilobytes!

Using your retrained MobileNet to classify images

Now that you have your MobileNet retrained on your custom dataset, it’s time to give it a try. Not surprisingly, TensorFlow comes with a script to do that, too.

python3 tensorflow/examples/label_image/label_image.py \--graph=/tmp/mobilenet_0.50_192.pb \--labels=/tmp/output_labels.txt \--image=/home/harvitronix/ml/blogs/road-not-road/test-image.jpg \--input_layer=input \--output_layer=final_result \--input_mean=128 \--input_std=128 \--input_width=192 \--input_height=192

Our network classifies this as road, with confidence of 0.686811. Not too confidence, but hey, that’s a FAST road!

Aside: It should be noted that in our fairly simple two-class problem, the accuracy trade-off is not that big. In the case of ImageNet with its 1,001 classes, the accuracy tradeoffs are much more significant. See the table here.

Next steps

Okay, so the whole point of MobileNets is to run on mobile, right? Stay tuned! In our next post, we’ll create some purpose-built training data, fine-tune again, and load our retrained MobileNet into an Android app. We’ll see how fast it can run on a mobile device and how accurate it is in the real world.

UPDATE: Part 2 is now live! Don’t miss Building an insanely fast image classifier on Android with MobileNets in TensorFlow.