Building scalable visual search with machine learning

There’s an actress on TV wearing an outfit that you must have. How do you find it? If you know some details, you could toss a word salad into Google and hope that someone has blogged about it. A search like “Blake Lively blue dress Cannes” is popular enough to work, but “That cute blue jacket that Sue wore on The Great British Bake Off” is a nonstarter. You need visual search — the ability to search for content in visual media. With the vast amount of video and image content on the web, there is an increasing need for effective visual search engines.

So what does creating a visual search engine look like?

Visual Search Pipeline, and the Shopper Experience

A typical visual search pipeline needs to be able to identify objects in an image, and search against a catalog to find items that are visually similar. This system needs to be robust to a variety of different visual contexts in both images and video. Identifying a pair of shoes in a high-quality studio image with a clean white background is a pretty straightforward task, but if the shoes were on a person facing away from the camera, in the background of a video scene, the task becomes much more difficult (even for humans).

Our visual search pipeline must also scale well, and run on a variety of different deployment architectures. It needs to be fast on both a GPU and a CPU (for rapid, cost-effective deployment) For example, if we’re looking at making our videos shoppable, we need to be able to accurately detect objects in video in real time, in order to provide visual search results quickly.

If we were to look at the way we personalize shopping experiences through an engine that recommends visually similar products, object detection is a vital piece of the computer vision pipeline- where one can find similar looking products from a variety of catalogs, whether it’s fashion, furniture, beauty products, or more. In the case of visual search, there are two main components needed to create an effective image recognition engine — determining where the objects are in an image (detection), and what they are (classification).

Deep Learning for Image Recognition

In recent years, deep-learning has taken over as the predominant method of solving these types of problems. State of the art deep learning algorithms for image recognition use convolutional neural networks (ConvNets) trained with large amounts of data. ConvNets are made up of many connected computational units that together form a powerful network to identify objects by learning directly from human-labeled images. They are special in that they learn to identify objects from the data, whereas previous algorithms required many steps and domain-specific heuristics.

Whole-Image Classification

Image classification is the process of assigning a single label to an image. For example, the following image might be labeled “woman,” “summer,” “couch,” etc. Algorithms that do this type of classification are fast and simple to train, but are not able to detect more than one object, or where the objects are. ConvNets that are trained for image classification typically have an image as input and produce a single label selected from a set of possible labels as output. For visual search to work, we need to know something about all of the objects in a scene and where they are located.

https://goo.gl/4DA9wf

Pixel Classification

Semantic segmentation is the process of labeling every pixel in an image according to its class. For example, in the following image, all of the pixels that make up the shoes might be labeled as “shoe”. Although this is a much richer output than whole-image classification, it is often desired to separately identify multiple instances of objects in an image. For example, in the image below we would prefer to have a label for each shoe: “shoe_1” and “shoe_2”. This can be accomplished by a type of semantic segmentation network called instance segmentation. This is something that works well, however pixel wise segmentation algorithms are too computationally expensive and require densely annotated training data that is difficult to create. This could be a blocker for implementing this method.

Object Localization

Object localization is the process of predicting a bounding box for each object in a scene. In the image below, the relevant objects are well contained in the bounding box that was predicted. This is generally slower than methods that only predict an image-level label, but much faster than making a prediction for every pixel, and still accomplishes the general task of finding and labeling all the objects. There are several popular approaches to doing this type of object detection. One method is to slide a small window over the image, processing each region within that window. If the window has a high probability of containing an object, the network calls that window a bounding box. Another similar method attempts to narrow down the possible windows to test using some pre-processing heuristic. These methods produce reasonable results, but they are slow because each individual window has to be processed as though it were a separate image. An alternative is to use the whole image to make a prediction, but have the network predict a size and location in addition to an object class. This method requires training data with class, and box information for each object in each image, but runs much more quickly than networks that must evaluate multiple windows to generate object locations. This method is well suited for a first step in a visual search pipeline; it is powerful enough to find and classify multiple objects in images, and fast enough to process video frames quickly.

Object localization

Building a Pipeline That Makes Smarter Predictions

ConvNets have proven effective at narrowly, well defined tasks. However, they require a massive amount of data and are inflexible to changes in detection requirements. Most of these models ignore a lot of implicit context information that humans use to identify objects. For example, if you see snow in the background of a video, you have an expectation that anyone in that scene is more likely to be wearing a coat and boots than a swimsuit. Exploiting these relationships allows us to improve our ability to accurately detect objects, and increases the performance of our pipeline because it allows us to narrow our search based on information we can glean from data.

At Mad Street Den, we are continuously developing topologies that leverage these relationships, in order to provide a computer vision pipeline that makes smarter, more sensible predictions with less of a need for massive amounts of training data.

~ Will Shainin

Will is a Computer Vision and Machine Learning Engineer at Mad Street Den