How We Automated the Verification of Car Photos

In this article, I will tell you how we automate the user vehicle verification process. I will share with you the components we use, and how we organize the process.

At inDrive, we use a great deal of visual content. We are present in a wide range of different regions marked by different mindsets where an abundance of different documentation is used: passports, certificates, and vehicle documents.

Apart from that, there are the drivers themselves and their vehicles to deal with.

When talking about travel comfort and improved quality of service for our users, safety and eliminating the unexpected are absolute essentials. For example, when the car that pulls up is not the one you booked.

You will find out how we at inDrive currently handle regular vehicle verifications. Once that is done, our moderators manually check the before and after photos. Of course, the verification process involves other things as well, but here we will focus only on this aspect.

The problem with the current approach is that it is more difficult to increase the number of moderators than it is to scale up the infrastructure. Especially when it comes to dealing with users' personal data.

About the Task at Hand

Let's put the problem very simply, as though for a child: We have two photos of a car — are they both of the same car? Obviously, anyone could handle this question, but things become much more complicated once we add a comparison criterion.

For example, making sure that this is not a screenshot from a cell phone or that the license plate numbers are a perfect match.

Globally, this problem can be solved in several ways: Using E2E models and sets of these models.

The E2E model implies involving one large model (most likely a neural network) that can answer our questions based on a pair of images, such as “Is it the same vehicle in the photo or not?”, “Do the license plate numbers match up or not?”, etc.

The problem with such models is that they require a lot of data to learn from and that they lack an understanding of why the answer is what it is.

For instance, if we train a model to answer "yes"/"no" based on a pair of photos, it is no longer possible to figure out the reasons for the answer.

So the users will be unable to understand what we want from them and we will have to bring in moderators.

This end-to-end approach does not suit us. We want to give the user clues as to why the photo provided doesn’t “fit the bill”: "It's best to take the photo in a well-lit location," "It looks like we don’t see the entire car in the picture," or "The license plate number doesn't match the information provided."

Model Quality and Base

It’s very important to us that the model does not answer "yes" when there are different vehicles. Let's name this metric the “FPR” (false positive rate) and use it to show the percentage of "yes" responses versus all negative examples (where the vehicles don’t match up).

Now, let's introduce another metric — TPR — to measure the ratio of "yes" answers to all affirmative answers.

Basically, these two metrics are already enough to describe our task when optimizing the model: To minimize the FPR and ensure that the TPR is not degraded too much.

We adopted the model set approach. Therefore, we can add all our ideas to the skeleton of the finished solution (this skeleton will be referred to as the “baseline”). Let's explore what it looks like and break it down into parts.

In fact, it consists of several models that independently process two images at the input, and give out the vehicle’s license plate number and its vector, at the output. As a result, based on the details compared, a verification decision is made about the two photos under review.

This is the skeleton of the algorithm we use to add in multiple other models, placing them in different nodes of our model. For example, when evaluating the quality of a photo, its clarity, light, and saturation levels.

Sometimes, we need to detect attempts at submitting photos that are not genuine. To do this, we add a preprocessor that checks the photo for spoofing attacks.

When using such an approach, it’s crucial to have clear product iteration cycles and generate a good test data set. We add a new model to assess these two factors, “Does it solve the problem assigned?” and “How much does it modify the metrics of the previous solution?”

Let's now talk about the basic building blocks of the solution.

Detection and Segmentation Are All You Need

Let's move on to take a look at the terminology. Below, I will explain the meaning of the terms such as "bounding box" and "segmentation mask."

A bounding box is a rectangular shape used to enclose a specific object of interest. For example, if we want to identify the cat's face, we would define a bounding box by outlining it in red. It is defined by the coordinates of its lower-left and upper-right points within the image.

Segmentation refers to the task of assigning a class label to each individual pixel in an input image. In the context of our discussion, we isolated the cat by segmenting it from the background.

In our model, the background of the vehicle is of no interest to us because this provides no relevant information for shaping our target solution. But that doesn't mean we have no ideas about how to improve our models by using backgrounds.

To address the challenge of separating vehicles from the background, we will take a model of the YOLO family (You Only Look Once) and train it to segment car images. The problem here is that we have a huge number of photos from users with more than one car in the picture.

To overcome this issue, we can employ the following approach:

Calculate the distance between the center of mass of the bounding box and the center of the photo.

Determine the size of the bounding box.

Sort the bounding boxes in ascending order based on the distance from the center and in descending order based on the size of the box.

Select the first object in the sorted list. Consequently, we obtain the bounding box that is closest to the center of the picture while being the largest.

Great, we've got our first input.

The next step is to find the vehicle’s license plate number. In almost all countries, the license plate is located on the front. Those rare cases where vehicle registration plates are located in unusual places are outside the scope of this article.

The most common approach to establishing the license plate number is to detect the bounding box and apply OCR to the resulting patch.

But as our experiments have shown, OCR is much less effective (and takes longer in some models) if the vehicle’s registration number plate is not parallel to the horizon.

This is especially relevant for our data, where we ask that drivers take photos at an angle.

The solution we decided on was to segment the number and then smooth the contour line obtained. In our case, the segmentation task was approached similarly to how vehicles are segmented, with the following result:

Next, we drew a contour line using the mask and applied ConvexHull to smooth it. This simple algorithm smooths out (straightens) the concavities of our contour line, making it straighter. This can be described as a smaller number of angles in a contour polygon.

In an ideal world, this operation would get us a rectangle defined by four points.

Once we have gotten the borders aligned, we repeat the same drill with the perspective, so that the registration number is smooth, well-presented, and clearly visible. This approach is especially helpful when a car is photographed very much at an angle where the license plate is hardly visible.

What is a perspective correction? I remember from my algebra class how a rotation matrix works. If you take a square in the Cartesian coordinate system and multiply each coordinate by a rotation matrix of 30 degrees, then your square in the new coordinate system will be rotated by 30 degrees.

Here, we are dealing with a similar task — let’s take the contour line and move all the points to the new coordinate system. The problem is to find a suitable transformation matrix.

All of these algorithms are already well established, so the only thing we have to do is make sure that they are correctly configured for the task at hand.

The result was just great. This increased the TPR by nearly 15 percentage points. Next, we apply some lightweight, high-quality OCR software, such as PARSeq architecture.

Car Encoder

As of now, this is the latest neural network technology for processing vehicle photos. Embeddings are a widely adopted technique in various machine learning fields, including search, recommendations, and data compression.

In the context of our task, embeddings are employed to assess the similarity between vehicles.

Let's look at an example where I took a picture of my car first from the right-hand side, and then from the left-hand side. Now, I can calculate the embeddings (vectors) for these pictures, and if these vectors are close in space, this indicates that it is the same vehicle.

But embeddings provide another useful property that can be used in the product: If your embedding model works well, you can do a search for what is closest from among the embedding samples. For example, to find non-unique vehicles in the system.

When training our embeddings model using inDrive data, we took meticulous precautions. We diligently removed any personal data from the photos and ensured the dataset was normalized, encompassing images from all the countries in which we operate and varying quality levels.

This approach aims to prevent discrimination against individuals who may not have access to expensive smartphones for capturing high-quality photos.

Consequently, obtained a dataset grouped by vehicle brand and make. After running several experiments, we realized that we would have to do without comparing the colors of the vehicles for the time being.

When selecting the architecture for our model, we sought a backbone that strikes a balance between performance and computational efficiency. It was crucial to avoid using an excessively large backbone, as it could significantly slow down the baseline running time.

After careful consideration, we opted for efficientnet_b2 as our backbone architecture, complemented by the utilization of the Additive Angular Margin Loss for machine learning purposes.

The goal of our model is to learn vector representations where vehicles of the same make and model, such as all Audi A4s, are positioned closely together in the vector space.

In contrast, Audi A5s would be positioned somewhat farther away from Audi A4s but still closer compared to, for instance, a Toyota Camry.

Now, let's delve into a few examples of vehicle comparisons:

At the top, we have two identical cars, while at the bottom, we have two different ones. Let's revisit the similarity scores: the top pair has a score of 0.989, while the bottom pair has a score of 0.697. By setting a threshold value of 0.98, we can classify vehicles as identical.

However, it's important to note that our model is not yet functioning flawlessly. We have a bias in the concerned factor:

The model produces a result of 0.751, whereas we ideally want a value close to zero for different vehicles.

The main issue here stems from training our model on datasets primarily focused on vehicle models and brands. Consequently, the model became proficient at distinguishing between different vehicles, but it struggles when evaluating differences within vehicle classes.

The second problem we have encountered is that the vehicle may be shown at different angles, which negatively affects the quality of our embeddings due to the limited data set.

As the first step, on the client side, we add masks and prompt the driver on how to take a photo of their car. The second step will be to detect different parts of the vehicle for positioning it in space and estimating its rotation.

A lot of heuristics can be developed here to choose the correct angle of rotation. And most importantly, these models can be later reused for evaluating the condition of the vehicle. But that's a story for another time.

Posted by Ilya Kaftanov.