7 Steps To Prepare A Dataset For An Image-Based AI Project

A dataset might be the most overlooked part of any machine-learning project. Most people see it simply as a collection of images that you quickly put together or even download premade.

With the dataset being the cornerstone of any image-based AI project, this view is often very damaging to the final production quality.

Creating and curating a well-balanced and structured dataset is essential to any machine learning project that aims to achieve high accuracy.

Creating a dataset is not as easy as collecting a few hundred images together. There are a lot of hidden pitfalls one may encounter when trying to get an AI project going.

The seven steps outlined below will guide you through everything you need to know when creating your own dataset. You will read why size actually matters, data leaking, turning a dataset into a database, and more.

Note: These steps are applicable for object detection and classification projects with datasets consisting of images. Other project types, like NLP or graph projects, require a different approach.

Step 1: Image size

Neural networks have a particular image size they work with, and any image over the threshold will get downsized. Before you do anything with your dataset, choose which neural network you will work with and adjust the size of your images accordingly.

Size matters, as downsizing images may lead to significant reductions in accuracy. Downsizing makes small objects disappear, which can be detrimental.

Let’s say you want to detect car license plates on images from security cameras. License plates take up a small portion of the image, and when the image is downsized to be processed by a neural network, the license plate may become so small it will not be recognizable any more:

Knowing which size of images your network works with will help you to crop your dataset images appropriately:

Many neural networks have a rather small image size they can work with; however, there are some recent NNs that are able to work with a larger resolution, like Yolo v5x6, which is able to process images up to 1280 pixels wide.

Step 2: Know your environment

The dataset must reflect the real images your neural network will recognize in the course of its operation. Here are a few factors to keep in mind when collecting your dataset:

camera type — smartphone camera, security camera
image size
camera placement — inside, outside
weather conditions — lighting, rain, fog, snow, etc.

After you have a clear understanding of the real images your neural network will process, you will be able to create a dataset that will accurately reflect the environment and how your object of interest looks in said environment.

Collecting generic images found on Google may be the easiest and fastest way to put together a large dataset, but the resulting system will hardly have high accuracy. Images found on Google or photo databases are usually ‘too pretty’ compared to the images produced by a real camera:

A ‘pretty’ dataset may result in high test accuracy, meaning the network will work well on test data (a collection of images taken from the dataset), but low real accuracy, meaning the network will work poorly in real conditions.

Step 3: Annotations and format

Another important aspect to pay attention to is the format your images are in. Before starting your project, check which formats the framework you’ve chosen works with and see if your images comply. Modern frameworks work with a large variety of image formats. However, there are still some problematic formats, like .jfif.

Annotations, i.e., the data detailing the bounding box and file name, can also be structured differently. Different neural networks and frameworks require different approaches to annotations. Some require absolute coordinates of the bounding box position, and some — relative coordinates. Some require each image to be accompanied by a separate .txt file with annotations. Others require a single file with all of them in it.

Even if your dataset has great images, it won’t matter if your framework won’t be able to process the annotations.

Step 4: Train and Validation subsets

For training purposes, the dataset is usually separated into two subsets:

Train subset — a set of images on which the neural network will train between 70 to 80% of the total image count
Validation subset — a smaller set of images that are used to check how well the neural network has learned during the training process, between 20 and 30% of the total image count

The neural network uses the train subset to extract object features and learn what the object looks like. After an epoch is completed, which is essentially a training cycle, the network looks at data from the Validation subset and tries to guess what objects it ‘looks’ at. Both correct and incorrect guesses allow the network to learn further.

While this approach is used widely and has proven to give good results, we prefer to do things a bit differently and divide the dataset into the following subsets:

Train subset — 70% of the total image count
Validation subset — 20% of the total image count
Test dataset — about 10% of the total image count

The test subset contains images from the dataset the network has not seen before. This subset allows developers to test the model to see how well it works manually and see which images it has trouble with.

In other words, this subset helps find where the network makes mistakes before launch, avoiding excessive retraining after project launch.

Step 5: Data leaking

Data leaking is extremely detrimental to the quality of your neural network. Data leakage happens when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict.

Simply put, data leaking from the image recognition point of view happens when both the train subset and validation subset have very similar photos of the same object.

Essentially, the model sees an image in the training dataset, extracts its features, then goes to the validation dataset and sees exactly the same (or very similar) image. Instead of actually learning, the model simply memorizes the information.

This leads to unnaturally high accuracy scores on the validation dataset, often upwards of 98%, but very poor scores in production.

A common approach to dataset segmentation is shuffling the data randomly and then taking the first 70% of the images and putting them into the Train subset, while the remaining 30% is out into the Validation subset.

This approach often leads to data leaking. It is imperative to remove all duplicates from the dataset and check no similar photos are present in the two subsets.

The duplicate removal can be performed automatically using simple scripts. You can adjust the duplicate threshold, like removing only exact duplicates, images that are 90% similar, etc.

The more duplicates you remove, the better the network's production accuracy.

Step 6: Database for large datasets

If your dataset is large, and we mean over 100K images and dozens of object classes and subclasses, we recommend you create a simple database to store your dataset information in.

The reason behind this is rather simple. With large datasets, it is hard to keep track of all of the data, and it is impossible to accurately analyze it without structuring the data in some way.

A database will allow you to quickly diagnose your dataset: low image count for a particular class, causing the network to struggle to recognize the object; uneven image distribution among classes; high amount of Google Photos images in a particular class, causing low accuracy scores for this class, etc.

The database can be quite simple and include just the following information:

file name
file path
annotation data
class data
data source (production, Google, etc.)
relevant information about the object — object type, name, etc.

A database is an indispensable tool when it comes to collecting dataset statistics. You can quickly and easily see how balanced your dataset is and how many good quality (from the neural network perspective) images are present in each class.

You can easily present this data visually to analyze it more quickly and compare it to recognition results to find the reason for low accuracy scores.

Low accuracy could result from a low image count or a higher percentage of Google photos in a particular class.

The work it takes to create a database like this is absolutely worth it, as it will cut down on production, testing, and model retraining time significantly.

Step 7: Dataset augmentation

Data augmentation is a technique used to increase the image count. It is a process of applying simple and complex transformations like flipping or style transfer to your data, increasing the data's effectiveness.

It allows for a more effective dataset without acquiring loads of more training data. Data transformation can be as simple as rotating an image 90 degrees or as complex as adding a sun flare to the image to imitate a backlit photo or a lens flare.

These augmentations are usually performed automatically. For example, there is a Python library dedicated to data augmentation.

There are two types of data augmentation:

Pretraining augmentation — Before the training process begins, the data is augmented and added to the Train subset. This augmentation should only be performed after the dataset has been divided into Train and Validation subsets to avoid data leaking
In-training augmentation — image transformations built into the framework, like PyTorch

It is important not to get too carried away when augmenting your data. Increasing the dataset size ten times will not make your network ten times more efficient; in fact, it can make it perform more poorly than before. Only use augmentations relevant to your production environment. For example, don’t add ‘rain’ augmentation to images if the camera is positioned inside a building where, under normal operations, no rain can occur.

Summing Up

Now you know how to treat your dataset right and get the best results out of your AI project:

Crop or resize images to fit the requirements of your neural network
Collect realistic images keeping in mind weather and lighting condition
Structure annotations according to neural network requirements
Don’t use all images to train the network. Save some for testing
Remove duplicate images from a validation dataset to avoid data leaking
Create a database for quick dataset diagnosis
Use data augmentation sparingly to increase image count

The dataset is the most important part of any image recognition project, albeit it is the least exciting for those looking to implement AI into their operations.

In the majority of image recognition projects, dataset management and curation take the most time. It usually takes a while to even get to the training process and testing, as you need to lay a good foundation first.