7 Gotchas(!) Data Engineers Need to Watch Out for in an ML Project

Newsflash: 87% of ML projects fail!

This article covers the top 7 data engineering gotchas in an ML project. The list is sorted in descending order based on the number of times I have encountered the issue multiplied by the impact of each occurrence on the overall project.

1. “I Thought This Dataset Attribute Means Something Else”

Prior to the big data era, data was curated before being added to the central data warehouse. This is known as schema-on-write. Today, the approach with data lakes is to first aggregate the data and then infer the meaning of data at the time of consumption. This is known as schema-on-read.

As a data engineer, be wary to use datasets without proper documentation of attribute details or a clear data steward responsible to keep the details updated!

2. “5 Definitions Exist for the Same Business Metric — Which One to Use”

Derived data or metrics can have multiple sources of truth!

For instance, I have seen even basic metrics such as “count of new customers” having multiple definitions across the business units. As a data engineer, if a business metric is being used in the model, ensure to look for all the available definitions and their corresponding ETL implementations.

3.“Looks Like the Data Source Schema Changed”

This one is extremely common in large distributed teams. Schema changes at the source database are typically not coordinated with downstream ETL processing teams. The changes can range from schema changes (breaking existing pipelines) to semantic changes that are extremely difficult to debug.

Also, when business metrics change, there is a lack of versioning of the business definitions making the historic data inconsistent.

4. “ETL Pipeline Logic for Training and Serving is Identical... Not Really!”

The typical reason for model performance skew during training and inference is due to discrepancies in training and serving pipelines. While the logic may start-off identical, fixes made in one pipeline may not be reflected in the other. Especially avoid scenarios where training and serving pipelines are written in different languages.

5. “Slow Poisoning of the Models”

It is easier to detect 0–1 kind of errors with data pipelines. The problems that are the most difficult to debug are the ones where a table is being updated intermittently or joining to tables not being updated correctly. In such scenarios, the models will degrade gradually and adjust to the changes. The key is building appropriate circuit breakers to detect and prevent bad quality data during ingestion.

6.“All the Datasets Managed by a Given Team Have the Same Quality”

This is a classic mistake. Not all datasets from the same team may be reliable. Some of them are updated and managed very closely while other datasets might be irregularly updated or have poorly written ETL pipelines!

Always develop validation rules for any input data used by the models.

7. “Systematic Data Issues Causing Bias in the Overall Dataset”

If errors in the dataset are random, they are less harmful to model training. But if there is a bug such that a specific row or column is systematically missing, it can lead to a bias in the dataset.

For instance, device details of customer clicks are missing for Andriod users due to a bug, the dataset will be biased for iPhone user activity. Similarly, sudden distribution changes in the data are important to track.

To wrap up, I believe ML projects are a team sport involving Data Engineers, Data Scientists, Statisticians, DataOps / MLOps engineers, Business Domain experts.

Each player needs to play their role in making the project successful.

This article is a subset from a broader list of "98 things that can go wrong in an ML project."