Being Data Science Ready

How to accelerate your startup, build data equity, and control data debt

You can be data science ready even if you are not doing data science yet

Lets say you’re a small startup. Maybe you aren’t doing any data science just yet. You probably don’t have much data anyway. You expect that you’ll be doing awesome AI down the line. But for now, you have lots of other things to worry about.

Six months go by. You hire your first data scientist, Dahlia. She’s awesome. She joins the company… and not much happens.

It turns out that Dahlia is spending several months cleaning and reorganizing the database. She needs some help from engineers, so that slows down the dev team too. Everybody is frustrated.

Things don’t have to be like that. You can be data science ready.

And if you do it right, you’ll be doing data science even before you hire Dahlia. You will do it without sacrificing any of the focus and speed that are critical in an early stage company.

Being data science ready is an accelerator.

Understanding Data Debt

Why did Dahlia have to spend months reorganizing the data in the first place?

Data is subject to the second law of thermodynamics: without active investment, it gets messy. Here are some familiar examples:

As new tables or data sources are added, the database gets cumbersome. Answering even simple questions requires slow, complex queries.
Data becomes inconsistent. Queries that are supposed to get to the same number in two different ways return different results.
Edge cases are handled inconsistently. For example, invalid values are sometimes recorded as nulls and sometimes as ‘invalid’ or -1. This makes it very difficult for anyone without a lot of institutional knowledge to use the data effectively.

These are all examples of data debt: lack of investment in the data leading to a big data mess (pun intended).

Data debt and technical debt

Good software developers are keenly aware of accumulating technical debt. Like data, software requires active maintenance: testing, documentation, refactoring. If you don’t invest in these seemingly mundane tasks, you will be accruing technical debt.

Like other types of debt, technical debt compounds. Eventually your development cycles slow down. It takes longer to develop new features. Your software becomes buggy and difficult to maintain.

Exactly the same thing happens with data. By the time Dahlia joins your company, you will have incurred a lot of data debt, and that is what slows her down. She must pay down the debt first.

Data debt sticks around for a long time and is hard to handle. Don’t accumulate too much of it.

Sometimes development teams discover that they have accumulated so much technical debt that the only solution is to tear down large parts of the codebase and rebuild them from scratch.

That is very painful, but it can work. That’s because you only care about what’s in your code now. If an old version was cumbersome and buggy, that doesn’t matter anymore. Like in a game of chess, in software what matters is the current state of the board.

Data is not like that. The value of your data comes in part from quantity and history. If you revamp your data practices and improve them, that’s great. But you’re not going to throw away all your data, wipe down the database, and sit around and wait until you’ve collected some new data, are you?

Data debt sticks around for a long time. It’s hard to handle, and you can’t solve the problem by starting from scratch. So you should be careful about accumulating it.

The benefits of being data science ready

So data debt is bad. But reducing or preventing it takes effort. It’s basically a law of nature.

So why not just worry about it later, when you hire your first data scientist? Or, how much should you invest in controlling your data debt now? What are the signs that you are data science ready?

Being data science ready has two immediate benefits: acceleration and equity.

Focusing your efforts on maximizing them will help you collect the right data, keep it in good shape, and prevent excessive data debt — while staying focused on the priorities of an early stage company.

Acceleration: you can’t learn what you don’t measure

In an early stage startup, the primary goal is to learn: understand the market need, compare distribution channels, experiment with improving conversion, etc.

The most successful framework to accelerate learning is the lean startup, which relies on a quick succession of build-measure-learn experiments, each designed to test an hypothesis about the market.

The key word here is measure. Acceleration comes from learning. Learning comes from measuring. Measuring requires data. Obviously, you have to collect the right data to properly learn from your experiments. Fortunately, that is the key to being data science ready.

The Proof of the Pudding is in the Analytics

Acceleration doesn’t require fancy data science. All it takes is rock solid analytics.

Let’s say you want to run an A/B test on a new feature of your product. Here are some questions you will want to measure and track: Which users exactly were in arm A and which in arm B? How many times did each user interact with the feature? For each of these interactions, how did each user respond? Are there differences between different demographic groups, geographic areas, or other user characteristics?

Without understanding the answers to such questions you can’t learn from the experiment. So capturing them in your database is a great investment:

Make a list of questions like these that should be easy to answer. Make them very granular.
Write the relevant queries. Make sure that they are tested and documented.
When you make changes to the database, test the queries for slowdowns or altered output.

As you have more and more granular analytics available at the tip of your fingers, it will become very easy to do deep dives into user behavior. You will get easy access to data-oriented insights that will clarify your thinking about your product. That’s actually one of the main roles of a data scientist — and you’ll be doing it before you even hire one!

Equity: Your data is an asset

The second benefit of being data science ready is equity. For any modern company, data is a primary asset. It is taken into account in the valuation of your company. Many companies have been acquired just for their data. Access to vast amounts of data is one of the main reasons why tech giants like Google and Facebook are so powerful.

To be valuable, data has to be actionable.

If a data scientist joins your company and spends six months on cleaning data before she can do anything with it, then your data is not actionable. Your company is literally worth less because of that.

Give your future data scientist a voice

Think about your data from the perspective of your future data scientist

As a product developer, you are used to thinking about things from the point of view of the user. You do that even if you don’t personally interact with most of your users, or before you even have any users.

Sometimes the user is not the client paying for your product. Instead it could be their employer or an advertiser. As you build your product, you think about how to provide value to them as well.

“Your data is an asset” means that at some point someone will be willing to pay for your data. That person is also a client. You should keep them in mind.

That client will be paying you because they can answer questions that they care about based on your data. Well, answering questions based on your data is exactly the job of a data scientist!

By thinking about your future data scientist you are building data equity.

Remember Dahlia? The best way to maintain data equity is to think about things from her perspective. Write user stories for her. Prioritize them in your backlog.

It doesn’t matter if you don’t know anything about machine learning. Dahlia wants to answer questions about your data. And that’s great, because you’re already answering many questions like that by having solid analytics!

All you have to do now is think about how Dahlia can learn from the analytics that you already have and use them to answer other questions about the data. So as you are writing all those queries, ask yourself:

How easy would it be for someone who doesn’t know the database to understand these queries?
Is all the information necessary to understand them contained in the database? Or do they rely on institutional knowledge, old documents on Google Drive, etc.?
Is the work that led to these queries properly recorded, ideally under version control in your codebase?

Dahlia will thank you for thinking about her, and when she finally joins the team, she will be a force multiplier.

Bottom line

Do you have actionable analytics? Are they granular and well tested?
If a new data scientist joins your team, would she be able to quickly understand the queries behind the analytics? Would she be able to change them without breaking things?

If the answer to these questions is yes, then congratulations: you are data science ready. You are doing a great job controlling data debt. You are accelerating your learning cycles and building data equity, adding to the value of your company.

And the answer is no? Then you should start thinking about reducing data debt. I will address how to do so in a future post.