David vs Goliath: Why Small Data Can Win

Written by karanveerm | Published 2017/10/25
Tech Story Tags: big-data | david-vs-goliath | small-data | scaling | software-engineering

TLDRvia the TL;DR App

The head of engineering at a billion-dollar company asks candidates to tell him the scale of data they’ve encountered. I heard anecdotally that one engineer was rejected because “he had only worked with gigabytes, not terabytes, of data”. As an engineer and an entrepreneur, I make a living off by specializing in this kind of scale. I have advised companies on how to run algorithms on terabytes of data, and these days I’m seeing that big data is de rigueur and companies seem to consider the size of the data proportional to the value it can generate.

Consequently, companies practically ignore data that isn’t big. Engineers aren’t interested in working with small data because the infrastructure isn’t technically challenging. Data scientists ignore small data because running simple regressions doesn’t sound as impressive as working with deep learning algorithms. Business executives don’t use small data because they want the credibility that comes from handling complexity associated with scale and managing large data teams. So, everyone disregards small data.

However, this is because of internal biases and the reasoning is not at all tied to business outcomes. In fact, I’d argue that there is little correlation between the scale of the dataset and the insights it can generate. Just because your data isn’t big doesn’t mean that it’s useless. We have been using small data until the 2010s, when Hadoop and big data became mainstream. Statistics has existed long before the era of big data and distributed computing. And you can continue to mine valuable insights from small data using algorithms and visualizations.

Small data isn’t just useful, it’s easier to work with. Although this is why engineers and data scientists have ignored it, this is also part of the reason why small data is valuable. You can easily extract information without building a large team around data analysis. You don’t need to use a distributed computing engine like Apache Spark or spend millions buying distributed data solutions from vendors. You also don’t need to hire PhDs who have expertise in deep learning

Surprisingly, some companies that deal with massive amounts of data have trouble using their small data. I vividly remember when a marketing analyst at one of my clients complained, “All my data is lying somewhere in this massive infrastructure and I need to ask a data engineer to run a Hadoop job for me, when all I want to do is download a CSV and use Excel.” So, how can you derive value from small data?

First, realize that you already have small data, and don’t need new infrastructure, initiatives, or recruits to derive value from your small data — there are straightforward ways to maximize its value. Remember the age-old principle: “Keep it simple, stupid.”

Next, start using descriptive analytics again. This includes summarizations, “slicing and dicing”, tabulations and other exploratory techniques. Not every analysis needs to be backed by an underlying AI model. When the dataset is small enough, the points can be visualized. The human eye is underestimated; patterns can often be physically seen once the data has been plotted.

Third, clean the data. Smaller datasets are more susceptible to errors, missing values, and noise. If your dataset is small enough, you can manually remove the glaring discrepancies. It’s true that this manual method won’t scale at the big data level, but most models that work on a larger scale account for these problems through regularization or other methods.

And finally, stay away from complex machine learning models. As companies build infrastructure to support trillions of data points and develop deep learning models, they find that these constructs are not applicable for smaller datasets. For example, it is easy to over-fit their models to the data. This over-fitting makes the models more sensitive, especially to measurement errors. Ask yourself whether you can analyze a dataset using simpler models like logistic/linear regression instead.

I’m not alone in praising small data — small data is now coming back into the spotlight. Stanford has a course in Fall 2017 titled “Small” Data and Martin Lindstrom has written a book about how companies can create superior products and services through smaller observations instead of their exclusive reliance on big data. Lindstrom proclaims in the book that “[in] the top 100 biggest innovations of our time, perhaps around 60 to 65 percent are really based on Small Data.” Even technologies traditionally based on big data might be able to leverage small data. Eric Schmidt, Alphabet’s executive chairman, says that AI may usher in the era of small data because smarter systems can learn more with less training. Not all of us have big data, but we all have small data — and we can easily use it starting now.


Published by HackerNoon on 2017/10/25