No PhD? No problem! Machine Learning for the rest of us

Via FunGi_ (Trading)

I’ve been writing code since I was an awkward middle schooler. I’ll never forget creating my AOL homepage with a seizure-inducing repeating background, under construction gif, and faux visitor counter. I begged my dad to drop me off at school early the next day so I could try and access the page from the library computer.

Now that I’m an awkward adult, there have been a couple more magical moments in front of a computer screen. My most recent? A year ago, I solved my first (trivial) machine learning problem. I was blown away at the ability to connect something so human to bits and bytes.

Since that machine learning awakening I’ve been diving in hard. I’m not an expert by any means, but I’d love to give you a smoother onramp than my own.

There’s three things I hope to leave you with:

Be comfortable talking about machine learning in conversation.
See what it looks like to solve a machine learning problem in code.
A couple of hand-picked resources to get you going.

Defining the machine learning ecosystem is as complicated as defining love

Love means lots of things to different people. Lots of smart people have defined it in many great ways. The ecosystem around machine learning can generate this same confusion. Thankfully, I think if you master just five topics, you’ll be in the 90th percentile of conversational ML:

Machine Learning
Data Science
Artificial Intelligence
Software 2.0
Applied Machine Learning

I’ll start with the most concrete definition: machine learning.

Machine Learning

Machine learning is the study of computer algorithms that improve automatically through experience.

Professor and Former Chair of the Machine Learning Department at Carnegie Mellon University, Tom M. Mitchell

The better the data you feed into a machine learning algorithm, the better the algorithm will perform. We’re not modifying machine learning algorithms to improve our results: we’re modifying the data.

Machine learning isn’t new: in 1952, Arthur Samuel wrote the first computer learning program. It played checkers. So, why do you hear so much about machine learning today?

It goes back to data. We’re able to store a lot of data very cheaply today. Our computers can process this data very efficiently. This is making our ML models better and more widespread.

Data Science

A data scientist is an expert at extracting nuggets of knowledge from a lot of information. And, they can do this very quickly.

A data scientist will use machine learning, but it’s only one of the tools in their tool set.

Artificial Intelligence

Like love, smart people define Artificial Intelligence (AI) in different ways.

- via Andrew Ng, Co-Founder Google Brain

AI is akin to building a rocket ship. You need a huge engine and a lot of fuel.

Andrew Ng, Co-Founder Google Brain

Some folks tie AI tightly to machine learning. This is understandable in today’s climate: much of the innovation in AI is powered by our more powerful machine learning models. In fact, it’s not uncommon to hear folks use AI and ML interchangeably.

Then there’s a more broad definition:

Artificial intelligence is the science and engineering of making computers behave in ways that, until recently, we thought required human intelligence.

Andrew Moore, Dean of the School of Computer Science at Carnegie Mellon University

AI?

By this definition, wouldn’t a calculator be AI at the time it was introduced? Adding numbers was certainly something that we thought required human intelligence.

Today, a calculator would not be considered AI but a self-driving car is. In thirty years, it’s likely a self-driving car will be as commonplace as a pocket calculator.

Which definition is correct?

I don’t know! Just be aware that some folks will go broad and others will align AI more tightly with the ML-fueled AI boom of today.

Software 2.0

Software 1.0 is code we write. Software 2.0 is code written by the optimization based on an evaluation criterion (such as “classify this training data correctly”).

Andrej Karpathy, Director of AI @ Tesla

My background is a Software 1.0 kind-of-guy. The hard work is in maintaining a growing nest of algorithms. In software 2.0, the work shifts from algorithms — which we don’t create — to the data we feed in for training and evaluation.

While I agree on the difference between these two styles of software, I find the naming of Software 2.0 unfortunate. Software 2.0 is being applied to new problems — detecting cancer, driving cards, identifying sentiment — not replacing old work.

Applied vs. Research Machine Learning

Are you baking bread or building ovens? — Photo Credit: lazy fri13th

Imagine hiring a chef to build you an oven or an electrical engineer to bake bread for you…that’s the kind of mistake I see…over and over.

Cassie Kozyrkov, Chief Decision Intelligence Engineer, Google

For years, companies have preferred to hire folks with PhDs in machine-learning-related fields to solve problems with machine learning. Today, many problems can be solved by open-source ML algorithms. The challenge — as always in ML — is in the data.

Having a post-grad degree in an ML-related field is still a great asset. However, if you’re more interested in applying ML than learning how models work, you probably don’t need to go back to school.

Hello World — classifying handwritten digits

A sample of the dataset

Classifying handwritten digits is one of the most famous “hello world” problems in machine learning. With solid accuracy, you can solve this problem in just a few lines of code. It’s magical.

There are many Kaggle kernels that solve this problem. I’m going to skip the plumbing (importing libraries) and get right into the meat of the problem.

The dataset

We’re given a collection of 70k handwritten digits and their associated labels. Each digit is actually an array of 728 integers. Each integer is a grayscale range from 0–256. The higher the number, the darker the pixel. This array can be arranged into a grid 28 pixels in width and 28 pixels in height:

Each instance of our dataset is an array of 728 values. The higher the value, the darker the pixel in the image.

Splitting the dataset into a training and test set

The first step in every ML problem is splitting the entire dataset into a training and test set. We only train the model on the training set.

Why would we exclude data when an ML model gets better with more data?

If we trained our model with all of the 70k handwritten digits, we’d still need a way to evaluate its accuracy on data it hasn’t seen. Think about how much work (and how much time it would take) to digitize handwritten digits! By not fitting our model to the test set, we can see how it performs against pretend real-world data immediately.

Picking a model

Now that we have our data properly split, it’s time to train a model! A good model to start with is a Random Forest Classifier. This is a fairly simple model that produces solid results with little tuning.

Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

Niklas Donges in The Random Forest Algorithm.

Training a Random Forest Classifier

We can initialize and train our model in just two lines:

…and it worked:

Accuracy

We can see how the model performs against the test set by measuring its accuracy (the percentage of time the model correctly classified a 3 as a 3, a 4 as 4, etc.):

For two lines of code, we’re about 80% accurate. If I wrote a Software 1.0-style algorithm to do this, I doubt I’d get to this accuracy and it’d take me a lot longer! But, we can do better!

The astute of you might have noticed I only trained the model on 1,000 instances of the training set. Remember how ML algorithms get better with more data? Let us use all of the 60,000 training instances and get the new accuracy score:

The ML Hype Roller Coaster

Right now, we’re at peak hype in this article: two lines of code, 95% accuracy!

Why was this so effortless? What’s the usual answer for anything ML? The data! If I asked you to solve this problem without a dataset, you’d need a lot of time to put that data set together. You might start with 5k images, then notice the model doesn’t classify 3s well, then increase the images and hit a new issue, and so on. Training, fitting, and evaluating the results of ML models is pretty easy today. Data is tedious and hard.

ML Diving Board

If I haven’t scared you off, there’s a couple of resources I recommend to get you going:

Hands on Machine Learning with Scikit-Learn and TensorFlow — this is an excellent book for applied ML. You’ll start solving problems almost immediately.
Data Science Meetups — there are many of these. Here in Fort Collins, there’s an excellent group that meets weekly.
Machine Learning for Everyone — an excellent and funny blog post that that covers the ML landscape in much better detail than this.
Kaggle — a community for data scientists. Enter competitions, view kernels from others, find a dataset, and more.

In Closing

The breadth of use cases for machine learning is so large it can be difficult to choose where to begin. I hope the above is enough to help you focus and get started!

Oh — and it’s all about the data.