Fast AI Machine Learning Lecture 1 Notes

These are the (Unofficial) Lecture Notes of the Fast.ai Machine Learning for Coders MOOC.

You can find the Official Thread here

This is Part 1/12 Lecture Notes.

Introduction

Course taught at USF, available as MOOC.
[Course Website is yet to be launched, I’ll update the link once it is. Meanwhile You can find the Official Thread here]
Tip: Check for “Cards” on the Video.

Alternatives to Local Setup: (With Fast AI Support)

CrestlePricing: 3 Cents an hour (Approx)Jupyter Nb opens in browser
Paperspace

Local Setup instructions:

Assuming you have your GPU and Anaconda Setup (Preferably CUDA ≥9):

$ git clone https://github.com/fastai/fastai$ cd fastai$ conda env update

Use the Setup Script provided Here

$ bash | wget files.fast.ai/setup/paperspace

Approach to Learning

Follow Along.
Watch First and Follow Later (Lose Recommendation).You might miss some important information and you can experiment.

Teaching Approach:

Dive into Code
Build Models
Theory Comes later, at a point which you’ll be able to effective coder.
Try with More Datasets.
The More Coding you do, The Better (Recommended by Alumni)
Write Blog Posts.

“Hey I Just Learned this Concept, and I’ll share about it”

Good Technical Blogs:

Peter Norvig (more here)
Stephen Merity
Julia Evans (more here)
Julia Ferraioli
Edwin Chen
Slav Ivanov (fast.ai student)
Brad Kenstler (fast.ai and USF MSAN student)

Imports

Auto reload commands:

%load ext_autoreload%autoreload 2

If you modify the source code of the imports, you’ll have to reload the kernel in order to reflect these changes.

These two lines auto-reload the Nb incase you change the source.

%matplotlib inline

To plot Figures inline

from fastai.imports import*

Data Science is not Software Engineering. Prototyping models needs things to be done interactively.

import * allows everything to be present, we don’t need to determine the specifics.

Jupyter Tricks

fn_name

?fn_name

??fn_name

Gives the fn_name library
Gives Details of the fn
Gives the Source Code of the Fn

Getting the Data

Kaggle: Real World Problems posted by a company/institute.These are really close to real world problems, allow you to check yourself against other competitors.

TL;DR: Perfect place to check your skillset.

Jeremy: “I’ve learnt more from Kaggle competitions than anything else I’ve done in my Entire Life”

Go to Competition page.
Accept Terms and Conditions.

Download Dataset

Setup Official Kaggle API

Use The Terminal to Download the Dataset.

Use CurlWget Chrome Extension.
Start Download and Cancel it.
Click on the Extension.
Paste the Copied Command into a Terminal.

Note: Prefer Techniques that will be useful for Downloading Data to your Cloud Compute Instance.Crestle and Paperspace will have most of the Datasets pre-downloaded.

Good Practise: Create a Data Folder for all of your data

To Run BASH Commands in Jupyter

!BASH_COMMAND

To Add Python Commands

!BASH {Python}

Blue Book for Bulldozers:

Goal:

The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it’s usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations.

Fast Iron is creating a “blue book for bull dozers,” for customers to value what their heavy equipment fleet is worth at auction.

Look at Data to Get Started.

!head data/bulldozers/Train.csv

Gives the First Few lines.

Structured Data:

(Unoffcial Def) Columns of Data having varying types of Data.

Pandas:Best Library to work with Tabular Data.
Fastai imports import pandas library by default.
Reading CSV

df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False,parse_dates=["saledate"])

low_mem=FalseAllows it to load more details to memory.

Python 3.6 Formatting:

var ='abc'f'ABC {abc}'

This allows Python to interpret Code inside the {}

Display data:

df_raw

Simply writing this would truncate the output

display_all()def display_all(df):with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000):display(df)

This allows The Complete df to be printed.

display_all(df_raw.tail().T)

Since there are a lot of columns, we have taken Transpose.

Evaluation:

Since the Metric is RMSLE, we would consider the logarithmic values here.

Root mean squared log error: between the actual and predicted auction prices.

Random Forests:

Introduction: Universal Machine Learning Technique that can be used for predicting categorical/continuous variables.
It can work with Pixel Values/Columns.
In General, it doesn’t overfit.
It’s easy to avoid Overfitting.
Works fine without Any Validation Cells.
Requires no Statistical Assumption.

TL;DR: It’s a great Start.

Curse of Dimensionality:

The Greater number of Columns creates emptier Mathematical space where the Data Points sit on the Edges (Math Property).

This leads to distance between points being meaningless.

In General, False.

Datapoints have distance even when they sit on the boundaries.
Theoretical Research was more heavy in ’90s.
Building Models on lots of columns works really well.

No Free Lunch Theorem:

There is no Universal kind of Model that works well for all kinds of Dataset.

In general, we look at Data that was created by some cause/structure. There are actually techniques that work well for nearly all of the General Datasets that we work with. Ensembles of Decision Tree is the Technique that is most widely used.

ValueError: could not convert string to float: 'Conventional'

SKLearn isn’t the Best library, but it’s good for our purposes.

RandomForest:

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

Regressor: Continous Values.
Classifier: Classify Values.

Note: Regression!=Linear Regression.

Feature Engineering

The RandomForest Algorithm expects numerical data.

We Need to Convert Everything to Numbers.

DataSet:

Continuos variables.
Categorical:- Numbers - Strings- Dates

df_raw.saledate

Information inside a Date:

Is it a holiday?
Is it a weekend?
The weather.
Event(s) Information.

??add_datepart

To look at the source code.

This grabs the field “fldname”Note: df.fldname would literally look up a Field named fldname.

df[fldname] is a safer option in general. It’s a safe bet, doesn’t give weird errors in case we make a mistake. Don’t be lazy to do df(dot)fldname

Also, df[fldname] returns a series.

The function goes through all of the Strings, it looks inside the object and finds attribute with that name. This has been made to create any column that might be relevant to our case. (Exact opposite of the Curse of Dimensionality- We are creating more columns)

There is no harm in adding more columns to your data.

Link getattr()

Pandas splits out different methods inside attributes.

All of the Date time specific linked in pd.dt.___

Finally we drop the column.

Dealing with Strings

UsageBand has Low, High, Medium.
Pandas has a Categorical Variable but it doesn’t work by default.

train_cats

Creates categorical variables for strings. It creates a column that stores number and stores the mapping of the String and numbers.

Make sure you use the same mapping for training dataset and testing dataset.

Similar to .dt, .cat gives access to Categorical data.

Since we’ll have a decision tree that will split the columns. It’ll be better to have a “Logical” order.

RF consists of Trees that make splits. The splits could be High Vs Low+Medium then followed by Low Vs Medium.

Missing Values

display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

.isnull() Returns T/F if the data has null values.
.sum() adds up the null values.
We then sort and divide them by the length to return the missing values.

Saving

os.makedirs('tmp', exist_ok=True)df_raw.to_feather('tmp/bulldozers-raw')

Feather: Saves the Files in the Format similar to the one in RAM. In layman-it’s fast.

Pro-Tip: Use Temporary folder for all actions/needs that pop up while you’re working.

Final Steps

proc_df

A Function inside the Structured.fastai

Grabs a copy of the df
Grab the dependent column.
Dependent column is dropped.
Missing Values are fixed.
Fix Missing- Numeric values: If it does have missing values, then create a new column named Col_na (Boolean column) and replace the _na with the median- Non-Numeric and Categorical: Replace with the code and add 1.

df, y, nas = proc_df(df_raw, 'SalePrice')

Running Regressor

m = RandomForestRegressor(n_jobs=-1)m.fit(df, y)m.score(df,y)

Random Forests are parallel-isablen_jobs=-1 creates a separate process for every CPU we have.
Create a Model
Return the Score

1 is the Best Score.

0 is the Worst.

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),m.score(X_train, y_train), m.score(X_valid, y_valid)]if hasattr(m, 'oob_score_'): res.append(m.oob_score_)print(res)

Checking Overfitting

We can create a Validation Dataset.
Sorted by Date, The Most Recent 12,000 dates will be the validation set.

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000 # same as Kaggle's test set sizen_trn = len(df)-n_validraw_train, raw_valid = split_vals(df_raw, n_trn)X_train, X_valid = split_vals(df, n_trn)y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

Final Score

If you’re in the Top Half of the Kaggle LB, it’s a great start.

print_score(m)[0.09044244804386327, 0.2508166961122146, 0.98290459302099709, 0.88765316048270615]

0.25 would get a LB position in the Top 25%

Appreciation: Without any thinking or intensive Feature Engineering, without defining/worrying about any statistical assumption-we get a decent score.

If you found this article to be useful and would like to stay in touch, you can find me on Twitter here.