Essential Machine Learning Libraries — I

Written by razzormandar | Published 2018/03/16
Tech Story Tags: data-science | numpy | matplotlib | machine-learning | ml-libraries

TLDRvia the TL;DR App

This post is part 1 of the Essential Machine Learning libraries series

Hello guys,

Now that you have covered the some basics of the theory behind Machine learning through the previous posts(1 ,2), it’s time to jump into some essential python libraries used for implementing ML and/or data science projects.

Python being the choice of most developers, major data science and ml/deep learning libraries used have been developed in Python. Though R is also be by many industry researchers for analytics and data science, my experience in this domain is limited to python.

This post will encompass only the libraries majorly used for data importing, preprocessing and visualization. The libraries used for ml model creation will be covered in the next post.

1. Numpy

Numpy library adds support for multidimentional arrays and matrices in a manner similar to MATLAB. Along with a large collection of high-level mathematical functions to operate on these arrays. The core functionality of NumPy is its “ndarray”, for n-dimensional array, data structure.

Sample Operations:

Array creation

import numpy as np x = np.array([1, 2, 3]) x

Output — array([1, 2, 3])

y = np.arange(10) # like Python's range, but returns an array y

Output — array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Basic operations

a = np.array([1, 2, 3, 6]) b = np.linspace(0, 2, 4) # create an array with four equally spaced points starting with 0 and ending with 2. c = a - b c

Output — array([ 1., 1.33333333, 1.66666667, 4.])

Output — array([ 1, 4, 9, 36])

Universal Functions

a = np.linspace(-np.pi, np.pi, 100) b = np.sin(a) c = np.cos(a)

Linear Algebra

from numpy.random import rand from numpy.linalg import solve, inv a = np.array([[1, 2, 3], [3, 4, 6.7], [5, 9.0, 5]]) a.transpose()

Output — array([[ 1. , 3. , 5. ], [ 2. , 4. , 9. ], [ 3. , 6.7, 5. ]])

Output — array([[-2.27683616, 0.96045198, 0.07909605], [ 1.04519774, -0.56497175, 0.1299435 ], [ 0.39548023, 0.05649718, -0.11299435]])

b = np.array([3, 2, 1]) solve(a, b) # solve the equation ax = b

Output — array([-4.83050847, 2.13559322, 1.18644068])

c = rand(3, 3) * 20 # create a 3x3 random matrix of values within [0,1] scaled by 20 c

Output — array([[ 3.98732789, 2.47702609, 4.71167924], [ 9.24410671, 5.5240412 , 10.6468792 ], [ 10.38136661, 8.44968437, 15.17639591]])

np.dot(a, c) # matrix multiplication

Output — array([[ 53.61964114, 38.8741616 , 71.53462537], [ 118.4935668 , 86.14012835, 158.40440712], [ 155.04043289, 104.3499231 , 195.26228855]])

For more detailed examples and tutorials follow this link.

2. Scipy

SciPy is a Python-based ecosystem of open-source software for mathematics, science, and engineering. It has a lot of packages for scientific computing including but not exhaustive — optimization, integration, fast fourier transform, ODE and signal and image processing.

SciPy is organized into subpackages covering different scientific computing domains. These are summarized in the following table:

Subpackage Description cluster Clustering algorithms constants Physical and mathematical constants fftpack Fast Fourier Transform routines integrate Integration and ordinary differential equation solvers interpolate Interpolation and smoothing splines io Input and Output linalg Linear algebra ndimage N-dimensional image processing odr Orthogonal distance regression optimize Optimization and root-finding routines signal Signal processing sparse Sparse matrices and associated routines spatial Spatial data structures and algorithms special Special functions stats Statistical distributions and functions

Since there are a lot of examples to be covered under scipy, I will be writing a seperate post after I have tried working with most of them. Till then you can check out the following link

3. Matplotlib

Matplotlib is a Python plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shell, the jupyter notebook, web application servers, and four graphical user interface toolkits.

It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+. There is also a procedural “pylab” interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB.

Line Plot

import matplotlib.pyplot as plt import numpy as np a = np.linspace(0,10,100) b = np.exp(-a) plt.plot(a,b) plt.show()

Histogram

import matplotlib.pyplot as plt from numpy.random import normal,rand x = normal(size=200) plt.hist(x,bins=30) plt.show()

Scatter Plot

import matplotlib.pyplot as plt from numpy.random import rand a = rand(100) b = rand(100) plt.scatter(a,b) plt.show()

For more detailed examples and tutorials follow this link.

4. Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Few of the important features of pandas are :

  • DataFrame object for data manipulation with integrated indexing.
  • Tools for reading and writing data between in-memory data structures and different file formats.
  • Data alignment and integrated handling of missing data.
  • Reshaping and pivoting of data sets.
  • Label-based slicing, fancy indexing,and subsetting of large data sets.
  • Data structure column insertion and deletion.
  • Group by engine allowing split-apply-combine operations on data sets.
  • Data set merging and joining.

Some sample examples:

Creating data

from pandas import DataFrame, read_csv import pandas as pd from numpy.random import rand names = ['Bob','Jessica','Mary','John','Mel'] births = [968, 155, 77, 578, 973]

Merging 2 lists using zip function

BabyDataSet = list(zip(names,births)) BabyDataSet

Output — [('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

Creating dataframe object for storing data in a manner similar to sql

df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])

Importing and dividing dataset

dataset = pd.read_csv('Dataset_name.csv') #modified as per need X = dataset.iloc[index of input values].values y = dataset.iloc[index of target/ouput values].values

For more detailed examples and tutorials follow this link. Also this is a great article for a quick walkthrough.

Till now we have covered major python libraries required for importing and preprocessing data before actual machine learning model creation.

In the next post I will be covering ml and deep learning libraries essential for creating, training and testing various models. So stay tuned!

Originally published at mandroid6.github.io on June 23, 2017.

Checkout my other posts on machine learning and deep learning : https://medium.com/@razzormandar


Published by HackerNoon on 2018/03/16