Fundamental Python Data Science Libraries: A Cheatsheet (Part 1/4)

Written by laurenjglass9 | Published 2017/12/07
Tech Story Tags: data-science | python | python-data-science | data-science-libraries | data

TLDRvia the TL;DR App

If you are a developer and want to integrate data manipulation or science into your product or starting your journey in data science, here are the Python libraries you need to know.

  1. NumPy
  2. Pandas
  3. Matplotlib
  4. Scikit-Learn

The goal of this series is to provide introductions, highlights, and demonstrations of how to use the must-have libraries so you can pick what to explore more in depth.

NumPy

Just as it is written on NumPy’s website, this library is fundamental for scientific computing in Python. It includes powerful manipulation and mathematical functionality at super fast speeds.

Focus of the Library

This library is all about the multidimensional array. It is similar in appearance to a list & indexes like a list, but carries a much more powerful set of tools.

Installation

Open a command line and type in:

pip install numpy

Windows: in the past I have found installing NumPy to be a headache, so I encourage all you Windows users to download Anaconda’s distribution of Python which already comes with all the mathematical and scientific libraries installed.

Details

A NumPy array differs from a list in a couple of ways.

  1. All data in a NumPy array must be of the same data type, a list can hold multiple

  2. A NumPy array is more memory efficient & faster! See a detailed explanation here

  3. Lists don’t have as many powerful mathematical methods and attributes built in! — super useful for data exploration and development.

Let’s dive in!

import numpy as np

Creation

You can create an array in a couple of different ways.

From a list or tuple

# 1 dimensional array (you can pick more)future_array1 = [1,2,3,4,5]array1 = np.array(future_array1)

>>> array1array([1, 2, 3, 4, 5])

With placeholder content

# there are other placeholder options, see jupyter notebook belowplaceholder_zero = np.zeros((3,4), dtype=np.int) # default np.float

>>> placeholder_zeroarray([[0, 0, 0, 0],[0, 0, 0, 0],[0, 0, 0, 0]])

With a sequence

# sequence based on steps (start, end, step_value)sequence_array1 = np.arange(10, 30, 4)

# evenly spaced sequence based on number specified# (start, end, number_of_elements)sequence_array2 = np.linspace(10, 30, 4)

>>> sequence_array1array([10, 14, 18, 22, 26])

>>> sequence_array2array([ 10. , 16.66666667, 23.33333333, 30. ])

Upload data

# csv = comma separated valuesdata = np.genfromtxt("your_file_here.csv", delimiter=",")

Makes Math Easy

You can do all sorts of mathematical operations on the whole array. No looping required! A new array will be made with the results.

a1 = np.array([1,2,3,4])a2 = np.ones((1,4), dtype=np.int)a3 = np.zeros((1,4), dtype=np.int)

# Addition/SubtractionA = a1 + a2 - a3

# Multiplication/DivisionB = a2 * a3 / a2

>>> Aarray([[2, 3, 4, 5]])

>>> Barray([[0, 0, 0, 0]])

Attributes & Methods

Beyond just mathematical operations, NumPy comes with a plethora of powerful functionality that you can leverage to save yourself time & increase readability.

Summary Statistics

>>> array1.mean()3.0

>>> array2.mean(axis=0)array([ 5. 6. 7. 8.])

>>> array2.mean(axis=1)array([ 2.5 6.5 10.5])

Additionally, there are .max(), .min(), .sum(), and plenty more.

Reshape

A = np.array([1,2,3,4,5,6,7,8,9,100])B = A.reshape((2,5)) # takes a tuple of dimensionsC = B.T # transpose

>>> Barray([[ 1, 2, 3, 4, 5],[ 6, 7, 8, 9, 100]])

>>> Carray([[ 1, 6],[ 2, 7],[ 3, 8],[ 4, 9],[ 5, 100]])

More Math

D = A.reshape((1,10))

>>> A.dot(D)array([10285])

There are many more (too many to list) mathematical methods available. Dot is just my favorite.

I’m providing here a link to download my NumPy walkthrough using a Jupyter Notebook for everything we covered and more!

Never used Jupyter notebooks before? Visit their website here.

Overall, if you have complex transformations you need to do on lists of data, I recommend searching for a NumPy solution before coding something yourself. This will save you many a headache.

Applications

Let’s look at a scenario. Say I was able to export trading transactions: buys & sells. I want to see how much cash I had on hand after each transaction.

import numpy as np

# deposited $100,000 and then started buying and sellingtrades = [100000, -10000, 10500, -100000, 175000]

# convert my trades to a numpy arraytrades_array = np.array(trades)

>>> trades_array.cumsum()array([100000, 90000, 100500, 500, 175500])

This is a version with very simple, fictional data. However, what if we wanted to work with the data shown above but with the dates next to them? That’s possible, check out my next article on pandas.

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.

Thanks for reading! If you have questions feel free to comment & I will try to get back to you.

Connect with me on Instagram @lauren__glass & LinkedIn

Check out my essentials list on Amazon

Visit my website!

Search for me using my nametag on Instagram!


Published by HackerNoon on 2017/12/07