A Gentle Introduction to Data Science

Trillions of gigabytes of data is being produced yearly, and the number is still growing exponentially. It is estimated that for every person, 1.7 megabytes of data will be produced every second by 2020 and digital data accumulation will reach about 44 zettabytes or 44 trillion gigabytes. This explosion of data is also shown in the graph below.

Growth of Data. Source: Patrick Cheesman

Data is only a raw material and extracting information from it requires further work. Our society is increasingly becoming data dependent and data science is the field which helps us make sense of this huge quantity of data.

Data Science is an interdisciplinary field, and makes use of methods and technologies from different fields such as computer science, databases, mathematics, statistics and machine learning. Data Science is involved with the collection, preparation, analysis, visualization, management and preservation of data. This data is often available in very large quantities, and covers a variety of types.

Examples of Data Science around us

Data science is widely used by companies and other organizations to get insights about their customers, staff, products and processes.

For example, Google uses data science in their product like AdSense to personalize the commercial advertisements being displayed to people browsing the websites based on the website they are on, and other data Google has collected about the user in the past. Uber uses data science to calculate how much to charge for a particular ride, which riders to give discounts to, and to test what kind of loyalty programs are working best for its drivers. Airbnb (an online marketplace which connects people looking to rent their homes with people who are looking for accommodation) uses data science to help people estimate the prices they should rent their homes at. For any data-centric organization, the data is the voice of the customer and data science is the interpretation of that voice.

Besides commercial sectors, government and non-government organizations also depend heavily on data science to make sense of huge quantity of data they generate. By using data science, government can detect fraud and criminal activities, optimize investment and funding and much more. Similarly, NGOs use data science to strengthen their cause by providing reliable proof. For example, World Wildlife Fund (WWF) increases effectiveness of their fundraising by using data science to show information about different wild animals and birds. In addition to these institutions, several other organizations are using data science for multitude of tasks and its use will only increase with time.

Opportunities in Data Science

The exponential growth of data has also raised the number of jobs of data scientists exponentially. Analysis done by LinkedIn based on its huge database of professional profiles of people shows the growth of Data Analyst and Data Science roles as a whole (see figure below).

Growth of Data Science and Data Analysis jobs

By the way — as a small aside — this tutorial is taken from the Data Science Course on Commonlounge. The course includes many hands-on assignments and projects. In addition, 80% of the course contents are available for free! If you’re interested in learning Data Science, definitely recommend checking it out.

Key Components of Data Science

Programming (Python, R)

As mentioned before, data science deals with large amounts of data. In data science, this data is managed and analyzed using computer programming. Other non-programming ways to analyze the data are studied in a field known as Data Analytics / Business Analytics.

In the data science community, the following two programming languages are most popular:

Python: The availability of large number of third party packages like numpy, scipy, scikitlearn, matplotlib, etc, make data science projects easier to implement and have led to its immense popularity. In addition to that, different IDEs like PyCharm, Vim, Emacs and interactive python environments like IPython and Jupyter have made using python easier than other languages.

R: R is a programming language specially developed to carry out variety of statistical and graphical techniques, i.e. the programming language was designed and created by statisticians, for statistics. R too has different packages for data wrangling, data visualization and machine learning. It is an open source language and there is an active community of statisticians and programmers who are constantly enriching the language by adding new libraries for new statistical methods.

Data (and its various types)

Data science uses programming to analyze data, and this data can be of various types. Some important categories of data are discussed below:

Structured Data: The data that is easy to represent in a tabular form, and store and manipulate in databases and Excel files. The data has a clearly defined data model. For example, Airbnb has a database of places available for rent, which consists of variables like, size of home (in square feet), number of guests it can accommodate, number of beds, number of bathrooms, per day cost of renting the home, and so on.

Unstructured Data: Data which doesn’t fit into a data model easily is called unstructured data. Examples of unstructured data include emails, PDF files, images, videos, etc.

Natural language: Data that is directly written in languages humans use to communicate with each other such as English, Chinese, French, etc. Natural language data is a sub-type of unstructured data.

Image, Video, Audio: Images, videos and audios are widely generated from sensors like camera and microphone. They are unstructured in nature and extracting information from them is quite a challenge.

Graph-based Data: Graph is a mathematical structure which models pairwise relation between two entities. It uses nodes, edges and properties to store information in it. For example, the information about Facebook friends can be represented in graph, where people are nodes, and edges between two nodes denote two people being friends.

Machine Generated: Machine generated data is any information created by a computer, different applications or machines without involvement of humans.

Statistics and Probability

Statistics: Statistics is a branch of mathematics that deals with collection, organization, analysis and interpretation of data. Statistical methods and techniques are implemented via programming to analyze data. Some commonly used concepts include mean, mode, median, standard deviation, hypothesis testing, skewness, etc.

Probability: Probability is used to mathematically describe the likelihood of occurrence of an event. It quantifies randomness and uncertainty. For example, probability tells us the chance of it raining on a particular day, or someone winning a lottery. The probability that an event occurs is always between 0 and 1, where 1 represents absolute certainty and 0 represents complete impossibility. Some commonly used concepts include random variables, different probability distributions, conditional probability, Bayes theorem, z-testing, etc.

Relation with Data Science: Data Science is all about manipulating data to extract information from it. Statistics and probability form the mathematical foundation of data science. Without a clear understanding of statistics and probability, it is very easy to mis-interpret data and reach incorrect conclusions.

Machine Learning

Introduction: Arthur Samuels defined machine learning as the field of study that gives computers the ability to learn without being explicitly programmed. A machine learns whenever it changes its structure or program in a manner that its expected future performance improves. The change can occur due to its inputs or in response to external information. For example, when the performance of a machine learning model being trained for object recognition improves after looking at several pictures of the object, it is reasonable to say that the machine has learned to identify the object.

In simple terms machine learning involves three goals: change, generalization and improvement.

Learning changes the learner: for machine learning the problem is determining the nature of these changes and how to best represent them.
Learning leads to generalization: performance must improve not only on the same task but on similar tasks
Learning leads to improvements: machine learning must address the possibility that changes may degrade performance and find ways to prevent it.

Machine learning systems perform a variety of tasks that involve recognition, diagnosis, planning, robot control, prediction, etc.

Machine Learning in Data Science: Data scientists use machine learning algorithms, and in particular regression and classification methods are popular in data science. Machine learning comes handy when data scientists need to predict different things from available data. For example, by using the sales data of a shopping mall from previous years, we can predict the approximate sales for coming years using regression methods like linear regression. Similarly, classifying data into known classes, like classifying birds based on their whistle, requires machine learning algorithms like logistic regression, decision trees, etc.

Big Data

Introduction: When a set of data becomes so huge in quantity or gets complex enough that there is difficulty in processing it using traditional data management approaches, then we turn to Big Data. Usually, storing or processing this data requires a large number of computers (starting from 10s for small companies, to tens of thousands for large companies). Big Data is characterized by three Vs:

Volume: Big Data is large in volume: it can range from terrabytes and zettabytes.
Variety: Big Data is diverse in nature. It can be in different formats and types. Most companies have a mix of structured and unstructured data.
Velocity: Large amount of data is generated on an ongoing basis. For example, this data is coming in from users interacting with a website, or from sensors that might be constantly collecting data.

Three Vs of Big Data: Volume, Velocity and Variety

Big Data and Data Science: The emergence of big data has raised the importance of Data Science. Usually, data is considered to be a crude oil — a raw material, and by applying data science we can extract different kinds of information, like extracting refined oil from the crude oil. Data scientists use different tools to process big data like Hadoop, Spark, R, Pig, Java and others, as per their needs. As our technology and society becomes more data driven, big data and data science will become even more intricately related.

References

Introducing Data Science by Davy Cielen
Data Science Specialization by John Hopkins University | Coursera

Co-authored by Bishal Lakha and Keshav Dhandhania.

Originally published as a tutorial on www.commonlounge.com as part of the Data Science Course.