Introduction to Web Scraping using Python

Written by srimanikanta | Published 2019/02/23
Tech Story Tags: web-scraping | data-science | python | web-development | data-mining

TLDRvia the TL;DR App

Summary: A quick tutorial on how to scrape the web with the help of python inbuilt modules Requests and Bs4.

One of the most efficient ways to collect the data as a data scientist is with the help of web scraping.

Web Scraping

It is a technique of capturing the data from the web into our local machine to perform certain data analysis or data visualizing operations on it to get useful insights from that data. It is also called as Web Harvesting (or) Data Extraction.

In this article, we are going to scrape the web with the help of two golden libraries named Requests and bs4(Beautiful Soup). The reason behind the selection of these two libraries is these are more powerful and flexible than other libraries which are available. These two libraries also have a good community on the StackOverflow to help if you are new to this web scraping journey.

By the end of this article, you will definitely have a good amount of knowledge on web scraping.

Prerequisites

Basics of HTML tags and CSS selectors is required.

The capturing of data from the web starts with sending a request to the web from which website you want to capture the data. this task is done with the help of the Requests module.

To make a request to the website, first we need to import the requests module in python. it is not a pre inbuilt module in python. we need to install that package with the help of pip.

>>>import requests # Module imported Sucessfully.

# To make a request

>>> response = requests.get('https://www.hackthissite.org/')

# The response variable will contain the response of that request object.

To check the status of the request object we need to use the status_code attribute in the requests module.

>>>response.status_code

If the result of the status code value is 200 then you got a successful response from the website. otherwise, you got a bad response from the web page. The problem might be at a web URL or server issue.

Types of Requests

There are mainly six types of request are possible with the help of the requests module.

  1. get() request
  2. post() request
  3. put() request
  4. delete() request
  5. head() request
  6. options() request

The syntax for Making Any Kind of Request to a webpage is —

requests.methodName('url')

But the most popular methods to make a request to the web page is using get() and post() methods only.

To send any kind of sensitive data along with the URL like login credentials etc… then post() request is more preferable because HTTP get() request doesn’t provide any security to the request that has been sent to the web page.

r = requests.post**('https://facebook.com/post',** data = {'key':'value'})

Response Results

The response content for our request is obtained with the help of text attribute.

>>>response.text

The above statement will produce the result as shown below —

Don’t be afraid of that result because it is the initial response from the webpage

From the above image, we can conclude that we downloaded the webpage content into our local machine. To make that content more flexible and useful we need to take the help of Beautiful Soup library. This library helps us to get useful insights from the available data.

# Importing the beautiful soup library>>>import bs4

To make our raw HTML data more prettify we need to parse our content with the help of some parsers. the parsers that are quite often used are —

  1. lxml
  2. HTML5lib
  3. XML parser
  4. HTML.parser

But the most flexible and popular one is lxml parser. it parses the data very quickly and effectively.

>>>soup_obj = bs4.BeautifulSoup(response.text,'lxml')# The soup_obj will us to fetch the our required results

# To make our previous data more understandable we will use prettify() on soup_obj

>>>soup_obj.prettify()

The result will be —

The output from Soup Object

So finally, we have moved one step further. The extraction of data starts here —

To Extract the name of a webPage we need to use the selectors along with appropriate HTML tags to obtain our result

>>>soup_obj.select('title')[0].getText()

Title Output

To extract all the links from that webpage you need to find all the anchor tags in the page and store the result into a variable. with the help of for loop iterate, it completely and print the result

>>>links = soup_obj.find_all('a')# find_all() will help to fetch all the details of the selected tag.

>>>for link in links:... print(link.get('href'))...

# get() is used to extract the specific content from the tag

The output of extracting links

This is a simple example to fetch the links from a webpage. if you want to extract some other data from the webpage select appropriate tag which is related to your content and grab the result with the help of soup object. Initially, it feels difficult but when you are working on it definitely you can able to scrape any type of website within a minutes.

Disclaimer

✋ You shouldn’t perform website web scraping without the permission of the administrator. it may lead to illegal activity. Data scientists in companies usually do web scraping on their own web pages and business they didn’t perform any illegal action on other company websites. so be careful. I’m not responsible if anything causes damage by you. ✋

The web page I used in this tutorial is free to scrape so there is no problem at all. use such type of websites to learn or enhance your skill.

Conclusion

From the all above examples, I think now you can able to use Request and bs4 library easily. you can able to extract a different type of content from the web with the help of different HTML tags and to store that data into a CSV file or text file. simply apply python file operation on the data to save into your local machine.

Hope this helps you learn about web scraping in python in an easy way.

If you liked this article please click on the clap and leave me your valuable feedback.


Written by srimanikanta | Problem Solver || Started Journey as a Programmer || Techie Guy || Bibliophile || Love to write blog
Published by HackerNoon on 2019/02/23