Scraping Amazon Reviews using Scrapy in Python [Tutorial]

Written by sandra-moraes | Published 2019/11/20
Tech Story Tags: web-scraping | scrape-amazon-data | scrape-amazon-reviews | data-scraping | python-tutorials | scraping-using-python | data-science-tools | latest-tech-stories

TLDR Scraping Amazon reviews using Scrapy in python is a web crawling framework in python. Web scraping allows the user to manage data for their requirements, for example, online merchandising, price monitoring and driving marketing decisions. The most significant feature is that it is built on Twisted, an asynchronous networking library, which makes the spider performance is very significant. In this tutorial, we will look at the different stages involved in scraping amazon reviews along with their short description. We will start by creating a scrapy system to collate different scraplers into a single scrapy project.via the TL;DR App

Are you looking for a method of scraping Amazon reviews and do not know where to begin with? In that case, you may find this blog very useful in scraping Amazon reviews. In this blog, we will discuss scraping amazon reviews using Scrapy in python. Web scraping is a simple means of collecting data from different websites, and Scrapy is a web crawling framework in python.
Web scraping allows the user to manage data for their requirements, for example, online merchandisingprice monitoring and driving marketing decisions. In case you are wondering whether this process is even legal or not, you can find the answer to this query here
Before digging into scraping Amazon for product reviews, let us first have a look at a few use-cases of scraping Amazon reviews at the first place

Why the need for scraping Amazon reviews?
Sentiment analysis can be performed over the reviews scraped from products on Amazon. Such study helps in identifying the user’s emotion towards a particular product. This can help in sellers or even other prospective buyers in understanding the public sentiment related to the product.
Drop shipping is a business type that allows a particular company to work without an inventory or a depository for the storage of its products. You can use web scraping for getting product pricing, user opinions, understanding the needs of the customer and following up with the trend.
It is difficult for large-scale companies to monitor their reputation of products. Web scraping can help in extracting relevant review data which can act as input to different analysis tool to measure user’s sentiment towards the organisation.
What is Scrapy?
Scrapy is a web crawling framework for a developer to write code to create, which define how a particular site (or a group of websites) will be scrapped. The most significant feature is that it is built on Twisted, an asynchronous networking library, which makes the spider performance is very significant.
Let us now have a look at a necessary pipeline for scraping amazon reviews
Scraping Amazon reviews Pipeline
I always feel that it is essential to have a holistic idea of the work before you start doing it which in our case is scraping Amazon reviews. Hence, before we begin with the coded implementation with Scrapy, let us have an uber look at the complete pipeline for scraping Amazon reviews.
In this section, we will look at the different stages involved in scraping amazon reviews along with their short description. This will give you an overall idea of the task which we are going to do using python in the later section.
1. Analysing HTML structure of the webpage
Scraping is about finding a pattern in the web pages and extracting them out. Before starting to write a scraper, we need to understand the HTML structure of the target web page and identify patterns in it. The pattern can be related to usage of classes, ids and other HTML elements in a repetitive manner.
2. Scrapy parser implementation in Python
After analysing the structure of the target web page, we work on the coded implementation in python. Scrapy parser’s responsibility is to visit the targeted web page and extract out the information as per the mentioned rules.
3. Collection and Storage of Information
The parser can dump out the results in any format you wish for be it CSV or JSON. This is the final output while in which your scraped data resides.
Python code implementation for scraping Amazon reviews
Installing Scrapy 
We will start by installing Scrapy in our system. There can be two cases here though. If you are using conda, then you can install scrapy from the conda-forge using the following command
conda install -c conda-forge scrapy
In case you are not using conda, you can use pip and directly install it in your system using the below command
pip install scrapy
We will start by creating a scrapy project. A scrapy project enables users to collate different components of the crawlers into a single folder. To create a scrapy project use following command
scrapy startproject amazon_reviews_scraping
Once you have created the project, you will find the following two contents in it. One is a folder which contains your scrapy code, and other is your spacy configuration file. Spacy configuration while helps in running and deploying the Scrapy project on a server. 
Once we have the project in place, we need to create a spider. A spider is a chunk of python code which determines how a web page will be scrapped. It is the main component which crawls different web pages and extracts content out of it. In our case, this will be the code chuck that will perform the task of visiting Amazon and scraping Amazon reviews. To create a spider, you can use the following command
scrapy genspider amazon_review your-link-here
Spider gets created within a spiders folder inside the project directory. Once you go into the scrapy project, you will see a directory structure like the one below

Scrapy files description

Let us understand the Scrapy project structure and supporting files inside in a bit more detail. Main files inside Scrapy project directory includes
items.py
Items are containers that will be loaded with the scraped data.

Middleware .py
The spider middleware is a framework of hooks into Scrapy’s spider processing mechanism where you can plug custom functionality to process the responses that are sent to Spiders for processing and to handle the requests and items that are generated from spiders.

Pipelines .py
After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially. Each item pipeline component is a Python class

settings.py
It allows one to customise the behaviour of all Scrapy components, including the core, extensions, pipelines and spiders themselves

spiders folder
The Spiders is a directory which contains all spiders/crawlers as Python classes. Whenever one runs/crawls any spider, then scrapy looks into this directory and tries to find the spider with its name provided by the user. Spiders define how a certain site or a group of sites will be scraped, including how to perform the crawl and how to extract data from their pages.
For more detailed information on Scrapy components, you can refer to this link
Analysing HTML structure of the webpage
Now before we actually start writing spider implementation in python for scraping Amazon reviews, we need to identify patterns in the target web page. Below is the page we are trying to scrape which contains different reviews about the MacBook air on Amazon.
We start by opening the web page using the inspect-element feature in the browser. There you can see the HTML code of the web page. After a little bit of exploration, I found the following HTML structure which renders the reviews on the web page
On the reviews page, there is a division with id cm_cr-review_list. This division multiple sub-division within which the review content resides. We are planning to extract both rating stars and review comment from the web page. We need to one more level deep into one other sub-divisions to prepare a scheme on fetching both star rating and review comment.
Upon further inspection, we can see that every review subdivision is further divided into multiple blocks. One of these blocks contain required star ratings, and other includes the text of review needed. By looking more closely, we can easily see that rating star division is represented by the class attribute “review-rating” and review texts are represented by the class “review-text”.
All we need to do now is just to pick these patterns up using our Scrapy parser
Defining Scrapy Parser in Python
Now once we have our spider template ready and we have analysed the pattern in the target web page, we can start writing the logic for the extraction of reviews from Amazon. We begin by extending the Spider class and mentioning the URLs we plan on scraping. Variable start_urls contains the list of the URLs to be crawled by the spider.
Then we need to define a parse function which gets fired up whenever our spider visits a new page. In the parse function, we need to identify patterns in the targeted page structure. Spider then looks for these patterns and extracts them out from the web page.
Below is a code sample of Scrapy parser for scraping Amazon reviews
# -*- coding: utf-8 -*-
 
# Importing Scrapy Library
import scrapy
 
# Creating a new class to implement Spide
class AmazonReviewsSpider(scrapy.Spider):
     
    # Spider name
    name = 'amazon_reviews'
     
    # Domain names to scrape
    allowed_domains = ['amazon.in']
     
    # Base URL for the MacBook air reviews
    myBaseUrl = "https://www.amazon.in/Apple-MacBook-Air-13-3-inch-MQD32HN/product-
    reviews/B073Q5R6VR/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews
    &pageNumber="
    start_urls=[]
    
    # Creating list of urls to be scraped by appending page number a the end of base url
    for i in range(1,121):
        start_urls.append(myBaseUrl+str(i))
    
    # Defining a Scrapy parser
    def parse(self, response):
            data = response.css('#cm_cr-review_list')
             
            # Collecting product star ratings
            star_rating = data.css('.review-rating')
             
            # Collecting user reviews
            comments = data.css('.review-text')
            count = 0
             
            # Combining the results
            for review in star_rating:
                yield{'stars': ''.join(review.xpath('.//text()').extract()),
                      'comment': ''.join(comments[count].xpath(".//text()").extract())
                     }
                count=count+1
Storing Scraped Results
Finally, we have successfully built our spider. The only task now left is to run this spider. We can run this spider by using the runspider command. It takes to input the spider file to run and the output file to store the collected results. In the case below, spider file is amazon_reviews.py and the output file is reviews.csv
scrapy runspider amazon_reviews_scraping/amazon_reviews_scraping/spiders/amazon_reviews.py -o reviews.csv

EDA on Amazon reviews

In this section, we will try to do some exploratory data analysis on the data obtained after scraping Amazon reviews. We will be counting the overall rating of the product along with the most common words used for the product. Using pandas, we can read the CSV containing the scraped data.
import pandas as pd
import matplotlib as plt
 
pd.read_csv("reviews.csv")
summarised_results = dataset["stars"].value_counts()
plt.bar(summarised_results.keys(), summarised_results.values)
plt.show()
Above code summarises all the ratings and finds their total count. After that, it plots a bar chart to visualise the findings. We have used matlplotlib library here to visualise the results.

Let us now try to visualise some of the keywords that are present in the scraped reviews. We can visualise these keywords using a word cloud. Word cloud works on the principle that most frequent words in the text should be much more prominent and bolder among the set of different words. The code snippet below can help you in making a word cloud in python
def visualise_word_map():
    words=" "
    for msg in dataset["comment"]:
    msg = str(msg).lower()
        words = words+msg+" "
    wordcloud = WordCloud(width=3000, height=2500, background_color='white').generate(words)
    fig_size = plt.rcParams["figure.figsize"]
    fig_size[0] = 14
    fig_size[1] = 7
    plt.show(wordcloud)
    plt.axis("off")
The image below is a word cloud generated by the above code snippet. Words like the laptop, apple, product and Amazon are represented by much more significant and bolder fonts representing that there are many frequent words used. Furthermore, this word cloud makes sense because we scraped MacBook air’s user reviews from Amazon. Also, you can see words like amazing, good, awesome and excellent indicating that indeed many of the users actually liked the product.

Conclusion

Using Scrapy, we were able to devise a method for scraping amazon reviews using python. Additionally, there can be some roadblocks while scraping Amazon reviews as Amazon tends to block IP’s if you try scraping Amazon frequently. This can be a hindrance to your work. In such cases, make sure you are shuffling your IP’s periodically and are making less frequent requests to Amazon server to prevent yourself from blocking out. You can read more about it here.
Additionally, you can use the proxy servers which serves as a protection to your home IP from blocking out while scraping Amazon reviews. 

Written by sandra-moraes | Data Scientist
Published by HackerNoon on 2019/11/20