How POST Requests with Python Make Web Scraping Easier

Written by otavioss | Published 2021/06/24
Tech Story Tags: python | web-scraping | requests | web-scraping-with-python | data-science | data-collection | python-tutorials | data-scraping

TLDR When scraping a website with Python, it’s common to use Python to send requests to the server. Selenium is a frequently used tool, but it also comes with some downsides. The alternative is to send a request containing the information the website needs using the request library. In this article, we’ll see a brief introduction to the method and how it can be implemented to improve your web scraping routines. The request library is a popular tool for web scraping and can also be quite unstable sometimes.via the TL;DR App

When scraping a website with Python, it’s common to use the
urllib
or the
Requests
libraries to send
GET
requests to the server in order to receive its information. 
However, you’ll eventually need to send some information to the website yourself before receiving the data you want, maybe because it’s necessary to perform a log-in or to interact somehow with the page.
To execute such interactions, Selenium is a frequently used tool. However, it also comes with some downsides as it’s a bit slow and can also be quite unstable sometimes. The alternative is to send a
POST
request containing the information the website needs using the request library.
In fact, when compared to Requests, Selenium becomes a very slow approach since it does the entire work of actually opening your browser to navigate through the websites you’ll collect data from. Of course, depending on the problem, you’ll eventually need to use it, but for some other situations, a
POST
request may be your best option, which makes it an important tool for your web scraping toolbox.
In this article, we’ll see a brief introduction to the
POST
method and how it can be implemented to improve your web scraping routines.

Web Scraping 

Although
POST
requests are commonly used to interact with APIs, they are also useful to fill HTML forms in a website or perform other actions automatically. 
Being able to perform such tasks is an important ability while web scraping, as it’s not rare to have to interact with the web page before reaching the data, you’re aiming to scrape.

Identifying an HTML Form

Before you start sending information to the website, you first need to understand how it will receive such information. Let’s say the idea is to log in to your account. If the site receives the username and password through an HTML form, it will probably look like this:
If that’s the case, all you have to do is send the username and the password within your
POST
request.
But how to identify and even see what the HTML form looks like? For this, we can go back to our old friend: the
GET
request. With a
GET
and using BeautifulSoup to parse the HTML, it’s easy to see all the HTML forms on the page and how each of them looks like.
This is a simple code for this task:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://website.com').text
soup = BeautifulSoup(page, 'html.parser')
forms = soup.find_all('form')
for form in forms:
   print(form)
And this is how our simple login form that will be the output of the code above:
<form action="login.html" method="post"> 
User Name: <input name="username" type="text"/><br/> 
Password: <input name="password" type="text"/><br/> 
<input id="submit" type="submit" value="Submit"/>
</form>
In a form like this, the “action” is where in the website you should send your request, and the “username” and “password” are the fields you want to fill. You can also notice the type for these values is specified as text.

Submitting Your First POST 

Now it’s time to send your first
POST
request. A basic request will contain only two arguments: the URL that will receive the request and the data that you’re sending.
The data is usually a dictionary where the keys are the names of the fields you intend to fill, and values are what you’re going to fill the fields with. The data can also be passed in different ways, but that’s a more complex approach that’s out of scope for this article.
The code is pretty simple. Actually, you can get it done with only two lines of code:
payload = {'username': 'user', 'password': '1234'}
r = requests.post('http://website.com/login.html', data=payload)
print(r.status_code)
The third line of code is just so you can see the status code of your request. You want to see a status code of 200, which means everything is OK. To learn more about it, click here.
We can now make this process more sophisticated by implementing the
POST
request we just created into a function. Here’s how it’ll work:
1. The
post_request
function will receive two arguments: the URL and the payload to send the request.
2. Inside the function, we’ll use a
try
and an
except
clause to have our code ready to handle a possible error.
3. If the code doesn’t crash and we receive a response from the server, we’ll then check if this response is the one we’re expecting. If so, the function will return it.
4. If we get a different status code, nothing will be returned, and the status will be printed.
5. If the code raises an exception, we’ll want to see what happened, and so the function will print this exception.
And this is the code for all this:
def post_request(url, payload):     
    try:           
        r = requests.post(url, data=payload)             
        if r.status_code == 200:                 
            return r
        else:
            print(r.status_code)
    except Exception as e:             
        print(e)
Depending on the website, however, you’ll need to deal with other issues in order to actually perform a login. The good news is that the Requests library provides resources to deal with cookies, HTTP authentications, and more that will have you covered. The goal here was just to use a common type of form as an easy example to understand for someone that had never used a
POST
request before.

Final Considerations

Especially if you’re sending a lot of requests to a particular website, you might want to insert some random pauses in your code in order not to overload the server and use even more
try
and
except 
clauses throughout your code and not only in the
post_request
function to make sure it’s prepared to handle other exceptions it may find along the way. 
Of course, it’s also a good practice to take advantage of a proxy provider, such as Infatica, to make sure your code will keep running as long as there are requests left to submit and data to be collected, and that you and your connection are protected.
The idea of this article is to be only an introduction to
POST 
requests and how they can be useful for collecting data on the web. We basically went through how to fill out a form automatically and even how to log in to a website, but there are also other possibilities such as marking a check box or selecting items from a dropdown list, for instance, which could be subject to an entirely new article.
I hope you’ve enjoyed this and that it can maybe be useful somehow. If you have a question, a suggestion, or just want to be in touch, feel free to be in touch.

Written by otavioss | Economist and data scientist
Published by HackerNoon on 2021/06/24