Scraping Google Search Console Backlinks

Written by alexbobes | Published 2022/07/16
Tech Story Tags: python-programming | web-scraping | google-search-console | python-tutorials | beautiful-soup | data-scraping | scraping-using-python | website-scraping-tools

TLDRPython, Beautiful Soup, and pandas allow you to scrape GSC and pull the data you need automatically. In order to scrape Google Search Console for backlink information, we need to emulate a normal user. In effect, what we are doing is taking the information from the request header and transforming it into two dictionaries.via the TL;DR App

If you’re a webmaster or SEO specialist, you most probably need to do backlink audits regularly. There are situations when you are forced to find toxic backlinks and disavow them. However, it’s very hard to manually export and correlate all the backlinks data from Google Search Console.

If the websites you’re working with are substantially large, there would be a lot of clicking and exporting involved to get this data out of GSC. It’s simply not doable.

Google Search Console – Links Section

Here is where Python, Beautiful Soup, and Pandas come in — they will allow you to scrape GSC and pull the data you need automatically.

First things first:

  • Install the following packages using pip: bs4, requests, re, pandas, csv

1. Emulate a user session

In order to scrape Google Search Console for backlink information, we need to emulate a normal user. We do this by simply going into your browser of choice, opening the Links section in GSC and selecting the Top Linking Sites section. Once here, we need to inspect the source code of the page by right clicking and hitting inspect.

In the development tools, we go to the network tab and select the first URL that appears and is a document type. It should be a request for a URL of the following type: https://search.google.com/search-console/links?resource_id=sc-domain%3A{YourDomainName}

Click on the URL and look at the Headers section for the Request Headers section as per the image below:

In order to emulate a normal session, we will need to add to our Python requests the request information that we see in the request header.

A few notes on this process:

You will notice that your request header will also contain cookie information. Python-wise, for the requests library, this information will be stored in a dictionary called cookies. The rest of the information will be stored in a dictionary named headers.

In effect, what we are doing is taking the information from the header and transforming it into two dictionaries as per the code below.

* replace [your-info] with your actual data

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import csv

headers = {
    "authority": "search.google.com",
    "method":"GET",
    "path":'"[your-info]"',
    "scheme":"https",  
    "accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-encoding":"gzip, deflate, br",
    "accept-language":"en-GB,en-US;q=0.9,en;q=0.8,ro;q=0.7",
    "cache-control":"no-cache",
    "pragma":"no-cache",
    "sec-ch-ua":"navigate",
    "sec-fetch-site":"same-origin",
    "sec-fetch-user":"?1",
    "upgrade-insecure-requests":"1",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
    "x-client-data":"[your-info]",
    "sec-ch-ua":'" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    "sec-ch-ua-mobile": "?0",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate"
}
cookies = {
    "__Secure-1PAPISID":"[your-info]",
    "__Secure-1PSID":"[your-info]",
    "__Secure-3PAPISID":"[your-info]",
    "__Secure-3PSID":"[your-info]",
    "__Secure-3PSIDCC":"[your-info]",
    "1P_JAR":"[your-info]",
    "NID":"[your-info]",
    "APISID":"[your-info]",
    "CONSENT":"[your-info]",
    "HSID":"[your-info]",
    "SAPISID":"[your-info]",
    "SID":"[your-info]",
    "SIDCC":"[your-info]",
    "SSID":"[your-info]",
    "_ga":"[your-info]",
    "OTZ":"[your-info]",
    "OGPC":"[your-info]"
}

The information displayed in your request header might be different in your case, don’t worry about the differences as long as you can create the two dictionaries.

Once this is done, execute the cell with the header and cookies information as it’s time to start working on the first part of the actual script — collecting a list of referring domains that link back to your website.

* replace [your-domain] with your actual domain

genericURL = "https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target=&domain="
req = requests.get(genericURL, headers=headers, cookies=cookies)
soup = BeautifulSoup(req.content, 'html.parser')

The above URL is in effect the URL in the Top linking sites section, so, please ensure that you update it accordingly.

You can test that you are bypassing the login by running the following code:

g_data = soup.find_all("div", {"class": "OOHai"})
for example in g_data:
    print(example)
    break

The output of the above code should be a div with a class called “00Hai”. If you see anything of the sort, you can continue.

2. Create a List of Referring Domains

The next step in this process will be to leverage Python and Pandas to return a list with all of the referring domains that point at your domain.

g_data = soup.find_all("div", {"class": "OOHai"})

dfdomains = pd.DataFrame()
finalList = []
for externalDomain in g_data:
    myList = []
    out = re.search(r'<div class="OOHai">(.*?(?=<))', str(externalDomain))
    if out:
        myList.append(out.group(1))
    finalList.append(myList) 
dfdomains = dfdomains.append(pd.DataFrame(finalList, columns=["External Domains"]))

domainsList = dfdomains["External Domains"].to_list()

The above code initialises an empty Pandas dataFrame, which will be populated with the external domains. The domains are identified by running through the entire HTML and identifying all of the divs that are in the “OOHai” class. If any such information is present, the dfdomains dataFrame will be populated with the name of the external domains.

3. Extract Backlink information for each Domain

Next we will extract the backlink information for all domains, Top sites linking to this page and also Top linking pages (practically the 3rd level from GSC, only the first value).

def extractBacklinks():
    for domain in domainsList[:]:
        url = f"https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target={domain}&domain="

        request = requests.get(url, headers=headers, cookies=cookies)
        soup = BeautifulSoup(request.content, 'html.parser')
        
        for row in soup.find_all("div", {"class": "OOHai"}):          
            output = row.text
            stripped_output = output.replace("", "")
        
            domain_stripped = str(domain.split('https://')[1].split('/')[0])
        
            print ("---------")
            print ("Domain: " + domain)
            print ("---------")
            
            url_secondary = f"https://search.google.com/search-console/links/drilldown?resource_id=[your-domain]&type=EXTERNAL&target={domain}&domain={stripped_output}"
            request_secondary = requests.get(url_secondary, headers=headers, cookies=cookies)
            soup_secondary = BeautifulSoup(request_secondary.content, 'html.parser')
            for row in soup_secondary.find_all("div", {"class": "OOHai"}):
                output_last = row.text
                stripped_output_last = output_last.replace("", "")
                break

            with open(f"{domain_stripped}.csv", 'a') as file:
                writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
                writer = writer.writerow([domain, stripped_output, stripped_output_last])
            file.close()

extractBacklinks()

Because Beautiful Soup is returning some strange characters, we are stripping them using Python .replace method.

All the URLs are added into a .csv file (located in the same directory where the script is present).

Have fun!


Also Published Here


Written by alexbobes | Tech Enthusiast, Web3 Endorser. We're now living in the age of the algorithm.
Published by HackerNoon on 2022/07/16