PHP Web Scraping Using Goutte

Written by mubin | Published 2020/08/17
Tech Story Tags: php | goutte | web-scraping | backend | php7 | data-scraping | html

TLDRvia the TL;DR App

When you talk about web scraping, PHP is the last thing most people think about.
But this tool based on Symphony framework is about to change your mind.
According to it’s documentation on https://goutte.readthedocs.io/en/latest/:
“Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.”
This also means you can login to a website, submit a form, upload a file and whatnot all using this library siting at your server (or even on your local system) and still do all the things you would normally do.
To scrap any website you basically need to first visit the website and do recon.
And how do you do that? Simple — Inspector tool. Most modern day web browsers have “inspect” tool when you right click a web page (For Safari, you need to enable the option “Show Develop menu in menu bar” under the Advanced tab of Preference).
Once you select inspect tool, the browser’s developer window (or console window) will open up docked on one side of window. The first tab is the “Elements” tab. Inside this tab you get to see the elements and their styling.
This is very important to see the “name” attribute of the elements, to see if there are any hidden elements, find the form or its action attribute etc., you can use this “inspect” tool.
This is recon.
Ok, now let’s see this in action. Suppose your objective is to find the list of repositories that a user has created on their GitHub account (you could do more, but let’s just play nice ;-) ).
Let’s create a folder in your localhost directory by the name “crawler” (name is irrelevant).
Then head over to the library’s repo page on GitHub to download from https://github.com/FriendsOfPHP/Goutte.git or
After unzipping (if you downloaded) or git cloning, you’ll see a directory named “Goutte”.
You’ll be needing composer installed as well to make this library work. Download it from https://getcomposer.org/download/
Now execute this command to add dependency libraries — composer require fabpot/goutte
Give this command a minute or so finish. After successfully executing this command vendor directory (containing dependencies) and composer file is added.
Now we are ready. Let’s create an interface for the user. Nothing fancy, simple HTML login. Create an index.php file with the following code.
<form method="POST">
<div class=“form-group">
<label for=“git_email">Email address</label>
<input type="email" class="form-control" id=“git_email"  name=“git_email"  placeholder="Enter email">
</div>
<div class="form-group">
<label for=“git_pwd">Password</label>
<input type="password" class="form-control" id="git_pwd" name="git_pwd" placeholder="Password">
</div>  <button type="submit" class="btn btn-primary">Submit</button>
</form>
Then when the user enters the credential (I’m not doing any client validation — I leave that to you ), the page gets submitted to itself.
So, here our logic begins. Check if the page is getting submitted with the data.
if(isset($_POST["“git_email"]) && isset($_POST["git_pwd"]) && !empty($_POST["“git_email"]) && !empty($_POST["git_pwd"])){
//check if the email and password params are there
}
Then include the library and initialise it.
require_once("vendor/autoload.php");
require_once(“Goutte/Goutte/Client.php");
use Goutte\Client;
$client = new Client();
If we inspect the login form at https://gitlab.com/users/sign_in we see the following:
It shows the name attribute of the email and password fields
//initialise the crawler with request type get and the login url
$crawler = $client->request('GET', 'https://github.com/login');
Find the form, set the post params and submit form
//find the form
$form = $crawler->selectButton('Sign in')->form();
//set the post params
$form->setValues(['login' => $_POST["git_email"], 'password' => $_POST[“git_pwd"]]);
// submit the form
$crawler = $client->submit($form);
Now let’s check if login was successful or not. The way to check it is if the response/redirected page has meta tag having name “octolytics-actor-login”. This meta tag contains the username.
//variable to store the logged in username
$username = “";
//Find all the meta tag
$crawler->filter('meta')->each(function ($node) {
global $username;//This is important, since you need to set the variable
//Stop at the meta tag that has “octolytics-actor-login" as it’s name
if(trim($node->attr("name")) == "octolytics-actor-login"){
//set the variable value and return
$username = ($node->attr("content"));
return;
}
});
If the variable value is blank then we can inform the user that the credentials they entered is invalid.
if($username == “”){
//give error message
echo “<center>The credentials entered are invalid</center>”;
}
Finally, let’s navigate to the url that contains the list of repositories associated with the username
$crawler = $client->request('GET', 'https://github.com/'.$username.'?tab=repositories');
Once we are on the repo page we need to loop through all the repository names (when you inspect any one repo link in inspector tool you’ll find that they are inside list tag with class “source” ) displayed as an anchor link:
//loop through all “a” tag contained inside list element with “source” class
$crawler->filter('li.source a')->each(function ($node) {
//This additional check is to determine we only get repo name
if(is_numeric($node->text()) === false){
echo $node->text();
echo "<br/>";
}
});
And that’s it, we are printing all the repository belonging to the user who tried to login.
This is the first article I’ve written ever. If you like it, I’d appreciate your claps!
Happy Hacking :)

Written by mubin | A developer who keeps trying new things
Published by HackerNoon on 2020/08/17