Web Scraping Use Cases for Technical Marketers

From a technical marketer perspective, scraping and automation libraries are extremely important to learn. Here’s an introduction to two of the most widely used web scraping libraries in Node JS.

When I talk to developers, they always find it weird that I love web scraping. This is mainly because of two reasons:

* Scraping is an unstable and unreliable solution for pulling data from data sources, compared to APIs.

* In terms of code, building scrapers means writing code that doesn’t necessarily comply with some best practices, such as reusability. The code is usually very much tied up with a specific use case.

But the thing is, when a marketer starts learning code, a ton of scraping use cases immediately come to mind. Much of the work that marketers dream of being able to automate can’t be achieved with official APIs.

Getting rid of some of the manual work of extracting information from the web is very tempting. I would argue that for a marketer, scraping and automation are among the most common use-cases for coding skills.

Recently I had the chance to work with Puppeteer and Cheerio, and switch between the two, so here’s a marketer’s perspective on when to use each of them.

Puppeteer

Puppeteer is a Node open source library developed by Google. It is basically a way to launch a browser via Node and automate actions on Chrome.

The main use case of Puppeteer is Automation.

It’s not always simple to scrape data. Take, for example, my Product Hunt scraper, Hunt.

In Product Hunt, upvoter information is not readily available in the page’s HTML when you first load it. Before you can access the full upvoter list, you have to:

Click the upvoters panel.Scroll all the way to the end of the list

To do so, you need a tool that can automate actions in the browser – That’s what puppeteer is for. Use Puppeteer when you need to log in to get data, or when you need to perform automated actions in the browser.

Cheerio

Cheerio is another NPM library, also called “JQuery for Node”. It allows you to scrape data with a lightweight, simple and quick framework. Cheerio works with raw HTML data that input to it, similar to Python’s Beautiful Soup, if you’re familiar. That means that if the data you need to parse can be extracted from a url, it is very simple to work with in Cheerio.

Below is code that can be used to extract information from twitter about a list of users (by twitter tag).

const axios = require("axios");
const cheerio = require("cheerio");

//This function uses axios to get the html data on a given url it's also possible to do the same using the fetch method
const getHtml = async url => {
  link = await axios.get(url);
  return link.data;
};

//this is a node module that uses the above function and Cheerio to extract twitter data from a list of user tag (used in the backend of Hunt)

module.exports = async function run(userList) {
  const enrichedUsers = [];

  for (user of productHuntUserList) {
    try {
      const $ = cheerio.load(await getHtml(`https://twitter.com/${user.tag}`));

// here we extract the relevant information from each twitter page - followers number, description, and Twitter URL

      let followers = $(
        ".ProfileNav-item.ProfileNav-item--followers > a > span.ProfileNav-value"
      ).text();
      const description = $(".ProfileHeaderCard-bio").text();
      const url = `https://twitter.com/${user.tag}`;

      // create a user object with existing info and the new info we've extracted from twitter
      const enrichedUser = {
        tag: user.tag,
        name: user.name,
        profile: user.profile,
        twitterDescription: description,
        twitterFollowers: followers,
        pageUrl: url,
        messagedAndFollowed: false
      };
      enrichedUsers.push(enrichedUser); // push the new user object into the enrichedUsers array
    } catch (e) {
      continue; // this is not a good way to handle errors, but I didn't want to get into error handling here and it works for the sake of this tutorial.
    }
  }
  return enrichedUsers;
};

Cheerio VS Puppeteer

The two libraries have different use cases but are often seem as the two main options for JS scraping. If I had to choose, I could argue that if there’s no need for Puppeteer’s automation capabilities, it would be more efficient and better practice to use Cheerio.

While working on Hunt, I’ve built 2 Scrapers – one for Product Hunt and one for Twitter. I initially built both with Puppeteer, and I noticed a lot of performance issues when trying to scrape a large list of users from Twitter (including memory errors on the Heroku server) – it took Puppeteer about 10 minutes to finish scraping a 1000 upvoters. I then rewrote the Twitter bot in Cheerio (as described above) and saw a performance boost of around 5X+ : The new code took about 2 minutes (or less) to finish scraping.

Summary

Both tools allow you to use node for automation and scraping in ways that marketers usually attribute to Python. These tools are another example of how learning Javascript might be a pain in the ass, but can eventually give you more profound and holistic knowledge of web development.

As a marketer, you can probably think of many ways to use both, and I recommend that you do so and go for it. If you’re learning something new, you might as well create something useful!