Web Scraping with Javascript and Node.js

Written by anderrv | Published 2021/10/01
Tech Story Tags: javascript | nodejs | javascript-tutorial | scraping | web-scraping | axios | coding | playwright

TLDRUsing Node v12, we will build a simple scraper and crawler from scratch using Javascript. We'll use scrapeme.live as an example, a fake website prepared for scraping. We use Axios to get the HTML. Then we will pass the HTML to cheerio and query it as we would in a browser environment. We'll query for the two things we want right now: paginator links and products. For cases when we want to run JS, Playwright will do. Once everything is working fine, we will scale it by launching crawls async.via the TL;DR App

Javascript and web scraping are both on the rise. We will combine them to build a simple scraper and crawler from scratch using Javascript in Node.js.

Avoiding blocks is an essential part of website scraping, so we will also add some features to help in that regard. And finally, parallelize the tasks to go faster thanks to Node's event loop.

Prerequisites

For the code to work, you will need Node (or nvm) and npm installed. Some systems have it pre-installed. After that, install all the necessary libraries by running npm install.

npm install axios cheerio playwright 

Introduction

We are using Node v12, but you can always check the compatibility of each feature.

Axios is a "promise based HTTP client" that we will use to get the HTML from a URL. It allows several options such as headers and proxies, which we will cover later. If you use TypeScript, they "include TypeScript definitions and a type guard for Axios errors."

Cheerio is a "fast, flexible & lean implementation of core jQuery." It lets us find nodes with selectors, get text or attributes, and many other things. We will pass the HTML to cheerio and then query it as we would in a browser environment.

Playwright "is a Node.js library to automate Chromium, Firefox and WebKit with a single API." When Axios is not enough, we will get the HTML using a headless browser to execute Javascript and wait for the async content to load.

Scraping the Basics

The first thing we need is the HTML. We installed Axios for that, and its usage is straightforward. We'll use scrapeme.live as an example, a fake website prepared for scraping.

const axios = require('axios');
axios.get('https://scrapeme.live/shop/')
    .then(({ data }) => console.log(data));

Nice! Then, using cheerio, we can query for the two things we want right now: paginator links and products. To know how to do that, we will look at the page with Chrome DevTools open. All modern browsers offer developer tools such as these. Pick your favorite.

We marked the interesting parts in red, but you can go on your own and try it yourselves. In this case, all the CSS selectors are straightforward and do not need nesting. Check the guide if you are looking for a different outcome or cannot select it. You can also use DevTools to get the selector.

On the Elements tab, right-click on the node ➡ Copy ➡ Copy selector. But the outcome is usually very coupled to the HTML, as in this case: #main > div:nth-child(2) > nav > ul > li:nth-child(2) > a. This approach might be a problem in the future because it will stop working after any minimal change. Besides, it will only capture one of the pagination links, not all of them.

We could capture all the links on the page and then filter them by content. If we were to write a full-site crawler, that would be the right approach. In our case, we only want the pagination links. Using the provided class, .page-numbers a will capture all and then extract the URLs (hrefs) from those. The selector will match all the link nodes with an ancestor containing the class page-numbers.

const axios = require('axios');
const cheerio = require('cheerio');

const extractLinks = $ => [
  ...new Set(
    $('.page-numbers a') // Select pagination links
      .map((_, a) => $(a).attr('href')) // Extract the href (url) from each link
      .toArray() // Convert cheerio object to array
  ),
];


axios.get('https://scrapeme.live/shop/').then(({ data }) => {
  const $ = cheerio.load(data); // Initialize cheerio
  const links = extractLinks($);

  console.log(links);
  // ['https://scrapeme.live/shop/page/2/', 'https://scrapeme.live/shop/page/3/', ... ]
});

As for the products (Pokémon in this case), we will get id, name, and price. Check the image below for details on selectors, or try again on your own. We will only log the content for now. Check the final code for adding them to an array.

As you can see above, all the products contain the class product, which makes our job easier. And for each of them, the h2 tag and price node hold the content we want. As for the product ID, we need to match an attribute instead of a class or node type. That can be done using the syntax node[attribute="value"]. We are looking only for the node with the attribute, so there is no need to match it to any particular value.

const extractContent = $ =>
  $('.product')
    .map((_, product) => {
      const $product = $(product);

      return {
        id: $product.find('a[data-product_id]').attr('data-product_id'),
        title: $product.find('h2').text(),
        price: $product.find('.price').text(),
      };
    })
    .toArray();

// ...

const content = extractContent($);
console.log(content);
// [{ id: '759', title: 'Bulbasaur', price: '£63.00' }, ...]

There is no error handling, as you can see above. We will omit it for brevity in the snippets but take it into account in real life. Most of the time, returning the default value (i.e...., empty array) should do the trick.

Following Links

Now that we have some pagination links, we should also visit them. If you run the whole code, you'll see that they appear twice - there are two pagination bars.

We will add two sets to keep track of what we already visited and the newly discovered links. We are using sets instead of arrays to avoid dealing with duplicates, but either one would work. To avoid crawling too much, we'll also include a maximum.

const maxVisits = 5;
const visited = new Set();
const toVisit = new Set();
toVisit.add('https://scrapeme.live/shop/page/1/'); // Add initial URL

For the next part, we will use async/await to avoid callbacks and nesting. An async function is an alternative to writing promise-based functions as chains. In this case, the Axios call will remain asynchronous. It might take around 1 second per page, but we write the code sequentially, with no need for callbacks.

There is a small gotcha with this: await is only valid in async function. That will force us to wrap the initial code inside a function, concretely in an IIFE (Immediately Invoked Function Expression). The syntax is a bit weird. It creates a function and then calls it immediately.

const crawl = async url => {
  visited.add(url);
  const { data } = await axios.get(url);
  const $ = cheerio.load(data);
  const content = extractContent($);
  const links = extractLinks($);
  links
    .filter(link => !visited.has(link)) // Filter out already visited links
    .forEach(link => toVisit.add(link));
};

(async () => { // IIFE
  // Loop over a set's values
  for (const next of toVisit.values()) {
    if (visited.size >= maxVisits) {
      break;
    }

    toVisit.delete(next);
    await crawl(next);
  }

  console.log(visited);
  // Set { 'https://scrapeme.live/shop/page/1/', '.../2/', ... }
  console.log(toVisit);
  // Set { 'https://scrapeme.live/shop/page/47/', '.../48/', ... }
})(); // The final set of parenthesis will call the function

Avoid Blocks

As said before, we need mechanisms to avoid blocks, captchas, login walls, and several other defensive techniques. It is complicated to prevent them 100% of the time. But we can achieve a high success rate with simple efforts. We will apply two tactics: adding proxies and full-set headers.

There are Free Proxies even though we do not recommend them. They might work for testing but are not reliable. We can use some of those for testing, as we'll see in some examples.

Note that these free proxies might not work for you. They are short-time lived.

Paid proxy services, on the other hand, offer IP Rotation. Meaning that our service will work the same, but the target website will see a different IP. In some cases, they rotate for every request or every few minutes. In any case, they are much harder to ban. And when it happens, we'll get a new IP after a short time.

We will use httpbin for testing. It offers several endpoints that will respond with headers, IP addresses, and many more.

const axios = require('axios');

const proxy = {
  protocol: 'http',
  host: '202.212.123.44', // Free proxy from the list
  port: 80,
};

(async () => {
  const { data } = await axios.get('https://httpbin.org/ip', { proxy });

  console.log(data);
  // { origin: '202.212.123.44' }
})();

The next step would be to check our request headers. The most known one is User-Agent (UA for short), but there are many more. Many software tools have their own, for example, Axios (axios/0.21.1). In general, it is a good practice to send actual headers along with the UA. That means we need a real-world set of headers because not all browsers and versions use the same ones. We include two in the snippet: Chrome 92 and Firefox 90 in a Linux machine.

const axios = require('axios');

// Helper functions to get a random item from an array
const sample = array => array[Math.floor(Math.random() * array.length)];

const headers = [
  {
    Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.9',
    'Sec-Ch-Ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
    'Sec-Ch-Ua-Mobile': '?0',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
  },
  {
    Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'en-US,en;q=0.5',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0',
  },
];

(async () => {
  const { data } = await axios.get('https://httpbin.org/anything', { headers: sample(headers) });

  console.log(data);
  // { 'User-Agent': '...Chrome/92...', ... }
})();

Headless Browsers

Until now, every page visited was done using axios.get, which can be inadequate in some cases. Say we need Javascript to load and execute or interact in any way with the browser (via mouse or keyboard). While avoiding them would be preferable - for performance reasons -, sometimes there is no other choice. Selenium, Puppeteer, and Playwright are the most used and known libraries. The snippet below shows only the User-Agent, but since it is a real browser, the headers will include the entire set (Accept, Accept-Encoding, etcetera).

const playwright = require('playwright');

(async () => {
  for (const browserType of ['chromium', 'firefox']) { // 'webkit' is also supported, but there is a problem on Linux
    const browser = await playwright[browserType].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto('https://httpbin.org/headers');
    console.log(await page.locator('pre').textContent());
    await browser.close();
  }
})();

// "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/94.0.4595.0 Safari/537.36",
// "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0",

This approach comes with its own problem: take a look a the User-Agents. The Chromium one includes "HeadlessChrome," which will tell the target website, well, that it is a headless browser. They might act upon that.

As with Axios, we can provide extra headers, proxies, and many other options to customize every request. An excellent choice to hide our "HeadlessChrome" User-Agent. And since this is a real browser, we can intercept requests, block others (like CSS files or images), take screenshots or videos, and more.

const playwright = require('playwright');

(async () => {
  const browser = await playwright.firefox.launch({
    proxy: { server: 'http://91.216.164.251:80' }, // Another free proxy from the list
  });
  const context = await browser.newContext();
  const page = await context.newPage();
  page.setExtraHTTPHeaders({ referrer: 'https://news.ycombinator.com/' });
  await page.goto('http://httpbin.org/anything');
  console.log(await page.locator('pre').textContent()); // Print the complete response
  await browser.close();
})();

// "Referrer": "https://news.ycombinator.com/"
// "origin": "91.216.164.251"

Now we can separate getting the HTML in a couple of functions, one using Playwright and the other Axios. We would then need a way to select which one is appropriate for the case at hand. For now, it is hardcoded. The output, by the way, is the same but quite faster when using Axios.

const playwright = require('playwright');
const axios = require('axios');
const cheerio = require('cheerio');

const getHtmlPlaywright = async url => {
  const browser = await playwright.firefox.launch();
  const context = await browser.newContext();
  const page = await context.newPage();
  await page.goto(url);
  const html = await page.content();
  await browser.close();

  return html;
};

const getHtmlAxios = async url => {
  const { data } = await axios.get(url);

  return data;
};

(async () => {
  const html = await getHtmlPlaywright('https://scrapeme.live/shop/page/1/');
  const $ = cheerio.load(html);
  const content = extractContent($);
  console.log('getHtmlPlaywright', content);
})();

(async () => {
  const html = await getHtmlAxios('https://scrapeme.live/shop/page/1/');
  const $ = cheerio.load(html);
  const content = extractContent($);
  console.log('getHtmlAxios', content);
})();

Using Javascript's Async

We already introduced async/await when crawling several links sequentially. If we were to crawl them in parallel, just by removing the await would be enough, right? Well... not so fast.

The function would call the first crawl and immediately take the following item from the toVisit set. The problem is that the set is empty since the crawling of the first page didn't occur yet. So we added no new links to the list. The function keeps running in the background, but we already exited from the main one.

To do this properly, we need to create a queue that will execute tasks when available. To avoid many requests at the same time, we will limit its concurrency.

const queue = (concurrency = 4) => {
  let running = 0;
  const tasks = [];

  return {
    enqueue: async (task, ...params) => {
      tasks.push({ task, params }); // Add task to the list
      if (running >= concurrency) {
        return; // Do not run if we are above the concurrency limit
      }

      running += 1; // "Block" one concurrent task
      while (tasks.length > 0) {
        const { task, params } = tasks.shift(); // Take task from the list
        await task(...params); // Execute task with the provided params
      }
      running -= 1; // Release a spot
    },
  };
};

// Just a helper function, Javascript has no sleep function
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));

const printer = async num => {
  await sleep(2000);
  console.log(num, Date.now());
};

const q = queue();
// Add 8 tasks that will sleep and print a number
for (let num = 0; num < 8; num++) {
  q.enqueue(printer, num);
}

If you run the code above, it will print numbers from 0 to 3 almost immediately (with a timestamp) and from 4 to 7 after 2 seconds. It might be the hardest snippet to understand - review it without hurries.

We define queue in lines 1-20. It will return an object with the function enqueue to add a task to the list. Then it checks if we are above the concurrency limit. If we are not, it will sum one to running and enter a loop that gets a task and runs it with the provided params. Until the task list is empty, then subtract one from running. This variable is the one that marks when we can or cannot execute any more tasks, only allowing it below the concurrency limit. In lines 23-28, there are helper functions sleep and printer. Instantiate the queue in line 30 and enqueue items in 32-34 (which will start running 4).

We have to use the queue now instead of a for loop to run several pages concurrently. The code below is partial with the parts that change.

const crawl = async url => {
  // ...
  links
    .filter(link => !visited.has(link))
    .forEach(link => {
      q.enqueue(crawlTask, link); // Add to queue instead of to the list
    });
};

// Helper function that will call crawl after some checks
const crawlTask = async url => {
  if (visited.size >= maxVisits) {
    console.log('Over Max Visits, exiting');
    return;
  }

  if (visited.has(url)) {
    return;
  }

  await crawl(url);
};

const q = queue();
// Add the first link to the process
q.enqueue(crawlTask, url);

Remember that Node runs in a single thread, so we can take advantage of its event loop but cannot use multiple CPUs/threads. What we've seen works fine because the thread is idle most of the time - network requests do not consume CPU time.

To build this further, we need to use some storage (database) or distributed queue system. Right now, we rely on variables, which are not shared between threads in Node. It is not overly complicated, but we covered enough ground in this blog post.

Final Code

const axios = require('axios');
const playwright = require('playwright');
const cheerio = require('cheerio');

const url = 'https://scrapeme.live/shop/page/1/';
const useHeadless = false; // "true" to use playwright
const maxVisits = 30; // Arbitrary number for the maximum of links visited
const visited = new Set();
const allProducts = [];

const sleep = ms => new Promise(resolve => setTimeout(resolve, ms));

const getHtmlPlaywright = async url => {
  const browser = await playwright.firefox.launch();
  const context = await browser.newContext();
  const page = await context.newPage();
  await page.goto(url);
  const html = await page.content();
  await browser.close();

  return html;
};

const getHtmlAxios = async url => {
  const { data } = await axios.get(url);

  return data;
};

const getHtml = async url => {
  return useHeadless ? await getHtmlPlaywright(url) : await getHtmlAxios(url);
};

const extractContent = $ =>
  $('.product')
    .map((_, product) => {
      const $product = $(product);

      return {
        id: $product.find('a[data-product_id]').attr('data-product_id'),
        title: $product.find('h2').text(),
        price: $product.find('.price').text(),
      };
    })
    .toArray();

const extractLinks = $ => [
  ...new Set(
    $('.page-numbers a')
      .map((_, a) => $(a).attr('href'))
      .toArray()
  ),
];

const crawl = async url => {
  visited.add(url);
  console.log('Crawl: ', url);
  const html = await getHtml(url);
  const $ = cheerio.load(html);
  const content = extractContent($);
  const links = extractLinks($);
  links
    .filter(link => !visited.has(link))
    .forEach(link => {
      q.enqueue(crawlTask, link);
    });
  allProducts.push(...content);

  // We can see how the list grows. Gotta catch 'em all!
  console.log(allProducts.length);
};

// Change the default concurrency or pass it as param
const queue = (concurrency = 4) => {
  let running = 0;
  const tasks = [];

  return {
    enqueue: async (task, ...params) => {
      tasks.push({ task, params });
      if (running >= concurrency) {
        return;
      }

      ++running;
      while (tasks.length) {
        const { task, params } = tasks.shift();
        await task(...params);
      }
      --running;
    },
  };
};

const crawlTask = async url => {
  if (visited.size >= maxVisits) {
    console.log('Over Max Visits, exiting');
    return;
  }

  if (visited.has(url)) {
    return;
  }

  await crawl(url);
};

const q = queue();
q.enqueue(crawlTask, url);

Conclusion

We'd like you to part with four main points:

  1. Understand the basics of website parsing and crawling.
  2. Separate responsibilities and use abstractions when necessary.
  3. Apply the required techniques to avoid blocks.
  4. Be able to figure out the following steps to scale up.

We can build a custom web scraper using Javascript and Node.js using the pieces we've seen. It might not scale to thousands of websites, but it will run perfectly for a few ones. And moving to distributed crawling is not that far from here.

And if you liked the content, please share it!


Also published on: https://www.zenrows.com/blog/web-scraping-with-javascript-and-nodejs


Written by anderrv | Web developer who has been working for startups for +10 years. Engineer turned entrepreneur.
Published by HackerNoon on 2021/10/01