Tips and Tricks for Web Scraping with Puppeteer

Written by dan-ni | Published 2018/08/02
Tech Story Tags: web-scraping | web-crawling | javascript | nodejs | puppeteer | latest-tech-stories | browserless | using-proxies

TLDR In order to avoid a spike in CPU usage from taking down your application server as well, it is a good idea to run headless Chrome on its own server. It represents a marked improvement both in terms of speed and stability over existing solutions like. The Google Chrome is one of the ten best web-scraping tools of 2018. It is not without its own set of warts, and getting Puppeteer running smoothly for large. web. projects can bring its own. set of. complexities (at Scraper. API, we use. to scrape and render Javascript from millions of web pages each month).via the TL;DR App

The Google Chrome team made waves last year when it released Puppeteer, a NodeJS API for running headless Chrome instances. It represents a marked improvement both in terms of speed and stability over existing solutions like PhantomJS and Selenium, and was named one of the ten best web scraping tools of 2018. However, it is not without its own set of warts, and getting Puppeteer running smoothly for large web scraping jobs can bring its own set of complexities (at Scraper API, we use Puppeteer to scrape and render Javascript from millions of web pages each month). Here are a few lessons we’ve learned.

Using Browserless

Running headless Chrome instances on the same server as your application code is generally a bad idea as CPU and RAM usage can be unpredictable. In order to avoid a spike in CPU usage from taking down your application server as well, it is a good idea to run headless Chrome on its own server. Luckily, this is incredibly easy with the Browserless library. Here are the settings we use in production:
docker pull browserless/chrome
docker run -p 3000:3000 -e "MAX_CONCURRENT_SESSIONS=5"
    -e "MAX_QUEUE_LENGTH=0"
    -e "PREBOOT_CHROME=true"
    -e "TOKEN=YOURTOKEN"
    -e "ENABLE_DEBUGGER=false"
    -e "CONNECTION_TIMEOUT=300000" --restart always browserless/chrome
These settings time out Chrome sessions after 5 minutes (this is to prevent stray sessions from running indefinitely and eventually crashing your server), and allow up to 5 sessions at any given time. 5 concurrent sessions seems to be a sweet spot that runs comfortably on a $5 Digital Ocean VPS.

Browser Settings

There are a few browser-level Puppeteer settings you should know about to speed up your browser instances:
// with browserless
browser = await puppeteer.connect({
  browserWSEndpoint:
  'ws://' +
  browserless.ip +
  ':' +
  browserless.port +
  '?TOKEN=' +
  browserless.token +
  '&--proxy-server=' + proxy + 
  '&--window-size=1920x1080' +
  '&--no-sandbox=true' +
  '&--disable-setuid-sandbox=true' +
  '&--disable-dev-shm-usage=true' +
  '&--disable-accelerated-2d-canvas=true' +
  '&--disable-gpu=true'
});

// without browserless
browser = await puppeteer.launch({
  args: [
  '--proxy-server=' + proxy,
  '--no-sandbox',
  '--disable-setuid-sandbox',
  '--disable-dev-shm-usage',
  '--disable-accelerated-2d-canvas',
  '--disable-gpu',
  '--window-size=1920x1080',
  ],
});
Because the Puppeteer library is still quite young and being very actively developed, some of these flags may be already on by default by the time you read this, basically these are sensible defaults that we’ve found in Github issues like this and this while debugging errors. They will ensure that you don’t run into the same cross platform and hard-to-debug memory errors that we ran into.

Page Settings

Scraping a web page requires creating a new Page (this is what Puppeteer calls creating a new browser tab), navigating to the correct page, and returning the HTML. Here are the Page-level settings we are using.
const blockedResourceTypes = [
  'image',
  'media',
  'font',
  'texttrack',
  'object',
  'beacon',
  'csp_report',
  'imageset',
];

const skippedResources = [
  'quantserve',
  'adzerk',
  'doubleclick',
  'adition',
  'exelator',
  'sharethrough',
  'cdn.api.twitter',
  'google-analytics',
  'googletagmanager',
  'google',
  'fontawesome',
  'facebook',
  'analytics',
  'optimizely',
  'clicktale',
  'mixpanel',
  'zedo',
  'clicksor',
  'tiqcdn',
];

const page = await browser.newPage();
await page.setRequestInterception(true);
await page.setUserAgent(userAgent);
page.on('request', request => {
  const requestUrl = request._url.split('?')[0].split('#')[0];
  if (
    blockedResourceTypes.indexOf(request.resourceType()) !== -1 ||
    skippedResources.some(resource => requestUrl.indexOf(resource) !== -1)
  ) {
    request.abort();
  } else {
    request.continue();
  }
});
const response = await page.goto(url, {
  timeout: 25000,
  waitUntil: 'networkidle2',
});
if (response._status < 400) {
  await page.waitFor(3000);
  let html = await page.content();
  return html;
}
There are a few things to notice here. Puppeteer has a waitUntil option, that allows you to define when a page is finished loading. ‘networkidle2’ means that there are no more than 2 active requests open. This is a good setting because for some websites (e.g. websites using websockets) there will always be connections open, so using ‘networkidle0’ your connections will time out every time. Here is the full documentation for waitUntil. We then wait for an additional 3 seconds after there are only 2 active requests left to let the last two requests finish, and then return the HTML (after checking that the response status code is not an error).
When scraping at scale, you may not want to download all of the files on each web page, especially larger files like images. You can intercept requests by using the setRequestInterception command, and block requests that you don’t need to be making. You can see the documentation for Puppeteer resource types here. You can block any domain or subdomain just by adding it to the skippedResources list.

Using Proxies

When scraping a large number of pages on a single website, it may be necessary to use a proxy service to avoid blocks. One common issue with Puppeteer is that proxies can only be set at the Browser level, not the Page level, so each Page (browser tab) must use the same proxy. To use different proxies with each page, you will need to use the proxy-chain module. Because Puppeteer/Chromium have some issues with stripping headers, it is safest to use the User-Agent header which is reliably set on each request. Simply set up your proxy server to read the User-Agent from the request, and use a different proxy for each User-Agent. Here is a sample proxy server.
const proxies = {
  'useragent1': 'http://proxyusername1:proxypassword1@proxyhost1:proxyport1',
  'useragent2': 'http://proxyusername2:proxypassword2@proxyhost2:proxyport2',
  'useragent3': 'http://proxyusername3:proxypassword3@proxyhost3:proxyport3',
};

const server = new ProxyChain.Server({
  port: 8000,
  prepareRequestFunction: ({request}) => {
    const userAgent = request.headers['user-agent'];
    const proxy = proxies[userAgent];
    return {
      upstreamProxyUrl: proxy,
    };
  });
});

server.listen(() => console.log('proxy server started'));
You can connect to this proxy server by following the example in the Browser Settings section above. This will allow you to set a different proxy server for each new Page based on the Page’s User-Agent, and will also allow you to connect to proxies that require password authentication (which Puppeteer does not currently support).
Hopefully this helps some of you avoid the painful edge cases we’ve encountered with Puppeteer. Happy scraping!

Published by HackerNoon on 2018/08/02