Take a Picture

Programmatic Screenshot Capture Using Playwright and Node.js

As engineers, we are always looking for shiny new greenfield projects or rewriting ageing legacy applications. We rarely get the opportunity to rebuild an application from the days we learned to code and see how far we have come as a developer. I’m lucky enough to be in that same situation.

Moving back into the search domain through my role as a Developer Advocate at Elastic has inspired me to revisit the Fu-Finder game that I built for my master’s thesis. One of the first tasks required to build this game is revisiting the URL dataset of the old game, and capturing screenshots to show in the game. A large number of the URLs have changed or no longer exist. But for those that do I need updated screenshots for players to search for.

Previously I used WebKit and Python when building the original game. The majority of e2e test and automation frameworks have screenshot capabilities, including Cypress and Selenium. But since I’ve been playing with Playwright and @elastic/synthetics, I’ll show how Playwright and JavaScript can be used to capture these screenshots.

Pictures of You

The final version of the script get-screenshot-images.js is available on repo carlyrichmond/elastic-fu-finder. The process for generating the screenshots involved 3 key steps:

Extract the URLs I want to capture. Following an experiment using the Elastic WebCrawler, I have a set of URLs present in the index search-elastic-fu-finder-pages that I can pull out using a search call to return all results.
Set up the Playwright browser configuration.
Capture a screenshot for each URL and save.

There are many great resources out there that show how to capture screenshots using any of the myriad of automation frameworks. I came across this amazing article by Chris Roebuck that covers 7 different approaches. I used the Playwright example as a basis, ending up with code that looked a little bit like this:

const { chromium } = require('playwright');
const imageFolder = 'images/screenshots';

// Elasticsearch URL extraction executed before getScreenshotsFromHits

// param: hits, results consisting of document id (_id) 
// and document source (_source) containing url
  async function getScreenshotsFromHits(hits) {
    let browser = await chromium.launch();
    let page = await browser.newPage();
    await page.viewportSize({ width: 3840, height: 1080 });

    for (const hit of hits ) {
      const id = hit._id;
      const url = hit._source.url;

      console.log(`Obtaining screenshot for ${url}`);
      
      try {
        await page.goto(url);
        await page.screenshot({ path: `${imageFolder}/${id}.png`}); 
      }
      catch(e) {
        console.log(`Timeout received for ${url}`);
        continue;
      }
    }

    await browser.close();
  }

Playwright gives us access to various different browsers. However, I found a rather troubling issue when using Chromium against a subset of my websites.

Don’t Stand So Close to Me

Some websites blocked my script, which resulted in Playwright capturing a screenshot for a 403 error in the browser, exactly like the one below.

This was a new issue that I didn’t encounter previously when using WebKit way back in the day. It’s not surprising given the expansion of the web over the past decade that websites are protecting themselves from automated agents. Especially if they expose a public API where the data can be extracted in a controlled way.

A bit of Googling led me to good old Stack Overflow, and this thread on Python access with Chromium. This old yet detailed post mentions detection of the User Agent to block traffic, which makes sense.

Local Boy in the Photograph

I could have simply excluded the impacted URLs from my dataset. I also considered using the new Headless Chromium option based on this other Stack Overflow post to address the limitation.

In the end, I decided to use an alternative browser. I also played around with the fullPage option, which was a tad too large, and the clip settings to get the content that I wanted from my pages:

const { firefox } = require('playwright');
const imageFolder = 'images/screenshots';

// Elasticsearch URL extraction executed before getScreenshotsFromHits

// param: hits, results consisting of document id (_id) 
// and document source (_source) containing url
  async function getScreenshotsFromHits(hits) {
    let browser = await firefox.launch();
    let page = await browser.newPage();
    await page.viewportSize({ width: 3840, height: 1080 });

    for (const hit of hits ) {
      const id = hit._id;
      const url = hit._source.url;

      console.log(`Obtaining screenshot for ${url}`);
      
      try {
        await page.goto(url);
        await page.screenshot({ path: `${imageFolder}/${id}.png`, fullPage: true, 
clip: { x: 0, y: 0, width: 3840, height: 1080 }}); 
      }
      catch(e) {
        console.log(`Timeout received for ${url}`);
        continue;
      }
    }

    await browser.close();
  }

With this trusty code, we’re now able to extract screenshots successfully. Woohoo!

Conclusions

I’m excited to say the Fu-Finder sequel is in progress thanks to Playwright helping me extract screenshots from a set of pages. I have limited execution of this script to a one off extraction to respect these sites. The last thing I want to do is initiate a denial of service against them!

When crawling to extract the content, such as with a crawler, I would respect the restrictions put in place by a site’s robots.txt as I did with my original list of URLs. The majority of Open Source and Commercial Crawlers will respect this file, and so should you!

Happy screenshot capture!