GitHub Actions is a powerful tool for building code, running tests and other repetitive tasks related to software development. It’s also a powerful, if somewhat underutilized tool for deploying web scrapers written in R to the internet and automatically publishing a version-controlled copy of the scraped data using GitHub.

In this post, I’ll show how I used GitHub Actions to automate running a scraper written in R that checks to see if New Jersey Governor Phil Murphy signed a new executive order on a given day by scraping a table from a state website using {rvest} and commit an updated CSV file to a GitHub repository if there is new data obtained by the scraper.

The Workflow

Actions are defined by a simple YAML file that lives in the .github/workflows directory in your repository. Here is the one I wrote for this project:

on:
  schedule:
    - cron: '0 4 * * *'
  push:
    branches: main

name: Scrape Executive Orders

jobs:
  render:
    name: Scrape Executive Orders
    runs-on: macOS-latest
    steps:
      - uses: actions/checkout@v2
      - uses: r-lib/actions/setup-r@v1
      - name: Install dependencies
        run: Rscript -e 'install.packages(c("rvest","dplyr","lubridate"))'
      - name: Scrape the data
        run: Rscript scrape_exec_orders.R
      - name: Commit results
        run: |
          git add -A
          git commit -m 'New Executive Order signed - data updated!' || echo "No changes to commit"
          git push origin || echo "No changes to commit"

How It Works

This workflow defines a set of actions that will be run each time a change is pushed to the main branch of the repository and on a set schedule at midnight every day.

  1. Install dependencies on one of GitHub’s macOS-based systems
  2. Run the R script that scrapes the data using {rvest}
  3. Check for changes to the CSV file storing the data
  4. Commit and push if there are updates

Each time that this R script is run, a CSV file with the latest version of data on the governor’s executive orders is produced. Using the system-level Git command, the action determines if there were any new additions to the file since the previous commit.

Why GitHub Actions?

Critics of this approach may argue that a similar level of automation could have been achieved by just setting up a cron job on a server, but that introduces additional complexity:

  • Most Linux distributions ship with outdated versions of R
  • Regular maintenance is needed for updates and security
  • Hardware failures and security threats are ever-present

Scheduling and running web scrapers via GitHub Actions avoids these pitfalls thanks to the ephemeral nature of the service’s compute function, not unlike other serverless and public cloud offerings.

The Value of Automation

As I have balanced side projects while working full-time, I don’t have hours of free time to devote to manually maintaining infrastructure. Automating the deployment of the scraper via GitHub Actions frees me of potential pitfalls of maintaining my own infrastructure.

The time I would have spent manually setting up a server to host and automate this scraper can now be spent on something else, since I can be reasonably confident that this approach is more easily maintainable and less likely to break.

Resources