Git Scraping Examples

Real-World Git Scraping Examples

See how git-scraping is being used in production to track important data changes

Production Examples

COVID-19 Data Tracking

Track daily COVID-19 statistics from Johns Hopkins University

By: Simon Willison

Monitoring public health data changes over time

View Repository

US Congress Votes

Archive congressional voting records

By: United States Project

Tracking legislative activity and voting patterns

View Repository

Hacker News Front Page

Archive Hacker News front page stories

By: Max Woolf

Analyzing trending topics and discussion patterns

View Repository

Template Examples

API Endpoint Monitoring

Track changes to a JSON API endpoint over time

API

Configuration:

  • • URL: https://api.github.com/repos/simonw/datasette
  • • Schedule: 0 */6 * * *
  • • Output: data/api-response.json

Generated Workflow:

1name: API Endpoint Monitoring
2"on":
3 schedule:
4 - cron: 0 */6 * * *
5 workflow_dispatch: null
6jobs:
7 scrape:
8 runs-on: ubuntu-latest
9 permissions:
10 contents: write
11 steps:
12 - name: Checkout repository
13 uses: actions/checkout@v4
14 - name: Fetch data
15 run: |-
16 curl -L "https://api.github.com/repos/simonw/datasette" \
17 -H "User-Agent: Git-Scraping-Bot/1.0" \
18 -o temp_data
19 - name: Move data to output location
20 run: mv temp_data data/api-response.json
21 - name: Commit and push if changed
22 run: |-
23 git config user.name "github-actions[bot]"
24 git config user.email "github-actions[bot]@users.noreply.github.com"
25 git add data/api-response.json
26 timestamp=$(date -u)
27 git diff --quiet && git diff --staged --quiet || (git commit -m "Update API data - $timestamp" && git push)
28

Website Archiving

Save HTML snapshots of a webpage for historical tracking

Web

Configuration:

  • • URL: https://example.com
  • • Schedule: 0 0 * * *
  • • Output: archive/page.html

Generated Workflow:

1name: Website Archiving
2"on":
3 schedule:
4 - cron: 0 0 * * *
5 workflow_dispatch: null
6jobs:
7 scrape:
8 runs-on: ubuntu-latest
9 permissions:
10 contents: write
11 steps:
12 - name: Checkout repository
13 uses: actions/checkout@v4
14 - name: Fetch data
15 run: |-
16 curl -L "https://example.com" \
17 -H "User-Agent: Mozilla/5.0 (compatible; Git-Scraping-Bot/1.0)" \
18 -o temp_data
19 - name: Move data to output location
20 run: mv temp_data archive/page.html
21 - name: Commit and push if changed
22 run: |-
23 git config user.name "github-actions[bot]"
24 git config user.email "github-actions[bot]@users.noreply.github.com"
25 git add archive/page.html
26 timestamp=$(date -u)
27 git diff --quiet && git diff --staged --quiet || (git commit -m "Archive webpage snapshot - $timestamp" && git push)
28

RSS Feed Tracking

Monitor blog or news RSS feeds for new entries

Content

Configuration:

  • • URL: https://simonwillison.net/atom/entries/
  • • Schedule: 0 */12 * * *
  • • Output: feeds/feed.xml

Generated Workflow:

1name: RSS Feed Tracking
2"on":
3 schedule:
4 - cron: 0 */12 * * *
5 workflow_dispatch: null
6jobs:
7 scrape:
8 runs-on: ubuntu-latest
9 permissions:
10 contents: write
11 steps:
12 - name: Checkout repository
13 uses: actions/checkout@v4
14 - name: Fetch data
15 run: |-
16 curl -L "https://simonwillison.net/atom/entries/" \
17 -H "User-Agent: Git-Scraping-Bot/1.0" \
18 -o temp_data
19 - name: Move data to output location
20 run: mv temp_data feeds/feed.xml
21 - name: Commit and push if changed
22 run: |-
23 git config user.name "github-actions[bot]"
24 git config user.email "github-actions[bot]@users.noreply.github.com"
25 git add feeds/feed.xml
26 timestamp=$(date -u)
27 git diff --quiet && git diff --staged --quiet || (git commit -m "Update RSS feed - $timestamp" && git push)
28

Common Use Cases

Ideas for what you can track with git-scraping

Government & Public Data

  • • Legislative changes and votes
  • • Public health statistics
  • • Environmental data
  • • Open data portals

Business Intelligence

  • • Competitor pricing
  • • Product availability
  • • Stock market data
  • • Exchange rates

Content Monitoring

  • • News articles
  • • Blog posts (RSS)
  • • Social media trends
  • • Forum discussions

Technical Monitoring

  • • API availability
  • • Service status pages
  • • Software releases
  • • Documentation changes