Real-World Git Scraping Examples
See how git-scraping is being used in production to track important data changes
Production Examples
COVID-19 Data Tracking
Track daily COVID-19 statistics from Johns Hopkins University
By: Simon Willison
Monitoring public health data changes over time
US Congress Votes
Archive congressional voting records
By: United States Project
Tracking legislative activity and voting patterns
Hacker News Front Page
Archive Hacker News front page stories
By: Max Woolf
Analyzing trending topics and discussion patterns
Template Examples
API Endpoint Monitoring
Track changes to a JSON API endpoint over time
Configuration:
- • URL: https://api.github.com/repos/simonw/datasette
- • Schedule: 0 */6 * * *
- • Output: data/api-response.json
Generated Workflow:
1name: API Endpoint Monitoring2"on":3 schedule:4 - cron: 0 */6 * * *5 workflow_dispatch: null6jobs:7 scrape:8 runs-on: ubuntu-latest9 permissions:10 contents: write11 steps:12 - name: Checkout repository13 uses: actions/checkout@v414 - name: Fetch data15 run: |-16 curl -L "https://api.github.com/repos/simonw/datasette" \17 -H "User-Agent: Git-Scraping-Bot/1.0" \18 -o temp_data19 - name: Move data to output location20 run: mv temp_data data/api-response.json21 - name: Commit and push if changed22 run: |-23 git config user.name "github-actions[bot]"24 git config user.email "github-actions[bot]@users.noreply.github.com"25 git add data/api-response.json26 timestamp=$(date -u)27 git diff --quiet && git diff --staged --quiet || (git commit -m "Update API data - $timestamp" && git push)28
Website Archiving
Save HTML snapshots of a webpage for historical tracking
Configuration:
- • URL: https://example.com
- • Schedule: 0 0 * * *
- • Output: archive/page.html
Generated Workflow:
1name: Website Archiving2"on":3 schedule:4 - cron: 0 0 * * *5 workflow_dispatch: null6jobs:7 scrape:8 runs-on: ubuntu-latest9 permissions:10 contents: write11 steps:12 - name: Checkout repository13 uses: actions/checkout@v414 - name: Fetch data15 run: |-16 curl -L "https://example.com" \17 -H "User-Agent: Mozilla/5.0 (compatible; Git-Scraping-Bot/1.0)" \18 -o temp_data19 - name: Move data to output location20 run: mv temp_data archive/page.html21 - name: Commit and push if changed22 run: |-23 git config user.name "github-actions[bot]"24 git config user.email "github-actions[bot]@users.noreply.github.com"25 git add archive/page.html26 timestamp=$(date -u)27 git diff --quiet && git diff --staged --quiet || (git commit -m "Archive webpage snapshot - $timestamp" && git push)28
RSS Feed Tracking
Monitor blog or news RSS feeds for new entries
Configuration:
- • URL: https://simonwillison.net/atom/entries/
- • Schedule: 0 */12 * * *
- • Output: feeds/feed.xml
Generated Workflow:
1name: RSS Feed Tracking2"on":3 schedule:4 - cron: 0 */12 * * *5 workflow_dispatch: null6jobs:7 scrape:8 runs-on: ubuntu-latest9 permissions:10 contents: write11 steps:12 - name: Checkout repository13 uses: actions/checkout@v414 - name: Fetch data15 run: |-16 curl -L "https://simonwillison.net/atom/entries/" \17 -H "User-Agent: Git-Scraping-Bot/1.0" \18 -o temp_data19 - name: Move data to output location20 run: mv temp_data feeds/feed.xml21 - name: Commit and push if changed22 run: |-23 git config user.name "github-actions[bot]"24 git config user.email "github-actions[bot]@users.noreply.github.com"25 git add feeds/feed.xml26 timestamp=$(date -u)27 git diff --quiet && git diff --staged --quiet || (git commit -m "Update RSS feed - $timestamp" && git push)28
Common Use Cases
Ideas for what you can track with git-scraping
Government & Public Data
- • Legislative changes and votes
- • Public health statistics
- • Environmental data
- • Open data portals
Business Intelligence
- • Competitor pricing
- • Product availability
- • Stock market data
- • Exchange rates
Content Monitoring
- • News articles
- • Blog posts (RSS)
- • Social media trends
- • Forum discussions
Technical Monitoring
- • API availability
- • Service status pages
- • Software releases
- • Documentation changes