Archive websites (on github)

Created: 2023-12-30

Introduction

Archiving websites is an important practice for preserving digital content. This article explores how to archive a website and transform it into a static site that can be published on GitHub Pages. We'll examine three practical examples from real archive projects. The process involves downloading the website, processing the content to create a static site, testing it locally with Docker, and finally deploying it to GitHub Pages.

Legal & Ethical Note: All websites archived in the examples were personally owned or had explicit permission for archiving. Always ensure you have the right to archive a website and respect copyright and privacy laws.

Save a website

To download and save a website, the wget command is the primary tool. It allows recursive downloading of all website resources with various options:

wget --wait 1 \
  --recursive \
  --page-requisites \
  --convert-links \
  --span-hosts \
  --no-clobber \
  --no-parent \
  -e robots=off \
  --no-check-certificate \
  --keep-session-cookies \
  --continue \
  --accept-regex='[pattern]' \
  -U='Mozilla/5.0 (compatible; archiv/SCAN; v.0.1)' \
  --header 'Accept-encoding: identity' \
  [website-url]

Key flags explained:

--wait 1: Wait 1 second between requests (respect server resources)
--recursive: Download recursively
--page-requisites: Download all resources needed to display pages (CSS, images, JS)
--convert-links: Convert absolute links to relative
--span-hosts: Allow downloading from multiple subdomains (e.g., cdn.example.com)
--no-parent: Don't ascend to parent directories
-e robots=off: Ignore robots.txt
--no-check-certificate: Ignore SSL errors
--accept-regex: Only accept URLs matching a regex pattern
-U: Set custom user agent

Create a Static Site from the website

After downloading, the website needs significant post-processing to become a proper static site. This involves:

Using sed for link replacement

The sed command performs bulk text replacements on HTML files:

find ./output -type f -exec sed -i 's/old-pattern/new-pattern/g' {} \;

Common transformations include:

Converting absolute URLs to relative paths
Fixing broken internal links
Removing query parameters
Replacing external domain references with local paths
Fixing file extensions for files without extensions

Makefile for orchestration

The Makefile orchestrates the entire workflow with targets for downloading, processing, testing, and building. Here's the basic structure:

fetch:
 cd ./input && wget [options] [url]

create:
 rm -rf ./output/*
 cp -R ./input/* ./output
 # ... sed replacements and file transformations ...

build:
 rm -rf ./build/*
 cp -R ./output/* ./build
 docker build --no-cache -t archiv.[domain] .

format:
 prettier --write . # Code style formatting

test:
 wget --recursive --spider localhost:8080 2>&1 | \
 grep '^--' | awk '{ print $3 }' | sort | uniq > tmp-url-list.txt
 diff tmp-url-list.txt url-list.txt

run:
 docker run -p 8080:80 archiv.[domain]

all:
 make create
 make format
 make build -B
 make run

Real-world example: archiv.berlinics.de

While the basic Makefile structure works for simple sites, real-world archives require significant customization. The archiv.berlinics.de project demonstrates a production-grade Makefile with extensive transformations for a complex website (a Zenphoto photo gallery):

# fetch input
fetch:
 cd ./input &&\
 wget --wait 1 \
   --recursive \
   --page-requisites \
   --convert-links \
   --span-hosts \
   --no-clobber \
   --no-parent \
   --force-html \
   -e robots=off \
   --no-check-certificate \
   --keep-session-cookies \
   --adjust-extension \
   --tries=3 \
   --continue \
   --accept-regex='/http:\/\/(img|dl|archiv|css)\.berlinics.de/.*$||http:\/\/berlinics.de/.*$/' \
   -U='Mozilla/5.0 (compatible; berlinics.de/SCAN; de-DE,de,en-US,en; v.0.1)' \
   --header 'Accept-encoding: identity' \
   --header 'Accept-language: de-DE' \
   http://berlinics.de/

# create output
create:
 rm -rf ./output/*
 cp -R ./input/* ./output

 # Add .html extension to pages without it
 find ./output/berlinics.de/page -type f -not -name '*.html' -exec bash -c 'mv "$$1" "$${1%.html}.html"' _ {} \;

 # Remove images that use zp-core gallery system
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<img[^>]*src="[^"]*zp-core\/c\.php[^"]*"[^>]*>//g' {} \;
 rm -rf ./output/berlinics.de/zp-core

 # Remove RSS links (broken in static copy)
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<a[^>]*href="?[^"]*\/rss\.php[^"]*"[^>]*>([^<]*)<\/a>/\1/g' {} \;

 # Convert absolute URLs to relative
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/href="https:\/\/berlinics.de/href="/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/src="https:\/\/berlinics.de/src="/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/href="http:\/\/berlinics.de/href="/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/src="http:\/\/berlinics.de/src="/g' {} \;

 # Handle CSS query parameters
 find ./output/berlinics.de -type f -name 'style.css?*' -exec bash -c 'mv "$$1" "$${1%\?*}"' _ {} \;
 find ./output/berlinics.de -type f -name 'style.css%3F*' -exec bash -c 'mv "$$1" "$${1%\%3F*}"' _ {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/style\.css\?[^"]*"/style\.css"/g' {} \;

 # Fix special cases
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<Liked/\&lt;Liked/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<Loved/\&lt;Loved/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/href="([^"]*\.photo)"/href="\1.html"/g' {} \;

 # Add custom scripts
 cp -R ./addition/js/* ./output/berlinics.de/themes/berlinics/js
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<\/body>/<script src="\/themes\/berlinics\/js\/archiv.js"><\/script><\/body>/g' {} \;

 # Add custom fonts
 cp -R ./addition/fonts ./output/berlinics.de/themes/berlinics/fonts
 cp ./addition/css/font.css ./output/berlinics.de/themes/berlinics/css/font.css
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<\/head>/<link rel="stylesheet" href="\/themes\/berlinics\/css\/font.css"><\/head>/g' {} \;
 find ./output/berlinics.de -type f -name 'style.css' -exec sed -i -E 's/"Trebuchet MS"/"PT Serif"/g' {} \;
 find ./output/berlinics.de -type f -name 'style.css' -exec sed -i -E 's/Georgia/"PT Serif"/g' {} \;

build:
 rm -rf ./build/*
 cp -R ./output/* ./build
 docker build --no-cache -t archiv.berlinics.de .

run:
 docker run -p 8080:80 archiv.berlinics.de

test:
 output_file="tmp-url-list.txt"; \
 wget --recursive --spider localhost:8080 2>&1 | grep '^--' | awk '{ print $$3 }' | sort | uniq > "$$output_file";\
 expected_file="url-list.txt"; \
 if diff "$$output_file" "$$expected_file"; then \
     echo "Test passed: Output matches expected content."; \
     rm tmp-url-list.txt; \
     exit 0; \
 else \
     echo "Test failed: Output does not match expected content."; \
     exit 1; \
 fi; \
 prettier --check . --ignore-path .prettierignore --config .prettierrc

format:
 prettier --write . --ignore-path .prettierignore --config .prettierrc

all:
 make create
 make format
 make build -B
 make run

Key transformations in this real-world example:

Fetch phase: Uses multiple headers and language preferences for a German website
Page handling: Adds .html extension to files without it (Zenphoto installation)
Gallery removal: Strips out images from the Zenphoto gallery system (zp-core)
RSS cleanup: Removes broken RSS feeds from the static copy
URL conversion: Changes both https:// and http:// absolute URLs to relative paths
CSS fixes: Removes query parameters from CSS files (common in dynamic sites)
Character escaping: Fixes HTML entities (<Liked → <Liked)
Photo extensions: Adds .html to photo URLs
Script injection: Adds custom archival scripts and fonts
Font replacement: Replaces unavailable fonts with PT Serif

This demonstrates how real websites require custom post-processing to become proper static sites. Each website's archive needs specific handling based on the original site's technology.

Docker image and nginx configuration (Testing Only)

Important note: For archiv.berlinics.de, the Docker image is used for local testing purposes only. The actual production deployment uses GitHub Pages, not Docker. The Docker setup allows developers to verify the archive works correctly before pushing to GitHub Pages.

The Dockerfile for archiv.berlinics.de is minimal but effective for testing:

# Use the official Nginx image as the base image
FROM nginx:latest

# Copy your Nginx configuration file into the container
COPY nginx.conf /etc/nginx/nginx.conf

# Copy your static files (e.g., HTML, CSS, JavaScript) to the Nginx default serving directory
COPY ./build/berlinics.de /usr/share/nginx/html

# Expose the default Nginx port (usually 80)
EXPOSE 80

# Start Nginx when the container starts
CMD ["nginx", "-g", "daemon off;"]

The nginx configuration serves the static site with proper routing:

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;

events {
    worker_connections 1024;
}

http {
    include /etc/nginx/mime.types;
    default_type text/html;
    access_log /var/log/nginx/access.log;

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    include /etc/nginx/conf.d/*.conf;

    server {
        listen 80 default_server;
        listen [::]:80 default_server;

        root /usr/share/nginx/html;
        index index.html;
        server_name _;

        location / {
            try_files $uri $uri/ =404;
        }
    }
}

How Docker build and test work together (local testing workflow):

Build phase (make build):
- Copies the processed ./output/* to ./build/
- Builds Docker image: docker build --no-cache -t archiv.berlinics.de .
- This creates a new image every time to ensure fresh dependencies
Run phase (make run):
- Starts container: docker run -p 8080:80 archiv.berlinics.de
- Maps localhost:8080 to container's port 80
- nginx serves the static site from /usr/share/nginx/html
- Developer can verify the site works locally before deployment
Test phase (make test):
- Uses wget --recursive --spider localhost:8080 to crawl all links
- Extracts all found URLs with grep and awk
- Compares against the expected url-list.txt
- Tests pass only if all crawled URLs match expected list
- Also runs prettier code style check

This Docker-based testing approach ensures:

Reproducibility: Same image builds work on any machine for consistent testing
Consistency: nginx routing configuration is always applied the same way
Validation: Static site is thoroughly tested before deployment to GitHub Pages
Isolation: Test environment is isolated from the local system
Pre-deployment verification: Developers can catch issues locally before pushing to production

After successful local testing with Docker, the ./build/ directory is deployed to GitHub Pages via GitHub Actions.

Save the website and Static Site

Once the static site is built and tested locally, it needs to be version controlled. This preserves the archive history and enables automated deployment:

Git and Git LFS

Important Note: Avoid Git LFS if possible. While some of the example projects use Git LFS, it should generally be avoided due to:

Additional costs: Git LFS has storage and bandwidth limits with associated fees
Complexity: Added configuration and maintenance overhead
Unnecessary for most cases: Standard Git handles static sites well

Instead of Git LFS, consider these alternatives:

Optimize assets: Compress images, minify HTML/CSS/JS before committing
Use .gitignore: Don't commit unnecessary files (build artifacts, cache files)
Separate large assets: Host images on a CDN or separate image hosting service
Regular Git is sufficient: For most static sites, standard Git works fine

If you absolutely must use Git LFS (very large media files):

git lfs track "*.jpg"       # Only for exceptionally large files
git lfs track "*.jpeg"

The example projects used Git LFS due to their specific circumstances (e.g., archiv.berlinics.de's extensive photo gallery), but modern static sites typically don't need it. GitHub's standard repository limits (recommended < 1GB) are sufficient for most website archives when properly optimized.

GitHub repositories

The three example projects use dedicated GitHub repositories:

Publish the Static Site on github pages

GitHub Pages provides free hosting for static sites. There are multiple approaches:

GitHub Actions Workflow

Modern projects use automated GitHub Actions workflows for deployment:

name: Deploy static content to Pages

on:
  push:
    branches: [main]
    paths:
      - "build/[domain]"
  workflow_dispatch:

permissions:
  contents: read
  pages: write
  id-token: write

concurrency:
  group: "pages"
  cancel-in-progress: false

jobs:
  deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          lfs: true # Remove this line if not using Git LFS
      - name: Setup Pages
        uses: actions/configure-pages@v3
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v1
        with:
          path: "./build/[domain]"
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v1

Key features:

Automatically triggered when build/[domain] directory changes
No manual deployment needed
One concurrent deployment to prevent conflicts
Optional Git LFS support (remove lfs: true if not needed to avoid extra costs)

Custom domains

GitHub Pages supports custom domains via CNAME file:

archiv.berlinics.de

Manual deployment alternative

For simpler workflows or testing before GitHub Actions:

git checkout gh-pages              # Switch to GitHub Pages branch
make all                           # Rebuild everything
rm -rf ./docs/* && cp -r ./build/[domain]/* docs/  # Update docs folder
git add docs/
git commit -m "feat(docs): update docs"
git push origin gh-pages           # Deploy to GitHub Pages

This approach requires manual execution but gives full control over when deployments happen.

Complete Workflow Summary

Archiving websites on GitHub Pages involves a complete workflow combining multiple tools:

Download → Process → Test → Commit → Deploy
  (wget)   (sed)    (Docker) (git)  (GitHub Actions)

Step 1 - Download (make fetch): wget recursively downloads the entire website

Step 2 - Process (make create): sed and bash scripts transform the content to static format

Step 3 - Test (make test): Local Docker container verifies all links work

Step 4 - Commit (git commit): Version control preserves the archive snapshot

Step 5 - Deploy (GitHub Actions or manual): Pushes to GitHub Pages for public access

Each step is atomic and can be run independently, making the workflow resilient and maintainable. The three real-world projects (archiv.gedit.net, archiv.berlinics.de, archiv.her0.be) demonstrate this approach in practice, ensuring important websites remain accessible indefinitely.

Why Archive Websites?

Archiving websites preserves digital history and personal work. The web is ephemeral—domains expire, hosting services shut down, and content disappears. By creating static archives on GitHub Pages, we ensure:

Digital preservation: Important content remains accessible even after the original site is gone
Personal history: Your own work and projects are preserved for future reference
Learning resource: Archived sites serve as examples of web technologies and designs from different eras
Independence: Self-hosted archives aren't dependent on third-party services

These three archive projects preserve websites that were important to me but no longer actively maintained. Rather than letting them disappear, they continue to exist as static snapshots—a digital time capsule of past work.

There's also an article about the journey of my abandoned blog if you're interested in the personal story behind these archives and the process of letting go while preserving history.

Common Issues and Troubleshooting

Problem: JavaScript-heavy sites don't work

Solution: wget only downloads static HTML. For JavaScript-rendered sites, consider using headless browsers like Puppeteer or tools like SingleFile browser extension.

Problem: Authentication/login-required pages

Solution: Use wget --load-cookies with exported browser cookies, or manually download after logging in.

Problem: Rate limiting or blocked by server

Solution: Increase --wait time, use --random-wait, or add delays between requests. Respect server resources.

Problem: SSL certificate errors

Solution: Use --no-check-certificate (as shown in examples), but only for sites you trust.

Problem: Site is too large

Solution: Use --level to limit recursion depth, --exclude-directories for unwanted paths, or --reject for specific file types.

Alternatives to consider:

HTTrack: GUI-based website copier with Windows support
archive.today or Wayback Machine: Third-party archiving services (less control, but simpler)
wget2: Modern rewrite of wget with better performance

References

Feedback

Have thoughts or experiences you'd like to share? I'd love to hear from you! Whether you agree, disagree, or have a different perspective, your feedback is always welcome. Drop me an email and let's start a conversation.

<archive-websites-on-github@exiguus.blog>

exiguus.blog

Archive websites (on github)

Introduction

Save a website

Create a Static Site from the website

Using sed for link replacement

Makefile for orchestration

Real-world example: archiv.berlinics.de

Docker image and nginx configuration (Testing Only)

Save the website and Static Site

Git and Git LFS

GitHub repositories

Publish the Static Site on github pages

GitHub Actions Workflow

Custom domains

Manual deployment alternative

Complete Workflow Summary

Why Archive Websites?

Common Issues and Troubleshooting

References

Feedback

Tags

Limit Firefox (nightly) resource usage with systemd

Import MediaWiki dumps into YaCy search engine

Create blacklists for YaCy search engine with Hagezi blocklists

Why I wrote zola-monoplain

Reflecting on Two Decades: The Journey of My Abandoned Blog

Archive websites (on github)