exiguus.blog

Personal blog

Archive websites (on github)

Created: 2023-12-30

Introduction

Archiving websites is an important practice for preserving digital content. This article explores how to archive a website and transform it into a static site that can be published on GitHub Pages. We'll examine three practical examples from real archive projects. The process involves downloading the website, processing the content to create a static site, testing it locally with Docker, and finally deploying it to GitHub Pages.

Legal & Ethical Note: All websites archived in the examples were personally owned or had explicit permission for archiving. Always ensure you have the right to archive a website and respect copyright and privacy laws.

Save a website

To download and save a website, the wget command is the primary tool. It allows recursive downloading of all website resources with various options:

wget --wait 1 \
  --recursive \
  --page-requisites \
  --convert-links \
  --span-hosts \
  --no-clobber \
  --no-parent \
  -e robots=off \
  --no-check-certificate \
  --keep-session-cookies \
  --continue \
  --accept-regex='[pattern]' \
  -U='Mozilla/5.0 (compatible; archiv/SCAN; v.0.1)' \
  --header 'Accept-encoding: identity' \
  [website-url]

Key flags explained:

Create a Static Site from the website

After downloading, the website needs significant post-processing to become a proper static site. This involves:

The sed command performs bulk text replacements on HTML files:

find ./output -type f -exec sed -i 's/old-pattern/new-pattern/g' {} \;

Common transformations include:

Makefile for orchestration

The Makefile orchestrates the entire workflow with targets for downloading, processing, testing, and building. Here's the basic structure:

fetch:
 cd ./input && wget [options] [url]

create:
 rm -rf ./output/*
 cp -R ./input/* ./output
 # ... sed replacements and file transformations ...

build:
 rm -rf ./build/*
 cp -R ./output/* ./build
 docker build --no-cache -t archiv.[domain] .

format:
 prettier --write . # Code style formatting

test:
 wget --recursive --spider localhost:8080 2>&1 | \
 grep '^--' | awk '{ print $3 }' | sort | uniq > tmp-url-list.txt
 diff tmp-url-list.txt url-list.txt

run:
 docker run -p 8080:80 archiv.[domain]

all:
 make create
 make format
 make build -B
 make run

Real-world example: archiv.berlinics.de

While the basic Makefile structure works for simple sites, real-world archives require significant customization. The archiv.berlinics.de project demonstrates a production-grade Makefile with extensive transformations for a complex website (a Zenphoto photo gallery):

# fetch input
fetch:
 cd ./input &&\
 wget --wait 1 \
   --recursive \
   --page-requisites \
   --convert-links \
   --span-hosts \
   --no-clobber \
   --no-parent \
   --force-html \
   -e robots=off \
   --no-check-certificate \
   --keep-session-cookies \
   --adjust-extension \
   --tries=3 \
   --continue \
   --accept-regex='/http:\/\/(img|dl|archiv|css)\.berlinics.de/.*$||http:\/\/berlinics.de/.*$/' \
   -U='Mozilla/5.0 (compatible; berlinics.de/SCAN; de-DE,de,en-US,en; v.0.1)' \
   --header 'Accept-encoding: identity' \
   --header 'Accept-language: de-DE' \
   http://berlinics.de/

# create output
create:
 rm -rf ./output/*
 cp -R ./input/* ./output

 # Add .html extension to pages without it
 find ./output/berlinics.de/page -type f -not -name '*.html' -exec bash -c 'mv "$$1" "$${1%.html}.html"' _ {} \;

 # Remove images that use zp-core gallery system
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<img[^>]*src="[^"]*zp-core\/c\.php[^"]*"[^>]*>//g' {} \;
 rm -rf ./output/berlinics.de/zp-core

 # Remove RSS links (broken in static copy)
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<a[^>]*href="?[^"]*\/rss\.php[^"]*"[^>]*>([^<]*)<\/a>/\1/g' {} \;

 # Convert absolute URLs to relative
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/href="https:\/\/berlinics.de/href="/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/src="https:\/\/berlinics.de/src="/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/href="http:\/\/berlinics.de/href="/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/src="http:\/\/berlinics.de/src="/g' {} \;

 # Handle CSS query parameters
 find ./output/berlinics.de -type f -name 'style.css?*' -exec bash -c 'mv "$$1" "$${1%\?*}"' _ {} \;
 find ./output/berlinics.de -type f -name 'style.css%3F*' -exec bash -c 'mv "$$1" "$${1%\%3F*}"' _ {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/style\.css\?[^"]*"/style\.css"/g' {} \;

 # Fix special cases
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<Liked/\&lt;Liked/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<Loved/\&lt;Loved/g' {} \;
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/href="([^"]*\.photo)"/href="\1.html"/g' {} \;

 # Add custom scripts
 cp -R ./addition/js/* ./output/berlinics.de/themes/berlinics/js
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<\/body>/<script src="\/themes\/berlinics\/js\/archiv.js"><\/script><\/body>/g' {} \;

 # Add custom fonts
 cp -R ./addition/fonts ./output/berlinics.de/themes/berlinics/fonts
 cp ./addition/css/font.css ./output/berlinics.de/themes/berlinics/css/font.css
 find ./output/berlinics.de -type f -name '*.html' -exec sed -i -E 's/<\/head>/<link rel="stylesheet" href="\/themes\/berlinics\/css\/font.css"><\/head>/g' {} \;
 find ./output/berlinics.de -type f -name 'style.css' -exec sed -i -E 's/"Trebuchet MS"/"PT Serif"/g' {} \;
 find ./output/berlinics.de -type f -name 'style.css' -exec sed -i -E 's/Georgia/"PT Serif"/g' {} \;

build:
 rm -rf ./build/*
 cp -R ./output/* ./build
 docker build --no-cache -t archiv.berlinics.de .

run:
 docker run -p 8080:80 archiv.berlinics.de

test:
 output_file="tmp-url-list.txt"; \
 wget --recursive --spider localhost:8080 2>&1 | grep '^--' | awk '{ print $$3 }' | sort | uniq > "$$output_file";\
 expected_file="url-list.txt"; \
 if diff "$$output_file" "$$expected_file"; then \
     echo "Test passed: Output matches expected content."; \
     rm tmp-url-list.txt; \
     exit 0; \
 else \
     echo "Test failed: Output does not match expected content."; \
     exit 1; \
 fi; \
 prettier --check . --ignore-path .prettierignore --config .prettierrc

format:
 prettier --write . --ignore-path .prettierignore --config .prettierrc

all:
 make create
 make format
 make build -B
 make run

Key transformations in this real-world example:

  1. Fetch phase: Uses multiple headers and language preferences for a German website
  2. Page handling: Adds .html extension to files without it (Zenphoto installation)
  3. Gallery removal: Strips out images from the Zenphoto gallery system (zp-core)
  4. RSS cleanup: Removes broken RSS feeds from the static copy
  5. URL conversion: Changes both https:// and http:// absolute URLs to relative paths
  6. CSS fixes: Removes query parameters from CSS files (common in dynamic sites)
  7. Character escaping: Fixes HTML entities (<Liked&lt;Liked)
  8. Photo extensions: Adds .html to photo URLs
  9. Script injection: Adds custom archival scripts and fonts
  10. Font replacement: Replaces unavailable fonts with PT Serif

This demonstrates how real websites require custom post-processing to become proper static sites. Each website's archive needs specific handling based on the original site's technology.

Docker image and nginx configuration (Testing Only)

Important note: For archiv.berlinics.de, the Docker image is used for local testing purposes only. The actual production deployment uses GitHub Pages, not Docker. The Docker setup allows developers to verify the archive works correctly before pushing to GitHub Pages.

The Dockerfile for archiv.berlinics.de is minimal but effective for testing:

# Use the official Nginx image as the base image
FROM nginx:latest

# Copy your Nginx configuration file into the container
COPY nginx.conf /etc/nginx/nginx.conf

# Copy your static files (e.g., HTML, CSS, JavaScript) to the Nginx default serving directory
COPY ./build/berlinics.de /usr/share/nginx/html

# Expose the default Nginx port (usually 80)
EXPOSE 80

# Start Nginx when the container starts
CMD ["nginx", "-g", "daemon off;"]

The nginx configuration serves the static site with proper routing:

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;

events {
    worker_connections 1024;
}

http {
    include /etc/nginx/mime.types;
    default_type text/html;
    access_log /var/log/nginx/access.log;

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;

    include /etc/nginx/conf.d/*.conf;

    server {
        listen 80 default_server;
        listen [::]:80 default_server;

        root /usr/share/nginx/html;
        index index.html;
        server_name _;

        location / {
            try_files $uri $uri/ =404;
        }
    }
}

How Docker build and test work together (local testing workflow):

  1. Build phase (make build):

    • Copies the processed ./output/* to ./build/
    • Builds Docker image: docker build --no-cache -t archiv.berlinics.de .
    • This creates a new image every time to ensure fresh dependencies
  2. Run phase (make run):

    • Starts container: docker run -p 8080:80 archiv.berlinics.de
    • Maps localhost:8080 to container's port 80
    • nginx serves the static site from /usr/share/nginx/html
    • Developer can verify the site works locally before deployment
  3. Test phase (make test):

    • Uses wget --recursive --spider localhost:8080 to crawl all links
    • Extracts all found URLs with grep and awk
    • Compares against the expected url-list.txt
    • Tests pass only if all crawled URLs match expected list
    • Also runs prettier code style check

This Docker-based testing approach ensures:

After successful local testing with Docker, the ./build/ directory is deployed to GitHub Pages via GitHub Actions.

Save the website and Static Site

Once the static site is built and tested locally, it needs to be version controlled. This preserves the archive history and enables automated deployment:

Git and Git LFS

Important Note: Avoid Git LFS if possible. While some of the example projects use Git LFS, it should generally be avoided due to:

Instead of Git LFS, consider these alternatives:

  1. Optimize assets: Compress images, minify HTML/CSS/JS before committing
  2. Use .gitignore: Don't commit unnecessary files (build artifacts, cache files)
  3. Separate large assets: Host images on a CDN or separate image hosting service
  4. Regular Git is sufficient: For most static sites, standard Git works fine

If you absolutely must use Git LFS (very large media files):

git lfs track "*.jpg"       # Only for exceptionally large files
git lfs track "*.jpeg"

The example projects used Git LFS due to their specific circumstances (e.g., archiv.berlinics.de's extensive photo gallery), but modern static sites typically don't need it. GitHub's standard repository limits (recommended < 1GB) are sufficient for most website archives when properly optimized.

GitHub repositories

The three example projects use dedicated GitHub repositories:

Publish the Static Site on github pages

GitHub Pages provides free hosting for static sites. There are multiple approaches:

GitHub Actions Workflow

Modern projects use automated GitHub Actions workflows for deployment:

name: Deploy static content to Pages

on:
  push:
    branches: [main]
    paths:
      - "build/[domain]"
  workflow_dispatch:

permissions:
  contents: read
  pages: write
  id-token: write

concurrency:
  group: "pages"
  cancel-in-progress: false

jobs:
  deploy:
    environment:
      name: github-pages
      url: ${{ steps.deployment.outputs.page_url }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
        with:
          lfs: true # Remove this line if not using Git LFS
      - name: Setup Pages
        uses: actions/configure-pages@v3
      - name: Upload artifact
        uses: actions/upload-pages-artifact@v1
        with:
          path: "./build/[domain]"
      - name: Deploy to GitHub Pages
        id: deployment
        uses: actions/deploy-pages@v1

Key features:

Custom domains

GitHub Pages supports custom domains via CNAME file:

archiv.berlinics.de

Manual deployment alternative

For simpler workflows or testing before GitHub Actions:

git checkout gh-pages              # Switch to GitHub Pages branch
make all                           # Rebuild everything
rm -rf ./docs/* && cp -r ./build/[domain]/* docs/  # Update docs folder
git add docs/
git commit -m "feat(docs): update docs"
git push origin gh-pages           # Deploy to GitHub Pages

This approach requires manual execution but gives full control over when deployments happen.

Complete Workflow Summary

Archiving websites on GitHub Pages involves a complete workflow combining multiple tools:

Download → Process → Test → Commit → Deploy
  (wget)   (sed)    (Docker) (git)  (GitHub Actions)

Step 1 - Download (make fetch): wget recursively downloads the entire website

Step 2 - Process (make create): sed and bash scripts transform the content to static format

Step 3 - Test (make test): Local Docker container verifies all links work

Step 4 - Commit (git commit): Version control preserves the archive snapshot

Step 5 - Deploy (GitHub Actions or manual): Pushes to GitHub Pages for public access

Each step is atomic and can be run independently, making the workflow resilient and maintainable. The three real-world projects (archiv.gedit.net, archiv.berlinics.de, archiv.her0.be) demonstrate this approach in practice, ensuring important websites remain accessible indefinitely.

Why Archive Websites?

Archiving websites preserves digital history and personal work. The web is ephemeral—domains expire, hosting services shut down, and content disappears. By creating static archives on GitHub Pages, we ensure:

These three archive projects preserve websites that were important to me but no longer actively maintained. Rather than letting them disappear, they continue to exist as static snapshots—a digital time capsule of past work.

There's also an article about the journey of my abandoned blog if you're interested in the personal story behind these archives and the process of letting go while preserving history.

Common Issues and Troubleshooting

Problem: JavaScript-heavy sites don't work

Problem: Authentication/login-required pages

Problem: Rate limiting or blocked by server

Problem: SSL certificate errors

Problem: Site is too large

Alternatives to consider:

References

Feedback

Have thoughts or experiences you'd like to share? I'd love to hear from you! Whether you agree, disagree, or have a different perspective, your feedback is always welcome. Drop me an email and let's start a conversation.

<​​​​archive-websites-on-github​​​@exiguus​.​​blog​​​>

Tags