Import MediaWiki dumps into YaCy search engine

Created: 2025-11-16

Introduction

Importing MediaWiki dumps can be a great way to create an initial index for a YaCy search engine.

This is especially useful if you want to create a search engine for a specific topic, such as Wikipedia, Wikibooks, or Wikiquote in this example.

This can save you a lot of time and resources compared to crawling the web to create an index from scratch.

This post will show you how to import MediaWiki dumps into YaCy and how to automate the process to keep your search engine up to date with the latest dumps.

Download Wikimedia dumps

Wikimedia offers dumps of all Wikimedia projects, including Wikipedia, Wikibooks and Wikiquote at https://dumps.wikimedia.org/. The dumps are available in different formats, such as XML, SQL and JSON.

Wikimedia also provides a list of mirror sites to download the dumps from at https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps.

Wikimedia currently limits downloads to a maximum of 3 concurrent downloads per IP address, so you may want to use a download manager (like in the example of the automation section in this post) or a mirror site to download the dumps.

Import Wikimedia dumps

Available Wikimedia XML dumps¹ can be imported in gzip or bz2-compressed format via the YaCy Menu under YaCy Packs & Import/Export -> MediaWiki Dump² using either a local file path (e.g. file:///opt/yacy_search_server/dumps/dewikibooks-latest-pages-articles.xml.bz2) or URL (e.g. https://dumps.wikimedia.org/dewikibooks/latest/dewikibooks-latest-pages-articles.xml.bz2).

The import process can take some time depending on the size of the dump and the resources of your machine.

For example, the first import of the German Wikipedia dump with ~2 million pages and 7GB bzipped³ took around 2 hours to decompress and import and another 4 hours to process with 4 vCores and 5GB RAM. While the first import of the English Wikipedia dump with ~6 million pages and 24GB bzipped⁴ took around 4 hours to decompress and import and another 10 hours to process with 8 vCores and 12GB RAM on my machines.

YaCy extracts the archives in multiple files like enwiki-latest-pages-articles.xml.bz2.231.xml.gz into the PACKS/load directory and then imports, processes, and moves the files to the PACKS/loaded directory.

You can monitor the extract and import progress using the YaCy logs, web interface under YaCy Packs & Import/Export -> MediaWiki Dump and the Crawler Monitor, or by checking the PACKS/load and PACKS/loaded directories.

Other interesting dumps to import are from Wikiquote, which contains quotes from famous people and can be used to create a search engine for quotes. The dumps can be found at https://dumps.wikimedia.org/dewikiquote/latest/ and https://dumps.wikimedia.org/enwikiquote/latest/.

Similarly, the dumps of Wikibooks contain books and manuals on various topics and can be used to create a search engine for educational materials. The dumps can be found at https://dumps.wikimedia.org/dewikibooks/latest/ and https://dumps.wikimedia.org/enwikibooks/latest/.

My personal recommendation is to start with smaller dumps like Wikibooks and Wikiquote to understand the import process before working with larger dumps. Download the dumps manually first and then import them into YaCy to avoid any download issues during the import process. It is also possible to automate the download and import process.

Automate the download and import

I currently import the latest dumps of the German and English Wikipedia, Wikibooks and Wikiquote weekly to keep my search engine up to date. You can automate the download and import process with a shell script and a cron job.

Cron job configuration

Schedule the download script to run every 5 days at 5:30 AM using a cron job:

# /etc/cron.d/wiki-dumps
30 5 */5 * * root  nice -n10 /path/to/script/wiki-article-download.sh -f /path/to/script/wiki-download-urls.txt -d /path/to/yacy/dumps -j 3 -m 5 -l /var/log/wiki-download.log && /usr/bin/chown -R yacy:yacy /path/to/yacy/dumps

Key features:

*/5 runs every 5 days, giving you enough time between imports
nice -n10 reduces CPU priority to avoid overwhelming the system
-f points to a URL file containing the list of dumps to download
-d specifies the destination directory for downloads
-j 3 limits concurrent downloads to 3 (respects Wikimedia's 3 concurrent downloads per IP limit)
-m 5 only re-downloads files older than 5 days (skips recent ones)
-l logs all activity to a file for monitoring
chown ensures YaCy can read the files (use yacy:yacy for native installs, or appropriate UID:GID like 100:101 for Docker)

Download script

The wiki-article-download.sh script handles robust downloading with retries, parallel job management, and logging:

#!/usr/bin/env bash

# wiki-article-download.sh - robust downloader for Wikimedia XML dumps
# - follows Google Shell Style Guide recommendations
# - suitable for interactive and cron usage (lockfile, logging, retries, job limit)
#
# Examples:
#  Normal use (interactive with URL file):
#    /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 4 -r 5 -l /var/log/wiki-download.log
#
#  Cron (daily at 03:30, append logs; uses URL file):
#    30 3 * * * /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 3 -l /var/log/wiki-download.log

set -o errexit
set -o nounset
set -o pipefail

readonly DEFAULT_JOBS=3
readonly DEFAULT_RETRIES=3
readonly DEFAULT_DEST="${PWD:-/tmp}"
readonly DEFAULT_TIMEOUT=30

URLS=(
    "https://dumps.wikimedia.org/enwikibooks/latest/enwikibooks-latest-pages-articles.xml.bz2"
    "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
    "https://dumps.wikimedia.org/enwikiquote/latest/enwikiquote-latest-pages-articles.xml.bz2"
)

# Mark URLS as used for static analyzers (used by run_jobs via nameref)
: "${URLS[*]:-}"

usage() {
    cat <<EOF
Download Wikimedia XML dumps with retries, logging, and parallel jobs.

Usage: $(basename "$0") [options]

Options:
  -d DIR    Destination directory (default: ${DEFAULT_DEST})
  -f FILE   File containing URLs (one per line). Lines starting with # are ignored.
  -j N      Parallel jobs (default: ${DEFAULT_JOBS})
  -r N      Retries per file on failure (default: ${DEFAULT_RETRIES})
  -t SEC    wget timeout in seconds (default: ${DEFAULT_TIMEOUT})
  -m DAYS   Max age of existing files in days (default: no age check)
  -l FILE   Log file (appended). If omitted, logs go to stdout.
  -n        Dry-run: show actions but do not download
  -h        Show this help and exit

Examples:
    Normal use (interactive with URL file):
        /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 4 -r 5 -l /var/log/wiki-download.log
    Using a URL file:
        /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki
    Cron (daily at 03:30, append logs; uses URL file):
        30 3 * * * /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 3 -m 5 -l /var/log/wiki-download.log

EOF
}

log() {
    local msg="$1"
    local ts
    ts=$(date --iso-8601=seconds 2>/dev/null || date +"%Y-%m-%dT%H:%M:%S%z")
    if [[ -n "${LOGFILE:-}" ]]; then
        printf '%s %s\n' "$ts" "$msg" >> "$LOGFILE"
    else
        printf '%s %s\n' "$ts" "$msg"
    fi
}

cleanup_lock() {
    if [[ -n "${LOCKDIR:-}" && -d "$LOCKDIR" ]]; then
        rmdir -- "$LOCKDIR" 2>/dev/null || true
    fi
}

trap cleanup_lock EXIT INT TERM

run_jobs() {
    local destdir=$1
    local jobs=$2
    local retries=$3
    local timeout=$4
    local maxage="${5:-}"

    log "INFO: starting downloads to $destdir (jobs=$jobs, retries=$retries, timeout=$timeout, maxage=${maxage:-none})"

    for url in "${URLS[@]}"; do
        # wait until background job count is below limit
        while (( $(jobs -rp 2>/dev/null | wc -l) >= jobs )); do
            sleep 0.5
        done

        # run download in a background subshell (simple, robust)
        (
            fname=$(basename "${url}")
            target="${destdir}/${fname}"
            # check if file already exists and is recent enough
            if [[ -f "$target" ]]; then
                if [[ -n "$maxage" ]]; then
                    if find "$target" -mtime "-$maxage" -print -quit | grep -q .; then
                        log "INFO: skipping ${fname}, exists and is recent (<${maxage} days)"
                        exit 0
                    else
                        log "INFO: ${fname} exists but is older than ${maxage} days, will re-download"
                        if [[ "${DRY_RUN:-false}" == "false" ]]; then
                            rm -f -- "$target"
                        else
                            log "DRY-RUN: would remove old file ${target}"
                            exit 0
                        fi
                    fi
                else
                    log "INFO: skipping ${fname}, already exists"
                    exit 0
                fi
            fi
            if [[ "${DRY_RUN:-false}" == "true" ]]; then
                log "DRY-RUN: would download ${url} -> ${target}"
                exit 0
            fi
            log "INFO: downloading ${url}"
            if wget --continue --tries="$retries" --timeout="$timeout" --waitretry=5 --retry-connrefused --no-verbose -O "$target" "$url" >> "${LOGFILE:-/dev/stdout}" 2>&1; then
                log "INFO: completed ${fname}"
                exit 0
            else
                log "WARN: download failed for ${fname}"
                exit 1
            fi
        ) &
        pid="$!"
        log "INFO: started PID ${pid} for ${url}"
    done

    # wait for all background jobs
    wait || true
}


# Default values
DEST="${DEFAULT_DEST}"
JOBS=${DEFAULT_JOBS}
RETRIES=${DEFAULT_RETRIES}
LOGFILE=""
DRY_RUN=false
TIMEOUT=${DEFAULT_TIMEOUT}
URL_FILE=""
MAX_AGE_DAYS=

while getopts ":d:f:j:r:l:t:m:nh" opt; do
    case "$opt" in
        d) DEST="$OPTARG" ;;
        f) URL_FILE="$OPTARG" ;;
        j) JOBS="$OPTARG" ;;
        r) RETRIES="$OPTARG" ;;
        l) LOGFILE="$OPTARG" ;;
        t) TIMEOUT="$OPTARG" ;;
        m) MAX_AGE_DAYS="$OPTARG" ;;
        n) DRY_RUN=true ;;
        h) usage; exit 0 ;;
        :) printf 'Missing argument for -%s\n' "$OPTARG"; usage; exit 2 ;;
        *) usage; exit 2 ;;
    esac
done

# Validate numeric options
is_positive_int() {
    [[ "$1" =~ ^[1-9][0-9]*$ ]]
}

if ! is_positive_int "$JOBS"; then
    log "ERROR: jobs (-j) must be a positive integer"
    exit 2
fi
if ! is_positive_int "$RETRIES"; then
    log "ERROR: retries (-r) must be a positive integer"
    exit 2
fi
if ! is_positive_int "$TIMEOUT"; then
    log "ERROR: timeout (-t) must be a positive integer"
    exit 2
fi
if [[ -n "${MAX_AGE_DAYS:-}" ]]; then
    if ! is_positive_int "$MAX_AGE_DAYS"; then
        log "ERROR: max age (-m) must be a positive integer"
        exit 2
    fi
fi


# If a URL file was provided, load URLs from it (ignore blank lines and comments)
if [[ -n "${URL_FILE:-}" ]]; then
    if [[ ! -r "$URL_FILE" ]]; then
        log "ERROR: cannot read URL file $URL_FILE"
        exit 2
    fi
    # read non-empty lines, strip comments, surrounding whitespace and CRs
    mapfile -t URLS < <(
        sed -e 's/#.*$//' "$URL_FILE" \
        | tr -d '\r' \
        | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' \
        | sed '/^$/d'
    )
    if [[ ${#URLS[@]} -eq 0 ]]; then
        log "ERROR: no URLs found in $URL_FILE"
        exit 2
    fi
    log "INFO: loaded ${#URLS[@]} URLs from $URL_FILE"
fi

mkdir -p "$DEST"

# Use an atomic lock directory to prevent overlapping cron runs
LOCKDIR="$DEST/.wiki-download.lock"
if mkdir "$LOCKDIR" 2>/dev/null; then
    log "INFO: acquired lock $LOCKDIR"
else
    log "INFO: lock exists, another instance is running. Exiting."
    exit 0
fi

# Run downloads
log "INFO: starting downloads to $DEST (jobs=$JOBS, retries=$RETRIES)"
run_jobs "$DEST" "$JOBS" "$RETRIES" "$TIMEOUT" "$MAX_AGE_DAYS"

log "INFO: all downloads finished"

# Call cleanup explicitly so static analyzers see the function is reachable.
cleanup_lock || true

exit 0

URL file configuration

Create a file with the list of dumps to download. Lines starting with # are ignored:

# wiki-download-urls.txt
# German dumps
https://dumps.wikimedia.org/dewikibooks/latest/dewikibooks-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/dewikiquote/latest/dewikiquote-latest-pages-articles.xml.bz2

# English dumps
https://dumps.wikimedia.org/enwikibooks/latest/enwikibooks-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/enwikiquote/latest/enwikiquote-latest-pages-articles.xml.bz2

Automating the import in YaCy

After downloads complete, you need to configure YaCy to automatically import the dumps. In the YaCy web interface:

Navigate to Administration → Automation⁵
Add a scheduled task that imports from the local file path (The dump e.g. file:///path/to/yacy/dumps/dewiki-latest-pages-articles.xml.bz2 must be manually added to the YaCy import list first under YaCy Packs & Import/Export → MediaWiki Dump)
Set the schedule to run after your download cron (e.g., daily or weekly)

This creates a complete hands-off workflow: downloads happen via cron, and imports happen via YaCy's automation.

How it works

The cron job runs the script every 5 days at 5:30 AM
The script reads URLs from the file and downloads them to the specified directory using up to 3 parallel jobs
Files older than 5 days are re-downloaded (controlled by -m 5); recent files are skipped
All activity is logged to /var/log/wiki-download.log
After successful downloads, file ownership is changed so YaCy can access them
The script uses a lock directory to prevent overlapping runs if a download takes longer than expected
Each download is retried automatically, respecting Wikimedia's rate limits
YaCy's automation feature periodically checks for new dumps and imports them automatically

This approach ensures your YaCy search engine stays updated with the latest Wikipedia, Wikibooks, and Wikiquote dumps without manual intervention.

About the YaCy search engine

YaCy⁶ is a free and open-source search engine that allows you to create your own distributed search index. It is based on a peer-to-peer network and uses Solr to index content and provide search results.

YaCy is written in Java and can run on any platform that supports Java. It's a great alternative to commercial search engines and can be used for personal search, topic-specific indexing, or community-based search projects.

Footnotes

¹
https://dumps.wikimedia.org/dewiki/latest/ all XML gzip or bz2 files end with -pages-articles.xml.bz2 or -pages-articles.xml.gz and contain the content of the pages, but not the history of the pages, ideal for search engine indexing
²
http://localhost:8090/IndexImportMediawiki_p.html Index Import MediaWiki at your local YaCy search engine instance
³
https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2 (~7GB bzipped, ~2 million pages)
⁴
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 (~24GB bzipped, ~6 million pages)
⁵
http://localhost:8090/Automation_p.html Automation at your local YaCy search engine instance
⁶
https://yacy.net/ the official website of the YaCy search engine, where you can find more information about the project, the documentation and the source code.

Feedback

Have thoughts or experiences you'd like to share? I'd love to hear from you! Whether you agree, disagree, or have a different perspective, your feedback is always welcome. Drop me an email and let's start a conversation.

<yacy-search-mediawiki-dumps@exiguus.blog>

exiguus.blog

Import MediaWiki dumps into YaCy search engine

Introduction

Download Wikimedia dumps

Import Wikimedia dumps

Automate the download and import

Cron job configuration

Download script

URL file configuration

Automating the import in YaCy

How it works

About the YaCy search engine

Footnotes

Feedback

Tags

Limit Firefox (nightly) resource usage with systemd

Import MediaWiki dumps into YaCy search engine

Create blacklists for YaCy search engine with Hagezi blocklists

Why I wrote zola-monoplain

Reflecting on Two Decades: The Journey of My Abandoned Blog

Archive websites (on github)