Import MediaWiki dumps into YaCy search engine
Created:
Introduction
Importing MediaWiki dumps can be a great way to create an initial index for a YaCy search engine.
This is especially useful if you want to create a search engine for a specific topic, such as Wikipedia, Wikibooks, or Wikiquote in this example.
This can save you a lot of time and resources compared to crawling the web to create an index from scratch.
This post will show you how to import MediaWiki dumps into YaCy and how to automate the process to keep your search engine up to date with the latest dumps.
Download Wikimedia dumps
Wikimedia offers dumps of all Wikimedia projects, including Wikipedia, Wikibooks and Wikiquote at https://dumps.wikimedia.org/. The dumps are available in different formats, such as XML, SQL and JSON.
Wikimedia also provides a list of mirror sites to download the dumps from at https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps.
Wikimedia currently limits downloads to a maximum of 3 concurrent downloads per IP address, so you may want to use a download manager (like in the example of the automation section in this post) or a mirror site to download the dumps.
Import Wikimedia dumps
Available Wikimedia XML dumps1 can be imported in gzip or bz2-compressed format via the YaCy Menu under YaCy Packs & Import/Export -> MediaWiki Dump2 using either a local file path (e.g. file:///opt/yacy_search_server/dumps/dewikibooks-latest-pages-articles.xml.bz2) or URL (e.g. https://dumps.wikimedia.org/dewikibooks/latest/dewikibooks-latest-pages-articles.xml.bz2).
The import process can take some time depending on the size of the dump and the resources of your machine.
For example, the first import of the German Wikipedia dump with ~2 million pages and 7GB bzipped3 took around 2 hours to decompress and import and another 4 hours to process with 4 vCores and 5GB RAM. While the first import of the English Wikipedia dump with ~6 million pages and 24GB bzipped4 took around 4 hours to decompress and import and another 10 hours to process with 8 vCores and 12GB RAM on my machines.
YaCy extracts the archives in multiple files like enwiki-latest-pages-articles.xml.bz2.231.xml.gz into the PACKS/load directory and then imports, processes, and moves the files to the PACKS/loaded directory.
You can monitor the extract and import progress using the YaCy logs, web interface under YaCy Packs & Import/Export -> MediaWiki Dump and the Crawler Monitor, or by checking the PACKS/load and PACKS/loaded directories.
Other interesting dumps to import are from Wikiquote, which contains quotes from famous people and can be used to create a search engine for quotes. The dumps can be found at https://dumps.wikimedia.org/dewikiquote/latest/ and https://dumps.wikimedia.org/enwikiquote/latest/.
Similarly, the dumps of Wikibooks contain books and manuals on various topics and can be used to create a search engine for educational materials. The dumps can be found at https://dumps.wikimedia.org/dewikibooks/latest/ and https://dumps.wikimedia.org/enwikibooks/latest/.
My personal recommendation is to start with smaller dumps like Wikibooks and Wikiquote to understand the import process before working with larger dumps. Download the dumps manually first and then import them into YaCy to avoid any download issues during the import process. It is also possible to automate the download and import process.
Automate the download and import
I currently import the latest dumps of the German and English Wikipedia, Wikibooks and Wikiquote weekly to keep my search engine up to date. You can automate the download and import process with a shell script and a cron job.
Cron job configuration
Schedule the download script to run every 5 days at 5:30 AM using a cron job:
# /etc/cron.d/wiki-dumps
30 5 */5 * * root nice -n10 /path/to/script/wiki-article-download.sh -f /path/to/script/wiki-download-urls.txt -d /path/to/yacy/dumps -j 3 -m 5 -l /var/log/wiki-download.log && /usr/bin/chown -R yacy:yacy /path/to/yacy/dumps
Key features:
*/5runs every 5 days, giving you enough time between importsnice -n10reduces CPU priority to avoid overwhelming the system-fpoints to a URL file containing the list of dumps to download-dspecifies the destination directory for downloads-j 3limits concurrent downloads to 3 (respects Wikimedia's 3 concurrent downloads per IP limit)-m 5only re-downloads files older than 5 days (skips recent ones)-llogs all activity to a file for monitoringchownensures YaCy can read the files (useyacy:yacyfor native installs, or appropriate UID:GID like100:101for Docker)
Download script
The wiki-article-download.sh script handles robust downloading with retries, parallel job management, and logging:
#!/usr/bin/env bash
# wiki-article-download.sh - robust downloader for Wikimedia XML dumps
# - follows Google Shell Style Guide recommendations
# - suitable for interactive and cron usage (lockfile, logging, retries, job limit)
#
# Examples:
# Normal use (interactive with URL file):
# /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 4 -r 5 -l /var/log/wiki-download.log
#
# Cron (daily at 03:30, append logs; uses URL file):
# 30 3 * * * /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 3 -l /var/log/wiki-download.log
set -o errexit
set -o nounset
set -o pipefail
readonly DEFAULT_JOBS=3
readonly DEFAULT_RETRIES=3
readonly DEFAULT_DEST="${PWD:-/tmp}"
readonly DEFAULT_TIMEOUT=30
URLS=(
"https://dumps.wikimedia.org/enwikibooks/latest/enwikibooks-latest-pages-articles.xml.bz2"
"https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2"
"https://dumps.wikimedia.org/enwikiquote/latest/enwikiquote-latest-pages-articles.xml.bz2"
)
# Mark URLS as used for static analyzers (used by run_jobs via nameref)
: "${URLS[*]:-}"
usage() {
cat <<EOF
Download Wikimedia XML dumps with retries, logging, and parallel jobs.
Usage: $(basename "$0") [options]
Options:
-d DIR Destination directory (default: ${DEFAULT_DEST})
-f FILE File containing URLs (one per line). Lines starting with # are ignored.
-j N Parallel jobs (default: ${DEFAULT_JOBS})
-r N Retries per file on failure (default: ${DEFAULT_RETRIES})
-t SEC wget timeout in seconds (default: ${DEFAULT_TIMEOUT})
-m DAYS Max age of existing files in days (default: no age check)
-l FILE Log file (appended). If omitted, logs go to stdout.
-n Dry-run: show actions but do not download
-h Show this help and exit
Examples:
Normal use (interactive with URL file):
/path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 4 -r 5 -l /var/log/wiki-download.log
Using a URL file:
/path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki
Cron (daily at 03:30, append logs; uses URL file):
30 3 * * * /path/to/scripts/wiki-article-download.sh -f /path/to/wiki-download-urls.txt -d /var/tmp/wiki -j 3 -m 5 -l /var/log/wiki-download.log
EOF
}
log() {
local msg="$1"
local ts
ts=$(date --iso-8601=seconds 2>/dev/null || date +"%Y-%m-%dT%H:%M:%S%z")
if [[ -n "${LOGFILE:-}" ]]; then
printf '%s %s\n' "$ts" "$msg" >> "$LOGFILE"
else
printf '%s %s\n' "$ts" "$msg"
fi
}
cleanup_lock() {
if [[ -n "${LOCKDIR:-}" && -d "$LOCKDIR" ]]; then
rmdir -- "$LOCKDIR" 2>/dev/null || true
fi
}
trap cleanup_lock EXIT INT TERM
run_jobs() {
local destdir=$1
local jobs=$2
local retries=$3
local timeout=$4
local maxage="${5:-}"
log "INFO: starting downloads to $destdir (jobs=$jobs, retries=$retries, timeout=$timeout, maxage=${maxage:-none})"
for url in "${URLS[@]}"; do
# wait until background job count is below limit
while (( $(jobs -rp 2>/dev/null | wc -l) >= jobs )); do
sleep 0.5
done
# run download in a background subshell (simple, robust)
(
fname=$(basename "${url}")
target="${destdir}/${fname}"
# check if file already exists and is recent enough
if [[ -f "$target" ]]; then
if [[ -n "$maxage" ]]; then
if find "$target" -mtime "-$maxage" -print -quit | grep -q .; then
log "INFO: skipping ${fname}, exists and is recent (<${maxage} days)"
exit 0
else
log "INFO: ${fname} exists but is older than ${maxage} days, will re-download"
if [[ "${DRY_RUN:-false}" == "false" ]]; then
rm -f -- "$target"
else
log "DRY-RUN: would remove old file ${target}"
exit 0
fi
fi
else
log "INFO: skipping ${fname}, already exists"
exit 0
fi
fi
if [[ "${DRY_RUN:-false}" == "true" ]]; then
log "DRY-RUN: would download ${url} -> ${target}"
exit 0
fi
log "INFO: downloading ${url}"
if wget --continue --tries="$retries" --timeout="$timeout" --waitretry=5 --retry-connrefused --no-verbose -O "$target" "$url" >> "${LOGFILE:-/dev/stdout}" 2>&1; then
log "INFO: completed ${fname}"
exit 0
else
log "WARN: download failed for ${fname}"
exit 1
fi
) &
pid="$!"
log "INFO: started PID ${pid} for ${url}"
done
# wait for all background jobs
wait || true
}
# Default values
DEST="${DEFAULT_DEST}"
JOBS=${DEFAULT_JOBS}
RETRIES=${DEFAULT_RETRIES}
LOGFILE=""
DRY_RUN=false
TIMEOUT=${DEFAULT_TIMEOUT}
URL_FILE=""
MAX_AGE_DAYS=
while getopts ":d:f:j:r:l:t:m:nh" opt; do
case "$opt" in
d) DEST="$OPTARG" ;;
f) URL_FILE="$OPTARG" ;;
j) JOBS="$OPTARG" ;;
r) RETRIES="$OPTARG" ;;
l) LOGFILE="$OPTARG" ;;
t) TIMEOUT="$OPTARG" ;;
m) MAX_AGE_DAYS="$OPTARG" ;;
n) DRY_RUN=true ;;
h) usage; exit 0 ;;
:) printf 'Missing argument for -%s\n' "$OPTARG"; usage; exit 2 ;;
*) usage; exit 2 ;;
esac
done
# Validate numeric options
is_positive_int() {
[[ "$1" =~ ^[1-9][0-9]*$ ]]
}
if ! is_positive_int "$JOBS"; then
log "ERROR: jobs (-j) must be a positive integer"
exit 2
fi
if ! is_positive_int "$RETRIES"; then
log "ERROR: retries (-r) must be a positive integer"
exit 2
fi
if ! is_positive_int "$TIMEOUT"; then
log "ERROR: timeout (-t) must be a positive integer"
exit 2
fi
if [[ -n "${MAX_AGE_DAYS:-}" ]]; then
if ! is_positive_int "$MAX_AGE_DAYS"; then
log "ERROR: max age (-m) must be a positive integer"
exit 2
fi
fi
# If a URL file was provided, load URLs from it (ignore blank lines and comments)
if [[ -n "${URL_FILE:-}" ]]; then
if [[ ! -r "$URL_FILE" ]]; then
log "ERROR: cannot read URL file $URL_FILE"
exit 2
fi
# read non-empty lines, strip comments, surrounding whitespace and CRs
mapfile -t URLS < <(
sed -e 's/#.*$//' "$URL_FILE" \
| tr -d '\r' \
| sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//' \
| sed '/^$/d'
)
if [[ ${#URLS[@]} -eq 0 ]]; then
log "ERROR: no URLs found in $URL_FILE"
exit 2
fi
log "INFO: loaded ${#URLS[@]} URLs from $URL_FILE"
fi
mkdir -p "$DEST"
# Use an atomic lock directory to prevent overlapping cron runs
LOCKDIR="$DEST/.wiki-download.lock"
if mkdir "$LOCKDIR" 2>/dev/null; then
log "INFO: acquired lock $LOCKDIR"
else
log "INFO: lock exists, another instance is running. Exiting."
exit 0
fi
# Run downloads
log "INFO: starting downloads to $DEST (jobs=$JOBS, retries=$RETRIES)"
run_jobs "$DEST" "$JOBS" "$RETRIES" "$TIMEOUT" "$MAX_AGE_DAYS"
log "INFO: all downloads finished"
# Call cleanup explicitly so static analyzers see the function is reachable.
cleanup_lock || true
exit 0URL file configuration
Create a file with the list of dumps to download. Lines starting with # are ignored:
# wiki-download-urls.txt
# German dumps
https://dumps.wikimedia.org/dewikibooks/latest/dewikibooks-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/dewikiquote/latest/dewikiquote-latest-pages-articles.xml.bz2
# English dumps
https://dumps.wikimedia.org/enwikibooks/latest/enwikibooks-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/enwikiquote/latest/enwikiquote-latest-pages-articles.xml.bz2Automating the import in YaCy
After downloads complete, you need to configure YaCy to automatically import the dumps. In the YaCy web interface:
- Navigate to Administration → Automation5
- Add a scheduled task that imports from the local file path (The dump e.g.
file:///path/to/yacy/dumps/dewiki-latest-pages-articles.xml.bz2must be manually added to the YaCy import list first under YaCy Packs & Import/Export → MediaWiki Dump) - Set the schedule to run after your download cron (e.g., daily or weekly)
This creates a complete hands-off workflow: downloads happen via cron, and imports happen via YaCy's automation.
How it works
- The cron job runs the script every 5 days at 5:30 AM
- The script reads URLs from the file and downloads them to the specified directory using up to 3 parallel jobs
- Files older than 5 days are re-downloaded (controlled by
-m 5); recent files are skipped - All activity is logged to
/var/log/wiki-download.log - After successful downloads, file ownership is changed so YaCy can access them
- The script uses a lock directory to prevent overlapping runs if a download takes longer than expected
- Each download is retried automatically, respecting Wikimedia's rate limits
- YaCy's automation feature periodically checks for new dumps and imports them automatically
This approach ensures your YaCy search engine stays updated with the latest Wikipedia, Wikibooks, and Wikiquote dumps without manual intervention.
About the YaCy search engine
YaCy6 is a free and open-source search engine that allows you to create your own distributed search index. It is based on a peer-to-peer network and uses Solr to index content and provide search results.
YaCy is written in Java and can run on any platform that supports Java. It's a great alternative to commercial search engines and can be used for personal search, topic-specific indexing, or community-based search projects.
Footnotes
-
1
https://dumps.wikimedia.org/dewiki/latest/ all XML gzip or bz2 files end with
-pages-articles.xml.bz2or-pages-articles.xml.gzand contain the content of the pages, but not the history of the pages, ideal for search engine indexing -
2
http://localhost:8090/IndexImportMediawiki_p.html Index Import MediaWiki at your local YaCy search engine instance
-
3
https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2 (~7GB bzipped, ~2 million pages)
-
4
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 (~24GB bzipped, ~6 million pages)
-
5
http://localhost:8090/Automation_p.html Automation at your local YaCy search engine instance
-
6
https://yacy.net/ the official website of the YaCy search engine, where you can find more information about the project, the documentation and the source code.
Feedback
Have thoughts or experiences you'd like to share? I'd love to hear from you! Whether you agree, disagree, or have a different perspective, your feedback is always welcome. Drop me an email and let's start a conversation.