exiguus.blog

Personal blog

Create blacklists for YaCy search engine with Hagezi blocklists

Created: 2025-10-02

Introduction

Writing your own blacklists for the YaCy1 search engine can be challenging, especially when you want to block a large number of domains or URLs. The syntax of the blacklists is not very user-friendly and it can be difficult to keep track of all the entries.

Writing blacklists for YaCy

Basically, YaCy blacklists are text files that contain a list of domains or URLs that you want to block from being crawled, indexed, searched or shared. Each line in the blacklist file represents a single entry, typically a domain or URL pattern.

The following syntax is used for the blacklists (see: http://localhost:8090/Blacklist_p.html):

The right * after the / can be replaced by a regular expression.

Based on my testing, the most common use case is to block entire domains with the syntax domain.net/* or *.domain.net/*. This will block all URLs that start with http://domain.net/ or http://www.domain.net/ and all subdomains of domain.net.

Unfortunately, it is not possible to apply wildcards or a regular expression to the domain or subdomain part of the URL. For example, something like (www\.)?domain\.net/*, (newsletter|blog)\.domain\.net/* or *.*(share|link|short|img)*.*/* is not possible. You would have to write two entries for the first example and two entries for the second example. The third example would require a lot of entries and would be very difficult to maintain.

Hagezi blacklists

Hagezi2 is a great resource for creating blacklists for YaCy. It provides DNS Blocklists and you might be familiar with similar blocklists used in AdGuard Home3 or Pi-hole4. Hagezi provides a large number of blocklists for different categories, such as ads, trackers, malware, phishing and more. You can use these blocklists to create your own blacklists for YaCy.

The interesting blacklists for YaCy are:

Both wildcard and only-domains versions are available and can be used for different use cases. E.g. onlydomains will block domain.tld but not sub.domain.tld while the wildcard version will block sub.domain.tld, which is what we need to use for YaCy with the syntax *.domain.tld/*.

Create a blacklist for YaCy with Hagezi

To create a blacklist for YaCy with Hagezi, you can follow these steps:

  1. Choose the blocklists that you want to use from the Hagezi blocklists list above.
  2. Download the blocklists and save them as text files on your local machine.
  3. Create a new blacklist file for YaCy and add the entries from the Hagezi blocklists to the file. You have to add /* at the end of each entry to block all URLs that start with the domain. For example, if you want to block example.com, you would add example.com/* to the blacklist file. If you want to block all subdomains of example.com, you would add *.example.com/* to the blacklist file.
  4. Save the blacklist file and import it into YaCy as a plain text file.
  5. Test the blacklist in YaCy and make sure that the entries are blocked as expected.

To automate the process of creating and importing the blacklist, you can use a script that downloads the Hagezi blocklists and formats them for YaCy (add the /* at the end of each entry) and then overwrites the current list in data/LISTS/[your_blacklist_name].black.

Automate the update of the blacklist

To automate the update of the blacklist, you can use a cron job that runs the script on a regular basis (e.g. weekly or monthly) to download the latest Hagezi blocklists and update the blacklist for YaCy.

The cron job looks like this:

30 2 * * * root  nice -n10 /usr/share/scripts/blacklist.sh --url-file /path/to/urls-blacklist.txt --output /path/to/yacy/data/LISTS/hagezi.blocklist.black && /usr/bin/chown -R yacy:yacy /path/to/yacy/data/LISTS/hagezi.blocklist.black

If you create a blacklist in YaCy with the name hagezi.blocklist in Filter & Blacklists, the list is automatically saved in the data/LISTS/hagezi.blocklist.black file and you can overwrite this file with the new one created by the script.

The script that downloads the Hagezi blocklists and formats them for YaCy can look like this:

#!/usr/bin/env bash

set -o errexit
set -o nounset
set -o pipefail

readonly DEFAULT_URL='https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/ultimate.txt'
readonly DEFAULT_OUTPUT_FILE='blacklist.txt'

usage() {
    cat <<'EOF'
Usage: blacklist.sh [OPTIONS]

Downloads one or more blocklist URLs, merges them, and appends '/*' to non-empty,
non-comment lines.

Options:
    -u, --url URL         Add a URL to download (can be passed multiple times).
    -f, --url-file FILE   File with URLs to download, one link per line.
    -o, --output FILE     Output file (default: blacklist.txt).
    -h, --help            Show this help.

If neither --url nor --url-file is provided, the default URL is used.

Examples:
    blacklist.sh
    blacklist.sh --url-file urls.txt
    blacklist.sh --url https://example.com/list1.txt --url https://example.com/list2.txt
    blacklist.sh --url-file urls.txt --output blacklists/blacklist.txt
EOF
}

require_command() {
    local cmd="$1"
    if ! command -v "${cmd}" >/dev/null 2>&1; then
        echo "Error: Required command not found: ${cmd}" >&2
        exit 1
    fi
}

read_urls_from_file() {
    local file_path="$1"
    local -n urls_ref="$2"
    local line=''

    if [[ ! -f "${file_path}" ]]; then
        echo "Error: URL file does not exist: ${file_path}" >&2
        exit 1
    fi

    while IFS= read -r line || [[ -n "${line}" ]]; do
        if [[ -z "${line//[[:space:]]/}" ]]; then
            continue
        fi
        if [[ "${line}" =~ ^[[:space:]]*# ]]; then
            continue
        fi
        urls_ref+=("${line}")
    done < "${file_path}"
}

download_urls_to_file() {
    local output_file="$1"
    shift
    local urls=("$@")
    local url=''

    : > "${output_file}"

    for url in "${urls[@]}"; do
        echo "Downloading blocklist from ${url}..."

        if ! curl --fail --location --silent --show-error "${url}" | \
            awk '{
                if ($0 == "" || $0 ~ /^#/) {
                    print
                } else {
                    print $0 "/*"
                }
            }' >> "${output_file}"; then
            echo "Error: Failed to download URL: ${url}" >&2
            exit 1
        fi

        echo >> "${output_file}"
    done
}

main() {
    require_command curl

    local output_file="${DEFAULT_OUTPUT_FILE}"
    local url_file=''
    local -a urls=()

    while (($# > 0)); do
        case "$1" in
            -u|--url)
                if (($# < 2)); then
                    echo 'Error: --url requires a value.' >&2
                    usage
                    exit 1
                fi
                urls+=("$2")
                shift 2
                ;;
            -f|--url-file)
                if (($# < 2)); then
                    echo 'Error: --url-file requires a value.' >&2
                    usage
                    exit 1
                fi
                url_file="$2"
                shift 2
                ;;
            -o|--output)
                if (($# < 2)); then
                    echo 'Error: --output requires a value.' >&2
                    usage
                    exit 1
                fi
                output_file="$2"
                shift 2
                ;;
            -h|--help)
                usage
                exit 0
                ;;
            *)
                echo "Error: Unknown argument: $1" >&2
                usage
                exit 1
                ;;
        esac
    done

    if [[ -n "${url_file}" ]]; then
        read_urls_from_file "${url_file}" urls
    fi

    if ((${#urls[@]} == 0)); then
        urls=("${DEFAULT_URL}")
    fi

    echo "Downloading and processing ${#urls[@]} blocklist(s)..."
    download_urls_to_file "${output_file}" "${urls[@]}"

    if [[ ! -s "${output_file}" ]]; then
        echo "Error: Downloaded content is empty: ${output_file}" >&2
        exit 1
    fi

    echo "Done! Processed blocklist saved to ${output_file}"
}

main "$@"

the url-blacklist.txt file can look like this:

https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/ultimate.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/doh-vpn-proxy-bypass.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/dyndns.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/hoster.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/urlshortener.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/anti.piracy.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/gambling.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/social.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/nsfw.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/ultimate-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/doh-vpn-proxy-bypass-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/dyndns-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/hoster-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/urlshortener-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/anti.piracy-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/gambling-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/social-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/nsfw-onlydomains.txt

With this setup, you will add approximately 1141642 entries to your YaCy blacklist and you will have a good coverage of the most common categories of domains that you want to block from being crawled, indexed, searched or shared.

About the YaCy search engine

YaCy1 is a free and open-source search engine that allows you to create your own distributed search index. It is based on a peer-to-peer network and uses Solr to index content and provide search results.

YaCy is written in Java and can run on any platform that supports Java. It's a great alternative to commercial search engines and can be used for personal search, topic-specific indexing, or community-based search projects.

Foot Notes

Feedback

Have thoughts or experiences you'd like to share? I'd love to hear from you! Whether you agree, disagree, or have a different perspective, your feedback is always welcome. Drop me an email and let's start a conversation.

<​​​​yacy-search-hagezi-blacklists​​​@exiguus​.​​blog​​​>

Tags