Create blacklists for YaCy search engine with Hagezi blocklists
Created:
Introduction
Writing your own blacklists for the YaCy1 search engine can be challenging, especially when you want to block a large number of domains or URLs. The syntax of the blacklists is not very user-friendly and it can be difficult to keep track of all the entries.
Writing blacklists for YaCy
Basically, YaCy blacklists are text files that contain a list of domains or URLs that you want to block from being crawled, indexed, searched or shared. Each line in the blacklist file represents a single entry, typically a domain or URL pattern.
The following syntax is used for the blacklists (see: http://localhost:8090/Blacklist_p.html):
The right * after the / can be replaced by a regular expression.
domain.net/fullpathdomain.net/**.domain.net/**.sub.domain.net/*sub.domain.*/*domain.*/*domainandpathregular expressions separated by a/(slow)
Based on my testing, the most common use case is to block entire domains with the syntax domain.net/* or *.domain.net/*. This will block all URLs that start with http://domain.net/ or http://www.domain.net/ and all subdomains of domain.net.
Unfortunately, it is not possible to apply wildcards or a regular expression to the domain or subdomain part of the URL.
For example, something like (www\.)?domain\.net/*, (newsletter|blog)\.domain\.net/* or *.*(share|link|short|img)*.*/* is not possible. You would have to write two entries for the first example and two entries for the second example. The third example would require a lot of entries and would be very difficult to maintain.
Hagezi blacklists
Hagezi2 is a great resource for creating blacklists for YaCy. It provides DNS Blocklists and you might be familiar with similar blocklists used in AdGuard Home3 or Pi-hole4. Hagezi provides a large number of blocklists for different categories, such as ads, trackers, malware, phishing and more. You can use these blocklists to create your own blacklists for YaCy.
The interesting blacklists for YaCy are:
- Multi ULTIMATE as a Base
- DoH/VPN/TOR/Proxy
- Dynamic DNS
- Badware Hoster
- URL Shortener
- Anti Piracy
- Gambling
- Social Media
- NSFW
Both wildcard and only-domains versions are available and can be used for different use cases. E.g. onlydomains will block domain.tld but not sub.domain.tld while the wildcard version will block sub.domain.tld, which is what we need to use for YaCy with the syntax *.domain.tld/*.
Create a blacklist for YaCy with Hagezi
To create a blacklist for YaCy with Hagezi, you can follow these steps:
- Choose the blocklists that you want to use from the Hagezi blocklists list above.
- Download the blocklists and save them as text files on your local machine.
- Create a new blacklist file for YaCy and add the entries from the Hagezi blocklists to the file. You have to add
/*at the end of each entry to block all URLs that start with the domain. For example, if you want to blockexample.com, you would addexample.com/*to the blacklist file. If you want to block all subdomains ofexample.com, you would add*.example.com/*to the blacklist file. - Save the blacklist file and import it into YaCy as a plain text file.
- Test the blacklist in YaCy and make sure that the entries are blocked as expected.
To automate the process of creating and importing the blacklist, you can use a script that downloads the Hagezi blocklists and formats them for YaCy (add the /* at the end of each entry) and then overwrites the current list in data/LISTS/[your_blacklist_name].black.
Automate the update of the blacklist
To automate the update of the blacklist, you can use a cron job that runs the script on a regular basis (e.g. weekly or monthly) to download the latest Hagezi blocklists and update the blacklist for YaCy.
The cron job looks like this:
30 2 * * * root nice -n10 /usr/share/scripts/blacklist.sh --url-file /path/to/urls-blacklist.txt --output /path/to/yacy/data/LISTS/hagezi.blocklist.black && /usr/bin/chown -R yacy:yacy /path/to/yacy/data/LISTS/hagezi.blocklist.black
If you create a blacklist in YaCy with the name hagezi.blocklist in Filter & Blacklists, the list is automatically saved in the data/LISTS/hagezi.blocklist.black file and you can overwrite this file with the new one created by the script.
The script that downloads the Hagezi blocklists and formats them for YaCy can look like this:
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
readonly DEFAULT_URL='https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/ultimate.txt'
readonly DEFAULT_OUTPUT_FILE='blacklist.txt'
usage() {
cat <<'EOF'
Usage: blacklist.sh [OPTIONS]
Downloads one or more blocklist URLs, merges them, and appends '/*' to non-empty,
non-comment lines.
Options:
-u, --url URL Add a URL to download (can be passed multiple times).
-f, --url-file FILE File with URLs to download, one link per line.
-o, --output FILE Output file (default: blacklist.txt).
-h, --help Show this help.
If neither --url nor --url-file is provided, the default URL is used.
Examples:
blacklist.sh
blacklist.sh --url-file urls.txt
blacklist.sh --url https://example.com/list1.txt --url https://example.com/list2.txt
blacklist.sh --url-file urls.txt --output blacklists/blacklist.txt
EOF
}
require_command() {
local cmd="$1"
if ! command -v "${cmd}" >/dev/null 2>&1; then
echo "Error: Required command not found: ${cmd}" >&2
exit 1
fi
}
read_urls_from_file() {
local file_path="$1"
local -n urls_ref="$2"
local line=''
if [[ ! -f "${file_path}" ]]; then
echo "Error: URL file does not exist: ${file_path}" >&2
exit 1
fi
while IFS= read -r line || [[ -n "${line}" ]]; do
if [[ -z "${line//[[:space:]]/}" ]]; then
continue
fi
if [[ "${line}" =~ ^[[:space:]]*# ]]; then
continue
fi
urls_ref+=("${line}")
done < "${file_path}"
}
download_urls_to_file() {
local output_file="$1"
shift
local urls=("$@")
local url=''
: > "${output_file}"
for url in "${urls[@]}"; do
echo "Downloading blocklist from ${url}..."
if ! curl --fail --location --silent --show-error "${url}" | \
awk '{
if ($0 == "" || $0 ~ /^#/) {
print
} else {
print $0 "/*"
}
}' >> "${output_file}"; then
echo "Error: Failed to download URL: ${url}" >&2
exit 1
fi
echo >> "${output_file}"
done
}
main() {
require_command curl
local output_file="${DEFAULT_OUTPUT_FILE}"
local url_file=''
local -a urls=()
while (($# > 0)); do
case "$1" in
-u|--url)
if (($# < 2)); then
echo 'Error: --url requires a value.' >&2
usage
exit 1
fi
urls+=("$2")
shift 2
;;
-f|--url-file)
if (($# < 2)); then
echo 'Error: --url-file requires a value.' >&2
usage
exit 1
fi
url_file="$2"
shift 2
;;
-o|--output)
if (($# < 2)); then
echo 'Error: --output requires a value.' >&2
usage
exit 1
fi
output_file="$2"
shift 2
;;
-h|--help)
usage
exit 0
;;
*)
echo "Error: Unknown argument: $1" >&2
usage
exit 1
;;
esac
done
if [[ -n "${url_file}" ]]; then
read_urls_from_file "${url_file}" urls
fi
if ((${#urls[@]} == 0)); then
urls=("${DEFAULT_URL}")
fi
echo "Downloading and processing ${#urls[@]} blocklist(s)..."
download_urls_to_file "${output_file}" "${urls[@]}"
if [[ ! -s "${output_file}" ]]; then
echo "Error: Downloaded content is empty: ${output_file}" >&2
exit 1
fi
echo "Done! Processed blocklist saved to ${output_file}"
}
main "$@"
the url-blacklist.txt file can look like this:
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/ultimate.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/doh-vpn-proxy-bypass.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/dyndns.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/hoster.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/urlshortener.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/anti.piracy.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/gambling.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/social.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/nsfw.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/ultimate-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/doh-vpn-proxy-bypass-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/dyndns-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/hoster-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/urlshortener-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/anti.piracy-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/gambling-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/social-onlydomains.txt
https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/wildcard/nsfw-onlydomains.txt
With this setup, you will add approximately 1141642 entries to your YaCy blacklist and you will have a good coverage of the most common categories of domains that you want to block from being crawled, indexed, searched or shared.
About the YaCy search engine
YaCy1 is a free and open-source search engine that allows you to create your own distributed search index. It is based on a peer-to-peer network and uses Solr to index content and provide search results.
YaCy is written in Java and can run on any platform that supports Java. It's a great alternative to commercial search engines and can be used for personal search, topic-specific indexing, or community-based search projects.
Foot Notes
YaCy https://yacy.net/
AdGuard Home https://adguard.com/en/adguard-home/overview.html
Pi-hole https://pi-hole.net/
Feedback
Have thoughts or experiences you'd like to share? I'd love to hear from you! Whether you agree, disagree, or have a different perspective, your feedback is always welcome. Drop me an email and let's start a conversation.