Analyze wikipedia dumps
Find a file
Joscha e04215802e Speed up ingest using rustc_hash
An enwiki ingest went from ca. 6:50 minutes down to ca. 7:00 minutes. Oh
wait...

This was not a rigorous test, but rustc_hash doesn't seem to have a
significant positive impact. Maybe I'm just holding it wrong, but right
now I'd rather remove it again and have simpler code/deps.
2024-12-31 13:06:46 +01:00
brood Speed up ingest using rustc_hash 2024-12-31 13:06:46 +01:00
sift Print nicer sift stats 2024-12-29 20:48:52 +01:00
.gitignore Create project 2022-09-29 23:07:00 +02:00
ingested.hexpat Add imhex patterns 2022-10-22 01:21:59 +02:00
ingested_header_only.hexpat Add imhex patterns 2022-10-22 01:21:59 +02:00
README.md Elaborate on sift 2022-09-30 01:18:41 +02:00

Wikilyze

Analyze wikipedia dumps obtained from https://dumps.wikimedia.org/backup-index.html.

Structure

This project has two main parts, sift and brood.

Sift

sift is written in Python and sifts through a wikipedia article dump (*-pages-articles.xml.bz2), parsing and analyzing individual articles and printing interesting data.

It takes a (decompressed) XML article dump on stdin. For each article in the dump, it prints a single-line JSON object to stdout.

Brood

brood is written in Rust and analyzes the data obtained by sift.