Analyze wikipedia dumps

github

Find a file

Joscha e04215802e Speed up ingest using rustc_hash An enwiki ingest went from ca. 6:50 minutes down to ca. 7:00 minutes. Oh wait... This was not a rigorous test, but rustc_hash doesn't seem to have a significant positive impact. Maybe I'm just holding it wrong, but right now I'd rather remove it again and have simpler code/deps.		2024-12-31 13:06:46 +01:00
brood	Speed up ingest using rustc_hash	2024-12-31 13:06:46 +01:00
sift	Print nicer sift stats	2024-12-29 20:48:52 +01:00
.gitignore	Create project	2022-09-29 23:07:00 +02:00
ingested.hexpat	Add imhex patterns	2022-10-22 01:21:59 +02:00
ingested_header_only.hexpat	Add imhex patterns	2022-10-22 01:21:59 +02:00
README.md	Elaborate on sift	2022-09-30 01:18:41 +02:00

README.md

Wikilyze

Analyze wikipedia dumps obtained from https://dumps.wikimedia.org/backup-index.html.

Structure

This project has two main parts, sift and brood.

Sift

sift is written in Python and sifts through a wikipedia article dump (*-pages-articles.xml.bz2), parsing and analyzing individual articles and printing interesting data.

It takes a (decompressed) XML article dump on stdin. For each article in the dump, it prints a single-line JSON object to stdout.

Brood

brood is written in Rust and analyzes the data obtained by sift.