Analyze wikipedia dumps
Find a file
2025-01-03 02:43:40 +01:00
brood Remove now-obsolete Counter 2025-01-03 02:43:40 +01:00
sift Print nicer sift stats 2024-12-29 20:48:52 +01:00
.gitignore Create project 2022-09-29 23:07:00 +02:00
ingested.hexpat Add imhex patterns 2022-10-22 01:21:59 +02:00
ingested_header_only.hexpat Add imhex patterns 2022-10-22 01:21:59 +02:00
README.md Elaborate on sift 2022-09-30 01:18:41 +02:00

Wikilyze

Analyze wikipedia dumps obtained from https://dumps.wikimedia.org/backup-index.html.

Structure

This project has two main parts, sift and brood.

Sift

sift is written in Python and sifts through a wikipedia article dump (*-pages-articles.xml.bz2), parsing and analyzing individual articles and printing interesting data.

It takes a (decompressed) XML article dump on stdin. For each article in the dump, it prints a single-line JSON object to stdout.

Brood

brood is written in Rust and analyzes the data obtained by sift.