Analyze wikipedia dumps
Find a file
2022-10-04 21:47:43 +02:00
brood Refactor export and add page length 2022-10-03 22:14:58 +02:00
sift Don't print escape characters directly 2022-10-04 21:47:43 +02:00
.gitignore Create project 2022-09-29 23:07:00 +02:00
README.md Elaborate on sift 2022-09-30 01:18:41 +02:00

Wikilyze

Analyze wikipedia dumps obtained from https://dumps.wikimedia.org/backup-index.html.

Structure

This project has two main parts, sift and brood.

Sift

sift is written in Python and sifts through a wikipedia article dump (*-pages-articles.xml.bz2), parsing and analyzing individual articles and printing interesting data.

It takes a (decompressed) XML article dump on stdin. For each article in the dump, it prints a single-line JSON object to stdout.

Brood

brood is written in Rust and analyzes the data obtained by sift.