Analyze wikipedia dumps
Find a file
2022-10-03 18:04:24 +02:00
brood Move commands to own module 2022-10-03 18:04:24 +02:00
sift Ignore all namespaces except 0 2022-10-03 16:26:08 +02:00
.gitignore Create project 2022-09-29 23:07:00 +02:00
README.md Elaborate on sift 2022-09-30 01:18:41 +02:00

Wikilyze

Analyze wikipedia dumps obtained from https://dumps.wikimedia.org/backup-index.html.

Structure

This project has two main parts, sift and brood.

Sift

sift is written in Python and sifts through a wikipedia article dump (*-pages-articles.xml.bz2), parsing and analyzing individual articles and printing interesting data.

It takes a (decompressed) XML article dump on stdin. For each article in the dump, it prints a single-line JSON object to stdout.

Brood

brood is written in Rust and analyzes the data obtained by sift.