wikimeasure

This is a project hosting software for measuring various things wikipedian.

The software is open source and released under the terms of the BSD license.

Overview

A fascinating fact about wikipedia is that it is available for download as a series of wikipedia dumps. Although their size can vary from the big to the huge, they are amenable to parsing and analysis, even on a laptop computer.

The Software

The software consists of a Python script that parses a wikipedia dump.

To use the program, type

bzip2 -c -d enwiki-20100130-pages-articles.xml.bz2 | python parser.py > parser-output.txt

The actual name of the wikipedia dump file may of course be different.

The output contains three kinds of lines:

  • lines containing word counts
  • lines containing links
  • lines containing redirects

To get the word counts we need to do:

grep '.*;.*;.*;.*;.*;' parser-output.txt > word-counts.txt

The fields in each line are: title, username, timestamp, word count, the article class (if found), the importance class (if found), and the categories the article belongs to. The categories are output in an array form ([a, b, c]).

To get the links we need to do:

grep ' => ' parser-output.txt > links.txt

And to get the redirects:

grep ' #REDIRECT ' parser-output.txt > redirects.txt

The links include links to parts of pages as well as links to pages in namespaces outside the main wikipedia
corpus. Also, the links do not resolve redirects. To remove links outside the main wikipedia corpus and resolve redirects, we must
process the links file as follows, using a Perl script and a helper module.

perl process-links.pl redirects.txt links.txt > processed-links.txt

We are not finished yet. There may be bogus links to articles that do not exist. We
assume that only article in the word counts exist. To filter out the
links that do not point to them we need to use a filter script:

perl filter-links.pl word-counts.txt processed-links.txt > filtered-links.txt

Support

For questions, comments, suggestions, etc., you may contact the author, Panos Louridas. The address is the author's surname at the hosting organisation (grnet in greece).