wikimeasure¶
This is a project hosting software for measuring various things wikipedian.
The software is open source and released under the terms of the BSD license.
Overview¶
A fascinating fact about wikipedia is that it is available for download as a series of wikipedia dumps. Although their size can vary from the big to the huge, they are amenable to parsing and analysis, even on a laptop computer.
The Software¶
The software consists of a Python script that parses a wikipedia dump.
To use the program, type
bzip2 -c -d enwiki-20100130-pages-articles.xml.bz2 | python parser.py > parser-output.txt
The actual name of the wikipedia dump file may of course be different.
The output contains three kinds of lines:
- lines containing word counts
- lines containing links
- lines containing redirects
To get the word counts we need to do:
grep '.*;.*;.*;.*;.*;' parser-output.txt > word-counts.txt
The fields in each line are: title, username, timestamp, word count, the article class (if found), the importance class (if found), and the categories the article belongs to. The categories are output in an array form ([a, b, c]).
To get the links we need to do:
grep ' => ' parser-output.txt > links.txt
And to get the redirects:
grep ' #REDIRECT ' parser-output.txt > redirects.txt
The links include links to parts of pages as well as links to pages in namespaces outside the main wikipedia
corpus. Also, the links do not resolve redirects. To remove links outside the main wikipedia corpus and resolve redirects, we must
process the links file as follows, using a Perl script and a helper module.
perl process-links.pl redirects.txt links.txt > processed-links.txt
We are not finished yet. There may be bogus links to articles that do not exist. We
assume that only article in the word counts exist. To filter out the
links that do not point to them we need to use a filter script:
perl filter-links.pl word-counts.txt processed-links.txt > filtered-links.txt
Support¶
For questions, comments, suggestions, etc., you may contact the author, Panos Louridas. The address is the author's surname at the hosting organisation (grnet in greece).