Wikipedia history diff as a revision corpus

(As this is of interest not only to the Polish-speaking community, this post is in English.)

Recently, after some discussions on the lingucomponent list at OpenOffice.org on the method of finding frequent typos, I did some experiments on the revision history logs.

Background. The developers of grammar checkers, and autocorrect lists, have hard times with finding relevant corpora. Revision history is an excellent source about native speakers perception of linguistic norms. Frequently revised typos are perceived as errors that need to be corrected, so using these typos on autocorrect lists is justified. The same goes for style, grammar and usage errors.

Method. Experiments involved three steps:
  1. Clean the history dump (??wiki-latest-pages-meta-history.xml), to get only relevant parts of the dump. Using XML tools isn't recommended (I tried XSLT, forget it). Using a simple awk script, I was able to clean the > 30GB dump in an hour or so, and got a >17 GB file. The script is simple and ugly, removes most wiki markup but requires more work, and tokenizes all words (puts them on single lines). The resulting file had 2440 million words. Nice ;)
  2. Get revisions from the dump, and compare them. I used another simple script that sequentially produced single files for all versions of one Wikipedia article, and run diff on them. The diff was piped to a file. After 1300 minutes, a corpus was ready. Note that instead of diff some other method.
  3. Query the corpus using the properties of unified diff format (gawk scripts, and sort). I was able to get a list of frequent one word-one word corrections, two word-one word corrections, and two or more words-one word or more corrections. I will post some of the results in Polish.
Possible improvements. (1) The corpus format probably should be changed. However, there is no standard corpus format and/or annothation format for revision corpora. The easiest way would be to tag the corpus before using diff. As the corpus will be used for LanguageTool, and we already have fast taggers for that, we could use the taggers to annotate the words.
(2) Use something better than diff - diff doesn't display enough context. Even if I changed the context to 5 or 7 lines, it could easily run out of sentence boundaries, and the sentence, segmented according to language-specific rules, is really the unit the grammar checkers are interested. This isn't as important to building autocorrect lists.
(3) Get a faster method for querying the corpus (using some standard corpus tools). For XML lovers, we could have an XML version of the difference corpus.
(4) Get more serious statistical tools, like those used for finding collocations. Now I use only frequencies. Using, for example, Levenshtein distance between two subsequent corrections could be a way of filtering out style corrections as opposed to typo corrections. Also word order correction is another thing that is now being disregarded.
(5) Use the comment lines in Wikipedia history diff in querying the corpus. Now they are ignored, but some of typo corrections go without further notice.
(6) Convert the revision history to the error corpus format, by annotating manually or automatically (using the comment tag in Wikipedia and/or the result of the statistical analyses in case the comment is not relevant).

Other sources of similar information. Queries to online dictionaries, like spell-checkers, thesauri, and the like, contain frequent typos. Also Google has a method for suggesting better queries, and this is based on such analyses. There are some Web diff tools for seeing the updates to web sites.

Linguistic relevance. Big revision corpora, also from newspaper proof-readers, and translation agencies, could be used for justifying the normative judgments not only on frequent usage grounds but on the ground what competent speakers of the language consider as a mistake or bad wording.

I need to dig more for error corpora tools, as my proof of the concept experiments were using only crude tools. Using easily available material as Wikipedia could be a great advantage to minority languages for making better computer-aided writing tools.

2 komentarze:

Anonimowy pisze...

Did you use the decompressed xml-File or the compressed 7z/bz2 Archive for your Experiment?

Marcin Miłkowski pisze...

Actually, both :) I downloaded the 7bz file and decompressed it to the standard output, feeding the gawk cleaning script with it. So of course you'd have to add the 7bz decompression time to the equation.

I also had a run on a decompressed XML corpus but I didn't notice much change in speed. Probably it was because I had it on a rather old drive and it was surely fragmented (it was large).

BTW, I have a publication forthcoming on this experiment in PALC 2007 proceedings. Here's a presentation: http://marcinmilkowski.pl/downloads/korpus_bledow.pdf
(in Polish, the publication will be in English).