Actually, both :) I downloaded the 7bz file and de...

2008-08-03T23:06:00.000+02:00

Actually, both :) I downloaded the 7bz file and decompressed it to the standard output, feeding the gawk cleaning script with it. So of course you'd have to add the 7bz decompression time to the equation.

I also had a run on a decompressed XML corpus but I didn't notice much change in speed. Probably it was because I had it on a rather old drive and it was surely fragmented (it was large).

BTW, I have a publication forthcoming on this experiment in PALC 2007 proceedings. Here's a presentation: http://marcinmilkowski.pl/downloads/korpus_bledow.pdf
(in Polish, the publication will be in English).

Did you use the decompressed xml-File or the compr...

2008-07-30T11:11:00.000+02:00

Did you use the decompressed xml-File or the compressed 7z/bz2 Archive for your Experiment?

Comments on Morfologik: Wikipedia history diff as a revision corpus

Actually, both :) I downloaded the 7bz file and de...

Did you use the decompressed xml-File or the compr...