Bootstrapping the rules for LanguageTool

This post is related to many languages, so I'm posting in English.

Recently, during PALC 2009, I had a talk on unsupervised generation of rules for LanguageTool. The idea is when you have an error corpus (and you can create one based on Wikipedia revision history, by the way, here's a draft of my paper on creating the error corpus from Wikipedia), you can use transformation-based learning techniques to create rules that may be used to boostrap rule creation for new languages in LanguageTool.

Of course, what I have right now, are only quick hacks and script prototypes, but as you can see in my presentation, I'm planning to make the process a bit easier to use. First of all, the extraction of the error corpus from Wikipedia revision history can be fully ported to Java (I will add filters to remove synonym-for-synonym revisions but some of the most frequent changes are used to adapt the text to some editorial conventions, so they would have to be filtered manually). Currently, I'm using two packages for TBL machine-learning: muTBL and fnTBL (both free but not muTBL is not open-source). muTBL is in Prolog, and it has serious memory limitation - but it allows for iterative, incremental learning of rules on the stack. fnTBL is C++ and can process much larger corpora but crashes a lot, and its newest version (1.2) didn't work for me at all. Fortunately, there is a Java implementation of the TBL toolkit that works in GATE, and it's GPL, so it would allow to adapt it for our needs.

With these tools, it should be fairly easy to add some relevant rules for new languages even for people with limited linguistic competence. Yet, some of the rules require using morphosyntactic lexicons, which aren't so easy to create or find.

Of course, the use of TBL is not limited to bootstrapping new rules for new languages - TBL learning can enhance existing rules quite significantly, as I observed. The process is not very time-consuming even with dirty hacks, so I'm quite optimistic about the practical application.

So if anyone wants to help with porting my scripts to Java, just let me know - this way, we'll have the tool a bit faster :)