Paper on developing LanguageTool available

My paper on developing LanguageTool, focused mostly on new features needed to support many different languages in LT, has just been published in Software - Practice and Experience (if you have no access to SPE, here is the final uncorrected draft). The paper contains a section with empirical results - I tested LanguageTool on Polish, and compared the results with Microsoft Word grammar checker (the only one that exists for Polish beside LanguageTool). The results are pretty good - whereas LanguageTool rules seem to create few false alarms (the precision level is around 90% for most test samples), MS Word tends to have a lot of them (the precision level is in best cases around 50%, and in many cases even lower).

Sample text	MS Word matches	MS Word precision	LT matches	LT precision
Frequency Dictionary Corpus	4572	22,00%	8552	92,00%
Camera-ready book	586	1,00%	323	54,00%
Culinary forum	75	17,00%	186	90,00%
Catholic church notices	24	0,00%	140	87,00%
Left-wing political commentary	58	1,00%	242	97,00%
Right-wing political commentary	51	5,00%	274	93,00%
Left-wing professional politician blog	91	13,00%	238	90,00%
Right-wing professional politician blog	59	55,00%	98	94,00%
Stock-market analyst blog	43	4,00%	134	95,00%
Political blog	12	50,00%	67	98,00%
Popular personal blog 1	67	8,00%	127	88,00%
Popular personal blog 2	57	47,00%	124	98,00%

As promised in the paper, I am making my testing set freely available, as the results were evaluated manually only by myself. Now, the methodology was very simple: instead of trying to figure out the total number of errors in the text, I simply checked if rules created any problems, i.e., offered suggestions that would result in grammar errors. This way, I avoided the possible objection that I treated some corrections as useless just because I thought you might leave the text as is. Instead of arguing over such cases, I treat them as correct.

There are two testing samples: files from LanguageTool, in a text format (zipped), and from Microsoft Word (also zipped). Note: to see errors as marked by Microsoft Word, you actually need to have Microsoft Word 2000 or newer (some files are in docx format). I tested the files with Microsoft Word 2006. I excluded one file from the testing sample due to copyright restrictions; the text contained withing samples is from the Frequency Dictionary of Polish Corpus, or from web blogs.

The paper has a longer section that describes the syntax for XML rules (which is now slightly out of date, as I made unification a bit simpler to use, by using native XML constructs) and describes the architecture of the checker. People that want to become language maintainers might therefore find it interesting (I hope!).

Morfologik

Szukaj na tym blogu

Paper on developing LanguageTool available

Etykiety

Komentarze

Popularne posty z tego bloga

Imiesłów przysłówkowy bez orzeczenia

Gromadzimy błędy językowe

Zgłaszanie błędów