Evaluating Grammar Checkers

As a developer of LanguageTool, I would love to see clear guidelines for evaluating grammar checkers. Some discussions of flaws in grammar checkers - such as the influential analysis by Daniel Kies - rely on quite controversial principles and make assumptions that seem unwarranted at times.

Kies used analysis of a corpus of 3000 college essays in English to evaluate the checkers. Now, the analysis contained top 20 errors, ordered by frequency. Of course, frequency of the error seems to be an important factor but what about the severity of error? Let me explain. I really doubt that "No comma after introductory element" is the top error in college essays if they're written on a computer keyboard. The most common error is to have multiple whitespace or formatting with whitespace. Of course, this is not strictly a grammar error, but punctuation is also quite far from syntactic principles of the language, and multiple whitespace is also a punctuation problem. If they had manually written essays, then they don't have a chance to count whitespace errors as they don't seem to occur in handwriting. So a lot depends on the medium of the corpus: the space of possible errors is biased by the representation of the linguistic content. In a corpus of OCRed books you would expect more errors that are due to incorrect recognition of badly printed characters, for example.

Another problem is that "No comma after introductory element" is actually quite controversial. Modern writers don't seem to care about the comma, so using this rule as a baseline is not a reliable way to evaluate the checkers. For example, I don't want to implement this rule and make it turned on by default exactly for this reason: most people don't think a missing comma after an introductory element is actually a severe problem. If you want to evaluate grammar checkers, you'd better ignore such problems and focus on things which are uncontroversially severe, such as confusing "it's" and "its". Implementing a rule for detecting a phrase at a sentence start that is not a subject of the sentence is not exactly rocket science, but I don't think the rule is really so important. Also, if you see some error really often in a corpus, then it may mean that it's no longer an error but rather an emerging standard usage.

The second error on the list is "vague pronoun reference". This is actually much harder to implement. In computational linguistics, we still don’t have a good solution for anaphora resolution: a failure of anaphora resolution would mean that there is a vague pronoun reference. But as anaphora resolution is actually quite hard (especially for such languages as English, which is much harder than Slavonic or Romance languages with their richer morphology that contains gender information), developers may decide not to focus on it but implement things which are much easier. The question is whether you want your grammar checker to complain at Kant's prose (Kant notoriously used vague pronouns) or you want it to find simpler typos etc. If you think that computer-aided proofreading tools shouldn't deal with the deep structure of language and actually help edit the text, then you will not count this error as a good way to evaluate grammar checkers.

Another problem with the list used for evaluation is that it's based on college essays. College students don't tend to make mistakes typical for people that use English as their second language (I make different mistakes than native speakers, for example). As more people use English as their second language than there are native speakers of English, you would expect the evaluation to be based at least not only on native users' problems. It seems obvious that a Chinese or Polish student needs the computer tools to correct her English much more than a native speaker.

If you use a biased list of errors, you can get biased results. I think it would be interesting to see the evaluation of grammar checkers based on different corpora, as all specialized corpora are biased this way or another. But as long as you try to balance it, you might get better insight, both into the state of the field of grammar checking (or computer-aided proofreading) and into the errors as stable patterns of incorrect usage of language. So a more exhaustive analysis could reuse:
  • Language learners error corpora (especially ESL)
  • Corpora of texts marked by professional proofreaders
  • Automatically generated corpora from Wikipedia
Any ideas what to use as baseline measures for evaluation? What kinds of data would be needed?

It's also worth noting that it's much easier to evaluate grammar checkers in terms of their precision by counting false positives among all alarms raised by them. It is the recall, or the real number of true positives, that remains really hard to evaluate.