Evaluating Grammar Checkers

As a developer of LanguageTool, I would love to see clear guidelines for evaluating grammar checkers. Some discussions of flaws in grammar checkers - such as the influential analysis by Daniel Kies - rely on quite controversial principles and make assumptions that seem unwarranted at times.

Kies used analysis of a corpus of 3000 college essays in English to evaluate the checkers. Now, the analysis contained top 20 errors, ordered by frequency. Of course, frequency of the error seems to be an important factor but what about the severity of error? Let me explain. I really doubt that "No comma after introductory element" is the top error in college essays if they're written on a computer keyboard. The most common error is to have multiple whitespace or formatting with whitespace. Of course, this is not strictly a grammar error, but punctuation is also quite far from syntactic principles of the language, and multiple whitespace is also a punctuation problem. If they had manually written essays, then they don't have a chance to count whitespace errors as they don't seem to occur in handwriting. So a lot depends on the medium of the corpus: the space of possible errors is biased by the representation of the linguistic content. In a corpus of OCRed books you would expect more errors that are due to incorrect recognition of badly printed characters, for example.

Another problem is that "No comma after introductory element" is actually quite controversial. Modern writers don't seem to care about the comma, so using this rule as a baseline is not a reliable way to evaluate the checkers. For example, I don't want to implement this rule and make it turned on by default exactly for this reason: most people don't think a missing comma after an introductory element is actually a severe problem. If you want to evaluate grammar checkers, you'd better ignore such problems and focus on things which are uncontroversially severe, such as confusing "it's" and "its". Implementing a rule for detecting a phrase at a sentence start that is not a subject of the sentence is not exactly rocket science, but I don't think the rule is really so important. Also, if you see some error really often in a corpus, then it may mean that it's no longer an error but rather an emerging standard usage.

The second error on the list is "vague pronoun reference". This is actually much harder to implement. In computational linguistics, we still don’t have a good solution for anaphora resolution: a failure of anaphora resolution would mean that there is a vague pronoun reference. But as anaphora resolution is actually quite hard (especially for such languages as English, which is much harder than Slavonic or Romance languages with their richer morphology that contains gender information), developers may decide not to focus on it but implement things which are much easier. The question is whether you want your grammar checker to complain at Kant's prose (Kant notoriously used vague pronouns) or you want it to find simpler typos etc. If you think that computer-aided proofreading tools shouldn't deal with the deep structure of language and actually help edit the text, then you will not count this error as a good way to evaluate grammar checkers.

Another problem with the list used for evaluation is that it's based on college essays. College students don't tend to make mistakes typical for people that use English as their second language (I make different mistakes than native speakers, for example). As more people use English as their second language than there are native speakers of English, you would expect the evaluation to be based at least not only on native users' problems. It seems obvious that a Chinese or Polish student needs the computer tools to correct her English much more than a native speaker.

If you use a biased list of errors, you can get biased results. I think it would be interesting to see the evaluation of grammar checkers based on different corpora, as all specialized corpora are biased this way or another. But as long as you try to balance it, you might get better insight, both into the state of the field of grammar checking (or computer-aided proofreading) and into the errors as stable patterns of incorrect usage of language. So a more exhaustive analysis could reuse:
  • Language learners error corpora (especially ESL)
  • Corpora of texts marked by professional proofreaders
  • Automatically generated corpora from Wikipedia
Any ideas what to use as baseline measures for evaluation? What kinds of data would be needed?

It's also worth noting that it's much easier to evaluate grammar checkers in terms of their precision by counting false positives among all alarms raised by them. It is the recall, or the real number of true positives, that remains really hard to evaluate.

3 komentarze:

kiesdan pisze...

Dear Marcin,

Thank you for mentioning my research on evaluating grammar checkers.

However, I think you misrepresent my research on two points. First, the twenty most common usage errors are based on the tradition of current usage as it is. Please don't confuse my own discussion of the usage error as evidence that the error type is "controversial" (and hence dismissible).

That was not what I meant to suggest at all.

Secondly, when you write "If you use a biased list of errors, ...," I would like to know what bias you see. The data collected by Lundsford and Connors represents an accurate representation of the writing of first year college/university students. If you have evidence otherwise, present it.

I do not mean to discredit what you and Daniel Nabor and the entire LanguageTool community have done. In fact, I support you. Remember that I volunteered my students' work to aid LanguageTool's progress?

However, to measure progress in the development of a grammar checker, we do need a metric. The Conners and Lundsford research is as close to a measure for English as I could find, which is why I have used it for twenty years of grammar checker evaluations.

There must be other possibilities too, and I would love to learn about them.

If you have other metrics, please let me know.

Marcin Miłkowski pisze...

Dear Daniel, sorry for not answering so long. Somehow I missed this comment.

I think the partial evidence is sometimes much worse than no evidence. In the case of Lundsford and Connors the problems is that they count as error something that is easily seen as correct English, as many usage dictionaries would also confirm. So the top error is actually based on a prescriptive account that has no justification in the data, e.g., in common correction patterns that users do display. Missing introductory comma does not therefore seem to be an error that we should ever base our metrics on.

You could also use error corpora for measuring the progress in grammar checkers. The problem is that those are pretty scarce.

Anyway, we need metrics but we need to know what we need to measure. Yet, I don't think the process of finding them is easy. Probably there is no way to have an unbiased metric of recall of grammar checkers (number of errors found / total number of errors in the doc).

I have tried to make a simple analysis in my recent paper on LT, though, by manually checking not if the grammar checker detects all errors (that is not doable for a single person) but whether the suggestions it creates are correct (that is, applying them does not lead to an error). This tests only the precision of checkers but seems a little less biased.

Mary Santago pisze...

Waooow!!! Magnificent blogs, this is what I wanted to search. Thanks buddyGrammarly