next up previous contents
Next: Where to get the Up: Choices in corpus design/collection Previous: Choices in corpus design/collection

Reference Corpus or Monitor Corpus?

Linguists who do (computational) experiments like reference corpora, whose composition is fixed once and for all, and which are publicly available. This lets them compare their results with those of other workers in a meaningful way. For example, P.F. Brown's group suggest using the cross-entropy of a language model measured on the Brown corpus as a benchmark for language modelling.

Lexicographers are more interested in language change, because a large part of their task is in assessing when a change is needed in a currently existing dictionary, or when a set of related but different changes are needed in a range of related but different dictionaries. They need corpora which grow as their in-house readers find new usages. Corpora like this are called monitor corpora. The Cobuild Bank of English http://titania.cobuild.collins.co.uk/ is a publicly accessible monitor corpus.

A middle ground are corpora which are assumed to be comparable, such as successive years of the AP Newswire, or the postings to a given Usenet newsgroup each month. It isn't clear what the criteria for comparability are.


next up previous contents
Next: Where to get the Up: Choices in corpus design/collection Previous: Choices in corpus design/collection
Chris Brew
8/7/1998