next up previous contents
Next: Generating your own corpus Up: Choices in corpus design/collection Previous: Choosing your own corpus

Size

The larger the better. For a long time corpora like LOB and Brown were considered large with 1,000,000 words . The BNC is now large at 100,000,000 words. You can create ad-hoc corpora larger than this from electronically available text. There's a trade-off between size and the amount of effort in collection and manual annotation which is practical. The part-of-speech tagging of the Brown corpus is now pretty good, though still not perfect. The parse trees in Penn Treebank 1 are not wonderful, second version a lot better. The POS tags on the 100,000,000 word BNC are pretty terrible, because the tagger used, while good, isn't perfect, and serious post-editing is impractical for a corpus of that size.



Chris Brew
8/7/1998