Next: Copyright and legal matters
Up: Choices in corpus design/collection
Previous: Reference Corpus or Monitor
There are several possible styles of corpus collection.
Not all of these are exclusive.
- Sample frame: take some manageable proportion of
the entire holdings of a library as the sample
frame (done for Brown , LOB, others).
- Stratified: take steps to ensure that particular
interesting cells are filled in. For example:
- Scientific writing
- Belle lettres
- News reports
- Cowboy fiction
- ...
- Solicited: Henry Thompson invited the readers of
sci.lang etc. to translate two French texts into
English. Got 40-odd self-selected translations for each
text, including ones from Systran and a self-proclaimed
non-speaker of French. Simone Teufel and Byron Georgantopoulos
solicited LATEXdocuments from the CogSci/HCRC community to get
a test set for summarization projects.
- Generated under controlled (or at least specified)
conditions: Henry Thompson and
Chris Brew arranged for translations to be made by the 4th year
translation class at Heriot-Watt university.
- Special purpose. Often provided by the partners in a project.
e.g. the collection of abstracts which should be assigned keywords
in the SISTA project.
- Generated under editorial control: Yearly transcripts of
newspapers. The Science Citation Index. Will probably need a
lawyer's advice or and/or well-tried consortium agreement.
- Test sets: eg the corpora generated for NSF funded competitions
such as TREC and MUC. Usually a fragment will be held back until
shortly before the competition to prevent cheating. Very costly
in terms of annotator time, since they have to be a reliable
benchmark against which competing systems can be evaluated.
- Opportunistic: if its available we'll have it. The ECI disk
falls into this category. Also Canadian Hansard and other
documents of public record. Old text which is out of
copyright (Conan-Doyle, Shakespeare, Bible ...).
- Secret corpora: We don't see these ...
Next: Copyright and legal matters
Up: Choices in corpus design/collection
Previous: Reference Corpus or Monitor
Chris Brew
8/7/1998