Next: Speech corpora Up: Corpus design Previous: External format of annotations

An annotated list of corpora

Here

The Map Task Corpus
The DCIEM Corpus
The Brown Corpus
The Penn Treebank
The Penn Treebank version 2
The British National Corpus
Wordnet
SemCor
ComLeX and its corpus
Moby corpora
Conan Doyle
Moby Dick
Shakespeare
Laurie Bauer's Corpus of New Zealand English
Courtaulds
ECI
MLCC
CELEX
LOB
Susanne
Wall Street Journal
AP Newswire
London-Lund
Helsinki Historical English
Kolhapur Corpus of Indian English
Childes
Canadian Hansard
ITU Corpora ACL/DCI Association for Computational Linguistics PENN TREEBANK The Penn Treebank Project - Release 2 TIPSTER Information Retrieval Text Research Collection TIPSTER Volume 1 TIPSTER Volume 2 TIPSTER Volume 3 UN United Nations Parallel Text Corpus (Complete) English French Spanish CSR-III Text Corpus Language Model Training Data JAPANESE NEWS Japanese Business News Text SPANISH NEWS Spanish News Text Collection ECI/MCI Euro
ACL/DCI
Penn-Helsinki Parsed Corpus of Middle English
Freely available corpora
Middle English The Linguistics Department at the University of Pennsylvania offers the Penn-Helsinki Parsed Corpus of Middle English, a database of 510,000 words of syntactically parsed Middle English text for use by historical linguists Spanish Three Spanish corpora are freely available in Internet for research purposes: Spoken Peninsular Spanish (1 Mi words) Written Argentinian Spanish (2 Mi words) Written Chilean Corpus (2 Mi words) These corpora have a basic tagging in a SGML and TEI related form, easy to convert to the latest versions.
Check them at http://www.lllf.uam.es/
Institutions
Norwegian Computing Centre for the Humanities (NCCH) with the International Computer Archive of Modern English (ICAME) ELSNET
Projects
TELRI
Distribution Institutions
Linguistic Data Consortium (LDC) ELRA
Others
The British National Corpus (BNC) Cobuild Direct (BOE) Encyclopedia Britannica (beta)
Speech
ShATR - A Corpus for Auditory Scene Analysis

Speech corpora

Next: Speech corpora Up: Corpus design Previous: External format of annotations

Chris Brew
8/7/1998