Alignment of Annotated Corpora
with Original Sources

Colloquium
Centrum for Information and Language Processing
LMU Munich
04/28/2014

Frank Zengea

Table of Contents

  1. Introduction
  2. Word Alignment
  3. Current work on thesis
    • Alignment of annotated data with original sources
    • Automatic generation of the Penn Treebank tokenization
  4. Remaining work
  5. Conclusion

1. Introduction

2. Word alignment

3. Current work on the thesis

3.1. Alignment

3.1.1. Preparation

  1. Determine text boundaries and save texts to separate files
  2. Search ProQuest Archiver database by keywords for original source files
  3. Decide from the preview if it is the right article

3.1.2. Extraction of raw text from original sources

3.1.3. Alignment and evaluation

3.1.4. Evaluation results of the alignment

Manually extracted data vs. Penn Treebank corpus

OCR extracted data vs. Penn Treebank

4.2. Tokenizer

4.2.1. Features of the Penn Treebank tokenizer

4.2.2. Development and evaluation of the tokenizer

5. Remaining Work

6. Conclusion