Alignment of Annotated Corpora
with Original Sources

Colloquium
Centrum for Information and Language Processing
LMU Munich
04/28/2014

Frank Zengea

Introduction
Word Alignment
Current work on thesis
- Alignment of annotated data with original sources
- Automatic generation of the Penn Treebank tokenization
Remaining work
Conclusion

1. Introduction

Standard data sets in computational linguistics are usually 'cleaned', i.e.
- Correction of OCR errors
- Resolve misrecognized objects (e.g. tables)
- Remove metadata (e.g. author names or page numbers)
- etc.
Word alignment of cleaned data with original sources reveals these changes in annotated text
If annotated data is tokenized → alignment can also be used to recreate the tokenizer
Resources / tools used in this thesis:
- Penn Treebank corpus (for Wall Street Journal articles)
- ProQuest Archiver (search engine for original sources)
- Python with NLTK

2. Word alignment

Most often used in statistical machine translation
- Bilingual translation
- Word sense disambiguation
- Translation lexicons
Used in this thesis: simple word alignment of annotated data with original sources to reveal changes on original data

3. Current work on the thesis

Thesis can be divided into two parts:
1. Alignment of annotated corpora with original sources
2. Automatic generation of Penn Treebank tokenization

3.1. Alignment

3.1.1. Preparation

Preconditions:
Penn Treebank corpus is one large text file
- Tokenized and tagged
- No titles or other text boundaries

Original newspaper articles can be found in ProQuest Archiver database
- Search results only show a preview of the article
- For full view articles have to be purchased

Preprocessing:

Determine text boundaries and save texts to separate files
Search ProQuest Archiver database by keywords for original source files
Decide from the preview if it is the right article

Original source files are photographs of the newspaper articles as PDF files
Therefore text extraction needed to obtain plain text

3.1.2. Extraction of raw text from original sources

Manual extraction
- Unknown what parts of the text are removed from original text for Penn Treebank corpus
- Copy everything including subtitles, author names, etc.

Automatic extraction with OCR tool
- Tesseract OCR with gImageReader (GUI)
- Mark and extract text from distinct text blocks (newspaper articles are often written in columns)
- Resolve hyphenation and delete redundant lines and spaces

3.1.3. Alignment and evaluation

Normalize text files:
1. Reverse tokenization: transform tokenized and annotated Penn Treebank text into plain text
2. Identical tokenization on all text files
Align extracted original data with Penn Treebank data
Evaluation: calculate Levenshtein distance between two words
- e.g.: distance between 'work' and 'word' is 1
Calculate average Levenshtein distance to define the differences between two texts
Print log file containing each word pair and its distance for detailed information

3.1.4. Evaluation results of the alignment

Manually extracted data vs. Penn Treebank corpus

Average Levenshtein distance: 5.41
Penn Treebank corpus has been modified: removal of
- Titles
- Subheadings
- Author names
- Page numbers
- etc.
Average Levenshtein distance after adapting: 1.71
Reason: translation of punctuation marks, e.g. "(" is translated to "-LRB-"

OCR extracted data vs. Penn Treebank

Average Levenshtein distance: 5.33
Average Levenshtein distance after adapting: 5.01
Main reason: recognition errors for:
- Symbols at the beginning and end of lines
- Uppercase words
- Punctuation marks
- etc.
Conclusion: image quality of original data not sufficient for automated text extraction

4.2. Tokenizer

First approach:
- Use existing Penn Treebank tokenizer from NLTK
- But tokenizer assumes text is already segmented into sentences
- Problem: data sample size is too small to train sentence tokenizer
Next approach:
- Penn Treebank corpus is already tokenized
- Use alignment to extract features of the tokenizer
- Create tokenizer from scratch using regular expressions

4.2.1. Features of the Penn Treebank tokenizer

Punctuation marks are not removed but instead translated, e.g. "[" → "-LSB-"
Special tokenization of quotation marks: "text" → `` + text + ''
Named entities are tokenized differently:
- "Amadou-Mahtar" → "Amadou" + "Mahtar"
- "M'Bow" → "M'Bow"
Some words are split but not marked as such (difficult to reverse tokenization)
- "won't" → "wo" + "n't"
- "gonna" → "gon" + "na"
Additional newlines after each full sentence → normalization needed

4.2.2. Development and evaluation of the tokenizer

Use Levenshtein distance again for evaluation
But this time use original tokenized version of Penn Treebank corpus
Development phase:
1. Declaration or extension of tokenizer rules (based on evaluation)
2. Alignment and evaluation
3. Repeat
Evaluation result: average Levenshtein distance: 0.27

5. Remaining Work

Extend sample size for more accurate results and testing
Train sentence tokenizer on bigger sample size to use Penn Treebank tokenizer from NLTK
Use statistical methods to improve OCR results and general automation

6. Conclusion

Alignment shows modified elements in Penn Treebank corpus:
- Titles, author names, subheadings, etc. are removed
- Translation of punctuation marks
- Declaration of sentence boundaries
Alignment can also be used to extract features of the Penn Treebank tokenizer
Recreated tokenizer achieves very good results
But automation can be improved by implementing statistical methods

Alignment of Annotated Corporawith Original Sources

Table of Contents

1. Introduction

2. Word alignment

3. Current work on the thesis

3.1. Alignment

3.1.1. Preparation

3.1.2. Extraction of raw text from original sources

3.1.3. Alignment and evaluation

3.1.4. Evaluation results of the alignment

Manually extracted data vs. Penn Treebank corpus

OCR extracted data vs. Penn Treebank

4.2. Tokenizer

4.2.1. Features of the Penn Treebank tokenizer

4.2.2. Development and evaluation of the tokenizer

5. Remaining Work

6. Conclusion

Alignment of Annotated Corpora
with Original Sources