Next:
What is Data-Intensive Linguistics?
Up:
Data-Intensive Linguistics
Previous:
Data-Intensive Linguistics
Contents
Contents
What is Data-Intensive Linguistics?
Introduction
Aims of the book
Recommended reading
Prerequisites
Chapters
The heart of our approach
A first language model
Ambiguity for beginners
Summary
Exercises
Historical roots of Data-Intensive Linguistics
Why provide a history?
The history
Earliest times
Panini
Public availability of text
The Rosetta stone
Käding
Motivations for the scientfic study of communication
Telegraphy and Telephony
Information Theory
Summary
Key ideas
Key applications
Questions
Finding Information in Text
Tools for finding and displaying text
Search tools for data-intensive linguistics
Using the
UNIX
tools
Sorting and counting text tokens
Lemmatization
Making n-grams
Filtering: grep
Selecting fields
AWK commands
AWK as a programming language
PERL programs
Summary
A final exercise.
A compendium of
UNIX
tools
Text processing
Data analysis
Concordances and Collocations
Concordances
Keyword-in-Context index
Collocations
Stuttgart corpus tools
Getting started
Queries
Manipulating the results
Other useful things
Collecting and Annotating Corpora
Corpus design
Introduction
Choices in corpus design/collection
Reference Corpus or Monitor Corpus?
Where to get the data?
Copyright and legal matters
Choosing your own corpus
Size
Generating your own corpus
Planning
Subject population and Treatment
Language
Extraneous factors
Human factors in annotation
Be conservative (small c)
Which annotation scheme?
The semantics of annotations
External format of annotations
An annotated list of corpora
Speech corpora
SGML for Computational Linguists
Annotation Tools
Statistics for Data-Intensive Linguistics
Probability and Language Models
Events and probabilities
Events:
Random variables:
Probabilities:
Conditional probabilities and independence:
Bayes rule
Medical diagnosis:
Statistical models of language
Case study: Language Identification
Unique strings
Common words
Markov models
Bayesian Decision Rules
Choice of priors may not matter:
Estimating Model Parameters
Results
Summary
Applying probabilities to Data-Intensive Linguistics
Contingency Tables
Text preparation
Contingency tables
Counting words in documents
Introduction
Bigram probabilities
Words and documents
Probability and information
Introduction
Data-intensive grocery selection
Entropy
Cross entropy
Why is the cross-entropy always more than the entropy?
Summary and self-check
Questions:
Hidden Markov Models and Part of Speech-Tagging
Graphical presentations of HMMs
Example
Transcript
Applications of Data-Intensive Linguistics
Statistical Parsing
Introduction
The need for structure
Why statistical parsing?
The components of a parser
The standard probabilistic parser
Varieties of probabilistic parser
Exhaustive search
Beam search
Left incremental parsers
Alternative figures of merit
Probabilistic LR Parsing
Left-Corner Language Models
Data-Oriented Parsing
Word Statistics
Lexicalized grammars
Categorial Grammars
Tree-adjoining Grammars
Dependency-based Grammars
Link Grammar
Parsing as statistical pattern recognition
Lexical Dependency Parsing
Decision-tree parsing
Maximum entropy parsing
Oracle-based parsing
Conventional techniques for shallow parsing
Summary
Exercises
References
Chris Brew
8/7/1998