Index based Statistical Analysis of Large Text Corpora

Summary. The statistical analysis of large text corpora is a fundamental method for gaining insights into the structure of language. Insights obtained from intelligent corpus analysis can be used for a large number of specific applications in natural language processing (NLP), such as grammar development, machine translation, terminology and named entity extraction, text correction, semantic text analysis, and others. Progress in these fields leads to improvements of related applications in information science (search engine technology) and many other text oriented disciplines. For the EU, these applications are of major interest: the need to translate documents between 24 European languages, the goal of a multilingual access to textual European cultural heritage, and the necessity to establish an independent European web search technology makelong term improvements in these fields crucial.
The core contribution of this project is a new methodology aimed at fundamentally improving statistical analysis of large text corpora. A weakness of current methods in corpus analysis is insufficient use of contextual information. Properly understanding the role, function and meaning of a phrase or word (which is important for many applications, e.g., for translation, search, etc.) is often only possible when taking sentence/paragraph contexts into account. Building on the latest developments in text indexing we intend to develop and study a new representation of corpora which is superior to present formats in three respects. Firstly and most importantly, it offers a much better access to contextual information. At the same time, it secondly yields better distinction criteria between arbitrary and meaningful parts of text, and thirdly it gives hints on how to compose/decompose phrases. With all these features, the new index based corpus representation provides a solid basis for fundamentally improving statistical analysis of corpora. The current advanced hardware developments guarantee that an index rich in information on phrases and contexts can be practically computed, processed and analyzed for larger and larger input corpora, which explains why this research should be undertaken now.