up previous
Up: CoverPage Previous: Partners


Project Summary

Keywords: character sets, multilingual electronic dictionaries, morphological models, text alignment

The construction of large-scale electronic dictionaries together with the systematic recourse to large text corpora has been a wide-spread tendency in language research during the past few years. It has become clear that all natural language processing tasks presuppose a detailed coverage of the lexical strata of the natural languages in question. All the partners in the present project have taken the construction and use of electronic dictionaries to be a central goal in language research and have come up with significant lexical databases and corpus tools.

The goals of the BILEDITA project are essentially three-fold: first, we propose to provide a uniform dictionary format for all of the existing electronic dictionaries constructed by the partners at the level of character representation, as this is a principal obstacle to any kind of systematic integration of slavic and latin character sets. Second, we propose to provide a uniform lexical encoding scheme for the electronic dictionaries in terms of both form and content of the entries in the dictionaries. This task has already been systematically dealt with as far as the French and German dictionaries are concerned; it remains to be accomplished for the other languages in the project. This is largely a matter - but not exclusively - of conversion of formats. The second important point is the unification of morphological models. Once it is done, the third and most important goal of the project can be undertaken: the systematic construction and exploitation of bilingual corpora for the purpose of building bilingual basic dictionaries and terminology dictionaries as well as of phrasal dictionaries.

The project is directly relevant to all areas of systematic language research in computational settings, and in particular, to various approaches to automatic translation, to multilingual information retrieval and to computer-assisted language learning. Without detailed large-scale bilingual dictionaries none of the above tasks has any hope of succeeding in a realistic setting.

Many concrete applications are expected to follow from the material to be prepared in the present project; a typical result, to be realized in two instances, will be the systematic extraction of legal or business terminology from existing bilingual text databases. Clearly, such applications will also have many practical effects in the future integration of the Eastern and Western parts of Europe.

The Main Work Packages and Tasks

(excluding organisatorial and technical)

WP1: A Uniform Representation Format of Different Character Sets and Primary Comparison of Dictionaries. Tasks:

1.1. Description of his local electronic dictionary by each partner, which includes the morphological model used.
1.2. Comparison of the dictionaries and establishment of their optimal common denominator.
1.3. A proposal for a standard character set representation appropriate for multilingual documents in the European languages.

WP2: Exemplary Bilingual Electronic Dictionaries (BEDs) with Retrieval Tools. Tasks:

2.1. Local lexicon port - by each partner into the character set representation of 1.3. and to the format unified to the greatest degree possible.
2.2. Optimal unification of morphological models for the German-Russian and French-Bulgarian pairs of languages.
2.3. Development of exemplary BEDs with limited number of word entries for these language pairs, including realization of information retrieval tools for them (e.g., sort, search).
2.4. Extension of the relatively small Polish dictionary.

WP3: Alignment of Bilingual Corpora and its use for BED Extension.
Tasks:

3.1. Representative bilingual text corpora - creation and/or purchase in pairwise cooperation of partners for the Russian-German, Bulgarian-French and Polish-French language pairs.
3.2. Improved alignment algorithms using BEDs - development, implementation and testing.
3.3. Computer-aided incrementation of BEDs - investigation and development of methods.
3.4. Development of exemplary Polish-French BED, as in 2.3.

WP4: Full-fledged Bilingual Terminological Dictionary and Textual Database. Tasks:

4.1. Extension of 3.2. - research and development on the further improvement of alignment methods, using e.g. phrase dictionaries and other types of higher-level linguistic information.
4.2. Extension of 3.3. - research and development on the further improvement of BED extension methods, using e.g. phrase-based and other types of "intelligent" alignment.
4.3. Full-fledged bilingual terminological dictionary and textual database creation, using all the methods and tools developed in the course of the whole project and providing thus a realistical testing ground and practical application for them. For Russian-German and Bulgarian-French language pairs, creation of realistical BEDs in areas of business, technic or legislation, together with bilingual textual databases for these areas.




Tourovski Vladimir - Ludwig Kuffer
Wed Nov 8 13:46:46 MET 1995