Typology and Dialect Dynamism: Analysis of Data from Contemporary Media
This project is a cooperation between the Center for Information and Language Processing (Centrum für Informations- und Sprachverarbeitung - CIS) at the Ludwig Maximilian University (LMU) Munich (Dr. Desislava Zhekova) and the Department of Roman Linguistics at LMU (Prof. Thomas Krefeld). Additionally, the project also cooperates with the Information Science Department at the University of Groningen (Prof. John Nerbonne).
The main research questions posed by this project are the applicability of a novel type of dialect data, which is not primed and consists of written and not orally produced dialect texts (e.g. Wikipedia articles and Twitter microblogs) to the problem of morpho-syntactic quantitative typology as well as the capability of the state-of-the-art metrics to measure the distance between varieties on non-word aligned data to achieve a good separation of the dialects with respect to both morphology and syntax.
We are also interested into the microdiachronic and synchronic manner of distance exploration: Wikipedia delivers a novel type of data in that each version of an article is often edited by a new author. This suggests that newer articles may be typologically closer to either the dialect or the standard, but with every further revision the article would become a mixture of the text produced by multiple authors and thus represent the typological features of a dialect Wikipedia article being collaboratively created and in a written form. Thus a microdiachronic investigation of the changes in dialect specificity is highly appropriate for this data. To our knowledge a study of this dimension and on this novel data type has not yet been performed.
In the process of this work a novel dialect corpus for Italian varieties will be created and syntactically annotated. This corpus will not be subjected to syntactic priming as described in section 1. Additionally, the corpus will be made publicly available via contemporary methods - free web access and distribution to digital catalogues.
SiMoN - this is a preliminary release of the morphological parser SiMoN developed by Simeon Herteis in the context of his Bachelor Thesis. SiMoN is an extension of AnIta, a morphological parser for Italian (Tamburini and Melandri, 2012). Currently, SiMoN extends on about 60 regular Sicilian verbs that were manualy integrated in the vocabulary. Additionally, it also includes a number of automatically collected ones, which were extracted from Wikipedia.
tweetNorm - this is the tweet normalizer for Italian developed by Daniel Weber in the context of his Bachelor Thesis. The implementation is a rule-based approach following state-of-the-art techniques for text normalization.