XML/SGML and Computational Linguistics
This page includes a collection of links on pages related to the XML
(and SGML) standard and its use in the field of (computational) linguistics.
Comments and descriptions taken from the respective web pages are meant to
simplify orientation. (Many thanks to Eduardo Torres for his help in preparing this page).
1. Homepages of groups, standards,
organizations and initiatives
-
LDC Linguistic Data
Consortium: Linguistic Annotation
The LDC supports language-related
education, research and technology development by creating and sharing
linguistic resources: data, tools and standards. The page describes tools and
formats for creating and managing linguistic annotations. `Linguistic
annotation' covers any descriptive or analytic notations applied to raw
language data. The basic data may be in the form of time functions -- audio,
video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
"named entity" identification, co-reference annotation, and so on. The focus
is on tools which have been widely used for constructing annotated linguistic
databases, and on the formats commonly adopted by such tools and databases.
(The page includes a large index with many links to workshops, projects and
research groups.)
-
EAGLES Expert Advisory Group on Language Engineering
Standards
The Expert Advisory Group on Language Engineering Standards (EAGLES) is an
initiative of the European Commission, within DG XIII Linguistic Research and
Engineering programme, which aims to accelerate the provision of standards
for:
- Very large-scale language resources (such as text corpora,
computational lexicons and speech corpora);
- Means of manipulating such knowledge, via computational linguistic
formalisms, mark up languages and various software tools;
- Means of assessing and evaluating resources, tools and products.
Numerous well-known companies, research centres, universities and professional
bodies across the European Union are collaborating under the aegis of EC
DGXIII to produce the EAGLES Guidelines which set out recommendations for de
facto standards and for good practice in the above areas of language
engineering.
The EAGLES initiative is coordinated by Consorzio Pisa Ricerche, Pisa,
Italy which also manages the EAGLES home page and the EAGLES ftp server.
The work towards common specifications is carried out by five working groups:
- Text Corpora
- Computational Lexicons
- Grammar Formalisms
- Evaluation
- Spoken Language
These are concerned with firstly establishing common methodologies for the five
areas of concern within EAGLES, in order to subsequently arrive at de facto
standards.
The Corpus Encoding Standard (CES) and its XML version (XCES), links see
below, are standards developed by EAGLES.
- CES
Corpus Encoding Standard
This document is the first version of the
Corpus Encoding Standard (CES), which are a part of the EAGLES
Guidelinesdeveloped by the Expert Advisory Group on Language Engineering
Standards (EAGLES). The CES is designed to be optimally suited for use in
language engineering research and applications, in order to serve as a widely
accepted set of encoding standards for corpus-based work in natural language
processing applications. The CES is an application of SGML (ISO 8879:1986,
Information Processing--Text and Office Systems--Standard Generalized Markup
Language) compliant with the specifications of the TEI Guidelines for
Electronic Text Encoding and Interchange of the Text Encoding Initiative.
- XCES
The XML-version of the CES-Standard (see above). Provides links to
DTDs for documents, aligned data, annotated data.
- OASIS Organization for the advancement of structured information
OASIS is a
non-profit, international consortium that creates interoperable industry specifications based
on public standards such as XML and SGML. OASIS members include organizations and
individuals who provide, use and specialize in implementing the technologies that make these
standards work in practice.
- OASIS
links to XML applications and projects
Offers links to an enormous number of projects and applications, including
many of the links to be find here.
- NLSML Natural
Language Semantics Markup Language
See also
NLSML
- XML for bibliographic
references
- VocML Vocabulary Markup
Language
- Taxonomic Markup
Language
- Description Logics Markup
Language
- Text Encoding Initiative
The Text Encoding Initiative (TEI) is an international project to develop
guidelines for the preparation and interchange of electronic texts
for scholarly research, and to satisfy a broad range of uses by the language
industries more generally. The Web-Page provides information on
history, organization, TEI applications, guidelines, tutorials etc.
- XML and TEI
OASIS page with Links to XML topics within TEI. See also
Academic
Applications
- XLT: XML representation of Lexicons
and Terminologies
XLT is an XML-based application
developed with the intent of facilitating the exchange of lexicons and
terminologies. The primary member of the XLT family of formats is Default
XLT Format (DXLT). (Page includes links
to specification, tutorials, projects
(SALT = Standards-based Access service to multilingual
Lexicons and Terminologies). SALT-page includes
deliverables and papers.)
- LACITO Linguistic
Data Archiving Project
The goals of the LACITO linguistic data archiving project are the
conservation and the distribution of speech data. To these ends it has
developed norms for the preparation and exploitation of documents
incorporating sound and text using internationally recognized standards (SGML
and XML).
- SABLE V1.0
The draft SABLE specification is an initiative to establish a standard system for
marking up text input to speech synthesizers. The current draft (version 1.0)
is being circulated for comment by users, developers and researchers of speech
synthesis.
- MATE
The MATE (Multilevel Annotation, Tools Engineering) project aims to facilitate
re-use of language resources by addressing the problems of creating,
acquiring, and maintaining language corpora. The problems are addressed along
two lines:
- through the development of a standard for annotating resources;
- through the provision of tools which will make the processes of
knowledge acquisition and extraction more efficient.
Specifically, MATE will treat spoken dialogue corpora at multiple levels,
focusing on prosody, (morpho-) syntax, co-reference, dialogue acts, and
communicative difficulties, as well as inter-level interaction. The results of
the project will be of particular benefit to developers of spoken language
dialogue systems but will also be directly useful for other applications of
language engineering.
Partners:
Odense, Torino, Barcelona, DFKI Saarbruecken, Edinburgh, Stuttgart, Pisa,
Madrid.
Page includes links to deliverables, publications, partners, project
description etc.
- CELLAR -- Computing
Environment for Linguistic, Literary, and Anthropological Research
CELLAR is an object-oriented database system that is being developed
by the Academic Computing Department of SIL to meet the data management needs
of our field workers. Two of its special features are the ability to cope
simultaneously with data in many languages, and design which separates the
conceptual model of a data set from multiple (interchangeable) views for
display and encoding formats for import and export. While important aspects of
the design were motivated by the needs of linguistic research, the system is
fully programmable and can be used to develop text-related (as opposed to
number crunching) applications for any discipline. Applications include
phonological analysis, interlinear text analysis, lexical database management
and others. (Page
includes links to technical overview and papers
- SGML:
ECI (European Corpus Initiative)
``The European Corpus Initiative was founded to oversee the
acquisition and preparation of a large multi-lingual corpus to be made
available in digital form for scientific research at cost and without
royalties. We believe that widespread easy access to such material would be a
great stimulus to scientific research and technology development as regards
language and language technology. . . . No amount of abstract argument as to
the value of corpus material is as powerful as the experience of actually
having access to some in one's laboratory.''
- ATLAS Architecture and
Tools for Linguistic Analysis Systems
ATLAS is a recent
initiative involving NIST , LDC and MITRE . ATLAS addresses an array of
applications needs spanning corpus construction, evaluation infrastructure,
and multi-modal visualization.
The principal goal of ATLAS is to provide powerful abstractions over annotation
tools and formats in order to maximize flexibility and extensibility. Our
approach has been to isolate and abstract over the physical and logical levels
of annotation tools and formats, leaving application- and domain-specific
issues to the side.
ATLAS Level 0, also known as Annotation Graphs, provides a data model
for working with linear signals (such as text and audio) indexed by intervals.
ATLAS Level 1 is a generalized model, suitable for annotating signals of
essentially arbitrary dimensionality with annotations having essentially
arbitrary structure. An early application of ATLAS Level 1 is OCR annotation,
where textual images are indexed using bounding boxes.
Both models utilize an XML-based interchange format for data storage
and exchange and a set of Application Programming Interfaces for data
manipulation. Currently, there is a Beta release of the API to support access
to, and manipulation of Level 0 data structures. Development of the API to
support Level 1 data is in its initial stages.
(Page includes links to persons, documents, papers, tutorial.)
- Transcriber
A tool for segmenting, labeling and transcribing speech
- The XML Cover Pages - Home
Page (XML page including overview and links on XML
applications in many fields. )
- The official site of RELAX
RELAX (REgular LAnguage description for XML) is a specification
for describing XML-based languages. XHTML 1.0, for example, can be described
in RELAX.
A description written in RELAX is called a RELAX grammar.
An XML document can be verified against a RELAX grammar.
Compared with DTD(Document Type Definition), RELAX has new features:
- RELAX grammars are represented in the XML instance syntax
- RELAX borrows rich datatypes of XML Schema Part 2
- RELAX is namespace-aware
RELAX is standardized by INSTAC XML SWG of Japan. Under the auspices
of the Japanese Standard Association(JSA), this committee develops Japanese
national standards for XML.
- Academia Sinica Computing
Centre's Schematron Home Page
The Schematron differs in basic concept from other schema
languages in that it not based on grammars but on finding tree patterns in
the parsed document.
2. Some Articles
Im Internet
Local
Im Verzeichnis ~torres/XML, PS- und PDF- formatiert.
- The MATE Workbench - an annotation tool for XML coded speech corpora. D. McKelvie et al.
- XML tools and architecture for Named Entity recognition. A. Mikheev et al
- Structured Document Transformations. Greger Linden
- SGML & XML Content Models. Pekka Kilpeläinen.
- Automatic Hypertext Link Typing, James Allan
- Die integrierte Repräsentation linguistischer Daten. A. Mengel.
- Automatically generating hypertext by computing semantic similarity, Green
- Hedge automata: a formal model for XML schemata, M. Murata
- Moddellierung multilingualer Ressourcen, Heyer, Wolff
- Towards a minimal standard for dialogue transcripts: a new SGML architecture for the HCRC Map Task Corpus, A. Isard et al.
- ATLAS: A Flexible and Extensible Architecture for Linguistic Annotation, S. Bird et al.
- An XML-based representation format for syntactically annotated corpora, A. Mengel, W. Lezius
- PAT expressions: an algebra for text search, A. Salminen
- CELLAR: A Data Modelleing System for Linguistic Annotation
- SSML: a Speech Synthesis Markup Language, P. TAylor, A. Isard
- Automatic Hypertext Construction, Allan
- SGML: Mathematical and Philosophical Issues, Wood
- Transcriber: development and use of a tool for assisting speech coprpora production, C. Barras et al.
- XCES: an XML-Based Encoding Standard for Linguistic Corpora, Nancy Ide et al.
3. Further links
(e.g. proceedings, literature, pages
including collection of links, etc).
General
Homepages of some Researchers