Search tools for data-intensive linguistics

Next: Using the UNIX tools Up: Tools for finding and Previous: Tools for finding and

Search tools for data-intensive linguistics

The general setup is this: we have a large corpus of (maybe) many millions of words of text, from which we wish to extract data which bears on a research question. We might for example, contra Fillmore, be genuinely interested in the distribution of parts-of-speech in first and second positions of the sentence. So we need the following

A way of specifying the sub-parts of the data which are of interest. We call the language in which such specifications are made a query language.
A way of separating the parts of the data which are of interest from those which merely get in the way. For large corpora it is impractical to do this by hand. We call the engine which does this a query engine.
Either:
a way of displaying the extracted data in a form which the human user finds easy to scan and assess.
Or:
a way of calculating and using some statistical property of the data which can act as a further filter before anything is displayed to the user. (We will not discuss appropriate statistical mechanisms in detail for the moment, but the topic will return, in spades, in later chapters).

We would like a tool which offers a flexible and expressive query language, a fast and accurate query engine, beautiful and informative displays and powerful statistical tests. But we can't always get exactly what we want.

The first design choice we face is that of deciding over which units the query language operates. It might process text a word at a time, a line at a time, a sentence at a time, or slurp up whole documents and match queries against the whole thing. In the UNIX tools it was decided that most tools should work either a character at a time or (more commonly) a line at a time. If we want to work with this convention we will need tools which can re-arrange text to fit the convention. We might for example want a tool to ensure that each word occurs on a separate line, or a similar one which does the same job for sentences. Such tools are usually called filters, and are called to prepare the data the real business of selecting the text units in which we are interested. The design strategy of using filters comes into its own when the corpus format changes: in the absence of filters we might have to change the main program, but if the main business is sheltered from the messiness of the real data by a filter we may need to do nothing more than to change the filter.

The filter and processor architecture can also work when units are marked out not by line-ends but by some form of bracketing. This style is much used when the information associated with the text is highly-structured. The Penn Treebank (described in Marcus et al. (1993)) uses this style. The current extreme of this style uses SGML (Standard Generalized Markup Language ()), essentially an a form of the bracketing-delimits-unit style of markup, with all sorts of extra facilities for indicating attributes and relationships which hold of units.

The next design choice is the query language itself. We could demand that users fully specify the items which the want, typing in (for example) the word in which they are interested, but that is of very limited utility, since you have to know ahead of time exactly what you want. Much better is to allow a means of specifying (for example) ``every sentence containing the word 'profit'''. Or again ``every word containing two or more consecutive vowels''. There is a tradeoff here, since the more general you make the query language, the more demanding is the computational task of matching it against the data.

Performance can often be dramatically improved by indexing the corpus with information which will speed up the process of query interpretation. The index of a book is like this, freeing you from the need to read the whole book in order to find the topic of interest. But of course the indexing strategy can break down disastrously if the class of queries built into the index fails to match the class of queries which the user wants to pose. Hence the existence of reversed-spelling dictionaries and rhyming dictionaries to complement the usual style. () is a tool which relies heavily on indexing, but in doing so it restricts the class of queries which it is able to service. What it can do is done so fast that is a great tool for interactive exploration by linguists and lexicographers.

One of the most general solutions to the query-language problem is to allow users to specify a full-fledged formal grammar for the text fragments in which they are interested. This gives a great deal of power: and Corley et al. (1997) has shown that it is possible to achieve adequately efficient implementations It requires pre-tagged text, but otherwise imposes few constraints on input. Filters exist for the BNC, for the Penn treebank, and for Susanne.

Note for Edinburgh readers:
And it is an Edinburgh in-house product, for which it is pretty easy to write new filters. And Frank Keller knows about it...

. We will be working with Gsearch for the first assessed exercise.

See http://www.ltg.ed.ac.uk/ keller/corset/ on the Web for more documentation. A sample grammar is in table .

**Table 3.1:** a sample Gsearch grammar
$\begin{table} \begin{verbatim} % File: Grammar % Purpose: A fairly simple Corset... ...on conj --\gt <CJ.*\gt % Conjunction ofp --\gt of np\end{verbatim}\end{table}$

Finally, there are toolkits for corpus processing like that described in: McKelvie et al. (1997), which we call LT-NSL or LT-XML, depending (roughly) on the wind-direction which offer great flexibility and powerful query languages for those who are able and willing to write their own tools. Packaged in the form of sggrep, the query language is ideally suited for search over corpora which have been pre-loaded with large amounts of reliable and hierarchically structured annotation.

Note for Edinburgh readers:
And it is an Edinburgh in-house product whose development you can influence. The query language adds some things to that of xkwic, but also lacks some of what xkwic has.

See http://www.ltg.hcrc.ed.ac.uk/software/index.html for further documentation.

It probably isn't worth going into great detail about ways of displaying matched data, beyond the comment that visualisation methods are important if you want human beings to make much sense of what is provided.

Next: Using the UNIX tools Up: Tools for finding and Previous: Tools for finding and

Chris Brew
8/7/1998