Next: Search tools for data-intensive Up: Finding Information in Text Previous: Finding Information in Text

Tools for finding and displaying text

This chapter will introduce some basic techniques and operations for use in data-intensive linguistics. We will show how existing tools (mainly standard UNIX tools) and some limited programming can be used to carry out these operations.

Some of the material on standard UNIX tools is a revised and extended version of notes taken at a tutorial given by Ken Church, at Coling 90 in Helsinki, entitled ``Unix for Poets''. We use some of his exercises, adding variations of our own as appropriate.

One of Church's examples was the creation of a KWIC index, which we first encountered in chapter 5 of Aho, Kernighan and Weinberger (1988). In section 4 we discuss this example, provide variations on the program, including a version in Perl, and compare it with other (more elaborate, but arguably less flexible) concordance generation tools.

There are more advanced, off-the-shelf tools that can be used for these operations, and several of them will be described later on in these notes. In theory, these can be used without requiring any programming skills. But more often than not, available tools will not do quite what you need, and will need to be adapted. When you do data-intensive linguistics, you will often spend time experimentally adapting a program written in a language which you don't necessarily know all that well. You do need to know the basics of these programming languages, a text editor, a humility, a positive attitude, a degree of luck and, if some or all of that is missing, a justified belief in your ability to find good manuals. The exercises towards the end of this chapter concentrate on this kind of adaptation of other people's code.

This chapter therefore has two aims:

1.: to introduce some very basic but very useful word operations, like counting words or finding common neighbours of certain words;
2.: to introduce publicly available utilities which can be used to carry out these operations, or which can be adapted to suit your needs.

The main body of this chapter focuses on particularly important and successful suite of tools -- the UNIX tools. Almost everybody who works with language data will find themselves using these tools at some point, so it is worth understanding what is special about these tools and how they differ from other tools. Before doing this we will take time out to give a more general overview of the tasks to which all the tools are dedicated, since these are the main tasks of data-intensive linguistics.

Next: Search tools for data-intensive Up: Finding Information in Text Previous: Finding Information in Text

Chris Brew
8/7/1998