Next: Lemmatization Up: Tools for finding and Previous: Using the UNIX tools

Sorting and counting text tokens

A tokeniser takes input text and divides it into ``tokens''. These are usually words, although we will see later that the issue of tokenisation is far more complex than that. In this chapter we will take tokenisation to mean the identification of words in the text. This is often a useful first step, because it means one can then count the number of words in a text, or count the number of different words in a text, or extract all the words that occur exactly 3 times in a text, etc.

UNIX has some facilities which allow you to do this tokenisation. We start with tr. This ``translates'' characters. Typical usage is

tr  chars1 chars2 < inputfile > outputfile

which means ``copy the characters from the inputfile into the outputfile and substitute all characters specified in chars1 by chars2''.

For example, tr allows you to change all the characters in the input file into uppercase characters:

tr 'a-z' 'A-Z' < exatext1 | more

This just says ``translate every a into A, every b into B, etc.''

Note for Edinburgh readers:
On different UNIX systems the options for tr are slightly different. Read the man-page to get the details. For the purposes of this tutorial, you should use the UNIX command alias tr /usr/ucb/tr (or equivalent if you use a non-standard shell) to get a version of tr which works as described here.

Similarly,

tr 'aiou' e < exatext1 | more

changes all the vowels in exatext1 into es.

You can also use tr to display all the words in the text on separate lines. You do this by ``translating'' everything which isn't a word (every space or punctuation mark) into newline (ASCII code 012). Of course, if you just type

tr 'A-Za-z' '\012' < exatext1

each letter in exatext1 is replaced by a newline, and the result (as you can easily verify) is just a long list of newlines, with only the punctuation marks remaining.

What we want is exactly the opposite--we are not interested in the punctuation marks, but in everything else. The option -c provides this:

tr -c 'A-Za-z' '\012' < exatext1

Here the complement of letters (i.e. everything which isn't a letter) is mapped into a newline. The result now looks as follows:

Text
in
a
class
of
its
own

The
HCRC
Language
Technology
Group

LTG

is
a
technology
transfer
...

There are some white lines in this file. That is because the full stop after a class of its own is translated into a newline, and the space after the full stop is also translated into a newline. So after own we have two newlines in the current output. The option -s ensures that multiple consecutive replacements like this are replaced with just a single occurrence of the replacement character (in this case: the newline). So with that option, the white lines in the current output will disappear.

Exercise:: Create a file exa_words with each word in exatext1 on a separate line.

Solution:: Just type tr -cs 'A-Za-z' '\012' < exatext1 > exa_words

You can combine these commands, using UNIX pipelines (|). For example, to map all words in the example text in lower case, and then display it one word per line, you can type:

tr 'A-Z' 'a-z' < exatext1 | tr -sc 'a-z' '\012' > exa_tokens

The reason for calling this file exa_tokens will become clear later on. We will refer back to files created here and in exercises, so it's useful to follow these naming conventions.

Another useful UNIX operation is sort. It sorts lines from the input file, typically in alphabetical order. Since the output of tr was one word per line, sort can be used to put these lines in alphabetical order, resulting in an alphabetical list of all the words in the text. Check the man-page for sort to find out about other possible options.

Exercise:: Sort all the words in exatext1 in alphabetical order.

Solution:: Just pipeline the tr command with sort: i.e. type
tr -cs 'A-Za-z' '\012' < exatext1 | sort | more
Or to get an alphabetical list of all words in lowercase, you can just type
sort exa_tokens > exa_tokens_alphab.
The file exa_tokens_alphab now contains an alphabetical list of all the word tokens occurring in exatext1.

The output so far is an alphabetical list of all words in exatext1, including duplicates, each on a separate line. You can also produce an alphabetical list which strips out the duplicates, using sort -u.

Exercise:: Create a file exa_types_alphab, containing each word in exatext1 exactly once.

Solution:: Just type
sort -u exa_tokens > exa_types_alphab

Sorted lists like this are useful input for a number of other UNIX tools. For example, comm can be used to check what two sorted lists have in common. Have a look at the file stoplist: it contains an alphabetical list of very common words of the English langage. If you type

comm stoplist exa_types_alphab | more

you will get a 3-column output, displaying in column 1 all the words that only occur in the file stoplist, in column 2 all words that occur only in exa_types_alphab, and in column 3 all the words the two files have in common. Option -1 suppresses column 1, -2 suppresses column 2, etc.

Exercise:: Display all the non-common words in exatext

Solution:: Just type comm -1 -3 stoplist exa_types_alphabetical | more
That compares the two files, but only prints the second column, i.e. those words which are in exatext but not in the list of common words.

The difference between word types and word tokens should now be clear. A word token is an occurrence of a word. In exatext1 there are 1,206 word tokens. You can use the UNIX command wc (for word count) to find this out: just type wc -w exatext1.

However, in exatext1 there are only 427 different words or word types. (Again, you can find this out by doing wc -w exa_types_alphabetical).

There is another tool that can be used to create a list of word types, namely uniq. This is a UNIX tool which can be used to remove duplicate adjacent lines in a file. If we use it to strip duplicate lines out of exa_tokens_alphab we will be left with an alphabetical list of all wordtypes in exatext1--just as was achieved by using sort -u. Try it by typing

uniq exa_tokens_alphab | more

The complete chain of commands (or pipe-line) is:

 
tr -cs 'a-z' '\012' < exa_tokens | sort | uniq | more

Exercise:: Can you check whether the following pipeline will achieve the same
tr -cs 'a-z' '\012' < exa_tokens | uniq | sort | more

Solution:: It won't: uniq strips out adjacent lines that are identical. To ensure that identical words end up on adjacent lines, the words have to be put in alphabetical order first. This means that sort has to precede uniq in the pipeline.

An important option which uniq allows (do check the man-page) is uniq -c: this still strips out adjacent lines that are identical, but also tells you how often that line occurred. This means you can use it to turn a sorted alphabetical list of words and their duplicates into a sorted alphabetical list of words without duplicates but with their frequency Try

 
uniq -c exa_tokens_alphab > exa_alphab_frequency

The file exa_alphab_frequency contains information like the following:

   3 also
   5 an
  35 and
   2 appear
   3 appears
   1 application
   5 applications
   1 approach

In other words, there are 3 tokens of the word ``also'', 5 tokens of the word ``an'', etc.

Exercise:: Can you see what is odd about the following frequency list?
tr -cs 'A-Za-z' '\012' < exatext1 | sort | uniq -c | more
How would you correct this pipeline?

Solution:: The odd thing is that it counts uppercase and lowercase words separately. For example, it says there are 11 occurrences of ``The'' and 74 occurrences of ``the'' in exatext1. That is usually not what you want in a frequency list. If you look in exa_alphab_frequency you will see that that correctly gives ``the'' a frequency of occurrence of 85. The complete pipeline to achieve this is
tr 'A-Z' 'a-z' < exatext1 | tr -sc 'a-z' '\012' | sort | uniq -c| more
It may be useful to save a version of exatext1 with all words in lower case. Just type
tr 'A-Z' 'a-z' < exatext1 > exatext1_lc

Now that you have a list of all word types in exatext1 and the frequency with which each word occurs, you can use sort to order the list according to frequency. The option for numerical ordering is sort -n; and if you add the option -r it will display the word list in reverse numerical order (i.e. the most frequent words first).

Exercise:: Generate a frequency list for exatext. Call it exa_freq.

Solution:: One solution is to type
sort -nr < exa_alphab_frequency > exa_freq

The complete pipeline to achieve this was
tr -cs 'a-z' '\012' < exatext1_lc | sort | uniq -c | sort -nr

To recap: first we use tr to map each word onto its own line. Then we sort the words alphabetically. Next we remove identical adjacent lines using uniq and use the -c option to mark how often that word occurred in the text. Finally, we sort that file numerically in reverse order, so that the word which occurred most often in the text appears at the top of the list.

When you get these long lists of words, it is sometimes useful to use head or tail to inspect part of the files, or to use the stream editor sed. For example, head -12 exa_freq or sed 12q exa_freq will display just the first 12 lines of exa_freq; tail -5 will display the last 5 lines; tail 14+ will display everything from line 14. sed /indexer/q exa_freq will display the file exa_freq up to and including the line with the first occurrence of the item ``indexer.

Exercise:: List the top 10 words in exatext1, with their frequency count.

Solution:

Your list should look as follows:

  85 the          34 in
  42 to           22 text
  39 of           18 for
  37 a            15 is
  35 and          14 this

With the files you already have, the easiest way of doing it is to say
head -10 exa_freq.
The complete pipeline is
tr -cs 'a-z' '\012' < exatext1_lc |sort|uniq -c|sort -nr|head -10

Next: Lemmatization Up: Tools for finding and Previous: Using the UNIX tools

Chris Brew
8/7/1998