Next: Filtering: grep Up: Tools for finding and Previous: Lemmatization

Making n-grams

To find out what a word's most common neighbours are, it is useful to make a list of bigrams (or trigrams, 4-grams, etc)-i.e. to list every cluster of two (or three, four, etc) consecutive words in a text.

Using the UNIX tools introduced in section 3.1 it is possible to create such n-grams. The starting point is again exa_tokens, the list of all words in the text, one on each line, all in lowercase. Then we use tail to create the tail end of that list:

 
tail +2 exa_tokens > exa_tail2

This creates a list just like exa_tokens except that the first item in the new list is the second item in the old list. We now paste these two lists together:

 
paste -d ' ' exa_tokens exa_tail2 > exa_bigrams

paste puts files together ``horizontally'': the first line in the first file is pasted to the first line in the second file, etc. (Contrast this with cat which puts files together ``vertically'': it first takes the first file, and then adds to it the second file.) Each time paste puts two items together it puts a tab between them. You can change this delimiter to anything else by using the option -d.

If we use paste on exa_tokens and exa_tail, the n-th word in the first list will be pasted to the n-th word in the second list, which actually means that the n-th word in the text is pasted to the n+1-th word in the text. With the option -d ' ', the separator between the words will be a simple space. This is the result:

text in
in a
a class
class of
of its
its own
...

Note that the last line in exa_bigrams contains a single word rather than a bigram.

Exercise:: What are the 5 most frequent trigrams in exatext1.

Solution:

This is the list:

  4   the human indexer
  4   in the document
  3   categorisation and routing
  2   work on text
  2   we have developed

For creating the trigrams, start again from exa_tokens and exa_tail2 as before. Then create another file with all words, but starting at the second word of the original list:
tail +3 exa_tokens > exa_tail3
Finally paste all this together:
paste exa_tokens exa_tail2 exa_tail3 > exa_trigrams
Since all trigrams are on separate lines, you can sort and count them the same way we did for words:
sort exa_trigrams | uniq -c | sort -nr | head -5

Exercise:: How many 4-grams are there in exatext? How many different ones are there? (Hint: use wc -l to display a count of lines.)

Solution:

Creating 4-grams should be obvious now:
tail +4 exa_tokens > exa_tail4
paste exa_tokens exa_tail2 exa_tail3 exa_tail4 > exa_fourgrams A wc -l on exa_fourgrams will reveal that it has 1,213 lines, which means there are 1,210 4-grams (the last 3 lines in the file are not 4-grams). When you sort and uniq that file, a wc reveals that there are still 1,200 lines in the resulting file, i.e. there are 1,197 different 4-grams. Counting and sorting in the usual way results in the following list:

 2  underlined in the text
 2  the system displays the
 2  the number of documents
 2  the figure to the
 2  should be assigned to

Of course, there was a simpler way of calculating how many 4-grams there were: there are 1,213 tokens in exa_tokens, which means that there will be 1,212 bigrams, 1,211 trigrams, 1,210 4-grams, etc.

Next: Filtering: grep Up: Tools for finding and Previous: Lemmatization

Chris Brew
8/7/1998