To find out what a word's most common neighbours are, it is useful to make a list of bigrams (or trigrams, 4-grams, etc)-i.e. to list every cluster of two (or three, four, etc) consecutive words in a text.
Using the UNIX tools introduced in section 3.1
it is possible to create such n-grams.
The starting point is again exa_tokens
,
the list of all words in the text, one on
each line, all in lowercase.
Then we use
tail to create the tail end of that list:
tail +2 exa_tokens > exa_tail2This creates a list just like
exa_tokens
except that the
first item in the new list is the second item in the old list. We now
paste these two lists together:
paste -d ' ' exa_tokens exa_tail2 > exa_bigrams
paste
puts files together ``horizontally'':
the first line in the first file is pasted to the first line
in the second file, etc. (Contrast this with cat
which
puts files together ``vertically'': it first takes the first file,
and then adds to it the second file.)
Each time paste
puts two items together it puts
a tab
between them. You can change this delimiter
to anything else by using the option -d
.
If we use paste
on exa_tokens
and exa_tail
,
the n-th word in the first
list will be pasted to the n-th word in the second list, which actually
means that the n-th word in the text is pasted to the n+1-th word in
the text. With the option -d ' '
, the separator between
the words will be a simple space. This is the result:
text in in a a class class of of its its own ...Note that the last line in
exa_bigrams
contains a single word rather than a bigram.
What are the 5 most frequent trigrams in exatext1.
This is the list:
4 the human indexer
4 in the document
3 categorisation and routing
2 work on text
2 we have developed
For creating the trigrams,
start again from exa_tokens
and exa_tail2
as before. Then create another file with
all words, but starting at the second word of the original list:
tail +3 exa_tokens > exa_tail3
Finally paste all this together:
paste exa_tokens exa_tail2 exa_tail3 > exa_trigrams
Since all trigrams are on separate lines, you can sort and count
them the same way we did for words:
sort exa_trigrams | uniq -c | sort -nr | head -5
How many 4-grams are there in exatext?
How many different ones are there? (Hint: use wc -l to
display a count of lines.)
Creating 4-grams should be obvious now:
tail +4 exa_tokens > exa_tail4
paste exa_tokens exa_tail2 exa_tail3 exa_tail4 > exa_fourgrams
A wc -l on exa_fourgrams
will reveal that it has 1,213 lines, which
means there are 1,210 4-grams (the last 3 lines in the file are not
4-grams). When you sort and uniq that file, a wc
reveals that there are still 1,200 lines in the resulting file, i.e. there are 1,197 different 4-grams. Counting and sorting in the usual
way results in the following list:
2 underlined in the text
2 the system displays the
2 the number of documents
2 the figure to the
2 should be assigned to
Of course, there was a simpler way of calculating how many 4-grams
there were: there are 1,213 tokens in exa_tokens
, which means
that there will be 1,212 bigrams, 1,211 trigrams, 1,210 4-grams, etc.