UNIX has some facilities which allow you to do this tokenisation.
We start with tr
.
This ``translates'' characters.
Typical usage is
tr chars1 chars2 < inputfile > outputfilewhich means ``copy the characters from the inputfile into the outputfile and substitute all characters specified in chars1 by chars2''.
For example, tr allows you to change all the characters in the input file into uppercase characters:
tr 'a-z' 'A-Z' < exatext1 | moreThis just says ``translate every a into A, every b into B, etc.''
Note for Edinburgh readers:
On different UNIX systems the options for tr are slightly different. Read the man-page to get the details. For the purposes of this tutorial, you should use the UNIX command alias tr /usr/ucb/tr (or equivalent if you use a non-standard shell) to get a version of tr which works as described here.
Similarly,
tr 'aiou' e < exatext1 | morechanges all the vowels in
exatext1
into es.
You can also use tr to display all the words in the text on separate lines. You do this by ``translating'' everything which isn't a word (every space or punctuation mark) into newline (ASCII code 012). Of course, if you just type
tr 'A-Za-z' '\012' < exatext1each letter in
exatext1
is replaced
by a newline, and the result (as you can easily verify) is just a long
list of newlines, with only the punctuation marks remaining.
What we want is exactly the opposite--we are not interested in the punctuation marks, but in everything else. The option -c provides this:
tr -c 'A-Za-z' '\012' < exatext1Here the complement of letters (i.e. everything which isn't a letter) is mapped into a newline. The result now looks as follows:
Text in a class of its own The HCRC Language Technology Group LTG is a technology transfer ...There are some white lines in this file. That is because the full stop after a class of its own is translated into a newline, and the space after the full stop is also translated into a newline. So after own we have two newlines in the current output. The option -s ensures that multiple consecutive replacements like this are replaced with just a single occurrence of the replacement character (in this case: the newline). So with that option, the white lines in the current output will disappear.
exa_words
with each word in exatext1
on a separate line.tr -cs 'A-Za-z' '\012' < exatext1 > exa_words
|
).
For example,
to map all words in the example text in lower case, and then display it
one word per line, you can type:
tr 'A-Z' 'a-z' < exatext1 | tr -sc 'a-z' '\012' > exa_tokensThe reason for calling this file
exa_tokens
will become clear
later on.
We will refer back to files created here and in exercises, so
it's useful to follow these naming conventions.
Another useful UNIX operation is sort. It sorts lines from the
input file, typically in alphabetical order. Since the output of
tr was one word per line, sort can be used to put these lines
in alphabetical order, resulting in an alphabetical list of all the
words in the text. Check the man-page for sort to find out about
other possible options.
Sort all the words in exatext1 in
alphabetical order.
The output so far is an alphabetical list of all words in
exatext1, including duplicates, each on a separate line.
You can also produce an alphabetical list which strips out the
duplicates, using sort -u.
Just pipeline the tr command with sort:
i.e. type
tr -cs 'A-Za-z' '\012' < exatext1 | sort | more
Or to get an alphabetical list of all words in lowercase,
you can just type
sort exa_tokens > exa_tokens_alphab
.
The file exa_tokens_alphab
now contains an alphabetical list of
all the word tokens occurring in exatext1
.
Create a file exa_types_alphab
, containing
each word in exatext1
exactly once.
Sorted lists like this are useful input for a number of other UNIX
tools. For example,
Just type
sort -u exa_tokens > exa_types_alphab
comm
can be used to check what two sorted
lists have in common. Have a look at the file stoplist
: it
contains an alphabetical list of very common words of the English
langage. If you type
comm stoplist exa_types_alphab | moreyou will get a 3-column output, displaying in column 1 all the words that only occur in the file
stoplist
, in column 2 all
words that occur only in exa_types_alphab
, and in column 3 all
the words the two files have in common. Option -1
suppresses
column 1, -2
suppresses column 2, etc.
exatext
comm -1 -3 stoplist exa_types_alphabetical | more
exatext
but not in the list of
common words.exatext1
there
are 1,206 word tokens. You can use the UNIX command wc
(for word count)
to find this out: just type wc -w exatext1
.
However, in exatext1
there are only 427 different words or word
types. (Again, you can find this out by doing
wc -w exa_types_alphabetical
).
There is another tool that can be used to create a list of word types,
namely uniq. This
is a UNIX tool which can be used to remove duplicate adjacent lines
in a file. If we use it to strip duplicate lines out of
exa_tokens_alphab
we will be left with an alphabetical list of all
wordtypes in exatext1--just as was achieved by using sort -u.
Try it by typing
uniq exa_tokens_alphab | moreThe complete chain of commands (or pipe-line) is:
tr -cs 'a-z' '\012' < exa_tokens | sort | uniq | more
tr -cs 'a-z' '\012' < exa_tokens | uniq | sort | more
uniq -c exa_tokens_alphab > exa_alphab_frequencyThe file
exa_alphab_frequency
contains information like the
following:
3 also 5 an 35 and 2 appear 3 appears 1 application 5 applications 1 approachIn other words, there are 3 tokens of the word ``also'', 5 tokens of the word ``an'', etc.
tr -cs 'A-Za-z' '\012' < exatext1 | sort | uniq -c | more
exa_alphab_frequency
you will see that that correctly
gives ``the'' a frequency of occurrence of 85. The complete pipeline
to achieve this is
tr 'A-Z' 'a-z' < exatext1 | tr -sc 'a-z' '\012' | sort | uniq -c| more
exatext1
with
all words in lower case. Just type
tr 'A-Z' 'a-z' < exatext1 > exatext1_lc
exa_freq
.sort -nr < exa_alphab_frequency > exa_freq
tr -cs 'a-z' '\012' < exatext1_lc | sort | uniq -c | sort -nr
To recap: first we use tr to map each word onto its own line. Then we sort the words alphabetically. Next we remove identical adjacent lines using uniq and use the -c option to mark how often that word occurred in the text. Finally, we sort that file numerically in reverse order, so that the word which occurred most often in the text appears at the top of the list.
When you get these long lists of words, it is sometimes useful to use
head
or tail
to inspect part of the files,
or to use the stream editor sed
. For example,
head -12 exa_freq
or sed 12q exa_freq
will display just the first 12 lines of exa_freq
;
tail -5
will display the last 5 lines; tail
14+
will display everything from line 14.
sed /indexer/q exa_freq
will display the file exa_freq
up to and including the line with
the first occurrence of the item
``indexer.
List the top 10 words in exatext1, with their
frequency count.
Your list should look as follows:
85 the 34 in
42 to 22 text
39 of 18 for
37 a 15 is
35 and 14 this
With the files you already have, the easiest way of doing it
is to say
head -10 exa_freq
.
The complete pipeline is
tr -cs 'a-z' '\012' < exatext1_lc |sort|uniq -c|sort -nr|head -10