First, let's get the Conan Doyle story ready for further
manipulation. We include this step here to give you an idea
as to the sort of odd problems you may be up against when dealing with
``real'' text.
How many words are there in sherlock
? Create a file with each
word in lowercase
on a separate line and call it sherlock_words
. How many
words are there in that file? If there is a difference, can you see
what is causing it?
If you just do wc
on sherlock
you will get the reply
that there are 7009 words. If you type
tr '[A-Z]' '[a-z]' <sherlock|tr -cs '[a-z]' '\012' > sherlock_words
and then do a word count on that, you will see that there are slightly
more words (7070). The difference arises for a number of reasons.
One is that the
original text has a lot of hyphenated words. wc sherlock
counts each compound as a single word, but tr
separates
the compounds onto separate lines and
counts their components as separate words.
Can you list all the hyphenated words in sherlock
? How many are
there?
If you inspect the hyphenated words, you will see it contains words
like ``test-tubes'', ``good-evening'' and ``top-hat''. For purposes of
examining which words in English go with what other words, it
may be more
useful to separate out these words (because it is useful to know that
``good'' goes with ``evening'' or ``top'' with ``hat''). Compounds are
separated out like that in
You could use grep
to find occurrences of lines containing
words connected by means of hyphens:
grep -c '[A-Za-z]-[A-Za-z]' sherlock
However, this only tells you how
many lines there are with hyphenated words (35 lines), not how
many instances there are. For example, it will find lines like
"Well, she had a slate-coloured, broad-brimmed straw hat,
and a general air of being fairly well-to-do in a vulgar, comfortable,
easy-going way."
To find compounds consisting of hyphenated words,
grep
can only be used
if each such compound occurs on a separate line. But then we have to use
tr
in such a way that hyphens remain in place. You can do
that by typing
tr -cs 'A-Za-z\-' '\012' < sherlock > sherlock_hyphened
The addition of \-
ensures that
the output has all the hyphenated words still in it as single items.
If you now type
grep '[A-Za-z]-[A-Za-z]' sherlock_hyphened | more
you will see all the hyphenated words; grep -c
will tell you
that there are 37 of them.sherlock_words
. So from now on, we
will take that to be the word list, which means there are 7070 words
in sherlock
.
This is of course a fairly arbitrary decision--there may be arguments for leaving hyphens in. Similarly, when one wants to see which words have which neighbours, then perhaps one wants to keep sentence boundaries in place, rather than stripping them out, since the last word of a sentence and the first word of the next sentence are not really ``neighbours''. Or perhaps the full stop is a neighbour and should appear in the bigrams, and be included in the word count. And there are some words in square brackets in this text:
I held the little printed slip to the light. "Missing [it said] on the morning of the fourteenth. a gentleman named Hosmer Angel. About five feet seven...Again, one may want to make some kind of principled decision about how one deals with such interjections.
But for the purposes of the exercises in this section, we will
use the file sherlock_words
as created above, and agree that
the original file sherlock
has 7070 words in it.