Text preparation

Next: Contingency tables Up: Applying probabilities to Data-Intensive Previous: Contingency Tables

Text preparation

First, let's get the Conan Doyle story ready for further manipulation. We include this step here to give you an idea as to the sort of odd problems you may be up against when dealing with ``real'' text.

Exercise:: How many words are there in sherlock? Create a file with each word in lowercase on a separate line and call it sherlock_words. How many words are there in that file? If there is a difference, can you see what is causing it?

Solution:: If you just do wc on sherlock you will get the reply that there are 7009 words. If you type
tr '[A-Z]' '[a-z]' <sherlock|tr -cs '[a-z]' '\012' > sherlock_words
and then do a word count on that, you will see that there are slightly more words (7070). The difference arises for a number of reasons. One is that the original text has a lot of hyphenated words. wc sherlock counts each compound as a single word, but tr separates the compounds onto separate lines and counts their components as separate words.

Exercise:: Can you list all the hyphenated words in sherlock? How many are there?

Solution:

You could use grep to find occurrences of lines containing words connected by means of hyphens:
grep -c '[A-Za-z]-[A-Za-z]' sherlock
However, this only tells you how many lines there are with hyphenated words (35 lines), not how many instances there are. For example, it will find lines like

"Well, she had a slate-coloured, broad-brimmed straw hat,
and a general air of being fairly well-to-do in a vulgar, comfortable,
easy-going way."

To find compounds consisting of hyphenated words, grep can only be used if each such compound occurs on a separate line. But then we have to use tr in such a way that hyphens remain in place. You can do that by typing
tr -cs 'A-Za-z\-' '\012' < sherlock > sherlock_hyphened
The addition of \- ensures that the output has all the hyphenated words still in it as single items. If you now type
grep '[A-Za-z]-[A-Za-z]' sherlock_hyphened | more
you will see all the hyphenated words; grep -c will tell you that there are 37 of them.

If you inspect the hyphenated words, you will see it contains words like ``test-tubes'', ``good-evening'' and ``top-hat''. For purposes of examining which words in English go with what other words, it may be more useful to separate out these words (because it is useful to know that ``good'' goes with ``evening'' or ``top'' with ``hat''). Compounds are separated out like that in sherlock_words. So from now on, we will take that to be the word list, which means there are 7070 words in sherlock.

This is of course a fairly arbitrary decision--there may be arguments for leaving hyphens in. Similarly, when one wants to see which words have which neighbours, then perhaps one wants to keep sentence boundaries in place, rather than stripping them out, since the last word of a sentence and the first word of the next sentence are not really ``neighbours''. Or perhaps the full stop is a neighbour and should appear in the bigrams, and be included in the word count. And there are some words in square brackets in this text:

I held the little printed slip to the light.
"Missing [it said] on the morning of the fourteenth. a
gentleman named Hosmer Angel. About five feet seven...

Again, one may want to make some kind of principled decision about how one deals with such interjections.

But for the purposes of the exercises in this section, we will use the file sherlock_words as created above, and agree that the original file sherlock has 7070 words in it.

Next: Contingency tables Up: Applying probabilities to Data-Intensive Previous: Contingency Tables

Chris Brew
8/7/1998