next up previous contents
Next: Counting words in documents Up: Applying probabilities to Data-Intensive Previous: Text preparation

Contingency tables

To build contingency tables, we first need to create a file with the bigrams in sherlock.

Exercise:

How many bigrams are there in sherlock? How many different or unique bigrams are there in sherlock? List them in a file called sherlock_ubigrams.

Solution:

The first question doesn't require any real calculation: a moment's reflection should make it obvious that there are 7069 bigrams (i.e. one less than there are words in the text).
To create the file with unique bigrams, you can take the following steps:
tail +2 sherlock_words > sherlock_words2
paste sherlock_words sherlock_words2 > sherlock_bitemp
sort sherlock_bitemp | uniq -c > sherlock_bigrams
We now want to separate the bigrams into four groupings:

 
Figure 8.3: Randomly generated strings from several orders of Markov model
sherlock followed by holmes
sherlock followed by anything (not holmes)
anything (not sherlock) followed by holmes
anything (not sherlock) followed by anything (not holmes)

This covers every possibility (every contingency) of the immediate neighbours of the words sherlock and holmes. You can put the results in a table as follows:

  followed by holmes followed by not holmes TOTAL
sherlock      
not sherlock      
TOTAL:      

You can use awk to find the various values for the table. For example

awk '$2=="sherlock" && $3=="holmes" {print "sherlock followed by
holmes:", $1}' < sherlock_bigrams
will result in the message
sherlock followed by holmes:
That means that 7 is the value for the top left corner.
Exercise:

Complete the contingency table for sherlock and holmes.

Solution:

There are different ways of doing this. Here is one very simple one:
#!/bin/nawk -f
$2=="sherlock" && $3=="holmes" {freq1=freq1+$1}
        END {print "sherlock followed by holmes:", freq1}
$2=="sherlock" && $3=!"holmes" {freq2=freq2+$1}
        END {print "sherlock followed by not holmes:", freq2}
$2!="sherlock" && $3=="holmes" {freq3=freq3+$1}
        END {print "not sherlock followed by holmes:", freq3}
$2!="sherlock" && $3!="holmes" {freq4=freq4+$1}
        END {print "not sherlock followed by not holmes:", freq4}
It is not a very satisfactory way of doing it, since it doesn't generalise very well, as we will see shortly. But it gets the job done. If you run that over sherlock_bigrams you will get the following result:
sherlock followed by holmes: 7
sherlock followed by not holmes: 
not sherlock followed by holmes: 39
not sherlock followed by not holmes: 7023
You don't get a value for sherlock followed by not holmes; that's because awk variables do not start of as zero, they start of as nothing. You can correct that by adding a BEGIN statement to your file: BEGIN{freq2=0}.

If your final figure (not sherlock followed by not holmes) was 7024 instead of 7023, then you may have forgotten to strip out the last line in sherlock_bitemp, which was not a bigram.

This is what the contingency table should look like:

1|r|followed by holmes not holmes TOTAL
sherlock 7   7
not sherlock 39 7023 7062
TOTAL: 46 7023 7069
There is a simpler way of writing down the result: instead of the above table, one can just write
sherlock holmes 7 0 39 7023

Here is an awk script that will get you that result:

#!/bin/nawk -f
BEGIN{freq1=0; freq2=0; freq3=0; freq4=0}         
$2=="sherlock" && $3=="holmes" {freq1=freq1+$1} 
$2=="sherlock" && $3=!"holmes" {freq2=freq2+$1} 
$2!="sherlock" && $3=="holmes" {freq3=freq3+$1} 
$2!="sherlock" && $3!="holmes" {freq4=freq4+$1} 
        END{print "sherlock", "holmes", freq1, freq2, freq3, freq4}

And it is possible to build contingency tables like this for every pair of words in the text.

Exercise:

Write an awk script that will produce the contingency information for every word pair in sherlock_words and print it in the linear format. Hint: don't try to generalise from the way the sherlock and +verb+holmes+ case was handled in the previous exercise. Instead, read in the awk book the section on arrays and split functions.

Solution:

Here is one way of writing the awk code:
1 #!/bin/nawk -f
2 # for constructing contingency tables
3 # takes as input a file with bigrams (output of uniq -c)
4 {total +=$1;  
5  bigrams[$2 "followed by" $3] += $1; 
6  first[$2] += $1; 
7  second[$3] += $1}
8 END{
9   for (bigram in bigrams)
10    {split(bigram,arr,"followed by");
11      var1=arr[1];
12      var2=arr[2];
13      print var1, var2,
14        bigrams[bigram],
15        first[var1]-bigrams[bigram], 
16        second[var2]-bigrams[bigram],
17        total+bigrams[bigram]-first[var1]-second[var2]}
18      }
As before, the line numbers are only there to make discussion easier; you have to remove them, or the program won't actually run. The intuition behind the code is as follows. By the time you reach the end of the file (and therefore also the END part of the awk program) you will need to print out four cells of the contingency table. To do that, you keep track of four things
  • total - the total number of bigrams seen so far.
  • first - the total number of times each word occurs as first part of a bigram.
  • second - the total number of times each word occurs as second part of a bigram.
  • bigrams - the count for each bigram
The first three of these are easy, a minor variation of idea in the word-counting program. The fourth is slightly tricky.

Although the output of uniq -c contains bigrams, they are spread across two different fields (in the case of sherlock_bigrams they are in fields $2 and $3). So the first thing to do (cf. line 5) is to create an array called bigrams which combines the items from the second and third field (in line 5). Then you can write the for-loop which takes every element in bigrams in turn (i.e. every bigram--cf. line 9).

Since this bigram consists of two items, you can separate them out again using split (in line 10). The values are stored in arrays called arr[1] and arr[2]. We give these the names var1 and var2 respectively (lines 11 and 12).

We calculate the total numer of times var1 occurred in the second field (line 6), and the total number of times var2 occurred in the third field (line 7).

Finally we print for each instance of var1 and var2 (i.e. for each bigram):

  • The two words in the bigram (i.e. var1 and var2 in line 13).
  • The number of times the bigram occurred (i.e. the total value of bigrams[bigram] in line 14).
  • The number of times var1 was found in first place but var2 was not in second place. This is calculated by taking the value of first[var1] (i.e. the total number of times var1 occurred in first position in the bigram) and subtracting the number of times it occurred in first position with var2 in second position (line 15).
  • The number of times var2 was found in second position in the bigram and var1 was not in first position (again by taking the total number of times var2 occurs in second position and subtracting those occasions where it occurred second and var1 occurred first (line 16).
  • The remainder, i.e. total number of bigrams (the total value of total)
    minus first[var1]-bigrams[bigram]
    minus second[var2]-bigrams[bigram]
    equals total-first[var1]-second[var2]+bigrams[bigram].


next up previous contents
Next: Counting words in documents Up: Applying probabilities to Data-Intensive Previous: Text preparation
Chris Brew
8/7/1998