To build contingency tables, we first need to create a file with
the bigrams in sherlock.
How many bigrams are there in sherlock? How many
different or unique bigrams are there in sherlock? List
them in a file called sherlock_ubigrams.
We now want to separate the bigrams into four groupings:
The first question doesn't require any real calculation: a moment's
reflection should make it obvious that there are 7069 bigrams (i.e. one less than there are words in the text).
To create the file with unique bigrams, you can take the following
steps:
tail +2 sherlock_words > sherlock_words2
paste sherlock_words sherlock_words2 > sherlock_bitemp
sort sherlock_bitemp | uniq -c > sherlock_bigrams
sherlock |
followed by | holmes |
sherlock |
followed by | anything (not holmes) |
anything (not sherlock) |
followed by | holmes |
anything (not sherlock) |
followed by | anything (not holmes) |
This covers every
possibility (every contingency) of the immediate neighbours of the
words sherlock and holmes. You can put the results in a
table as follows:
followed by holmes |
followed by not holmes |
TOTAL | |
sherlock |
|||
not sherlock |
|||
| TOTAL: |
You can use awk
to find the various values for the table. For example
awk '$2=="sherlock" && $3=="holmes" {print "sherlock followed by
holmes:", $1}' < sherlock_bigrams
will result in the message
sherlock followed by holmes:That means that 7 is the value for the top left corner.
sherlock and holmes.
#!/bin/nawk -f
$2=="sherlock" && $3=="holmes" {freq1=freq1+$1}
END {print "sherlock followed by holmes:", freq1}
$2=="sherlock" && $3=!"holmes" {freq2=freq2+$1}
END {print "sherlock followed by not holmes:", freq2}
$2!="sherlock" && $3=="holmes" {freq3=freq3+$1}
END {print "not sherlock followed by holmes:", freq3}
$2!="sherlock" && $3!="holmes" {freq4=freq4+$1}
END {print "not sherlock followed by not holmes:", freq4}
It is not a very satisfactory way of doing it, since it doesn't generalise very well,
as we will see shortly. But it gets the job done.
If you run that over sherlock_bigrams you will get the following
result:
sherlock followed by holmes: 7 sherlock followed by not holmes: not sherlock followed by holmes: 39 not sherlock followed by not holmes: 7023You don't get a value for
sherlock followed by
not holmes; that's because awk variables do not start of as zero,
they start of as nothing. You can correct that by adding a BEGIN
statement to your file: BEGIN{freq2=0}.
If your final figure (not sherlock followed by
not holmes) was 7024 instead of 7023, then you may have forgotten to
strip out the last line in sherlock_bitemp, which was not a
bigram.
This is what the contingency table should look like:
| 1|r|followed by | holmes |
not holmes |
TOTAL |
sherlock |
7 | 7 | |
not sherlock |
39 | 7023 | 7062 |
| TOTAL: | 46 | 7023 | 7069 |
sherlock holmes 7 0 39 7023
Here is an awk script that will get you that result:
#!/bin/nawk -f
BEGIN{freq1=0; freq2=0; freq3=0; freq4=0}
$2=="sherlock" && $3=="holmes" {freq1=freq1+$1}
$2=="sherlock" && $3=!"holmes" {freq2=freq2+$1}
$2!="sherlock" && $3=="holmes" {freq3=freq3+$1}
$2!="sherlock" && $3!="holmes" {freq4=freq4+$1}
END{print "sherlock", "holmes", freq1, freq2, freq3, freq4}
And it is possible to build contingency tables like this for every pair of
words in the text.
Although the output of
Since this bigram consists of two items, you can separate them out
again using
We calculate the total numer of times
Finally we print for each instance of
Write an awk script that will produce the contingency
information for every word pair in sherlock_words and print it in the
linear format. Hint: don't try to generalise from the way the sherlock
and +verb+holmes+ case was handled in the previous exercise. Instead, read
in the awk book the section on
arrays and split functions.
Here is one way of writing the awk code:
1 #!/bin/nawk -f
2 # for constructing contingency tables
3 # takes as input a file with bigrams (output of uniq -c)
4 {total +=$1;
5 bigrams[$2 "followed by" $3] += $1;
6 first[$2] += $1;
7 second[$3] += $1}
8 END{
9 for (bigram in bigrams)
10 {split(bigram,arr,"followed by");
11 var1=arr[1];
12 var2=arr[2];
13 print var1, var2,
14 bigrams[bigram],
15 first[var1]-bigrams[bigram],
16 second[var2]-bigrams[bigram],
17 total+bigrams[bigram]-first[var1]-second[var2]}
18 }
As before, the line numbers are only there to make discussion easier;
you have to remove them, or the program won't actually run.
The intuition behind the code is as follows. By the time
you reach the end of the file (and therefore also the
END part of the awk program) you will need to
print out four cells of the contingency table.
To do that, you keep track of four things
The first three of these are easy, a minor variation
of idea in the
word-counting program. The fourth is slightly tricky.
uniq -c contains
bigrams, they are spread across two different fields (in the case of
sherlock_bigrams they are in fields $2 and $3).
So the first thing to do (cf. line 5) is to create an array called
bigrams
which combines the items from the second and third field (in line 5).
Then you can write the for-loop which takes every element in
bigrams in turn (i.e. every bigram--cf. line 9).
split (in line 10). The values are stored in arrays called
arr[1] and arr[2]. We give these the names var1 and var2
respectively (lines 11 and 12).
var1 occurred in the second field
(line 6), and the total number of times var2 occurred in the third field
(line 7).
var1 and var2 (i.e. for each bigram):
var1 and var2 in line 13).
bigrams[bigram] in line 14).
var1 was found in first place but var2 was
not in second place. This is calculated by
taking the value of first[var1] (i.e. the total
number of times var1 occurred in first position in the bigram) and subtracting
the number of times it occurred in first position with var2 in second position
(line 15).
var2 was found in second position in the bigram
and var1 was not in first position (again by taking the total number of times
var2 occurs in second position and subtracting those occasions where it
occurred second and var1 occurred first (line 16).
total)
minus first[var1]-bigrams[bigram]
minus second[var2]-bigrams[bigram]
equals total-first[var1]-second[var2]+bigrams[bigram].