To build contingency tables, we first need to create a file with
the bigrams in sherlock
.
How many bigrams are there in sherlock
? How many
different or unique bigrams are there in sherlock
? List
them in a file called sherlock_ubigrams
.
We now want to separate the bigrams into four groupings:
The first question doesn't require any real calculation: a moment's
reflection should make it obvious that there are 7069 bigrams (i.e. one less than there are words in the text).
To create the file with unique bigrams, you can take the following
steps:
tail +2 sherlock_words > sherlock_words2
paste sherlock_words sherlock_words2 > sherlock_bitemp
sort sherlock_bitemp | uniq -c > sherlock_bigrams
sherlock |
followed by | holmes |
sherlock |
followed by | anything (not holmes ) |
anything (not sherlock ) |
followed by | holmes |
anything (not sherlock ) |
followed by | anything (not holmes ) |
This covers every
possibility (every contingency) of the immediate neighbours of the
words sherlock
and holmes
. You can put the results in a
table as follows:
followed by holmes |
followed by not holmes |
TOTAL | |
sherlock |
|||
not sherlock |
|||
TOTAL: |
You can use awk
to find the various values for the table. For example
awk '$2=="sherlock" && $3=="holmes" {print "sherlock followed by holmes:", $1}' < sherlock_bigramswill result in the message
sherlock followed by holmes:That means that 7 is the value for the top left corner.
sherlock
and holmes
.#!/bin/nawk -f $2=="sherlock" && $3=="holmes" {freq1=freq1+$1} END {print "sherlock followed by holmes:", freq1} $2=="sherlock" && $3=!"holmes" {freq2=freq2+$1} END {print "sherlock followed by not holmes:", freq2} $2!="sherlock" && $3=="holmes" {freq3=freq3+$1} END {print "not sherlock followed by holmes:", freq3} $2!="sherlock" && $3!="holmes" {freq4=freq4+$1} END {print "not sherlock followed by not holmes:", freq4}It is not a very satisfactory way of doing it, since it doesn't generalise very well, as we will see shortly. But it gets the job done. If you run that over
sherlock_bigrams
you will get the following
result:
sherlock followed by holmes: 7 sherlock followed by not holmes: not sherlock followed by holmes: 39 not sherlock followed by not holmes: 7023You don't get a value for
sherlock
followed by
not holmes
; that's because awk
variables do not start of as zero,
they start of as nothing. You can correct that by adding a BEGIN
statement to your file: BEGIN{freq2=0}
.
If your final figure (not sherlock
followed by
not holmes
) was 7024 instead of 7023, then you may have forgotten to
strip out the last line in sherlock_bitemp
, which was not a
bigram.
This is what the contingency table should look like:
1|r|followed by | holmes |
not holmes |
TOTAL |
sherlock |
7 | 7 | |
not sherlock |
39 | 7023 | 7062 |
TOTAL: | 46 | 7023 | 7069 |
sherlock holmes 7 0 39 7023
Here is an awk
script that will get you that result:
#!/bin/nawk -f BEGIN{freq1=0; freq2=0; freq3=0; freq4=0} $2=="sherlock" && $3=="holmes" {freq1=freq1+$1} $2=="sherlock" && $3=!"holmes" {freq2=freq2+$1} $2!="sherlock" && $3=="holmes" {freq3=freq3+$1} $2!="sherlock" && $3!="holmes" {freq4=freq4+$1} END{print "sherlock", "holmes", freq1, freq2, freq3, freq4}
And it is possible to build contingency tables like this for every pair of
words in the text.
Although the output of
Since this bigram consists of two items, you can separate them out
again using
We calculate the total numer of times
Finally we print for each instance of
Write an awk
script that will produce the contingency
information for every word pair in sherlock_words
and print it in the
linear format. Hint: don't try to generalise from the way the sherlock
and +verb+holmes+ case was handled in the previous exercise. Instead, read
in the awk
book the section on
arrays
and split
functions.
Here is one way of writing the awk
code:
1 #!/bin/nawk -f
2 # for constructing contingency tables
3 # takes as input a file with bigrams (output of uniq -c)
4 {total +=$1;
5 bigrams[$2 "followed by" $3] += $1;
6 first[$2] += $1;
7 second[$3] += $1}
8 END{
9 for (bigram in bigrams)
10 {split(bigram,arr,"followed by");
11 var1=arr[1];
12 var2=arr[2];
13 print var1, var2,
14 bigrams[bigram],
15 first[var1]-bigrams[bigram],
16 second[var2]-bigrams[bigram],
17 total+bigrams[bigram]-first[var1]-second[var2]}
18 }
As before, the line numbers are only there to make discussion easier;
you have to remove them, or the program won't actually run.
The intuition behind the code is as follows. By the time
you reach the end of the file (and therefore also the
END part of the awk program) you will need to
print out four cells of the contingency table.
To do that, you keep track of four things
The first three of these are easy, a minor variation
of idea in the
word-counting program. The fourth is slightly tricky.
uniq -c
contains
bigrams, they are spread across two different fields (in the case of
sherlock_bigrams
they are in fields $2
and $3
).
So the first thing to do (cf. line 5) is to create an array called
bigrams
which combines the items from the second and third field (in line 5).
Then you can write the for
-loop which takes every element in
bigrams
in turn (i.e. every bigram--cf. line 9).
split
(in line 10). The values are stored in arrays called
arr[1]
and arr[2]
. We give these the names var1
and var2
respectively (lines 11 and 12).
var1
occurred in the second field
(line 6), and the total number of times var2
occurred in the third field
(line 7).
var1
and var2
(i.e. for each bigram):
var1
and var2
in line 13).
bigrams[bigram]
in line 14).
var1
was found in first place but var2
was
not in second place. This is calculated by
taking the value of first[var1]
(i.e. the total
number of times var1
occurred in first position in the bigram) and subtracting
the number of times it occurred in first position with var2
in second position
(line 15).
var2
was found in second position in the bigram
and var1
was not in first position (again by taking the total number of times
var2
occurs in second position and subtracting those occasions where it
occurred second and var1
occurred first (line 16).
total
)
minus first[var1]-bigrams[bigram]
minus second[var2]-bigrams[bigram]
equals total-first[var1]-second[var2]+bigrams[bigram]
.