next up previous contents
Next: Up: Applying probabilities to Data-Intensive Previous: Introduction

Bigram probabilities

Now we want to calculate the probability of bigram occurrences. Each word token in the document gets to be first in a bigram once, so the number of bigrams is 7070-1=7069. We can then calculate the following bigram probabilities:

$p({\tt sherlock},{\tt holmes}) = 7/7069 = 0.00099$
$p({\tt sherlock},\neg{\tt holmes}) = 0/7069 = 0.0$
$p(\neg {\tt sherlock},{\tt holmes}) = 39/7069 = 0.00552$
$p(\neg {\tt sherlock},\neg {\tt holmes}) = 7023/7069 = 0.99349$

We can lay these results out in a table. Note the marginal totals.

  holmes $\neg {\tt holmes}$ Total
sherlock 0.00099 0.00000 0.00099
$\neg {\tt sherlock}$ 0.00552 0.99349 0.99901
Total 0.00651 0.99349 1.00000

If text really was word confetti, we could assume that the probability of the second word is unaffected by the probability of the first word. We can represent this in the table by multiplying the marginal probabilities for each cell.

  holmes $\neg {\tt holmes}$ Total
sherlock $0.00651 \times 0.00099 $ $0.99349 \times 0.00099 $ 0.00099
$\neg {\tt sherlock}$ $0.00651 \times 0.99901$ $0.99349 \times 0.99901$ 0.99901
Total 0.00651 0.99349 1.00000

To calculate the expected frequencies from probabilities, you multiply everything by 7069:

  holmes $\neg {\tt holmes}$ Total
sherlock 0.05 6.95 7
$\neg {\tt sherlock}$ 45.95 7016.05 7062
Total 46 7023 7069

next up previous contents
Next: Up: Applying probabilities to Data-Intensive Previous: Introduction
Chris Brew
8/7/1998