Bigram probabilities

Next: Up: Applying probabilities to Data-Intensive Previous: Introduction

Bigram probabilities

Now we want to calculate the probability of bigram occurrences. Each word token in the document gets to be first in a bigram once, so the number of bigrams is 7070-1=7069. We can then calculate the following bigram probabilities:

$p({\tt sherlock},{\tt holmes}) = 7/7069 = 0.00099$
$p({\tt sherlock},\neg{\tt holmes}) = 0/7069 = 0.0$
$p(\neg {\tt sherlock},{\tt holmes}) = 39/7069 = 0.00552$
$p(\neg {\tt sherlock},\neg {\tt holmes}) = 7023/7069 = 0.99349$

We can lay these results out in a table. Note the marginal totals.

	`holmes`	$\neg {\tt holmes}$	Total
`sherlock`	0.00099	0.00000	0.00099
$\neg {\tt sherlock}$	0.00552	0.99349	0.99901
Total	0.00651	0.99349	1.00000

If text really was word confetti, we could assume that the probability of the second word is unaffected by the probability of the first word. We can represent this in the table by multiplying the marginal probabilities for each cell.

	`holmes`	$\neg {\tt holmes}$	Total
`sherlock`	$0.00651 \times 0.00099$	$0.99349 \times 0.00099$	0.00099
$\neg {\tt sherlock}$	$0.00651 \times 0.99901$	$0.99349 \times 0.99901$	0.99901
Total	0.00651	0.99349	1.00000

To calculate the expected frequencies from probabilities, you multiply everything by 7069:

	`holmes`	$\neg {\tt holmes}$	Total
`sherlock`	0.05	6.95	7
$\neg {\tt sherlock}$	45.95	7016.05	7062
Total	46	7023	7069

Next: Up: Applying probabilities to Data-Intensive Previous: Introduction

Chris Brew
8/7/1998