next up previous contents
Next: Ambiguity for beginners Up: Introduction Previous: The heart of our

A first language model

Here is an idea of what we mean by the term probabilistic language model. This is one of the core concepts of Data-Intensive Lingistics.

Imagine a cup of word confetti made by cutting up a copy of ``A Case of Identity'' (or sherlock_words). Now imagine picking words out of the cup, one at a time. On each occasion, you note the word and put it back.

Given that there are 7070 words in the cup, and 7 of them are sherlock, the probability of picking sherlock out of the cup is $p(\mbox{\tt sherlock}) = 7/7070 = 0.00099$. This is the fraction of time you expect to see sherlock if you draw one word. Similarly, $p(\mbox{\tt holmes}) = 46/7070 = 0.0065$.

If you are not already comfortable with the ideas of probability and randomness, rest assured that we go into these matters in much more depth later in the book.



Chris Brew
8/7/1998