next up previous contents
Next: Results Up: Probability and Language Models Previous: Choice of priors may

Estimating Model Parameters

  The obvious way to do this, which doesn't work, is to estimate the transition probabilities by taking the ratio of the counts of k+1-grams and the corresponding k-grams. That is, we form the ratio:

\begin{displaymath}
p(w_{k+1} \vert w_{1} \ldots w_{k}\vert M_{lang}) = \frac{T(w_{1} \ldots 
w_{k+1},T_{lang})}{T(w_{1} \ldots 
w_{k},T_{lang})}\end{displaymath}

This is the maximum likelihood estimator for the transition probabilities given the training set. It's a great model for the training set, but fails catastrophically in the real world. This is because This means that using the maximum likelihood estimator brings back all the brittleness of the unique string model. Fortunately, that's not the end of the story, because alternatives are available.In particular there is a more stable expression:

\begin{displaymath}
\hat{p}(w_{k+1} \vert w_{1} \ldots w_{k}\vert M_{lang}) = \f...
 ...1},T_{lang}) +1}{T(w_{1} \ldots 
w_{k},T_{lang})+ \vert M\vert}\end{displaymath}

where |M| is the size of the alphabet of symbols which might be generated. Intuitively, this, which is called the Laplace sample-size correction, adds one to every count in the numerator, ensuring that none are ever quite zero, and ``balances'' it by adding a corresponding number to the counts in the denominator. There is a proper derivation of this formula,in Dunning's paper, which you can read later if it appeals, but various approximations are involved. For some applications, notably word-based n-grams, you need a more refined approach. but for language identification it is quite acceptable. Dunning shows that using the Laplace formula to estimate k+1-grams directly amounts to using a decision rule which incorporates evidence from Markov models with order 0 to k. Everybody should read up to page 18 of Dunning's paper, and look at the graphs - masochists can read the maths too.



 
next up previous contents
Next: Results Up: Probability and Language Models Previous: Choice of priors may
Chris Brew
8/7/1998