Next: Results Up: Probability and Language Models Previous: Choice of priors may

Estimating Model Parameters

The obvious way to do this, which doesn't work, is to estimate the transition probabilities by taking the ratio of the counts of k+1-grams and the corresponding k-grams. That is, we form the ratio:

$\begin{displaymath} p(w_{k+1} \vert w_{1} \ldots w_{k}\vert M_{lang}) = \frac{T(w_{1} \ldots w_{k+1},T_{lang})}{T(w_{1} \ldots w_{k},T_{lang})}\end{displaymath}$

This is the maximum likelihood estimator for the transition probabilities given the training set. It's a great model for the training set, but fails catastrophically in the real world. This is because

There may be k+1-grams in the test data which are absent from the training data. This brusquely zeroes the probability.
By bad luck may be a k+1 gram in the training data for one language, even though it is in fact rare in all the languages. If this happens, all strings containing that k+1-gram will be judged to be from that language, because all the others will be 0.

This means that using the maximum likelihood estimator brings back all the brittleness of the unique string model. Fortunately, that's not the end of the story, because alternatives are available.In particular there is a more stable expression:

$\begin{displaymath} \hat{p}(w_{k+1} \vert w_{1} \ldots w_{k}\vert M_{lang}) = \f... ...1},T_{lang}) +1}{T(w_{1} \ldots w_{k},T_{lang})+ \vert M\vert}\end{displaymath}$

where |M| is the size of the alphabet of symbols which might be generated. Intuitively, this, which is called the Laplace sample-size correction, adds one to every count in the numerator, ensuring that none are ever quite zero, and ``balances'' it by adding a corresponding number to the counts in the denominator. There is a proper derivation of this formula,in Dunning's paper, which you can read later if it appeals, but various approximations are involved. For some applications, notably word-based n-grams, you need a more refined approach. but for language identification it is quite acceptable. Dunning shows that using the Laplace formula to estimate k+1-grams directly amounts to using a decision rule which incorporates evidence from Markov models with order 0 to k. Everybody should read up to page 18 of Dunning's paper, and look at the graphs - masochists can read the maths too.

Results

Next: Results Up: Probability and Language Models Previous: Choice of priors may

Chris Brew
8/7/1998