Next: Results
Up: Probability and Language Models
Previous: Choice of priors may
The obvious way to do this, which doesn't work, is to estimate the
transition probabilities by taking the ratio of the counts of k+1-grams
and the corresponding k-grams. That is, we form the ratio:

This is the maximum likelihood estimator for the transition
probabilities given the training set. It's a great model for the
training set, but fails catastrophically in the real world. This
is because
- There may be k+1-grams in the test data which are
absent from the training data. This brusquely zeroes the
probability.
- By bad luck may be a k+1 gram in the training data
for one language, even though it is in fact rare in all the
languages. If this happens, all strings containing that k+1-gram will
be judged to be from that language, because all the others will be
0.
This means that using the maximum likelihood estimator brings back all
the brittleness of the unique string model. Fortunately, that's not the
end of the story, because alternatives are available.In particular
there
is a more stable expression:

where |M| is the size of the alphabet of symbols which might be
generated. Intuitively, this, which is called the Laplace
sample-size correction, adds one to every count in the
numerator, ensuring that
none are ever quite zero, and ``balances'' it by adding a corresponding
number to the counts in the denominator.
There is a proper derivation of this formula,in Dunning's paper,
which you can read later if it appeals, but various approximations are involved.
For some
applications, notably word-based n-grams, you need a more
refined approach.
but for language identification it is quite acceptable. Dunning
shows that using the Laplace formula to estimate
k+1-grams directly amounts to using a decision
rule which incorporates evidence from Markov models with order
0 to k.
Everybody
should read up to page 18 of Dunning's paper, and look
at the graphs - masochists can
read the maths too.
Next: Results
Up: Probability and Language Models
Previous: Choice of priors may
Chris Brew
8/7/1998