Strictly, the probability of observing a particular test string S given a Markov model like MSpanish or MEnglish is:
but for practical purposes it is just as good to drop the leading term. Variations in this are going to be massively outweighed by the contribution of the terms in the product.You can rearrange the product by grouping together terms which involve the same words (for example, pulling together all instances of ``th''), to get [Need to spell this out in more detail, with an example and a diagram].
where is the number of times the k+1 gram occurs in the test string. NB. Dunning gets this formula wrong, using a product instead of an exponent. The next one is right. As is usually the case, when working with probabilities, taking logarithms helps to keep the numbers stable. This gives: We can compare these for different languages, and choose the language model which is most likely to have generated the given string. If the language models sufficiently reflect the languages, comparing the models will get us the right conclusions about the languages.The question remaining is that of getting reliable estimates of the ps. And this is where statistical language modellers really spend their lives. Everything up to now is common ground shared, in one way or another, by almost all applications. What remains is task-specific and crucially important to the success of the enterprise.