next up previous contents
Next: Unique strings Up: Probability and Language Models Previous: Statistical models of language

Case study: Language Identification

The point of this section is to point up the issues in statistical language modelling in a very simple context. Language identification is relatively easy, but demanding enough to work as an illustration. The same principles apply to speech recognition and part-of-speech tagging, but there is more going on in those applications, which can get distracting. The following few pages are based on Dunning's paper on Statistical Language Identification, which is strongly recommended.

  
Figure 8.1: Language strings to identify
\begin{figure}
\begin{verbatim}
e preebas bioquimica
man immunodeficiency 
faits se sont produi\end{verbatim}\end{figure}

It is obvious from the examples in figure 8.1 (first Spanish, second English, third French) that you don't need comprehension to identify different human languages. But it isn't immediately clear how to do it. Various less good alternatives are reviewed in the paper.

Dunning asks the following questions:

No linguistically motivated heuristics are needed beyond the assumption that we have a probabilistic (low-order Markov) process generating characters.