next up previous contents
Next: Hidden Markov Models and Up: Probability and information Previous: Summary and self-check

Questions:

1.
Exercise:

How many different single character sequences are there in English text?

Solution:

There are various sensible answers to this.
\begin{trivlist}
\item $26+26+10$\space (all different alphanumerics)
\item $26+...
 ...texttt{cgram} gets you 76 distinct
 chars from a 13Mb BNC extract.\end{trivlist}
2.
Exercise:

How many different two character sequences are there in English text?

Solution:

Assume we said 76 for the previous question. There are the same number of choices for the second character, so there are $76 \times 76 = 5776$ possibilities. Or are there? What about the possibility that some sequences of characters don't occur? For example ``sb'' is either rare or impossible (but not in Italian). In fact there are only 1833 distinct two character sequences in my extract. What does this mean?
3.
Exercise:

How many different syllables are there in English?

Solution:

A rough cut is to assume that English syllables are of the form
C?C?VCC??
If we assumed that there are roughly 10 distinct vowels and 20 distinct consonants, and assume that we have a free choice at all times then we get an upper bound of about $20 \times 20 \times 10 \times 20 \times 20 = 160,000$ possible syllables. Typical syllabic writing systems have 50-200 distinct signs (Japanese, which has a particularly simple syllabary (nearly all open syllables like ``ma'' ``ka'' ``no'') makes do with 48. Clearly the assumption of independence is unwarranted in this case.
4.
Exercise:

How many different words are there in English?

Solution:

There may not be a good way of answering this, but it is worth thinking about. One way is to go through the same sort of argument that I just did with syllables, assuming few words longer than five syllables, or something.
5.
Exercise:

This question is about ``identical'' twins. It isn't always possible to tell by inspection whether twins are monozygotic or dizygotic [*]. But monozygotic twins are always of the same sex. Derive a formula for the proportion of twins which are monozygotic from sex-ratio data alone. (borrowed from ``Bayesian Statistics'' by Peter M. Lee).

Solution:

Each pair of twins is either monozygotic M, or dizygotic D, and either two girls GG, two boys BB, or a girl and a boy GB.

\begin{displaymath}
\begin{array}
{ccc}
P(GG\vert M) = \frac{1}{2} & P(BB\vert M...
 ...\vert D) = \frac{1}{4} & P(GB\vert D) = \frac{1}{2}
\end{array}\end{displaymath}

from which you can deduce that

and thence that

P(M) = 4P(GG)-1

It's worth pointing out that if you were unlucky with your sample (your provider of twins works at a single-sex boys school) you would get a strange estimate of P(GG) and this could feed through into making your estimate of P(M) not only wrong but (being negative) nonsensical.

next up previous contents
Next: Hidden Markov Models and Up: Probability and information Previous: Summary and self-check
Chris Brew
8/7/1998