Next: Cross entropy Up: Probability and information Previous: Data-intensive grocery selection

Entropy

We've in fact already seen the definition of entropy - but to see that requires a slight change of point-of-view. Instead of the scenario with the djinn, imagine watching a sequence of symbols go past on a ticker-tape. You have seen the symbols $s_1 \ldots s_{i-1}$ and you are waiting for s_i to arrive. You ask yourself the following question:

How much information will I gain when I see s_i?

another way to express the same thing is:

How predictable is s_i from its context?

The way to answer this is to enumerate the possible next symbo which we'll call $w_1 \ldots w_n$ . On the basis of $s_1 \ldots s_{i-1}$ we have estimates of the probabilities $p(w_k\vert s_1 \ldots s_{i-1})$ where $1 \leq k \leq n$ Each such outcome will gain $- \log p(w_k\vert s_1 \ldots s_{i-1})$ bits of information. To answer our question we need the sum over all the outcomes, weighted by their probability:

$\begin{displaymath} - \sum^n_{k=1} p(w_k\vert s_1 \ldots s_{i-1}) \log p(w_k\vert s_1 \ldots s_{i-1})\end{displaymath}$

This is the formula which we used to choose questions for the decision tree. But now the scenario is more passive. Each time we see a symbol we are more or less surprised, depending on which symbol turns up. Large information gain goes with extreme surprise. If you can reliably predict the next symbol from context, you will not be surprised, and the information gain will be low. The entropy will be highest when you know least about the next symbol, and lowest when you know most.

A good language model is one which provides reliable predictions. It therefore tends to minimize entropy. In the next section we develop the formal apparatus for using cross entropy to evaluate language models.

Next: Cross entropy Up: Probability and information Previous: Data-intensive grocery selection

Chris Brew
8/7/1998