Next: Why is the cross-entropy Up: Probability and information Previous: Entropy

Cross entropy

In the previous section we developed the idea that entropy is measure of the expected information gain from seeing the next symbol of a ticker tape. The formula for this quantity, which we called entropy is:

$\begin{displaymath} H(w) - \sum_w p(w) \log p(w)\end{displaymath}$

Now we imagine that we are still watching a ticker tape, whose behaviour is still controlled by P(w) but we have imperfect knowledge P_M(w) of the probabilities. That is, when we see w we assess our information gain as $\log p_{M}(w)$ , not as the correct $\log p(w)$ . Over time we will see symbols occurring with their true distribution, so our estimate of the information content of the signal will be:

$\begin{displaymath} H(w; P_{M}) = - \sum_w p(w) \log p_{M}(w)\end{displaymath}$

This quantity is called the cross-entropy of the signal with respect to the model P_M. It is a remarkable and important fact that the cross entropy with respect to any incorrect probabilistic model is greater than the entropy with respect to the correct model.

The reason that this fact is important is that it provides us with a justification for using cross-entropy as a tool for evaluating models. This lets you organize the search for a good model in the following way

Initialize your model with random (or nearly random) parameters.
Measure the cross-entropy.
Alter the model slightly (maybe improve it)
Measure again, accepting the new model if the cross-entropy has improved.
Repeatedly alter the model until it is good enough

. If you are able to find a scheme which guarantees that the alterations to the model will improve cross-entropy, then so much the better, but even if not every change is an improvement, the algorithm may still eventually yield good models.

Why is the cross-entropy always more than the entropy?

Next: Why is the cross-entropy Up: Probability and information Previous: Entropy

Chris Brew
8/7/1998