We've in fact already seen the definition of entropy - but to see that requires a slight change of point-of-view. Instead of the scenario with the djinn, imagine watching a sequence of symbols go past on a ticker-tape. You have seen the symbols and you are waiting for si to arrive. You ask yourself the following question:
How much information will I gain when I see si?another way to express the same thing is:
How predictable is si from its context?The way to answer this is to enumerate the possible next symbo which we'll call . On the basis of we have estimates of the probabilities where Each such outcome will gain bits of information. To answer our question we need the sum over all the outcomes, weighted by their probability: This is the formula which we used to choose questions for the decision tree. But now the scenario is more passive. Each time we see a symbol we are more or less surprised, depending on which symbol turns up. Large information gain goes with extreme surprise. If you can reliably predict the next symbol from context, you will not be surprised, and the information gain will be low. The entropy will be highest when you know least about the next symbol, and lowest when you know most.
A good language model is one which provides reliable predictions. It therefore tends to minimize entropy. In the next section we develop the formal apparatus for using cross entropy to evaluate language models.