next up previous contents
Next: Entropy Up: Probability and information Previous: Introduction

Data-intensive grocery selection

As with the medical example in the last chapter, it pays to start with a simple non-linguistic example. Imagine the problem of designing a structured questionare for classifying fruit. You start with a single question, and depending on the answer you to that question you choose the next question. In AI this is usually called a decision tree If there are 4 nutritious fruits and 4 poisonous ones, and you have hungry people on your hands,you need to set out your questionaire so that:

and if you are that way inclined, you will want to make a design which: There are lots of ways of measuring complexity, but one of the best is information. We're going to define the information change caused by the answer to a question. The basic idea is that we focus on the probability of an outcome before and after the question is answered.

Now imagine a situation where a slightly capricious djinn is holding one of the pieces of fruit behind its back. It won't show you the fruit, but it will answer questions about the colour, size and skin type. Your task is to ask sensible questions. There is a risk that if you ask too many questions the djinn will just fly off with the fruit. So it pays to ask the right questions. You can do this by getting the decision tree right.

  
Figure 9.1: The training fruit
\begin{figure}
\begin{tabular}
{\vert c\vert c\vert c\vert c\vert} \hline
Colour...
 ...g & Rough & No \ Brown & Small & Shiny & No \  \hline\end{tabular}\end{figure}

Suppose we know that the fruit is chosen from eight possibilities as shown in 9.1. Before we start, we can measure the amount of information which we would gain from magically knowing the correct answer. We'll call the safe fruit the positive examples and the unsafe fruit the negative examples.

In general, if the possible answers have probability P(vi) then the information content of the actual answer is

\begin{displaymath}
I(P(v_1) \ldots P(v_n)) = \sum_i^n -P(v_i) \log_{2}P(v_i)\end{displaymath}

Using $\log_{2}$ is the standard in information theory. We think of messages as composed of sequences of binary digits, or bits.
Example: To see that this formula is sensible consider the case of a fair coin:

\begin{displaymath}
I\left(\frac{1}{2},\frac{1}{2}\right) = - \frac{1}{2}\log_{2}\frac{1}{2} -
\frac{1}{2}\log_{2}\frac{1}{2} = 1 \mbox{bit}\end{displaymath}

Now consider the case of a very biased coin, with probability of heads being 0.98? What is the information content of a correct answer?

\begin{displaymath}
I\left(0.98,0.02\right) = - 0.98\log_{2}0.98 -
0.02\log_{2}0.02 = 0.14 \mbox{bits}\end{displaymath}

which agrees with the feeling that the outcome of tossing the biased coin is much less uncertain.
The generalized version for a decision tree problem where there are p positive examples and n negative examples is:

\begin{displaymath}
I\left(\frac{p}{p+n},\frac{n}{p+n}\right) = - \frac{p}{p+n}\log_{2} \frac{p}{p+n} -
 \frac{n}{p+n}\log_{2} \frac{n}{p+n}\end{displaymath}

In the case of our example, p is 4 and n is 4, so the information in a correct answer is going to be:

\begin{displaymath}
I\left(\frac{4}{4+4},\frac{4}{4+4}\right) = - \frac{4}{4+4}\...
 ...2} \frac{4}{4+4} -
 \frac{4}{4+4}\log_{2} \frac{4}{4+4} = 1 bit\end{displaymath}

Now we're going to test an attribute. We might choose the ternary Colour test or either of the binary tests. There aren't any tests which will immediately get the whole answer, so none completely removes our uncertainty. but we can measure how close each gets.

The way we do this is to consider the answers $a_1 \ldots a_v$ to the attribute test A. If we pick on colour then $a_1= \mbox{Red}$,$a_2= \mbox{Green}$,$a_3= \mbox{Brown}$.The answers split up the training set into subsets. The subset Ei contains pi positive examples and ni negative examples. For Colour

The information that will be needed to finish the task depends on which answer we get. Suppose for the moment that we get the ith answer . In that case, by the previous reasoning, the information we still need will be:

\begin{displaymath}
I\left(\frac{p_{i}}{p_{i}+n_{i}},\frac{n_{i}}{p_{i}+n_{i}}\r...
 ... -
 \frac{n_{i}}{p_{i}+n_{i}}\log_{2} \frac{n_{i}}{p_{i}+n_{i}}\end{displaymath}

In the case of getting Brown as an answer to the attribute Colour, this would be:

\begin{displaymath}
I\left(\frac{1}{1+2},\frac{2}{1+2}\right) = 
- \frac{1}{1+2}\log_{2} \frac{1}{1+2} -
 \frac{2}{1+2}\log_{2} \frac{2}{1+2}\end{displaymath}

giving 0.91 bits as the information still needed ( call this Remainder(Colour|Brown)). But we since we don't know which answer is coming, we need to calculate an equivalent sum for each of the possible answers. This gives you Remainder(Colour|Red) = 0.91, and Remainder(Colour|Green)=1. And to get the expectation of the amount of information we after the Colour question has been asked and answered, we will have to take the sum of the three remainders, weighted by the probabilities of getting the answers. So we have

\begin{displaymath}
Remainder(Colour) = \frac{p_{red}+n_{red}}{p+n} I(2,1) + \fr...
 ..._{green}}{p+n} 
I(1,1) + \frac{p_{brown}+n_{brown}}{p+n} I(1,2)\end{displaymath}

or in numbers:

\begin{displaymath}
Remainder(Colour) = \frac{2+1}{8}\times 0.91 + \frac{1+1}{8} \times
1.0 + \frac{1+2}{8} \times 0.91\end{displaymath}

which is 0.93. We started with an information need of 1.0, so the information gain is

1.0 - 0.93 = 0.07 bits

In general the expression for remaining information after choosing attribute A is

\begin{displaymath}
Remainder(A) = \sum_{i=1}^v
\frac{p_i+n_i}{p+n}I\left(\frac{p_i}{p_i+n_i},\frac{n_i}{p_i+n_i}\right)\end{displaymath}

Plugging in the numbers, you get

\begin{displaymath}
Remainder(Size) = \frac{3}{8}I(1,2) + \frac{5}{8}I(3,2)\end{displaymath}

which is 3/8*0.91 + 5/8*0.97 = 0.9475 So the information gain is 0.0525 bits. and

\begin{displaymath}
Remainder(Skin) = \frac{3}{8}I(1,2) + \frac{5}{8}I(3,2)\end{displaymath}

which is actually the same expression as the last one, giving an information gain of 0.0525 bits.

Because it has the higher information gain colour is indeed the best attribute to pick, so you make that the root of the decision tree. You can then do the same thing to the subsets which are at each branch of the colour attribute. You get the decision tree.


next up previous contents
Next: Entropy Up: Probability and information Previous: Introduction
Chris Brew
8/7/1998