As with the medical example in the last chapter, it pays to start with a simple non-linguistic example. Imagine the problem of designing a structured questionare for classifying fruit. You start with a single question, and depending on the answer you to that question you choose the next question. In AI this is usually called a decision tree If there are 4 nutritious fruits and 4 poisonous ones, and you have hungry people on your hands,you need to set out your questionaire so that:
Now imagine a situation where a slightly
capricious djinn is holding one
of the pieces of fruit
behind its back. It won't show you the fruit, but
it will answer questions about the colour, size
and skin type. Your task is to ask sensible
questions. There is a risk that if you ask
too many questions the djinn will just
fly off with the fruit. So it pays to ask
the right questions. You can do this by
getting the decision tree right.
In general, if the possible answers have probability P(vi) then the information content of the actual answer is
Example: To see that this formula is sensible consider the case of a fair coin:
Now consider the case of a very biased coin, with probability of heads being 0.98? What is the information content of a correct answer?
The generalized version for a decision tree problem where there are p positive examples and n negative examples is:which agrees with the feeling that the outcome of tossing the biased coin is much less uncertain.
Now we're going to test an attribute. We might choose the ternary Colour test or either of the binary tests. There aren't any tests which will immediately get the whole answer, so none completely removes our uncertainty. but we can measure how close each gets.
The way we do this is to consider the answers to the attribute test A. If we pick on colour
then
,
,
.The answers split up the training set into subsets.
The subset Ei contains pi positive examples and ni
negative examples. For Colour
The information that will be needed to finish the task depends on which answer we get. Suppose for the moment that we get the ith answer . In that case, by the previous reasoning, the information we still need will be:
1.0 - 0.93 = 0.07 bits
In general the expression for remaining information after choosing attribute A isPlugging in the numbers, you get
Because it has the higher information gain colour is indeed the best attribute to pick, so you make that the root of the decision tree. You can then do the same thing to the subsets which are at each branch of the colour attribute. You get the decision tree.