You are stranded on a desert island, with only a pile of records and
a book of poetry, and so have no way to determine which of the many
types of fruit available are safe to eat. They are of various colours
and sizes, some have hairy skins and others are smooth, some have hard
flesh and others are soft. After a great deal of stomach ache, you compile
the table of data shown
in table 2. This is pretty tedious to look up each time
you feel hungry; is there some kind of pattern to it?

This can be viewed as a set of examples, where each line states the values of certain attributes and a conclusion. On the other hand, it can be viewed as a set of rules: the first line is

if skin = hairy and colour = brown and size = large and flesh = hard then conclusion = safeSuch rules are not constrained to have a fixed number of preconditions, of course. Suppose that there was a rule which had no precondition about the size of the fruit. Then it could be replaced by two rules (there are two possible values for

`size`

), that is, one
in which
It is possible to devise a *decision tree* to replace this set of
rules, automatically. The decision tree is of the general form

if attribute1 = value1 then <subtree 1> else if attribute1 = value2 then <subtree 2> else if ... ........... else if attribute1 = valueN then <subtree N>This corresponds directly to a set of rules, with as many rules as there are leaf nodes in the whole tree. Each rule is a tracing out of the path from the top of the tree to a leaf node. The process is not limited to preconditions of the form

The key question is which of the attributes is the most useful determiner
of the conclusion of the rules. Whichever it is, that attribute ought to
be the first or topmost one in the decision tree. Information theory provides
one answer, as well as providing a meaning for the notion of `most useful
determiner'. To formalise it, suppose that the conclusion can have any of
the values
(in the example above, and the conclusions
are `safe' or `dangerous'). Suppose that a certain attribute can take
values
. The quantity means `the
probability that , given - that is, among the cases for
which is true, this is the probability that is also true.
For example, in figure 2:

because there are eight lines with

`Skin=hairy`

and among those,
six of them say `Conclusion=safe`

.
Then the expression

is deemed to be a measure of the `amount of information' that has to offer about conclusion . Information theory defines

to be the

This definition of entropy is not arbitrary. Suppose there is a set of possible events, but all that is known about them is their probabilities . Any measure of the uncertainty about which event will occur should be some function of the probabilities, and should satisfy the following very reasonable conditions:

- it should be a continuous function of each argument
- if the probabilities are all equal () then entropy should increase as does. This is because, with equally likely events, when there are more events there is more uncertainty about the outcome.
- If the `choice' between events is split into a set of successive
`choices' then the entropy of the original set should be the same as
the weighted sum of the entropies at each step of the successive
`choices'. That is, the entropy should not depend on the `choice'
method. For example, this means that if the events are concerned
with throwing a die, and are specifically ``three or less'', ``four
or five'' and ``six'', then this could be cast as a two step
process: ``is it three or less, or not?'', then ``if high, is it six
or not?''. So, in mathematical terms, this independence means that

To return to the original problem now, the
lower the entropy of with respect to
is, the more information that has to offer about .
Using this expression,
the entropy of the attribute A with respect to C is

and the attribute which has the lowest entropy is the most useful determiner.

Consider the attribute *Size* in the above example. From the table:

Thus the entropy for

However, since it is only necessary to compare entropies it makes no difference to use or rather than . If you do all the calculations you will find that

if colour = brown then conclusion = safe if colour = green and size = large then conclusion = safe if colour = green and size = small then conclusion = dangerous if colour = red and skin = hairy then conclusion = safe if colour = red and skin = smooth and size = large then conclusion = dangerous if colour = red and skin = smooth and size = small then conclusion = safeThis set of six rules, each smaller than any of the originals, is an improvement on having those sixteen. They are all smaller because, as you might have guessed, the attribute

To summarise the ID3 algoithm:

1. For each attribute, compute its entropy with respect to the conclusion 2. Select the attribute (say A) with lowest entropy. 3. Divide the data into separate sets so that within a set, A has a fixed value (eg Colour=green in one set, Colour=brown in another, etc). 4. Build a tree with branches: if A=a1 then ... (subtree1) if A=a2 then ... (subtree2) ...etc... 5. For each subtree, repeat this process from step 1. 6. At each iteration, one attribute gets removed from consideration. The process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the conclusion (eg they all say Conclusion=safe).

The six generated rules can be even more succinctly put, however - only small green ones or large smooth red ones are dangerous. Of course, that word `only' conceals the fact that there are only two possible conclusions. Often such a further apparent compression of the rule set will not be easy to find, if it exists at all. One easy method is to try omitting each rule condition in turn, testing to see whether this results in any misclassification of data.

The rule induction algorithm was first used by Hunt in his CLS system in 1962! Then, with extensions for handling numeric data too, it was used by Ross Quinlan for his ID3 system (in 1979). Since then it has been widely copied, and is the basis of a variety of commercial rule induction packages. ID3 and later packages did not apply the algorithm to the whole data set, since that might be wasteful; instead they would generate a set of rules from a small sampling of the data and then look for other data items which were misclassified by the current rules. A number of such anomalous data items would be added to the initial set and the rule induction algorithm re-run to generate a new set of rules. It has been reported that this iterative approach may not save much effort compared to running the algorithm on the whole set right at the start, unless the whole set is huge.

If you are considering using the algorithm then there are, as ever, points to bear in mind. First, if the table of data about edibility had included a record of which day of the week the sample had been eaten on, the algorithm would have blindly considered this as a factor. So, wholly spurious correlations are possible, since the algorithm takes no account of any meaning that the data it works on may have. Second, the algorithm considers just one attribute at a time. If two attributes turn out to have almost identical entropies it will still select the lower entropy, whereas it might have been better to consider them jointly (as though they made up a single attribute). Consider the following data:

Outcome |
X |
Y |

yes | 3 | 3 |

no | 2 | 1 |

yes | 4 | 4 |

no | 2 | 4 |

no | 1 | 3 |

yes | 1 | 1 |

yes | 2 | 2 |

no | 2 | 3 |

if X = Y then outcome = yes else outcome = nobecause it only ever considers one attribute at a time. Third, when inducing rules from large sets of examples in which there are a large number of possible outcomes (that is, is large), then the algorithm can be very sensitive to apparently trivial changes in the set of examples. Quinlan's ID3 tried to cut down on effort by inducing a set of rules from a small subset of data, and then testing to see if those rules explained other data. Data not explained were then added to the chosen subset, and new rules induced. This process continued until all the data was accounted for. The letters ID stood for `iterative dichotomiser', a fancy name for this simple algorithm. Fourth, the algorithm cannot generate uncertain rules or handle uncertain data.

Although commercial rule induction systems have been used very successfully
to condense sets of examples into rules, they must be used with care.
You must be sure that the attributes you choose to include in describing
the examples (the *Colour*, *Skin* and so on in the example above)
must include the `right' ones. If the algorithm is given a complete set
of examples then it can usually do a worthwhile job; if it is given only
a subset then it is up to you to make sure that the subset is somehow
representative of the whole. If a brown-skinned dangerous fruit were to
be added to the data table above, the generated rules would change
radically. For these reasons, rule induction systems are usually
used to suggest correlations that can be confirmed by logical analysis
afterward, rather than being trusted to get things right by themselves.

It is also important to remember that the generated rules are for solving the arbitrary classification task. Suppose you provided a set of attributes of motor cars, and used the algorithm to generate rules to classify an unspecified car. The rules might well have, as a first question, `How many wheels does it have?', then `how many cylinders?' and so on, ending up with, for instance, `does it have a statue of the Winged Victory on the bonnet? If yes, it is a Rolls Royce'. However, this last question is sufficient all by itself to determine whether a car is a Rolls Royce or not; if that is the only point at issue, all the earlier questions are unnecessary.

This version: 2000-10-30