You are stranded on a desert island, with only a pile of records and
a book of poetry, and so have no way to determine which of the many
types of fruit available are safe to eat. They are of various colours
and sizes, some have hairy skins and others are smooth, some have hard
flesh and others are soft. After a great deal of stomach ache, you compile
the table of data shown
in table 2. This is pretty tedious to look up each time
you feel hungry; is there some kind of pattern to it?
if skin = hairy and colour = brown and size = large and flesh = hard then conclusion = safeSuch rules are not constrained to have a fixed number of preconditions, of course. Suppose that there was a rule which had no precondition about the size of the fruit. Then it could be replaced by two rules (there are two possible values for
size), that is, one
in which size = large and one in which size = small were
extra preconditions, or it could be replaced by a rule with the single
equivalent extra precondition size = Anything or the precondition
(size = large or size = small).
It is possible to devise a decision tree to replace this set of rules, automatically. The decision tree is of the general form
if attribute1 = value1 then <subtree 1> else if attribute1 = value2 then <subtree 2> else if ... ........... else if attribute1 = valueN then <subtree N>This corresponds directly to a set of rules, with as many rules as there are leaf nodes in the whole tree. Each rule is a tracing out of the path from the top of the tree to a leaf node. The process is not limited to preconditions of the form This = That, either, although for the purpose of explaining the idea only this kind of precondition will be considered.
The key question is which of the attributes is the most useful determiner
of the conclusion of the rules. Whichever it is, that attribute ought to
be the first or topmost one in the decision tree. Information theory provides
one answer, as well as providing a meaning for the notion of `most useful
determiner'. To formalise it, suppose that the conclusion
can have any of
the values
(in the example above,
and the conclusions
are `safe' or `dangerous'). Suppose that a certain attribute
can take
values
. The quantity
means `the
probability that
, given
- that is, among the cases for
which
is true, this is the probability that
is also true.
For example, in figure 2:
Skin=hairy and among those,
six of them say Conclusion=safe.
Then the expression
This definition of entropy
is not arbitrary. Suppose there is a set of possible events,
but all that is known about them is their probabilities
.
Any measure of the uncertainty about which event will occur should be some
function
of the probabilities, and should satisfy
the following very reasonable conditions:
To return to the original problem now, the
lower the entropy of
with respect to
is, the more information that
has to offer about
.
Using this expression,
the entropy of the attribute A with respect to C is
Consider the attribute Size in the above example. From the table:
if colour = brown then conclusion = safe if colour = green and size = large then conclusion = safe if colour = green and size = small then conclusion = dangerous if colour = red and skin = hairy then conclusion = safe if colour = red and skin = smooth and size = large then conclusion = dangerous if colour = red and skin = smooth and size = small then conclusion = safeThis set of six rules, each smaller than any of the originals, is an improvement on having those sixteen. They are all smaller because, as you might have guessed, the attribute Flesh is irrelevant. If you look at the table, you will find some examples of lines in it which differ only in the value of this attribute. From such an observation you can only conclude that the attribute is, under certain combinations of the other attributes, irrelevant; it should be a little surprising to you that it is always so.
To summarise the ID3 algoithm:
1. For each attribute, compute its entropy with respect to the
conclusion
2. Select the attribute (say A) with lowest entropy.
3. Divide the data into separate sets so that within a set,
A has a fixed value (eg Colour=green in one set, Colour=brown
in another, etc).
4. Build a tree with branches:
if A=a1 then ... (subtree1)
if A=a2 then ... (subtree2)
...etc...
5. For each subtree, repeat this process from step 1.
6. At each iteration, one attribute gets removed from
consideration. The process stops when there are
no attributes left to consider, or when all the data
being considered in a subtree have the same value for the
conclusion (eg they all say Conclusion=safe).
The six generated rules can be even more succinctly put, however - only small green ones or large smooth red ones are dangerous. Of course, that word `only' conceals the fact that there are only two possible conclusions. Often such a further apparent compression of the rule set will not be easy to find, if it exists at all. One easy method is to try omitting each rule condition in turn, testing to see whether this results in any misclassification of data.
The rule induction algorithm was first used by Hunt in his CLS system in 1962! Then, with extensions for handling numeric data too, it was used by Ross Quinlan for his ID3 system (in 1979). Since then it has been widely copied, and is the basis of a variety of commercial rule induction packages. ID3 and later packages did not apply the algorithm to the whole data set, since that might be wasteful; instead they would generate a set of rules from a small sampling of the data and then look for other data items which were misclassified by the current rules. A number of such anomalous data items would be added to the initial set and the rule induction algorithm re-run to generate a new set of rules. It has been reported that this iterative approach may not save much effort compared to running the algorithm on the whole set right at the start, unless the whole set is huge.
If you are considering using the algorithm then there are, as ever, points to bear in mind. First, if the table of data about edibility had included a record of which day of the week the sample had been eaten on, the algorithm would have blindly considered this as a factor. So, wholly spurious correlations are possible, since the algorithm takes no account of any meaning that the data it works on may have. Second, the algorithm considers just one attribute at a time. If two attributes turn out to have almost identical entropies it will still select the lower entropy, whereas it might have been better to consider them jointly (as though they made up a single attribute). Consider the following data:
| Outcome | X | Y |
| yes | 3 | 3 |
| no | 2 | 1 |
| yes | 4 | 4 |
| no | 2 | 4 |
| no | 1 | 3 |
| yes | 1 | 1 |
| yes | 2 | 2 |
| no | 2 | 3 |
if X = Y then outcome = yes else outcome = nobecause it only ever considers one attribute at a time. Third, when inducing rules from large sets of examples in which there are a large number of possible outcomes (that is,
Although commercial rule induction systems have been used very successfully to condense sets of examples into rules, they must be used with care. You must be sure that the attributes you choose to include in describing the examples (the Colour, Skin and so on in the example above) must include the `right' ones. If the algorithm is given a complete set of examples then it can usually do a worthwhile job; if it is given only a subset then it is up to you to make sure that the subset is somehow representative of the whole. If a brown-skinned dangerous fruit were to be added to the data table above, the generated rules would change radically. For these reasons, rule induction systems are usually used to suggest correlations that can be confirmed by logical analysis afterward, rather than being trusted to get things right by themselves.
It is also important to remember that the generated rules are for solving the arbitrary classification task. Suppose you provided a set of attributes of motor cars, and used the algorithm to generate rules to classify an unspecified car. The rules might well have, as a first question, `How many wheels does it have?', then `how many cylinders?' and so on, ending up with, for instance, `does it have a statue of the Winged Victory on the bonnet? If yes, it is a Rolls Royce'. However, this last question is sufficient all by itself to determine whether a car is a Rolls Royce or not; if that is the only point at issue, all the earlier questions are unnecessary.