their classes, a perfect classification has no entropy, and so, in that case, information
gain is due to adecreasein entropy:
Information gain¼
X
final
pilnpi??ffi
X
initial
pilnpiðÞ
(1.27)
A simple example of information gain is illustrated inTable 1.8. Information gain is
here determined based on how well three different classifiers (row results) do on
assigning samples to three different clusters A, B, and C, corresponding to Classes
a, b, and c. The original cluster has 10 samples each from Classes a, b, and c, and so,
the entropy of three such clusters¼3.300. The maximum information gain is thus
3.300 if a classifier perfectly assigns each sample, generating cluster A¼
{10,0,0}, cluster B¼{0,10,0}, and cluster C¼{0,0,10}. Actual (realized) informa-
tion gain is 3.300 (entropy sum), or 0.507 for distribution 1, 1.081 for distribution 2,
and 1.535 for distribution 3. Distribution 3 thus has the maximum information gain
and therefore performs the best classification. Note that these values correlated with
the observed accuracies of 60%, 70%, and 77%, respectively, for distributions 1, 2,
and 3, respectively.
1.7.5Optimization and search
Much research has been performed in optimization, and it will be revisited in
Section 1.8with regard to genetic algorithms. Suffice it to say that this is a huge area
of research, and its scope is often underestimated. Elements of optimization that are
often underappreciated include considerations of (a) sensitivity analysis of the opti-
mal algorithm or system design; (b) the robustness and relevance of the ground truth,
or training data; (c) determining the breadth of the search space and performing an
exhaustive “presearch” (usually as part of the training or validation); and (d) estimat-
ing the lifetime of the system and periodically ensuring that prediction meets
Table 1.8Information gain for a classification problem
Original cluster5{10,10,10}
distribution among the three
classes
Cluster
A
Cluster
B
Cluster
C
Sum of
A,B,C
{a,b,c} Distribution 1 {6,2,1} {2,6,3} {2,2,6} {10,10,10}
Entropy of distribution 1 0.848 0.995 0.950 2.793
{a,b,c} Distribution 2 {7,0,3} {2,8,1} {1,2,6} {10,10,10}
Entropy of distribution 2 0.611 0.760 0.848 2.219
{a,b,c} Distribution 3 {8,1,1} {0,9,3} {2,0,6} {10,10,10}
Entropy of distribution 3 0.639 0.563 0.563 1.765
The original cluster comprises an equal amount (10 each) of samples from each of three classes. Three
different classifiers result in three different distributions of assignment, where a perfect assignment would
be {10,0,0}, {0,10,0}, and {0,0,10}, with a resulting entropy of 0.000 for the distribution. Distribution 3
clearly moves the distribution entropy closest to 0.000 and so is judged the best classifier.
28 CHAPTER 1 Introduction, overview, and applications