Foreword
Clustering is basically collecting together objects, which arealike, and sepa-
rating them from other groups of like objects, which areunalike; i.e., group-
ing and discriminating. We do this intuitively all the time in our daily lives.
Without it, every object we encountered would be new. Byrecognizing critical
featureswhich make the new objectlikesomething we have seen before, we
can benefit from earlier encounters andbe able to deal quickly with a fresh
experience. Equally, we are very efficient. For instance, we recognize a group
of friends and relatives, even if they are in a large crowd of people, byselecting
asubsetoffeatures, which separate them from the population in general.
Chemists, for example, even in the days when chemistry was emerging from
alchemy, were putting substances into classes such as “metals.” This “metal”
class contained things such as iron, copper, silver, and gold, but also mer-
cury which, even though was liquid, still had enough properties in common
with the other members of its class to be included. In other words, scientists
were grouping together things that were related or similar, but not necessar-
ily identical, and separating them from the “non-metals.” As Hartigan [65]
has pointed out, today’s biologist would be reasonably happy with the way
Aristotle classified animals and plants, and would probably use only a slightly
modified version of a Linnaean-type hierarchy: kingdom, phylum, class, order,
family, genus, and species.
So, if we are so good at clustering and the related processes of classify-
ing and discrimination in our daily lives, why do we need a book to tell us
how to do this for chemistry and biology, which are, after all, fundamentally
classification sciences? The answer to this question is twofold.
Firstly, the datasets we are dealing with are very large and have high
dimensionality, so we cannot easily find patterns in data. Consider for example
one of the best known datasets in chemistry, the Periodic Table of Elements.
It is “exactly what it says on the tin.” Mendeleyevorderedthe60or70known
elements by atomic weight, that, when set out in a table, grouped the elements
by properties. So the alkali metals all came in one row in his original version,
the halogens in another, and so on. He had enough belief in his model to leave
gaps for yet undiscovered elements and correct atomic weights of elements
that did not fit. What he didnotdo was try and cluster the properties of the
elements. Even for such a small dataset, questions would arise as to how to
combine, say, a density value, which is real, with a valence value, which is an