We can define the goal in hard flat clustering as follows.
(i) a set of documents D = {d1, . . . , dN}, (ii) a desired
number of clusters K, and
(iii) an objective function that evaluates the quality of a
clustering, we want to OBJECTIVE FUNCTION compute an
assignment γ : D → {1, . . . , K} that minimizes (or, in other
cases, maximizes) the objective function. In most cases, we
also demand thatγ is subjective, i.e., that none of the K
clusters is empty. The objective function is often defined in
terms of similarity or distance
between documents.
EXHAUSTIVE and non-EXHAUSTIVEsearch
Someresearchersdistinguishbetweenexhaustiveclusteringsthat
assignEXHAUSTIVE searchdocumenttoaclusterandnon-
exhaustiveclusterings,inwhichsomedocumentswillbeassigned
tonocluster.Non-exhaustiveclusteringsinwhicheachdocumentis
amemberofeithernoclusteroroneclusterarecalledexclusive.
Evaluation of clustering
Typical objective functions in clustering formalize the goal of
attaining high intra-cluster similarity (documents within a cluster
are similar) and low inter cluster similarity (documents from
different clusters are dissimilar). This is an internal criterion for
the quality of a clustering.
where Ω = {ω1, ω2, . . . , ωK} is the set of clusters and C =
{c1, c2, . . . , cJ} is the set of classes. We interpret ωk as the
set of documents in ωk and cj as the set of documents in cj
Normalised Mututal Information
High purity is easy to achieve when the number of clusters is
large –in particular, purity is 1 if each document gets its own
cluster. Thus, we cannot use purity to trade off the quality of the
clustering against the number of clusters.
A measure that allows us to make this tradeoff is normalized
mutual information or NMI