Definition
•Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
What is Cluster Analysis?
•Cluster: a collection of data objects
–Similar to one another within the same cluster
–Dissimilar to the objects in other clusters
•Cluster analysis
–Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
•Unsupervised learning: no predefined classes
•Typical applications
–As a stand-alone tool to get insight into data distribution
–As a preprocessing step for other algorithms
Applications
•Group related documents for browsing
•Group genes and proteins that have similar
functionality
•Group stocks with similar price fluctuations
•Group users with similar buying mentalities
Examples of Clustering Applications
•Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
•Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
•Land use: Identification of areas of similar land use in an earth
observation database
•City-planning: Identifying groups of houses according to their house
type, value, and geographical location
•Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Clustering is ambiguous
•There is no correct or incorrect solution for
clustering.
How many clusters?
Four Clusters Two Clusters
Six Clusters
Quality: What Is Good Clustering?
•A good clustering method will produce high quality clusters
with
–high intra-class similarity
–low inter-class similarity
•The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
•The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
Requirements of Clustering in Data
Mining
•Scalability
•Ability to deal with different types of attributes
•Ability to handle dynamic data
•Discovery of clusters with arbitrary shape
•Minimal requirements for domain knowledge to determine
input parameters
•Able to deal with noise and outliers
•Insensitive to order of input records
•High dimensionality
•Incorporation of user-specified constraints
•Interpretability and usability
Major Clustering Approaches
•Partitioning approach:
–Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square errors
–Typical methods: k-means, k-medoids, CLARANS
•Hierarchical approach:
–Create a hierarchical decomposition of the set of data (or
objects) using some criterion either agglomerative or
divisive
–Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
•Density-based approach:
–Based on connectivity and density functions
–Typical methods: DBSACN, OPTICS, DenClue
Partitional Clustering
Original Points A Partitional Clustering
K-Mean Algorithm
•Each cluster is represented by the mean value of
the objects in the cluster
•Input : set of objects (n), no of clusters (k)
•Output : set of k clusters
•Algo
–Arbitrarily select k objects from D & mark them a
initial cluster centers
–Repeat
•(re)assign each object to the cluster to which the object is
the most similar, based on the mean value of the objects in
the cluster;
• update the cluster means, that is, calculate the mean value
of the objects for each cluster;
•until no change;
K-Means (second method)
•Step 1: Randomly assign objects to k clusters
•Step 2: Find the mean of each cluster
•Step 3: Re-assign objects to the cluster with
closest mean.
•Step 4: Go to step2
Repeat until no change.
Example 1
Given: {2,3,6,8,9,12,15,18,22} Assume k=3.
•Solution:
–Randomly partition given data set:
•K1 = 2,8,15 mean = 8.3
•K2 = 3,9,18 mean = 10
•K3 = 6,12,22 mean = 13.3
–Reassign
•K1 = 2,3,6,8,9 mean = 5.6
•K2 = mean = 0
•K3 = 12,15,18,22 mean = 16.75
–Reassign
•K1 = 3,6,8,9 mean = 6.5
•K2 = 2 mean = 2
•K3 = 12,15,18,22 mean = 16.75
–Reassign
•K1 = 6,8,9 mean = 7.6
•K2 = 2,3 mean = 2.5
•K3 = 12,15,18,22 mean = 16.75
–Reassign
•K1 = 6,8,9 mean = 7.6
•K2 = 2,3 mean = 2.5
•K3 = 12,15,18,22 mean = 16.75
•STOP
Example 2
Given {2,4,10,12,3,20,30,11,25}
Assume k=2.
Solution:
K1 = 2,3,4,10,11,12
K2 = 20, 25, 30
K-Means (graph)
•Step1: Form k centroids, randomly
•Step2: Calculate distance between centroids and
each object
–Use Euclidean’s law do determine min distance:
d(A,B) = (x
2-x
1)
2
+ (y
2-y
1)
2
•Step3: Assign objects based on min distance to k
clusters
•Step4: Calculate centroid of each cluster using
C = (x
1+x
2+…x
n , y
1+y
2+…y
n)
n n
–Go to step 2.
–Repeat until no change in centroids.
Example 1
•There are four types of medicines and each
have two attributes, as shown below. Find a
way to group them into 2 groups based on
their features.
Medicine Weight(x) pH(y)
A 1 1
B 2 1
C 4 3
D 5 4
Solution
•Plot the values on a graph
•Mark any k centeroids
•Calculate Euclidean distance of each point from
the centroids.
•D = 0 1 3.61 5
1 0 2.83 4.24
•Based on minimum distance, we assign points to
clusters: C
1 = A
C
2 = B, C, D
•Calculate new centroids: C
1=(1,1)
•C
2 = 2+4+5 , 1+3+4 = (11/3 , 8/3)
3 3
C
1=(1,1) group 1
C
2 =(2,1) group 2
•D1 = 0 1 3.61 5
3.14 2.36 0.47 1.89
•Objects clustering :
– A,B = group 1
–C,D = group 2
•Calculate new centroids:
•C1 = (3/2, 1)
•C2 = (9/2,7/2)
•Marking the new centroids
•Continue the iteration, until there is no change in the
centroids or clusters.
Final solution
Example 2
•Use K-means algorithm to create two clusters.
Given:
Example 3
.
Group the below points into 3 clusters
K-Mediod (PAM)
•Also called Partitioning Around Mediods.
•Step1: choose k mediods
•Step2: assign all points to closest mediod
•Step3: form distance matrix for each cluster
and choose the next best mediod. i.e., the
point closest to all other points in cluster
–go to step2.
–Repeat until no change in any mediods
Hierarchical methods
•Works by grouping data objects into a tree of
clusters
•Uses distance (similarity) matrix as clustering
criteria
•We provide a termination condition
•Classified further as :
–Agglomerative hierarchical clustering
–Divisive hierarchical clustering
Agglomerative hierarchical clustering
•Data objects are grouped in bottom –up
fashion
•Initially each data object is in its own cluster
•Then merge these atomic clusters into larger
and larger clusters ,until all objects are in
single cluster or until certain termination
condition are satisfied
•Termination condition can be specified by the
user , as desired number of clusters
•An HAC clustering is typically visualized as a
dendrogram
•Dendrogram : is a tree data structure which
illustrates hierarchical clustering techniques
•Each level shows clusters for that level
–Leaf : individual clusters
–Root : one cluster
Agglomerative hierarchical clustering
p4
p1
p3
p2 p4p1p2p3
Traditional Hierarchical Clustering Traditional Dendrogram
Agglomerative hierarchical clustering
Use Agglomerative algorithm on
following data and draw single link
dendogram.
Object X Y
A 2 2
B 3 2
C 1 1
D 3 1
E 1.5 0.5
Construct distance matrix
Use Euclidean’s law do determine min distance:
d(A,B) = (x
2-x
1)
2
+ (y
2-y
1)
2
A B C D E
A 0
B 1 0
C 1.41 2.24 0
D 1.41 1 2 0
E 1.58 2.12 0.71 1.58 0
•SINGLE LINK :
•Step 1:
•SINCE C, E is minimum we can combine
clusters C, E
A B C,E D
A 0
B 1 0
C,E 1.41 2.12 0
D 1.41 1 1.58 0
•Step 2:
•A, B has minimum value therefore we merge
these two clusters
A,B C,E D
A,B 0
C,E 1.41 0
D 1 1.58 0
•Step 3:
•(A, B) and D has minimum value therefore we
merge these two clusters
•
A,B,D C,E
A,B,D 0
C,E 1.41 0
•Step 4:
•We have two clusters to be combined
•Construct a dendrogram for the same
•Different approaches to defining the distance
between clusters distinguish the different
algorithm :
•Single linkage clustering: check for minimum
distance
•Complete linkage : check for maximum distance
•Average linkage : we consider the distance
between any two clusters A and B is taken to be
average of all distances between pairs of objects
“i” in A and “j” in B that is the mean distance
between elements of each cluster
Agglomerative
•Step1: Make each object as a cluster
•Step2: Calculate the Euclidean distance from
every point to every other point. i.e.,
construct a Distance Matrix
•Step3: Identify two clusters with shortest
distance.
–Merge them
–Go to Step 2
–Repeat until all objects are in one cluster
Example
•Find single link technique to find clusters in
the given database.
Data points X Y
1
0.4 0.53
2
0.22 0.38
3
0.35 0.32
4
0.26 0.19
5
0.08 0.41
6
0.45 0.3
Object X Y
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
Solution
•D,F at distance 0.50
•A,B at distance 0.71
•E and (D,F) at 1
•(D, F , E) and C at distance 1.41
•(D,F,E,C) and (A,B) at distance 2.50
•
Divisive hierarchical clustering
•Data objects are grouped in top down manner
•Initially all objects are in one clusters
•Then cluster is subdivided into smaller pieces ,
until each objects forms a cluster on its own
or until it satisfies some termination condition
•Advantages :
–is simple and outputs a hierarchy , a structure that
is more informative
–It does not require to pre specify the number of
clusters
•Disadvantages: selection of merge or split
point is critical as once a group of objects is
merged or split , it will operate on the newly
generated clusters and will not undo what was
done previously
Agglomerative vs Divisive
Agglomerative Divisive
Start with points as individual clusters Start with one , all – inclusive cluster
At each step merge closest pair of cluster
until only one cluster left
At each step split a cluster until each
cluster contains a point
Extensions to Hierarchical Clustering
•Major weakness of agglomerative clustering methods
–Can never undo what was done previously
–Do not scale well: time complexity of at least O(n
2
), where n
is the number of total objects
•Integration of hierarchical & distance-based clustering
–BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
–CHAMELEON (1999): hierarchical clustering using dynamic
modeling
60
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
•Zhang, Ramakrishnan & Livny, SIGMOD’96
•Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
–Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
–Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
•Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
•Weakness: handles only numeric data, and sensitive to the order of the data
record
61
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:
SS: square sum of N points
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
N
i
iX
1 2
1
N
i
iX
62
•CF=<n,LS,SS>
•Where LS= linear sum of n points and SS is square
sum of n data points
•For 2 disjoint clusters C1 and C2, the CF is formed
by merging C1 and C2
•i.e CF1+CF2 = <n1+n2, LS1+LS2, SS1+SS2>
CF-Tree in BIRCH (Balanced Iterative Reducing
and Clustering using Hierarchies)
CF-Tree in BIRCH
•Clustering feature:
–Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
–Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
–A nonleaf node in a tree has descendants or “children”
–The nonleaf nodes store sums of the CFs of their children
•A CF tree has two parameters
–Branching factor: max no. of children
–Threshold: max diameter of sub-clusters stored at the leaf
nodes
65
•For each point in the input
–Find closest leaf entry
–Add point to leaf entry and update CF
–If entry diameter > max_diameter, then split leaf, and possibly parents
•Algorithm is O(n)
•Concerns
–Sensitive to insertion order of data points
–Since we fix the size of leaf nodes, so clusters may not be so natural
–Clusters tend to be spherical given the radius and diameter measures
2
)(
)1(
1
j
x
i
x
nn
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
If cluster 1 becomes too large (not compact) by adding object 2,
then split the cluster
Leaf node
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
entry 1 entry 2
Leaf node with two entries
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry1 is the closest to object 3
If cluster 1 becomes too large by adding object 3,
then split the cluster
entry 1 entry 2
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
Leaf node with three entries
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
4
entry3 is the closest to object 4
Cluster 2 remains compact when adding object 4
then add object 4 to cluster 2
Cluster2
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
entry 1 entry 2 entry 3
Cluster3
4
entry2 is the closest to object 5
Cluster 3 becomes too large by adding object 5
then split cluster 3?
BUT there is a limit to the number of entries a node can have
Thus, split the node
Cluster2
5
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Leaf node
Non-Leaf node
Cluster4
BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Leaf node
Non-Leaf node
Cluster4
6
entry1.2 is the closest to object 6
Cluster 3 remains compact when adding object 6
then add object 6 to cluster 3
Cluster3
Density-Based Clustering Methods
•Clustering based on density (local cluster criterion), such as
density-connected points
•Major features:
–Discover clusters of arbitrary shape
–Handle noise
–One scan
–Need density parameters as termination condition
•Several interesting studies:
–DBSCAN: Ester, et al. (KDD’96)
–OPTICS: Ankerst, et al (SIGMOD’99).
–DENCLUE: Hinneburg & D. Keim (KDD’98)
–CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
76
Density-Based Clustering: Basic Concepts
•Two parameters:
–Eps: Maximum radius of the neighbourhood
–MinPts: Minimum number of points in an Eps-
neighbourhood of that point
•N
Eps(q): {p belongs to D | dist(p,q) ≤ Eps}
•Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
–p belongs to N
Eps(q)
–core point condition:
|N
Eps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
77
Density-Reachable and Density-Connected
•Density-reachable:
–A point p is density-reachable from a
point q w.r.t. Eps, MinPts if there is a
chain of points p
1, …, p
n, p
1 = q, p
n =
p such that p
i+1 is directly density-
reachable from p
i
•Density-connected
–A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts
p
q
p
1
p q
o
78
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
•Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
•Discovers clusters of arbitrary shape in spatial databases with
noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
79
DBSCAN: The Algorithm
•Arbitrary select a point p
•Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
•If p is a core point, a cluster is formed
•If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
•Continue the process until all of the points have been
processed
•If a spatial index is used, the computational complexity of DBSCAN is
O(nlogn), where n is the number of database objects. Otherwise, the
complexity is O(n
2
)
80
DBSCAN: Sensitive to Parameters
81
OPTICS (Ordering Points to
Identify the Clustering
Structure)
–Based on DBSCAN.
– Does not produce clusters explicitly.
– Rather generate an ordering of data objects representing
density-based clustering structure.
•Produces a special order of the database wrt its density-
based clustering structure
•This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
•Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
•Can be represented graphically or using visualization
techniques
83
OPTICS: Some Extension from DBSCAN
•Two values need to be stored for each object—core-distance and
reachability-distance:
–The core-distance of an object p is the smallest Ɛ’ value that makes
{p} a core object. If p is not a core object, the core-distance of
p is undefined.
–The reachability-distance of an object q with respect to another
object p is the greater value of the core-distance of p and the
Euclidean distance between p and q. If p is not a core object,
the reachability-distance between p and q is undefined.
84
Core Distance & Reachability Distance
85
OPTICS: Some Extension from DBSCAN
•The OPTICS algorithm creates an ordering of the objects in a
database, additionally storing the core-distance and a suitable
reachability distance for each object.
•An algorithm was proposed to extract clusters based on the
ordering information produced by OPTICS.
•Such information is sufficient for the extraction of all density-
based clusterings with respect to any distance Ɛ’ that is smaller
than the distance e used in generating the order.
86
Reachability-
distance
Cluster-order of the objects
undefined
‘
87
88
OPTICS
If we change minimum points we still get the same clusters, there is a change in
graph according to min pts
Important Questions
•Explain K-means algorithm
•Difference between clustering and
classification
•Write a note on DBSCAN, Birch
•Difference between agglomerative and
divisive
•Numerical on:
– k-means(both single and graphical)
–Agglomerative clustering