ch_5_dm clustering in data mining.......

PriyankaPatil919748 22 views 89 slides Jul 09, 2024
Slide 1
Slide 1 of 89
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89

About This Presentation

clustering


Slide Content

Chapter 5: Clustering

Definition
•Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups

Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized

What is Cluster Analysis?


•Cluster: a collection of data objects
–Similar to one another within the same cluster
–Dissimilar to the objects in other clusters
•Cluster analysis
–Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
•Unsupervised learning: no predefined classes
•Typical applications
–As a stand-alone tool to get insight into data distribution
–As a preprocessing step for other algorithms

Applications
•Group related documents for browsing

•Group genes and proteins that have similar
functionality

•Group stocks with similar price fluctuations

•Group users with similar buying mentalities

Examples of Clustering Applications
•Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
•Insurance: Identifying groups of motor insurance policy holders
with a high average claim cost
•Land use: Identification of areas of similar land use in an earth
observation database
•City-planning: Identifying groups of houses according to their house
type, value, and geographical location
•Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

Clustering is ambiguous
•There is no correct or incorrect solution for
clustering.
How many clusters?
Four Clusters Two Clusters
Six Clusters

Quality: What Is Good Clustering?
•A good clustering method will produce high quality clusters
with
–high intra-class similarity
–low inter-class similarity
•The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
•The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

Requirements of Clustering in Data
Mining
•Scalability
•Ability to deal with different types of attributes
•Ability to handle dynamic data
•Discovery of clusters with arbitrary shape
•Minimal requirements for domain knowledge to determine
input parameters
•Able to deal with noise and outliers
•Insensitive to order of input records
•High dimensionality
•Incorporation of user-specified constraints
•Interpretability and usability

Major Clustering Approaches
•Partitioning approach:
–Construct various partitions and then evaluate them by
some criterion, e.g., minimizing the sum of square errors
–Typical methods: k-means, k-medoids, CLARANS
•Hierarchical approach:
–Create a hierarchical decomposition of the set of data (or
objects) using some criterion either agglomerative or
divisive
–Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
•Density-based approach:
–Based on connectivity and density functions
–Typical methods: DBSACN, OPTICS, DenClue

Partitional Clustering
Original Points A Partitional Clustering

Hierarchical Clustering p4
p1
p3
p2
p4
p1
p3
p2 p4p1p2p3 p4p1p2p3
Traditional Hierarchical
Clustering
Non-traditional Hierarchical
Clustering
Traditional Dendrogram
Non-traditional Dendrogram

Clustering Algorithms
•Partitional
–K-means
–K-mediods
•Hierarchial
–Agglomerative
–Divisive
–Birch
•Density based method:
–DBSCAN
–Optics

K-Mean Algorithm
•Each cluster is represented by the mean value of
the objects in the cluster
•Input : set of objects (n), no of clusters (k)
•Output : set of k clusters
•Algo
–Arbitrarily select k objects from D & mark them a
initial cluster centers
–Repeat
•(re)assign each object to the cluster to which the object is
the most similar, based on the mean value of the objects in
the cluster;
• update the cluster means, that is, calculate the mean value
of the objects for each cluster;
•until no change;

K-Means (second method)
•Step 1: Randomly assign objects to k clusters
•Step 2: Find the mean of each cluster
•Step 3: Re-assign objects to the cluster with
closest mean.
•Step 4: Go to step2
Repeat until no change.

Example 1
Given: {2,3,6,8,9,12,15,18,22} Assume k=3.
•Solution:
–Randomly partition given data set:
•K1 = 2,8,15 mean = 8.3
•K2 = 3,9,18 mean = 10
•K3 = 6,12,22 mean = 13.3
–Reassign
•K1 = 2,3,6,8,9 mean = 5.6
•K2 = mean = 0
•K3 = 12,15,18,22 mean = 16.75

–Reassign
•K1 = 3,6,8,9 mean = 6.5
•K2 = 2 mean = 2
•K3 = 12,15,18,22 mean = 16.75
–Reassign
•K1 = 6,8,9 mean = 7.6
•K2 = 2,3 mean = 2.5
•K3 = 12,15,18,22 mean = 16.75
–Reassign
•K1 = 6,8,9 mean = 7.6
•K2 = 2,3 mean = 2.5
•K3 = 12,15,18,22 mean = 16.75
•STOP

Example 2
Given {2,4,10,12,3,20,30,11,25}
Assume k=2.




Solution:
K1 = 2,3,4,10,11,12
K2 = 20, 25, 30

K-Means (graph)
•Step1: Form k centroids, randomly
•Step2: Calculate distance between centroids and
each object
–Use Euclidean’s law do determine min distance:
d(A,B) = (x
2-x
1)
2
+ (y
2-y
1)
2

•Step3: Assign objects based on min distance to k
clusters
•Step4: Calculate centroid of each cluster using
C = (x
1+x
2+…x
n , y
1+y
2+…y
n)
n n
–Go to step 2.
–Repeat until no change in centroids.

Example 1
•There are four types of medicines and each
have two attributes, as shown below. Find a
way to group them into 2 groups based on
their features.
Medicine Weight(x) pH(y)
A 1 1
B 2 1
C 4 3
D 5 4

Solution
•Plot the values on a graph
•Mark any k centeroids

•Calculate Euclidean distance of each point from
the centroids.
•D = 0 1 3.61 5
1 0 2.83 4.24

•Based on minimum distance, we assign points to
clusters: C
1 = A
C
2 = B, C, D
•Calculate new centroids: C
1=(1,1)
•C
2 = 2+4+5 , 1+3+4 = (11/3 , 8/3)
3 3

C
1=(1,1) group 1
C
2 =(2,1) group 2

•D1 = 0 1 3.61 5
3.14 2.36 0.47 1.89

•Objects clustering :
– A,B = group 1
–C,D = group 2

•Calculate new centroids:
•C1 = (3/2, 1)
•C2 = (9/2,7/2)

•Marking the new centroids








•Continue the iteration, until there is no change in the
centroids or clusters.

Final solution

Example 2
•Use K-means algorithm to create two clusters.
Given:

Example 3
.
Group the below points into 3 clusters

K-Mediod (PAM)
•Also called Partitioning Around Mediods.
•Step1: choose k mediods
•Step2: assign all points to closest mediod
•Step3: form distance matrix for each cluster
and choose the next best mediod. i.e., the
point closest to all other points in cluster
–go to step2.
–Repeat until no change in any mediods

Hierarchical methods
•Works by grouping data objects into a tree of
clusters
•Uses distance (similarity) matrix as clustering
criteria
•We provide a termination condition
•Classified further as :
–Agglomerative hierarchical clustering
–Divisive hierarchical clustering

Agglomerative hierarchical clustering

•Data objects are grouped in bottom –up
fashion
•Initially each data object is in its own cluster
•Then merge these atomic clusters into larger
and larger clusters ,until all objects are in
single cluster or until certain termination
condition are satisfied
•Termination condition can be specified by the
user , as desired number of clusters

•An HAC clustering is typically visualized as a
dendrogram
•Dendrogram : is a tree data structure which
illustrates hierarchical clustering techniques
•Each level shows clusters for that level
–Leaf : individual clusters
–Root : one cluster


Agglomerative hierarchical clustering

p4
p1
p3
p2 p4p1p2p3
Traditional Hierarchical Clustering Traditional Dendrogram
Agglomerative hierarchical clustering

Use Agglomerative algorithm on
following data and draw single link
dendogram.
Object X Y
A 2 2
B 3 2
C 1 1
D 3 1
E 1.5 0.5

Construct distance matrix
Use Euclidean’s law do determine min distance:
d(A,B) = (x
2-x
1)
2
+ (y
2-y
1)
2




A B C D E
A 0
B 1 0
C 1.41 2.24 0
D 1.41 1 2 0
E 1.58 2.12 0.71 1.58 0

•SINGLE LINK :
•Step 1:
•SINCE C, E is minimum we can combine
clusters C, E

A B C,E D
A 0
B 1 0
C,E 1.41 2.12 0
D 1.41 1 1.58 0

•Step 2:
•A, B has minimum value therefore we merge
these two clusters
A,B C,E D
A,B 0
C,E 1.41 0
D 1 1.58 0

•Step 3:
•(A, B) and D has minimum value therefore we
merge these two clusters


A,B,D C,E
A,B,D 0
C,E 1.41 0

•Step 4:
•We have two clusters to be combined
•Construct a dendrogram for the same

•Different approaches to defining the distance
between clusters distinguish the different
algorithm :
•Single linkage clustering: check for minimum
distance
•Complete linkage : check for maximum distance
•Average linkage : we consider the distance
between any two clusters A and B is taken to be
average of all distances between pairs of objects
“i” in A and “j” in B that is the mean distance
between elements of each cluster

Agglomerative
•Step1: Make each object as a cluster
•Step2: Calculate the Euclidean distance from
every point to every other point. i.e.,
construct a Distance Matrix
•Step3: Identify two clusters with shortest
distance.
–Merge them
–Go to Step 2
–Repeat until all objects are in one cluster

Example
•Find single link technique to find clusters in
the given database.
Data points X Y
1
0.4 0.53
2
0.22 0.38
3
0.35 0.32
4
0.26 0.19
5
0.08 0.41
6
0.45 0.3

Plot given data

Construct a distance matrix
1 0
2 0.24 0
3 0.22 0.15 0
4 0.37 0.20 0.15 0
5 0.34 0.14 0.28 0.29 0
6 0.23 0.25 0.11 0.22 0.39 0
1 2 3 4 5 6

Identify two nearest clusters

Repeat process until all objects in
same cluster

Complete link
•Max distance matrix

Average link
•Average distance matrix

Object X Y
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5

Solution
•D,F at distance 0.50
•A,B at distance 0.71
•E and (D,F) at 1
•(D, F , E) and C at distance 1.41
•(D,F,E,C) and (A,B) at distance 2.50

Divisive hierarchical clustering
•Data objects are grouped in top down manner
•Initially all objects are in one clusters
•Then cluster is subdivided into smaller pieces ,
until each objects forms a cluster on its own
or until it satisfies some termination condition
•Advantages :
–is simple and outputs a hierarchy , a structure that
is more informative
–It does not require to pre specify the number of
clusters

•Disadvantages: selection of merge or split
point is critical as once a group of objects is
merged or split , it will operate on the newly
generated clusters and will not undo what was
done previously

Agglomerative vs Divisive
Agglomerative Divisive
Start with points as individual clusters Start with one , all – inclusive cluster
At each step merge closest pair of cluster
until only one cluster left
At each step split a cluster until each
cluster contains a point

Extensions to Hierarchical Clustering
•Major weakness of agglomerative clustering methods
–Can never undo what was done previously
–Do not scale well: time complexity of at least O(n
2
), where n
is the number of total objects
•Integration of hierarchical & distance-based clustering
–BIRCH (1996): uses CF-tree and incrementally adjusts the
quality of sub-clusters
–CHAMELEON (1999): hierarchical clustering using dynamic
modeling
60

BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
•Zhang, Ramakrishnan & Livny, SIGMOD’96
•Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
–Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
–Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
•Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
•Weakness: handles only numeric data, and sensitive to the order of the data
record
61

Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points:



SS: square sum of N points
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8) 

N
i
iX
1 2
1


N
i
iX
62

•CF=<n,LS,SS>
•Where LS= linear sum of n points and SS is square
sum of n data points
•For 2 disjoint clusters C1 and C2, the CF is formed
by merging C1 and C2
•i.e CF1+CF2 = <n1+n2, LS1+LS2, SS1+SS2>

CF-Tree in BIRCH (Balanced Iterative Reducing
and Clustering using Hierarchies)

CF-Tree in BIRCH
•Clustering feature:
–Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point
of view
–Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
–A nonleaf node in a tree has descendants or “children”
–The nonleaf nodes store sums of the CFs of their children
•A CF tree has two parameters
–Branching factor: max no. of children
–Threshold: max diameter of sub-clusters stored at the leaf
nodes
65

The CF Tree Structure
CF
1
child
1
CF
3
child
3
CF
2
child
2
CF
6
child
6
CF
1
child
1
CF
3
child
3
CF
2
child
2
CF
5
child
5
CF
1 CF
2 CF
6 prev next CF
1 CF
2 CF
4 prev next
Root
Non-leaf node
Leaf node Leaf node
66

The Birch Algorithm
•Cluster Diameter


•For each point in the input
–Find closest leaf entry
–Add point to leaf entry and update CF
–If entry diameter > max_diameter, then split leaf, and possibly parents
•Algorithm is O(n)
•Concerns
–Sensitive to insertion order of data points
–Since we fix the size of leaf nodes, so clusters may not be so natural
–Clusters tend to be spherical given the radius and diameter measures 

2
)(
)1(
1
j
x
i
x
nn

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
If cluster 1 becomes too large (not compact) by adding object 2,
then split the cluster
Leaf node

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
entry 1 entry 2
Leaf node with two entries

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry1 is the closest to object 3

If cluster 1 becomes too large by adding object 3,
then split the cluster
entry 1 entry 2

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
Leaf node with three entries

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
Cluster2
3
entry 1 entry 2 entry 3
Cluster3
4
entry3 is the closest to object 4

Cluster 2 remains compact when adding object 4
then add object 4 to cluster 2
Cluster2

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
entry 1 entry 2 entry 3
Cluster3
4
entry2 is the closest to object 5

Cluster 3 becomes too large by adding object 5
then split cluster 3?

BUT there is a limit to the number of entries a node can have
Thus, split the node
Cluster2
5

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Leaf node
Non-Leaf node
Cluster4

BIRCH: The Idea by example
Data Objects
1
Clustering Process (build a tree)
Cluster1
1
2
3
4
5
6
2
Leaf node
3
Cluster3
4
Cluster2
5
entry 1 entry 2
entry 1.1 entry 1.2 entry 2.1 entry 2.2
Leaf node
Non-Leaf node
Cluster4
6
entry1.2 is the closest to object 6

Cluster 3 remains compact when adding object 6
then add object 6 to cluster 3
Cluster3

Density-Based Clustering Methods
•Clustering based on density (local cluster criterion), such as
density-connected points
•Major features:
–Discover clusters of arbitrary shape
–Handle noise
–One scan
–Need density parameters as termination condition
•Several interesting studies:
–DBSCAN: Ester, et al. (KDD’96)
–OPTICS: Ankerst, et al (SIGMOD’99).
–DENCLUE: Hinneburg & D. Keim (KDD’98)
–CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
76

Density-Based Clustering: Basic Concepts
•Two parameters:
–Eps: Maximum radius of the neighbourhood
–MinPts: Minimum number of points in an Eps-
neighbourhood of that point
•N
Eps(q): {p belongs to D | dist(p,q) ≤ Eps}
•Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
–p belongs to N
Eps(q)
–core point condition:
|N
Eps (q)| ≥ MinPts
MinPts = 5
Eps = 1 cm
p
q
77

Density-Reachable and Density-Connected
•Density-reachable:
–A point p is density-reachable from a
point q w.r.t. Eps, MinPts if there is a
chain of points p
1, …, p
n, p
1 = q, p
n =
p such that p
i+1 is directly density-
reachable from p
i
•Density-connected
–A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts
p
q
p
1
p q
o
78

DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
•Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points
•Discovers clusters of arbitrary shape in spatial databases with
noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
79

DBSCAN: The Algorithm
•Arbitrary select a point p
•Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
•If p is a core point, a cluster is formed
•If p is a border point, no points are density-reachable from p
and DBSCAN visits the next point of the database
•Continue the process until all of the points have been
processed
•If a spatial index is used, the computational complexity of DBSCAN is
O(nlogn), where n is the number of database objects. Otherwise, the
complexity is O(n
2
)
80

DBSCAN: Sensitive to Parameters
81

OPTICS (Ordering Points to
Identify the Clustering
Structure)

–Based on DBSCAN.
– Does not produce clusters explicitly.
– Rather generate an ordering of data objects representing
density-based clustering structure.
•Produces a special order of the database wrt its density-
based clustering structure
•This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
•Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
•Can be represented graphically or using visualization
techniques
83

OPTICS: Some Extension from DBSCAN
•Two values need to be stored for each object—core-distance and
reachability-distance:

–The core-distance of an object p is the smallest Ɛ’ value that makes
{p} a core object. If p is not a core object, the core-distance of
p is undefined.

–The reachability-distance of an object q with respect to another
object p is the greater value of the core-distance of p and the
Euclidean distance between p and q. If p is not a core object,
the reachability-distance between p and q is undefined.
84

Core Distance & Reachability Distance

85

OPTICS: Some Extension from DBSCAN
•The OPTICS algorithm creates an ordering of the objects in a
database, additionally storing the core-distance and a suitable
reachability distance for each object.

•An algorithm was proposed to extract clusters based on the
ordering information produced by OPTICS.

•Such information is sufficient for the extraction of all density-
based clusterings with respect to any distance Ɛ’ that is smaller
than the distance e used in generating the order.
86

  Reachability-
distance
Cluster-order of the objects
undefined 

87

88
OPTICS
If we change minimum points we still get the same clusters, there is a change in
graph according to min pts

Important Questions
•Explain K-means algorithm
•Difference between clustering and
classification
•Write a note on DBSCAN, Birch
•Difference between agglomerative and
divisive
•Numerical on:
– k-means(both single and graphical)
–Agglomerative clustering
Tags