Birch

malla23 2,598 views 12 slides Jun 10, 2013
Slide 1
Slide 1 of 12
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12

About This Presentation

No description available for this slideshow.


Slide Content

Birch Algorithm Presented by: Binod Malla 09BIM012

Introduction B alance I terative R educing & C lustering using H ierarchies

Definition Effective data clustering method for very large databases   Each clustering decision is made without scanning all data points unsupervised data mining algorithm used to perform hierarchical clustering 

Data Clustering It is portioning of a database into clusters Closely packed group Collection of data objects that are similar to one another & treated collectively in a group

Clustering Factor CF CF is compact storage for data on points in cluster Has enough information to calculate the intra-cluster distances Additive theorem allows to merge sub-clusters

BIRCH goal Minimizes running time and data scans Clustering decision made without scanning whole data Exploit the non uniformity of the data Treat dense areas as one and reduce noise

Algorithm Phase 1: Scan all data and build an initial in-memory CF tree, using the given amount of memory and recycling space on disk. Phase 2: Condense into desirable length by building a smaller CF tree. Phase 3: Global clustering. Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results.

Birch- Phase 1 Starts with initial threshold and inserts points into the tree. If it runs out of memory before it finishes scanning the data, it increases the threshold value and rebuilds a new, smaller CF tree, by re-inserting the leaf entries from the older tree and then other values Good initial threshold is important but hard to figure out. Outlier removal (when rebuilding tree).

Birch- Phase2 Optional Preparation for phase 3 Removes outliers and grouping clusters

Birch- Phase3 Problem after phase 1 Input order affects results. Splitting triggered by node size. Phase 3 Cluster all leaf node on the CF values according to an existing algorithm Algorithm used here: agglomerative hierarchical clustering

Birch-Phase4 Optional Additional scans of the datasets attaching each items to the centroids found. Recalculating the centroid and redistributing the items Always converges

Conclusion Birch performs faster than existing algorithms ( CLARANS and KMEANS) on large databases. Scans whole data only once. Handles outliers better. Superior to other algorithms in stability and scalability.
Tags