Introduction B alance I terative R educing & C lustering using H ierarchies
Definition Effective data clustering method for very large databases Each clustering decision is made without scanning all data points unsupervised data mining algorithm used to perform hierarchical clustering
Data Clustering It is portioning of a database into clusters Closely packed group Collection of data objects that are similar to one another & treated collectively in a group
Clustering Factor CF CF is compact storage for data on points in cluster Has enough information to calculate the intra-cluster distances Additive theorem allows to merge sub-clusters
BIRCH goal Minimizes running time and data scans Clustering decision made without scanning whole data Exploit the non uniformity of the data Treat dense areas as one and reduce noise
Algorithm Phase 1: Scan all data and build an initial in-memory CF tree, using the given amount of memory and recycling space on disk. Phase 2: Condense into desirable length by building a smaller CF tree. Phase 3: Global clustering. Phase 4: Cluster refining – this is optional, and requires more passes over the data to refine the results.
Birch- Phase 1 Starts with initial threshold and inserts points into the tree. If it runs out of memory before it finishes scanning the data, it increases the threshold value and rebuilds a new, smaller CF tree, by re-inserting the leaf entries from the older tree and then other values Good initial threshold is important but hard to figure out. Outlier removal (when rebuilding tree).
Birch- Phase2 Optional Preparation for phase 3 Removes outliers and grouping clusters
Birch- Phase3 Problem after phase 1 Input order affects results. Splitting triggered by node size. Phase 3 Cluster all leaf node on the CF values according to an existing algorithm Algorithm used here: agglomerative hierarchical clustering
Birch-Phase4 Optional Additional scans of the datasets attaching each items to the centroids found. Recalculating the centroid and redistributing the items Always converges
Conclusion Birch performs faster than existing algorithms ( CLARANS and KMEANS) on large databases. Scans whole data only once. Handles outliers better. Superior to other algorithms in stability and scalability.