Classification & Clustering.pptx

170 views 62 slides Aug 21, 2022
Slide 1
Slide 1 of 62
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62

About This Presentation

Ml


Slide Content

Mining Big Data: Motivation Today’s digital society has seen enormous data growth in both commercial and scientific databases Data Mining is becoming a commonly used tool to extract information from large and complex datasets Examples: Helps provide better customer service in business/commercial setting Helps scientists in hypothesis formation Computational Simulations Business Data Sensor Networks Geo-spatial data Homeland Security Scientific Data

Data Mining for Life and Health Sciences Recent technological advances are helping to generate large amounts of both medical and genomic data High-throughput experiments/techniques Gene and protein sequences Gene-expression data Biological networks and phylogenetic profiles Electronic Medical Records IBM-Mayo clinic partnership has created a DB of 5 million patients Single Nucleotides Polymorphisms (SNPs) Data mining offers potential solution for analysis of large-scale data Automated analysis of patients history for customized treatment Prediction of the functions of anonymous genes Identification of putative binding sites in protein structures for drugs/chemicals discovery Protein Interaction Network

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems Traditional Techniques may be unsuitable due to Enormity of data High dimensionality of data Heterogeneous, distributed nature of data Origins of Data Mining Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems

Data Mining as Part of the Knowledge Discovery Process

Data Mining Tasks... Predictive Modeling Clustering Association Rules Anomaly Detection Milk Data

July 24, 2022 Data Mining: Concepts and Techniques 6 Data Mining Functions : Generalization Materials to be covered in Chapters 2-4 Information integration and data warehouse construction Data cleaning, transformation, integration, and multidimensional data model Data cube technology Scalable methods for computing (i.e., materializing) multidimensional aggregates OLAP (online analytical processing) Multidimensional concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions

July 24, 2022 Data Mining: Concepts and Techniques 7 Data Mining Functions : Association and Correlation Frequent patterns (or frequent itemsets) What items are frequently purchased together in your Walmart? Association, correlation vs. causality A typical association rule Diaper  Beer [0.5%, 75%] (support, confidence) Are strongly associated items also strongly correlated? How to mine such patterns and rules efficiently in large datasets? How to use such patterns for classification, clustering, and other applications?

July 24, 2022 Data Mining: Concepts and Techniques 8 Data Mining Functions : Classification and Prediction Classification and prediction Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values Typical methods Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern-based classification, logistic regression, … Typical applications: Credit card fraud detection, direct marketing, classifying stars, diseases, web-pages, …

July 24, 2022 Data Mining: Concepts and Techniques 9 Data Mining Functions : Cluster and Outlier Analysis Cluster analysis Unsupervised learning (i.e., Class label is unknown) Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity Many methods and applications Outlier analysis Outlier: A data object that does not comply with the general behavior of the data Noise or exception? ― One person’s garbage could be another person’s treasure Methods: by product of clustering or regression analysis, … Useful in fraud detection, rare events analysis

July 24, 2022 Data Mining: Concepts and Techniques 10 Data Mining Functions : Trend and Evolution Analysis Sequence, trend and evolution analysis Trend and deviation analysis: e.g., regression Sequential pattern mining e.g., first buy digital camera, then large SD memory cards Periodicity analysis Motifs, time-series, and biological sequence analysis Approximate and consecutive motifs Similarity-based analysis Mining data streams Ordered, time-varying, potentially infinite, data streams

July 24, 2022 Data Mining: Concepts and Techniques 11 Data Mining Functions:Structure and Network Analysis Graph mining Finding frequent subgraphs (e.g., chemical compounds), trees (XML), substructures (web fragments) Information network analysis Social networks: actors (objects, nodes) and relationships (edges) e.g., author networks in CS, terrorist networks Multiple heterogeneous networks A person could be multiple information networks: friends, family, classmates, … Links carry a lot of semantic information: Link mining Web mining Web is a big information network: from PageRank to Google Analysis of Web information networks Web community discovery, opinion mining, usage mining, …

Predictive Modeling Classification

General Approach for Building a Classification Model categorical categorical quantitative class Test Set Training Set Model Learn Classifier

Predicting tumor cells as benign or malignant Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Predicting functions of proteins Classifying credit card transactions as legitimate or fraudulent Categorizing news stories as finance, weather, entertainment, sports, etc Identifying intruders in the cyberspace Examples of Classification Task

Commonly Used Classification Models Base Classifiers Decision Tree based Methods Rule-based Methods Nearest-neighbor Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines Ensemble Classifiers Boosting, Bagging, Random Forests

Class Model for predicting credit worthiness Employed No Education Number of years No Yes Graduate { High school, Undergrad } Yes No > 7 yrs < 7 yrs Yes Classification Model: Decision Tree

Constructing a Decision Tree Employed Worthy: 4 Not Worthy: 3 Yes No Worthy: Not Worthy: 3 Graduate High School/ Undergrad Worthy: 2 Not Worthy: 2 Education Worthy: 2 Not Worthy: 4 Key Computation Worthy Not Worthy 4 3 3 Employed = Yes Employed = No Worthy: 4 Not Worthy: 3 Yes No Worthy: 0 Not Worthy: 3 Employed

Constructing a Decision Tree Employed = Yes Employed = No

Classification Errors Training errors (apparent errors) Errors committed on the training set Test errors Errors committed on the test set Generalization errors Expected error of a model over random selection of records from same distribution

Example Data Set Two class problem: + : 5200 instances 5000 instances generated from a Gaussian centered at (10,10) 200 noisy instances added o : 5200 instances Generated from a uniform distribution 10 % of the data used for training and 90% of the data used for testing

Design Issues of Decision Tree Induction How should training records be split? Method for specifying test condition depending on attribute types Measure for evaluating the goodness of a test condition How should the splitting procedure stop? Stop splitting if all the records belong to the same class or have identical attribute values Early termination

Model Overfitting Underfitting: when model is too simple, both training and test errors are large Overfitting: when model is too complex, training error is small but test error is large

Model Overfitting Using twice the number of data instances If training data is under-representative, testing errors increase and training errors decrease on increasing number of nodes Increasing the size of training data reduces the difference between training and testing errors at a given number of nodes

Reasons for Model Overfitting Presence of Noise Lack of Representative Samples Multiple Comparison Procedure

Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error does not provide a good estimate of how well the tree will perform on previously unseen records Need ways for incorporating model complexity into model development

Clustering Algorithms K-means and its variants Hierarchical clustering Other types of clustering

K-means Clustering Partitional clustering approach Number of clusters, K, must be specified Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid The basic algorithm is very simple

Example of K-means Clustering

K-means Clustering – Details The centroid is (typically) the mean of the points in the cluster Initial centroids are often chosen randomly Clusters produced vary from one run to another ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them x is a data point in cluster C i and m i is the representative point for cluster C i Given two sets of clusters, we prefer the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters

Two different K-means Clusterings Sub-optimal Clustering Optimal Clustering Original Points

Limitations of K-means K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers.

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 1 2 3 4 5 6 1 2 3 4 5

Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendrogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Hierarchical Clustering Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Proximity Matrix

Intermediate Situation After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2 C3 C4 C5 Proximity Matrix

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2 C3 C4 C5 Proximity Matrix

After Merging The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? ? ? C2 U C5 C1 C1 C3 C4 C2 U C5 C3 C4 Proximity Matrix

How to Define Inter-Cluster Distance p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Proximity Matrix MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Proximity Matrix MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Proximity Matrix MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . . . Proximity Matrix MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error  

Other Types of Cluster Algorithms Hundreds of clustering algorithms Some clustering algorithms K-means Hierarchical Statistically based clustering algorithms Mixture model based clustering Fuzzy clustering Self-organizing Maps (SOM) Density-based (DBSCAN) Proper choice of algorithms depends on the type of clusters to be found, the type of data, and the objective

Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters? But “clusters are in the eye of the beholder”! Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

Clusters found in Random Data Random Points K-means DBSCAN Complete Link

Distinguishing whether non-random structure actually exists in the data Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels Evaluating how well the results of a cluster analysis fit the data without reference to external information Comparing the results of two different sets of cluster analyses to determine which is better Determining the ‘correct’ number of clusters Different Aspects of Cluster Validation

Order the similarity matrix with respect to cluster labels and inspect visually. Using Similarity Matrix for Cluster Validation

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp DBSCAN

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp K-means

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp Complete Link

Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types of indices. External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Relative Index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy For futher details please see “Introduction to Data Mining”, Chapter 8. http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf Measures of Cluster Validity

Association Analysis

Association Analysis Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record Applications Marketing and Sales Promotion Supermarket shelf management Traffic pattern analysis (e.g., rules such as "high congestion on Intersection 58 implies high accident rates for left turning traffic") Rules Discovered: {Milk} --> {Coke} (s=0.6, c=0.75) {Diaper, Milk} --> {Beer} (s=0.4, c=0.67)

Association Rule Mining Task Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: Two Steps Frequent Itemset Generation Generate all itemsets whose support  minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is computationally expensive

Efficient Pruning Strategy ( Ref: Agrawal & Srikant 1994 ) If an itemset is infrequent, then all of its supersets must also be infrequent Found to be Infrequent Pruned supersets

Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, 6 + 6 + 1 = 13
Tags