Classification in data mining

sulmanahmed1004 11,909 views 56 slides Sep 19, 2017
Slide 1
Slide 1 of 56
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56

About This Presentation

This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all about the data mining that how we get the optimized results. it included with all types and how we use these techniques.This course is all abo...


Slide Content

Data Mining Lecture – 03

Classification: Definition Given a collection of records ( training set ) Each record contains a set of attributes , one of the attributes is the class . Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Illustrating Classification Task

Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

Example of a Decision Tree categorical categorical continuous class Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree

Another Example of Decision Tree categorical categorical continuous class MarSt Refund TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data!

Decision Tree Classification Task Decision Tree

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Start from the root of tree.

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Assign Cheat to “No”

Decision Tree Classification Task Decision Tree

Decision Tree Induction Many Algorithms: Hunt’s Algorithm (one of the earliest) CART (Classification And Regression Tree) ID3 (Iterative Dichotomiser 3 ) C4.5   (Successor of ID3) SLIQ (It does not require loading the entire dataset into the main memory) SPRINT (similar approach as SLIQ, induces decision trees relatively quickly) CHAID ( CHi -squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees . MARS : extends decision trees to handle numerical data better. Conditional Inference Trees. Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid overfitting .

General Structure of Hunt’s Algorithm Let D t be the set of training records that reach a node t General Procedure: If D t contains records that belong the same class y t , then t is a leaf node labeled as y t If D t is an empty set, then t is a leaf node labeled by the default class, y d If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. D t ?

Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K >= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married

Evaluation of a Classifier How predictive is the model we learned? Which performance measure to use? Natural performance measure for classification problems : error rate on a test set Success : instance’s class is predicted correctly Error : instance’s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Accuracy : proportion of correctly classified instances over the whole set of instances accuracy = 1 – error rate 19

Confusion Matrix A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. 20 PREDICTED CLASS ACTUAL CLASS Class = Yes Class = No Class = Yes a b Class = No c d a: TP (true positive) c: FP (false positive ) b: FN (false negative) d: TN (true negative )

Confusion Matrix - Example What can we learn from this matrix? There are two possible predicted classes : "yes" and "no". If we were predicting the presence of a disease, for example, "yes" would mean they have the disease, and "no" would mean they don't have the disease. The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease). Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times. In reality, 105 patients in the sample have the disease, and 60 patients do not. 21

Confusion Matrix – Confusion? False positives are actually negative False negatives are actually positives 22

Confusion Matrix - Example Let's now define the most basic terms, which are whole numbers (not rates): true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease . true negatives (TN): We predicted no, and they don't have the disease. false positives (FP): We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.") false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a "Type II error.") 23

Confusion Matrix - Computations This is a list of rates that are often computed from a confusion matrix: Accuracy: Overall, how often is the classifier correct? ( TP+TN)/total = (100+50)/165 = 0.91 Misclassification Rate: Overall, how often is it wrong? ( FP+FN)/total = (10+5)/165 = 0.09 equivalent to 1 minus Accuracy also known as "Error Rate" True Positive Rate: When it's actually yes, how often does it predict yes? TP/actual yes = 100/105 = 0.95 also known as "Sensitivity" or "Recall" False Positive Rate: When it's actually no, how often does it predict yes? FP/actual no = 10/60 = 0.17 24

Confusion Matrix - Computations This is a list of rates that are often computed from a confusion matrix: Specificity : When it's actually no, how often does it predict no? TN/actual no = 50/60 = 0.83 equivalent to 1 minus False Positive Rate Precision: When it predicts yes, how often is it correct? TP/predicted yes = 100/110 = 0.91 Prevalence: How often does the yes condition actually occur in our sample? actual yes/total = 105/165 = 0.64 25

Confusion Matrix – Example 2 Imagine that you have a dataset that consists of 33 patterns that are 'Spam' (S) and 67 patterns that are 'Non-Spam' (NS). In the example 33 patterns that are 'Spam' (S), 27 were correctly predicted as 'Spams' while 6 were incorrectly predicted as 'Non-Spams '. On the other hand, out of the 67 patterns that are 'Non-Spams', 57 are correctly predicted as 'Non-Spams' while 10 were incorrectly classified as 'Spams '. 26 http://aimotion.blogspot.com/2010/08/tools-for-machine-learning-performance.html

Confusion Matrix – Example 2 Accuracy = ( TP+TN)/total = (27+57)/100 = 84% Misclassification Rate = ( FP+FN)/total = (6+10)/100 = 16% True Positive Rate = TP/actual yes = 27/33 = 0.81 False Positive Rate = FP/actual no = 10/67 = 0.15 27 http://www.marcovanetti.com/pages/cfmatrix/?noc=1

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR Splitting Based on Nominal Attributes

Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split? Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium} {Large} OR Size {Small, Large} {Medium}

Different ways of handling Discretization to form an ordinal categorical attribute Static – discretize once at the beginning Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision : (A < v) or (A  v) consider all possible splits and finds the best cut can be more compute intensive Splitting Based on Continuous Attributes

Splitting Based on Continuous Attributes

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?

How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

How to Measure Impurity? Given a data table that contains attributes and class of the attributes, we can measure homogeneity (or heterogeneity) of the table based on the classes. We say a table is pure or homogenous if it contains only a single class. If a data table contains several classes, then we say that the table is impure or heterogeneous. 37 http://people.revoledu.com/kardi/tutorial/DecisionTree/how-to-measure-impurity.htm

How to Measure Impurity? There are several indices to measure degree of impurity quantitatively. Most well known indices to measure degree of impurity are: Entropy Gini Index Misclassification error All above formulas contain values of probability of p j a class j . 38

How to Measure Impurity ? - Example In our example, the classes of Transportation mode below consist of three groups of Bus, Car, and Train. In this case, we have 4 buses, 3 cars, and 3 trains (in short we write as 4B, 3C, 3T). The total data is 10 rows. 39

How to Measure Impurity ? - Example Based on the data, we can compute probability of each class. Since probability is equal to frequency relative, we have Prob (Bus) = 4/10 = 0.4 Prob (Car) = 3/10 = 0.3 Prob (Train) = 3/10 = 0.3 Observe that when to compute the probability, we only focus on the classes, not on the attributes. Having the probability of each class, now we are ready to compute the quantitative indices of impurity degrees. 40

How to Measure Impurity ? - Entropy One way to measure impurity degree is using entropy Example: Given that Prob (Bus)=0.4, Prob (Car)=0.3, Prob (Train)=0.3, we can now compute entropy as: Entropy = - 0.4log 2 (0.4 ) - 0.3log 2 (0.3 ) - 0.3log 2 (0.3 ) = 1.571 41

How to Measure Impurity ? - Entropy Entropy of a pure table (consist of single class) is zero because the probability is 1 and log 2 (1 )=0. Entropy reaches maximum value when all classes in the table have equal probability. Figure plots the values of maximum entropy for different number of classes n, where probability is equal to p=1/n. In this case, maximum entropy is equal to - n*p*log 2 p . Notice that the value of entropy is larger than 1 if the number of classes is more than 2. 42

How to Measure Impurity ? - Gini Another way to measure impurity degree is using Gini index Example: Given that Prob (Bus)=0.4, Prob (Car)=0.3, Prob (Train)=0.3, we can now compute Gini index as: Gini Index = 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660 43

How to Measure Impurity ? - Entropy Gini index of a pure table (consist of a single class is zero because the probability is 1 and 1-(1)^2=0. Similar to Entroy , Gini index also reaches maximum value when all classes in the table have equal probability. Figure plots the values of maximum Gini index for different number of classes n, where probability is equal to p=1/n. Notice that the value of Gini index is always between 0 and 1 regardless the number of classes. 44

How to Measure Impurity ? – Missclassification Error Still another way to measure impurity degree Example : Given that Prob (Bus)=0.4, Prob (Car)=0.3, Prob (Train)=0.3, we can now compute index as: Index = 1 - Max{0.4,0.3,0.3} = 1-0.4 = 0.60 45

How to Measure Impurity? – Missclassification Error Misclassification Error Index of a pure table (consist of a single class is zero because the probability is 1 and 1 - Max(1 )=0. The value of classification error index is always between 0 and 1. In fact the maximum Gini index for a given number of classes is always equal to the maximum of misclassification error index because for a number of classes n, we set probability is equal to p=1/n and maximum Gini index happens at 1-n*(1/n)^2=1-1/n, while maximum misclassification error index also happens at 1-max{1/n}=1-1/n. 46

Information Gain The reason for different ways of computation of impurity degrees between data table D and subset table S i is because we would like to compare the difference of impurity degrees before we split the table (i.e. data table D) and after we split the table according to the values of an attribute i (i.e. subset table S i ) . The measure to compare the difference of impurity degrees is called information gain. We would like to know what our gain is if we split the data table based on some attribute values. 47

Information Gain - Example For example, in the parent table below, we can compute degree of impurity based on transportation mode. In this case we have 4 Busses, 3 Cars and 3 Trains (in short 4B, 3C, 3T): 48

Information Gain - Example For example, we split using travel cost attribute and compute the degree of impurity. 49

Information Gain - Example Information gain is computed as impurity degrees of the parent table and weighted summation of impurity degrees of the subset table. The weight is based on the number of records for each attribute values. Suppose we will use entropy as measurement of impurity degree, then we have : Information gain (i) = Entropy of parent table D – Sum (n k /n * Entropy of each value k of subset table Si ) The information gain of attribute Travel cost per km is computed as 1.571 – (5/10 * 0.722+2/10*0+3/10*0) = 1.210 50

Information Gain - Example You can also compute information gain based on Gini index or classification error in the same method. The results are given below. 51

Information Gain – Example Split using “Gender” attribute 52

Information Gain - Example Split using “Car ownership” attribute 53

Information Gain - Example Split using “Income Level” attribute 54

Information Gain - Example Table below summarizes the information gain for all four attributes. In practice, you don't need to compute the impurity degree based on three methods. You can use either one of Entropy or Gini index or index of classification error . Now we find the optimum attribute that produce the maximum information gain (i* = argmax {information gain of attribute i}). In our case, travel cost per km produces the maximum information gain. 55

Information Gain - Example So we split using “travel cost per km” attribute as this produces the maximum information gain. 56
Tags