k-Nearest Neighbor KNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure Lazy learning algorithm Non-parametric learning algorithm because it doesn’t assume anything about the underlying data. 4 CoSc 6211 JJU
KNN-classification Lazy approach to classification Uses all the training set to perform classification Uses distances between training and test records 5 CoSc 6211 JJU
KNN example The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is assigned to the first class (3 squares vs. 2 triangles inside the outer circle). ? 6 CoSc 6211 JJU
Classification Example • Unknown instance is classified based on the nearest instance class 7 CoSc 6211 JJU
Similarity Measure Euclidian distance (sum of squared errors) Manhattan distance (sum of absolute errors) d cityblock ( x , y ) = i | x i y i | where x = x 1 , x 2 , . . . , x m , and y = y 1 , y 2 , . . . , y m represent the m attribute values of two records. 8 CoSc 6211 JJU
KNN Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M 9 CoSc 6211 JJU
Scaling… Attribute normalization if scales are different For continuous variables, the min–max normalization or Z-score standardization, may be used: Min–max normalization: X ∗ = X − min( X ) / Range( X ) = X − min( X ) / max( X) − min( X ) Z-score standardization: X ∗ = X − mean( X ) / SD( X ) For categorical variables, different( x i , y i ) = 0 if x i = y i , 1 otherwise Example: x=male and y=female then distance(x, y)=1 10 CoSc 6211 JJU
Example of K-NN Classification Training data Test data 11 CoSc 6211 JJU Example of K-NN Classification from reference book: Tan, Steinbach, Kumar
**Exercise 1 13 CoSc 6211 JJU Record Age Gender A 50 Male B 20 Male C 50 Female Which is more similar to A , B or C ? Give comments Table 1 : Variable Values for Age and Gender for given data set
**Exercise 2 Record Age Agemmn AgeZs Gender A 50 ? ? Male B 20 ? ? Male C 50 ? ? Female 14 CoSc 6211 JJU Table 2 : Variable Values for Age and Gender for given data. Agemmn = Min max normalization and AgeZs = Z score standard .. Assume for variable age , minimum is 10 and range is 50 and the mean is 45, and the standard deviation is 15. Compute and fill the above table : Agemmn and AgeZs using min-max normalization and Zscore standardization respectively. Which is more similar to A , B or C : using Agemmn and AgeZs . First compute Edistance for both Agemmn and AgeZs . Compare your answer with table 1 questions a Given comments
Next: Rule-based methods CoSc 6211 JJU 15
Rule-based classifier Based on reference book (Tan, Steinbach, Kumar and some based on text book (J.Han) 16 CoSc 6211 JJU
Rule-based classifier Classify records by using a collection of “if…then…” rules Rule: ( Condition) → y where Condition is a conjunctions of attributes y is the class label LHS: rule antecedent or condition RHS: rule consequent Examples of classification rules: (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No CoSc 6211 JJU 17
Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule CoSc 6211 JJU 19 R1: (Give Birth = no) (Can Fly = yes) Birds R2: (Give Birth = no) (Live in Water = yes) Fishes R3: (Give Birth = yes) (Blood Type = warm) Mammals R4: (Give Birth = no) (Can Fly = no) Reptiles R5: (Live in Water = sometimes) Amphibians The rule R1 covers a hawk => Bird The rule R3 covers the grizzly bear => Mammal
Rule Coverage and Accuracy Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy of a rule: Fraction of records that satisfy the antecedent that also satisfy the consequent of a rule CoSc 6211 JJU 20 n covers = # of tuples covered by R n correct = # of tuples correctly classified by R coverage(R) = n covers /|D| /* D: training data set */ accuracy(R) = n correct / n covers R1: (Status=Single) No Coverage = 40%, Accuracy = 50%
Rule Extraction from a Decision Tree Rules are easier to understand than large trees One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction CoSc 6211 JJU 21 age? student? credit rating? <=30 >40 no yes yes yes 31..40 no fair excellent yes no Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes
Ensemble Methods: Increasing the Accuracy CoSc 6211 JJU 22
Ensemble Methods: Increasing the Accuracy Ensemble methods Use a combination of models to increase accuracy Combine a series of k learned models, M 1 , M 2 , …, M k , with the aim of creating an improved model M* Popular ensemble methods Bagging: averaging the prediction over a collection of classifiers Boosting: weighted vote with a collection of classifiers Ensemble: combining a set of heterogeneous classifiers CoSc 6211 JJU 23
Bagging: Bootstrap Aggregation Analogy: Diagnosis based on multiple doctors’ majority vote Training Given a set D of d tuples , at each iteration i , a training set D i of d tuples is sampled with replacement from D (i.e., bootstrap) A classifier model M i is learned for each training set D i Classification: classify an unknown sample X Each classifier M i returns its class prediction The bagged classifier M* counts the votes and assigns the class with the most votes to X Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple Accuracy Often significantly better than a single classifier derived from D For noise data: not considerably worse, more robust Proved improved accuracy in prediction CoSc 6211 JJU 24
Boosting Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy How boosting works? Weights are assigned to each training tuple A series of k classifiers is iteratively learned After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy Boosting algorithm can be extended for numeric prediction Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfitting the model to misclassified data CoSc 6211 JJU 25
Adaboost (Freund and Schapire , 1997 ) Given a set of d class-labeled tuples , ( X 1 , y 1 ), …, ( X d , y d ) Initially, all the weights of tuples are set the same (1/d) Generate k classifiers in k rounds. At round i , Tuples from D are sampled (with replacement) to form a training set D i of the same size Each tuple’s chance of being selected is based on its weight A classification model M i is derived from D i Its error rate is calculated using D i as a test set If a tuple is misclassified, its weight is increased, o.w . it is decreased Error rate: err( X j ) is the misclassification error of tuple X j . Classifier M i error rate is the sum of the weights of the misclassified tuples : The weight of classifier M i ’s vote is CoSc 6211 JJU 26
Random Forest ( Breiman 2001) Random Forest: Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split During classification, each tree votes and the most popular class is returned Two Methods to construct Random Forest: Forest-RI ( random input selection ): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to maximum size Forest-RC ( random linear combinations ) : Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers) Comparable in accuracy to Adaboost , but more robust to errors and outliers Insensitive to the number of attributes selected for consideration at each split, and faster than bagging or boosting CoSc 6211 JJU 27
Improving Classification Accuracy of Class-Imbalanced Data Sets Class-imbalance problem: Rare positive example but numerous negative ones, e.g., medical diagnosis, fraud, oil-spill, fault, etc. Traditional methods assume a balanced distribution of classes and equal error costs: not suitable for class-imbalanced data Typical methods for imbalance data in 2-class classification: Oversampling : re-sampling of data from positive class Under-sampling : randomly eliminate tuples from negative class Threshold-moving : moves the decision threshold, t, so that the rare class tuples are easier to classify, and hence, less chance of costly false negative errors Ensemble techniques: Ensemble multiple classifiers introduced above Still difficult for class imbalance problem on multiclass tasks CoSc 6211 JJU 28
Next : Classification: ANN and Evaluation models 4. Clustering CoSc 6211 JJU 29