Performance Measurement for Machine Leaning.pptx

toneve4907 53 views 43 slides Jun 04, 2024
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

Machine Learning


Slide Content

Performance Measurement Usman Khan

Confusion Matrix (1) • A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known • It allows the visualization of the performance of an algorithm

Confusion Matrix (2) It allows easy identification of confusion between classes e.g . one class is commonly mislabeled as the other . Most performance measures are computed from the confusion matrix .

Confusion Matrix (3) A confusion matrix is a summary of prediction results on a classification problem The number of correct and incorrect predictions are summarized with count values and broken down by each class . This is the key to the confusion matrix

Confusion Matrix (4) The confusion matrix shows the ways in which your classification model is confused when it makes predictions • It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made

Confusion Matrix (5)

Confusion Matrix (6) Here, Class 1 : Positive Class 2 : Negative Definition of the Terms: • Positive (P) : Observation is positive (for example: is an apple). • Negative (N) : Observation is not positive (for example: is not an apple).

Confusion Matrix (7) • True Positive (TP) : Observation is positive, and is predicted to be positive • False Negative (FN) : Observation is positive, but is predicted negative. • True Negative (TN) : Observation is negative, and is predicted to be negative. • False Positive (FP) : Observation is negative, but is predicted positive.

Confusion Matrix (8) Total number of test samples are 165

Classification Rate/Accuracy • Classification Rate or Accuracy is given by the relation :

Confusion Matrix (9)

Sensitivity and Specificity Sensitivity and specificity values can be used to quantify the performance of a case definition or the results of a diagnostic test. Even with a highly specific diagnostic test, if a disease is uncommon among those people tested, a large proportion of positive test results will be false positive, and the positive predictive value will be low.

Sensitivity and Specificity If the test is applied more selectively such that the proportion of people tested who truly have disease is greater, the test's predictive value will be improved Thus , sensitivity and specificity are characteristics of the test, whereas predictive values depend both on test sensitivity and specificity and on the disease prevalence in the population in which the test is applied

Sensitivity and Specificity Sensitivity/Recall Sensitivity (Se) is defined as the proportion of individuals that have a positive test result .

Sensitivity and Specificity Specificity Specificity is defined as the proportion of individuals have negative test result

Precision To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labeled as positive is indeed positive (small number of FP) .

precision Precision is the fraction of true positive examples among the examples that the model classified as positive. In other words, the number of true positives divided by the number of false positives plus true positives. recall Recall, also known as sensitivity, is the fraction of examples classified as positive, among the total number of positive examples. In other words, the number of true positives divided by the number of true positives plus false negatives. TP  The number of true positives classified by the model. FN The number of false negatives classified by the model. FP  The number of false positives classified by the model.

F1 Score The F-score, also called the F1-score, is a measure of a model’s accuracy on a dataset. It is used to evaluate binary classification systems, which  classify  examples into ‘positive’ or ‘negative ’ The F-score is a way of combining the  precision and recall  of the model, and it is defined as the  harmonic mean  of the model’s precision and recall

Calculating F-score Let us imagine we have a tree with ten apples on it. Seven are ripe and three are still unripe, but we do not know which one is which. We have an AI which is trained to recognize which apples are ripe for picking, and pick all the ripe apples and no unripe apples. We would like to calculate the F-score, and we consider both precision and recall to be equally important, so we use the F1-score.

The AI picks five ripe apples but also picks one unripe apple.

Confusion Matrix for Model 1 Ripe Unripe Picked 5 1 Unpicked 2 2

Precision and Recall for model 1 Precision = 0.83 Recall = 0.71 F1 Score = 0.77

Confusion Matrix for Model 2 Ripe Unripe Picked 4 1 Unpicked 2 3

Precision and Recall for model 1 Precision = 0.8 Recall = 0.666 F1 Score = 0.72

Conclusion High recall, low precision: This means that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives. Low recall, high precision: This shows that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP)

F-score vs Accuracy There are a number of metrics which can be used to evaluate a binary classification model, and accuracy is one of the simplest to understand. Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of false negatives and false positives . The F1-score is useful: where there are either differing costs of false positives or false negatives , or where there is a large class imbalance, such as if 10% of apples on trees tend to be unripe. In this case the accuracy would be misleading, since a classifier that classifies all apples as ripe would automatically get 90% accuracy but would be useless for real-life applications. The accuracy has the advantage that it is very easily interpretable, but the disadvantage that it is not robust when the data is unevenly distributed, or where there is a higher cost associated with a particular type of error.

Mean Absolute Error or MAE We know that an error basically is the absolute difference between the actual or true values and the values that are predicted. Absolute difference means that if the result has a negative sign, it is ignored. Hence,  MAE = True values – Predicted values MAE takes the  average  of this error from every sample in a dataset and gives the output.

Mean Squared Error or MSE MSE is calculated by taking the average of the square of the difference between the original and predicted values of the data . Hence, MSE = 

Root Mean Squared Error or RMSE

R Squared

Where to use which Metric to determine the Performance of a Machine Learning Model? MAE:  It is not very sensitive to outliers in comparison to MSE since it doesn't punish huge errors. It is usually used when the performance is measured on continuous variable data. It gives a linear value, which averages the weighted individual differences equally. The lower the value, better is the model's performance. MSE:  It is one of the most commonly used metrics, but least useful when a single bad prediction would ruin the entire model's predicting abilities, i.e when the dataset contains a lot of noise. It is most useful when the dataset contains outliers, or unexpected values (too high or too low values). RMSE:  In RMSE, the errors are squared before they are averaged. This basically implies that RMSE assigns a higher weight to larger errors. This indicates that RMSE is much more useful when large errors are present and they drastically affect the model's performance. It avoids taking the absolute value of the error and this trait is useful in many mathematical calculations. In this metric also, lower the value, better is the performance of the model.

Cross Validation Usman Khan

Cross Validation (1) In machine learning is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on ``new'' data. This is the basic idea for a whole class of model evaluation methods called cross validation

Cross Validation (2) Method of estimating expected predicting error Helps selecting the best fit model Helps ensuring model is not over fit

Cross Validation (3) 1) Holdout method 2) K-Fold CV 3) Leave one out CV 4) Bootstraps Methods

Holdout method The holdout cross validation method is the simplest of all. In this method, you randomly assign data points to two sets . The size of the sets does not matter

K-FOLD K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets and the holdout method is repeated k times Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set

K - FOLD Disadvantages ??? Stratified K-Fold

Leave one out CV (1) Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set That means that N separate times, the function approximate is trained on all the data except for one point and a prediction is made for that point As before the average error is computed and used to evaluate the model.

Leave one out CV (2) Specific case of K-fold validation

Leave one out CV (3) Disadvantages ???

Bootstrap (1) Randomly draw datasets from the training sample Each sample same size as the training sample Refit the model with the bootstrap samples Examine the model

Bootstrap (2)
Tags