Confusion Matrix (1) • A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known • It allows the visualization of the performance of an algorithm
Confusion Matrix (2) It allows easy identification of confusion between classes e.g . one class is commonly mislabeled as the other . Most performance measures are computed from the confusion matrix .
Confusion Matrix (3) A confusion matrix is a summary of prediction results on a classification problem The number of correct and incorrect predictions are summarized with count values and broken down by each class . This is the key to the confusion matrix
Confusion Matrix (4) The confusion matrix shows the ways in which your classification model is confused when it makes predictions • It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made
Confusion Matrix (5)
Confusion Matrix (6) Here, Class 1 : Positive Class 2 : Negative Definition of the Terms: • Positive (P) : Observation is positive (for example: is an apple). • Negative (N) : Observation is not positive (for example: is not an apple).
Confusion Matrix (7) • True Positive (TP) : Observation is positive, and is predicted to be positive • False Negative (FN) : Observation is positive, but is predicted negative. • True Negative (TN) : Observation is negative, and is predicted to be negative. • False Positive (FP) : Observation is negative, but is predicted positive.
Confusion Matrix (8) Total number of test samples are 165
Classification Rate/Accuracy • Classification Rate or Accuracy is given by the relation :
Confusion Matrix (9)
Sensitivity and Specificity Sensitivity and specificity values can be used to quantify the performance of a case definition or the results of a diagnostic test. Even with a highly specific diagnostic test, if a disease is uncommon among those people tested, a large proportion of positive test results will be false positive, and the positive predictive value will be low.
Sensitivity and Specificity If the test is applied more selectively such that the proportion of people tested who truly have disease is greater, the test's predictive value will be improved Thus , sensitivity and specificity are characteristics of the test, whereas predictive values depend both on test sensitivity and specificity and on the disease prevalence in the population in which the test is applied
Sensitivity and Specificity Sensitivity/Recall Sensitivity (Se) is defined as the proportion of individuals that have a positive test result .
Sensitivity and Specificity Specificity Specificity is defined as the proportion of individuals have negative test result
Precision To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labeled as positive is indeed positive (small number of FP) .
precision Precision is the fraction of true positive examples among the examples that the model classified as positive. In other words, the number of true positives divided by the number of false positives plus true positives. recall Recall, also known as sensitivity, is the fraction of examples classified as positive, among the total number of positive examples. In other words, the number of true positives divided by the number of true positives plus false negatives. TP The number of true positives classified by the model. FN The number of false negatives classified by the model. FP The number of false positives classified by the model.
F1 Score The F-score, also called the F1-score, is a measure of a model’s accuracy on a dataset. It is used to evaluate binary classification systems, which classify examples into ‘positive’ or ‘negative ’ The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall
Calculating F-score Let us imagine we have a tree with ten apples on it. Seven are ripe and three are still unripe, but we do not know which one is which. We have an AI which is trained to recognize which apples are ripe for picking, and pick all the ripe apples and no unripe apples. We would like to calculate the F-score, and we consider both precision and recall to be equally important, so we use the F1-score.
The AI picks five ripe apples but also picks one unripe apple.
Confusion Matrix for Model 1 Ripe Unripe Picked 5 1 Unpicked 2 2
Precision and Recall for model 1 Precision = 0.83 Recall = 0.71 F1 Score = 0.77
Confusion Matrix for Model 2 Ripe Unripe Picked 4 1 Unpicked 2 3
Precision and Recall for model 1 Precision = 0.8 Recall = 0.666 F1 Score = 0.72
Conclusion High recall, low precision: This means that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives. Low recall, high precision: This shows that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP)
F-score vs Accuracy There are a number of metrics which can be used to evaluate a binary classification model, and accuracy is one of the simplest to understand. Accuracy is defined as simply the number of correctly categorized examples divided by the total number of examples. Accuracy can be useful but does not take into account the subtleties of class imbalances, or differing costs of false negatives and false positives . The F1-score is useful: where there are either differing costs of false positives or false negatives , or where there is a large class imbalance, such as if 10% of apples on trees tend to be unripe. In this case the accuracy would be misleading, since a classifier that classifies all apples as ripe would automatically get 90% accuracy but would be useless for real-life applications. The accuracy has the advantage that it is very easily interpretable, but the disadvantage that it is not robust when the data is unevenly distributed, or where there is a higher cost associated with a particular type of error.
Mean Absolute Error or MAE We know that an error basically is the absolute difference between the actual or true values and the values that are predicted. Absolute difference means that if the result has a negative sign, it is ignored. Hence, MAE = True values – Predicted values MAE takes the average of this error from every sample in a dataset and gives the output.
Mean Squared Error or MSE MSE is calculated by taking the average of the square of the difference between the original and predicted values of the data . Hence, MSE =
Root Mean Squared Error or RMSE
R Squared
Where to use which Metric to determine the Performance of a Machine Learning Model? MAE: It is not very sensitive to outliers in comparison to MSE since it doesn't punish huge errors. It is usually used when the performance is measured on continuous variable data. It gives a linear value, which averages the weighted individual differences equally. The lower the value, better is the model's performance. MSE: It is one of the most commonly used metrics, but least useful when a single bad prediction would ruin the entire model's predicting abilities, i.e when the dataset contains a lot of noise. It is most useful when the dataset contains outliers, or unexpected values (too high or too low values). RMSE: In RMSE, the errors are squared before they are averaged. This basically implies that RMSE assigns a higher weight to larger errors. This indicates that RMSE is much more useful when large errors are present and they drastically affect the model's performance. It avoids taking the absolute value of the error and this trait is useful in many mathematical calculations. In this metric also, lower the value, better is the performance of the model.
Cross Validation Usman Khan
Cross Validation (1) In machine learning is to not use the entire data set when training a learner. Some of the data is removed before training begins. Then when training is done, the data that was removed can be used to test the performance of the learned model on ``new'' data. This is the basic idea for a whole class of model evaluation methods called cross validation
Cross Validation (2) Method of estimating expected predicting error Helps selecting the best fit model Helps ensuring model is not over fit
Cross Validation (3) 1) Holdout method 2) K-Fold CV 3) Leave one out CV 4) Bootstraps Methods
Holdout method The holdout cross validation method is the simplest of all. In this method, you randomly assign data points to two sets . The size of the sets does not matter
K-FOLD K-fold cross validation is one way to improve over the holdout method. The data set is divided into k subsets and the holdout method is repeated k times Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set
K - FOLD Disadvantages ??? Stratified K-Fold
Leave one out CV (1) Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with K equal to N, the number of data points in the set That means that N separate times, the function approximate is trained on all the data except for one point and a prediction is made for that point As before the average error is computed and used to evaluate the model.
Leave one out CV (2) Specific case of K-fold validation
Leave one out CV (3) Disadvantages ???
Bootstrap (1) Randomly draw datasets from the training sample Each sample same size as the training sample Refit the model with the bootstrap samples Examine the model