decision tree machine learning model for classification

shrutijams 28 views 78 slides Oct 14, 2024
Slide 1
Slide 1 of 78
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78

About This Presentation

Machine learning classification model based on decision tree


Slide Content

Decision Tree

Classification: Decision Tree A decision tree is a popular machine learning algorithm used for both classification and regression tasks. It's a tree-like structure where each internal node represents a decision based on a feature, each branch represents an outcome of that decision, and each leaf node represents the final output (class label or numerical value). Decision trees are particularly useful for their simplicity, interpretability, and ease of visualization.

Classification: Decision Tree A decision tree is a popular machine learning algorithm used for both classification and regression tasks. It's a tree-like structure where each internal node represents a decision based on a feature, each branch represents an outcome of that decision, and each leaf node represents the final output (class label or numerical value). Decision trees are particularly useful for their simplicity, interpretability, and ease of visualization.

Decision Tree The process of building a decision tree involves recursively splitting the dataset into subsets based on the most informative features at each node. The algorithm selects the feature that provides the best separation of the data, typically using metrics like Gini impurity for classification tasks or mean squared error for regression tasks. This process continues until a stopping criterion is met, such as a predefined depth of the tree or when further splitting does not significantly improve the model.

Pros and Cons of Decision Tree Pros: Decision trees requires little data preprocessing (no need for normalization or scaling) and being able to handle both numerical and categorical data. Decision trees are widely used in various applications due to their simplicity, interpretability, and effectiveness in capturing complex decision boundaries in the data. Cons: Decision trees are prone to overfitting, especially when the tree becomes too deep. Techniques like pruning or using ensemble methods like Random Forests can help mitigate overfitting and improve generalization performance .

Terminologies In decision trees, several key terminologies and concepts are used to describe different elements and aspects of the tree-building process. Root Node : The top node of the decision tree, from which the tree starts branching. Decision Nodes (Internal Nodes ): Nodes in the tree where decisions are made based on the values of features. These nodes have branches leading to child nodes.

Terminologies Leaf Nodes (Terminal Nodes ) : End nodes of the tree where the final decision or prediction is made. They do not have any child nodes. Branches: The edges connecting nodes in the tree. Each branch corresponds to a possible outcome of a decision. Splitting: The process of dividing a node into two or more child nodes based on a certain feature and a threshold or condition.

Terminologies Decision Rule: The condition at each decision node that determines the direction of the branch. It is based on a feature and a threshold . Entropy: A measure of impurity or disorder in a set of data. Decision trees use entropy to find the best feature to split the data. Information Gain: The reduction in entropy or impurity achieved by splitting the data based on a particular feature. Decision trees aim to maximize information gain when choosing features for splitting.

Terminologies Gini Impurity: Another measure of impurity used in decision trees. It represents the probability of incorrectly classifying a randomly chosen element. Pruning: The process of removing branches from the tree to reduce overfitting. Pruning helps improve the generalization ability of the model. Depth of the Tree: The length of the longest path from the root node to a leaf node. It indicates how many decisions or splits are made before reaching a prediction.

Terminologies CART (Classification and Regression Trees ) An algorithm used for constructing decision trees that can be applied to both classification and regression tasks. Random Forest An ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting. These terminologies provide a foundation for understanding and discussing decision trees and their construction. Different algorithms and implementations may use variations of these terms, but the fundamental concepts remain consistent across the field of machine learning.

Example of a Decision Tree categorical categorical continuous class Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree

categorical categorical continuous class MarSt Refund TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! Example of a Decision Tree

Decision Tree Classification Task Decision Tree

Decision Tree Classification Task Decision Tree

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Start from the root of tree.

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data

Apply Model to Test Data Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Assign Cheat to “No”

21 Decision Tree Classification A decision tree is created in two phases: Tree Building Phase Repeatedly partition the training data until all the examples in each partition belong to one class or the partition is sufficiently small Tree Pruning Phase Remove dependency on statistical noise or variation that may be particular only to the training set

22 Tree Building Phase General tree-growth algorithm (binary tree) Partition(Data S) If (all points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split to partition S into S1 and S2; Partition(S1); Partition(S2);

23 Tree Building Phase (cont.) The form of the split depends on the type of the attribute Splits for numeric attributes are of the form A ≤ v, where v is a real number Splits for categorical attributes are of the form A ∈ S’, where S’ is a subset of all possible values of A

Why decision tree Decision trees are powerful and popular tools for classification and prediction. Decision trees represent rules , which can be understood by humans and used in knowledge system such as database.

key requirements Attribute-value description: object or case must be expressible in terms of a fixed collection of properties or attributes (e.g., hot, mild, cold). Predefined classes (target values): the target function has discrete output values ( boolean or multiclass) Sufficient data: enough training cases should be provided to learn the model.

Principled Criterion Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. Gini impurity measure of impurity or disorder in a set of data used in decision tree algorithms, particularly for classification problems. Entropy A measure of homogeneity of the set of examples . I nformation gain measures how well a given attribute separates the training examples according to their target classification This measure is used to select among the candidate attributes at each step while growing the tree

Calculation of Gini Impurity Gini impurity is a measure of impurity or disorder in a set of data used in decision tree algorithms, particularly for classification problems. Which indicates the likelihood of new, random data being misclassified if it were given a random class label according to the class distribution in the dataset. The Gini impurity for a node is calculated based on the distribution of class labels in that node. The formula for Gini impurity is as follows: where: c is the number of classes. pi is the probability of the occurrence of class i in the node. The Gini impurity ranges from 0 to 0.5, where 0 indicates perfect purity (all instances belong to a single class), and 0 .5 indicates maximum impurity (an equal distribution of instances across all classes).

Calculation of Gini Impurity Here's a step-by-step example of calculating Gini impurity for a hypothetical dataset with two classes, "Yes" and "No“. Dataset Calculation

Calculation of Gini Impurity The node with uniform class distribution has the highest impurity. The minimum impurity is obtained when all records belong to the same class. Several examples are given in the following table to demonstrate the Gini Impurity computation.

Calculation of Gini Impurity Gini Impurity is a measurement used to build Decision Trees to determine how the features of a dataset should split nodes to form the tree. In the context of decision trees, when selecting a feature for splitting, the one that results in the lowest Gini impurity after the split is chosen, as it indicates a more "pure" separation of the classes.

Example : Gini Impurity For example, say you want to build a classifier that determines if someone will default on their credit card. You have some labeled data with features, such as bins for age, income, credit rating, and whether or not each person is a student. To find the best feature for the first split of the tree – the root node – you could calculate how poorly each feature divided the data into the correct class, default ("yes") or didn't default ("no"). This calculation would measure the impurity of the split, and the feature with the lowest impurity would determine the best feature for splitting the current node. This process would continue for each subsequent node using the remaining features.

Example : Gini Impurity

An attribute with the smallest Gini Impurity is selected for splitting the node. If a data set D is split on an attribute into two subsets D1 and D2 with sizes and n1 and n2 , respectively, the Gini Impurity can be defined as: When training a decision tree, the attribute that provides the smallest is chosen to split the node. In order to obtain information gain for an attribute, the weighted impurities of the branches is subtracted from the original impurity. The best split can also be chosen by maximizing the Gini gain. Gini gain is calculated as follows:

Gini Impurity and Gini Index The Gini Index is essentially the same, expressed in percentage terms. It is calculated as the Gini Impurity multiplied by 100 Gini Index = Gini Impurity ×100 Both Gini Impurity and Gini Index are widely used as criteria for evaluating the quality of splits in decision trees, and the choice between them often depends on the convention used in specific implementations or literature.

Entropy A measure of homogeneity of the set of examples. Given a set S of positive and negative examples of some target concept (a 2-class problem), the entropy of set S relative to this binary classification is E(S) = - p(P)log2 p(P) – p(N)log2 p(N)

Some Intuitions The entropy is 0 if the outcome is ``certain’’. The entropy is maximum if we have no knowledge of the system (or any outcome is equally possible). Entropy of a 2-class problem with regard to the portion of one of the two groups

It is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. It is the number of bits saved when encoding the target value of an arbitrary member of S , by knowing the value of attribute A . The range of entropy is from 0 to 1, whereas the range of Gini Impurity is from 0 to 0.5

Calculation of Entropy Entropy is a measure of impurity or disorder in a set of data. In the context of decision trees, entropy is used to determine the homogeneity of a node based on the distribution of class labels. The formula for entropy (H) is given by: The entropy is calculated for each node in a decision tree, and the goal is to minimize entropy by choosing the best features for splitting the data.

Example: Here's a step-by-step example of calculating entropy for a hypothetical dataset with two classes, "Yes" and "No“. Dataset Calculation: Gini Index – 0.48 Entropy – 0.29

Example: If all instances in a dataset belong to the same class (either all "Yes" or all "No"), then the dataset is perfectly pure, and there is no uncertainty or disorder in terms of class labels. In such a case, the entropy is 0, indicating complete homogeneity. where c is the number of classes, and pi is the probability of class i in set S.

Example: If all instances are of the same class, say "Yes," then =1 and =0. Substituting these values into the entropy formula: H(S)=0 Note that (0) is undefined, so we typically define 0 . (0 ) as 0 for the purpose of the calculation. Therefore: Similarly, if all instances are of the class "No," the entropy will also be 0. I f a dataset contains only one class, the entropy is 0, indicating perfect homogeneity and no uncertainty. This is the ideal scenario for a node in a decision tree because it implies a clear and pure separation of classes.  

Entropy measure: let us get a feeling As the data become purer and purer, the entropy value becomes smaller and smaller.

Difference Between Gini Impurity and Entropy Both Gini impurity and entropy are measures of impurity or disorder used in decision tree algorithms to determine the best split at each node. However, there are some differences in how they calculate impurity. Formula Gini Impurity : It is calculated using the formula: Gini(t) = 1 - Σ ( pi )^2, where pi is the proportion of instances of class i in node t. Entropy : It is calculated using the formula: Entropy(t) = -Σ pi * log2(pi ), where pi is the proportion of instances of class i in node t.

Difference Between Gini Impurity and Entropy Scale Gini Impurity : Gini impurity values range from 0 to 0.5 where 0 represents perfect purity (all instances belong to a single class), and 1 represents maximum impurity (instances are evenly distributed across all classes). Entropy : Entropy values range from 0 to log2(number of classes) (0 to 1) where 0 represents perfect purity, and log2(number of classes) represents maximum impurity. Decision Rule Gini Impurity : Tends to favor larger partitions with dominant classes. It is less sensitive to imbalances in class distribution. Entropy : Can be sensitive to imbalances in class distribution. It may result in slightly more balanced trees compared to Gini impurity.

Difference Between Gini Impurity and Entropy Computation Gini Impurity : Computationally less expensive compared to entropy, as it doesn't involve logarithmic calculations. Entropy : Requires logarithmic calculations, which can be relatively more computationally expensive. In practice, both Gini impurity and entropy are commonly used, and the choice between them often depends on the specific characteristics of the dataset and the desired properties of the resulting decision tree. Scikit -learn , a popular machine learning library in Python, allows users to choose between Gini impurity and entropy as the criterion for splitting in decision tree classifiers.

Information Gain Information gain measures the expected reduction in entropy, or uncertainty. Values(A ) is the set of all possible values for attribute A, and Sv the subset of S for which attribute A has value v Sv = {s in S | A(s) = v} . the first term in the equation for Gain is just the entropy of the original collection S the second term is the expected value of the entropy after S is partitioned using attribute A

Relationship Between Entropy and Information Gain Here's how entropy and information gain are related: Entropy (H) Entropy is a measure of impurity or disorder in a set of data. It quantifies the uncertainty associated with the distribution of class labels in a node. Higher entropy indicates higher disorder and uncertainty.

Relationship Between Entropy and Information Gain Information Gain (IG) Information gain measures the reduction in entropy (uncertainty) achieved by splitting the data based on a particular feature. The feature with the highest information gain is chosen as the decision feature for the split, as it provides the most significant reduction in uncertainty.

Relationship Between Entropy and Information Gain There is a strong relationship between entropy and information gain in the context of decision trees. Information gain is a concept used to measure the effectiveness of a feature in reducing the uncertainty or entropy in a dataset. The higher the information gain, the more effective the feature is for making decisions in a decision tree.

Relationship Between Entropy and Information Gain The decision tree algorithm seeks to maximize information gain when selecting features for splitting. It chooses the feature that results in the highest reduction of entropy in the dataset, leading to more homogenous and well-separated subsets. This process is repeated recursively, building the decision tree by selecting features that maximize information gain at each decision node.

A simple example You want to guess the outcome of next week's game between the MallRats and the Chinooks . Available knowledge / Attribute was the game at Home or Away was the starting time 5pm, 7pm or 9pm. Did Joe play center, or forward. whether that opponent's center was tall or not. …..

Basket ball data

What we know The game will be away, at 9pm, and that Joe will play center on offense… A classification problem Generalizing the learned rule to new examples

Decision tree is a classifier in the form of a tree structure Decision node: specifies a test on a single attribute Leaf node: indicates the value of the target attribute Arc/edge: split of one attribute Path: a disjunction of test to make the final decision Decision trees classify instances or examples by starting at the root of the tree and moving through it until a leaf node. Definition

Illustration (2) Which node to proceed? (3) When to stop/ come to conclusion? (1) Which to start? (root)

Random split The tree can grow huge These trees are hard to understand. Larger trees are typically less accurate than smaller trees.

Example Before partitioning, the entropy is H(10/20, 10/20) = - 10/20 log(10/20) - 10/20 log(10/20) = 1 Using the ``where’’ attribute, divide into 2 subsets Entropy of the first set H(home) = - 6/12 log(6/12) - 6/12 log(6/12) = 1 Entropy of the second set H(away) = - 4/8 log(6/8) - 4/8 log(4/8) = 1 Expected entropy after partitioning 12/20 * H(home) + 8/20 * H(away) = 1 Information gain is 1-1 =0

Using the ``when’’ attribute, divide into 3 subsets Entropy of the first set H(5pm) = - 1/4 log(1/4) - 3/4 log(3/4); Entropy of the second set H(7pm) = - 9/12 log(9/12) - 3/12 log(3/12); Entropy of the second set H(9pm) = - 0/4 log(0/4) - 4/4 log(4/4) = 0 Expected entropy after partitioning 4/20 * H(1/4, 3/4) + 12/20 * H(9/12, 3/12) + 4/20 * H(0/4, 4/4) = 0.65 Information gain 1-0.65 = 0.35 Example

Decision Knowing the ``when’’ attribute values provides larger information gain than ``where’’. Therefore the ``when’’ attribute should be chosen for testing prior to the ``where’’ attribute. Similarly, we can compute the information gain for other attributes. At each node, choose the attribute with the largest information gain.

Iris Data Set The Iris dataset is a well-known dataset in the field of machine learning and statistics. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936. The dataset is commonly used for practicing and demonstrating various classification algorithms, particularly in the context of supervised learning. The Iris dataset consists of measurements of four features (attributes) of iris flowers from three different species: setosa , versicolor, and virginica . The four features are: Sepal Length : Length of the iris flower's sepal (the outermost whorl of a flower). Sepal Width : Width of the iris flower's sepal. Petal Length : Length of the iris flower's petal (the innermost whorl of a flower). Petal Width : Width of the iris flower's petal.

Iris Data Set Each sample in the dataset represents an individual iris flower, and the corresponding label indicates the species to which the flower belongs ( setosa , versicolor, or virginica ). The Iris dataset is commonly used as a beginner's dataset for learning and practicing classification algorithms due to its simplicity and the clear separation of the three species based on the provided features. The Iris dataset is available in many machine learning libraries, including scikit -learn, making it easily accessible for experimentation and learning purposes.

Iris Data Set

Iris Data Set Setosa (Iris setosa ) Setosa is one of the three species in the Iris dataset. It is characterized by its distinctive features, including smaller size, a unique petal shape, and color patterns. Setosa flowers typically have a shorter sepal length compared to the other two species. Versicolor (Iris versicolor) Versicolor is another species in the Iris dataset. It is known for its versatility in terms of petal and sepal characteristics. Versicolor flowers have medium-sized petals and sepals . Virginica (Iris virginica ) Virginica is the third species included in the Iris dataset. It is characterized by larger petal and sepal sizes compared to Setosa and Versicolor. Virginica flowers often exhibit different color patterns.

Performance of Classification Algorithms Precision, recall, and F1-score are metrics commonly used to evaluate the performance of classification algorithms, particularly in the context of binary or multiclass classification problems. These metrics provide insights into how well a model is performing in terms of correctly classifying instances from different classes . Let's break down each metric. These metrics are commonly reported together to provide a comprehensive assessment of a model's performance, especially in situations where achieving a balance between false positives and false negatives is crucial.

Performance of Classification Algorithms Precision : "Of all instances predicted as positive, how many are actually positive?" Recall : "Of all actual positive instances, how many were correctly predicted?" F1 -Score : "A balance between precision and recall. It is the harmonic mean of the two." Support : "The number of actual occurrences of the class in the dataset."

Performance of Classification Algorithms Let's use a hypothetical binary classification scenario to illustrate precision, recall, and F1 -score. Consider the following confusion matrix for a binary classification problem: In this confusion matrix: True Positives ( TP ) = 85 True Negatives (TN) = 90 False Positives (FP) = 10 False Negatives (FN) = 15

Performance of Classification Algorithms Let's create a simple example with two columns, "Actual Class" and "Predicted Class," each containing ten rows. We will use binary classification, where the classes are either 0 or 1.

Performance of Classification Algorithms In this example: True Positives (TP): Rows where both "Actual Class" and "Predicted Class" are 1. (e.g., row 2, row 5, row 8, row 9) True Negatives (TN): Rows where both "Actual Class" and "Predicted Class" are 0. (e.g., row 6, row 10) False Positives (FP): Rows where "Actual Class" is 0, but "Predicted Class" is 1. (e.g., row 1, row 4) False Negatives (FN): Rows where "Actual Class" is 1, but "Predicted Class" is 0. (e.g., row 3, row 7)

Performance of Classification Algorithms Now, let's calculate Precision, Recall, and F1-Score: Precision = TP / (TP + FP) = 4 / (4 + 2) = 0.67 Recall = TP / (TP + FN) = 4 / (4 + 2) = 0.67 F1-Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.67 * 0.67) / (0.67 + 0.67) = 0.67 In this example, Precision and Recall are both 0.67, indicating that out of the instances predicted as positive, 67% are actually positive, and out of all actual positive instances, 67% were correctly predicted. The F1-Score provides a balanced measure, and in this case, it is also 0.67.

Performance of Classification Algorithms Now, let's calculate precision, recall, and F1 -score using the formulas mentioned earlier: So, in this example: Precision is approximately 0.8947. Recall is 0.85. F1 -Score is approximately 0.8714 .

Precision Precision is the ratio of correctly predicted positive observations to the total predicted positives. It is calculated as the number of true positives divided by the sum of true positives and false positives. Precision is a measure of how many of the predicted positive instances are actually positive. The formula for precision is: Precision = TP / ( TP + FP)

Recall (Sensitivity or True Positive Rate) Recall is the ratio of correctly predicted positive observations to the total actual positives. It is calculated as the number of true positives divided by the sum of true positives and false negatives. Recall measures the ability of the model to capture all the actual positive instances. The formula for recall is: Recall = TP / ( TP + FN) The formula for precision is: Precision = TP / ( TP + FP)

F1 -Score F1 -score is the harmonic mean of precision and recall. It provides a balance between precision and recall. It is calculated as 2 * (Precision * Recall) / (Precision + Recall). F1 -score reaches its best value at 1 (perfect precision and recall) and worst at 0. F1 -score is particularly useful when the class distribution is imbalanced.

Support Support is the number of actual occurrences of the class in the specified dataset. It is the number of instances belonging to each class.

Practice Problem Scenario : Email Spam Classification Suppose you have a dataset of emails labeled as either spam or ham (Non-spam), and you want to build a classification model to identify spam emails. The initial performance of the model may look like this: What are total rows in the data set?

Practice Problem In this confusion matrix: True Positives ( TP ): 80 True Negatives (TN): 950 False Positives (FP): 50 False Negatives (FN): 20

Practice Problem The total number of rows in the dataset (total instances or observations) can be calculated by adding up all these values: Total Rows = TP + TN + FP + FN Using the values from the confusion matrix: Total Rows = 80 + 950 + 50 + 20 =1100 So, there are a total of 1100 rows or instances in the dataset based on the information provided in the confusion matrix. Each row corresponds to a specific instance (email in this case) that the model has made predictions on, and the confusion matrix summarizes how those predictions align with the true classes.

Practice Problem
Tags