Linear Regression and Decision Trees.pptx

ssemwogerere_rajab 0 views 43 slides Oct 13, 2025
Slide 1
Slide 1 of 43
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43

About This Presentation

Linear Regression and Decision Trees


Slide Content

Linear Regression and Decision Trees. By Bashemera Brenda Birungi and Josephine Akumu

Linear Regression. This refers to the approximation of a causal relationship between two or more variables Types of regression. Simple lenear regression Multiple linear regression

Simple Linear Regression Equation.

Simple linear equation geometric representation. Regression line is the best fitting line through the data points.

Overfitting and Under fitting These are some of the deficiencies that any machine learning model may suffer from.

Overfitted Underfitted

Cost Function. This refers to the performance of a machine learning model of a given dataset. It measures how far away the models prediction is from the correct values. It is sometimes referred to as the mean squared error function

Practice illustration There are two ways to solve a linear regression problem, (1) Create your own model with Linear Regression equation. (2) Or just Import Model from Scikit -Learn Predicting a student’s GPA depending on there SAT results for linear regression Predicting media sales on tv ,radio ,newspaper

Decision Trees.

Decision Trees Decision tree is a tree shape diagram used to determine a course of action, each branch of a tree represents a possible descion,occurance or reaction Decision tree is a tree in which each branch node presents a choice between a pair of alternatives and each leaf node represents a decision. It is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. Identifies the most significant variable and its value that gives best homogeneous sets of population.

Decision Trees • Classifies data using the attributes • Tree consists of decision nodes and decision leafs. • Nodes can have two or more branches which represents the value for the attribute tested. • Leafs nodes produces a homogeneous result.

A decision tree has the following constituents Root Node: The factor of ‘temperature’ is considered as the root in this case. Internal Node: The nodes with one incoming edge and 2 or more outgoing edges. Leaf Node: This is the terminal node with no out-going edge.

Decision tree representation

Decision tree representation decision trees represent disjunctions of conjunctions (Outlook=Sunny  Humidity=Normal ) ( Outlook=Overcast) ( Outlook=Rain Wind=Weak )

When to use Decision Trees Problem characteristics: Instances can be described by attribute value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data samples Robust to errors in training data Missing attribute values

Different classification problems: Equipment or medical diagnosis Credit risk analysis Classifying credit card transactions as legitimate or fraudulent Categorizing news stories as finance, weather, entertainment, sports, etc.

Types of Decision Trees Categorical Variable Decision Tree:  Has categorical target variable E.g. “Student will play cricket or not” i.e. YES or NO. Classification Trees are used when dependent variable is categorical .(main focus) Continuous Variable Decision Tree:  Decision Tree has continuous target variable . Regression trees are used when dependent variable is continuous.

Illustrating Classification Learning

Top-Down Induction of Decision Trees ID3 Function ID3( Training-set , Attributes ) If all elements in Training-set are in same class, then return leaf node labeled with that class Else if Attributes is empty, then return leaf node labeled with majority class in Training-set Else if Training-Set is empty, then return leaf node labeled with default majority class Else Select and remove A from Attributes Make A the root of the current tree For each value V of A Create a branch of the current tree labeled by V Partition_V  Elements of Training-set with value V for A Induce-Tree( Partition_V , Attributes ) Attach result to branch V

ID3 ID3 measures information gained by making each attribute the root of the current subtree, and subsequently chooses the attribute that produces the greatest information gain In ID3 Minimum (0.0) when all records belong to one class, implying most information Maximum (log C) when records are equally distributed among all classes, implying least information Intuitively, the smaller the entropy the purer the partition

Terminologies Information theory is a measure to define the degree of disorganization in a system known as Entropy. Entropy is the measure of impurity of the sample set. Information gain determines which attributes goes into decision node To minimize the decision tree depth,the attribute with the most entropy reduction is the best.

Which attribute is the best classifier? A statistical property called information gain , measures how well a given attribute separates the training examples Information gain uses the notion of entropy , commonly used in information theory Information gain = expected reduction of entropy

Which attribute is best? S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S ) = -p + log 2 p + - p - log 2 p -

Entropy in binary classification Entropy measures the impurity of a collection of examples. It depends from the distribution of the random variable p. S is a collection of training examples p + the proportion of positive examples in S p – the proportion of negative examples in S Entropy ( S )  – p + log 2 p + – p – log 2 p – [0 log 2 0 = 0] Entropy ([14+, 0–]) = – 14/14 log 2 (14/14) – log 2 (0) = 0 Entropy ([9+, 5–]) = – 9/14 log 2 (9/14) – 5/14 log 2 (5/14) = 0,94 Entropy ([7+, 7– ]) = – 7/14 log 2 (7/14) – 7/14 log 2 (7/14) = = 1/2 + 1/2 = 1 [ log 2 1/2 = – 1] Note: the log of a number < 1 is negative , 0  p  1, 0  entropy  1

Entropy in general Entropy measures the amount of information in a random variable H ( X ) = – p + log 2 p + – p – log 2 p – X = {+, – }

Information gain as entropy reduction Entropy specifies the number the average length (in bits) of the message needed to transmit the outcome of a random variable. This depends on the probability distribution. Optimal length code assigns   log 2 p  bits to messages with probability p. Most probable messages get shorter codes.

Information gain as entropy reduction Information gain is the expected reduction in entropy caused by partitioning the examples on an attribute. The higher the information gain the more effective the attribute in classifying training data. Expected reduction in entropy knowing A Values ( A ) possible values for A Sv subset of S for which A has value v

Example: expected information gain Let Values ( Wind ) = { Weak , Strong } S = [9+, 5 − ] S Weak = [6+, 2 − ] S Strong = [3+, 3 − ] Information gain due to knowing Wind : Gain ( S , Wind ) = Entropy ( S ) − 8/14 Entropy ( S Weak ) − 6/14 Entropy ( S Strong ) = 0,94 − 8/14  0,811 − 6/14  1,00 = 0,048

Information Gain (S=E) Gain(S,A): expected reduction in entropy due to sorting S on attribute A 29 A 1 =? G H [21+, 5-] [8+, 30-] [29+,35-] A 2 =? True False [18+, 33-] [11+, 2-] [29+,35-] Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

Information Gain Entropy([21+,5-]) = 0.71 Entropy([8+,30-]) = 0.74 Gain(S,A 1 )=Entropy(S) -26/64*Entropy([21+,5-]) -38/64*Entropy([8+,30-]) =0.27 30 A 1 =? True False [21+, 5-] [8+, 30-] [29+,35-] Entropy([18+,33-]) = 0.94 Entropy([11+,2-]) = 0.62 Gain(S,A2)=Entropy(S) -51/64*Entropy([18+,33-]) -13/64*Entropy([11+,2-]) =0.12 A 2 =? True False [18+, 33-] [11+, 2-] [29+,35-]

Example

Selecting the Next Attribute 32 Outlook Sunny Rain [ 2 +, 3 -] [ 3 +, 2 -] S= [9+,5-] E=0.940 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 E=0.971 E=0.971 Over cast [ 4 +, ] E=0.0

Selecting the Next Attribute 33 Humidity High Normal [ 3 +, 4 -] [ 6 +, 1 -] S= [9+,5-] E=0.940 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 E=0.985 E=0.592 Wind Weak Strong [ 6 +, 2 -] [ 3 +, 3 -] S= [9+,5-] E=0.940 E=0.811 E=1.0 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048 Humidity provides greater info. gain than Wind

First step: which attribute to test at the root? Which attribute should be tested at the root? Gain ( S , Outlook ) = 0.247 Gain ( S , Humidity ) = 0.151 Gain ( S , Wind ) = 0.048 Gain ( S , Temperature ) = 0.029 Outlook provides the best prediction for the target Lets grow the tree: add to the tree a successor for each possible value of Outlook partition the training samples according to the value of Outlook

After the first step

Second step Working on Outlook=Sunny node: Gain ( S Sunny , Humidity ) = 0.970  3/5  0.0  2/5  0.0 = 0.970 Gain ( S Sunny , Wind ) = 0.970  2/5  1.0  3.5  0.918 = 0 .019 Gain ( S Sunny , Temp. ) = 0.970  2/5  0.0  2/5  1.0  1/5  0.0 = 0.570 Humidity provides the best prediction for the target Lets grow the tree: add to the tree a successor for each possible value of Humidity partition the training samples according to the value of Humidity

Second and third steps

Overfitting and underfitting in Decision Trees Overfitting results in decision trees that are more complex than necessary Tree growth went too far Number of instances gets smaller as we build the tree (e.g., several leaves match a single example ) Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Underfitting: The model is too simple, so that both training and test errors are large

Avoiding Tree Overfitting – Solution 1 Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features Stop if expanding the current node does not improve impurity measures

Avoiding Tree Overfitting – Solution 2 Post-pruning Split dataset into training and validation sets Grow full decision tree on training set While the accuracy on the validation set increases: Evaluate the impact of pruning each subtree, replacing its root by a leaf labeled with the majority class for that subtree Replace subtree that most increases validation set accuracy (greedy approach)

Advantages Easy to understand and visualize Little effort require foe data preparation Can handle both numerical and categorical data. Non linear parameters do not affect the performance

Disadvantages Overfitting: Not fit for continuous variables: While working with continuous numerical variables, decision tree loses information, when it categorizes variables in different categories. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated.

Case study Predicting the income of a given population
Tags