Linear Regression and Decision Trees. By Bashemera Brenda Birungi and Josephine Akumu
Linear Regression. This refers to the approximation of a causal relationship between two or more variables Types of regression. Simple lenear regression Multiple linear regression
Simple Linear Regression Equation.
Simple linear equation geometric representation. Regression line is the best fitting line through the data points.
Overfitting and Under fitting These are some of the deficiencies that any machine learning model may suffer from.
Overfitted Underfitted
Cost Function. This refers to the performance of a machine learning model of a given dataset. It measures how far away the models prediction is from the correct values. It is sometimes referred to as the mean squared error function
Practice illustration There are two ways to solve a linear regression problem, (1) Create your own model with Linear Regression equation. (2) Or just Import Model from Scikit -Learn Predicting a student’s GPA depending on there SAT results for linear regression Predicting media sales on tv ,radio ,newspaper
Decision Trees.
Decision Trees Decision tree is a tree shape diagram used to determine a course of action, each branch of a tree represents a possible descion,occurance or reaction Decision tree is a tree in which each branch node presents a choice between a pair of alternatives and each leaf node represents a decision. It is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. Identifies the most significant variable and its value that gives best homogeneous sets of population.
Decision Trees • Classifies data using the attributes • Tree consists of decision nodes and decision leafs. • Nodes can have two or more branches which represents the value for the attribute tested. • Leafs nodes produces a homogeneous result.
A decision tree has the following constituents Root Node: The factor of ‘temperature’ is considered as the root in this case. Internal Node: The nodes with one incoming edge and 2 or more outgoing edges. Leaf Node: This is the terminal node with no out-going edge.
Decision tree representation
Decision tree representation decision trees represent disjunctions of conjunctions (Outlook=Sunny Humidity=Normal ) ( Outlook=Overcast) ( Outlook=Rain Wind=Weak )
When to use Decision Trees Problem characteristics: Instances can be described by attribute value pairs Target function is discrete valued Disjunctive hypothesis may be required Possibly noisy training data samples Robust to errors in training data Missing attribute values
Different classification problems: Equipment or medical diagnosis Credit risk analysis Classifying credit card transactions as legitimate or fraudulent Categorizing news stories as finance, weather, entertainment, sports, etc.
Types of Decision Trees Categorical Variable Decision Tree: Has categorical target variable E.g. “Student will play cricket or not” i.e. YES or NO. Classification Trees are used when dependent variable is categorical .(main focus) Continuous Variable Decision Tree: Decision Tree has continuous target variable . Regression trees are used when dependent variable is continuous.
Illustrating Classification Learning
Top-Down Induction of Decision Trees ID3 Function ID3( Training-set , Attributes ) If all elements in Training-set are in same class, then return leaf node labeled with that class Else if Attributes is empty, then return leaf node labeled with majority class in Training-set Else if Training-Set is empty, then return leaf node labeled with default majority class Else Select and remove A from Attributes Make A the root of the current tree For each value V of A Create a branch of the current tree labeled by V Partition_V Elements of Training-set with value V for A Induce-Tree( Partition_V , Attributes ) Attach result to branch V
ID3 ID3 measures information gained by making each attribute the root of the current subtree, and subsequently chooses the attribute that produces the greatest information gain In ID3 Minimum (0.0) when all records belong to one class, implying most information Maximum (log C) when records are equally distributed among all classes, implying least information Intuitively, the smaller the entropy the purer the partition
Terminologies Information theory is a measure to define the degree of disorganization in a system known as Entropy. Entropy is the measure of impurity of the sample set. Information gain determines which attributes goes into decision node To minimize the decision tree depth,the attribute with the most entropy reduction is the best.
Which attribute is the best classifier? A statistical property called information gain , measures how well a given attribute separates the training examples Information gain uses the notion of entropy , commonly used in information theory Information gain = expected reduction of entropy
Which attribute is best? S is a sample of training examples p + is the proportion of positive examples p - is the proportion of negative examples Entropy measures the impurity of S Entropy(S ) = -p + log 2 p + - p - log 2 p -
Entropy in binary classification Entropy measures the impurity of a collection of examples. It depends from the distribution of the random variable p. S is a collection of training examples p + the proportion of positive examples in S p – the proportion of negative examples in S Entropy ( S ) – p + log 2 p + – p – log 2 p – [0 log 2 0 = 0] Entropy ([14+, 0–]) = – 14/14 log 2 (14/14) – log 2 (0) = 0 Entropy ([9+, 5–]) = – 9/14 log 2 (9/14) – 5/14 log 2 (5/14) = 0,94 Entropy ([7+, 7– ]) = – 7/14 log 2 (7/14) – 7/14 log 2 (7/14) = = 1/2 + 1/2 = 1 [ log 2 1/2 = – 1] Note: the log of a number < 1 is negative , 0 p 1, 0 entropy 1
Entropy in general Entropy measures the amount of information in a random variable H ( X ) = – p + log 2 p + – p – log 2 p – X = {+, – }
Information gain as entropy reduction Entropy specifies the number the average length (in bits) of the message needed to transmit the outcome of a random variable. This depends on the probability distribution. Optimal length code assigns log 2 p bits to messages with probability p. Most probable messages get shorter codes.
Information gain as entropy reduction Information gain is the expected reduction in entropy caused by partitioning the examples on an attribute. The higher the information gain the more effective the attribute in classifying training data. Expected reduction in entropy knowing A Values ( A ) possible values for A Sv subset of S for which A has value v
Example: expected information gain Let Values ( Wind ) = { Weak , Strong } S = [9+, 5 − ] S Weak = [6+, 2 − ] S Strong = [3+, 3 − ] Information gain due to knowing Wind : Gain ( S , Wind ) = Entropy ( S ) − 8/14 Entropy ( S Weak ) − 6/14 Entropy ( S Strong ) = 0,94 − 8/14 0,811 − 6/14 1,00 = 0,048
Information Gain (S=E) Gain(S,A): expected reduction in entropy due to sorting S on attribute A 29 A 1 =? G H [21+, 5-] [8+, 30-] [29+,35-] A 2 =? True False [18+, 33-] [11+, 2-] [29+,35-] Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99
Selecting the Next Attribute 32 Outlook Sunny Rain [ 2 +, 3 -] [ 3 +, 2 -] S= [9+,5-] E=0.940 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 E=0.971 E=0.971 Over cast [ 4 +, ] E=0.0
Selecting the Next Attribute 33 Humidity High Normal [ 3 +, 4 -] [ 6 +, 1 -] S= [9+,5-] E=0.940 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 E=0.985 E=0.592 Wind Weak Strong [ 6 +, 2 -] [ 3 +, 3 -] S= [9+,5-] E=0.940 E=0.811 E=1.0 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048 Humidity provides greater info. gain than Wind
First step: which attribute to test at the root? Which attribute should be tested at the root? Gain ( S , Outlook ) = 0.247 Gain ( S , Humidity ) = 0.151 Gain ( S , Wind ) = 0.048 Gain ( S , Temperature ) = 0.029 Outlook provides the best prediction for the target Lets grow the tree: add to the tree a successor for each possible value of Outlook partition the training samples according to the value of Outlook
After the first step
Second step Working on Outlook=Sunny node: Gain ( S Sunny , Humidity ) = 0.970 3/5 0.0 2/5 0.0 = 0.970 Gain ( S Sunny , Wind ) = 0.970 2/5 1.0 3.5 0.918 = 0 .019 Gain ( S Sunny , Temp. ) = 0.970 2/5 0.0 2/5 1.0 1/5 0.0 = 0.570 Humidity provides the best prediction for the target Lets grow the tree: add to the tree a successor for each possible value of Humidity partition the training samples according to the value of Humidity
Second and third steps
Overfitting and underfitting in Decision Trees Overfitting results in decision trees that are more complex than necessary Tree growth went too far Number of instances gets smaller as we build the tree (e.g., several leaves match a single example ) Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Underfitting: The model is too simple, so that both training and test errors are large
Avoiding Tree Overfitting – Solution 1 Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features Stop if expanding the current node does not improve impurity measures
Avoiding Tree Overfitting – Solution 2 Post-pruning Split dataset into training and validation sets Grow full decision tree on training set While the accuracy on the validation set increases: Evaluate the impact of pruning each subtree, replacing its root by a leaf labeled with the majority class for that subtree Replace subtree that most increases validation set accuracy (greedy approach)
Advantages Easy to understand and visualize Little effort require foe data preparation Can handle both numerical and categorical data. Non linear parameters do not affect the performance
Disadvantages Overfitting: Not fit for continuous variables: While working with continuous numerical variables, decision tree loses information, when it categorizes variables in different categories. Decision trees can be unstable because small variations in the data might result in a completely different tree being generated.
Case study Predicting the income of a given population