Classification and pattern recognition.pdf

Anilkamboj25 10 views 20 slides Feb 27, 2025
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

Classification


Slide Content

1
Learning
Deductive Reasoning
•Deductive Reasoning – A type of logic in which one goes
from a general statement to a specific instance.
•The classic example
All men are mortal. (premise-I)
Socrates is a man. (premise-II)
Therefore, Socrates is mortal. (conclusion)
Propositional Logic
•A simple language useful for showing key ideas and definitions
•A set of propositional symbols, like P and Q.

•P means “It is hot.”
•Q means “It is humid.”
•R means “It is raining.”
•(P  Q)  R
•“If it is hot and humid, then it is raining”
•Q  P
•“If it is humid, then it is hot”
First Order Logic
–“It is illegal for students to copy music.”
–“Joe is a student.”
–“Every student copies music.”
–Is Joe a criminal?

•Knowledge Base: ),(
),()()(,
yxCopiesMusic(y)Student(x)yx
e)Student(Jo
)Criminal(x
yxCopiesyMusicxStudentyx


 )3(
)2(
)1( ),(
),( :From
yJoeCopiesMusic(y)e)Student(Joy
yxCopiesMusic(y)Student(x)yx

 ),(SomeSongJoeCopiesSong)Music(Somee)Student(Jo  oe)Criminal(J
Universal Elimination
Existential Elimination
Modus Ponens
First Order Logic
Inductive Reasoning
Inductive Reasoning, involves going from a series of specific cases
to a general statement. The conclusion in an inductive argument is
never guaranteed.

Example: What is the next number in the sequence 6, 13, 20, 27,…
There is more than one correct answer.

2
Inductive Reasoning
•Here’s the sequence again 6, 13, 20, 27,…
•Look at the difference of each term.
•13 – 6 = 7, 20 – 13 = 7, 27 – 20 = 7
•Thus the next term is 34, because 34 – 27 = 7.
•However what if the sequence represents the dates. Then
the next number could be 3 (31 days in a month).
•The next number could be 4 (30 day month)
•Or it could be 5 (29 day month – Feb. Leap year)
•Or even 6 (28 day month – Feb.)
Two types of learning in AI
Deductive: Deduce rules/facts from already known rules/facts. (We have already
dealt with this)



Inductive: Learn new rules/facts from a data set D.   CACBA     CAnyn
Nn

...1
)(),(xD
Motivation
•Learning is important for agents to deal with
–Unknown environments (lacks omniscience)
–Changes
•In many cases, it is more efficient to train an agent via examples,
than to “manually” extract knowledge from the examples
•Agents capable of learning can improve their performance
Medical
•Task: Does a patient have heart disease (on a scale from 1
to 4)
•Training data:
•Age, sex,cholesterol, chest pain location, chest pain type, resting blood
pressure, smoker?, fasting blood sugar, etc.
•Characterization of heart disease (0,1-4)
•Output:
•Given a new patient, classification by disease
Example 1: Text Classification
Text
classifier
New text
file
class
Classified text files
Text file 1 trade
Text file 2 ship
… …
Training
Example 2: Disease Diagnosis
Disease
classifier
New
patient’s
data
Presence or
absence
Database of medical records
Patient 1’s data Absence
Patient 2’s data Presence
… …
Training

3
Face Recognition
Types of Learning
•Supervised (inductive) learning
–Training data includes desired outputs
•Unsupervised learning
–Training data does not include desired outputs
•Semi-supervised learning
–Training data includes a few desired outputs
•Reinforcement learning
–Rewards from sequence of actions
Need for an intermediate
approach
•Unsupervised and Supervised learning
–Two extreme learning paradigms
–Unsupervised learning
•collection of documents without any labels
•easy to collect
–Supervised learning
•each object tagged with a class.
•laborious job
•Semi-supervised learning
–Real life applications are somewhere in between.
Semi Supervised Learning
•Document collection D
•A subset (with ) has
known labels
•Goal: to label the rest of the collection.
•Approach
–Train a supervised learner using , the labeled
subset.
–Apply the trained learner on the remaining
documents.
•Idea
–Harness information in to enable better
learning. DD
K
 |||||||| DD
K
 K
D K
DD\
•Unsupervised portion of the corpus, , adds to
–Vocabulary
–Knowledge about the joint distribution of terms
–Unsupervised measures of inter-document similarity.
•E.g.: site name, directory path, hyperlinks
•Put together multiple sources of evidence of similarity
and class membership into a label-learning system.
–combine different features with partial supervision K
DD\
Semi Supervised Learning

4
•Train a supervised learner on available labeled data
•Label all documents in
•Retrain the classifier using the new labels for
documents where the classier was most confident,
•Continue until labels do not change any more. K
D K
DD\
Semi Supervised Learning
Reinforcement learning
•Task
Learn how to behave successfully to achieve a goal while
interacting with an external environment
–Learn via experiences!
•Examples
–Game playing: player knows whether it win or lose, but
not know how to move at each step
–Control: a traffic system can measure the delay of cars,
but not know how to decrease it.

Issues
•Representation
•How to map from a representation in the domain to a representation used
for learning?
•Training data
•How can training data be acquired?
•Amount of training data
•How well does the algorithm do as we vary the amount of data?
•Which attributes influence learning most?
•Does the learning algorithm provide insight into the
generalizations made?
MNIST Training Data
Medical
•Task: Does a patient have heart disease (on a scale from 1
to 4)
•Training data:
•Age, sex,cholesterol, chest pain location, chest pain type, resting blood
pressure, smoker?, fasting blood sugar, etc.
•Characterization of heart disease (0,1-4)
•Output:
•Given a new patient, classification by disease
Inductive Learning
•Inductive learning or “Prediction”:
–Given examples of a function (X, F(X))
–Predict function F(X) for new examples X
•Classification: Learning categories
F(X) = Discrete
•Regression: Learning function values
F(X) = Continuous

5
Inductive learning
Simplest form: learn a function from examples

f is the target function

An example is a pair (x, f(x))

Problem: find a hypothesis h
such that h ≈ f
given a training set of examples
Inductive learning method
•Construct/adjust h to agree with f on training set
•(h is consistent if it agrees with f on all examples)

•E.g., curve fitting:

Inductive learning
•Construct h so that it agrees with f.
•The hypothesis h is consistent if it agrees with f on all observations.

•How to achieve good generalization?
Consistent linear fit Consistent 7
th
order
polynomial fit
Inconsistent linear fit.
Consistent 6
th
order
polynomial fit.
Consistent sinusoidal
fit
Inductive learning – example
x
y
Inductive learning – example
x
y
Inductive learning – example
x
y

6
Inductive learning and bias
•Suppose that we want to learn a function h and we are given
some sample (x, f(x)) pairs, as in figure (a)
•There are several hypotheses we could make about this
function, e.g.: (b), (c) and (d)
•A preference for one over the others reveals the bias of our
learning technique, e.g.:
–prefer piece-wise functions
–prefer a smooth function
–prefer a simple function and treat outliers as noise
Inductive learning – example
x
y
Sometimes a consistent hypothesis is worse than an inconsistent
Preference bias: Ockham’s Razor
•The simplest consistent explanation is the best
Y
X
Y
X
Y
Y
X
X
Strong relationships Weak relationships
Linear Correlation
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Linear Correlation
Y
X
No relationship
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall
Calculating by hand… 1
)(
1
)(
1
))((
varvar
),(cov
ˆ
1
2
1
2
1











n
yy
n
xx
n
yyxx
yx
yxariance
r
n
i
i
n
i
i
n
i
ii
Covariance indicates how two variables are related. A positive covariance means the variables
are positively related, while a negative covariance means the variables are inversely related.

Correlation tells you the degree to which the variables tend to move together.

7
Linear regression
In correlation, the two variables are treated as equals. In regression, one
variable is considered independent (=predictor) variable (X) and the other the
dependent (=outcome) variable Y.

If you know something about X, this knowledge helps you predict something about Y.
Regression equation
Expected value of y at a given level of x=
Y X
i i i      
0 1
Linear Regression Model
Relationship Between Variables Is a Linear Function
Dependent
(Response)
Variable
Independent (Explanatory)
Variable
Population
Slope
Population
Y-Intercept
Random
Error
Predicted value for an
individual…
y
i= 0 + 1*x
i + random error
i

Follows a normal
distribution
Fixed –
exactly
on the
line
Least Squares Graphically 
2
Y
X

1 
3

4
^
^
^
^ LS minimizes 
i
i
n
2
1
1
2
2
2
3
2
4
2


8
Multiple Linear Regression
•More than one predictor…

E(y)=  + 
1*X + 
2 *W + 
3 *Z…

Each regression coefficient is the amount of change
in the outcome variable that would be expected per
one-unit change of the predictor, if all other
variables in the model were held constant.


Non Linear Regression
•A form of regression analysis in which observational
data are modeled by a function which is a nonlinear
combination of the model parameters and depends on
one or more independent variables.

•Nonlinear regression is characterized by the fact that the
prediction equation depends nonlinearly on one or more
unknown parameters.

9
Inductive learning - example
•f(x) is the target function
•An example is a pair [x, f(x)]
•Learning task: find a hypothesis h such that h(x)  f(x) given a
training set of examples D = {[x
i, f(x
i) ]}, i = 1,2,…,N

  

 1)(,
0
0
1
0
1
0
1
1
1




































 xx f
 
 
 1)(,
0
0
1
1
1
0
0
1
1




































 xx f
  

 0)(,
0
1
0
1
1
0
0
1
1




































 xx f
Etc...
Terminology
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0

1.0

2.0

3.0


Feature Space:
Properties that describe the problem
Terminology
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0

1.0

2.0

3.0


Example:
<0.5,2.8,+>
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
+
+ +
-
-
-
+
+
Terminology
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0

1.0

2.0

3.0


Hypothesis:
Function for labeling examples
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
+
+ +
-
-
-
+
+
Label: -
Label: +
?
?
?
?
Terminology
0.0 1.0 2.0 3.0 4.0 5.0 6.0
0.0

1.0

2.0

3.0


Hypothesis Space:
Set of legal hypotheses
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
+
+ +
-
-
-
+
+

10
The idealized inductive learning problem
Our hypothesis space
H
f(x)
h
opt(x)  H
Error
Find appropriate hypothesis space H and find
h(x)  H with minimum “distance” to f(x) (“error”)
Overfitting
•Irrelevant attributes, can result in overfitting the
training example data
•If hypothesis space has many dimensions (large
number of attributes), we may find meaningless
regularity in the data that is irrelevant to the true,
important, distinguishing features

Text Classification
•Is text
i a finance article?
Positive Negative
20 attributes
•Investors 2
•Dow 2
•Jones 2
•Industrial 1
•Average 3
•Percent 5
•Gain 6
•Trading 8
•Broader 5
•stock 5
•Indicators 6
•Standard 2
•Rolling 1
•Nasdaq 3
•Early 10
•Rest 12
•More 13
• first 11
•Same 12
•The 30
20 attributes
•Men’s
•Basketball
•Championship
•UConn Huskies
•Georgia Tech
•Women
•Playing
•Crown
•Titles
•Games
•Rebounds
•All-America
•early
•rolling
•Celebrates
•Rest
•More
•First
•The
•same
k Nearest-Neighbor Classification

11
Instance Based Classifiers
•Lazy learning
–No model is learned
–The stored training instances themselves represent the knowledge
–Training instances are searched for instance that most closely resembles
new instance
•Examples:
–Rote-learner
Memorizes entire training data and performs classification only if
attributes of record match one of the training examples exactly
Rote Learning Nearest Neighbor Classification
Instance Based Classifiers
•Lazy learning
–No model is learned
–The stored training instances themselves represent the knowledge
–Training instances are searched for instance that most closely resembles
new instance
•Examples:
–Rote-learner
Memorizes entire training data and performs classification only if
attributes of record match one of the training examples exactly

–Nearest-neighbor classifier
Uses k “closest” points (nearest neigbours) for performing classification
Nearest Neighbor Classifier Nearest Neighbor Classifier

12
Classification
•The predicted class is determined from the nearest neighbor
list
•Take the majority vote of class labels among the k-nearest
neighbors
Nearest-Neighbor Classifiers Nearest-Neighbor Classifiers
Distance Functions
Distance Functions for Numerical
Attributes
Distance Functions for Numerical
Attributes
•Scaling issues
–Attributes may have to be scaled to prevent distance
measures from being dominated by one of the
attributes
–Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
–Proximity measure may be dominated by differences in
weights and income of a person

13
Distance Functions for Symbolic
Attributes
Feature Weighting Lazy Learning Algorithms
•kNN is considered a lazy learning algorithm
oDefers data processing until it receives a request to
classify an unlabelled example
oReplies to a request for information by combining its
stored training data
•Other names for lazy algorithms
oMemory-based, Instance-based , Exemplar-based , Case-
based
•This strategy is opposed to eager learning algorithms which
•Compiles its data into a compressed description or model
•Discards the training data after compilation of the model
•Classifies incoming patterns using the induced model

Efficiency of NN algorithms Decision Trees
•An inductive learning task
–Use particular facts to make more generalized conclusions

•A predictive model based on a branching series of
Boolean tests
–These smaller Boolean tests are less complex than a one-stage
classifier

•A decision tree encodes a classifier or regression function
in form of a tree stage classifier
Learning and Decision Trees
•Based on a set of attributes as input, predicted output value, the
decision is learned

•Boolean or binary classification
–output values are true or false
–The simplest case.

•making decisions
–a sequence of test is performed, testing the value of one of the attributes in
each step
–when a leaf node is reached, its value is returned
–good correspondence to human decision-making

14
Decision tree-induced partition – example
Color
Shape Size +
+ - Size
+ -
+
big
big small
small
round square
red
green blue
Learning decision trees
•Goal: Build a decision tree to classify examples as
positive or negative instances using supervised
learning from a training set
•A decision tree is a tree where
– each non-leaf node has associated
with it an attribute (feature)
–each leaf node has associated
with it a classification (+ or -)
–each arc has associated with it one
of the possible values of the attribute
at the node from which the arc is directed
•Generalization: allow for >2 classes
–e.g., for stocks, classify into {sell, hold, buy}
Color
Shape Size +
+ - Size
+ -
+
big
big small
small
round square
red
green blue
Types of Variables
•Numerical: Domain is ordered and can be represented on
the real line (e.g., age, income)
•Nominal or categorical: Domain is a finite set without
any natural ordering (e.g., occupation, marital status,
race)
•Ordinal: Domain is ordered, but absolute differences
between values is unknown (e.g., preference scale,
severity of an injury)
Internal Nodes
•Each internal node has an associated splitting predicate.
Most common are binary predicates.
Example predicates:
–Age <= 20
–Profession in {student, teacher}
–5000*Age + 3*Salary – 10000 > 0
Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1.Alternate: is there an alternative restaurant nearby?
2.Bar: is there a comfortable bar area to wait in?
3.Fri/Sat: is today Friday or Saturday?
4.Hungry: are we hungry?
5.Patrons: number of people in the restaurant (None, Some, Full)
6.Price: price range ($, $$, $$$)
7.Raining: is it raining outside?
8.Reservation: have we made a reservation?
9.Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations
•Examples described by attribute values
•E.g., situations where I will/won't wait for a table:











•Classification of examples is positive (T) or negative (F)

15
Restaurant Sample Set Example Attributes GoalExample
AltBarFriHunPatPriceRainResTypeEstWillWait
X1 Ye sNo No Ye sSome$$$ No Ye sFrench0-10Ye sX1
X2 Ye sNo No Ye sFull $ No No Thai30-60No X2
X3 No Ye sNo NoSome $ No NoBurger0-10Ye sX3
X4 Ye sNo Ye sYe sFull $ No No Thai10-30Ye sX4
X5 Ye sNo Ye sNo Full$$$ No Ye sFrench>60 No X5
X6 No Ye sNo Ye sSome $$ Ye sYe sItalian0-10Ye sX6
X7 No Ye sNo NoNone $ Ye sNoBurger0-10 No X7
X8 No No No Ye sSome $$ Ye sYe sThai0-10Ye sX8
X9 No Ye sYe sNo Full $ Ye sNoBurger>60 No X9
X10 Ye sYe sYe sYe sFull$$$ No Ye sItalian10-30No X10
X11 No No No NoNone $ No No Thai0-10 No X11
X12 Ye sYe sYe sYe sFull $ No NoBurger30-60Ye sX12
Decision trees
•One possible representation for hypotheses
•it is a tree for deciding whether to wait:
Learning Decision Trees
•problem: find a decision tree that agrees with the training
set
•trivial solution: construct a tree with one branch for each
sample of the training set
–works perfectly for the samples in the training set
–may not work well for new samples (generalization)
–results in relatively large trees
•better solution: find a concise tree that still agrees with all
samples
–corresponds to the simplest hypothesis that is consistent with the
training set
Ockham’s Razor
The most likely hypothesis is the simplest one that is
consistent with all observations.
–general principle for inductive learning
–a simple hypothesis that is consistent with all observations is more
likely to be correct than a complex one
Restaurant Sample Set Example Attributes GoalExample
AltBarFriHunPatPriceRainResTypeEstWillWait
X1 Ye sNo No Ye sSome$$$ No Ye sFrench0-10Ye sX1
X2 Ye sNo No Ye sFull $ No No Thai30-60No X2
X3 No Ye sNo NoSome $ No NoBurger0-10Ye sX3
X4 Ye sNo Ye sYe sFull $ No No Thai10-30Ye sX4
X5 Ye sNo Ye sNo Full$$$ No Ye sFrench>60 No X5
X6 No Ye sNo Ye sSome $$ Ye sYe sItalian0-10Ye sX6
X7 No Ye sNo NoNone $ Ye sNoBurger0-10 No X7
X8 No No No Ye sSome $$ Ye sYe sThai0-10Ye sX8
X9 No Ye sYe sNo Full $ Ye sNoBurger>60 No X9
X10 Ye sYe sYe sYe sFull$$$ No Ye sItalian10-30No X10
X11 No No No NoNone $ No No Thai0-10 No X11
X12 Ye sYe sYe sYe sFull $ No NoBurger30-60Ye sX12
Restaurant Sample Set Example Attributes GoalExample
AltBarFriHunPatPriceRainResTypeEstWillWait
X1 Ye sNo No Ye sSome$$$ No Ye sFrench0-10Ye sX1
X2 Ye sNo No Ye sFull $ No No Thai30-60No X2
X3 No Ye sNo NoSome $ No NoBurger0-10Ye sX3
X4 Ye sNo Ye sYe sFull $ No No Thai10-30Ye sX4
X5 Ye sNo Ye sNo Full$$$ No Ye sFrench>60 No X5
X6 No Ye sNo Ye sSome $$ Ye sYe sItalian0-10Ye sX6
X7 No Ye sNo NoNone $ Ye sNoBurger0-10 No X7
X8 No No No Ye sSome $$ Ye sYe sThai0-10Ye sX8
X9 No Ye sYe sNo Full $ Ye sNoBurger>60 No X9
X10 Ye sYe sYe sYe sFull$$$ No Ye sItalian10-30No X10
X11 No No No NoNone $ No No Thai0-10 No X11
X12 Ye sYe sYe sYe sFull $ No NoBurger30-60Ye sX12
•select best attribute
–candidate 1: Pat Some and None in agreement with goal
–candidate 2: Type No values in agreement with goal

16
Choosing an attribute
•Idea: a good attribute splits the examples into subsets that
are (ideally) "all positive" or "all negative"






•Patrons? is a better choice
Partial Decision Tree
•Patrons needs further discrimination only
for the Full value
•None and Some agree with the Will Wait
goal predicate
•the next step will be performed on the
remaining samples for the Full value of
Patrons
Patrons?
X1, X3, X4, X6, X8, X12
X2, X5, X7, X9, X10, X11
X7, X11
X2, X5, X9, X10
X1, X3, X6, X8 X4, X12
Yes No
Splitting
examples
by testing
attributes
Restaurant Sample Set Example Attributes GoalExample
AltBarFriHunPatPriceRainResTypeEstWillWait
X1 Ye sNo No Ye sSome$$$ No Ye sFrench0-10Ye sX1
X2 Ye sNo No Ye sFull $ No No Thai30-60No X2
X3 No Ye sNo NoSome $ No NoBurger0-10Ye sX3
X4 Ye sNo Ye sYe sFull $ No No Thai10-30Ye sX4
X5 Ye sNo Ye sNo Full$$$ No Ye sFrench>60 No X5
X6 No Ye sNo Ye sSome $$ Ye sYe sItalian0-10Ye sX6
X7 No Ye sNo NoNone $ Ye sNoBurger0-10 No X7
X8 No No No Ye sSome $$ Ye sYe sThai0-10Ye sX8
X9 No Ye sYe sNo Full $ Ye sNoBurger>60 No X9
X10 Ye sYe sYe sYe sFull$$$ No Ye sItalian10-30No X10
X11 No No No NoNone $ No No Thai0-10 No X11
X12 Ye sYe sYe sYe sFull $ No NoBurger30-60Ye sX12
•select next best attribute
–candidate 1: Hungry No in agreement with goal
–candidate 2: Type No values in agreement with goal
Partial Decision Tree
•Hungry needs further discrimination
only for the Yes value
•No agrees with the WillWait goal
predicate
•the next step will be performed on
the remaining samples for the Yes
value of Hungry
Patrons?
X1, X3, X4, X6, X8, X12
X2, X5, X7, X9, X10, X11
X7, X11
X2, X5, X9, X10
X1, X3, X6, X8 X4, X12
Yes No Hungry?
X5, X9 X4, X12
X2, X10
No

17
Finding ‘compact’ decision trees
•Finding the smallest decision tree is difficult.
•There are heuristics that find reasonable decision trees in most
practical cases.
• Idea: test the most important attribute first
–attribute that makes the most difference for the classification of an example
–hopefully will yield the correct classification with few tests

Choosing the best attribute
•Key problem: choosing which attribute to split a given set
of examples
•Some possibilities are:
–Random: Select any attribute at random
–Least-Values: Choose the attribute with the smallest number of
possible values (e.g. Hungry)
–Most-Values: Choose the attribute with the largest number of
possible values (e.g. Type)
–Max-Gain: Choose the attribute that has the largest expected
information gain–i.e., attribute that results in smallest expected
size of sub-trees rooted at its children
•The ID3 algorithm uses the Max-Gain method of selecting
the best attribute
Choosing an attribute
Idea: a good attribute splits the examples into subsets
that are (ideally) “all positive” or “all negative”






Which is better: Patrons? or Type?
Decision tree learning
•Aim: find a small tree consistent with the training examples
•Idea: (recursively) choose "most significant" attribute as root of (sub)tree

Disorder is bad
Homogeneity is good
No
Better
Good
Bad •We want a measure that prefers attributes that have a high
degree of “order”:
–Maximum order: All examples are of the same class
–Minimum order: All classes are equally likely
•Entropy is a measure for (un-)orderedness
–all examples of the same class → no information

•ID3 splits attributes based on their entropy.

Disorder is bad
Homogeneity is good

18
Entropy (for two classes)
% of example that are positive
50-50 class split
Maximum disorder
All positive
Pure
distribution
The entropy is maximal when all
possibilities are equally likely.

The goal of the decision tree is to
decrease the entropy in each node.

Entropy is zero in a pure ”yes”
node (or pure ”no” node).
Attribution selection criteria:
•Create pure nodes whenever possible
•If pure nodes are not possible, choose the split that leads to the largest decrease in entropy. )](ln[)()](ln[)(Entropy noPnoPyesPyesP  
i
ii
vPvP )(ln)(Entropy
General form:
Decision tree learning example
10 attributes:
1.Alternate: Is there a suitable alternative restaurant nearby? {yes,no}
2.Bar: Is there a bar to wait in? {yes,no}
3.Fri/Sat: Is it Friday or Saturday? {yes,no}
4.Hungry: Are you hungry? {yes,no}
5.Patrons: How many are seated in the restaurant? {none, some, full}
6.Price: Price level {$,$$,$$$}
7.Raining: Is it raining? {yes,no}
8.Reservation: Did you make a reservation? {yes,no}
9.Type: Type of food {French,Italian,Thai,Burger}
10.Wait: {0-10 min, 10-30 min, 30-60 min, >60 min}
Decision tree learning example
T = True, F = False
6 True,
6 False 30.0
12
6
ln
12
6
12
6
ln
12
6
Entropy 
Decision tree learning example    30.0
6
3
ln
6
3
6
3
ln
6
3
12
6
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy 
Alternate?
3 T, 3 F 3 T, 3 F
Yes No
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example    
 14.0
6
4
ln
6
4
6
2
ln
6
2
12
6
4
0
ln
4
0
4
4
ln
4
4
12
4
2
2
ln
2
2
2
0
ln
2
0
12
2
Entropy


Patrons?
2 F
4 T
None
Full
Entropy decrease = 0.30 – 0.14 = 0.16
2 T, 4 F
Some

19
Decision tree learning example    
 23.0
4
3
ln
4
3
4
1
ln
4
1
12
4
2
0
ln
2
0
2
2
ln
2
2
12
2
6
3
ln
6
3
6
3
ln
6
3
12
6
Entropy


Price
3 T, 3 F
2 T
$
$$$
Entropy decrease = 0.30 – 0.23 = 0.07
1 T, 3 F
$$
ID3 Algorithm
•A greedy algorithm for decision tree construction developed by
Ross Quinlan circa 1987
•Top-down construction of decision tree by recursively selecting
“best attribute” to use at the current node in tree
–Once attribute is selected for current node, generate child
nodes, one for each possible value of selected attribute
–Partition examples using the possible values of this attribute,
and assign these subsets of the examples to the appropriate
child node
–Repeat for each child node until all examples associated with
a node are either all positive or all negative
Next step
Given Patrons as root node, the next attribute chosen is
Hungry?






Decision tree learning example
Induced tree (from examples)
Cross-validation
Use a “validation set”.
D
train
D
val
E
val valgenEE
Split your data set into two parts, one
for training your model and the other
for validating your model.

The error on the validation data is
called “validation error” (E
val)
K-Fold Cross-validation
More accurate than using only one validation set.
D
train
D
val
D
train
D
train
D
train
D
val
D
val
E
val(1) E
val(2)
E
val(3) 


K
k
valvalgen kE
K
EE
1
)(
1

20
Noisy data
•Two examples have same attribute/value pairs, but
different classifications

•Some values of attributes are incorrect because of errors
in the data acquisition process or the preprocessing phase
•The classification is wrong (e.g., + instead of -) because of
some error

•Some attributes are irrelevant to the decision-making
process, e.g., color of a die is irrelevant to its outcome
Tags