Lecture-32Lecture-32
What is classification? What is What is classification? What is
prediction?prediction?
Classification:Classification:
predicts categorical class labelspredicts categorical class labels
classifies data (constructs a model) based on the classifies data (constructs a model) based on the
training set and the values (class labels) in a training set and the values (class labels) in a
classifying attribute and uses it in classifying new dataclassifying attribute and uses it in classifying new data
Prediction: Prediction:
models continuous-valued functionsmodels continuous-valued functions
predicts unknown or missing values predicts unknown or missing values
ApplicationsApplications
credit approvalcredit approval
target marketingtarget marketing
medical diagnosismedical diagnosis
treatment effectiveness analysistreatment effectiveness analysis
What is Classification & PredictionWhat is Classification & Prediction
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?
Classification—A Two-Step ProcessClassification—A Two-Step Process
Learning step- describing a set of predetermined Learning step- describing a set of predetermined
classesclasses
Each tuple/sample is assumed to belong to a predefined class, Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attributeas determined by the class label attribute
The set of tuples used for model construction: training setThe set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, The model is represented as classification rules, decision trees,
or mathematical formulaeor mathematical formulae
Classification- for classifying future or unknown objectsClassification- for classifying future or unknown objects
Estimate accuracy of the modelEstimate accuracy of the model
The known label of test sample is compared with the The known label of test sample is compared with the
classified result from the modelclassified result from the model
Accuracy rate is the percentage of test set samples that are Accuracy rate is the percentage of test set samples that are
correctly classified by the modelcorrectly classified by the model
Test set is independent of training set, otherwise over-fitting Test set is independent of training set, otherwise over-fitting
will occurwill occur
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?
Classification Process : Model ConstructionClassification Process : Model Construction
Training
Data
NAMERANK YEARSTENURED
MikeAssistant Prof3 no
MaryAssistant Prof7 yes
Bill Professor 2 yes
JimAssociate Prof7 yes
DaveAssistant Prof6 no
AnneAssociate Prof3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?
Classification Process : Use the Model in Classification Process : Use the Model in
PredictionPrediction
Classifier
Testing
Data
NAMERANK YEARSTENURED
Tom Assistant Prof2 no
MerlisaAssociate Prof7 no
GeorgeProfessor 5 yes
JosephAssistant Prof7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?
Supervised vs. Unsupervised LearningSupervised vs. Unsupervised Learning
Supervised learning (classification)Supervised learning (classification)
Supervision: The training data (observations, Supervision: The training data (observations,
measurements, etc.) are accompanied by labels measurements, etc.) are accompanied by labels
indicating the class of the observationsindicating the class of the observations
New data is classified based on the training setNew data is classified based on the training set
Unsupervised learning (clustering)Unsupervised learning (clustering)
The class labels of training data is unknownThe class labels of training data is unknown
Given a set of measurements, observations, etc. with Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or the aim of establishing the existence of classes or
clusters in the dataclusters in the data
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?
Lecture-33Lecture-33
Issues regarding classification Issues regarding classification
and predictionand prediction
Issues regarding classification and Issues regarding classification and
prediction -prediction -Preparing the data for Preparing the data for
classification and predictionclassification and prediction
Data cleaningData cleaning
Preprocess data in order to reduce noise and handle Preprocess data in order to reduce noise and handle
missing valuesmissing values
Relevance analysis (feature selection)Relevance analysis (feature selection)
Remove the irrelevant or redundant attributesRemove the irrelevant or redundant attributes
Data transformationData transformation
Generalize and/or normalize dataGeneralize and/or normalize data
Lecture-33 - Issues regarding classification and predictionLecture-33 - Issues regarding classification and prediction
Issues regarding classification and prediction Issues regarding classification and prediction
Comparing Classification MethodsComparing Classification Methods
AccuracyAccuracy
Speed and scalabilitySpeed and scalability
time to construct the modeltime to construct the model
time to use the modeltime to use the model
RobustnessRobustness
handling noise and missing valueshandling noise and missing values
ScalabilityScalability
efficiency in disk-resident databases efficiency in disk-resident databases
Interpretability: Interpretability:
understanding and insight provded by the modelunderstanding and insight provded by the model
interpretabilityinterpretability
decision tree sizedecision tree size
compactness of classification rulescompactness of classification rules
Lecture-33 - Issues regarding classification and predictionLecture-33 - Issues regarding classification and prediction
Lecture-34Lecture-34
Classification by decision tree Classification by decision tree
inductioninduction
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Classification by Decision Tree InductionClassification by Decision Tree Induction
Decision tree Decision tree
A flow-chart-like tree structureA flow-chart-like tree structure
Internal node denotes a test on an attributeInternal node denotes a test on an attribute
Branch represents an outcome of the testBranch represents an outcome of the test
Leaf nodes represent class labels or class distributionLeaf nodes represent class labels or class distribution
Decision tree generation consists of two phasesDecision tree generation consists of two phases
Tree constructionTree construction
At start, all the training examples are at the rootAt start, all the training examples are at the root
Partition examples recursively based on selected attributesPartition examples recursively based on selected attributes
Tree pruningTree pruning
Identify and remove branches that reflect noise or outliersIdentify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sampleUse of decision tree: Classifying an unknown sample
Test the attribute values of the sample against the decision treeTest the attribute values of the sample against the decision tree
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Training DatasetTraining Dataset
ageincomestudentcredit_rating
<=30high nofair
<=30high noexcellent
31…40high nofair
>40 medium nofair
>40 low yesfair
>40 low yesexcellent
31…40low yesexcellent
<=30medium nofair
<=30low yesfair
>40 medium yesfair
<=30medium yesexcellent
31…40medium noexcellent
31…40high yesfair
>40 medium noexcellent
This
follows
an
example
from
Quinlan’s
ID3
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Output: A Decision Tree for “Output: A Decision Tree for “buys_computer”buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30
>40
no noyes yes
yes
30..40
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Algorithm for Decision Tree InductionAlgorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer Tree is constructed in a top-down recursive divide-and-conquer
mannermanner
At start, all the training examples are at the rootAt start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are Attributes are categorical (if continuous-valued, they are
discretized in advance)discretized in advance)
Examples are partitioned recursively based on selected attributesExamples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or statistical Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)measure (e.g., information gain)
Conditions for stopping partitioningConditions for stopping partitioning
All samples for a given node belong to the same classAll samples for a given node belong to the same class
There are no remaining attributes for further partitioning – majority There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leafvoting is employed for classifying the leaf
There are no samples leftThere are no samples left
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Attribute Selection MeasureAttribute Selection Measure
Information gain (ID3/C4.5)Information gain (ID3/C4.5)
All attributes are assumed to be categoricalAll attributes are assumed to be categorical
Can be modified for continuous-valued attributesCan be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)Gini index (IBM IntelligentMiner)
All attributes are assumed continuous-valuedAll attributes are assumed continuous-valued
Assume there exist several possible split values for each Assume there exist several possible split values for each
attributeattribute
May need other tools, such as clustering, to get the May need other tools, such as clustering, to get the
possible split valuespossible split values
Can be modified for categorical attributesCan be modified for categorical attributes
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Information Gain (ID3/C4.5)Information Gain (ID3/C4.5)
Select the attribute with the highest information Select the attribute with the highest information
gaingain
Assume there are two classes,Assume there are two classes, P P and and N N
Let the set of examples Let the set of examples SS contain contain pp elements of class elements of class PP
and and nn elements of class elements of class NN
The amount of information, needed to decide if an The amount of information, needed to decide if an
arbitrary example in arbitrary example in SS belongs to belongs to PP or or NN is defined as is defined as
np
n
np
n
np
p
np
p
npI
++
-
++
-=
22
loglog),(
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Information Gain in Decision Tree Information Gain in Decision Tree
InductionInduction
Assume that using attribute A a set Assume that using attribute A a set SS will be will be
partitioned into sets {partitioned into sets {SS
11, , SS
22 , …, , …, SS
vv} }
If If SS
ii contains contains pp
ii examples of examples of PP and and nn
ii examples of examples of NN, ,
the entropy, or the expected information needed to the entropy, or the expected information needed to
classify objects in all subtrees classify objects in all subtrees SS
ii is is
The encoding information that would be gained The encoding information that would be gained
by branching on by branching on AA
å
=+
+
=
n
1
),()(
i
ii
ii
npI
np
np
AE
)(),()( AEnpIAGain -=
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Attribute Selection by Information Gain Attribute Selection by Information Gain
ComputationComputation
Class P: buys_computer Class P: buys_computer
= “yes”= “yes”
Class N: buys_computer Class N: buys_computer
= “no”= “no”
I(p, n) = I(9, 5) =0.940I(p, n) = I(9, 5) =0.940
Compute the entropy for Compute the entropy for
ageage::
HenceHence
SimilarlySimilarly
agep
in
iI(p
i, n
i)
<=30 230.971
30…40400
>40 320.971
69.0)2,3(
14
5
)0,4(
14
4
)3,2(
14
5
)(
=+
+=
I
IIageE
048.0)_(
151.0)(
029.0)(
=
=
=
ratingcreditGain
studentGain
incomeGain
)(),()( ageEnpIageGain -=
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
GiniGini Index (IBM IntelligentMiner) Index (IBM IntelligentMiner)
If a data set If a data set TT contains examples from contains examples from nn classes, classes,
gini index, gini index, ginigini((TT) is defined as) is defined as
where where pp
jj is the relative frequency of class is the relative frequency of class jj in in T.T.
If a data set If a data set TT is split into two subsets is split into two subsets TT
11 and and TT
22 with with
sizes sizes NN
11 and and NN
22 respectively, the respectively, the ginigini index of the index of the
split data contains examples from split data contains examples from nn classes, the classes, the
ginigini index index ginigini((TT) is defined as) is defined as
The attribute provides the smallest The attribute provides the smallest ginigini
splitsplit((TT) is ) is
chosen to split the node (chosen to split the node (need to enumerate all need to enumerate all
possible splitting points for each attributepossible splitting points for each attribute).).
å
=
-=
n
j
p
j
Tgini
1
2
1)(
)()()(
2
2
1
1
Tgini
N
N
Tgini
N
N
Tgini
split
+=
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Extracting Classification Rules from TreesExtracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rulesRepresent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leafOne rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunctionEach attribute-value pair along a path forms a conjunction
The leaf node holds the class predictionThe leaf node holds the class prediction
Rules are easier for humans to understandRules are easier for humans to understand
ExampleExample
IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “nono” THEN ” THEN buys_computerbuys_computer = “ = “nono””
IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “yesyes” THEN ” THEN buys_computerbuys_computer = “ = “yesyes””
IF IF ageage = “31…40” = “31…40” THEN THEN buys_computerbuys_computer = “ = “yesyes””
IF IF ageage = “>40” AND = “>40” AND credit_ratingcredit_rating = “ = “excellentexcellent” THEN ” THEN
buys_computer buys_computer = “= “yesyes””
IF IF ageage = “>40” AND = “>40” AND credit_ratingcredit_rating = “ = “fairfair” THEN ” THEN buys_computerbuys_computer = “ = “nono””
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Avoid Overfitting in ClassificationAvoid Overfitting in Classification
The generated tree may overfit the training dataThe generated tree may overfit the training data
Too many branches, some may reflect anomalies Too many branches, some may reflect anomalies
due to noise or outliersdue to noise or outliers
Result is in poor accuracy for unseen samplesResult is in poor accuracy for unseen samples
Two approaches to avoid overfitting Two approaches to avoid overfitting
Prepruning: Halt tree construction earlyPrepruning: Halt tree construction early——do not split do not split
a node if this would result in the goodness measure a node if this would result in the goodness measure
falling below a thresholdfalling below a threshold
Difficult to choose an appropriate thresholdDifficult to choose an appropriate threshold
Postpruning: Remove branches from a “fully grown” Postpruning: Remove branches from a “fully grown”
treetree——get a sequence of progressively pruned treesget a sequence of progressively pruned trees
Use a set of data different from the training data Use a set of data different from the training data
to decide which is the “best pruned tree”to decide which is the “best pruned tree”
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Approaches to Determine the Final Tree Approaches to Determine the Final Tree
SizeSize
Separate training and testing setsSeparate training and testing sets
Use cross validation, 10-fold cross validationUse cross validation, 10-fold cross validation
Use all the data for trainingUse all the data for training
apply a statistical test (chi-square) to estimate apply a statistical test (chi-square) to estimate
whether expanding or pruning a node may improve whether expanding or pruning a node may improve
the entire distributionthe entire distribution
Use minimum description length (MDL) principle: Use minimum description length (MDL) principle:
halting growth of the tree when the encoding is halting growth of the tree when the encoding is
minimizedminimized
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Enhancements to basic decision tree Enhancements to basic decision tree
inductioninduction
Allow for continuous-valued attributesAllow for continuous-valued attributes
Dynamically define new discrete-valued attributes that Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete partition the continuous attribute value into a discrete
set of intervalsset of intervals
Handle missing attribute valuesHandle missing attribute values
Assign the most common value of the attributeAssign the most common value of the attribute
Assign probability to each of the possible valuesAssign probability to each of the possible values
Attribute constructionAttribute construction
Create new attributes based on existing ones that are Create new attributes based on existing ones that are
sparsely representedsparsely represented
This reduces fragmentation, repetition, and replicationThis reduces fragmentation, repetition, and replication
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Classification in Large DatabasesClassification in Large Databases
ClassificationClassification——a classical problem extensively studied by a classical problem extensively studied by
statisticians and machine learning researchersstatisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speedand hundreds of attributes with reasonable speed
Why decision tree induction in data mining?Why decision tree induction in data mining?
relatively faster learning speed (than other classification relatively faster learning speed (than other classification
methods)methods)
convertible to simple and easy to understand convertible to simple and easy to understand
classification rulesclassification rules
can use SQL queries for accessing databasescan use SQL queries for accessing databases
comparable classification accuracy with other methodscomparable classification accuracy with other methods
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Scalable Decision Tree Induction Methods in Scalable Decision Tree Induction Methods in
Data Mining StudiesData Mining Studies
SLIQ (EDBT’96 SLIQ (EDBT’96 —— Mehta et al.) Mehta et al.)
builds an index for each attribute and only class list and builds an index for each attribute and only class list and
the current attribute list reside in memorythe current attribute list reside in memory
SPRINT (VLDB’96 SPRINT (VLDB’96 —— J. Shafer et al.) J. Shafer et al.)
constructs an attribute list data structure constructs an attribute list data structure
PUBLIC (VLDB’98 PUBLIC (VLDB’98 —— Rastogi & Shim) Rastogi & Shim)
integrates tree splitting and tree pruning: stop growing integrates tree splitting and tree pruning: stop growing
the tree earlierthe tree earlier
RainForestRainForest (VLDB’98 (VLDB’98 —— Gehrke, Ramakrishnan & Ganti) Gehrke, Ramakrishnan & Ganti)
separates the scalability aspects from the criteria that separates the scalability aspects from the criteria that
determine the quality of the treedetermine the quality of the tree
builds an AVC-list (attribute, value, class label)builds an AVC-list (attribute, value, class label)
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction
Bayesian ClassificationBayesian Classification
Statical classifiersStatical classifiers
Based on Baye’s theoremBased on Baye’s theorem
Naïve Bayesian classificationNaïve Bayesian classification
Class conditional independenceClass conditional independence
Bayesian belief netwoksBayesian belief netwoks
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Bayesian ClassificationBayesian Classification
Probabilistic learningProbabilistic learning
Calculate explicit probabilities for hypothesis, among the most Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problemspractical approaches to certain types of learning problems
Incremental Incremental
Each training example can incrementally increase/decrease the Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be probability that a hypothesis is correct. Prior knowledge can be
combined with observed data.combined with observed data.
Probabilistic predictionProbabilistic prediction
Predict multiple hypotheses, weighted by their probabilitiesPredict multiple hypotheses, weighted by their probabilities
StandardStandard
Even when Bayesian methods are computationally intractable, they Even when Bayesian methods are computationally intractable, they
can provide a standard of optimal decision making against which can provide a standard of optimal decision making against which
other methods can be measuredother methods can be measured
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Baye’s TheoremBaye’s Theorem
Let X be a data tuple and H be hypothesis, Let X be a data tuple and H be hypothesis,
such that X belongs to a specific class C.such that X belongs to a specific class C.
Posterior probability of a hypothesis h on X, Posterior probability of a hypothesis h on X,
P(h|X) follows the Baye’s theoremP(h|X) follows the Baye’s theorem
)(
)()|(
)|(
XP
HPHXP
XHP =
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Naïve Bayes Classifier Naïve Bayes Classifier
Let D be a training data set of tuples and Let D be a training data set of tuples and
associated class labelsassociated class labels
X = (x1, x2, x3,…xn) and m = C1, C2, C3,..CmX = (x1, x2, x3,…xn) and m = C1, C2, C3,..Cm
Bayes theorem:Bayes theorem:
P(Ci|X) = P(X|Ci)·P(Ci) / P(X)P(Ci|X) = P(X|Ci)·P(Ci) / P(X)
Naïve Baye’s predicts that X belongs to class Ci Naïve Baye’s predicts that X belongs to class Ci
if and only if if and only if
P(Ci/X) > P(Cj/X) for 1<=j<=m, i!=jP(Ci/X) > P(Cj/X) for 1<=j<=m, i!=j
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Naïve Bayes ClassifierNaïve Bayes Classifier
P(X) is constant for all classesP(X) is constant for all classes
P(C1)=P(C2)=…..=P(Cn)P(C1)=P(C2)=…..=P(Cn)
P(X|Ci)·P(Ci) is to be maximizeP(X|Ci)·P(Ci) is to be maximize
P(Ci)=|Ci,d|/|D|P(Ci)=|Ci,d|/|D|
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Naïve Bayesian ClassificationNaïve Bayesian Classification
Naïve assumption: attribute independenceNaïve assumption: attribute independence
P(xP(x
11,…,x,…,x
kk|C) = P(x|C) = P(x
11|C)·…·P(x|C)·…·P(x
kk|C)|C)
If attribute is categorical:If attribute is categorical:
P(xP(x
ii|C) is estimated as the relative freq of |C) is estimated as the relative freq of
samples having value xsamples having value x
ii as i-th attribute in class as i-th attribute in class
CC
If attribute is continuous:If attribute is continuous:
P(xP(x
ii|C) is estimated thru a Gaussian density |C) is estimated thru a Gaussian density
functionfunction
Computationally easy in both casesComputationally easy in both cases
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Play-tennis example: estimating P(xPlay-tennis example: estimating P(x
ii|C)|C)
OutlookTemperatureHumidityWindyClass
sunnyhot high false N
sunnyhot high true N
overcasthot high false P
rain mild high false P
rain cool normalfalse P
rain cool normaltrue N
overcastcool normaltrue P
sunnymild high false N
sunnycool normalfalse P
rain mild normalfalse P
sunnymild normaltrue P
overcastmild high true P
overcasthot normalfalse P
rain mild high true N
outlookoutlook
P(sunny|p) = 2/9P(sunny|p) = 2/9P(sunny|n) = 3/5P(sunny|n) = 3/5
P(overcast|p) = 4/9P(overcast|p) = 4/9P(overcast|n) = 0P(overcast|n) = 0
P(rain|p) = 3/9P(rain|p) = 3/9 P(rain|n) = 2/5P(rain|n) = 2/5
temperaturetemperature
P(hot|p) = 2/9P(hot|p) = 2/9 P(hot|n) = 2/5P(hot|n) = 2/5
P(mild|p) = 4/9P(mild|p) = 4/9 P(mild|n) = 2/5P(mild|n) = 2/5
P(cool|p) = 3/9P(cool|p) = 3/9 P(cool|n) = 1/5P(cool|n) = 1/5
humidityhumidity
P(high|p) = 3/9P(high|p) = 3/9 P(high|n) = 4/5P(high|n) = 4/5
P(normal|p) = 6/9P(normal|p) = 6/9P(normal|n) = 2/5P(normal|n) = 2/5
windywindy
P(true|p) = 3/9P(true|p) = 3/9 P(true|n) = 3/5P(true|n) = 3/5
P(false|p) = 6/9P(false|p) = 6/9P(false|n) = 2/5P(false|n) = 2/5
P(p) = 9/14P(p) = 9/14
P(n) = 5/14P(n) = 5/14
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Play-tennis example: classifying XPlay-tennis example: classifying X
An unseen sample X = <rain, hot, high, false>An unseen sample X = <rain, hot, high, false>
P(X|p)·P(p) = P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 3/9·2/9·3/9·6/9·9/14 = 0.0105820.010582
P(X|n)·P(n) = P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 2/5·2/5·4/5·2/5·5/14 = 0.0182860.018286
Sample X is classified in class n (donSample X is classified in class n (don’’t play)t play)
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
How effective are Bayesian classifiers?How effective are Bayesian classifiers?
makes computation possiblemakes computation possible
optimal classifiers when satisfiedoptimal classifiers when satisfied
but is seldom satisfied in practice, as attributes but is seldom satisfied in practice, as attributes
(variables) are often correlated.(variables) are often correlated.
Attempts to overcome this limitation:Attempts to overcome this limitation:
Bayesian networks, that combine Bayesian reasoning Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributeswith causal relationships between attributes
Decision trees, that reason on one attribute at the Decision trees, that reason on one attribute at the
time, considering most important attributes firsttime, considering most important attributes first
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Bayesian Belief Networks (I)Bayesian Belief Networks (I)
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S)(FH, ~S)(~FH, S)(~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table
for the variable LungCancer
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Bayesian Belief Networks Bayesian Belief Networks
Bayesian belief network allows a Bayesian belief network allows a subsetsubset of the variables of the variables
conditionally independentconditionally independent
A graphical model of causal relationshipsA graphical model of causal relationships
Several cases of learning Bayesian belief networksSeveral cases of learning Bayesian belief networks
Given both network structure and all the variables: easyGiven both network structure and all the variables: easy
Given network structure but only some variablesGiven network structure but only some variables
When the network structure is not known in advanceWhen the network structure is not known in advance
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification
Lecture-36Lecture-36
Classification by BackpropagationClassification by Backpropagation
Classification by BackpropagationClassification by Backpropagation
Neural network learning algorithmNeural network learning algorithm
Psychologists and NeurobiologistsPsychologists and Neurobiologists
Neural network – set of connected input/output Neural network – set of connected input/output
units in which each connection has a weight units in which each connection has a weight
associated with it.associated with it.
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation
Neural NetworksNeural Networks
AdvantagesAdvantages
prediction accuracy is generally highprediction accuracy is generally high
robust, works when training examples contain errorsrobust, works when training examples contain errors
output may be discrete, real-valued, or a vector of output may be discrete, real-valued, or a vector of
several discrete or real-valued attributesseveral discrete or real-valued attributes
fast evaluation of the learned target functionfast evaluation of the learned target function
limitationslimitations
long training timelong training time
difficult to understand the learned function (weights)difficult to understand the learned function (weights)
not easy to incorporate domain knowledgenot easy to incorporate domain knowledge
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation
A NeuronA Neuron
The The nn-dimensional input vector -dimensional input vector xx is mapped is mapped
into variable into variable yy by means of the scalar by means of the scalar
product and a nonlinear function mappingproduct and a nonlinear function mapping
m
k-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
å
w
0
w
1
w
n
x
0
x
1
x
n
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation
Network TrainingNetwork Training
The ultimate objective of training The ultimate objective of training
obtain a set of weights that makes almost all the obtain a set of weights that makes almost all the
tuples in the training data classified correctly tuples in the training data classified correctly
StepsSteps
Initialize weights with random values Initialize weights with random values
Feed the input tuples into the network one by oneFeed the input tuples into the network one by one
For each unitFor each unit
Compute the net input to the unit as a linear combination of Compute the net input to the unit as a linear combination of
all the inputs to the unitall the inputs to the unit
Compute the output value using the activation functionCompute the output value using the activation function
Compute the errorCompute the error
Update the weights and the biasUpdate the weights and the bias
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation
Multi-Layer PerceptronMulti-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: x
i
w
ij
å +=
i
jiijj
OwI q
j
Ij
e
O
-
+
=
1
1
))(1(
jjjjj
OTOOErr --=
jk
k
kjjj
wErrOOErr å-= )1(
ijijij
OErrlww )(+=
jjj
Errl)(+=qq
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation
Lecture-37Lecture-37
Classification based on concepts Classification based on concepts
from association rule miningfrom association rule mining
Association-Based ClassificationAssociation-Based Classification
Several methods for association-based Several methods for association-based
classificationclassification
ARCS: Quantitative association mining and ARCS: Quantitative association mining and
clustering of association rules (Lent et al’97)clustering of association rules (Lent et al’97)
It beats C4.5 in (mainly) scalability and also It beats C4.5 in (mainly) scalability and also
accuracyaccuracy
Associative classification: (Liu et al’98) Associative classification: (Liu et al’98)
It mines high support and high confidence rules in It mines high support and high confidence rules in
the form of “cond_set => y”, where y is a class labelthe form of “cond_set => y”, where y is a class label
Lecture-37 - Classification based on concepts from association rule miningLecture-37 - Classification based on concepts from association rule mining
Association-Based ClassificationAssociation-Based Classification
CAEP (Classification by aggregating CAEP (Classification by aggregating
emerging patterns) (Dong et al’99)emerging patterns) (Dong et al’99)
Emerging patterns (EPs): the itemsets whose Emerging patterns (EPs): the itemsets whose
support increases significantly from one class to support increases significantly from one class to
anotheranother
Mine Eps based on minimum support and growth Mine Eps based on minimum support and growth
raterate
Lecture-37 - Classification based on concepts from association rule miningLecture-37 - Classification based on concepts from association rule mining
Lecture-38Lecture-38
Other Classification MethodsOther Classification Methods
Other Classification MethodsOther Classification Methods
k-nearest neighbor classifierk-nearest neighbor classifier
case-based reasoningcase-based reasoning
Genetic algorithmGenetic algorithm
Rough set approachRough set approach
Fuzzy set approachesFuzzy set approaches
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Instance-Based MethodsInstance-Based Methods
Instance-based learning: Instance-based learning:
Store training examples and delay the processing Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be (“lazy evaluation”) until a new instance must be
classifiedclassified
Approaches Approaches
kk-nearest neighbor approach-nearest neighbor approach
Instances represented as points in a Euclidean Instances represented as points in a Euclidean
space.space.
Locally weighted regressionLocally weighted regression
Constructs local approximationConstructs local approximation
Case-based reasoningCase-based reasoning
Uses symbolic representations and knowledge-Uses symbolic representations and knowledge-
based inferencebased inference
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
The The kk-Nearest Neighbor Algorithm-Nearest Neighbor Algorithm
All instances correspond to points in the n-D All instances correspond to points in the n-D
space.space.
The nearest neighbor are defined in terms of The nearest neighbor are defined in terms of
Euclidean distance.Euclidean distance.
The target function could be discrete- or real- The target function could be discrete- or real-
valued.valued.
For discrete-valued, the For discrete-valued, the kk-NN returns the most -NN returns the most
common value among the k training examples common value among the k training examples
nearest tonearest to xxqq. .
Vonoroi diagram: the decision surface induced Vonoroi diagram: the decision surface induced
by 1-NN for a typical set of training examples.by 1-NN for a typical set of training examples.
__
+
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Discussion on the Discussion on the kk-NN Algorithm-NN Algorithm
The k-NN algorithm for continuous-valued target functionsThe k-NN algorithm for continuous-valued target functions
Calculate the mean values of theCalculate the mean values of the k k nearest neighbors nearest neighbors
Distance-weighted nearest neighbor algorithmDistance-weighted nearest neighbor algorithm
Weight the contribution of each of the k neighbors Weight the contribution of each of the k neighbors
according to their distance to the query pointaccording to their distance to the query point x x
qq
giving greater weight to closer neighborsgiving greater weight to closer neighbors
Similarly, for real-valued target functionsSimilarly, for real-valued target functions
Robust to noisy data by averaging k-nearest neighborsRobust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes. be dominated by irrelevant attributes.
To overcome it, axes stretch or elimination of the least To overcome it, axes stretch or elimination of the least
relevant attributes.relevant attributes.
w
dx
q
x
i
º
1
2
(,)
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Case-Based ReasoningCase-Based Reasoning
Also uses: lazy evaluation + analyze similar instancesAlso uses: lazy evaluation + analyze similar instances
Difference: Instances are not “points in a Euclidean space”Difference: Instances are not “points in a Euclidean space”
Example: Water faucet problem in CADET (Sycara et al’92)Example: Water faucet problem in CADET (Sycara et al’92)
MethodologyMethodology
Instances represented by rich symbolic descriptions Instances represented by rich symbolic descriptions
( function graphs)( function graphs)
Multiple retrieved cases may be combinedMultiple retrieved cases may be combined
Tight coupling between case retrieval, knowledge-based Tight coupling between case retrieval, knowledge-based
reasoning, and problem solvingreasoning, and problem solving
Research issuesResearch issues
Indexing based on syntactic similarity measure, and Indexing based on syntactic similarity measure, and
when failure, backtracking, and adapting to additional when failure, backtracking, and adapting to additional
casescases
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Lazy vs. Eager LearningLazy vs. Eager Learning
Instance-based learning: lazy evaluation Instance-based learning: lazy evaluation
Decision-tree and Bayesian classification: eager evaluationDecision-tree and Bayesian classification: eager evaluation
Key differencesKey differences
Lazy method may consider query instance Lazy method may consider query instance xqxq when deciding how to when deciding how to
generalize beyond the training data generalize beyond the training data DD
Eager method cannot since they have already chosen global Eager method cannot since they have already chosen global
approximation when seeing the queryapproximation when seeing the query
Efficiency: Lazy - less time training but more time predictingEfficiency: Lazy - less time training but more time predicting
AccuracyAccuracy
Lazy method effectively uses a richer hypothesis space since it Lazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form its implicit global uses many local linear functions to form its implicit global
approximation to the target functionapproximation to the target function
Eager: must commit to a single hypothesis that covers the entire Eager: must commit to a single hypothesis that covers the entire
instance spaceinstance space
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Genetic AlgorithmsGenetic Algorithms
GA: based on an analogy to biological evolutionGA: based on an analogy to biological evolution
Each rule is represented by a string of bitsEach rule is represented by a string of bits
An initial population is created consisting of randomly An initial population is created consisting of randomly
generated rulesgenerated rules
e.g., IF Ae.g., IF A
11 and Not A and Not A
22 then C then C
22 can be encoded as 100 can be encoded as 100
Based on the notion of survival of the fittest, a new Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules and population is formed to consists of the fittest rules and
their offsprings their offsprings
The fitness of a rule is represented by its classification The fitness of a rule is represented by its classification
accuracy on a set of training examplesaccuracy on a set of training examples
Offsprings are generated by crossover and mutationOffsprings are generated by crossover and mutation
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Rough Set ApproachRough Set Approach
Rough sets are used to approximately or Rough sets are used to approximately or
“roughly” define equivalent classes “roughly” define equivalent classes
A rough set for a given class C is approximated A rough set for a given class C is approximated
by two sets: a lower approximation (certain to by two sets: a lower approximation (certain to
be in C) and an upper approximation (cannot be in C) and an upper approximation (cannot
be described as not belonging to C) be described as not belonging to C)
Finding the minimal subsets (reducts) of Finding the minimal subsets (reducts) of
attributes (for feature reduction) is NP-hard but attributes (for feature reduction) is NP-hard but
a discernibility matrix is used to reduce the a discernibility matrix is used to reduce the
computation intensity computation intensity
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Fuzzy Set ApproachesFuzzy Set Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to Fuzzy logic uses truth values between 0.0 and 1.0 to
represent the degree of membership (such as using represent the degree of membership (such as using
fuzzy membership graph)fuzzy membership graph)
Attribute values are converted to fuzzy valuesAttribute values are converted to fuzzy values
e.g., income is mapped into the discrete categories e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated{low, medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy value may For a given new sample, more than one fuzzy value may
applyapply
Each applicable rule contributes a vote for membership Each applicable rule contributes a vote for membership
in the categoriesin the categories
Typically, the truth values for each predicted category Typically, the truth values for each predicted category
are summedare summed
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods
Lecture-39Lecture-39
PredictionPrediction
What Is Prediction?What Is Prediction?
Prediction is similar to classificationPrediction is similar to classification
First, construct a modelFirst, construct a model
Second, use model to predict unknown valueSecond, use model to predict unknown value
Major method for prediction is regressionMajor method for prediction is regression
Linear and multiple regressionLinear and multiple regression
Non-linear regressionNon-linear regression
Prediction is different from classificationPrediction is different from classification
Classification refers to predict categorical class labelClassification refers to predict categorical class label
Prediction models continuous-valued functionsPrediction models continuous-valued functions
Lecture-39 - PredictionLecture-39 - Prediction
Predictive modeling: Predict data values or construct Predictive modeling: Predict data values or construct
generalized linear models based on the database data.generalized linear models based on the database data.
One can only predict value ranges or category distributionsOne can only predict value ranges or category distributions
Method outline:Method outline:
Minimal generalizationMinimal generalization
Attribute relevance analysisAttribute relevance analysis
Generalized linear model constructionGeneralized linear model construction
PredictionPrediction
Determine the major factors which influence the predictionDetermine the major factors which influence the prediction
Data relevance analysis: uncertainty measurement, Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysisMulti-level prediction: drill-down and roll-up analysis
Predictive Modeling in DatabasesPredictive Modeling in Databases
Lecture-39 - PredictionLecture-39 - Prediction
Linear regressionLinear regression: Y = : Y = aa + + bb X X
Two parameters , Two parameters , aa and and bb specify the line and are to specify the line and are to
be estimated by using the data at hand.be estimated by using the data at hand.
using the least squares criterion to the known values of using the least squares criterion to the known values of
YY11, Y, Y22, …, X, …, X11, X, X22, …., ….
Multiple regressionMultiple regression: Y = b0 + b1 X1 + b2 X2.: Y = b0 + b1 X1 + b2 X2.
Many nonlinear functions can be transformed into the Many nonlinear functions can be transformed into the
above.above.
Log-linear modelsLog-linear models::
The multi-way table of joint probabilities is The multi-way table of joint probabilities is
approximated by a product of lower-order tables.approximated by a product of lower-order tables.
Probability: Probability: p(a, b, c, d) = p(a, b, c, d) = aaab ab bbacacccadad ddbcdbcd
Regress Analysis and Log-Linear Models in Regress Analysis and Log-Linear Models in
PredictionPrediction
Lecture-39 - PredictionLecture-39 - Prediction
Locally Weighted RegressionLocally Weighted Regression
Construct an explicit approximation toConstruct an explicit approximation to f f over a local region over a local region
surrounding query instance surrounding query instance xqxq..
Locally weighted linear regression: Locally weighted linear regression:
The target functionThe target function f f is approximated near is approximated near xqxq using the using the
linear function: linear function:
minimize the squared error: distance-decreasing weight minimize the squared error: distance-decreasing weight
KK
the gradient descent training rule:the gradient descent training rule:
In most cases, the target function is approximated by a In most cases, the target function is approximated by a
constant, linear, or quadratic function.constant, linear, or quadratic function.
() () ()fxwwax w
n
a
n
x=+ ++
011
Ex
q
fxfx
xknearestneighborsofx
q
Kdx
q
x() (()
())
_ _ __
((,))º -
Î
å
1
2
2
Dw
j
Kdx
q
xfxfxa
j
x
xknearestneighborsofx
q
º -
Î
åh ((,))((()
())()
_ _ __
Lecture-39 - PredictionLecture-39 - Prediction
Classification Accuracy: Estimating Error Classification Accuracy: Estimating Error
RatesRates
Partition: Training-and-testingPartition: Training-and-testing
use two independent data sets- training set , test setuse two independent data sets- training set , test set
used for data set with large number of samplesused for data set with large number of samples
Cross-validationCross-validation
divide the data set into divide the data set into kk subsamples subsamples
use use k-1k-1 subsamples as training data and one sub- subsamples as training data and one sub-
sample as test data --- sample as test data --- kk-fold cross-validation-fold cross-validation
for data set with moderate sizefor data set with moderate size
Bootstrapping (leave-one-out)Bootstrapping (leave-one-out)
for small size datafor small size data
Lecture-40 - Classification accuracyLecture-40 - Classification accuracy
Boosting and BaggingBoosting and Bagging
Boosting increases classification accuracyBoosting increases classification accuracy
Applicable to decision trees or Bayesian Applicable to decision trees or Bayesian
classifierclassifier
Learn a series of classifiers, where each Learn a series of classifiers, where each
classifier in the series pays more attention to classifier in the series pays more attention to
the examples misclassified by its predecessorthe examples misclassified by its predecessor
Boosting requires only linear time and Boosting requires only linear time and
constant spaceconstant space
Lecture-40 - Classification accuracyLecture-40 - Classification accuracy
Boosting Technique Boosting Technique —— Algorithm Algorithm
Assign every example an equal weight Assign every example an equal weight 1/N1/N
For t = 1, 2, …, T Do For t = 1, 2, …, T Do
Obtain a hypothesis (classifier) hObtain a hypothesis (classifier) h
(t)(t)
under w under w
(t)(t)
Calculate the error ofCalculate the error of h(t) h(t) and re-weight the and re-weight the
examples based on the errorexamples based on the error
Normalize wNormalize w
(t+1)(t+1)
to sum to 1 to sum to 1
Output a weighted sum of all the hypothesis, Output a weighted sum of all the hypothesis,
with each hypothesis weighted according to its with each hypothesis weighted according to its
accuracy on the training set accuracy on the training set
Lecture-40 - Classification accuracyLecture-40 - Classification accuracy