Classification and prediction

dama2211 6,900 views 65 slides Mar 18, 2016
Slide 1
Slide 1 of 65
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65

About This Presentation

Data Mining and warehousing


Slide Content

Lecture-32Lecture-32
What is classification? What is What is classification? What is
prediction?prediction?

Classification:Classification:

predicts categorical class labelspredicts categorical class labels

classifies data (constructs a model) based on the classifies data (constructs a model) based on the
training set and the values (class labels) in a training set and the values (class labels) in a
classifying attribute and uses it in classifying new dataclassifying attribute and uses it in classifying new data
Prediction: Prediction:

models continuous-valued functionsmodels continuous-valued functions

predicts unknown or missing values predicts unknown or missing values
ApplicationsApplications

credit approvalcredit approval

target marketingtarget marketing

medical diagnosismedical diagnosis

treatment effectiveness analysistreatment effectiveness analysis
What is Classification & PredictionWhat is Classification & Prediction
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?

Classification—A Two-Step ProcessClassification—A Two-Step Process
Learning step- describing a set of predetermined Learning step- describing a set of predetermined
classesclasses

Each tuple/sample is assumed to belong to a predefined class, Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attributeas determined by the class label attribute

The set of tuples used for model construction: training setThe set of tuples used for model construction: training set

The model is represented as classification rules, decision trees, The model is represented as classification rules, decision trees,
or mathematical formulaeor mathematical formulae
Classification- for classifying future or unknown objectsClassification- for classifying future or unknown objects

Estimate accuracy of the modelEstimate accuracy of the model
The known label of test sample is compared with the The known label of test sample is compared with the
classified result from the modelclassified result from the model
Accuracy rate is the percentage of test set samples that are Accuracy rate is the percentage of test set samples that are
correctly classified by the modelcorrectly classified by the model
Test set is independent of training set, otherwise over-fitting Test set is independent of training set, otherwise over-fitting
will occurwill occur
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?

Classification Process : Model ConstructionClassification Process : Model Construction
Training
Data
NAMERANK YEARSTENURED
MikeAssistant Prof3 no
MaryAssistant Prof7 yes
Bill Professor 2 yes
JimAssociate Prof7 yes
DaveAssistant Prof6 no
AnneAssociate Prof3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?

Classification Process : Use the Model in Classification Process : Use the Model in
PredictionPrediction
Classifier
Testing
Data
NAMERANK YEARSTENURED
Tom Assistant Prof2 no
MerlisaAssociate Prof7 no
GeorgeProfessor 5 yes
JosephAssistant Prof7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?

Supervised vs. Unsupervised LearningSupervised vs. Unsupervised Learning
Supervised learning (classification)Supervised learning (classification)

Supervision: The training data (observations, Supervision: The training data (observations,
measurements, etc.) are accompanied by labels measurements, etc.) are accompanied by labels
indicating the class of the observationsindicating the class of the observations

New data is classified based on the training setNew data is classified based on the training set
Unsupervised learning (clustering)Unsupervised learning (clustering)

The class labels of training data is unknownThe class labels of training data is unknown

Given a set of measurements, observations, etc. with Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or the aim of establishing the existence of classes or
clusters in the dataclusters in the data
Lecture-32 - What is classification? What is prediction?Lecture-32 - What is classification? What is prediction?

Lecture-33Lecture-33
Issues regarding classification Issues regarding classification
and predictionand prediction

Issues regarding classification and Issues regarding classification and
prediction -prediction -Preparing the data for Preparing the data for
classification and predictionclassification and prediction

Data cleaningData cleaning

Preprocess data in order to reduce noise and handle Preprocess data in order to reduce noise and handle
missing valuesmissing values
Relevance analysis (feature selection)Relevance analysis (feature selection)

Remove the irrelevant or redundant attributesRemove the irrelevant or redundant attributes
Data transformationData transformation

Generalize and/or normalize dataGeneralize and/or normalize data
Lecture-33 - Issues regarding classification and predictionLecture-33 - Issues regarding classification and prediction

Issues regarding classification and prediction Issues regarding classification and prediction
Comparing Classification MethodsComparing Classification Methods
AccuracyAccuracy
Speed and scalabilitySpeed and scalability

time to construct the modeltime to construct the model

time to use the modeltime to use the model
RobustnessRobustness

handling noise and missing valueshandling noise and missing values
ScalabilityScalability

efficiency in disk-resident databases efficiency in disk-resident databases
Interpretability: Interpretability:

understanding and insight provded by the modelunderstanding and insight provded by the model
interpretabilityinterpretability

decision tree sizedecision tree size

compactness of classification rulescompactness of classification rules
Lecture-33 - Issues regarding classification and predictionLecture-33 - Issues regarding classification and prediction

Lecture-34Lecture-34
Classification by decision tree Classification by decision tree
inductioninduction
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Classification by Decision Tree InductionClassification by Decision Tree Induction
Decision tree Decision tree

A flow-chart-like tree structureA flow-chart-like tree structure

Internal node denotes a test on an attributeInternal node denotes a test on an attribute

Branch represents an outcome of the testBranch represents an outcome of the test

Leaf nodes represent class labels or class distributionLeaf nodes represent class labels or class distribution
Decision tree generation consists of two phasesDecision tree generation consists of two phases

Tree constructionTree construction
At start, all the training examples are at the rootAt start, all the training examples are at the root
Partition examples recursively based on selected attributesPartition examples recursively based on selected attributes

Tree pruningTree pruning
Identify and remove branches that reflect noise or outliersIdentify and remove branches that reflect noise or outliers
Use of decision tree: Classifying an unknown sampleUse of decision tree: Classifying an unknown sample

Test the attribute values of the sample against the decision treeTest the attribute values of the sample against the decision tree
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Training DatasetTraining Dataset
ageincomestudentcredit_rating
<=30high nofair
<=30high noexcellent
31…40high nofair
>40 medium nofair
>40 low yesfair
>40 low yesexcellent
31…40low yesexcellent
<=30medium nofair
<=30low yesfair
>40 medium yesfair
<=30medium yesexcellent
31…40medium noexcellent
31…40high yesfair
>40 medium noexcellent
This
follows
an
example
from
Quinlan’s
ID3
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Output: A Decision Tree for “Output: A Decision Tree for “buys_computer”buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30
>40
no noyes yes
yes
30..40
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Algorithm for Decision Tree InductionAlgorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer Tree is constructed in a top-down recursive divide-and-conquer
mannermanner

At start, all the training examples are at the rootAt start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they are Attributes are categorical (if continuous-valued, they are
discretized in advance)discretized in advance)

Examples are partitioned recursively based on selected attributesExamples are partitioned recursively based on selected attributes

Test attributes are selected on the basis of a heuristic or statistical Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)measure (e.g., information gain)
Conditions for stopping partitioningConditions for stopping partitioning

All samples for a given node belong to the same classAll samples for a given node belong to the same class

There are no remaining attributes for further partitioning – majority There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leafvoting is employed for classifying the leaf

There are no samples leftThere are no samples left
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Attribute Selection MeasureAttribute Selection Measure
Information gain (ID3/C4.5)Information gain (ID3/C4.5)

All attributes are assumed to be categoricalAll attributes are assumed to be categorical

Can be modified for continuous-valued attributesCan be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner)Gini index (IBM IntelligentMiner)

All attributes are assumed continuous-valuedAll attributes are assumed continuous-valued

Assume there exist several possible split values for each Assume there exist several possible split values for each
attributeattribute

May need other tools, such as clustering, to get the May need other tools, such as clustering, to get the
possible split valuespossible split values

Can be modified for categorical attributesCan be modified for categorical attributes
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Information Gain (ID3/C4.5)Information Gain (ID3/C4.5)
Select the attribute with the highest information Select the attribute with the highest information
gaingain
Assume there are two classes,Assume there are two classes, P P and and N N

Let the set of examples Let the set of examples SS contain contain pp elements of class elements of class PP
and and nn elements of class elements of class NN

The amount of information, needed to decide if an The amount of information, needed to decide if an
arbitrary example in arbitrary example in SS belongs to belongs to PP or or NN is defined as is defined as
np
n
np
n
np
p
np
p
npI
++
-
++
-=
22
loglog),(
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Information Gain in Decision Tree Information Gain in Decision Tree
InductionInduction
Assume that using attribute A a set Assume that using attribute A a set SS will be will be
partitioned into sets {partitioned into sets {SS
11, , SS
22 , …, , …, SS
vv} }
If If SS
ii contains contains pp
ii examples of examples of PP and and nn
ii examples of examples of NN, ,
the entropy, or the expected information needed to the entropy, or the expected information needed to
classify objects in all subtrees classify objects in all subtrees SS
ii is is
The encoding information that would be gained The encoding information that would be gained
by branching on by branching on AA
å
=+
+
=
n
1
),()(
i
ii
ii
npI
np
np
AE
)(),()( AEnpIAGain -=
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Attribute Selection by Information Gain Attribute Selection by Information Gain
ComputationComputation
Class P: buys_computer Class P: buys_computer
= “yes”= “yes”
Class N: buys_computer Class N: buys_computer
= “no”= “no”
I(p, n) = I(9, 5) =0.940I(p, n) = I(9, 5) =0.940
Compute the entropy for Compute the entropy for
ageage::
HenceHence
SimilarlySimilarly
agep
in
iI(p
i, n
i)
<=30 230.971
30…40400
>40 320.971
69.0)2,3(
14
5
)0,4(
14
4
)3,2(
14
5
)(
=+
+=
I
IIageE
048.0)_(
151.0)(
029.0)(
=
=
=
ratingcreditGain
studentGain
incomeGain
)(),()( ageEnpIageGain -=
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

GiniGini Index (IBM IntelligentMiner) Index (IBM IntelligentMiner)
If a data set If a data set TT contains examples from contains examples from nn classes, classes,
gini index, gini index, ginigini((TT) is defined as) is defined as
where where pp
jj is the relative frequency of class is the relative frequency of class jj in in T.T.
If a data set If a data set TT is split into two subsets is split into two subsets TT
11 and and TT
22 with with
sizes sizes NN
11 and and NN
22 respectively, the respectively, the ginigini index of the index of the
split data contains examples from split data contains examples from nn classes, the classes, the
ginigini index index ginigini((TT) is defined as) is defined as
The attribute provides the smallest The attribute provides the smallest ginigini
splitsplit((TT) is ) is
chosen to split the node (chosen to split the node (need to enumerate all need to enumerate all
possible splitting points for each attributepossible splitting points for each attribute).).
å
=
-=
n
j
p
j
Tgini
1
2
1)(
)()()(
2
2
1
1
Tgini
N
N
Tgini
N
N
Tgini
split
+=
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Extracting Classification Rules from TreesExtracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rulesRepresent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leafOne rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunctionEach attribute-value pair along a path forms a conjunction
The leaf node holds the class predictionThe leaf node holds the class prediction
Rules are easier for humans to understandRules are easier for humans to understand
ExampleExample
IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “nono” THEN ” THEN buys_computerbuys_computer = “ = “nono””
IF IF ageage = “<=30” AND = “<=30” AND studentstudent = “ = “yesyes” THEN ” THEN buys_computerbuys_computer = “ = “yesyes””
IF IF ageage = “31…40” = “31…40” THEN THEN buys_computerbuys_computer = “ = “yesyes””
IF IF ageage = “>40” AND = “>40” AND credit_ratingcredit_rating = “ = “excellentexcellent” THEN ” THEN
buys_computer buys_computer = “= “yesyes””
IF IF ageage = “>40” AND = “>40” AND credit_ratingcredit_rating = “ = “fairfair” THEN ” THEN buys_computerbuys_computer = “ = “nono””
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Avoid Overfitting in ClassificationAvoid Overfitting in Classification
The generated tree may overfit the training dataThe generated tree may overfit the training data

Too many branches, some may reflect anomalies Too many branches, some may reflect anomalies
due to noise or outliersdue to noise or outliers

Result is in poor accuracy for unseen samplesResult is in poor accuracy for unseen samples
Two approaches to avoid overfitting Two approaches to avoid overfitting

Prepruning: Halt tree construction earlyPrepruning: Halt tree construction early——do not split do not split
a node if this would result in the goodness measure a node if this would result in the goodness measure
falling below a thresholdfalling below a threshold
Difficult to choose an appropriate thresholdDifficult to choose an appropriate threshold

Postpruning: Remove branches from a “fully grown” Postpruning: Remove branches from a “fully grown”
treetree——get a sequence of progressively pruned treesget a sequence of progressively pruned trees
Use a set of data different from the training data Use a set of data different from the training data
to decide which is the “best pruned tree”to decide which is the “best pruned tree”
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Approaches to Determine the Final Tree Approaches to Determine the Final Tree
SizeSize
Separate training and testing setsSeparate training and testing sets
Use cross validation, 10-fold cross validationUse cross validation, 10-fold cross validation
Use all the data for trainingUse all the data for training

apply a statistical test (chi-square) to estimate apply a statistical test (chi-square) to estimate
whether expanding or pruning a node may improve whether expanding or pruning a node may improve
the entire distributionthe entire distribution
Use minimum description length (MDL) principle: Use minimum description length (MDL) principle:

halting growth of the tree when the encoding is halting growth of the tree when the encoding is
minimizedminimized
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Enhancements to basic decision tree Enhancements to basic decision tree
inductioninduction
Allow for continuous-valued attributesAllow for continuous-valued attributes

Dynamically define new discrete-valued attributes that Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete partition the continuous attribute value into a discrete
set of intervalsset of intervals
Handle missing attribute valuesHandle missing attribute values

Assign the most common value of the attributeAssign the most common value of the attribute

Assign probability to each of the possible valuesAssign probability to each of the possible values
Attribute constructionAttribute construction

Create new attributes based on existing ones that are Create new attributes based on existing ones that are
sparsely representedsparsely represented

This reduces fragmentation, repetition, and replicationThis reduces fragmentation, repetition, and replication
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Classification in Large DatabasesClassification in Large Databases
ClassificationClassification——a classical problem extensively studied by a classical problem extensively studied by
statisticians and machine learning researchersstatisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speedand hundreds of attributes with reasonable speed
Why decision tree induction in data mining?Why decision tree induction in data mining?

relatively faster learning speed (than other classification relatively faster learning speed (than other classification
methods)methods)

convertible to simple and easy to understand convertible to simple and easy to understand
classification rulesclassification rules

can use SQL queries for accessing databasescan use SQL queries for accessing databases

comparable classification accuracy with other methodscomparable classification accuracy with other methods
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Scalable Decision Tree Induction Methods in Scalable Decision Tree Induction Methods in
Data Mining StudiesData Mining Studies
SLIQ (EDBT’96 SLIQ (EDBT’96 —— Mehta et al.) Mehta et al.)

builds an index for each attribute and only class list and builds an index for each attribute and only class list and
the current attribute list reside in memorythe current attribute list reside in memory
SPRINT (VLDB’96 SPRINT (VLDB’96 —— J. Shafer et al.) J. Shafer et al.)

constructs an attribute list data structure constructs an attribute list data structure
PUBLIC (VLDB’98 PUBLIC (VLDB’98 —— Rastogi & Shim) Rastogi & Shim)

integrates tree splitting and tree pruning: stop growing integrates tree splitting and tree pruning: stop growing
the tree earlierthe tree earlier
RainForestRainForest (VLDB’98 (VLDB’98 —— Gehrke, Ramakrishnan & Ganti) Gehrke, Ramakrishnan & Ganti)

separates the scalability aspects from the criteria that separates the scalability aspects from the criteria that
determine the quality of the treedetermine the quality of the tree

builds an AVC-list (attribute, value, class label)builds an AVC-list (attribute, value, class label)
Lecture-34 - Classification by decision tree inductionLecture-34 - Classification by decision tree induction

Lecture-35Lecture-35
Bayesian ClassificationBayesian Classification

Bayesian ClassificationBayesian Classification
Statical classifiersStatical classifiers
Based on Baye’s theoremBased on Baye’s theorem
Naïve Bayesian classificationNaïve Bayesian classification
Class conditional independenceClass conditional independence
Bayesian belief netwoksBayesian belief netwoks
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Bayesian ClassificationBayesian Classification
Probabilistic learningProbabilistic learning

Calculate explicit probabilities for hypothesis, among the most Calculate explicit probabilities for hypothesis, among the most
practical approaches to certain types of learning problemspractical approaches to certain types of learning problems
Incremental Incremental

Each training example can incrementally increase/decrease the Each training example can incrementally increase/decrease the
probability that a hypothesis is correct. Prior knowledge can be probability that a hypothesis is correct. Prior knowledge can be
combined with observed data.combined with observed data.
Probabilistic predictionProbabilistic prediction

Predict multiple hypotheses, weighted by their probabilitiesPredict multiple hypotheses, weighted by their probabilities
StandardStandard

Even when Bayesian methods are computationally intractable, they Even when Bayesian methods are computationally intractable, they
can provide a standard of optimal decision making against which can provide a standard of optimal decision making against which
other methods can be measuredother methods can be measured
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Baye’s TheoremBaye’s Theorem
Let X be a data tuple and H be hypothesis, Let X be a data tuple and H be hypothesis,
such that X belongs to a specific class C.such that X belongs to a specific class C.
Posterior probability of a hypothesis h on X, Posterior probability of a hypothesis h on X,
P(h|X) follows the Baye’s theoremP(h|X) follows the Baye’s theorem
)(
)()|(
)|(
XP
HPHXP
XHP =
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Naïve Bayes Classifier Naïve Bayes Classifier
Let D be a training data set of tuples and Let D be a training data set of tuples and
associated class labelsassociated class labels
X = (x1, x2, x3,…xn) and m = C1, C2, C3,..CmX = (x1, x2, x3,…xn) and m = C1, C2, C3,..Cm
Bayes theorem:Bayes theorem:
P(Ci|X) = P(X|Ci)·P(Ci) / P(X)P(Ci|X) = P(X|Ci)·P(Ci) / P(X)
Naïve Baye’s predicts that X belongs to class Ci Naïve Baye’s predicts that X belongs to class Ci
if and only if if and only if
P(Ci/X) > P(Cj/X) for 1<=j<=m, i!=jP(Ci/X) > P(Cj/X) for 1<=j<=m, i!=j
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Naïve Bayes ClassifierNaïve Bayes Classifier
P(X) is constant for all classesP(X) is constant for all classes
P(C1)=P(C2)=…..=P(Cn)P(C1)=P(C2)=…..=P(Cn)
P(X|Ci)·P(Ci) is to be maximizeP(X|Ci)·P(Ci) is to be maximize
P(Ci)=|Ci,d|/|D|P(Ci)=|Ci,d|/|D|
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Naïve Bayesian ClassificationNaïve Bayesian Classification
Naïve assumption: attribute independenceNaïve assumption: attribute independence
P(xP(x
11,…,x,…,x
kk|C) = P(x|C) = P(x
11|C)·…·P(x|C)·…·P(x
kk|C)|C)
If attribute is categorical:If attribute is categorical:
P(xP(x
ii|C) is estimated as the relative freq of |C) is estimated as the relative freq of
samples having value xsamples having value x
ii as i-th attribute in class as i-th attribute in class
CC
If attribute is continuous:If attribute is continuous:
P(xP(x
ii|C) is estimated thru a Gaussian density |C) is estimated thru a Gaussian density
functionfunction
Computationally easy in both casesComputationally easy in both cases
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Play-tennis example: estimating P(xPlay-tennis example: estimating P(x
ii|C)|C)
OutlookTemperatureHumidityWindyClass
sunnyhot high false N
sunnyhot high true N
overcasthot high false P
rain mild high false P
rain cool normalfalse P
rain cool normaltrue N
overcastcool normaltrue P
sunnymild high false N
sunnycool normalfalse P
rain mild normalfalse P
sunnymild normaltrue P
overcastmild high true P
overcasthot normalfalse P
rain mild high true N
outlookoutlook
P(sunny|p) = 2/9P(sunny|p) = 2/9P(sunny|n) = 3/5P(sunny|n) = 3/5
P(overcast|p) = 4/9P(overcast|p) = 4/9P(overcast|n) = 0P(overcast|n) = 0
P(rain|p) = 3/9P(rain|p) = 3/9 P(rain|n) = 2/5P(rain|n) = 2/5
temperaturetemperature
P(hot|p) = 2/9P(hot|p) = 2/9 P(hot|n) = 2/5P(hot|n) = 2/5
P(mild|p) = 4/9P(mild|p) = 4/9 P(mild|n) = 2/5P(mild|n) = 2/5
P(cool|p) = 3/9P(cool|p) = 3/9 P(cool|n) = 1/5P(cool|n) = 1/5
humidityhumidity
P(high|p) = 3/9P(high|p) = 3/9 P(high|n) = 4/5P(high|n) = 4/5
P(normal|p) = 6/9P(normal|p) = 6/9P(normal|n) = 2/5P(normal|n) = 2/5
windywindy
P(true|p) = 3/9P(true|p) = 3/9 P(true|n) = 3/5P(true|n) = 3/5
P(false|p) = 6/9P(false|p) = 6/9P(false|n) = 2/5P(false|n) = 2/5
P(p) = 9/14P(p) = 9/14
P(n) = 5/14P(n) = 5/14
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Play-tennis example: classifying XPlay-tennis example: classifying X
An unseen sample X = <rain, hot, high, false>An unseen sample X = <rain, hot, high, false>
P(X|p)·P(p) = P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 3/9·2/9·3/9·6/9·9/14 = 0.0105820.010582
P(X|n)·P(n) = P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 2/5·2/5·4/5·2/5·5/14 = 0.0182860.018286
Sample X is classified in class n (donSample X is classified in class n (don’’t play)t play)
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

How effective are Bayesian classifiers?How effective are Bayesian classifiers?
makes computation possiblemakes computation possible
optimal classifiers when satisfiedoptimal classifiers when satisfied
but is seldom satisfied in practice, as attributes but is seldom satisfied in practice, as attributes
(variables) are often correlated.(variables) are often correlated.
Attempts to overcome this limitation:Attempts to overcome this limitation:

Bayesian networks, that combine Bayesian reasoning Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributeswith causal relationships between attributes

Decision trees, that reason on one attribute at the Decision trees, that reason on one attribute at the
time, considering most important attributes firsttime, considering most important attributes first
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Bayesian Belief Networks (I)Bayesian Belief Networks (I)
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S)(FH, ~S)(~FH, S)(~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table
for the variable LungCancer
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Bayesian Belief Networks Bayesian Belief Networks
Bayesian belief network allows a Bayesian belief network allows a subsetsubset of the variables of the variables
conditionally independentconditionally independent
A graphical model of causal relationshipsA graphical model of causal relationships
Several cases of learning Bayesian belief networksSeveral cases of learning Bayesian belief networks

Given both network structure and all the variables: easyGiven both network structure and all the variables: easy

Given network structure but only some variablesGiven network structure but only some variables

When the network structure is not known in advanceWhen the network structure is not known in advance
Lecture-35 - Bayesian ClassificationLecture-35 - Bayesian Classification

Lecture-36Lecture-36
Classification by BackpropagationClassification by Backpropagation

Classification by BackpropagationClassification by Backpropagation
Neural network learning algorithmNeural network learning algorithm
Psychologists and NeurobiologistsPsychologists and Neurobiologists
Neural network – set of connected input/output Neural network – set of connected input/output
units in which each connection has a weight units in which each connection has a weight
associated with it.associated with it.
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation

Neural NetworksNeural Networks
AdvantagesAdvantages

prediction accuracy is generally highprediction accuracy is generally high

robust, works when training examples contain errorsrobust, works when training examples contain errors

output may be discrete, real-valued, or a vector of output may be discrete, real-valued, or a vector of
several discrete or real-valued attributesseveral discrete or real-valued attributes

fast evaluation of the learned target functionfast evaluation of the learned target function
limitationslimitations

long training timelong training time

difficult to understand the learned function (weights)difficult to understand the learned function (weights)

not easy to incorporate domain knowledgenot easy to incorporate domain knowledge
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation

A NeuronA Neuron
The The nn-dimensional input vector -dimensional input vector xx is mapped is mapped
into variable into variable yy by means of the scalar by means of the scalar
product and a nonlinear function mappingproduct and a nonlinear function mapping
m
k-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
å
w
0
w
1
w
n
x
0
x
1
x
n
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation

Network TrainingNetwork Training
The ultimate objective of training The ultimate objective of training

obtain a set of weights that makes almost all the obtain a set of weights that makes almost all the
tuples in the training data classified correctly tuples in the training data classified correctly
StepsSteps

Initialize weights with random values Initialize weights with random values

Feed the input tuples into the network one by oneFeed the input tuples into the network one by one

For each unitFor each unit
Compute the net input to the unit as a linear combination of Compute the net input to the unit as a linear combination of
all the inputs to the unitall the inputs to the unit
Compute the output value using the activation functionCompute the output value using the activation function
Compute the errorCompute the error
Update the weights and the biasUpdate the weights and the bias
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation

Multi-Layer PerceptronMulti-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: x
i
w
ij
å +=
i
jiijj
OwI q
j
Ij
e
O
-
+
=
1
1
))(1(
jjjjj
OTOOErr --=
jk
k
kjjj
wErrOOErr å-= )1(
ijijij
OErrlww )(+=
jjj
Errl)(+=qq
Lecture-36 - Classification by BackpropagationLecture-36 - Classification by Backpropagation

Lecture-37Lecture-37
Classification based on concepts Classification based on concepts
from association rule miningfrom association rule mining

Association-Based ClassificationAssociation-Based Classification
Several methods for association-based Several methods for association-based
classificationclassification

ARCS: Quantitative association mining and ARCS: Quantitative association mining and
clustering of association rules (Lent et al’97)clustering of association rules (Lent et al’97)
It beats C4.5 in (mainly) scalability and also It beats C4.5 in (mainly) scalability and also
accuracyaccuracy

Associative classification: (Liu et al’98) Associative classification: (Liu et al’98)
It mines high support and high confidence rules in It mines high support and high confidence rules in
the form of “cond_set => y”, where y is a class labelthe form of “cond_set => y”, where y is a class label
Lecture-37 - Classification based on concepts from association rule miningLecture-37 - Classification based on concepts from association rule mining

Association-Based ClassificationAssociation-Based Classification

CAEP (Classification by aggregating CAEP (Classification by aggregating
emerging patterns) (Dong et al’99)emerging patterns) (Dong et al’99)
Emerging patterns (EPs): the itemsets whose Emerging patterns (EPs): the itemsets whose
support increases significantly from one class to support increases significantly from one class to
anotheranother
Mine Eps based on minimum support and growth Mine Eps based on minimum support and growth
raterate
Lecture-37 - Classification based on concepts from association rule miningLecture-37 - Classification based on concepts from association rule mining

Lecture-38Lecture-38
Other Classification MethodsOther Classification Methods

Other Classification MethodsOther Classification Methods
k-nearest neighbor classifierk-nearest neighbor classifier
case-based reasoningcase-based reasoning
Genetic algorithmGenetic algorithm
Rough set approachRough set approach
Fuzzy set approachesFuzzy set approaches
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Instance-Based MethodsInstance-Based Methods
Instance-based learning: Instance-based learning:

Store training examples and delay the processing Store training examples and delay the processing
(“lazy evaluation”) until a new instance must be (“lazy evaluation”) until a new instance must be
classifiedclassified
Approaches Approaches

kk-nearest neighbor approach-nearest neighbor approach
Instances represented as points in a Euclidean Instances represented as points in a Euclidean
space.space.

Locally weighted regressionLocally weighted regression
Constructs local approximationConstructs local approximation

Case-based reasoningCase-based reasoning
Uses symbolic representations and knowledge-Uses symbolic representations and knowledge-
based inferencebased inference
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

The The kk-Nearest Neighbor Algorithm-Nearest Neighbor Algorithm
All instances correspond to points in the n-D All instances correspond to points in the n-D
space.space.
The nearest neighbor are defined in terms of The nearest neighbor are defined in terms of
Euclidean distance.Euclidean distance.
The target function could be discrete- or real- The target function could be discrete- or real-
valued.valued.
For discrete-valued, the For discrete-valued, the kk-NN returns the most -NN returns the most
common value among the k training examples common value among the k training examples
nearest tonearest to xxqq. .
Vonoroi diagram: the decision surface induced Vonoroi diagram: the decision surface induced
by 1-NN for a typical set of training examples.by 1-NN for a typical set of training examples.
__
+
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Discussion on the Discussion on the kk-NN Algorithm-NN Algorithm
The k-NN algorithm for continuous-valued target functionsThe k-NN algorithm for continuous-valued target functions

Calculate the mean values of theCalculate the mean values of the k k nearest neighbors nearest neighbors
Distance-weighted nearest neighbor algorithmDistance-weighted nearest neighbor algorithm

Weight the contribution of each of the k neighbors Weight the contribution of each of the k neighbors
according to their distance to the query pointaccording to their distance to the query point x x
qq
giving greater weight to closer neighborsgiving greater weight to closer neighbors

Similarly, for real-valued target functionsSimilarly, for real-valued target functions
Robust to noisy data by averaging k-nearest neighborsRobust to noisy data by averaging k-nearest neighbors
Curse of dimensionality: distance between neighbors could Curse of dimensionality: distance between neighbors could
be dominated by irrelevant attributes. be dominated by irrelevant attributes.

To overcome it, axes stretch or elimination of the least To overcome it, axes stretch or elimination of the least
relevant attributes.relevant attributes.
w
dx
q
x
i
º
1
2
(,)
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Case-Based ReasoningCase-Based Reasoning
Also uses: lazy evaluation + analyze similar instancesAlso uses: lazy evaluation + analyze similar instances
Difference: Instances are not “points in a Euclidean space”Difference: Instances are not “points in a Euclidean space”
Example: Water faucet problem in CADET (Sycara et al’92)Example: Water faucet problem in CADET (Sycara et al’92)
MethodologyMethodology

Instances represented by rich symbolic descriptions Instances represented by rich symbolic descriptions
( function graphs)( function graphs)

Multiple retrieved cases may be combinedMultiple retrieved cases may be combined

Tight coupling between case retrieval, knowledge-based Tight coupling between case retrieval, knowledge-based
reasoning, and problem solvingreasoning, and problem solving
Research issuesResearch issues

Indexing based on syntactic similarity measure, and Indexing based on syntactic similarity measure, and
when failure, backtracking, and adapting to additional when failure, backtracking, and adapting to additional
casescases
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Lazy vs. Eager LearningLazy vs. Eager Learning
Instance-based learning: lazy evaluation Instance-based learning: lazy evaluation
Decision-tree and Bayesian classification: eager evaluationDecision-tree and Bayesian classification: eager evaluation
Key differencesKey differences

Lazy method may consider query instance Lazy method may consider query instance xqxq when deciding how to when deciding how to
generalize beyond the training data generalize beyond the training data DD

Eager method cannot since they have already chosen global Eager method cannot since they have already chosen global
approximation when seeing the queryapproximation when seeing the query
Efficiency: Lazy - less time training but more time predictingEfficiency: Lazy - less time training but more time predicting
AccuracyAccuracy

Lazy method effectively uses a richer hypothesis space since it Lazy method effectively uses a richer hypothesis space since it
uses many local linear functions to form its implicit global uses many local linear functions to form its implicit global
approximation to the target functionapproximation to the target function

Eager: must commit to a single hypothesis that covers the entire Eager: must commit to a single hypothesis that covers the entire
instance spaceinstance space
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Genetic AlgorithmsGenetic Algorithms
GA: based on an analogy to biological evolutionGA: based on an analogy to biological evolution
Each rule is represented by a string of bitsEach rule is represented by a string of bits
An initial population is created consisting of randomly An initial population is created consisting of randomly
generated rulesgenerated rules
e.g., IF Ae.g., IF A
11 and Not A and Not A
22 then C then C
22 can be encoded as 100 can be encoded as 100
Based on the notion of survival of the fittest, a new Based on the notion of survival of the fittest, a new
population is formed to consists of the fittest rules and population is formed to consists of the fittest rules and
their offsprings their offsprings
The fitness of a rule is represented by its classification The fitness of a rule is represented by its classification
accuracy on a set of training examplesaccuracy on a set of training examples
Offsprings are generated by crossover and mutationOffsprings are generated by crossover and mutation
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Rough Set ApproachRough Set Approach
Rough sets are used to approximately or Rough sets are used to approximately or
“roughly” define equivalent classes “roughly” define equivalent classes
A rough set for a given class C is approximated A rough set for a given class C is approximated
by two sets: a lower approximation (certain to by two sets: a lower approximation (certain to
be in C) and an upper approximation (cannot be in C) and an upper approximation (cannot
be described as not belonging to C) be described as not belonging to C)
Finding the minimal subsets (reducts) of Finding the minimal subsets (reducts) of
attributes (for feature reduction) is NP-hard but attributes (for feature reduction) is NP-hard but
a discernibility matrix is used to reduce the a discernibility matrix is used to reduce the
computation intensity computation intensity
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Fuzzy Set ApproachesFuzzy Set Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to Fuzzy logic uses truth values between 0.0 and 1.0 to
represent the degree of membership (such as using represent the degree of membership (such as using
fuzzy membership graph)fuzzy membership graph)
Attribute values are converted to fuzzy valuesAttribute values are converted to fuzzy values

e.g., income is mapped into the discrete categories e.g., income is mapped into the discrete categories
{low, medium, high} with fuzzy values calculated{low, medium, high} with fuzzy values calculated
For a given new sample, more than one fuzzy value may For a given new sample, more than one fuzzy value may
applyapply
Each applicable rule contributes a vote for membership Each applicable rule contributes a vote for membership
in the categoriesin the categories
Typically, the truth values for each predicted category Typically, the truth values for each predicted category
are summedare summed
Lecture-38 - Other Classification MethodsLecture-38 - Other Classification Methods

Lecture-39Lecture-39
PredictionPrediction

What Is Prediction?What Is Prediction?
Prediction is similar to classificationPrediction is similar to classification

First, construct a modelFirst, construct a model

Second, use model to predict unknown valueSecond, use model to predict unknown value
Major method for prediction is regressionMajor method for prediction is regression

Linear and multiple regressionLinear and multiple regression

Non-linear regressionNon-linear regression
Prediction is different from classificationPrediction is different from classification

Classification refers to predict categorical class labelClassification refers to predict categorical class label

Prediction models continuous-valued functionsPrediction models continuous-valued functions
Lecture-39 - PredictionLecture-39 - Prediction

Predictive modeling: Predict data values or construct Predictive modeling: Predict data values or construct
generalized linear models based on the database data.generalized linear models based on the database data.
One can only predict value ranges or category distributionsOne can only predict value ranges or category distributions
Method outline:Method outline:

Minimal generalizationMinimal generalization

Attribute relevance analysisAttribute relevance analysis

Generalized linear model constructionGeneralized linear model construction

PredictionPrediction
Determine the major factors which influence the predictionDetermine the major factors which influence the prediction

Data relevance analysis: uncertainty measurement, Data relevance analysis: uncertainty measurement,
entropy analysis, expert judgement, etc.entropy analysis, expert judgement, etc.
Multi-level prediction: drill-down and roll-up analysisMulti-level prediction: drill-down and roll-up analysis
Predictive Modeling in DatabasesPredictive Modeling in Databases
Lecture-39 - PredictionLecture-39 - Prediction

Linear regressionLinear regression: Y = : Y = aa + + bb X X

Two parameters , Two parameters , aa and and bb specify the line and are to specify the line and are to
be estimated by using the data at hand.be estimated by using the data at hand.

using the least squares criterion to the known values of using the least squares criterion to the known values of
YY11, Y, Y22, …, X, …, X11, X, X22, …., ….
Multiple regressionMultiple regression: Y = b0 + b1 X1 + b2 X2.: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed into the Many nonlinear functions can be transformed into the
above.above.
Log-linear modelsLog-linear models::

The multi-way table of joint probabilities is The multi-way table of joint probabilities is
approximated by a product of lower-order tables.approximated by a product of lower-order tables.

Probability: Probability: p(a, b, c, d) = p(a, b, c, d) = aaab ab bbacacccadad ddbcdbcd
Regress Analysis and Log-Linear Models in Regress Analysis and Log-Linear Models in
PredictionPrediction
Lecture-39 - PredictionLecture-39 - Prediction

Locally Weighted RegressionLocally Weighted Regression
Construct an explicit approximation toConstruct an explicit approximation to f f over a local region over a local region
surrounding query instance surrounding query instance xqxq..
Locally weighted linear regression: Locally weighted linear regression:

The target functionThe target function f f is approximated near is approximated near xqxq using the using the
linear function: linear function:

minimize the squared error: distance-decreasing weight minimize the squared error: distance-decreasing weight
KK

the gradient descent training rule:the gradient descent training rule:
In most cases, the target function is approximated by a In most cases, the target function is approximated by a
constant, linear, or quadratic function.constant, linear, or quadratic function.

() () ()fxwwax w
n
a
n
x=+ ++
011

Ex
q
fxfx
xknearestneighborsofx
q
Kdx
q
x() (()

())
_ _ __
((,))º -
Î
å
1
2
2
Dw
j
Kdx
q
xfxfxa
j
x
xknearestneighborsofx
q
º -
Î
åh ((,))((()

())()
_ _ __
Lecture-39 - PredictionLecture-39 - Prediction

Lecture-40Lecture-40
Classification accuracyClassification accuracy

Classification Accuracy: Estimating Error Classification Accuracy: Estimating Error
RatesRates
Partition: Training-and-testingPartition: Training-and-testing

use two independent data sets- training set , test setuse two independent data sets- training set , test set

used for data set with large number of samplesused for data set with large number of samples
Cross-validationCross-validation

divide the data set into divide the data set into kk subsamples subsamples

use use k-1k-1 subsamples as training data and one sub- subsamples as training data and one sub-
sample as test data --- sample as test data --- kk-fold cross-validation-fold cross-validation

for data set with moderate sizefor data set with moderate size
Bootstrapping (leave-one-out)Bootstrapping (leave-one-out)

for small size datafor small size data
Lecture-40 - Classification accuracyLecture-40 - Classification accuracy

Boosting and BaggingBoosting and Bagging
Boosting increases classification accuracyBoosting increases classification accuracy

Applicable to decision trees or Bayesian Applicable to decision trees or Bayesian
classifierclassifier
Learn a series of classifiers, where each Learn a series of classifiers, where each
classifier in the series pays more attention to classifier in the series pays more attention to
the examples misclassified by its predecessorthe examples misclassified by its predecessor
Boosting requires only linear time and Boosting requires only linear time and
constant spaceconstant space
Lecture-40 - Classification accuracyLecture-40 - Classification accuracy

Boosting Technique Boosting Technique —— Algorithm Algorithm
Assign every example an equal weight Assign every example an equal weight 1/N1/N
For t = 1, 2, …, T Do For t = 1, 2, …, T Do

Obtain a hypothesis (classifier) hObtain a hypothesis (classifier) h
(t)(t)
under w under w
(t)(t)

Calculate the error ofCalculate the error of h(t) h(t) and re-weight the and re-weight the
examples based on the errorexamples based on the error

Normalize wNormalize w
(t+1)(t+1)
to sum to 1 to sum to 1
Output a weighted sum of all the hypothesis, Output a weighted sum of all the hypothesis,
with each hypothesis weighted according to its with each hypothesis weighted according to its
accuracy on the training set accuracy on the training set
Lecture-40 - Classification accuracyLecture-40 - Classification accuracy
Tags