6
Process (1): Model Construction
Training
DataNAMERANK YEARSTENURED
MikeAssistant Prof 3 no
MaryAssistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
DaveAssistant Prof 6 no
AnneAssociate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
7
Process (2): Using the Model in Prediction
Classifier
Testing
DataNAMERANK YEARSTENURED
Tom Assistant Prof2 no
MerlisaAssociate Prof6 no
GeorgeProfessor 5 yes
JosephAssistant Prof7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
8
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning(clustering)
The class labels of training data is unknown
In this we can predict the class labels by using clustering
technique.
Issues regarding
Classification & Prediction
9
10
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing
values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
11
Issues: Evaluating Classification Methods
Accuracy: classifier accuracy: predicting class label
Speed: time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness:handling noise and missing values
Scalability:efficiency in disk-resident databases
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
14
Decision Tree Induction: Training Datasetageincomestudentcredit_ratingbuys_computer
<=30high nofair no
<=30high noexcellent no
31…40high nofair yes
>40 medium nofair yes
>40 low yesfair yes
>40 low yesexcellent no
31…40low yesexcellent yes
<=30medium nofair no
<=30low yesfair yes
>40 medium yesfair yes
<=30medium yesexcellent yes
31…40medium noexcellent yes
31…40high yesfair yes
>40 medium noexcellent no
15
Output: A Decision Tree for “buys_computer”
age?
student? credit rating?
no yes
yes
yes
31….40
16
How are decision trees used for
classification?
Givenatuple,X,forwhichtheassociatedclasslabelis
unknown.
Theattributevaluesofthetuplearetestedagainstthedecision
tree.
Apathistracedfromtheroottoaleafnode,whichholdsthe
classpredictionforthattuple.
Decisiontreescaneasilybeconvertedtoclassificationrules.
17
Why are decision tree classifiers so
popular?
Theconstructionofdecisiontreeclassifiersdoesnotrequireany
domainknowledgeorparametersetting,andthereforeis
appropriateforexploratoryknowledgediscovery.Decision
treescanhandlemultidimensionaldata.
Theirrepresentationofacquiredknowledgeintreeformis
intuitiveandgenerallyeasytounderstandbyhumans.
Thelearningandclassificationstepsofdecisiontreeinductionare
simpleandfast.Ingeneral,decisiontreeclassifiershavegood
accuracy.
Information Gain
20
•Information gain is the basic criterion to decide whether a
feature should be used to split a node or not.
Entropy: an information theory metric that measures the
impurity or uncertainty in a group of observations.
It determines how a decision tree chooses to split data.
21
Attribute Selection Measure:
Information Gain
Select the attribute with the highest information gain
Let p
ibe the non-zero probability that an arbitrary tuple in D
belongs to class C
i, estimated by |C
i, D|/|D|
Initial entropy:
A log function to the base 2 is used, because the information
is encoded in bits.
New entropy(after using A to split D into v partitions) to
classify D:
Information gainedby branching on attribute A)(log)(
2
1
i
m
i
i
ppDInfo
)(
||
||
)(
1
j
v
j
j
A
DInfo
D
D
DInfo
(D)InfoInfo(D)Gain(A)
A
22
23
Information Gain
24
25
26
27
28
Attribute Selection Measure:
Information Gain
29
Attribute Selection Measure:
Information Gain
•Theclasslabelattribute,buys_computer,hastwodistinct
values(namely,{yes,no});therefore,therearetwodistinct
classes(i.e.,m=2).
•Thereareninetuplesofclassyesandfivetuplesofclass
no.)(log)(
2
1
i
m
i
i
ppDInfo
45
Bayesian Theorem
Given training dataX, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem)(
)()|(
)|(
X
X
X
P
HPHP
HP
46
Derivation of Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X= (x
1,
x
2, …, x
n)
Suppose there are mclasses C
1, C
2, …, C
m.
Classification is to derive the maximum posteriori, i.e., the
maximal P(C
i|X)
This can be derived from Bayes’ theorem
Since P(X) is constant for all classes, only
needs to be maximized)(
)()|(
)|(
X
X
X
P
i
CP
i
CP
i
CP )()|()|(
i
CP
i
CP
i
CP XX
47
Derivation of Naïve Bayes Classifier
Asimplifiedassumption:attributesareconditionally
independent(i.e.,nodependencerelationbetweenattributes):
Forinstance,tocomputeP(X/Ci),weconsiderthefollowing:)|(...)|()|(
1
)|()|(
21
CixPCixPCixP
n
k
CixPCi
P
nk
X
50
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
Types of Naive Bayes Classifier
Multinominal, Gaussian and Bernoulli Naive Bayes
are the three types of Naïve Bayes classification:
Multinomial Naive Bayes:
This is mostly used for document classification
problem, i.ewhether a document belongs to the
category of sports, politics, technology etc. The
features/predictors used by the classifier are the
frequency of the words present in the document.
51
Bernoulli Naive Bayes:
This is similar to the multinomial naive bayesbut
the predictors are booleanvariables. The
parameters that we use to predict the class
variable take up only values yes or no, for
example if a word occurs in the text or not.
52
Gaussian Naive Bayes:
When the predictors take up a continuous value
and are not discrete, we assume that these values
are sampled from a Gaussian distribution.
Since the way, the values are present in the
dataset changes, the formula for conditional
probability changes to,
53
56
Using IF-THEN Rules for Classification
Rulesareagoodwayofrepresentinginformationorbitsof
knowledge.Arule-basedclassifierusesasetofIF-THENrules
forclassification.AnIF-THENruleisanexpressionoftheform
IFconditionTHENconclusion
R1:IFage=youthANDstudent=yesTHENbuys_computer=yes
The“IF”part(orleftside)ofaruleisknownastherule
antecedentorprecondition.The“THEN”part(orrightside)is
theruleconsequent.
Intheruleantecedent,theconditionconsistsofoneormore
attributetestsandtherule’sconsequentcontainsaclass
prediction.R1canalsobewrittenas:
57
Using IF-THEN Rules for Classification
Iftheconditionholdstrueforagiventuple,wesaythattherule
antecedentissatisfiedandthatrulecoversthetuple.
Assessment of a rule: coverageand accuracy
n
covers = # of tuples covered by R
n
correct = # of tuples correctly classified by R
coverage(R) = n
covers / |D| /* D: training data set */
accuracy(R) = n
correct / n
covers
58
age?
student? credit rating?
<=30
>40
no yes yes
yes
31..40
fairexcellent
yesno
Example: Rule extraction from our buys_computerdecision-tree
IF age= young AND student= no THEN buys_computer= no
IF age= young AND student= yes THEN buys_computer= yes
IF age= mid-age THEN buys_computer= yes
IF age= old AND credit_rating= excellentTHEN buys_computer = no
IF age= young AND credit_rating= fairTHEN buys_computer= yes
Rule Extraction from a Decision Tree
Rules are easier to understand than large trees
One rule is created for each path from the root to a
leaf
Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
Rules are mutually exclusive and exhaustive
61
Neural Network as a Classifier
Weakness
Long training time
Require a number of parameters typically best determined empirically,
e.g., the network topology or ``structure."
Strength
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the extraction of rules from
trained neural networks
62
A Neuron (= a perceptron)
The n-dimensional input vector xis mapped into variable y by
means of the scalar product and a nonlinear function mapping
m
k-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w
w
0
w
1
w
n
x
0
x
1
x
n)sign(y
ExampleFor
n
0i
kii
xwm
A Multi-Layer Feed-Forward Neural Network
64
How A Multi-Layer Neural Network Works?
The inputsto the network correspond to the attributes measured
for each training tuple
Inputs are fed simultaneously into the units making up the input
layer
They are then weighted and fed simultaneously to a hidden layer
The number of hidden layers is arbitrary, although usually only one
The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network's prediction
The network is feed-forwardin that none of the weights cycles
back to an input unit or to an output unit of a previous layer
From a statistical point of view, networks perform nonlinear
regression: Given enough hidden units and enough training
samples, they can closely approximate any function
65
Defining a Network Topology
First decide the network topology: # of units in the input layer,
# of hidden layers(if > 1), # of units in each hidden layer, and #
of units in the output layer
Normalizing the input values for each attribute measured in the
training tuples to [0.0—1.0]
One inputunit per domain value, each initialized to 0
Output, if for classification and more than two classes, one
output unit per class is used
Once a network has been trained and its accuracy is
unacceptable, repeat the training process with a different
network topologyor a different set of initial weights
66
Backpropagation
Iteratively process a set of training tuples & compare the network's
prediction with the actual known target value
For each training tuple, the weights are modified to minimize the mean
squared errorbetween the network's prediction and the actual target
value
Modifications are made in the “backwards” direction: from the output
layer, through each hidden layer down to the first hidden layer, hence
“backpropagation”
Steps
Initialize weights (to small random #s) and biases in the network
Propagate the inputs forward (by applying activation function)
Backpropagate the error (by updating weights and biases)
Terminating condition (when error is very small, etc.)
SVM
67
68
SVM—Support Vector Machines
A new classification method for both linear and nonlinear data
With the new dimension, it searches for the linear optimal
separating hyper-plane (i.e., “decision boundary”, separating the
tuples of one class from another)
With an appropriate nonlinear mapping to a sufficiently high
dimension, data from two classes can always be separated by a
hyper-plane
SVM finds this hyper-plane using support vectors (“essential”
training tuples) and margins (defined by the support vectors)
69
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from Vapnik
& Chervonenkis’ statistical learning theory in 1960s
Features: training can be slow but accuracy is high owing
to their ability to model complex nonlinear decision
boundaries (margin maximization)
Used both for classification and prediction
Applications:
handwritten digit recognition, object recognition,
speaker identification, benchmarking time-series
prediction tests
70
SVM—General Philosophy
Support Vectors
Small Margin Large Margin
71
SVM—Margins and Support Vectors
72
SVM—When Data Is Linearly Separable
m
Let data D be (X
1, y
1), …, (X
|D|, y
|D|), where X
iis the set of training tuples
associated with the class labels y
i
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane(MMH)
73
SVM—Linearly Separable
A separating hyperplane can be written as
W. X+ b = 0
where W={w
1, w
2, …, w
n} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w
0+ w
1x
1+ w
2x
2= 0
The hyperplane defining the sides of the margin:
H
1: w
0+ w
1x
1+ w
2x
2≥ 1 for y
i = +1, and
H
2: w
0+ w
1x
1+ w
2x
2≤ –1 for y
i = –1
Any training tuples that fall on hyperplanes H
1or H
2(i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimizationproblem:
Quadratic objective function and linear constraints Quadratic
Programming (QP) Lagrangian multipliers
74
SVM—Linearly Inseparable
Transform the original input data into a higher dimensional
space
Search for a linear separating hyperplane in the new spaceA
1
A
2
75
SVM vs. Neural Network
SVM
Relatively new concept
Deterministic algorithm
Nice Generalization
properties
Hard to learn –learned in
batch mode using
quadratic programming
techniques
Using kernels can learn
very complex functions
Neural Network
Relatively old
Nondeterministic algorithm
Generalizes well but
doesn’t have strong
mathematical foundation
Can easily be learned in
incremental fashion
To learn complex
functions—use multilayer
perceptron (not that trivial)