2
Bayesian Classification
A statistical classifier
Probabilistic prediction
Predict class membership probabilities
Based on Bayes’ Theorem
Naive Bayesian classifier
comparable performance with decision tree and selected neural
network classifiers
Accuracy and Speed is good when applied to large databases
Incremental
3
Bayesian Classification
Naïve Bayesian Classifier
Class Conditional Independence
Effect of an attribute value on a given class is
independent of the values of other attributes
Simplifies Computations
Bayesian Belief Networks
Graphical models
Represent dependencies among subsets of
attributes
4
Bayesian Theorem: Basics
Let X be a data sample class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
Posterior Probability
P(H) (prior probability), the initial probability
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
X – Round and Red Fruit H - Apple
5
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem
Predicts X belongs to C
i iff the probability P(C
i|X) is the highest
among all the P(C
k|X) for all the k classes
Practical difficulty: require initial knowledge of many probabilities,
significant computational cost
)(
)()|(
)|(
X
X
X
P
HPHP
HP =
6
Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x
1
, x
2
, …, x
n
)
Suppose there are m classes C
1
, C
2
, …, C
m
.
Classification is to derive the maximum posteriori, i.e., the
maximal P(C
i
|X)
This can be derived from Bayes’ theorem
)(
)()|(
)|(
X
X
X
P
i
CP
i
CP
i
CP =
7
Naïve Bayesian Classifier
Since P(X) is constant for all classes, only
needs to be maximized
Can assume that all classes are equally likely and maximize P(X|
C
i
)
A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
)()|()|(
i
CP
i
CP
i
CP XX=
)|(...)|()|(
1
)|()|(
21
CixPCixPCixP
n
k
CixPCi
P
nk
´´´=Õ
=
=X
8
Derivation of Naïve Bayes Classifier
This greatly reduces the computation cost: Only counts the
class distribution
If A
k
is categorical, P(x
k
|C
i
) = s
ik
/s
i
where s
ik
is the # of tuples in C
i
having value x
k
for A
k
and s
i
is the number of training samples
belonging to C
i
If A
k
is continuous-valued, P(x
k
|C
i
) is usually computed based
on Gaussian distribution with a mean μ and standard deviation
σ
P(x
k
|C
i
) is g(x
k
, m
Ci
, s
Ci
)
2
2
2
)(
2
1
),,(
s
m
sp
sm
-
-
=
x
exg
9
Example
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age incomestudentcredit_ratingbuys_computer
<=30high nofair no
<=30high noexcellentno
31…40high nofair yes
>40 mediumnofair yes
>40 low yesfair yes
>40 low yesexcellentno
31…40low yesexcellentyes
<=30mediumnofair no
<=30low yesfair yes
>40 mediumyesfair yes
<=30mediumyesexcellentyes
31…40mediumnoexcellentyes
31…40high yesfair yes
>40 mediumnoexcellentno
11
Avoiding the 0-Probability Problem
Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10),
Use Laplacian correction (or Laplacian estimator)
Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Õ
=
=
n
k
Cixk
PCi
XP
1
)|()|(
12
Naïve Bayesian Classifier
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
13
Bayesian Belief Networks
Models dependencies between variables
Defined by Two components
Directed Acyclic Graph
Conditional Probability Table (CPT) for each variable
Bayesian belief network allows a subset of the
variables to be conditionally independent
14
Bayesian Belief Networks
A graphical model of causal relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
X
Y
Z
P
Nodes: random variables
Links: dependency
X and Y are the parents of Z, and Y is
the parent of P
No dependency between Z and P
Has no loops or cycles
15
Bayesian Belief Network: An Example
Family
History
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S)(FH, ~S)(~FH, S)(~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table
(CPT) for variable LungCancer:
Õ
=
=
n
i
YParents
i
xi
PxxP
n
1
))(|(),...,(
1
CPT shows the conditional probability for
each possible combination of its parents
Derivation of the probability of a
particular combination of values of X,
from CPT:
16
Training Bayesian Networks
Several scenarios:
Given both the network structure and all variables observable:
learn only the CPTs
Network structure known, some hidden variables: gradient
descent (greedy hill-climbing) method, analogous to neural
network learning
Network structure unknown, all variables observable: search
through the model space to reconstruct network topology
Unknown structure, all hidden variables: No good algorithms
known for this purpose