An Introduction to Probabilistic Graphical Modeling

PallavShah31 17 views 42 slides Aug 23, 2024
Slide 1
Slide 1 of 42
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42

About This Presentation

Staff:
Instructor
Eran Segal ([email protected], room 149)
Teaching Assistants
Ohad Manor ([email protected], room 125)
Course information:
WWW: http://www.weizmann.ac.il/math/pgm
Course book:
“Bayesian Networks and Beyond”,�Daphne Koller (Stanford) & Nir Friedman (Hebrew U...


Slide Content

Introduction to Probabilistic
Graphical Models
Eran Segal
Weizmann Institute

Logistics

Staff:

Instructor
Eran Segal ([email protected], room 149)

Teaching Assistants
Ohad Manor ([email protected], room 125)

Course information:
WWW: http://www.weizmann.ac.il/math/pgm

Course book:

“Bayesian Networks and Beyond”,
Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)

Course structure

One weekly meeting

Sun: 9am-11am

Homework assignments

2 weeks to complete each

40% of final grade

Final exam

3 hour class exam, date will be announced

60% of final grade

Probabilistic Graphical Models

Tool for representing complex systems and
performing sophisticated reasoning tasks

Fundamental notion: Modularity

Complex systems are built by combining simpler parts

Why have a model?

Compact and modular representation of complex systems

Ability to execute complex reasoning patterns

Make predictions

Generalize from particular problem

Probabilistic Graphical Models

Increasingly important in Machine Learning

Many classical probabilistic problems in statistics,
information theory, pattern recognition, and
statistical mechanics are special cases of the
formalism

Graphical models provides a common framework

Advantage: specialized techniques developed in one
field can be transferred between research communities

Representation: Graphs

Intuitive data structure for modeling highly-
interacting sets of variables

Explicit model for modularity

Data structure that allows for design of efficient
general-purpose algorithms

Reasoning: Probability Theory

Well understood framework for modeling
uncertainty

Partial knowledge of the state of the world

Noisy observations

Phenomenon not covered by our model

Inherent stochasticity

Clear semantics

Can be learned from data

Probabilistic Reasoning

In this course we will learn:

Semantics of probabilistic graphical models (PGMs)

Bayesian networks

Markov networks

Answering queries in a PGMs (“inference”)

Learning PGMs from data (“learning”)

Modeling temporal processes with PGMs

Hidden Markov Models (HMMs) as a special case

Course Outline
Week Topic Reading
1 Introduction, Bayesian network
representation
1-3
2 Bayesian network representation cont. 1-3
3 Local probability models 5
4 Undirected graphical models 4
5 Exact inference 9,10
6 Exact inference cont. 9,10
7 Approximate inference 12
8 Approximate inference cont. 12
9 Learning: Parameters 16,17
10 Learning: Parameters cont. 16,17
11 Learning: Structure 18
12 Partially observed data 19
13 Learning undirected graphical models 20
14 Template models 6
15 Dynamic Bayesian networks 15

A Simple Example

We want to model whether our neighbor will
inform us of the alarm being set off

The alarm can set off if

There is a burglary

There is an earthquake

Whether our neighbor calls depends on whether
the alarm is set off

A Simple Example

Variables

Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N)
E B A N Prob.
F F F F 0.01
F F F T 0.04
F F T F 0.05
F F T T 0.01
F T F F 0.02
F T F T 0.07
F T T F 0.2
F T T T 0.1
T F F F 0.01
T F F T 0.07
T F T F 0.13
T F T T 0.04
T T F F 0.06
T T F T 0.05
T T T F 0.1
T T T T 0.05
2
4
-1
independent
parameters

A Simple Example
Alarm
NeighborCalls
BurglaryEarthquake
A
E B F T
F F 0.990.01
F T 0.10.9
T F 0.30.7
T T 0.010.99
N
A F T
F 0.90.1
T 0.20.8
E
F T
0.90.1
7 independent
parameters
B
F T
0.70.3

Example Bayesian Network

The “Alarm” network for monitoring intensive care
patients

509 parameters (full joint 2
37
)

37 variables
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG
VENITUBE
DISCONNECT
MINVOLSET
VENTMACH
KINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP

Application: Clustering Users

Input: TV shows that each user watches

Output: TV show “clusters”

Assumption: shows watched by same users are similar
Class 1
• Power rangers
• Animaniacs
• X-men
• Tazmania
• Spider man
Class 4
• 60 minutes
• NBC nightly news
• CBS eve news
• Murder she wrote
• Matlock
Class 2
• Young and restless
• Bold and the beautiful
• As the world turns
• Price is right
• CBS eve news
Class 3
• Tonight show
• Conan O’Brien
• NBC nightly news
• Later with Kinnear
• Seinfeld
Class 5
• Seinfeld
• Friends
• Mad about you
• ER
• Frasier

App.: Recommendation Systems

Given user preferences, suggest recommendations

Example: Amazon.com

Input: movie preferences of many users

Solution: model correlations between movie
features

Users that like comedy, often like drama

Users that like action, often do not like cartoons

Users that like Robert Deniro films often like Al Pacino
films

Given user preferences, can predict probability that new
movies match preferences

Diagnostic Systems

Diagnostic indexing for home health site at
microsoft

Enter symptoms  recommend multimedia
content

Online TroubleShooters

Expression level in each
module is a function of
expression of regulators
App.: Finding Regulatory
Networks
Experiment
Gene
Expression
Module
Regulator
1
Regulator
2
Regulator
3
Level
What module does
gene “g” belong to?
Expression level of
Regulator
1
in
experiment
BMH1 
GIC2 
00 0
2
1
Module
P(Level | Module, Regulators)
HAP4 
CMK1 
0
0 0

App.: Finding Regulatory
Networks
Y
p
l
2
3
0
w H
a
p
4
X
b
p
1
Y
e
r
1
8
4
c
Y
a
p
6
G
a
t
1
I
m
e
4
L
s
g
1
M
s
n
4
G
a
c
1
G
i
s
1
N
o
t
3
S
i
p
2
Amino acid
metabolism
Energy and
cAMP signaling
DNA and RNA
processing
nuclear
S
T
R
E
N
4
1
H
A
P
2
3
4
R
E
P
C
A
R
C
A
T
8
N
2
6
A
D
R
1
H
S
F
H
A
C
1
X
B
P
1
M
C
M
1
N
3
0
A
B
F
_
C
N
3
6
K
i
n
8
2
C
m
k
1
T
p
k
1
P
p
t
1
N
1
1
G
A
T
A
G
C
N
4
C
B
F
1
_
B
T
p
k
2
P
p
h
3
N
1
4
N
1
3
B
m
h
1
G
c
n
2
0
G
C
R
1
M
I
G
1
N
1
8
1232533414263947 30423136 516 810913 14151718 11
Regulation supported in
literature
Regulator (Signaling molecule)
Regulator (transcription factor)
Inferred regulation48Module (number)
Experimentally tested regulator
Enriched cis-Regulatory Motif

Prerequisites

Probability theory

Conditional probabilities

Joint distribution

Random variables

Information theory

Function optimization

Graph theory

Computational complexity

Probability Theory

Probability distribution P over (, S) is a mapping from
events in S such that:

P() 0 for all S

P() = 1

If ,S and =, then P()=P()+P()

Conditional Probability:

Chain Rule:

Bayes Rule:

Conditional Independence:
)(
)(
)|(



P
P
P


)(
)()|(
)|(



P
PP
P 
)()|()(  PPP 
)|()|()|(   PP

Random Variables & Notation

Random variable: Function from to a value

Categorical / Ordinal / Continuous

Val(X) – set of possible values of RV X

Upper case letters denote RVs (e.g., X, Y, Z)

Upper case bold letters denote set of RVs (e.g., X, Y)

Lower case letters denote RV values (e.g., x, y, z)

Lower case bold letters denote RV set values (e.g., x)

Values for categorical RVs with |Val(X)|=k: x
1
,x
2
,…,x
k

Marginal distribution over X: P(X)

Conditional independence: X is independent of Y
given Z if:
Pin )|(:,, zZyYxXZzYyXx 

Expectation

Discrete RVs:

Continuous RVs:

Linearity of expectation:

Expectation of products:
(when X Y in P)



Xx
P
xxPXE )(][



Xx
P xxpXE )(][
][][
)()(
),(),(
),(),(
),()(][
, ,
,
YEXE
yyPxxP
yxPyyxPx
yxyPyxxP
yxPyxYXE
x y
x y xy
yx yx
yx
P





 
 
 

][][
)()(
)()(
),(][
,
,
YEXE
yyPxxP
yPxxyP
yxxyPXYE
yx
yx
yx
P





















Independence
assumption

Variance

Variance of RV:

If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]

Var[aX+b]=a
2
Var[X]
  
  
][][
][][][2][
]][[]][2[][
][][2
][][
22
22
22
22
2
XEXE
XEXEXEXE
XEEXXEEXE
XEXXEXE
XEXEXVar
PP
PPPP
PPPPP
PPP
PPP






Information Theory

Entropy:

We use log base 2 to interpret entropy as bits of information

Entropy of X is a lower bound on avg. # of bits to encode values of X
0  H
p
(X)  log|Val(X)| for any distribution P(X)

Conditional entropy:

Information only helps:

Mutual information:
0  I
p(X;Y)  H
p(X)
Symmetry: I
p(X;Y)= I
p(Y;X)
I
p
(X;Y)=0 iff X and Y are independent

Chain rule of entropies:



Xx
P
xPxPXH )(log)()(




YyXx
PPP
yxPyxP
YHYXHYXH
,
)|(log),(
)(),()|(
),...,|(...)(),...,(
1111 
nnPPnP XXXHXHXXH
)()|( XHYXH
PP





YyXx
PPP
xP
yxP
yxP
YXHXHYXI
,
)(
)|(
log),(
)|()();(

Distances Between Distributions

Relative Entropy:

D(P॥Q)0

D(P॥Q)=0 iff P=Q

Not a distance metric (no symmetry and triangle inequality)
L
1 distance:
L
2
distance:
L

distance:
 
2
1
2
2 ))()((


Xx
xQxPQP



Xx
xQ
xP
xPXQXPD
)(
)(
log)())()((



Xx
xQxPQP |)()(|
1
|)()(|max
Xx xQxPQP 


Optimization Theory
Find values 
1
,…
n
such that

Optimization strategies

Solve gradient analytically and verify local maximum

Gradient search: guess initial values, and improve iteratively

Gradient ascent

Line search

Conjugate gradient

Lagrange multipliers

Solve maximization problem with constraints

Maximize
)',...,'(max),...,(
1',...,'1
1
nn
ff
n



 0,,...,1:  

j
cnjC


j
n
j
j
cfL 


1
)(),(

Graph Theory

Undirected graph

Directed graph

Complete graph (every two nodes connected)

Acyclic graph

Partially directed acyclic graph (PDAG)

Induced graph

Sub-graph

Graph algorithms
Shortest path from node X
1 to all other nodes (BFS)

Representing Joint Distributions
Random variables: X
1
,…,X
n
P is a joint distribution over X
1
,…,X
n
If X
1,..,X
n binary, need 2
n
parameters to describe P
Can we represent P more compactly?

Key: Exploit independence properties

Independent Random Variables

Two variables X and Y are independent if

P(X=x|Y=y) = P(X=x) for all values x,y

Equivalently, knowing Y does not change predictions of X

If X and Y are independent then:

P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)
If X
1
,…,X
n
are independent then:
P(X
1
,…,X
n
) = P(X
1
)…P(X
n
)

O(n) parameters

All 2
n
probabilities are implicitly defined

Cannot represent many types of distributions

Conditional Independence

X and Y are conditionally independent given Z if

P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z

Equivalently, if we know Z, then knowing Y does not
change predictions of X

Notation: Ind(X;Y | Z) or (X  Y | Z)

Conditional Parameterization

S = Score on test, Val(S) = {s
0
,s
1
}

I = Intelligence, Val(I) = {i
0
,i
1
}
I S P(I,S)
i
0
s
0
0.665
i
0
s
1
0.035
i
1
s
0
0.06
i
1
s
1
0.24
S
I s
0
s
1
i
0
0.950.05
i
1
0.20.8
I
i
0
i
1
0.70.3
P(S|I)P(I)P(I,S)
Joint parameterization Conditional parameterization
3 parameters 3 parameters
Alternative parameterization: P(S) and P(I|S)

Conditional Parameterization

S = Score on test, Val(S) = {s
0
,s
1
}

I = Intelligence, Val(I) = {i
0
,i
1
}

G = Grade, Val(G) = {g
0
,g
1
,g
2
}

Assume that G and S are independent given I

Joint parameterization

223=12-1=11 independent parameters

Conditional parameterization has

P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I)

P(I) – 1 independent parameter

P(S|I) – 21 independent parameters

P(G|I) - 22 independent parameters

7 independent parameters

Naïve Bayes Model
Class variable C, Val(C) = {c
1,…,c
k}
Evidence variables X
1,…,X
n

Naïve Bayes assumption: evidence variables are
conditionally independent given C

Applications in medical diagnosis, text classification

Used as a classifier:

Problem: Double counting correlated evidence



n
i
in
CXPCPXXCP
1
1
)|()(),...,,(

 






n
i i
i
n
n
cCxP
cCxP
cCP
cCP
xxcCP
xxcCP
1 2
1
2
1
12
11
)|(
)|(
)(
)(
),...,|(
),...,|(

Bayesian Network (Informal)

Directed acyclic graph G

Nodes represent random variables

Edges represent direct influences between random
variables

Local probability models
I
S
I
SG
C
X
1
X
nX
2

Naïve BayesExample 2Example 1

Bayesian Network (Informal)

Represent a joint distribution

Specifies the probability for P(X=x)

Specifies the probability for P(X=x|E=e)

Allows for reasoning patterns

Prediction (e.g., intelligent  high scores)

Explanation (e.g., low score  not intelligent)

Explaining away (different causes for same effect
interact)
I
SG
Example 2

Bayesian Network Structure

Directed acyclic graph G
Nodes X
1,…,X
n represent random variables

G encodes local Markov assumptions
X
i
is independent of its non-descendants given its
parents
Formally: (X
i  NonDesc(X
i)

| Pa(X
i))
A
B C
E
G
D FE  {A,C,D,F} | B

Independency Mappings (I-Maps)

Let P be a distribution over X

Let I(P) be the independencies (X  Y | Z) in P

A Bayesian network structure is an I-map
(independency mapping) of P if I(G)I(P)
I
S
I S P(I,S)
i
0
s
0
0.25
i
0
s
1
0.25
i
1
s
0
0.25
i
1
s
1
0.25
I
S
I S P(I,S)
i
0
s
0
0.4
i
0
s
1
0.3
i
1
s
0
0.2
i
1
s
1
0.1
I(P)={IS} I(G)={IS} I(G)= I(P)=

Factorization Theorem

If G is an I-Map of P, then
Proof:
wlog. X
1,…,X
n is an ordering consistent with G

By chain rule:

From assumption:
Since G is an I-Map  (X
i
; NonDesc(X
i
)| Pa(X
i
))I(P)



n
i
iin
XPaXPXXP
1
1
))(|(),...,(
)()(},{
},{)(
1,1
1,1
iii
ii
XNonDescXPaXX
XXXPa










n
i
iin
XXXPXXP
1
111
),...,|(),...,(
))(|(),...,|(
11 iiii
XPaXPXXXP 

Factorization Implies I-Map

 G is an I-Map of P
Proof:
Need to show (X
i; NonDesc(X
i)| Pa(X
i))I(P) or that
P(X
i
| NonDesc(X
i
)) = P(X
i
| Pa(X
i
))
wlog. X
1
,…,X
n
is an ordering consistent with G





n
i
iin
XPaXPXXP
1
1
))(|(),...,(
))(|(
))(|(
))(|(
))((
))(,(
))(|(
1
1
1
ii
i
k
kk
i
k
kk
i
ii
ii
XPaXP
XPaXP
XPaXP
XNonDescP
XNonDescXP
XNonDescXP







Bayesian Network Definition

A Bayesian network is a pair (G,P)

P factorizes over G

P is specified as set of CPDs associated with G’s nodes

Parameters

Joint distribution: 2
n

Bayesian network (bounded in-degree k): n2
k

Bayesian Network Design

Variable considerations

Clarity test: can an omniscient being determine its
value?

Hidden variables?

Irrelevant variables

Structure considerations

Causal order of variables

Which independencies (approximately) hold?

Probability considerations

Zero probabilities

Orders of magnitude

Relative values