An Introduction to Probabilistic Graphical Modeling

Introduction to Probabilistic
Graphical Models
Eran Segal
Weizmann Institute

Logistics

Staff:

Instructor
Eran Segal ([email protected], room 149)

Teaching Assistants
Ohad Manor ([email protected], room 125)

Course information:
WWW: http://www.weizmann.ac.il/math/pgm

Course book:

“Bayesian Networks and Beyond”,
Daphne Koller (Stanford) & Nir Friedman (Hebrew U.)

Course structure

One weekly meeting

Sun: 9am-11am

Homework assignments

2 weeks to complete each

40% of final grade

Final exam

3 hour class exam, date will be announced

60% of final grade

Probabilistic Graphical Models

Tool for representing complex systems and
performing sophisticated reasoning tasks

Fundamental notion: Modularity

Complex systems are built by combining simpler parts

Why have a model?

Compact and modular representation of complex systems

Ability to execute complex reasoning patterns

Make predictions

Generalize from particular problem

Probabilistic Graphical Models

Increasingly important in Machine Learning

Many classical probabilistic problems in statistics,
information theory, pattern recognition, and
statistical mechanics are special cases of the
formalism

Graphical models provides a common framework

Advantage: specialized techniques developed in one
field can be transferred between research communities

Representation: Graphs

Intuitive data structure for modeling highly-
interacting sets of variables

Explicit model for modularity

Data structure that allows for design of efficient
general-purpose algorithms

Reasoning: Probability Theory

Well understood framework for modeling
uncertainty

Partial knowledge of the state of the world

Noisy observations

Phenomenon not covered by our model

Inherent stochasticity

Clear semantics

Can be learned from data

Probabilistic Reasoning

In this course we will learn:

Semantics of probabilistic graphical models (PGMs)

Bayesian networks

Markov networks

Answering queries in a PGMs (“inference”)

Learning PGMs from data (“learning”)

Modeling temporal processes with PGMs

Hidden Markov Models (HMMs) as a special case

Course Outline
Week Topic Reading
1 Introduction, Bayesian network
representation
1-3
2 Bayesian network representation cont. 1-3
3 Local probability models 5
4 Undirected graphical models 4
5 Exact inference 9,10
6 Exact inference cont. 9,10
7 Approximate inference 12
8 Approximate inference cont. 12
9 Learning: Parameters 16,17
10 Learning: Parameters cont. 16,17
11 Learning: Structure 18
12 Partially observed data 19
13 Learning undirected graphical models 20
14 Template models 6
15 Dynamic Bayesian networks 15

A Simple Example

We want to model whether our neighbor will
inform us of the alarm being set off

The alarm can set off if

There is a burglary

There is an earthquake

Whether our neighbor calls depends on whether
the alarm is set off

A Simple Example

Variables

Earthquake (E), Burglary (B), Alarm (A), NeighborCalls (N)
E B A N Prob.
F F F F 0.01
F F F T 0.04
F F T F 0.05
F F T T 0.01
F T F F 0.02
F T F T 0.07
F T T F 0.2
F T T T 0.1
T F F F 0.01
T F F T 0.07
T F T F 0.13
T F T T 0.04
T T F F 0.06
T T F T 0.05
T T T F 0.1
T T T T 0.05
2
4
-1
independent
parameters

A Simple Example
Alarm
NeighborCalls
BurglaryEarthquake
A
E B F T
F F 0.990.01
F T 0.10.9
T F 0.30.7
T T 0.010.99
N
A F T
F 0.90.1
T 0.20.8
E
F T
0.90.1
7 independent
parameters
B
F T
0.70.3

Example Bayesian Network

The “Alarm” network for monitoring intensive care
patients

509 parameters (full joint 2
37
)

37 variables
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG
VENITUBE
DISCONNECT
MINVOLSET
VENTMACH
KINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP

Application: Clustering Users

Input: TV shows that each user watches

Output: TV show “clusters”

Assumption: shows watched by same users are similar
Class 1
• Power rangers
• Animaniacs
• X-men
• Tazmania
• Spider man
Class 4
• 60 minutes
• NBC nightly news
• CBS eve news
• Murder she wrote
• Matlock
Class 2
• Young and restless
• Bold and the beautiful
• As the world turns
• Price is right
• CBS eve news
Class 3
• Tonight show
• Conan O’Brien
• NBC nightly news
• Later with Kinnear
• Seinfeld
Class 5
• Seinfeld
• Friends
• Mad about you
• ER
• Frasier

App.: Recommendation Systems

Given user preferences, suggest recommendations

Example: Amazon.com

Input: movie preferences of many users

Solution: model correlations between movie
features

Users that like comedy, often like drama

Users that like action, often do not like cartoons

Users that like Robert Deniro films often like Al Pacino
films

Given user preferences, can predict probability that new
movies match preferences

Diagnostic Systems

Diagnostic indexing for home health site at
microsoft

Enter symptoms  recommend multimedia
content

Online TroubleShooters

Expression level in each
module is a function of
expression of regulators
App.: Finding Regulatory
Networks
Experiment
Gene
Expression
Module
Regulator
1
Regulator
2
Regulator
3
Level
What module does
gene “g” belong to?
Expression level of
Regulator
1
in
experiment
BMH1 
GIC2 
00 0
2
1
Module
P(Level | Module, Regulators)
HAP4 
CMK1 
0
0 0

App.: Finding Regulatory
Networks
Y
p
l
2
3
0
w H
a
p
4
X
b
p
1
Y
e
r
1
8
4
c
Y
a
p
6
G
a
t
1
I
m
e
4
L
s
g
1
M
s
n
4
G
a
c
1
G
i
s
1
N
o
t
3
S
i
p
2
Amino acid
metabolism
Energy and
cAMP signaling
DNA and RNA
processing
nuclear
S
T
R
E
N
4
1
H
A
P
2
3
4
R
E
P
C
A
R
C
A
T
8
N
2
6
A
D
R
1
H
S
F
H
A
C
1
X
B
P
1
M
C
M
1
N
3
0
A
B
F
_
C
N
3
6
K
i
n
8
2
C
m
k
1
T
p
k
1
P
p
t
1
N
1
1
G
A
T
A
G
C
N
4
C
B
F
1
_
B
T
p
k
2
P
p
h
3
N
1
4
N
1
3
B
m
h
1
G
c
n
2
0
G
C
R
1
M
I
G
1
N
1
8
1232533414263947 30423136 516 810913 14151718 11
Regulation supported in
literature
Regulator (Signaling molecule)
Regulator (transcription factor)
Inferred regulation48Module (number)
Experimentally tested regulator
Enriched cis-Regulatory Motif

Prerequisites

Probability theory

Conditional probabilities

Joint distribution

Random variables

Information theory

Function optimization

Graph theory

Computational complexity

Probability Theory

Probability distribution P over (, S) is a mapping from
events in S such that:

P() 0 for all S

P() = 1

If ,S and =, then P()=P()+P()

Conditional Probability:

Chain Rule:

Bayes Rule:

Conditional Independence:
)(
)(
)|(



P
P
P


)(
)()|(
)|(



P
PP
P 
)()|()(  PPP 
)|()|()|(   PP

Random Variables & Notation

Random variable: Function from to a value

Categorical / Ordinal / Continuous

Val(X) – set of possible values of RV X

Upper case letters denote RVs (e.g., X, Y, Z)

Upper case bold letters denote set of RVs (e.g., X, Y)

Lower case letters denote RV values (e.g., x, y, z)

Lower case bold letters denote RV set values (e.g., x)

Values for categorical RVs with |Val(X)|=k: x
1
,x
2
,…,x
k

Marginal distribution over X: P(X)

Conditional independence: X is independent of Y
given Z if:
Pin )|(:,, zZyYxXZzYyXx 

Expectation

Discrete RVs:

Continuous RVs:

Linearity of expectation:

Expectation of products:
(when X Y in P)



Xx
P
xxPXE )(][



Xx
P xxpXE )(][
][][
)()(
),(),(
),(),(
),()(][
, ,
,
YEXE
yyPxxP
yxPyyxPx
yxyPyxxP
yxPyxYXE
x y
x y xy
yx yx
yx
P





 
 
 

][][
)()(
)()(
),(][
,
,
YEXE
yyPxxP
yPxxyP
yxxyPXYE
yx
yx
yx
P





















Independence
assumption

Variance

Variance of RV:

If X and Y are independent: Var[X+Y]=Var[X]+Var[Y]

Var[aX+b]=a
2
Var[X]
  
  
][][
][][][2][
]][[]][2[][
][][2
][][
22
22
22
22
2
XEXE
XEXEXEXE
XEEXXEEXE
XEXXEXE
XEXEXVar
PP
PPPP
PPPPP
PPP
PPP






Information Theory

Entropy:

We use log base 2 to interpret entropy as bits of information

Entropy of X is a lower bound on avg. # of bits to encode values of X
0  H
p
(X)  log|Val(X)| for any distribution P(X)

Conditional entropy:

Information only helps:

Mutual information:
0  I
p(X;Y)  H
p(X)
Symmetry: I
p(X;Y)= I
p(Y;X)
I
p
(X;Y)=0 iff X and Y are independent

Chain rule of entropies:



Xx
P
xPxPXH )(log)()(




YyXx
PPP
yxPyxP
YHYXHYXH
,
)|(log),(
)(),()|(
),...,|(...)(),...,(
1111 
nnPPnP XXXHXHXXH
)()|( XHYXH
PP





YyXx
PPP
xP
yxP
yxP
YXHXHYXI
,
)(
)|(
log),(
)|()();(

Distances Between Distributions

Relative Entropy:

D(P॥Q)0

D(P॥Q)=0 iff P=Q

Not a distance metric (no symmetry and triangle inequality)
L
1 distance:
L
2
distance:
L

distance:
 
2
1
2
2 ))()((


Xx
xQxPQP



Xx
xQ
xP
xPXQXPD
)(
)(
log)())()((



Xx
xQxPQP |)()(|
1
|)()(|max
Xx xQxPQP 


Optimization Theory
Find values 
1
,…
n
such that

Optimization strategies

Solve gradient analytically and verify local maximum

Gradient search: guess initial values, and improve iteratively

Gradient ascent

Line search

Conjugate gradient

Lagrange multipliers

Solve maximization problem with constraints

Maximize
)',...,'(max),...,(
1',...,'1
1
nn
ff
n



 0,,...,1:  

j
cnjC


j
n
j
j
cfL 


1
)(),(

Graph Theory

Undirected graph

Directed graph

Complete graph (every two nodes connected)

Acyclic graph

Partially directed acyclic graph (PDAG)

Induced graph

Sub-graph

Graph algorithms
Shortest path from node X
1 to all other nodes (BFS)

Representing Joint Distributions
Random variables: X
1
,…,X
n
P is a joint distribution over X
1
,…,X
n
If X
1,..,X
n binary, need 2
n
parameters to describe P
Can we represent P more compactly?

Key: Exploit independence properties

Independent Random Variables

Two variables X and Y are independent if

P(X=x|Y=y) = P(X=x) for all values x,y

Equivalently, knowing Y does not change predictions of X

If X and Y are independent then:

P(X, Y) = P(X|Y)P(Y) = P(X)P(Y)
If X
1
,…,X
n
are independent then:
P(X
1
,…,X
n
) = P(X
1
)…P(X
n
)

O(n) parameters

All 2
n
probabilities are implicitly defined

Cannot represent many types of distributions

Conditional Independence

X and Y are conditionally independent given Z if

P(X=x|Y=y, Z=z) = P(X=x|Z=z) for all values x, y, z

Equivalently, if we know Z, then knowing Y does not
change predictions of X

Notation: Ind(X;Y | Z) or (X  Y | Z)

Conditional Parameterization

S = Score on test, Val(S) = {s
0
,s
1
}

I = Intelligence, Val(I) = {i
0
,i
1
}
I S P(I,S)
i
0
s
0
0.665
i
0
s
1
0.035
i
1
s
0
0.06
i
1
s
1
0.24
S
I s
0
s
1
i
0
0.950.05
i
1
0.20.8
I
i
0
i
1
0.70.3
P(S|I)P(I)P(I,S)
Joint parameterization Conditional parameterization
3 parameters 3 parameters
Alternative parameterization: P(S) and P(I|S)

Conditional Parameterization

S = Score on test, Val(S) = {s
0
,s
1
}

I = Intelligence, Val(I) = {i
0
,i
1
}

G = Grade, Val(G) = {g
0
,g
1
,g
2
}

Assume that G and S are independent given I

Joint parameterization

223=12-1=11 independent parameters

Conditional parameterization has

P(I,S,G) = P(I)P(S|I)P(G|I,S) = P(I)P(S|I)P(G|I)

P(I) – 1 independent parameter

P(S|I) – 21 independent parameters

P(G|I) - 22 independent parameters

7 independent parameters

Naïve Bayes Model
Class variable C, Val(C) = {c
1,…,c
k}
Evidence variables X
1,…,X
n

Naïve Bayes assumption: evidence variables are
conditionally independent given C

Applications in medical diagnosis, text classification

Used as a classifier:

Problem: Double counting correlated evidence



n
i
in
CXPCPXXCP
1
1
)|()(),...,,(

 






n
i i
i
n
n
cCxP
cCxP
cCP
cCP
xxcCP
xxcCP
1 2
1
2
1
12
11
)|(
)|(
)(
)(
),...,|(
),...,|(

Bayesian Network (Informal)

Directed acyclic graph G

Nodes represent random variables

Edges represent direct influences between random
variables

Local probability models
I
S
I
SG
C
X
1
X
nX
2
…
Naïve BayesExample 2Example 1

Bayesian Network (Informal)

Represent a joint distribution

Specifies the probability for P(X=x)

Specifies the probability for P(X=x|E=e)

Allows for reasoning patterns

Prediction (e.g., intelligent  high scores)

Explanation (e.g., low score  not intelligent)

Explaining away (different causes for same effect
interact)
I
SG
Example 2

Bayesian Network Structure

Directed acyclic graph G
Nodes X
1,…,X
n represent random variables

G encodes local Markov assumptions
X
i
is independent of its non-descendants given its
parents
Formally: (X
i  NonDesc(X
i)

| Pa(X
i))
A
B C
E
G
D FE  {A,C,D,F} | B

Independency Mappings (I-Maps)

Let P be a distribution over X

Let I(P) be the independencies (X  Y | Z) in P

A Bayesian network structure is an I-map
(independency mapping) of P if I(G)I(P)
I
S
I S P(I,S)
i
0
s
0
0.25
i
0
s
1
0.25
i
1
s
0
0.25
i
1
s
1
0.25
I
S
I S P(I,S)
i
0
s
0
0.4
i
0
s
1
0.3
i
1
s
0
0.2
i
1
s
1
0.1
I(P)={IS} I(G)={IS} I(G)= I(P)=

Factorization Theorem

If G is an I-Map of P, then
Proof:
wlog. X
1,…,X
n is an ordering consistent with G

By chain rule:

From assumption:
Since G is an I-Map  (X
i
; NonDesc(X
i
)| Pa(X
i
))I(P)



n
i
iin
XPaXPXXP
1
1
))(|(),...,(
)()(},{
},{)(
1,1
1,1
iii
ii
XNonDescXPaXX
XXXPa










n
i
iin
XXXPXXP
1
111
),...,|(),...,(
))(|(),...,|(
11 iiii
XPaXPXXXP 


Factorization Implies I-Map

 G is an I-Map of P
Proof:
Need to show (X
i; NonDesc(X
i)| Pa(X
i))I(P) or that
P(X
i
| NonDesc(X
i
)) = P(X
i
| Pa(X
i
))
wlog. X
1
,…,X
n
is an ordering consistent with G





n
i
iin
XPaXPXXP
1
1
))(|(),...,(
))(|(
))(|(
))(|(
))((
))(,(
))(|(
1
1
1
ii
i
k
kk
i
k
kk
i
ii
ii
XPaXP
XPaXP
XPaXP
XNonDescP
XNonDescXP
XNonDescXP









Bayesian Network Definition

A Bayesian network is a pair (G,P)

P factorizes over G

P is specified as set of CPDs associated with G’s nodes

Parameters

Joint distribution: 2
n

Bayesian network (bounded in-degree k): n2
k

Bayesian Network Design

Variable considerations

Clarity test: can an omniscient being determine its
value?

Hidden variables?

Irrelevant variables

Structure considerations

Causal order of variables

Which independencies (approximately) hold?

Probability considerations

Zero probabilities

Orders of magnitude

Relative values

An Introduction to Probabilistic Graphical Modeling

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

An Introduction to Probabilistic Graphical Modeling

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd