Probability Review for beginner to be used for

ridwansiregar 5 views 55 slides Jul 31, 2024
Slide 1
Slide 1 of 55
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55

About This Presentation

Review


Slide Content

Probability Review
Thursday Sep 13

Probability Review
•Events and Event spaces
•Random variables
•Joint probability distributions
•Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
•Structural properties
•Independence, conditional independence
•Mean and Variance
•The big picture
•Examples

Sample space and Events
•W : Sample Space, result of an experiment
•If you toss a coin twice W = {HH,HT,TH,TT}
•Event: a subset of W
•First toss is head = {HH,HT}
•S: event space, a set of events:
•Closed under finite union and complements
•Entails other binary operation: union, diff, etc.
•Contains the empty event and W

Probability Measure
•Defined over (W,S) s.t.
•P(a) >= 0 for all ain S
•P(W) = 1
•If a, bare disjoint, then
•P(aU b) = p(a) + p(b)
•We can deduce other axioms from the above ones
•Ex: P(aU b) for non-disjoint event
P(aU b) = p(a) + p(b) –p(a ∩ b)

Visualization
•We can go on and define conditional
probability, using the above visualization

Conditional Probability
P(F|H) = Fraction of worlds in which H is true that also
have F true)(
)(
)|(
Hp
HFp
hfp

=

Rule of total probability
A
B
1
B
2B
3
B
4
B
5
B
6B
7) ))=
ii BAPBPAp |

From Events to Random Variable
•Almost all the semester we will be dealing with RV
•Concise way of specifying attributes of outcomes
•Modeling students (Grade and Intelligence):
•W = all possible students
•What are events
•Grade_A = all students with grade A
•Grade_B = all students with grade B
•Intelligence_High = … with high intelligence
•Very cumbersome
•We need “functions” that maps from W to an
attribute space.
•P(G = A) = P({student ϵW : G(student) = A})

Random Variables
W
High
low
A
B
A+
I:Intelligence
G:Grade
P(I = high) = P( {all students whose intelligence is high})

Discrete Random Variables
•Random variables (RVs) which may take on
only a countable number of distinct values
–E.g. the total number of tails X you get if you flip
100 coins
•X is a RV with arity k if it can take on exactly
one value out of {x
1, …, x
k}
–E.g. the possible values that X can take on are 0, 1,
2, …, 100

Probability of Discrete RV
•Probability mass function (pmf): P(X = x
i)
•Easy facts about pmf
Σ
iP(X = x
i) = 1
P(X = x
i∩X = x
j) = 0 if i ≠ j
P(X = x
i U X = x
j) = P(X = x
i) + P(X = x
j) if i ≠ j
P(X = x
1 U X = x
2 U … U X = x
k) = 1

Common Distributions
•Uniform XU[1, …, N]
X takes values 1, 2, … N
P(X = i) = 1/N
E.g. picking balls of different colors from a box
•Binomial XBin(n, p)
X takes values 0, 1, …, n

E.g. coin flips
p(X=i)=
n
i





 p
i
(1p)
ni

Continuous Random Variables
•Probability density function (pdf) instead of
probability mass function (pmf)
•A pdf is any function f(x) that describes the
probability density in terms of the input
variable x.

Probability of Continuous RV
•Properties of pdf


•Actual probability can be obtained by taking
the integral of pdf
E.g. the probability of X being between 0 and 1 is 
f(x)0,x
f(x)=1


 
P(0X1)=f(x)dx
0
1

Cumulative Distribution Function
•F
X(v) = P(X ≤ v)
•Discrete RVs
F
X(v) = Σ
viP(X = v
i)
•Continuous RVs


F
X(v)=f(x)dx

v

d
dx
F
x(x)=f(x)

Common Distributions
•Normal XN(μ, σ
2
)

E.g. the height of the entire population
f(x)=
1
2
exp
(x)
2
2
2





Multivariate Normal
•Generalization to higher dimensions of the
one-dimensional normal
fr
X
(x
i
,...,x
d
)=
1
(2)
d/2

1/2 
exp
1
2
r
x )
T

1r
x )






.
Covariance matrix
Mean

Probability Review
•Events and Event spaces
•Random variables
•Joint probability distributions
•Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
•Structural properties
•Independence, conditional independence
•Mean and Variance
•The big picture
•Examples

Joint Probability Distribution
•Random variables encodes attributes
•Not all possible combination of attributes are equally
likely
•Joint probability distributions quantify this
•P( X= x, Y= y) = P(x, y)
•Generalizes to N-RVs

• ) ===
x y
yYxXP 1, )
=
xy
YX dxdyyxf 1,
,

Chain Rule
•Always true
•P(x, y, z) = p(x) p(y|x) p(z|x, y)
= p(z) p(y|z) p(x|y, z)
=…

Conditional Probability )
 )
 )
P X Y
P X Y
PY
xy
xy
y
=  =
= = =
= )
)(
),(
|
yp
yxp
yxP =
But we will always write it this way:
events

Marginalization
•We know p(X, Y), what is P(X=x)?
•We can use the low of total probability, why?) )
))

=
=
y
y
yxPyP
yxPxp
|
,
A
B
1
B
2B
3
B
4
B
5
B
6B
7

Marginalization Cont.
•Another example)  )
) )

=
=
yz
zy
zyxPzyP
zyxPxp
,
,
,|,
,,

Bayes Rule
•We know that P(rain) = 0.5
•If we also know that the grass is wet, then
how this affects our belief about whether it
rains or not?
Prain|wet )=
P(rain)P(wet|rain)
P(wet) 
Px|y)=
P(x)P(y|x)
P(y)

Bayes Rule cont.
•You can condition on more variables )
)|(
),|()|(
,|
zyP
zxyPzxP
zyxP =

Probability Review
•Events and Event spaces
•Random variables
•Joint probability distributions
•Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
•Structural properties
•Independence, conditional independence
•Mean and Variance
•The big picture
•Examples

Independence
•X is independent of Y means that knowing Y
does not change our belief about X.
•P(X|Y=y) = P(X)
•P(X=x, Y=y) = P(X=x) P(Y=y)
•The above should hold for all x, y
•It is symmetric and written as X Y

Independence
•X
1, …, X
nare independent if and only if
•If X
1, …, X
nare independent and identically
distributed we say they are iid (or that they
are a random sample) and we write
P(X
1A
1,...,X
nA
n)=PX
iA
i)
i=1
n

X
1, …, X
n
∼P

CI: Conditional Independence
•RV are rarely independent but we can still
leverage local structural properties like
Conditional Independence.
•X Y | Z if once Z is observed, knowing the
value of Y does not change our belief about X
•P(rain sprinkler’s on | cloudy)
•P(rain sprinkler’s on | wet grass)

Conditional Independence
•P(X=x | Z=z, Y=y) = P(X=x | Z=z)
•P(Y=y | Z=z, X=x) = P(Y=y | Z=z)
•P(X=x, Y=y | Z=z) = P(X=x| Z=z) P(Y=y| Z=z)
We call these factors : very useful concept !!

Probability Review
•Events and Event spaces
•Random variables
•Joint probability distributions
•Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
•Structural properties
•Independence, conditional independence
•Mean and Variance
•The big picture
•Examples

Mean and Variance
•Mean (Expectation):
–Discrete RVs:
–Continuous RVs:)XE= )  )X P X
i
ii
v
E v v== ) )XE xf x dx


=
 
E(g(X))=g(v
i
)P(X=v
i
)
v
i
 
E(g(X))=g(x)f(x)dx



Mean and Variance
•Variance:
–Discrete RVs:
–Continuous RVs:
•Covariance:)  ) )
2
X P X
i
ii
v
V v v =  = )  ))
2
XV x f x dx


=
 
Var(X)=E((X)
2
)
Var(X)=E(X
2
)
2 
Cov(X,Y)=E((X
x)(Y
y))=E(XY)
x
y

Mean and Variance
•Correlation:
(X,Y)=Cov(X,Y)/
x
y 
1(X,Y)1

Properties
•Mean


–If X and Y are independent,
•Variance

–If X and Y are independent, )))X Y X YE E E =  ) )XXE a aE= )))XY X YE E E=  ) )
2
XXV a b a V=  )X Y (X) (Y)V V V = 

Some more properties
•The conditional expectation of Y given X when
the value of X = x is:
•The Law of Total Expectation or Law of
Iterated Expectation: ) dyxypyxXYE )|(*| 
==  
=== dxxpxXYEXYEEYE
X)()|()|()(

Some more properties
•The law of Total Variance:
Var(Y)=VarE(Y|X) EVar(Y|X) 

Probability Review
•Events and Event spaces
•Random variables
•Joint probability distributions
•Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
•Structural properties
•Independence, conditional independence
•Mean and Variance
•The big picture
•Examples

The Big Picture
Model Data
Probability
Estimation/learning

Statistical Inference
•Given observations from a model
–What (conditional) independence assumptions
hold?
•Structure learning
–If you know the family of the model (ex,
multinomial), What are the value of the
parameters: MLE, Bayesian estimation.
•Parameter learning

Probability Review
•Events and Event spaces
•Random variables
•Joint probability distributions
•Marginalization, conditioning, chain rule,
Bayes Rule, law of total probability, etc.
•Structural properties
•Independence, conditional independence
•Mean and Variance
•The big picture
•Examples

Monty Hall Problem
•You're given the choice of three doors: Behind one
door is a car; behind the others, goats.
•You pick a door, say No. 1
•The host, who knows what's behind the doors, opens
another door, say No. 3, which has a goat.
•Do you want to pick door No. 2 instead?

Host must
reveal Goat B
Host must
reveal Goat A
Host reveals
Goat A
or
Host reveals
Goat B

Monty Hall Problem: Bayes Rule
•: the car is behind door i, i= 1, 2, 3

•: the host opens door jafter you pick door i
•i
C ij
H )13
i
PC=  )
0
0
12
1,
ij k
ij
jk
P H C
ik
i k j k
=

=
=
=

 

Monty Hall Problem: Bayes Rule cont.
•WLOG, i=1, j=3

• )
 ))
)
13 1 1
1 13
13
P H C P C
P C H
PH
=  ))
13 1 1
1 1 1
2 3 6
P H C P C =  =



Monty Hall Problem: Bayes Rule cont.) ) ) )
 )) ))
13 13 1 13 2 13 3
13 1 1 13 2 2
, , ,
11
1
63
1
2
P H P H C P H C P H C
P H C P C P H C P C
=  
=
=  
=  )1 13
1 6 1
1 2 3
P C H ==

Monty Hall Problem: Bayes Rule cont. )1 13
1 6 1
1 2 3
P C H ==


You should switch! )  )2 13 1 13
12
1
33
P C H P C H=  = 

Information Theory
•P(X) encodes our uncertainty about X
•Some variables are more uncertain that others
•How can we quantify this intuition?
•Entropy: average number of bits required to encode X
P(X)
P(Y)
X Y)
)
)
)
) ==






=
xx
P
xPxP
xP
xP
xp
EXH )(log
1
log
1
log

Information Theory cont.
•Entropy: average number of bits required to encode X
•We can define conditional entropy similarly
•i.e. once Y is known, we only need H(X,Y) –H(Y) bits
•We can also define chain rule for entropies (not surprising))
)
))YHYXH
yxp
EYXH
PPP =






= ,
|
1
log|  ) )) )YXZHXYHXHZYXH
PPPP ,||,, = )
)
)
)
) ==






=
xx
P
xPxP
xP
xP
xp
EXH )(log
1
log
1
log

Mutual Information: MI
•Remember independence?
•If XY then knowing Y won’t change our belief about X
•Mutual information can help quantify this! (not the only
way though)
•MI:
•“The amount of uncertainty in X which is removed by
knowing Y”
•Symmetric
•I(X;Y) = 0 iff, X and Y are independent! ) ))YXHXHYXI
PPP |; =  







=
yx ypxp
yxp
yxpYXI
)()(
),(
log),();(

Chi Square Test for Independence
(Example)
Republican Democrat IndependentTotal
Male 200 150 50 400
Female 250 300 50 600
Total 450 450 100 1000
•State the hypotheses
H
0
: Gender and voting preferences are independent.
H
a
: Gender and voting preferences are not independent
•Choosesignificance level
Say, 0.05

Chi Square Test for Independence
•Analyze sample data
•Degrees of freedom =
|g|-1 * |v|-1 = (2-1) * (3-1) = 2
•Expected frequency count=
E
g,v= (n
g* n
v) / n
E
m,r= (400 * 450) / 1000 = 180000/1000 = 180
E
m,d= (400 * 450) / 1000 = 180000/1000 = 180
E
m,i= (400 * 100) / 1000 = 40000/1000 = 40
E
f,r= (600 * 450) / 1000 = 270000/1000 = 270
E
f,d= (600 * 450) / 1000 = 270000/1000 = 270
E
f,i= (600 * 100) / 1000 = 60000/1000 = 60
RepublicanDemocrat IndependentTotal
Male 200 150 50 400
Female250 300 50 600
Total 450 450 100 1000

Chi Square Test for Independence
•Chi-square test statistic
•Χ
2
= (200 -180)
2
/180 + (150 -180)
2
/180 + (50 -40)
2
/40+
(250 -270)
2
/270 + (300 -270)
2
/270 + (50 -60)
2
/40
•Χ
2
= 400/180 + 900/180 + 100/40 + 400/270 + 900/270 +
100/60
•Χ
2
= 2.22 + 5.00 + 2.50 + 1.48 + 3.33 + 1.67 = 16.2






 
=
vg
vgvg
E
EO
X
,
2
,,2
)(
RepublicanDemocrat IndependentTotal
Male 200 150 50 400
Female250 300 50 600
Total 450 450 100 1000

Chi Square Test for Independence
•P-value
–Probability of observing a sample statistic as
extreme as the test statistic
–P(X
2
≥ 16.2) = 0.0003
•Since P-value(0.0003) is less than the
significance level (0.05), we cannot accept the
null hypothesis
•There is a relationship between gender and
voting preference

Acknowledgment
•Carlos Guestrin recitation slides:
http://www.cs.cmu.edu/~guestrin/Class/10708/recitations/r1/Probability_and_St
atistics_Review.ppt
•Andrew Moore Tutorial:
http://www.autonlab.org/tutorials/prob.html
•Monty hall problem:
http://en.wikipedia.org/wiki/Monty_Hall_problem
•http://www.cs.cmu.edu/~guestrin/Class/10701-F07/recitation_schedule.html
•Chi-square test for independence
http://stattrek.com/chi-square-test/independence.aspx