AmritaChaturvedi2
149 views
83 slides
Jun 11, 2024
Slide 1 of 83
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
About This Presentation
These slides describe the bayesian decision theory
Size: 1.96 MB
Language: en
Added: Jun 11, 2024
Slides: 83 pages
Slide Content
Machine Learning
Bayesian Decision Theory 1. Introduction – Bayesian Decision Theory Pure statistics, probabilities known, optimal decision 2. Bayesian Decision Theory–Continuous Features 3. Minimum-Error-Rate Classification 4. Classifiers, Discriminant Functions, Decision Surfaces 5. The Normal Density 6. Discriminant Functions for the Normal Density
2 1. Introduction The sea bass/salmon example State of nature, prior State of nature is a random variable The catch of salmon and sea bass is equiprobable P( 1 ) = P( 2 ) (uniform priors) P( 1 ) + P( 2 ) = 1 (exclusivity and exhaustivity)
3 Decision rule with only the prior information Decide 1 if P( 1 ) > P( 2 ) otherwise decide 2 Use of the class –conditional information p(x | 1 ) and p(x | 2 ) describe the difference in lightness between populations of sea and salmon
4
Bayes Formula Suppose we know the prior probabilities P( j ) and the conditional densities p(x | j ) for j = 1, 2. Lightness of a fish has value x Joint probability density of finding a pattern that is in category j and has feature value x: p( j , x ) = P( j | x ) p(x) = p(x | j ) P( j ) 5
6 Posterior, likelihood, evidence P( j | x) = P(x | j ) P ( j ) / P(x) (Bayes Rule) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence
7
8 Decision given the posterior probabilities X is an observation for which: if P( 1 | x) > P( 2 | x) True state of nature = 1 if P( 1 | x) < P( 2 | x) True state of nature = 2 This rule minimizes average probability of error: p = min[ P( 1 | x), P( 2 | x) ] Decide 1 if P(x | 1 ) P ( 1 ) > P(x | 2 ) P ( 2 )
9 2. Bayesian Decision Theory – Continuous Features Generalization of the preceding ideas Use of more than one feature, vector X Use more than two states of nature, c classes Allowing actions other than deciding the state of nature Introduce a loss of function which is more general than the probability of error
10 Allowing actions other than classification primarily allows the possibility of rejection Rejection in the sense of abstention Don’t make a decision if the alternatives are too close This must be tempered by the cost of indecision The loss function states how costly each action taken is
11 Let { 1 , 2 ,…, c } be the set of c states of nature (or “categories”) Let { 1 , 2 ,…, a } be the set of possible actions Let ( i | j ) be the loss incurred for taking action i when the state of nature is j
12 Overall risk R = Sum of all R( i | x) for i = 1,…,a and all x Minimizing R Minimizing R( i | x) for i = 1,…, a for each action a i (i = 1,…,a) Note: This is the risk specifically for observation x Conditional risk
13 Select the action i for which R( i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!
14 Two-category classification 1 : deciding 1 2 : deciding 2 ij = ( i | j ) loss incurred for deciding i when the true state of nature is j Conditional risk: R( 1 | x) = 11 P( 1 | x) + 12 P( 2 | x) R( 2 | x) = 21 P( 1 | x) + 22 P( 2 | x)
15 Our rule is the following: if R( 1 | x) < R ( 2 | x) action 1 : “decide 1 ” is taken Substituting the def. of R() we have : decide 1 if: 11 P ( 1 | x ) + 12 P ( 2 | x ) < 21 P ( 1 | x ) + 22 P ( 2 | x ) and decide 2 otherwise
16 We can rewrite 11 P ( 1 | x ) + 12 P ( 2 | x ) < 21 P ( 1 | x ) + 22 P ( 2 | x ) As ( 21 - 11 ) P ( 1 | x ) > ( 12 - 22 ) P ( 2 | x )
17 Finally, we can rewrite ( 21 - 11 ) P ( 1 | x ) > ( 12 - 22 ) P ( 2 | x ) using Bayes formula and posterior probabilities to get: decide 1 if: ( 21 - 11 ) P(x | 1 ) P( 1 ) > ( 12 - 22 ) P(x | 2 ) P( 2 ) and decide 2 otherwise
18 If 21 > 11 then we can express our rule as a Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action 1 (decide 1 ) Otherwise take action 2 (decide 2 )
19 Optimal decision property “If the likelihood ratio exceeds a threshold value independent of the input pattern x, we can take optimal actions”
2 2.3 Minimum-Error-Rate Classification Actions are decisions on classes If action i is taken and the true state of nature is j then: the decision is correct if i = j and in error if i j Seek a decision rule that minimizes the probability of error which is the error rate
21 Introduction of the zero-one loss function: Therefore, the conditional risk is: “The risk corresponding to this loss function is the average probability error”
22 Minimize the risk requires maximize P( i | x) (since R( i | x) = 1 – P( i | x) ) For Minimum error rate Decide i if P ( i | x) > P ( j | x) j i
23 Regions of decision and zero-one loss function, therefore: If is the zero-one loss function which means:
24
25 4. Classifiers, Discriminant Functions, and Decision Surfaces The multi-category case Set of discriminant functions g i (x), i = 1,…, c The classifier assigns a feature vector x to class i if: g i (x) > g j (x) j i
26 Discriminant function g i (x) = P( i | x) (max. discrimination corresponds to max. posterior!) g i (x) P(x | i ) P( i ) g i (x) = ln P(x | i ) + ln P( i ) (ln: natural logarithm!)
27
28 Feature space divided into c decision regions if g i (x) > g j (x) j i then x is in R i ( R i means assign x to i ) The two-category case A classifier is a “dichotomizer” that has two discriminant functions g 1 and g 2 Let g(x) g 1 (x) – g 2 (x) Decide 1 if g(x) > 0 ; Otherwise decide 2
29 The computation of g(x)
30
31 5. The Normal Density Univariate density Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where: = mean (or expected value) of x 2 = expected squared standard deviation or variance
32
33 Multivariate normal density p(x) ~ N( , ) Multivariate normal density in d dimensions is: where: x = (x 1 , x 2 , …, x d ) t (t stands for the transpose vector form) = ( 1 , 2 , …, d ) t mean vector = d*d covariance matrix || and -1 are determinant and inverse respectively
34
Always symmetric Always positive semi-definite, but for our purposes is positive definite determinant is positive invertible Eigenvalues real and positive, Properties of the Normal Density Covariance Matrix
36 Linear combinations of normally distributed random variables are normally distributed if then Whitening transform: :The orthonormal eigenvectors of :The diagonal matrix of the eigenvalues
37
38 Mahalanobis distance from x to
39
40 Entropy Measure the fundamental uncertainty in the values of points selected randomly Measured in nats, bit, Normal distribution has max entropy of all distributions having a given mean and variance
41 6. Discriminant Functions for the Normal Density We saw that the minimum error-rate classification can be achieved by the discriminant function g i (x) = ln P(x | i ) + ln P( i ) Case of multivariate normal p(x) ~ N( , )
42 Case i = 2 I ( I stands for the identity matrix) What does “ i = 2 I ” say about the dimensions? What about the variance of each dimension?
43 Case i = 2 I ( I stands for the identity matrix) We can further simplify by recognizing that the quadratic term x t x implicit in the Euclidean norm is the same for all i .
44 A classifier that uses linear discriminant functions is called “a linear machine” The decision surfaces for a linear machine are pieces of hyperplanes defined by: g i (x) = g j (x) The equation can be written as: w t ( x-x )=0 If equal priors for all classes, this reduces to the minimum-distance classifier where an unknown is assigned to the class of the nearest mean
45 The hyperplane separating R i and R j always orthogonal to the line linking the means!
46
47
48
2. Case i = (covariance of all classes are identical but arbitrary!) The quadratic form results in a sum involving a quadratic term x t -1 x which is independent of i . After this term is dropped, the resulting discriminant function is again linear 49
50 The resulting decision boundaries are again hyperplanes. If R i and R j are contiguous, the boundary between them has the equation: w t ( x-x )=0 W = - j )
51 2. Case i = (covariance of all classes are identical but arbitrary!) Hyperplane separating R i and R j (the hyperplane separating R i and R j is generally not orthogonal to the line between the means!) If priors are equal, is half way between the means.
52
53
54
55 3. Case i = arbitrary The covariance matrices are different for each category ( Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)
56
57
58
59
60
61
Assume we have following data
63 Bayes Decision Theory – Discrete Features Components of x are binary or integer valued, x can take only one of m discrete values v 1 , v 2 , …, v m concerned with probabilities rather than probability densities in Bayes Formula:
64 Bayes Decision Theory – Discrete Features Conditional risk is defined as before: R( a | x ) Approach is still to minimize risk:
65 Bayes Decision Theory – Discrete Features Case of independent binary features in 2 category problem Let x = [ x 1 , x 2 , …, x d ] t where each x i is either 0 or 1, with probabilities: p i = P(x i = 1 | 1 ) q i = P(x i = 1 | 2 )
66 Bayes Decision Theory – Discrete Features Assuming conditional independence, P( x | w i ) can be written as a product of component probabilities:
67 Bayes Decision Theory – Discrete Features Taking our likelihood ratio
71 Structure Node Discrete variables Parent, Child Nodes Direct influence Conditional Probability Table Set by expert or by learning from training set (Sorry, learning is not discussed here)
72
73 Examples
74
75
76 Evidence e
77 Ex. 4. Belief Network for Fish A B X C D P(a) P(b) P(c|x) P(d|x) P(x|a,b) b 1 =north Atlantic, 0.6 b 2 =south Atlantic, 0.4 a 1 =winter, 0.25 a 2 =spring, 0.25 a 3 =summer,0.25 a 4 =autumn, 0.25 x 1 =salmon x 2 =sea bass d 1 =wide, d 2 =thin x1 0.3, 0.7 x2 0.6, 0.4 c 1 =light,c 2 =medium, c 3 =dark x 1 0.6, 0.2, 0.2 x 2 0.2, 0.3, 0.5 x1 x2 a1b1 0.5 0.5 a1b2 0.7 0.3 a2b1 0.6 0.4 a2b2 0.8 0.2 a3b1 0.4 0.6 a3b2 0.1 0.9 a4b1 0.2 0.8 a4b2 0.3 0.7
78 Belief Network for Fish Fish was caught in the summer in the north Atlantic and is a sea bass that is dark and thin P(a 3 ,b 1 ,x 2 ,c 3 ,d 2 ) = P(a 3 )P(b 1 )P(x 2 |a 3 ,b 1 )P(c 3 |x 2 )P(d 2 |x 2 ) =0.25*0.6*0.6*0.5*0.4 =0.018
82 Medical Application Medical diagnosis Uppermost nodes: biological agent (virus or bacteria) Intermediate nodes: diseases (flu or emphysema) Lowermost nodes: symptoms (high temperature or coughing) Finds the most likely disease or cause By entering measured values