Bayesian Decision Theory details ML.pptx

AmritaChaturvedi2 149 views 83 slides Jun 11, 2024
Slide 1
Slide 1 of 83
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83

About This Presentation

These slides describe the bayesian decision theory


Slide Content

Machine Learning

Bayesian Decision Theory 1. Introduction – Bayesian Decision Theory Pure statistics, probabilities known, optimal decision 2. Bayesian Decision Theory–Continuous Features 3. Minimum-Error-Rate Classification 4. Classifiers, Discriminant Functions, Decision Surfaces 5. The Normal Density 6. Discriminant Functions for the Normal Density

2 1. Introduction The sea bass/salmon example State of nature, prior State of nature is a random variable The catch of salmon and sea bass is equiprobable P(  1 ) = P(  2 ) (uniform priors) P(  1 ) + P(  2 ) = 1 (exclusivity and exhaustivity)

3 Decision rule with only the prior information Decide  1 if P(  1 ) > P(  2 ) otherwise decide  2 Use of the class –conditional information p(x |  1 ) and p(x |  2 ) describe the difference in lightness between populations of sea and salmon

4

Bayes Formula Suppose we know the prior probabilities P(  j ) and the conditional densities p(x |  j ) for j = 1, 2. Lightness of a fish has value x Joint probability density of finding a pattern that is in category  j and has feature value x: p(  j , x ) = P(  j | x ) p(x) = p(x |  j ) P(  j ) 5

6 Posterior, likelihood, evidence P(  j | x) = P(x |  j ) P (  j ) / P(x) (Bayes Rule) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence

7

8 Decision given the posterior probabilities X is an observation for which: if P(  1 | x) > P(  2 | x) True state of nature =  1 if P(  1 | x) < P(  2 | x) True state of nature =  2 This rule minimizes average probability of error: p = min[ P(  1 | x), P(  2 | x) ] Decide  1 if P(x |  1 ) P (  1 ) > P(x |  2 ) P (  2 )  

9 2. Bayesian Decision Theory – Continuous Features Generalization of the preceding ideas Use of more than one feature, vector X Use more than two states of nature, c classes Allowing actions other than deciding the state of nature Introduce a loss of function which is more general than the probability of error

10 Allowing actions other than classification primarily allows the possibility of rejection Rejection in the sense of abstention Don’t make a decision if the alternatives are too close This must be tempered by the cost of indecision The loss function states how costly each action taken is

11 Let {  1 ,  2 ,…,  c } be the set of c states of nature (or “categories”) Let {  1 ,  2 ,…,  a } be the set of possible actions Let ( i |  j ) be the loss incurred for taking action  i when the state of nature is  j

12 Overall risk R = Sum of all R(  i | x) for i = 1,…,a and all x Minimizing R Minimizing R(  i | x) for i = 1,…, a for each action a i (i = 1,…,a) Note: This is the risk specifically for observation x Conditional risk

13 Select the action  i for which R(  i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!

14 Two-category classification  1 : deciding  1  2 : deciding  2  ij = ( i |  j ) loss incurred for deciding  i when the true state of nature is  j Conditional risk: R(  1 | x) =  11 P( 1 | x) +  12 P( 2 | x) R(  2 | x) =  21 P( 1 | x) +  22 P( 2 | x)

15 Our rule is the following: if R(  1 | x) < R (  2 | x) action  1 : “decide  1 ” is taken Substituting the def. of R() we have : decide  1 if:  11 P (  1 | x ) +  12 P (  2 | x ) <  21 P (  1 | x ) +  22 P (  2 | x ) and decide  2 otherwise

16 We can rewrite  11 P (  1 | x ) +  12 P (  2 | x ) <  21 P (  1 | x ) +  22 P (  2 | x ) As (  21 -  11 ) P (  1 | x ) > (  12 -  22 ) P (  2 | x )

17 Finally, we can rewrite (  21 -  11 ) P (  1 | x ) > (  12 -  22 ) P (  2 | x ) using Bayes formula and posterior probabilities to get: decide  1 if: (  21 -  11 ) P(x |  1 ) P(  1 ) > (  12 -  22 ) P(x |  2 ) P(  2 ) and decide  2 otherwise

18 If  21 >  11 then we can express our rule as a Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action  1 (decide  1 ) Otherwise take action  2 (decide  2 )

19 Optimal decision property “If the likelihood ratio exceeds a threshold value independent of the input pattern x, we can take optimal actions”

2 2.3 Minimum-Error-Rate Classification Actions are decisions on classes If action  i is taken and the true state of nature is  j then: the decision is correct if i = j and in error if i  j Seek a decision rule that minimizes the probability of error which is the error rate

21 Introduction of the zero-one loss function: Therefore, the conditional risk is: “The risk corresponding to this loss function is the average probability error” 

22 Minimize the risk requires maximize P(  i | x) (since R( i | x) = 1 – P( i | x) ) For Minimum error rate Decide  i if P (  i | x) > P (  j | x) j  i

23 Regions of decision and zero-one loss function, therefore: If  is the zero-one loss function which means:

24

25 4. Classifiers, Discriminant Functions, and Decision Surfaces The multi-category case Set of discriminant functions g i (x), i = 1,…, c The classifier assigns a feature vector x to class  i if: g i (x) > g j (x)  j  i

26 Discriminant function g i (x) = P(  i | x) (max. discrimination corresponds to max. posterior!) g i (x)  P(x |  i ) P(  i ) g i (x) = ln P(x |  i ) + ln P(  i ) (ln: natural logarithm!)

27

28 Feature space divided into c decision regions if g i (x) > g j (x)  j  i then x is in R i ( R i means assign x to  i ) The two-category case A classifier is a “dichotomizer” that has two discriminant functions g 1 and g 2 Let g(x)  g 1 (x) – g 2 (x) Decide  1 if g(x) > 0 ; Otherwise decide  2

29 The computation of g(x)

30

31 5. The Normal Density Univariate density Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x  2 = expected squared standard deviation or variance

32

33 Multivariate normal density p(x) ~ N(  , ) Multivariate normal density in d dimensions is: where: x = (x 1 , x 2 , …, x d ) t (t stands for the transpose vector form)  = ( 1 ,  2 , …,  d ) t mean vector  = d*d covariance matrix || and  -1 are determinant and inverse respectively 

34  

Always symmetric Always positive semi-definite, but for our purposes  is positive definite determinant is positive  invertible Eigenvalues real and positive, Properties of the Normal Density Covariance Matrix 

36 Linear combinations of normally distributed random variables are normally distributed if then Whitening transform: :The orthonormal eigenvectors of :The diagonal matrix of the eigenvalues            

37 

38 Mahalanobis distance from x to    

39 

40 Entropy Measure the fundamental uncertainty in the values of points selected randomly Measured in nats, bit, Normal distribution has max entropy of all distributions having a given mean and variance  

41 6. Discriminant Functions for the Normal Density We saw that the minimum error-rate classification can be achieved by the discriminant function g i (x) = ln P(x |  i ) + ln P(  i ) Case of multivariate normal p(x) ~ N(  , )

42 Case  i =  2 I ( I stands for the identity matrix) What does “  i =  2 I ” say about the dimensions? What about the variance of each dimension?

43 Case  i =  2 I ( I stands for the identity matrix) We can further simplify by recognizing that the quadratic term x t x implicit in the Euclidean norm is the same for all i .

44 A classifier that uses linear discriminant functions is called “a linear machine” The decision surfaces for a linear machine are pieces of hyperplanes defined by: g i (x) = g j (x) The equation can be written as: w t ( x-x )=0 If equal priors for all classes, this reduces to the minimum-distance classifier where an unknown is assigned to the class of the nearest mean

45 The hyperplane separating R i and R j always orthogonal to the line linking the means!

46

47

48

2. Case  i =  (covariance of all classes are identical but arbitrary!) The quadratic form results in a sum involving a quadratic term x t  -1 x which is independent of i . After this term is dropped, the resulting discriminant function is again linear   49  

50   The resulting decision boundaries are again hyperplanes. If R i and R j are contiguous, the boundary between them has the equation: w t ( x-x )=0 W = - j )  

51 2. Case  i =  (covariance of all classes are identical but arbitrary!) Hyperplane separating R i and R j (the hyperplane separating R i and R j is generally not orthogonal to the line between the means!) If priors are equal, is half way between the means.    

52

53

54

55 3. Case  i = arbitrary The covariance matrices are different for each category ( Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)  

56

57

58

59

60

61

Assume we have following data

63 Bayes Decision Theory – Discrete Features Components of x are binary or integer valued, x can take only one of m discrete values v 1 , v 2 , …, v m  concerned with probabilities rather than probability densities in Bayes Formula:

64 Bayes Decision Theory – Discrete Features Conditional risk is defined as before: R( a | x ) Approach is still to minimize risk:

65 Bayes Decision Theory – Discrete Features Case of independent binary features in 2 category problem Let x = [ x 1 , x 2 , …, x d ] t where each x i is either 0 or 1, with probabilities: p i = P(x i = 1 |  1 ) q i = P(x i = 1 |  2 )

66 Bayes Decision Theory – Discrete Features Assuming conditional independence, P( x | w i ) can be written as a product of component probabilities:

67 Bayes Decision Theory – Discrete Features Taking our likelihood ratio

68 The discriminant function in this case is:

69 Bayesian Belief Network Features Causal relationships Statistically independent Bayesian belief nets Causal networks Belief nets

70 x 1 and x 3 are independent

71 Structure Node Discrete variables Parent, Child Nodes Direct influence Conditional Probability Table Set by expert or by learning from training set (Sorry, learning is not discussed here)

72

73 Examples

74

75

76 Evidence e

77 Ex. 4. Belief Network for Fish A B X C D P(a) P(b) P(c|x) P(d|x) P(x|a,b) b 1 =north Atlantic, 0.6 b 2 =south Atlantic, 0.4 a 1 =winter, 0.25 a 2 =spring, 0.25 a 3 =summer,0.25 a 4 =autumn, 0.25 x 1 =salmon x 2 =sea bass d 1 =wide, d 2 =thin x1 0.3, 0.7 x2 0.6, 0.4 c 1 =light,c 2 =medium, c 3 =dark x 1 0.6, 0.2, 0.2 x 2 0.2, 0.3, 0.5 x1 x2 a1b1 0.5 0.5 a1b2 0.7 0.3 a2b1 0.6 0.4 a2b2 0.8 0.2 a3b1 0.4 0.6 a3b2 0.1 0.9 a4b1 0.2 0.8 a4b2 0.3 0.7

78 Belief Network for Fish Fish was caught in the summer in the north Atlantic and is a sea bass that is dark and thin P(a 3 ,b 1 ,x 2 ,c 3 ,d 2 ) = P(a 3 )P(b 1 )P(x 2 |a 3 ,b 1 )P(c 3 |x 2 )P(d 2 |x 2 ) =0.25*0.6*0.6*0.5*0.4 =0.018

Pattern Classification, Chapter 2 (Part 3) 79 Light, south Atlantic, fish?

80 Normalize

81 Conditionally Independent

82 Medical Application Medical diagnosis Uppermost nodes: biological agent (virus or bacteria) Intermediate nodes: diseases (flu or emphysema) Lowermost nodes: symptoms (high temperature or coughing) Finds the most likely disease or cause By entering measured values
Tags