Bayesian Decision Theory details ML.pptx

Machine Learning

Bayesian Decision Theory 1. Introduction – Bayesian Decision Theory Pure statistics, probabilities known, optimal decision 2. Bayesian Decision Theory–Continuous Features 3. Minimum-Error-Rate Classification 4. Classifiers, Discriminant Functions, Decision Surfaces 5. The Normal Density 6. Discriminant Functions for the Normal Density

2 1. Introduction The sea bass/salmon example State of nature, prior State of nature is a random variable The catch of salmon and sea bass is equiprobable P(  1 ) = P(  2 ) (uniform priors) P(  1 ) + P(  2 ) = 1 (exclusivity and exhaustivity)

3 Decision rule with only the prior information Decide  1 if P(  1 ) > P(  2 ) otherwise decide  2 Use of the class –conditional information p(x |  1 ) and p(x |  2 ) describe the difference in lightness between populations of sea and salmon

4

Bayes Formula Suppose we know the prior probabilities P(  j ) and the conditional densities p(x |  j ) for j = 1, 2. Lightness of a fish has value x Joint probability density of finding a pattern that is in category  j and has feature value x: p(  j , x ) = P(  j | x ) p(x) = p(x |  j ) P(  j ) 5

6 Posterior, likelihood, evidence P(  j | x) = P(x |  j ) P (  j ) / P(x) (Bayes Rule) Where in case of two categories Posterior = (Likelihood. Prior) / Evidence

7

8 Decision given the posterior probabilities X is an observation for which: if P(  1 | x) > P(  2 | x) True state of nature =  1 if P(  1 | x) < P(  2 | x) True state of nature =  2 This rule minimizes average probability of error: p = min[ P(  1 | x), P(  2 | x) ] Decide  1 if P(x |  1 ) P (  1 ) > P(x |  2 ) P (  2 )

9 2. Bayesian Decision Theory – Continuous Features Generalization of the preceding ideas Use of more than one feature, vector X Use more than two states of nature, c classes Allowing actions other than deciding the state of nature Introduce a loss of function which is more general than the probability of error

10 Allowing actions other than classification primarily allows the possibility of rejection Rejection in the sense of abstention Don’t make a decision if the alternatives are too close This must be tempered by the cost of indecision The loss function states how costly each action taken is

11 Let {  1 ,  2 ,…,  c } be the set of c states of nature (or “categories”) Let {  1 ,  2 ,…,  a } be the set of possible actions Let ( i |  j ) be the loss incurred for taking action  i when the state of nature is  j

12 Overall risk R = Sum of all R(  i | x) for i = 1,…,a and all x Minimizing R Minimizing R(  i | x) for i = 1,…, a for each action a i (i = 1,…,a) Note: This is the risk specifically for observation x Conditional risk

13 Select the action  i for which R(  i | x) is minimum R is minimum and R in this case is called the Bayes risk = best performance that can be achieved!

14 Two-category classification  1 : deciding  1  2 : deciding  2  ij = ( i |  j ) loss incurred for deciding  i when the true state of nature is  j Conditional risk: R(  1 | x) =  11 P( 1 | x) +  12 P( 2 | x) R(  2 | x) =  21 P( 1 | x) +  22 P( 2 | x)

15 Our rule is the following: if R(  1 | x) < R (  2 | x) action  1 : “decide  1 ” is taken Substituting the def. of R() we have : decide  1 if:  11 P (  1 | x ) +  12 P (  2 | x ) <  21 P (  1 | x ) +  22 P (  2 | x ) and decide  2 otherwise

16 We can rewrite  11 P (  1 | x ) +  12 P (  2 | x ) <  21 P (  1 | x ) +  22 P (  2 | x ) As (  21 -  11 ) P (  1 | x ) > (  12 -  22 ) P (  2 | x )

17 Finally, we can rewrite (  21 -  11 ) P (  1 | x ) > (  12 -  22 ) P (  2 | x ) using Bayes formula and posterior probabilities to get: decide  1 if: (  21 -  11 ) P(x |  1 ) P(  1 ) > (  12 -  22 ) P(x |  2 ) P(  2 ) and decide  2 otherwise

18 If  21 >  11 then we can express our rule as a Likelihood ratio: The preceding rule is equivalent to the following rule: Then take action  1 (decide  1 ) Otherwise take action  2 (decide  2 )

19 Optimal decision property “If the likelihood ratio exceeds a threshold value independent of the input pattern x, we can take optimal actions”

2 2.3 Minimum-Error-Rate Classification Actions are decisions on classes If action  i is taken and the true state of nature is  j then: the decision is correct if i = j and in error if i  j Seek a decision rule that minimizes the probability of error which is the error rate

21 Introduction of the zero-one loss function: Therefore, the conditional risk is: “The risk corresponding to this loss function is the average probability error” 

22 Minimize the risk requires maximize P(  i | x) (since R( i | x) = 1 – P( i | x) ) For Minimum error rate Decide  i if P (  i | x) > P (  j | x) j  i

23 Regions of decision and zero-one loss function, therefore: If  is the zero-one loss function which means:

24

25 4. Classifiers, Discriminant Functions, and Decision Surfaces The multi-category case Set of discriminant functions g i (x), i = 1,…, c The classifier assigns a feature vector x to class  i if: g i (x) > g j (x)  j  i

26 Discriminant function g i (x) = P(  i | x) (max. discrimination corresponds to max. posterior!) g i (x)  P(x |  i ) P(  i ) g i (x) = ln P(x |  i ) + ln P(  i ) (ln: natural logarithm!)

27

28 Feature space divided into c decision regions if g i (x) > g j (x)  j  i then x is in R i ( R i means assign x to  i ) The two-category case A classifier is a “dichotomizer” that has two discriminant functions g 1 and g 2 Let g(x)  g 1 (x) – g 2 (x) Decide  1 if g(x) > 0 ; Otherwise decide  2

29 The computation of g(x)

30

31 5. The Normal Density Univariate density Density which is analytically tractable Continuous density A lot of processes are asymptotically Gaussian Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x  2 = expected squared standard deviation or variance

32

33 Multivariate normal density p(x) ~ N(  , ) Multivariate normal density in d dimensions is: where: x = (x 1 , x 2 , …, x d ) t (t stands for the transpose vector form)  = ( 1 ,  2 , …,  d ) t mean vector  = d*d covariance matrix || and  -1 are determinant and inverse respectively 

34

Always symmetric Always positive semi-definite, but for our purposes  is positive definite determinant is positive  invertible Eigenvalues real and positive, Properties of the Normal Density Covariance Matrix 

36 Linear combinations of normally distributed random variables are normally distributed if then Whitening transform: :The orthonormal eigenvectors of :The diagonal matrix of the eigenvalues

37 

38 Mahalanobis distance from x to

39 

40 Entropy Measure the fundamental uncertainty in the values of points selected randomly Measured in nats, bit, Normal distribution has max entropy of all distributions having a given mean and variance

41 6. Discriminant Functions for the Normal Density We saw that the minimum error-rate classification can be achieved by the discriminant function g i (x) = ln P(x |  i ) + ln P(  i ) Case of multivariate normal p(x) ~ N(  , )

42 Case  i =  2 I ( I stands for the identity matrix) What does “  i =  2 I ” say about the dimensions? What about the variance of each dimension?

43 Case  i =  2 I ( I stands for the identity matrix) We can further simplify by recognizing that the quadratic term x t x implicit in the Euclidean norm is the same for all i .

44 A classifier that uses linear discriminant functions is called “a linear machine” The decision surfaces for a linear machine are pieces of hyperplanes defined by: g i (x) = g j (x) The equation can be written as: w t ( x-x )=0 If equal priors for all classes, this reduces to the minimum-distance classifier where an unknown is assigned to the class of the nearest mean

45 The hyperplane separating R i and R j always orthogonal to the line linking the means!

46

47

48

2. Case  i =  (covariance of all classes are identical but arbitrary!) The quadratic form results in a sum involving a quadratic term x t  -1 x which is independent of i . After this term is dropped, the resulting discriminant function is again linear 49

50 The resulting decision boundaries are again hyperplanes. If R i and R j are contiguous, the boundary between them has the equation: w t ( x-x )=0 W = - j )

51 2. Case  i =  (covariance of all classes are identical but arbitrary!) Hyperplane separating R i and R j (the hyperplane separating R i and R j is generally not orthogonal to the line between the means!) If priors are equal, is half way between the means.

52

53

54

55 3. Case  i = arbitrary The covariance matrices are different for each category ( Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids)

56

57

58

59

60

61

Assume we have following data

63 Bayes Decision Theory – Discrete Features Components of x are binary or integer valued, x can take only one of m discrete values v 1 , v 2 , …, v m  concerned with probabilities rather than probability densities in Bayes Formula:

64 Bayes Decision Theory – Discrete Features Conditional risk is defined as before: R( a | x ) Approach is still to minimize risk:

65 Bayes Decision Theory – Discrete Features Case of independent binary features in 2 category problem Let x = [ x 1 , x 2 , …, x d ] t where each x i is either 0 or 1, with probabilities: p i = P(x i = 1 |  1 ) q i = P(x i = 1 |  2 )

66 Bayes Decision Theory – Discrete Features Assuming conditional independence, P( x | w i ) can be written as a product of component probabilities:

67 Bayes Decision Theory – Discrete Features Taking our likelihood ratio

68 The discriminant function in this case is:

69 Bayesian Belief Network Features Causal relationships Statistically independent Bayesian belief nets Causal networks Belief nets

70 x 1 and x 3 are independent

71 Structure Node Discrete variables Parent, Child Nodes Direct influence Conditional Probability Table Set by expert or by learning from training set (Sorry, learning is not discussed here)

72

73 Examples

74

75

76 Evidence e

77 Ex. 4. Belief Network for Fish A B X C D P(a) P(b) P(c|x) P(d|x) P(x|a,b) b 1 =north Atlantic, 0.6 b 2 =south Atlantic, 0.4 a 1 =winter, 0.25 a 2 =spring, 0.25 a 3 =summer,0.25 a 4 =autumn, 0.25 x 1 =salmon x 2 =sea bass d 1 =wide, d 2 =thin x1 0.3, 0.7 x2 0.6, 0.4 c 1 =light,c 2 =medium, c 3 =dark x 1 0.6, 0.2, 0.2 x 2 0.2, 0.3, 0.5 x1 x2 a1b1 0.5 0.5 a1b2 0.7 0.3 a2b1 0.6 0.4 a2b2 0.8 0.2 a3b1 0.4 0.6 a3b2 0.1 0.9 a4b1 0.2 0.8 a4b2 0.3 0.7

78 Belief Network for Fish Fish was caught in the summer in the north Atlantic and is a sea bass that is dark and thin P(a 3 ,b 1 ,x 2 ,c 3 ,d 2 ) = P(a 3 )P(b 1 )P(x 2 |a 3 ,b 1 )P(c 3 |x 2 )P(d 2 |x 2 ) =0.25*0.6*0.6*0.5*0.4 =0.018

Pattern Classification, Chapter 2 (Part 3) 79 Light, south Atlantic, fish?

80 Normalize

81 Conditionally Independent

82 Medical Application Medical diagnosis Uppermost nodes: biological agent (virus or bacteria) Intermediate nodes: diseases (flu or emphysema) Lowermost nodes: symptoms (high temperature or coughing) Finds the most likely disease or cause By entering measured values

Bayesian Decision Theory details ML.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Bayesian Decision Theory details ML.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77