PRML Chapter 3

ssuser36cf8e 391 views 17 slides Jul 11, 2021
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Pattern Recognition and Machine Learning


Slide Content

Chapter 3 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics 1

Chapter 3.1. Basic Linear Regression 2 Common linear regression case : Extending to basis function : Notable fact : There exists a relationship that…  

Chapter 3.1. Basic Linear Regression 3 We may consider such normal distribution that passes through linear line. Thus, we can consider the optimization issue as MLE task. Derivation was covered in Undergraduate regression analysis

Chapter 3.1. Basic Linear Regression 4 Understanding under geometrical perspective By definition… Projection matrix : ; H B = Projecting B on the column space of A Our estimated value : ; H T = Projecting T on the column space of X   Green vector t : Target value Blue vector y : Estimated value (in our course )   Sequential update of linear regression Familiar form! Just like gradient descent

Chapter 3.1. Basic Linear Regression 5 Regularization Preventing the overfitting, also called weight decay. Most common l2 regularization : L = Min loss without penalty Min loss with penalty Min loss without penalty Min loss with penalty     Theoretically, l1 regularization(lasso) tends to shrink more, which means the sparse solution. But it’s hard to get first, second order value. Thus, we use numerical optimization for lasso

Chapter 3.1. Basic Linear Regression 6 Multiple outputs This is very interesting part. If we compute multiple output linear regression, how can we estimate values?? e.g. with , we are predicting house price and house year at the same time!   This result indicates even if we predict multiple outputs, We are using the same design matrix, and only changing the target value t . Geometrically, this indicates we are projecting column vectors of t to the ’s column space. We get the same result if we calculate two outputs separately, since we assume t’s column vectors are independent.  

Chapter 3.3. Bayesian linear regression 7 Prior & Posterior of regression Now we are assuming the probability distribution of the weights (parameters). Let’s consider simple conjugate prior of normal pdf. We assume parameter w follows normal distribution! To make entire process As simple as we can… We assume simpler prior Univariate conjugate prior of normal dist. (Normal / Normal / Normal) Weighted average Note that     This part is the weighted prior mean This part is the weighted MLE mean  

Chapter 3.3. Bayesian linear regression 8 Intrinsic regularization of bayes regression We know that likelihood x prior is proportional to the posterior. Then let’s re-consider posterior at this point of view. , where (nuisance parameters) Even we did not intend to include the regularization, prior itself acts as a regularization! Figure shows the sequential updating process of the posterior/prior. We can see the variance of distribution reduces slowly.  

Chapter 3.3. Bayesian linear regression 9 Predictive distribution of Bayesian linear regression To get the predicted value, we don’t need the parameter distribution itself. We only need some specific estimated values, like bayes estimator. Derivation of following equation will be covered in chapter 8. Important thing is , as , variance of posterior converges to zero, and only left noise variance term .   Fitted line Generated samples

Chapter 3.3. Bayesian linear regression 10 Predictive distribution of Bayesian linear regression Since we have studied entire linear regression on the perspective of frequentist, this process is really tricky. Thus, let’s implement the entire process in a python.

Chapter 3.3. Bayesian linear regression 11 Equivalent kernel and its insight Let’s talk about the kernel. First, we can get the predicted value of Bayesian regression by the following equations. This k function is called smoother matrix or the equivalent kernel. What the heck does this kernel indicates?? This gives an important intuition about the linear regression on the perspective of “weighted average of neighbors”. You can see a kernel acts as an “similarity measure”. And it is being multiplied with the observed target values . What does it mean? It shows the estimating process is the weighted mean of the observed target values. So called kernel, the similarity measure, gives more weights to the true value. So, if input values have high similarity, it gets higher weights. Following equations yield similar intuitions.  

Chapter 3.4. Bayesian model comparison 12 Equivalent kernel and its insight Let’s talk about the kernel. First, we can get the predicted value of Bayesian regression by the following equations. This k function is called smoother matrix or the equivalent kernel. What the heck does this kernel indicates?? This gives an important intuition about the linear regression on the perspective of “weighted average of neighbors”. You can see a kernel acts as an “similarity measure”. And it is being multiplied with the observed target values . What does it mean? It shows the estimating process is the weighted mean of the observed target values. So called kernel, the similarity measure, gives more weights to the true value. So, if input values have high similarity, it gets higher weights. Following equations yield similar intuitions.  

Chapter 3.5. The evidence approximation 13 Fully Bayesian treatment The real predictive distribution is equal to the following equation. This integral is analytically intractable! Thus, we use other approach. If distribution is sharply peaked around , we can replace integral process by putting estimated values. That is,  

Chapter 3.5. The evidence approximation 14 Evaluation of the evidence function What we are trying to do is to estimate the nuisance parameter   Which can be known as the likelihood x prior. Overall equation can be rewritten as the following equations. Which was covered in previous sections. Now, let’s re-write the by the followings.   Then why are we re-writing the equation? We can perform the integral much easier. We can get model comparison. We can get nuisance parameter estimation.

Chapter 3.5. The evidence approximation 15 Re-writing evidence function

Chapter 3.5. The evidence approximation 16 Evidence function for the model comparison Which model is best for the data?? = Model that yields the best evidence value! max This difficult integration was computed easily by re-written equation!!

Chapter 3.5. The evidence approximation 17 Nuisance parameter estimation of   Why? Determinant is equal to the product of it’s eigen values!! (We covered this in multivariate analysis!) Prior variance   Likelihood variance (Similar to )