PCA_2022-In_and_out.pptx zxczxczxczxczxcxzczx

JuanManuelNasralaAlv1 20 views 58 slides May 26, 2024
Slide 1
Slide 1 of 58
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58

About This Presentation

zxczxczxczx


Slide Content

Principal Component Analysis (PCA) – In vs. Out of Model 1 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs February 23, 2022

“a database is not a vault, it is a garden.“ Rene Marie Pacella 2

Scores Provide a MAP of Samples Loadings Provide a MAP of Variable Interrelationships Esbensen KH (2012) Multivariate Data Analysis- in practice. 5th edn . CAMO Software, page 44. Scores cannot be interpreted without loadings; loadings cannot be interpreted without scores.

“Loadings describe the data structure in terms of variable contributions and correlations. Every variable analyzed has a loading on each PC, which reflects how much the individual variable contributes to that PC, and how well the PC takes into account the variation contained in a variable .” Loadings Score Plots Loadings Line Plots

“Loadings describe the data structure in terms of variable contributions and correlations. Every variable analyzed has a loading on each PC, which reflects how much the individual variable contributes to that PC, and how well the PC takes into account the variation contained in a variable .”

Loadings Loadings – map of variables. “The PCs are in fact nothing but linear combinations of the original variables (unit vectors). Returning to a previous definition, the loadings provide a “weighting” of each variable’s contribution to a PC direction. When a weighting is high for a variable, i.e. the variable has a high loading (numerically), this variable contributes significantly to the variance expressed by that PC.” X = TP T + E TP T describes the model. E is not the model, it is the residual variation that is not included in the model. Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018 . page 78.

If the spectral noise is known to be 0.1%, then how many principal components should be used to explain the data ? Here we have the opportunity to filter or reduce the spectral noise since we do not have to keep the eight factors. Principal Components Analysis (PCA)

Principal Component Analysis X = TP T + E TP T describes the model. E is not the model, it is the residual variation that is not included in the model. This is a description of PCA, but more importantly it follows: Data structure is correlated with the property of interest, while noise is everything else, such as instrumental noise (high frequency, low frequency noise, etc ). Noise is what is not included within the model (data structure) –unexplained Esbensen KH (2012) Multivariate Data Analysis- in practice. 5th edn . CAMO Software,

95% C onfidence Interval The distance from the multivariate mean, commonly referred to as the Mahalanobis distance (MD), is computed with the scores . In Hotelling’s T2 test, the squared MD values are called Hotelling’s T2 values which are compared with a table of critical values. De Maesschalck , R.; Jouan -Rimbaud, D.; Massart , D. L., The Mahalanobis distance. Chemometrics Intellig . Lab. Syst. 2000, 50 (1), 1-18.

Understanding Hotelling’s T 2 Test In univariate statistics the t-test is used to estimate confidence interval for one variable. This is the estimation of confidence interval for a multivariate system. Confidence limits are built using the training set which would contain measurements that represent the normal (in control) situation. The computed MD (the so-called T 2 value) of each value is compared to a critical T 2 value. Used to build multivariate process control charts using the original variables or PCs.

Euclidean & Mahalonobis Distance The Euclidean distance would tell us the distance traveled. However, if there is bad weather or there are many airplanes trying to land the distance traveled would be greater. The distance traveled would not be the same for every flight. In that case we could calculate a Mahalanobis distance based on the standard deviations of the actual travel distances. The Mahalanobis distance is thus a statistical distance.

Projection of Samples Into Model Once the PCA model is developed you can project samples into this model. “Projection is the method of taking a new object and “transforming” it through the loadings of an existing model to produce a new score value in multivariate space .”- Esbensen page 151. “a new object can be projected onto an existing model and the position of the new object can be determined as being part of the model population, or as being distinctly different. Esbensen page 151” Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018.

PCA Score Plot The calibration set (Red points, 0-14.46 %w/w) was used to predict blends of 20% w/w with a different API (Blue) and excipients under the same spectral acquisition parameters. These are outside the 95% CI of model.

What are Outliers? Outliers are unusual samples that have a different pattern that the rest of the samples, and could require an extra factor to describe them. – Pirouette “Any observation that does not fit a pattern” - Miller Outliers may be measurement errors, for example when there is no sample in front of Raman or NIR probe, and it just sees air. “Some packages include elaborate diagnostics for so-called outliers, which may in many cases be perfectly good samples but ones whose correlation structure differs from that of the training set.” - Brereton Pirouette Manual, page 5-23 . Chemometrics. Data Analysis for the Laboratory and Chemical Plant. Richard G. Brereton, Wiley, 2003, Page 72. Miller, Chemometrics in Process Analytical Chemistry. 14

PCA and outliers PCA is a good tool for visualization of outliers in the X data. Visualizing spectra or samples that do not follow the same pattern as the rest. If a sample’s score is separated from others it may be an outlier. We may not know the reason for the outlier, but if we pin-point it that’s a start. Later on we can determine a physical or chemical reason for the outlier. The score and error contributes indicate which variables may cause the sample to be different. PCA model can be used as a benchmark for future samples. Pirouette Manual. 5-14, and 5-19. Infometrix. 15

In Model vs. Out of Model

In-Model vs. Out of Model Hotelling’s T 2 test indicates whether a sample is an outlier based on scores (variation captured by model). The X-residuals provides an “out of model” – assessment of outliers. The Q statistic flags samples with unusual outliers. These are two complementary views, and often used in real time process measurements. 17 Pirouette Manual 4.02, Infometrix, page 5-25 *- The Q statistic is also referred to as the square prediction error (squared residual variance) . The value of Q can be obtained from its approximate distribution. Pirouette manual 5-24.

Real Time Latent Variable Predictor Analytical methods are not applicable to all materials, they are applicable to a certain formulation or product. First test with PCA determines applicability of method.

Process Control Application – using the Calib . Model

Control hardware and software integration Computers and Chemical Engineering 66 (2014) 186–200 Step 2 – Method for RT analysis Step 3 – sensors integrated into plant Step 4 – signal to Control platform Step 1 - Design

Industrial Application Software like SIMCA, Unscrambler , Pirouette, PLS Tool Box, PLS IQ, etc are used to develop the calibration model. This is another software that is just for “use” of the calibration model and provide the information to a distributed control system or plant manufacturing data system. A number of companies sell software for real time predictions. Software that enable the use of the regression equation in the production environment. Significant opportunity for quality improvement and process knowledge.

Principal Component Analysis (PCA) – In vs. Out of Model – Importance of Residuals 22 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs March 4, 2022

Projection of Samples Into Model Use an existing model to predict unknown samples. Once the PCA model is developed you can project samples into this model. “Projection is the method of taking a new object and “transforming” it through the loadings of an existing model to produce a new score value in multivariate space .”- Esbensen page 151. “a new object can be projected onto an existing model and the position of the new object can be determined as being part of the model population, or as being distinctly different .” Esbensen page 151, section 6.7. Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018.

Projection to 2-Factor PCA Model (in within Mahalanobis , but out in terms of residuals)

X-Residuals after 2Factors and Projection to PCA model Plot of variation that is not included in the model.

X-Residuals after 3 Factors and Projection to PCA model Plot of variation that is not included in the model.

Projection of Samples Into Model You used an existing model (0 -14.46% w/w samples) to predict unknown samples. Once the PCA model is developed you can project samples into this model. “Projection is the method of taking a new object and “transforming” it through the loadings of an existing model to produce a new score value in multivariate space .”- Esbensen page 151. “a new object can be projected onto an existing model and the position of the new object can be determined as being part of the model population, or as being distinctly different .” Esbensen page 151, section 6.7. Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018.

Prediction of 2 samples Two spectra were predicted as unknowns: 1. the 7.29% (w/w) spectrum that is already in the calibration model. 2. Noise spectrum + 7.29% (w/w) spectrum that is already in the calibration model. They were predicted by 0 – 14.46% (w/w) calibration model developed in the laboratory session. The Prediction command was used to predict with a stored model.

Projection of Unknown Samples into PCA Model S cores 3D

Predict Scores 2D

Predict Scores 2D Zoomed

In and Out PCA Model – 1 Factor

In and Out PCA Model – 2 Factors

In and Out PCA Model – 3 Factors

In and Out PCA Model – 4 Factors

Sample Residual in PCA X = TP T + E TP T describes the model. E is not the model, it is the residual variation that is not included in the model . In model – scores based. A sample could have a large distance to the center of the model described by TP T , be far away from the center. Exceed the 95% confidence interval. Out model – residuals based. A sample could also have a E larger than the 95% CI of the model. The sample residual exceeds the residual variation calculated with all the samples in the model. This is what happened to the 7.29% (w/w) sample + noise. Esbensen KH (2012) Multivariate Data Analysis- in practice. 5th edn . CAMO Software, and equation 5-23 in Pirouette manual.

Sample Residual in PCA If a particular sample residual exceeds the residual of the model, it could be considered an outlier, “it might not belong to the same population as the other samples in the training set .” Pirouette manual, Sample Residual topic (5.23 – 5.28) An F-test is used ( s i is for the sample, s for the variance of model. A threshold is then calculated as shown in the plot

Q-Statistic This is referred to the squared prediction error. Q is related to the sample residual variance but without normalization. Also used to evaluate whether a sample might be an outlier. The critical value of Q can be calculated from its approximate distribution. Pirouette manual, Chapter 5, Q-Statistic, equation 5.29

Probability Consider a probability cutoff of 95%. The null hypothesis is that the variances are equal. If the probability value corresponding to the F value exceeds 95% then the hypothesis is rejected, and it is inferred that the sample was not drawn from the same population. As the samples probability approaches 1, then the chance that it is an outlier increases. Pirouette chapter 5, Probability (page 5-24)

Mahalanobis Distance in PCA Pirouette manual, Chapter 5, page 5-25 Calculated for each sample, as a measure of variability. This is the distance from the multivariate mean. If a sample’s distance excees MD crit , it may be an outlier.

Understanding Hotelling’s T 2 Test In univariate statistics the t-test is used to estimate confidence interval for one variable. This is the estimation of confidence interval for a multivariate system. Confidence limits are built using the training set which would contain measurements that represent the normal (in control) situation. The computed MD (the so-called T 2 value) of each value is compared to a critical T 2 value. Used to build multivariate process control charts using the original variables or PCs.

Principal Component Analysis (PCA)- Preamble to Quantitative Methods 42 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs February =------- 2022

A.U. Vanarase , M. Alcalà , J.I. Jerez Rozo , F.J. Muzzio and R.J. Romañach , “Real-time monitoring of drug concentration in a continuous powder mixing process using NIR spectroscopy, Chemical Engineering Science, 2010, 65(21), 5728 – 5733. PCA Scores Plot-The Transition from Qualitative to Quantitative The first PC shows that concentration is the main source of variation in the data. This qualitative observation is a good positive indication that a good quantitative model can be developed.

Introduction to Principal Component Analysis (PCA)- Discussion of Exercise 44 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs February =------- 2022

Spectra and PCA Score Plot

Spectra, PCA Score Plot, & Loadings Line Plot Orthogonality

What is the objective? If the goal is to determine differences in baseline, then method development is completed. If the goal is to differentiate between samples of different concentrations then more work is needed. Pretreatment of the data becomes necessary.

Spectra, PCA Score Plot, & Loadings Line Plot Comp No. M2.R2X M2.R2X Comp[1] 0.827293 0.827293 Comp[2] 0.146939 0.146939 Orthogonality

PCA – Orthogonality Obtained for spectra of 0 -15% (w/w) without pretreatment. Notice that first factor summarizes most of the variation, and then it starts decreasing. Each factor summarizes new variation, not included in previous factors. Pretreatment was not used, only mean centering.

If the spectral noise is known to be 0.1%, then how many principal components should be used to explain the data ? Here we have the opportunity to filter or reduce the spectral noise since we do not have to keep the eight factors. Principal Components Analysis (PCA)

95% C onfidence Interval The distance from the multivariate mean, commonly referred to as the Mahalanobis distance (MD), is computed with the scores . In Hotelling’s T2 test, the squared MD values are called Hotelling’s T2 values which are compared with a table of critical values. De Maesschalck , R.; Jouan -Rimbaud, D.; Massart , D. L., The Mahalanobis distance. Chemometrics Intellig . Lab. Syst. 2000, 50 (1), 1-18.

Euclidean & Mahalonobis Distance The Euclidean distance would tell us the distance traveled. However, if there is bad weather or there are many airplanes trying to land the distance traveled would be greater. The distance traveled would not be the same for every flight. In that case we could calculate a Mahalanobis distance based on the standard deviations of the actual travel distances. The Mahalanobis distance is thus a statistical distance.

Principal Component Analysis (PCA) Introduction to Outlier Diagnostics & Projections 53 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs September 3, 2019 Class no. 7

Foods Example 2018 Spectra have been uploaded and category variables were added. View the spectra. Look at all spectra. Look at all flour spectra. Perform 2 nd derivative with 7 points. Perform second derivative spectra. Savitzky Golay 19 points. Select just one spectrum to view spectral differences. Select limited number of spectra. Look at all Flour spectra. Obtain shortened spectral range eliminating some of high frequency region. Perform PCA of flour with 2 nd der 19 points, in shortened spectral region (194-1063 variables).

Foods Example 2018- 2 View scores plot, and Hotelling’s limits, loadings plot (line and score plot), influence plot, explained variance. Examine sample no. 2 more closely. Review In and Out for Flour model. Residual vs. Hotelling . Predict samples 6 -10 of flour, as if they were unknowns. These should be projected within 95% confidence interval. Perform a PCA of the pan criollo samples. Project Whole wheat into pan criollo .

Vector Multiplication 56 For the product of two matrices to be defined, the number of columns of the first matrix must equal the number of rows of the second matrix. Show scores and loadings in software.

Vector Multiplication inner product , dot product , or scalar product , and seeing the vector as a one column or one row matrix. The length of a vector is Inner product obtained with vectors that have the same dimensions: x = [200 300 100 360] y = [380 580 420 840] x∙y = 200∙380 + 300∙580 + 100∙420 + 360∙840 x∙y = 594400 57 Matrix multiplication requires that the number of columns of first matrix = number of rows of second. m x n n x p = m x p Chapter 9 Vectors and matrices. In Data Handling in Science and Technology , Massart , D. L.; Vandeginste , B. G. M.; Buydens , L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke , J., Eds. Elsevier: 1998; Vol. 20, pp 231-261.

Vector Multiplication Second way of writing the product, explains why it is called the dot product Matrix multiplication requires that the number of columns of first matrix = number of rows of second. m x n n x p = m x p Matrix multiplication is done by summing the products of ith element of a row with the ith element of a column. 58 Chapter 9 Vectors and matrices. In Data Handling in Science and Technology , Massart , D. L.; Vandeginste , B. G. M.; Buydens , L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke , J., Eds. Elsevier: 1998; Vol. 20, pp 231-261.