JuanManuelNasralaAlv1
20 views
58 slides
May 26, 2024
Slide 1 of 58
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
About This Presentation
zxczxczxczx
Size: 6.73 MB
Language: en
Added: May 26, 2024
Slides: 58 pages
Slide Content
Principal Component Analysis (PCA) – In vs. Out of Model 1 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs February 23, 2022
“a database is not a vault, it is a garden.“ Rene Marie Pacella 2
Scores Provide a MAP of Samples Loadings Provide a MAP of Variable Interrelationships Esbensen KH (2012) Multivariate Data Analysis- in practice. 5th edn . CAMO Software, page 44. Scores cannot be interpreted without loadings; loadings cannot be interpreted without scores.
“Loadings describe the data structure in terms of variable contributions and correlations. Every variable analyzed has a loading on each PC, which reflects how much the individual variable contributes to that PC, and how well the PC takes into account the variation contained in a variable .” Loadings Score Plots Loadings Line Plots
“Loadings describe the data structure in terms of variable contributions and correlations. Every variable analyzed has a loading on each PC, which reflects how much the individual variable contributes to that PC, and how well the PC takes into account the variation contained in a variable .”
Loadings Loadings – map of variables. “The PCs are in fact nothing but linear combinations of the original variables (unit vectors). Returning to a previous definition, the loadings provide a “weighting” of each variable’s contribution to a PC direction. When a weighting is high for a variable, i.e. the variable has a high loading (numerically), this variable contributes significantly to the variance expressed by that PC.” X = TP T + E TP T describes the model. E is not the model, it is the residual variation that is not included in the model. Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018 . page 78.
If the spectral noise is known to be 0.1%, then how many principal components should be used to explain the data ? Here we have the opportunity to filter or reduce the spectral noise since we do not have to keep the eight factors. Principal Components Analysis (PCA)
Principal Component Analysis X = TP T + E TP T describes the model. E is not the model, it is the residual variation that is not included in the model. This is a description of PCA, but more importantly it follows: Data structure is correlated with the property of interest, while noise is everything else, such as instrumental noise (high frequency, low frequency noise, etc ). Noise is what is not included within the model (data structure) –unexplained Esbensen KH (2012) Multivariate Data Analysis- in practice. 5th edn . CAMO Software,
95% C onfidence Interval The distance from the multivariate mean, commonly referred to as the Mahalanobis distance (MD), is computed with the scores . In Hotelling’s T2 test, the squared MD values are called Hotelling’s T2 values which are compared with a table of critical values. De Maesschalck , R.; Jouan -Rimbaud, D.; Massart , D. L., The Mahalanobis distance. Chemometrics Intellig . Lab. Syst. 2000, 50 (1), 1-18.
Understanding Hotelling’s T 2 Test In univariate statistics the t-test is used to estimate confidence interval for one variable. This is the estimation of confidence interval for a multivariate system. Confidence limits are built using the training set which would contain measurements that represent the normal (in control) situation. The computed MD (the so-called T 2 value) of each value is compared to a critical T 2 value. Used to build multivariate process control charts using the original variables or PCs.
Euclidean & Mahalonobis Distance The Euclidean distance would tell us the distance traveled. However, if there is bad weather or there are many airplanes trying to land the distance traveled would be greater. The distance traveled would not be the same for every flight. In that case we could calculate a Mahalanobis distance based on the standard deviations of the actual travel distances. The Mahalanobis distance is thus a statistical distance.
Projection of Samples Into Model Once the PCA model is developed you can project samples into this model. “Projection is the method of taking a new object and “transforming” it through the loadings of an existing model to produce a new score value in multivariate space .”- Esbensen page 151. “a new object can be projected onto an existing model and the position of the new object can be determined as being part of the model population, or as being distinctly different. Esbensen page 151” Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018.
PCA Score Plot The calibration set (Red points, 0-14.46 %w/w) was used to predict blends of 20% w/w with a different API (Blue) and excipients under the same spectral acquisition parameters. These are outside the 95% CI of model.
What are Outliers? Outliers are unusual samples that have a different pattern that the rest of the samples, and could require an extra factor to describe them. – Pirouette “Any observation that does not fit a pattern” - Miller Outliers may be measurement errors, for example when there is no sample in front of Raman or NIR probe, and it just sees air. “Some packages include elaborate diagnostics for so-called outliers, which may in many cases be perfectly good samples but ones whose correlation structure differs from that of the training set.” - Brereton Pirouette Manual, page 5-23 . Chemometrics. Data Analysis for the Laboratory and Chemical Plant. Richard G. Brereton, Wiley, 2003, Page 72. Miller, Chemometrics in Process Analytical Chemistry. 14
PCA and outliers PCA is a good tool for visualization of outliers in the X data. Visualizing spectra or samples that do not follow the same pattern as the rest. If a sample’s score is separated from others it may be an outlier. We may not know the reason for the outlier, but if we pin-point it that’s a start. Later on we can determine a physical or chemical reason for the outlier. The score and error contributes indicate which variables may cause the sample to be different. PCA model can be used as a benchmark for future samples. Pirouette Manual. 5-14, and 5-19. Infometrix. 15
In Model vs. Out of Model
In-Model vs. Out of Model Hotelling’s T 2 test indicates whether a sample is an outlier based on scores (variation captured by model). The X-residuals provides an “out of model” – assessment of outliers. The Q statistic flags samples with unusual outliers. These are two complementary views, and often used in real time process measurements. 17 Pirouette Manual 4.02, Infometrix, page 5-25 *- The Q statistic is also referred to as the square prediction error (squared residual variance) . The value of Q can be obtained from its approximate distribution. Pirouette manual 5-24.
Real Time Latent Variable Predictor Analytical methods are not applicable to all materials, they are applicable to a certain formulation or product. First test with PCA determines applicability of method.
Process Control Application – using the Calib . Model
Control hardware and software integration Computers and Chemical Engineering 66 (2014) 186–200 Step 2 – Method for RT analysis Step 3 – sensors integrated into plant Step 4 – signal to Control platform Step 1 - Design
Industrial Application Software like SIMCA, Unscrambler , Pirouette, PLS Tool Box, PLS IQ, etc are used to develop the calibration model. This is another software that is just for “use” of the calibration model and provide the information to a distributed control system or plant manufacturing data system. A number of companies sell software for real time predictions. Software that enable the use of the regression equation in the production environment. Significant opportunity for quality improvement and process knowledge.
Principal Component Analysis (PCA) – In vs. Out of Model – Importance of Residuals 22 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs March 4, 2022
Projection of Samples Into Model Use an existing model to predict unknown samples. Once the PCA model is developed you can project samples into this model. “Projection is the method of taking a new object and “transforming” it through the loadings of an existing model to produce a new score value in multivariate space .”- Esbensen page 151. “a new object can be projected onto an existing model and the position of the new object can be determined as being part of the model population, or as being distinctly different .” Esbensen page 151, section 6.7. Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018.
Projection to 2-Factor PCA Model (in within Mahalanobis , but out in terms of residuals)
X-Residuals after 2Factors and Projection to PCA model Plot of variation that is not included in the model.
X-Residuals after 3 Factors and Projection to PCA model Plot of variation that is not included in the model.
Projection of Samples Into Model You used an existing model (0 -14.46% w/w samples) to predict unknown samples. Once the PCA model is developed you can project samples into this model. “Projection is the method of taking a new object and “transforming” it through the loadings of an existing model to produce a new score value in multivariate space .”- Esbensen page 151. “a new object can be projected onto an existing model and the position of the new object can be determined as being part of the model population, or as being distinctly different .” Esbensen page 151, section 6.7. Esbensen , K. E.; Swarbrick, B., Multivariate Data Analysis – in practice. An Introduction . 6th ed.; IMPublishing : 2018.
Prediction of 2 samples Two spectra were predicted as unknowns: 1. the 7.29% (w/w) spectrum that is already in the calibration model. 2. Noise spectrum + 7.29% (w/w) spectrum that is already in the calibration model. They were predicted by 0 – 14.46% (w/w) calibration model developed in the laboratory session. The Prediction command was used to predict with a stored model.
Projection of Unknown Samples into PCA Model S cores 3D
Predict Scores 2D
Predict Scores 2D Zoomed
In and Out PCA Model – 1 Factor
In and Out PCA Model – 2 Factors
In and Out PCA Model – 3 Factors
In and Out PCA Model – 4 Factors
Sample Residual in PCA X = TP T + E TP T describes the model. E is not the model, it is the residual variation that is not included in the model . In model – scores based. A sample could have a large distance to the center of the model described by TP T , be far away from the center. Exceed the 95% confidence interval. Out model – residuals based. A sample could also have a E larger than the 95% CI of the model. The sample residual exceeds the residual variation calculated with all the samples in the model. This is what happened to the 7.29% (w/w) sample + noise. Esbensen KH (2012) Multivariate Data Analysis- in practice. 5th edn . CAMO Software, and equation 5-23 in Pirouette manual.
Sample Residual in PCA If a particular sample residual exceeds the residual of the model, it could be considered an outlier, “it might not belong to the same population as the other samples in the training set .” Pirouette manual, Sample Residual topic (5.23 – 5.28) An F-test is used ( s i is for the sample, s for the variance of model. A threshold is then calculated as shown in the plot
Q-Statistic This is referred to the squared prediction error. Q is related to the sample residual variance but without normalization. Also used to evaluate whether a sample might be an outlier. The critical value of Q can be calculated from its approximate distribution. Pirouette manual, Chapter 5, Q-Statistic, equation 5.29
Probability Consider a probability cutoff of 95%. The null hypothesis is that the variances are equal. If the probability value corresponding to the F value exceeds 95% then the hypothesis is rejected, and it is inferred that the sample was not drawn from the same population. As the samples probability approaches 1, then the chance that it is an outlier increases. Pirouette chapter 5, Probability (page 5-24)
Mahalanobis Distance in PCA Pirouette manual, Chapter 5, page 5-25 Calculated for each sample, as a measure of variability. This is the distance from the multivariate mean. If a sample’s distance excees MD crit , it may be an outlier.
Understanding Hotelling’s T 2 Test In univariate statistics the t-test is used to estimate confidence interval for one variable. This is the estimation of confidence interval for a multivariate system. Confidence limits are built using the training set which would contain measurements that represent the normal (in control) situation. The computed MD (the so-called T 2 value) of each value is compared to a critical T 2 value. Used to build multivariate process control charts using the original variables or PCs.
Principal Component Analysis (PCA)- Preamble to Quantitative Methods 42 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs February =------- 2022
A.U. Vanarase , M. Alcalà , J.I. Jerez Rozo , F.J. Muzzio and R.J. Romañach , “Real-time monitoring of drug concentration in a continuous powder mixing process using NIR spectroscopy, Chemical Engineering Science, 2010, 65(21), 5728 – 5733. PCA Scores Plot-The Transition from Qualitative to Quantitative The first PC shows that concentration is the main source of variation in the data. This qualitative observation is a good positive indication that a good quantitative model can be developed.
Introduction to Principal Component Analysis (PCA)- Discussion of Exercise 44 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs February =------- 2022
Spectra and PCA Score Plot
Spectra, PCA Score Plot, & Loadings Line Plot Orthogonality
What is the objective? If the goal is to determine differences in baseline, then method development is completed. If the goal is to differentiate between samples of different concentrations then more work is needed. Pretreatment of the data becomes necessary.
PCA – Orthogonality Obtained for spectra of 0 -15% (w/w) without pretreatment. Notice that first factor summarizes most of the variation, and then it starts decreasing. Each factor summarizes new variation, not included in previous factors. Pretreatment was not used, only mean centering.
If the spectral noise is known to be 0.1%, then how many principal components should be used to explain the data ? Here we have the opportunity to filter or reduce the spectral noise since we do not have to keep the eight factors. Principal Components Analysis (PCA)
95% C onfidence Interval The distance from the multivariate mean, commonly referred to as the Mahalanobis distance (MD), is computed with the scores . In Hotelling’s T2 test, the squared MD values are called Hotelling’s T2 values which are compared with a table of critical values. De Maesschalck , R.; Jouan -Rimbaud, D.; Massart , D. L., The Mahalanobis distance. Chemometrics Intellig . Lab. Syst. 2000, 50 (1), 1-18.
Euclidean & Mahalonobis Distance The Euclidean distance would tell us the distance traveled. However, if there is bad weather or there are many airplanes trying to land the distance traveled would be greater. The distance traveled would not be the same for every flight. In that case we could calculate a Mahalanobis distance based on the standard deviations of the actual travel distances. The Mahalanobis distance is thus a statistical distance.
Principal Component Analysis (PCA) Introduction to Outlier Diagnostics & Projections 53 Rodolfo J. Romañach , Ph.D. QUIM 6835 Chemometrcs September 3, 2019 Class no. 7
Foods Example 2018 Spectra have been uploaded and category variables were added. View the spectra. Look at all spectra. Look at all flour spectra. Perform 2 nd derivative with 7 points. Perform second derivative spectra. Savitzky Golay 19 points. Select just one spectrum to view spectral differences. Select limited number of spectra. Look at all Flour spectra. Obtain shortened spectral range eliminating some of high frequency region. Perform PCA of flour with 2 nd der 19 points, in shortened spectral region (194-1063 variables).
Foods Example 2018- 2 View scores plot, and Hotelling’s limits, loadings plot (line and score plot), influence plot, explained variance. Examine sample no. 2 more closely. Review In and Out for Flour model. Residual vs. Hotelling . Predict samples 6 -10 of flour, as if they were unknowns. These should be projected within 95% confidence interval. Perform a PCA of the pan criollo samples. Project Whole wheat into pan criollo .
Vector Multiplication 56 For the product of two matrices to be defined, the number of columns of the first matrix must equal the number of rows of the second matrix. Show scores and loadings in software.
Vector Multiplication inner product , dot product , or scalar product , and seeing the vector as a one column or one row matrix. The length of a vector is Inner product obtained with vectors that have the same dimensions: x = [200 300 100 360] y = [380 580 420 840] x∙y = 200∙380 + 300∙580 + 100∙420 + 360∙840 x∙y = 594400 57 Matrix multiplication requires that the number of columns of first matrix = number of rows of second. m x n n x p = m x p Chapter 9 Vectors and matrices. In Data Handling in Science and Technology , Massart , D. L.; Vandeginste , B. G. M.; Buydens , L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke , J., Eds. Elsevier: 1998; Vol. 20, pp 231-261.
Vector Multiplication Second way of writing the product, explains why it is called the dot product Matrix multiplication requires that the number of columns of first matrix = number of rows of second. m x n n x p = m x p Matrix multiplication is done by summing the products of ith element of a row with the ith element of a column. 58 Chapter 9 Vectors and matrices. In Data Handling in Science and Technology , Massart , D. L.; Vandeginste , B. G. M.; Buydens , L. M. C.; De Jong, S.; Lewi, P. J.; Smeyers-Verbeke , J., Eds. Elsevier: 1998; Vol. 20, pp 231-261.