Data Integration strategies for chemical and sensory product spaces: a craft beers and gins case study Mpho Mafata 1,2 , Cody Williams 1 , Markus Kruger 2,3 , Jeanne Brand 1 , Bruce Watson 3 , and Astrid Buica 1 1 South African Grape and Wine Research Institute, Department of Viticulture and Oenology, Stellenbosch University, South Africa 2 School for Data Science and Computational Thinking, Stellenbosch University, South Africa 3 Department of Information Science, Stellenbosch University, South Africa
Motivation Craft market allure/ interest & Investigation of new product spaces No scientific studies on the chemical and sensory space of (South African) craft beers and gins. How do you investigate unknown product space where everybody tries to be as different as possible? Be highly inclusive (open-ended → unsupervised/ non-confirmatory ) Add as many as possible (variability), from as far and as different as possible (variation) Capture different types of data
Background Colour Smell Taste Mouthfeel Photometric measurements of colour Gas chromatography - Volatile compounds Liquid chromatography for non-volatile compounds What is captured and how is it captured? What we know The chemistry is the source of sensory stimuli How to capture the sensory stimuli and chemical compounds Which compound classes are responsible for which stimuli What we are trying to figure out Why the correlation between two modes falls short of theoretical expectations Hypotheses Due to complex nature of sensory interactions between the chemical components Quantity: capture more – samples, measurements Quality: capture the vital stuff Better modelling
Background Multiblock Approaches Designate data blocks Treat blocks separately Find joint or unique information i.e. Optimize unique information between different classes, for discrimination analyses i.e. Optimize joint information between dependent and independent variables, for regression analyses and prediction Can be used for exploratory or confirmatory purposes Phase 1: acquisition Phase 2: pre-processing to remove artifacts, reduce noise, or address peak shifts Phase 3: standardization, scaling, weighing, and rank Phase 4: optimization Phase 5: final model
Background Multiblock Approaches Purpose of exploratory Reduce number of dimensions *Used prior to confirmatory analyses Purpose of confirmatory Prediction *Calibration, validation, and testing *Large sample variation and variability Examples Variations of PCA: sum-PCA, m-PCA, h-PCA, etc. Factor analysis: PARAfac , PARAdise , and variations, MFA, ComDim , etc. Predictive analysis: PLS variations OPLS, OPLS-DA, P- ComDim , LDA, etc. Practical optimization criteria Factors/dimensions with Eigenvalue less than 1 Cumulative variation of 70% Dimension/point of first inflection in eigenvalue decay/ scree plot Optimize particular criterion ex. indices such as coefficients of fit (covariance, correlation or regression) Matt C. Howard (2016) A Review of Exploratory Factor Analysis Decisions and Overview of Current Practices: What We Are Doing and How Can We Improve?, International Journal of Human-Computer Interaction, 32:1, 51-62, DOI: 10.1080/10447318.2015.1087664 James B. Schreiber (2021). Issues and recommendations for exploratory factor analysis and principal component analysis. Research in Social and Administrative Pharmacy 17, 1004–1011.
Problem statement Choice: Large number of techniques and their variants Execution: advanced options require programming skills Optimization: highly experimental and laborious Visualization: large data sets fused, difficult to interpret
Aim Comprehensive data fusion of craft beer and gin sensory descriptors and chemical data Choice – must be comprehensive*, exploratory, and unsupervised Optimization – criteria must be exploratory (not blanketed) Interpretation – unambiguous visual representation *Comprehensive meaning it contains both joint and unique information
Materials and methods Sampling Commercially available industrial and craft beers. Local and international gins. & 68 beers, 23 breweries 36 ales and 32 lagers – ‘crafty’ categorization, blurry lines 61 gins, 37 producers South Africa, UK, USA, Belgium, Denmark, Italy, and Japan
Materials and methods Sensory data Mining for the products chemically analysed Poster presentation: P06.014 & Distinguish flavour/ aroma descriptors from emotional/ marketing terms Data consolidation – crafty drinks, crafty descriptions Sampling Commercially available industrial and craft beers. Local and international gins.
Materials and methods Chemical data Headspace-solid phase microextraction - gas chromatography – mass spectroscopy (HS-SPME-GC-MS) Untargeted analysis, Scripps XCMS alignment Terpenoid analysis by Williams & Buica (2021) & Sampling Commercially available industrial and craft beers. Local and international gins. Sensory data Mining for the products chemically analysed Poster presentation: P06.014
Materials and methods Data analysis Sensory attributes – multiple correspondence analysis (MCA) Chemical data – principal component analysis (PCA) Data fusion – multiple factor analysis (MFA) Cluster analysis – agglomerative hierarchical analysis (AHC) & Sampling Commercially available industrial and craft beers. Local and international gins. Sensory data Mining for the products chemically analysed Poster presentation: P06.014 Chemical data Headspace-solid phase microextraction - gas chromatography – mass spectroscopy (HS-SPME-GC-MS) Untargeted analysis, Scripps XCMS alignment Terpenoid analysis by Williams & Buica (2021)
Results Overall data fusion model Cumulative contribution (%) F19 Total eigenvalue 19.7498 Sensory Attributes 39.11 Cumulative eigenvalue at F19 14.0803 Targeted 13.04 Cumulative %EV at F19 71.2935 Untargeted 19.13 Overall data fusion model Cumulative contribution (%) F18 Total Eigenvalue 19.3407 Sensory Attributes 40.96 Cumulative Eigenvalue at F18 13.6204 Targeted 22.45 Cumulative %EV at F18 70.4235 Untargeted 7.01 GIN BEER Optimization criteria Eigenvalue ≥1 at F3( 22,2%EV ) Beers and F3( 19,7%EV ) Gins Cumulative variation ≥70% at F18 for Beers and F19 for Gins Inflection point at F39 (95%EV) for Beers and F48 (98%EV) for Gins Variation Captured
Results Overall data fusion model Cumulative contribution (%) F19 Total eigenvalue 19.7498 Sensory Attributes 39.11 Cumulative eigenvalue at F19 14.0803 Targeted 13.04 Cumulative %EV at F19 71.2935 Untargeted 19.13 Overall data fusion model Cumulative contribution (%) F18 Total Eigenvalue 19.3407 Sensory Attributes 40.96 Cumulative Eigenvalue at F18 13.6204 Targeted 22.45 Cumulative %EV at F18 70.4235 Untargeted 7.01 GIN BEER Optimization criteria Eigenvalue ≥1 at F3( 22,2%EV ) Beers and F3( 19,7%EV ) Gins Cumulative variation ≥70% at F18 for Beers and F19 for Gins Inflection point at F39 (95%EV) for Beers and F48 (98%EV) for Gins Variation Captured
Results Interpretation – Pattern Recognition After optimization, still many dimensions left to visualise, … … to avoid ambiguous assignment by naked-eye observations Hierarchical Cluster Analysis on projected points biplots BEER 18 dimension, 805 projected points, 91 clusters Cophenetic correlation: 0.7337 Variance decomposition for the optimal classification: Absolute Percent Within-class 0.1968 0.64% Between-classes 30.7731 99.36% Total 30.9699 100.00%
Results GIN 19 dimension, 1465 projected points, 149 clusters Cophenetic correlation: 0.877 Variance decomposition for the optimal classification: Absolute Percent Within-class 0.0219 0.21% Between-classes 10.2381 99.79% Total 10.2601 100.00% Hierarchical Cluster Analysis on projected points biplots
Results GIN Hierarchical Cluster Analysis on projected points biplots Cluster Members 58 Cassia-0 UT_1 to UT_4 UT_18 UT_22 UT_28 UT_32 UT_36 UT_59 to UT_84 . . . Continued 88 UT_5 to UT_17 UT_19 UT_20 UT_21 UT_23 UT_24 UT_25 UT_26 . . . Continued 89 UT_351 UT_526 UT_534 UT_547 UT_548 UT_671 UT_697 UT_725 . . . Continued
Conclusion Created comprehensive MFA data fusion models of sensory descriptors, terpenoids, and untargeted GC-MS features of craft beers and gins Choice – unsupervised allowed us to see common and unique products and their related chemical compounds/signals and sensory descriptors Practical optimization criteria (not blanketed) were used Cluster analysis on projected points biplots allowed for unambiguous interpretation Recommendation – confirmatory cluster analyses ( ex. k-NN or fuzzy c-means) if there are natural clustering by categories (ale and lager), design further studies to exploit each category *Not included here: contextual descriptions of the chemical and sensory space
Acknowledgements Funding: Office of the Vice Rector, Stellenbosch University School for Data Science and Computational Thinking, Stellenbosch University