Exploratory Data Analysis using Python using wine quality data set

SrijitaGhoshal2 54 views 16 slides Sep 09, 2025
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

This presentation, titled “Exploratory Data Analysis (EDA) in Python using the Wine Quality Dataset,” explores how Python’s data science ecosystem can be effectively applied to uncover patterns, correlations, and insights from real-world data. Prepared as part of the Introduction to R/Python c...


Slide Content

PRESENTED BY- SRIJITA GHOSHAL H2024CG021 INTRODUCTION TO R/PYTHON(CG-S3) SUBMITTED TO:- DR. DIPAK R. SAMAL

WHAT IS EXPLORATORY DATA ANALYSIS(EDA)? Exploratory Data Analysis (EDA) is a critical starting point in any data science project, enabling us to gain a deep understanding of our data's quality, patterns, and relationships. Through the use of essential Python libraries like Pandas, Matplotlib , and Seaborn , we can efficiently load, clean, and visualize data, revealing hidden insights and guiding feature engineering and modeling decisions. EDA empowers us to identify and address issues like missing data and outliers, enabling the creation of informative data visualizations and ultimately enhancing our ability to make data-driven decisions and effectively communicate findings to stakeholders.

STEPS OF EDA: The steps in EDA involves: Importing libraries Reading dataset Analyzing the data Checking for the duplicates Missing Values Calculation Exploratory Data Analysis Univariate Analysis Bivariate Analysis Multivariate Analysis

For this project ,we are using the dataset “ winequality - red”.Before diving further, let us see what coloumns do this dataset have in it. Columns Description: fixed acidity: Amount of fixed acids in wine (e.g., tartaric, citric acids). volatile acidity: Amount of volatile acids (e.g., acetic acid). citric acid: Citric acid content. residual sugar: Sugar left after fermentation (g/L). chlorides: Salt content in wine. free sulfur dioxide: SO₂ not chemically bound, acting as an antimicrobial. total sulfur dioxide: Total SO₂ in wine. density: Wine density (g/mL). pH: Acidity or basicity. sulphates: Sulphates level, contributing to wine preservation. alcohol: Alcohol content (% by volume). quality: Wine quality score (integer, typically 0–10).

STEP 1: IMPORTING REQUIRED LIBRARIES:- NAME OF THE LIBRARIES USAGE PANDAS Used for data manipulation and analysis.  NUMPY Used f or numerical computing. MATPLOTLIB AND SEABORN Used for data visualization. So here we have imported all the necessary libraries required for EDA analysis.

STEP 2: IMPORTING AND READING THE DATASET:- Here, we have imported the dataset using read.csv() function and in next couple of lines using head() function we were able to check the first five rows of the dataset and using tail() function we got to see the last five rows of the dataset.

STEP 3 :ANALYZING THE DATA:- NAME OF FUNCTION USAGE .shape() display the number of observations(rows) and features(columns) in the dataset Input code: So there are 1599 obs and 12 features in the dataset. .info() helps to understand the data type and information about data, including the number of records in each column, data having null or not null, Data type, the memory usage of the dataset Input code: From the .info() , the dataset looks good without any missing values. It also gives us some information such as the shape of our dataset (12 number of columns and 1599 number of rows) and the data types of each column.

STEP 4: CHECKING FOR MISSING VALUES AND UNIQUE VALUES:- NAME OF FUNCTION USAGE isnull () . sum() widely used to identify null values in the data Input code output . nunique () determines how many unique values there are in each column offering information about the variety of data that makes up each feature. Input code output

STEP 5 : STATISTICS SUMMARY Statistics summary gives a high-level idea to identify whether the data has any outliers, data entry error, distribution of data such as the data is normally distributed or left/right skewed In python, this can be achieved using describe () .describe () – Provide a statistics summary of data belonging to numerical datatype such as int , float. o Input code output

SOME IMPORTANT INTERPRETATIONS FROM STATS SUMMARY:- 1 . Outliers in Chlorides and Sulfur Dioxide: - The maximum values for chlorides (0.611) and total sulfur dioxide (289) are extreme compared to their 75th percentiles. These outliers could represent wines with unusual compositions or data entry issues and may need further inspection. 2 . Skewness in Residual Sugar and Sulphates : - Features like residual sugar and sulphates exhibit significant right-skewness, with max values much higher than the interquartile range. Logarithmic or power transformations might normalize these distributions for better modeling. 3 . Density and Alcohol Correlation: - The narrow range of density(0.9901–1.0037 ) suggests a potential relationship with alcohol content, as higher alcohol generally lowers density. This correlation could help in identifying wine styles or quality levels. 4 . Quality Distribution and Median Value: - The quality scores are concentrated around the median value (6), with minimal wines scoring below 4 or above 7. Exploring the features of high-quality wines (e.g., alcohol, acidity) may uncover patterns for premium classifications. 5 . Citric Acid and Volatile Acidity Dynamics: - Some wines have citric acid values of 0, suggesting absence or degradation, while volatile acidity exhibits a wide range up to 1.58. These parameters are crucial for taste and aroma, and extreme values might correlate with lower quality. 6 . pH and Acidity Balance: - The pH range (2.74–4.01) highlights acidity variation among the wines. Wines with very high or low pH values could have distinct sensory profiles, potentially influencing consumer preference or quality ratings .

STEP 6: UNIVARIATE ANALYSIS:- This is a very interesting plot known as Kernel Density Plot(KDE) or Density Plot. The purpose of this plot is to get the skewness value of all the variables present in dataset. INTERPRETATION OF KDE: The high positive skewness of chloride levels(5.68) indicates that most wines have low salt content, with outliers having unusually high levels, potentially requiring data transformation and further investigation for accuracy. The near-normal distribution of density(0.07), with minimal skewness, indicates consistent density levels across the dataset, making it suitable for linear modeling without preprocessing. From this boxplot of ‘chlorides’ also we can see: There are numerous outliers above the upper whisker, indicating a heavy right skew. Some extreme outliers have values as high as 0.6, which are significantly greater than the majority of the dataset. These outliers may represent wines with abnormally high chloride (salt) content, which could negatively affect taste or indicate quality issues.

STEP 7: BIVARIATE ANALYSIS:- For bivariate analysis, the two variables I have t aken into account density and alcohol. Results: The scatter plot of density vs. alcohol shows a negative trend, indicating that wines with higher alcohol content generally have lower densities. 2. Regression Line : Slope (coefficient): -280.16 For every unit increase in density, alcohol decreases by approximately 280.16 units on average. Intercept: 289.68 This represents the predicted alcohol value when density is zero (theoretical and outside the realistic range). 3. Corelation -Coefficient: The correlation coefficient between alcohol and density is -0.4961 which suggests a moderate negative linear relationship between density and alcohol . As alcohol increases, density tends to decrease.

INTERPRETATION: This is a bar graph plotted between density and alcohol. Inverse Relationship : There is a noticeable decline in average alcohol content as density increases. This aligns with the scientific understanding that higher alcohol content generally reduces density in wines. 2. Distinct Wine Profiles : Wines with lower densities may correspond to lighter-bodied, higher-alcohol wines, potentially preferred for certain styles or quality levels and thus influencing mouthfeel and body. Wines with higher densities may represent sweeter wines or those with more residual sugar and lower alcohol content. 3 . Analyzing some other features may be like quality ratings alongside the density bins might reveal how density and alcohol jointly influence wine quality.

STEP 8: MULTIVARIATE ANALYSIS:- As the name suggests,  Multivariate analysis  looks at more than two variables. Multivariate analysis is one of the most useful methods to determine relationships and analyze patterns for any dataset. A heat map is widely used for Multivariate Analysis. Heat Map gives the correlation between the variables, whether it has a positive or negative correlation. This is the heatmap for the dataset that we are using.

INTERPRETATION OF THE HEATMAP: Strong Positive Correlations: Fixed acidity is strongly positively correlated with density (0.67). This indicates that wines with higher fixed acidity tend to have higher density. Citric acid is positively correlated with fixed acidity (0.67), suggesting that higher levels of citric acid are associated with higher acidity in wines. Negative Correlations: pH is negatively correlated with fixed acidity (-0.68). This aligns with the fact that higher acidity corresponds to lower pH values. Volatile acidity shows a negative correlation with quality (-0.39), indicating that wines with higher volatile acidity generally have lower quality scores. Moderate Positive Correlations: Alcohol is moderately positively correlated with quality (0.48). This suggests that wines with higher alcohol content tend to have better quality. Sulphates show a weak positive correlation with quality (0.25), indicating that sulphates might contribute positively to wine quality but not strongly. Weak or Negligible Correlations : Residual sugar has very weak or negligible correlations with most features, including quality (0.01). This suggests that residual sugar does not significantly impact wine quality. Free sulfur dioxide has almost no correlation with quality (-0.05). Density and Its Influence : Density is moderately negatively correlated with alcohol (-0.50). Wines with higher alcohol content are less dense, which is expected because alcohol is less dense than water. Sulfur Dioxide's Role: Free sulfur dioxide and total sulfur dioxide are strongly positively correlated (0.67), indicating that wines with higher free sulfur dioxide also tend to have higher total sulfur dioxide.