BAD702_module1_statistical machine learning for Data Science
rahuls801107
1 views
31 slides
Sep 17, 2025
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
Statistical ML for Data Science
Size: 2.24 MB
Language: en
Added: Sep 17, 2025
Slides: 31 pages
Slide Content
Module 1
Exploratory Data Analysis: The Foundation of Data
Science
What is Exploratory Data
Analysis?
●First step in any data science
project
●Involves looking at and
summarizing data
●Pioneered by John Tukey in 1960s
●Forms foundation for data science
field
●How might exploratory analysis
differ from other types of data
analysis you've encountered?
Key Goals of EDA
●Gain intuition about the data
●Discover underlying patterns
●Check for anomalies and outliers
●Test hypotheses and assumptions
●Generate initial insights
●What do you think is the most important goal of exploratory data
analysis?
Types of Data
●Numeric: Continuous (e.g. height) or
Discrete (e.g. count)
●Categorical(nominal): Binary, Ordinal
Ex: type of TV screen (plasma, LCD,
LED, etc.), a state name (KA, MH, KL
etc.)
●Structured vs Unstructured
●Can you think of examples of each data
type from your daily life?
Rectangular Data
●Most common format for analysis
●Rows = records/cases
●Columns = features/variables
●Similar to spreadsheets or database tables
●How is rectangular data similar to or different from data you've
worked with before?
Estimates of Location
●Mean: Average of all values
●Median: Middle value
●Mode: Most common value
●Trimmed mean: Average after
removing extremes
●Which measure do you think is
most useful, and why?
Estimates of Variability
●Range: Difference between max and min
●Variance: Average squared deviation from mean
●Standard deviation: Square root of variance
●Percentiles: Values below which a % of observations fall
●How might understanding variability be useful in real-world
scenarios?
Visualizing Distributions
●Histograms: Show frequency of
data in bins
●Box Plots: Display quartiles and
outliers
●Density plots: Smoothed version of
histogram
●What additional insights can
visualizations provide compared to
numerical summaries?
Exploring Categorical Data
●Mode: Most frequent category
●Bar charts: Visual representation of frequencies
●Pie charts: Show proportions (use cautiously)
●When might a bar chart be more appropriate than a pie chart?
Correlation
●Measures association between two
variables
●Ranges from -1 (perfect negative)
to +1 (perfect positive)
●Visualized with scatterplots
●Correlation ≠ Causation
●Can you think of two variables that
might be correlated but not causally
related?
Correlation Matrix
●Table showing correlations between multiple variables
●Diagonal always 1 (correlation with self)
●Symmetric across diagonal
●Often visualized with heatmaps
●How might a correlation matrix be useful in a data science project?
Exploring Two Variables:
Numeric vs Numeric
●Scatterplots: Good for small datasets
●Hexagonal binning: For large datasets
●Contour plots: Show density like a
topographical map
●Why might hexagonal binning be
preferred over scatterplots for large
datasets?
Exploring Two Variables: Categorical vs Categorical
●Contingency tables: Show counts for each combination
●Include row/column percentages
●Visualize with grouped bar charts or heatmaps
●How might contingency tables help identify relationships
between categories?
Exploring Two Variables:
Categorical vs Numeric
●Boxplots: Compare distributions across
categories
●Violin plots: Show full distribution shape
●Side-by-side bar charts: Compare means
or medians
●What advantages does a violin plot have
over a traditional boxplot?
Visualizing Multiple Variables
●Use conditioning (faceting)
●Create small multiples of plots
●Group by one or more categorical variables
●Allows comparison across groups
●How might faceting help reveal complex relationships in data?
Tools for EDA
●R: ggplot2, lattice packages
●Python: matplotlib, seaborn, pandas
●Business intelligence: Tableau, Spotfire
●Interactive notebooks: Jupyter, RStudio
●Have you used any of these tools? What
was your experience?
Best Practices for EDA
●Start with simple summaries and plots
●Iterate and refine your analysis
●Look for patterns and anomalies
●Generate hypotheses for further testing
●Document your process and findings
●Why is it important to document your exploratory analysis
process?
Common Pitfalls in EDA
●Overlooking data quality issues
●Ignoring outliers or extreme values
●Assuming normality or linearity
●Over-interpreting correlations
●Failing to consider domain knowledge
●Can you think of a situation where domain knowledge might be
crucial in EDA?
From EDA to Modeling
●Use EDA insights to guide feature selection
●Identify potential transformations needed
●Understand relationships to inform model choice
●Generate hypotheses for predictive modeling
●How might EDA influence your choice of machine learning
algorithm?
The Importance of EDA in Data Science
●Builds understanding of data and problem
●Guides further analysis and modeling
●Helps communicate findings to stakeholders
●Uncovers insights that models might miss
●Why do you think EDA is considered a cornerstone of data
science projects?
Conclusion: EDA as an
Iterative Process
●EDA is not a one-time step
●Revisit throughout the project lifecycle
●Combine with domain expertise and
creativity
●Continually ask questions and explore data
●How might you apply EDA techniques in
your future projects or studies?
Missing Data Handling in
EDA
●Identify patterns in missing data
●Visualize missingness with heatmaps or
matrix plots
●Consider imputation techniques (mean,
median, mode, or advanced methods)
●Analyse the impact of missing data on your
findings
●How might missing data affect your
analysis and subsequent modelling?
Time Series Analysis in
EDA
●Examine trends, seasonality, and cyclical
patterns
●Use line plots, seasonal plots, and
decomposition techniques
●Analyse autocorrelation and partial
autocorrelation
●Identify potential forecasting approaches
●Can you think of real-world scenarios
where time series EDA would be crucial?
Dimensionality Reduction Techniques
●Principal Component Analysis (PCA) for visualizing
high-dimensional data
●t-SNE for non-linear dimensionality reduction
●UMAP for preserving both local and global structure
●Use these techniques to identify clusters and patterns
●How might dimensionality reduction help in understanding
complex datasets?
Geographic Data
Exploration
●Utilize choropleth maps for regional
comparisons
●Create heat maps to show density of
events or phenomena
●Use interactive maps to explore spatial
relationships
●Consider spatial autocorrelation in your
analysis
●What insights can geographic
visualizations provide that tables or charts
cannot?
Text Data Analysis in EDA
●Use word clouds to visualize frequent terms
●Analyse sentiment and emotion in text data
●Explore topic modelling techniques like LDA
●Examine n-grams and co-occurrence patterns
●How might text analysis complement traditional numeric and
categorical EDA?
Multivariate Analysis
Techniques
●Use parallel coordinates plots for
high-dimensional data
●Explore Andrews curves for pattern
recognition
●Utilize radar charts for comparing multiple
variables across categories
●Consider Chernoff faces for creative
multivariate visualization
●Which multivariate technique do you find
most intuitive, and why?
Anomaly Detection in EDA
●Use statistical methods (e.g., Z-score, IQR)
●Explore clustering-based anomaly detection
●Consider isolation forests for high-dimensional data
●Visualize anomalies using scatter plots or box plots
●How might identifying anomalies early in EDA impact your
overall analysis?
Interactive EDA Tools
●Explore Plotly for interactive Python
visualizations
●Use Shiny for creating interactive R
dashboards
●Consider D3.js for custom web-based
visualizations
●Utilize Tableau or Power BI for business
intelligence EDA
●How might interactive tools enhance your
EDA process and communication of
findings?
EDA for Big Data
●Use sampling techniques to explore large datasets
●Consider distributed computing frameworks (e.g., Spark)
●Utilize data summarization techniques
●Explore incremental and online learning approaches
●What challenges might you face when performing EDA on
extremely large datasets?
Ethical Considerations in EDA
●Be aware of potential biases in your data and analysis
●Consider privacy implications when exploring sensitive data
●Ensure transparency in your EDA process and findings
●Reflect on the societal impact of your analysis and
visualizations
●How can you ensure your EDA practices are ethical and
responsible?