BAD702_module1_statistical machine learning for Data Science

rahuls801107 1 views 31 slides Sep 17, 2025
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Statistical ML for Data Science


Slide Content

Module 1
Exploratory Data Analysis: The Foundation of Data
Science

What is Exploratory Data
Analysis?
●First step in any data science
project
●Involves looking at and
summarizing data
●Pioneered by John Tukey in 1960s
●Forms foundation for data science
field
●How might exploratory analysis
differ from other types of data
analysis you've encountered?

Key Goals of EDA
●Gain intuition about the data
●Discover underlying patterns
●Check for anomalies and outliers
●Test hypotheses and assumptions
●Generate initial insights
●What do you think is the most important goal of exploratory data
analysis?

Types of Data
●Numeric: Continuous (e.g. height) or
Discrete (e.g. count)

●Categorical(nominal): Binary, Ordinal
Ex: type of TV screen (plasma, LCD,
LED, etc.), a state name (KA, MH, KL
etc.)

●Structured vs Unstructured

●Can you think of examples of each data
type from your daily life?

Rectangular Data
●Most common format for analysis
●Rows = records/cases
●Columns = features/variables
●Similar to spreadsheets or database tables
●How is rectangular data similar to or different from data you've
worked with before?

Estimates of Location
●Mean: Average of all values
●Median: Middle value
●Mode: Most common value
●Trimmed mean: Average after
removing extremes
●Which measure do you think is
most useful, and why?

Estimates of Variability
●Range: Difference between max and min
●Variance: Average squared deviation from mean
●Standard deviation: Square root of variance
●Percentiles: Values below which a % of observations fall
●How might understanding variability be useful in real-world
scenarios?

Visualizing Distributions
●Histograms: Show frequency of
data in bins
●Box Plots: Display quartiles and
outliers
●Density plots: Smoothed version of
histogram
●What additional insights can
visualizations provide compared to
numerical summaries?

Exploring Categorical Data
●Mode: Most frequent category
●Bar charts: Visual representation of frequencies
●Pie charts: Show proportions (use cautiously)
●When might a bar chart be more appropriate than a pie chart?

Correlation
●Measures association between two
variables
●Ranges from -1 (perfect negative)
to +1 (perfect positive)
●Visualized with scatterplots
●Correlation ≠ Causation
●Can you think of two variables that
might be correlated but not causally
related?

Correlation Matrix
●Table showing correlations between multiple variables
●Diagonal always 1 (correlation with self)
●Symmetric across diagonal
●Often visualized with heatmaps
●How might a correlation matrix be useful in a data science project?

Exploring Two Variables:
Numeric vs Numeric
●Scatterplots: Good for small datasets
●Hexagonal binning: For large datasets
●Contour plots: Show density like a
topographical map
●Why might hexagonal binning be
preferred over scatterplots for large
datasets?

Exploring Two Variables: Categorical vs Categorical
●Contingency tables: Show counts for each combination
●Include row/column percentages
●Visualize with grouped bar charts or heatmaps
●How might contingency tables help identify relationships
between categories?

Exploring Two Variables:
Categorical vs Numeric
●Boxplots: Compare distributions across
categories
●Violin plots: Show full distribution shape
●Side-by-side bar charts: Compare means
or medians
●What advantages does a violin plot have
over a traditional boxplot?

Visualizing Multiple Variables
●Use conditioning (faceting)
●Create small multiples of plots
●Group by one or more categorical variables
●Allows comparison across groups
●How might faceting help reveal complex relationships in data?

Tools for EDA
●R: ggplot2, lattice packages
●Python: matplotlib, seaborn, pandas
●Business intelligence: Tableau, Spotfire
●Interactive notebooks: Jupyter, RStudio
●Have you used any of these tools? What
was your experience?

Best Practices for EDA
●Start with simple summaries and plots
●Iterate and refine your analysis
●Look for patterns and anomalies
●Generate hypotheses for further testing
●Document your process and findings
●Why is it important to document your exploratory analysis
process?

Common Pitfalls in EDA
●Overlooking data quality issues
●Ignoring outliers or extreme values
●Assuming normality or linearity
●Over-interpreting correlations
●Failing to consider domain knowledge
●Can you think of a situation where domain knowledge might be
crucial in EDA?

From EDA to Modeling
●Use EDA insights to guide feature selection
●Identify potential transformations needed
●Understand relationships to inform model choice
●Generate hypotheses for predictive modeling
●How might EDA influence your choice of machine learning
algorithm?

The Importance of EDA in Data Science
●Builds understanding of data and problem
●Guides further analysis and modeling
●Helps communicate findings to stakeholders
●Uncovers insights that models might miss
●Why do you think EDA is considered a cornerstone of data
science projects?

Conclusion: EDA as an
Iterative Process
●EDA is not a one-time step
●Revisit throughout the project lifecycle
●Combine with domain expertise and
creativity
●Continually ask questions and explore data
●How might you apply EDA techniques in
your future projects or studies?

Missing Data Handling in
EDA
●Identify patterns in missing data
●Visualize missingness with heatmaps or
matrix plots
●Consider imputation techniques (mean,
median, mode, or advanced methods)
●Analyse the impact of missing data on your
findings
●How might missing data affect your
analysis and subsequent modelling?

Time Series Analysis in
EDA
●Examine trends, seasonality, and cyclical
patterns
●Use line plots, seasonal plots, and
decomposition techniques
●Analyse autocorrelation and partial
autocorrelation
●Identify potential forecasting approaches
●Can you think of real-world scenarios
where time series EDA would be crucial?

Dimensionality Reduction Techniques
●Principal Component Analysis (PCA) for visualizing
high-dimensional data
●t-SNE for non-linear dimensionality reduction
●UMAP for preserving both local and global structure
●Use these techniques to identify clusters and patterns
●How might dimensionality reduction help in understanding
complex datasets?

Geographic Data
Exploration
●Utilize choropleth maps for regional
comparisons
●Create heat maps to show density of
events or phenomena
●Use interactive maps to explore spatial
relationships
●Consider spatial autocorrelation in your
analysis
●What insights can geographic
visualizations provide that tables or charts
cannot?

Text Data Analysis in EDA
●Use word clouds to visualize frequent terms
●Analyse sentiment and emotion in text data
●Explore topic modelling techniques like LDA
●Examine n-grams and co-occurrence patterns
●How might text analysis complement traditional numeric and
categorical EDA?

Multivariate Analysis
Techniques
●Use parallel coordinates plots for
high-dimensional data
●Explore Andrews curves for pattern
recognition
●Utilize radar charts for comparing multiple
variables across categories
●Consider Chernoff faces for creative
multivariate visualization
●Which multivariate technique do you find
most intuitive, and why?

Anomaly Detection in EDA
●Use statistical methods (e.g., Z-score, IQR)
●Explore clustering-based anomaly detection
●Consider isolation forests for high-dimensional data
●Visualize anomalies using scatter plots or box plots
●How might identifying anomalies early in EDA impact your
overall analysis?

Interactive EDA Tools
●Explore Plotly for interactive Python
visualizations
●Use Shiny for creating interactive R
dashboards
●Consider D3.js for custom web-based
visualizations
●Utilize Tableau or Power BI for business
intelligence EDA
●How might interactive tools enhance your
EDA process and communication of
findings?

EDA for Big Data
●Use sampling techniques to explore large datasets
●Consider distributed computing frameworks (e.g., Spark)
●Utilize data summarization techniques
●Explore incremental and online learning approaches
●What challenges might you face when performing EDA on
extremely large datasets?

Ethical Considerations in EDA
●Be aware of potential biases in your data and analysis
●Consider privacy implications when exploring sensitive data
●Ensure transparency in your EDA process and findings
●Reflect on the societal impact of your analysis and
visualizations
●How can you ensure your EDA practices are ethical and
responsible?
Tags