BAD702_module2_statistical machine learning for Data Science
rahuls801107
2 views
26 slides
Sep 17, 2025
Slide 1 of 26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
About This Presentation
Statistical ML for Data Science
Size: 1.67 MB
Language: en
Added: Sep 17, 2025
Slides: 26 pages
Slide Content
Module 2 - Sampling and Distributions in Statistics
Introduction to Sampling
●Sampling is selecting a subset of data from
a larger population
●Random sampling gives each member an
equal chance of selection
●Quality of sample often matters more than
quantity
●Why do you think random sampling is
important in statistics?
Types of Sampling
●Simple random sampling: Each element has equal chance of
selection
●Stratified sampling: Population divided into subgroups (strata)
●Sampling with/without replacement
●What are some advantages of stratified sampling?
Sample Bias
●Sample bias: When sample differs from
population in a meaningful way
●Can lead to incorrect conclusions about the
population
●Example: Literary Digest poll of 1936
predicting Landon victory
●How might online surveys introduce sample
bias?
Central Limit Theorem
●Means from multiple samples resemble a normal distribution
Imagine a population of anything—people's heights, the number of coffee cups sold per day at a
cafe, the results of rolling a weirdly shaped die. The distribution of this population can be
anything. It could be skewed, uniform, or completely random. The CLT says that if you take
enough sufficiently large random samples from this population, calculate the mean of each
sample, and then plot those sample means, the resulting distribution will be a normal
distribution.
●True even if source population is not normally distributed
●Requires large enough sample size
●Why is this theorem important for statistical inference?
Standard Error
●Measures variability in the sampling
distribution of a statistic
●Estimated using sample standard deviation
(s) and sample size (n)
●Standard Error = s / √n
●How does increasing sample size affect
standard error?
The Bootstrap
●Resampling technique to estimate sampling distribution
●Draw samples with replacement from observed data
●Recalculate statistic for each resample
●In the bootstrap method, a sample of size n is drawn from a population. We'll call this sample S.
Then, rather than using theory to determine all possible estimates, a sampling distribution is
created by resampling observations from S with replacement m times, where each resampled set
contains n observations. With proper sampling, S will be representative of the population. Thus, by
resampling S m times with replacement, it is as if m samples were drawn from the original population,
and the derived estimates will represent the theoretical distribution from the traditional approach.
●What advantages does bootstrapping have over traditional
methods?
Confidence Intervals
●Range of values likely to contain the true
population parameter
●Usually expressed with a confidence level (e.g.
95%)
●Wider interval = more confident, but less precise
●How would you interpret a 95% confidence
interval?
Confidence Interval = Point Estimate ± Margin of Error
Margin of Error=Critical Value×Standard Error
●Critical Value (z or t):** A number from a statistical table (like the Z-table or T-table) that
corresponds to your chosen confidence level. For a 95% confidence level, the common
critical value is 1.96.
Confidence Intervals
A confidence interval is a range of values that we are confident contains the true value of a
population parameter, like the average or a proportion. Since it's often impossible to survey an entire
population, we take a sample and use its data to estimate the population's true value.
Imagine you want to know the average height of all students at your school. You can't measure
everyone, so you measure a random sample of 50 students. You find the average height of your
sample is 5'7". This single number is called a point estimate. However, it's highly unlikely that the
sample average is the exact same as the true average for the entire school.
That's where confidence intervals come in. Instead of just giving a single number, you provide a
range. For example, you might say, "I am 95% confident that the true average height of all students is
between 5'5" and 5'9"." The range from 5'5" to 5'9" is your confidence interval.
Normal Distribution
●Bell-shaped, symmetric distribution
●Defined by mean and standard deviation
●68% of data within 1 SD, 95% within 2 SD
●Why is the normal distribution important in statistics?
Standard Normal
Distribution
●Normal distribution with mean = 0 and SD
= 1
●Data converted to z-scores
●Used for comparisons across different
scales
●How do you calculate a z-score?
z=(x−μ)/σ
●A positive z-score means the data point is above the mean.
●A negative z-score means the data point is below the mean.
●A z-score of 0 means the data point is exactly the same as the
mean
QQ-Plots
●Used to visually assess if data follows a normal distribution
●Plots sample quantiles against theoretical quantiles
●Straight line indicates normal distribution
●What patterns in a QQ-plot suggest non-normality?
Long-Tailed Distributions
●Many real-world datasets are not normally
distributed
●Have "tails" extending further than normal
distribution
●Examples: income data, stock returns
●Why is it important to recognise long-tailed
distributions?
Student's t-Distribution
●Similar to normal but with thicker tails
●Used for small sample sizes
●Approaches normal distribution as sample size increases
●When would you use a t-distribution instead of normal?
Binomial Distribution
●Models number of successes in fixed
number of trials
●Each trial has two possible outcomes
(success/failure)
●Defined by number of trials (n) and
probability of success (p)
●Can you give an example of a binomial
experiment?
Chi-Square Distribution
●Used to analyse categorical data
●Measures departure from expected frequencies
●Applications: goodness-of-fit tests, independence tests
●How might you use a chi-square test in research?
F-Distribution
●Used in analysis of variance (ANOVA)
●Compares variability between and within
groups
●Ratio of two chi-square distributions
●What does a large F-statistic indicate?
Poisson Distribution
●Models number of events in fixed interval of time or space
●Events occur at constant average rate
●Mean = Variance = λ (lambda)
●Can you think of a real-world example of a Poisson process?
Exponential Distribution
●Models time between events in a Poisson
process
●Memoryless property: P(T > s + t | T > s) =
P(T > t)
●Related to Poisson distribution
●How might this distribution be used in
reliability analysis?
Weibull Distribution
●Generalisation of exponential distribution
●Models time-to-failure with changing failure rate
●Shape parameter β determines if failure rate
increases/decreases
●Where might you encounter Weibull distributions in
engineering?
Conclusion
●Sampling and distributions are fundamental to statistics
●Understanding these concepts is crucial for data analysis
●Real-world data often deviates from theoretical distributions
●How will you apply these concepts in your future work?
Sampling Distribution of the Sample Proportion
●Used when dealing with categorical data
●Approximates normal distribution for large samples
●Mean of distribution = population proportion (p)
●Standard error = √[p(1-p)/n]
●How does this distribution relate to the Central Limit Theorem?
Multivariate Distributions
●Describe the behaviour of multiple random variables
simultaneously
●Examples: multivariate normal, multivariate t-distribution
●Useful in fields like finance and environmental science
●Correlation between variables is a key consideration
●Can you think of a real-world scenario where multivariate
distributions might be applied?
Kernel Density
Estimation
●Non-parametric method to estimate
probability density function
●Smooths out the data to create a
continuous distribution
●Kernel function determines the shape of
the estimate
●Bandwidth parameter controls the
smoothness
●How might this technique be useful when
dealing with non-standard distributions?
Mixture Distributions
●Combination of two or more probability distributions
●Can model complex, multi-modal data
●Often used in clustering and classification problems
●Example: Gaussian Mixture Models
●Why might a single distribution be insufficient for some
datasets?
Copulas
●Functions that describe dependence between random variables
●Allow separation of marginal distributions from dependency
structure
●Widely used in finance and risk management
●Examples: Gaussian copula, t-copula, Archimedean copulas
●How do copulas extend our understanding of multivariate
distributions?