Unit 4a- Sampling Distribution (Slides - up to slide 21).pdf

DevangshuMitra2 23 views 26 slides Sep 03, 2024
Slide 1
Slide 1 of 26
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26

About This Presentation

Probability: Sampling Distribution


Slide Content

Sampling Distributions
UNIT 4
OPRE 6359
1

Statistical Inference and Sampling
We are often interested in estimating a parameter while we study
a population.
As the population is a large group which is not easily accessible,
the process becomes very difficult and expensive if we want to do
an exhaustive job.
Thus, we use representative samples to study the characteristics
of the population, and then make inferences about the parameter
based on the value of the statistics. Sampling becomes critical to
do statistical inference.
2

Statistical inference –concept
Statistical inference allows to use samples to make conclusions
towards the populations.
A sample will provide an estimate of the population parameter.
Generally we are interested in the Mean, the Proportion and the
Variance/Standard Deviation of the population.
3

Sampling Distribution –an Example
Suppose we are sampling from a
population that has a mean ofμ=5and is rightly skewed.
# Population is an Exponential
distribution with Mean =5
PopDist<-data.frame(x =
seq(0,20,length=10000)) %>%
mutate(density=dexp(x,rate=.2))
ggplot(PopDist, aes(x=x,
y=density)) +
geom_area(fill='salmon') +
ggtitle('Population
Distribution')
4

Sampling Distribution –an Example
Consider now a sample of size 2 from this population. Denote the
outcomes of two independent values by X1and X2. Let the
average of the two outcomes be !"=!!"!"
#
5
Rules about Expectations and Variances

Sampling Distribution –an Example
Now if we want to estimate the population meanμby taking a random sample ofn=5.
Notice that the sample mean is never exactly 5, but it is close to that.
n <-5 # Our Sample Size!
mosaic::do(3) * {
Sample.Data<-data.frame( x = rexp(n,rate=.2))
Sample.Data%>% summarise( xbar= mean(x))}
xbar
1 4.515265
2 5.430437
3 3.416277
6

Sampling Distribution –A simulation
n <-5
SampDist<-mosaic::do(10000) * {
Sample.Data<-data.frame( x =
rexp(n,rate=0.2))
Sample.Data%>% summarise( xbar=
mean(x))
}
ggplot() + geom_area(data=PopDist,
aes(x=x, y=density),
fill='salmon') +
geom_histogram(data=SampDist,
aes(x=xbar, y=..density..),
binwidth=.1, alpha=.6)
7

Conclusions from the Graphs
1.The sampling distribution of!"is centered at the population
meanμ.
2.The sampling distribution of!"has less spread than the
population distribution.
3.The sampling distribution of !"is less skewed than the
population distribution.
Thus problems with skewness and departures from normality in
the data are reduced/removed when working with sampling
distributions.
8

Sampling Distribution of the Mean
What do you think will happen to the sampling distribution of the
mean as we increase the sample size, n?
Let us investigate…
9

Sampling Distribution and Sample Size
10
n=5 n=20
n=50n=100

Sampling Distribution and Sample Size
11
n=5 n=20
n=50n=100

Some important facts
1.The sample mean $!"does not depend on the sample size n.
2.The variance V(!") does depend on n, and it shrinks to zero as n
approaches ∞.
3.The calculations of the Mean and Variance does not depend on
the distribution of the population.
12

Sampling Distribution of the Mean
These relationships define the sampling distribution of !"
$!"=µ
13
Standard error
((!") = $"
%
*('!)=$"
%=*
+

Central Limit Theorem
LetX1,…Xnbe independent observations collected from a
distribution with expectationμand varianceσ2. Then the
distribution of!"converges to a normal distribution with expectationμand varianceσ2/n asn→∞.
In practice this means that ifnis large (usuallyn>30 is sufficient),
then
14

Central Limit Theorem
If the population from which successive samples are taken has a
normal distribution, then !"~-(.,$"
%).
If the population is not normally distributed, then:
For any infinite population with mean µ and variance σ2, the
sampling distribution of !"is well approximated by the normal
distribution with mean µ and variance $"
%, provided that n is
sufficiently large.
15
This depends of the extend of nonnormality of X (heavily
skewed, multimodal). In general, the larger the sample size,
the more closely the sampling distribution of !"will resemble
a normal distribution (n>30)

Finite Population Correction Factor
For finite population, the standard error of !"should be corrected
to:
16
The usual rule of thumb is to consider N large enough if it is
at least 20 times larger than n.

Sampling Distribution of the Mean -Example
A researcher is studying salaries of recently graduated MS
students, from a very large population. She sampled 3 alumni and assumes their answers are independent. The population is
-(.=70000,*=12000)
1.What is the probability that the first observation is greater
than $80,000?
1-pnorm(80000, mean=70000, sd=12000)
[1] 0.2023284
17

Sampling Distribution of the Mean -Example
2. What is the probability that the sample mean is greater than
$80,000?
std_error<-12000/sqrt(3)
std_error
[1] 6928.203
1-pnorm(80000, mean=70000, sd=std_error)
[1] 0.07445734
18

To Calculate in R -Example
#to create graph with Normal Distribution
distr<-data.frame(x=seq(22, 118, length=1000)) %>%
mutate( density = dnorm(x, mean=70, sd=12),
group = ifelse(x<=80, 'Lower','Higher') )
ggplot(distr, aes(x=x, y=density, fill=group)) +
geom_line() +
geom_area() +
theme_bw()
#to create graph with Sampling Distribution
distr<-data.frame(x_bar=seq(22, 118, length=1000)) %>%
mutate( density = dnorm(x_bar, mean=70, sd=6.93),
group = ifelse(x_bar<=80, 'Lower','Higher') )
ggplot(distr, aes(x=x_bar, y=density, fill=group)) +
geom_line() +
geom_area() +
theme_bw()
19

20
Compare X and !"

21
Compare X and !"

Sampling Distribution of the Proportion
The central limit theorem also applies to “sample proportions.”
Let X be a binomial random variable with parameters n and p. Since each trial results in either a “success” or a “failure,” we can
define for trial ia variable Xithat equals 1 if we have a success
and 0 otherwise.
Then, the proportion of trials that resulted in a success is given
by:
22

Sampling Distribution of the Proportion
is approximately normally distributed provided np and n(1-p)
are at least 10
23
Standard error of the
proportion
*()=

To Calculate in R -Example
A company hired 50 people from a pool of qualified candidates. If
the pool contains 30% females, and only 5 out of the 50 hired were females, can we conclude that there is gender
discrimination in hiring?
̂5is defined as the proportion of hires that are female.
sd_prop= sqrt((0.3*.70)/50)
pnorm(.10, mean=.30, sd=sd_prop)
[1] 0.001014116
This is the probability of hiring 10% females when the proportion
of females in the pool is 30%, making it a very rare occurrence
and providing evidence of gender discrimination.
24

To Calculate in R -Example
successes100 <-rbinom(10000, size = 50, prob = 0.30)
proportion100 <-successes100 / 50
hist(proportion100, breaks = 20, right = FALSE, xlim=
c(0,0.7), col="lightblue", xlab= "Sample proportion")
25

Alternative Solution with R -Example
prop.test(5, 50, p = 0.30, alternative = "less", correct =
FALSE)
1-sample proportions test without continuity correction
data: 5 out of 50, null probability 0.3
X-squared = 9.5238, df = 1, p-value = 0.001014
alternative hypothesis: true p is less than 0.3
95 percent confidence interval:
0.0000000 0.1915375
sample estimates:
p
0.1
26
Tags