Statistics101: Numerical Measures

zahid-mian 519 views 28 slides Jul 28, 2015
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

Intro to Statistics for non-statisticians (delivered as part of Six Sigma Green Belt training)


Slide Content

Zahid Mian
Part of the Brown-bag Series

Population
Sample
Variable
Statistic
Sample
Skew

Mean
Median
Mode
Range
Percentile
Variance
Standard Deviation
Covariance
Correlation Coefficient
Skewness

Why do we care about Central Tendency?
What is most valuable to you:
Average price of home in a neighborhood
Median price of home …
Range of prices …
Mode …
What does it say about neighborhood if:
Average price is $500K
Median price is $350K
Range is $750K

??????for Population ??????=
??????
??????
 ??????for Sample ??????=
??????
�
N = 1,2,3,4,4,5,5,5,5,6
Mean = 4

The “middle” value from sorted list
??????��??????????????????=
�+1
2
�ℎ
term
Data: 1,2,3,4,4,5,5,5,5,6 Median: 4.5
Data: 1,2,3,4,4,5,5,5,5,6,7 Median: 5

The number that occurs the most
Data: 1,2,3,4,4,5,5,5,5,6,7
Mode: 5 (appears 4 times)
> table(c(1,2,3,4,4,5,5,5,5,6,7))
1 2 3 4 56 7
1 1 1 2 41 1

Cuts off data by npercent
Quiz Scores:
67,72,88,82,80,90,95,60,77,89,99,85,77
What is the score that cuts off 30% of all
scores? 50%?
quantile(c(67,72,88,82,80,90,95,60,77,89,99,85,77 ), c(.3, .5))
30% 50%
77 82
30% of all scores were 77 or below; 50% of all scores were 82 or
below

Graph showing data within quartiles
Max Value
Min Value
Median
Q3 (75%)
Q1 (25%)

How to see the “dispersion” of data
Sets of quiz scores for different classes:
Set1: 80, 79, 80, 81, 80, 80, 79, 79
Set2: 75, 100, 60, 100, 100, 75, 75, 60
Just by looking at the number you should be
able to state that Set2 is more dispersed
standard deviation measures the dispersion
around the mean
Other measures include range and variance

difference between the highest and lowest
values in the data
Data: 1,2,3,4,4,5,5,5,5,6,7
Range: 6 (7-1)

Measures dispersion around the mean (how
far from the normal)
??????
2
=
(�−??????)
2
�
Steps to calculate population variance (??????
2
)
Calculate mean
For each number in set
▪subtract the mean
▪Square the result (why square it?)
Get the average of differences

Data:
67,72,88,82,80,90,95,6
0,77,89,99,85,77
Large variance
indicates data is
spread out; small
indicates data is close
to mean.
For Sample (note n-1)
�
2
=
(�− �)
2
�−1
Mean: 81.62
(67-81.62)
2
= 213.74
(72-81.62)
2
= 92.54
…
(72-81.62)
2
= 21.34
Average of 213.74,
92.54,…, 21.74
??????
2
= 123.09

Simply the square root of variance
??????=

??????
(�
??????
−??????)
2
??????
??????
2
= 123.09
??????= 11.09
??????is useful in determining what is “normal”
The mean of scores was 81.62, so most scores
are within 1 ??????(+/-11.09). All scores are within
2 ??????(+/-22.18).

1
??????
Mean
2
??????

Measures how two variables (x,y) are linearly
related. Positive value indicates linear relation.
Test Scores (x):
67,72,88,82,80,90,95,60,77,89,99,85,77
Study Time (y):
30,45,80,85,75,85,120,30,45,75,85,110,40
Is there a relationship between Test Scores and
Time spent studying?
??????
��= 269.43
What if everyone studied for 30 minutes?
??????
��= 0 (so no linear relation)

A normalized measurement of how two
variables are linearly related.
Sample: �
��=
�
��
�
��
�
Population: ??????
��=
??????��
??????
�??????
�
From Previous Example: ??????
��= 0.83 (the closer
this value to 1, the stronger the relationship)

Intuitively we would say there is no relation
But be careful …
Let’s say I have data for sales of ice vs. temp
> cor(temps, sales)
[1] -0.001413245
Correlation Coefficient is nearly 0, so no
relation, right?

The scatter plot
clearly shows a
strong relationship
between sales and
temps. Maybe
when it’s too hot
people just don’t
want to leave the
house.
Always visualize data!

Identical simple statistical properties—so always Visualize!

Measure of “symmetry” of
data. Negative value
indicates mean is less than
median (left skewed).
Positive value indicates
mean is larger than
median (right skewed).
> skewness(scores)
[1] -0.302365

Data:
60,62,62,62,65,65,65,
75,82,96,99,100,100
> skewness(scores)
[1] 0.4652821
This means a few
students did really well
and lifted the overall
mean score.

Mean, Median, Mode exactly at center
99.999% of all data within 3 ??????of mean
Important for making inferences
Test scores are generally normal distribution
Height of humans follow normal distribution
Need to be careful not to apply normal
distribution rules against non-normal data
https://en.wikipedia.org/wiki/Normality_test

z-Score indicates how far above or below the mean
a given score in the distribution is
Scenario: Which exam did Scott do better?
Scott got a 65/100 on Exam1; ??????is 60; ??????is 10
Scott got a 42/200 on Exam2; ??????is 37; ??????is 5
First, need to standardize scores (Exam1 is out of
100; Exam2 out of 200)
This standardization is the z-score
??????=
�??????������−��??????�
��??????��??????�����????????????�??????��
or ??????=
??????−??????
??????

z of -1.5 means student
scored 1.5 standard
deviations below the
mean
In the case of test scores,
positive numbers are good
Less than 10% scored worse
Which score marks the
97
th
percentile?
What percentage of
population scored
between score1 and
score2 (say 75 and 90)?

Measurements of Central Tendency and
Variability are critical to study of statistics
Central Tendency tries to provide information
about the “central” value of your set
Variability tries to provide information about
the dispersion of data in your set
Covariance tries to provide information about
how two variables are related
z-Scores are useful with a normal distribution
Tags