What is statistics?
Statistics is the science concerned with
developing and studying methods for
collecting, analyzing, interpreting and
presenting data.
There are two main types: descriptive
and inferential.
Descriptive Statistics: Central Tendency
In applied statistics, a central tendency (or measure of central tendency) is a central or typical
value for a probability distribution. It may also be called a center or location of the distribution.
Colloquially, measures of central tendency are often called averages.
●Mean is the average of
the values in series.
●Median is the value in the
middle when values are
sorted in ascending or
descending order.
●Mode is the value that
appears most often.
Mean
Source: Visually Inspecting Measures of Typical
Think of the mean as the
balancing point of a
distribution. That is, imagine
you have a solid histogram
of values and you must
balance it on one finger.
Where would you hold it?
For all symmetric
distributions the balancing
point – the mean – is
directly in the center.
Mode
The mode describes
the value or category
in a set of data that
appears the most
often.
The mode is
specifically useful
when asking
questions about
categorical
(qualitative)
variables.
What to do with two modes?
What to do with two modes? Split into Groups
Public Colleges
Private Colleges
What to do with two modes? Split into Groups
Descriptive Statistics: Measure Of The
Spread
Just like the measure of center, we also have measures of the spread, which comprises of the
following measures:
●Range: It is the given measure of how spread apart the values in a data set are.
●Variance: It describes how much a random variable differs from its expected value. It entails
computing squares of deviations. Deviation is the difference between each element from the
mean.
○Population Variance is the average of squared deviations
○Sample Variance is the average of squared differences from the mean
●Standard Deviation: It is the measure of the dispersion of a set of data from its mean.
●Interquartile Range (IQR): It is the measure of variability, based on dividing a data set into
quartiles.
Variance and standard deviation
The variance and standard deviation (the square root of of variance) are both
measures of the spread of the data.
Higher variance/std Lower variance/std
Variance:
Standard deviation:
Interquartile Range (IQR)
Whereas the range gives you the spread of the whole data set, the interquartile range
gives you the range of the middle half of a data set.
➔The interquartile range is an especially useful measure of variability for skewed distributions.
➔For these frequency distributions, the median is the best measure of central tendency
because it’s the value exactly in the middle when all values are ordered from low to high.
➔Along with the median, the IQR can give you an overview of where most of your values lie
and how clustered they are.
➔The IQR is also useful for datasets with outliers. Because it’s based on the middle half of
the distribution, it’s less influenced by extreme values.
When is the interquartile range useful?
Source: How to find the interquartile range
Skewness
Source: https://www.scribbr.com/statistics/skewness/
Skewness is a measure of the asymmetry of a distribution. A distribution is
asymmetrical when its left and right side are not mirror images.
A distribution can have right (or positive), left (or negative), or zero skewness.
Many statistical procedures assume that variables or residuals are normally
distributed. You can address skewness by:
1.Do nothing. Many statistical tests, including t- tests, ANOVAs, and linear
regressions, aren’t very sensitive to skewed data.
2.Use a different model. You may want to choose a model that doesn’t assume a
normal distribution (e.g. Non-parametric tests or generalized linear models)
3.Transform the variable. Another option is to transform a skewed variable so that
it’s less skewed. (See types of transformations here)
Notes on Skewness
Correlation and Covariance
Let's say you are the new owner of a small ice-cream shop in a little village near the beach. You
noticed that there was more business in the warmer months than the cooler months. Before you
alter your purchasing pattern to match this trend, you want to be sure that the relationship is
real.
How can you be sure that the trend you noticed is real?
Covariance and correlation are two measures that can tell you, statistically, whether or not a real
relationship exists between the outside temperature and the number of customers you have. In
this way, you can make an informed choice about your purchasing pattern.
Example: Covariance
The covariance of this set of data is 25.132. The number is positive, so we can state that the two
variables do have a positive relationship; as temperature rises, the number of customers in the store also
rises. What this doesn't tell us is how strong this relationship is. To find the strength, we need to
continue on to correlation.
Example: Correlation
To determine the strength of a LINEAR
relationship, you must use the formula for
correlation coefficient. This formula will result in
a number between -1 and 1, with:
-1 being a perfect inverse correlation (the
variables move in opposite directions reliably
and consistently),
- 0 indicating no relationship between the two
variables,
- 1 being a perfect positive correlation (the
variables reliably and consistently move in the
same direction as each other).
One of the most common measures of linear correlation is the Pearson coefficient: The Pearson
correlation coefficient is a good choice when all of the following are true:
➔Both variables are quantitative
➔The variables are normally distributed: You can create a histogram of each variable to verify
whether the distributions are approximately normal (they can be a little non-normal).
➔The data have no outliers: Outliers are observations that don’t follow the same patterns as
the rest of the data. A scatterplot is one way to check for outliers—look for points that are
far away from the others.
➔The relationship is linear: “Linear” means that the relationship between the two variables
can be described reasonably well by a straight line. You can use a scatterplot to check
whether the relationship between two variables is linear.
Notes on the Pearson Coefficient
Source: Pearson Correlation Coefficient - Scribbr
Example: Correlation
To determine the strength of a relationship, you must use the formula for correlation
coefficient. Which is given by the covariance divided by the standard deviation of each
variable.
The value of the coefficient of
correlation is: 0.9118.
We can say that there is a strong
relationship between the
temperature and the number of
of customers.
SciPy library
https://docs.scipy.org/doc/scipy/referenc
e/stats.html#association-correlation-test
s
And in Python?
Correlation means there is a statistical association between variables. Causation means that a
change in one variable causes a change in another variable.
A correlation DOES NOT imply causation, but causation always implies correlation.
Why doesn’t correlation mean causation?
➔The third variable problem means that a confounding variable affects both variables to
make them seem causally related when they are not.
➔A spurious correlation is when two variables appear to be related through hidden third
variables or simply by coincidence.
Notes on correlation
Sources: Correlation vs. Causation- Scribbr, Causation vs. Correlation Explained With 10
Examples, Spurious Correlations Website.
A word on math formulas …
Math Glossary: Mean
●x
i
= i-th observation/data point
●n = number of observations/data
points
Is shorthand for the sum from i=1 to i =n.
Think of a loop! Say x is a list of your data
points:
Math Glossary: Variance
●x
i
= i-th observation/data point
●n = number of observations/data points
●x̄ = sample mean
In words, we are computing how far is each data point to the mean (spread) and
taking an “average”. We square each term to make sure that the distance remains
positive.
Math Glossary: Standard Deviation
●x
i
= i-th observation/data point
●n = number of observations/data points
●x̄ = sample mean
The standard deviation is simply the square root of the variance. This is a useful and
interpretable statistic because taking the square root of the variance puts the
standard deviation back into the original units of the measure we used.