CORRELATION AND
REGRESSSION
CERTIFIED STATISTICIAN SPECIALIST (CSS) PROGRAM
(VISIONARY RESEARCH ASSOCIATION, INC.)
Eugine B. Dodongan, MAEd
Agriculture Department
Davao de Oro State College
Objectives
1
2
3
Building pre-requisite
knowledge on correlation and
regression
Establishing the desired
knowledge on correlation and
regression
Developing skills in analyzing
data
CORRELATION ANALYSIS
Correlation analysis is a statistical technique that gives
you information about the relationship between
variables.
Correlation analysis can be calculated to investigate
the relationship of variables. How strong
the correlation is is determined by the correlation
coefficient, which varies from -1 to +1. Correlation
analyses can thus be used to make a statement about
the strength and direction of the correlation.
ISSUES OF CORRELATION
Correlation DOES NOT IMPLY
CAUSATION
The size of a correlation can be
influence by the size of your sample.
LINEARITY of the relationship
LINEARITY of the relationship
RANGE of talent (VARIABILITY)
RANGE of talent (VARIABILITY)
Homoscedasticity (equal variability)
Effect of Discontinuous Distributions
(Outliers)
Deciding what is a “good” correlation
CORRELATION ANALYSIS:
ASSUMPTIONS
LEVEL OF MEASUREMENT
Both variables should be measured on a
continuous scale.
RELATED PAIRS
Each observation must include pairs of values for
the two variables.
CORRELATION ANALYSIS:
ASSUMPTIONS
ABSENCE OF OUTLIERS
The data should not contain outliers in either
variable.
LINEARITY
The relationship between variables should be
linear.
REGRESSION ANALYSIS
Regression is a statistical method that allows modeling
relationships between a dependent variable and one or more
independent variables.
WHEN TO USE REGRESSION
ANALYSIS?
TYPES OF REGRESSION ANALYSES
TYPES OF REGRESSION ANALYSES
ASSUMPTIONS OF LINEAR REGRESSION
1
First, linear regression needs the relationship between the
independent and dependent variables to be linear. It is also
important to check for outliers since linear regression is sensitive to
outlier effects. The linearity assumption can best be tested with
scatter plots, the following two examples depict two cases, where no
and little linearity is present.
ASSUMPTIONS OF LINEAR REGRESSION
1
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is
also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption
can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is
present.
ASSUMPTIONS OF LINEAR REGRESSION
1
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is
also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption
can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is
present.
ASSUMPTIONS OF LINEAR REGRESSION
1
First, linear regression needs the relationship between the independent and dependent variables to be linear. It is
also important to check for outliers since linear regression is sensitive to outlier effects. The linearity assumption
can best be tested with scatter plots, the following two examples depict two cases, where no and little linearity is
present.
ASSUMPTIONS OF LINEAR REGRESSION
2
Secondly, the linear regression analysis requires all
variables to be multivariate normal. This assumption can
best be checked with a histogram or a Q-Q-Plot.
Normality can be checked with a goodness of fit test,
e.g., the Kolmogorov-Smirnov test. When the data is not
normally distributed a non-linear transformation (e.g.,
log-transformation) might fix this issue.
ASSUMPTIONS OF LINEAR REGRESSION
2
Secondly, the linear regression analysis requires all variables to be multivariate normal. This assumption can best
be checked with a histogram or a Q-Q-Plot. Normality can be checked with a goodness of fit test, e.g., the
Kolmogorov-Smirnov test. When the data is not normally distributed a non-linear transformation (e.g.,
log-transformation) might fix this issue.
ASSUMPTIONS OF LINEAR REGRESSION
3
Thirdly, linear regression assumes that there is little or no
multicollinearity in the data. Multicollinearity occurs
when the independent variables are too highly correlated
with each other.
CORRELATION MATRIX
TOLERANCE
Variance Inflation Factor (VIF)
ASSUMPTIONS OF LINEAR REGRESSION
4
Fourthly, linear regression analysis requires that there is
little or no autocorrelation in the data. Autocorrelation
occurs when the residuals are not independent from each
other. In other words when the value of y(x+1) is not
independent from the value of y(x).
You can test the linear regression model for autocorrelation with the Durbin-Watson test. Durbin-Watson’s d tests the null hypothesis that the residuals are not linearly
auto-correlated.
ASSUMPTIONS OF LINEAR REGRESSION
4
Fourthly, linear regression analysis requires that there is little or no autocorrelation in the data. Autocorrelation occurs when the residuals are not independent from each
other. In other words when the value of y(x+1) is not independent from the value of y(x).
You can test the linear regression model for
autocorrelation with the Durbin-Watson test.
Durbin-Watson’s d tests the null hypothesis that the
residuals are not linearly auto-correlated.
ASSUMPTIONS OF LINEAR REGRESSION
5
The last assumption of the linear regression analysis
is homoscedasticity. Homoscedasticity, or homogeneity
of variances, is an assumption of equal or similar
variances in different groups being compared. The
scatter plot is good way to check whether the data are
homoscedastic (meaning the residuals are equal across
the regression line). The following scatter plots show
examples of data that are not homoscedastic (i.e.,
heteroscedastic):
ASSUMPTIONS OF LINEAR REGRESSION
5
The last assumption of the linear regression analysis is homoscedasticity. The scatter plot is good way to check
whether the data are homoscedastic (meaning the residuals are equal across the regression line). The following
scatter plots show examples of data that are not homoscedastic (i.e., heteroscedastic):
Linearity Scatter Plot
Normality Shapiro Wilk / Kolmogorov p-value > 0.05
Multicollinearity VIF
A VIF of 1 will mean that the
variables are not correlated; a VIF
between 1 and 5 shows that
variables are moderately correlated,
and a VIF between 5 and 10 will
mean that variables are highly
correlated.
Autocorrelation Durbin-Watson Test
The DW statistic ranges from zero to
four, with a value of 2.0 indicating
zero autocorrelation. Values below
2.0 mean there is positive
autocorrelation and above 2.0
indicates negative autocorrelation.
homoscedasticity Scatter Plot
CORRELATION AND
REGRESSION ANALYSIS USING
JAMOVI AND JASP