Regression analysis

22,357 views 30 slides Jul 13, 2018
Slide 1
Slide 1 of 30
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30

About This Presentation

.


Slide Content

Regression Analysis

Regression Regression: technique concerned with predicting some variables by knowing others The process of predicting variable Y using variable X Tells you how values in y change as a function of changes in values of x

Correlation and Regression Correlation describes the strength of a linear relationship between two variables Linear means “straight line” Regression tells us how to draw the straight line described by the correlation

Regression Calculates the “best-fit” line for a certain set of data The regression line makes the sum of the squares of the residuals smaller than for any other line Regression minimizes residuals

Regression we are able to construct a best fitting straight line to the scatter diagram points and then formulate a regression equation in the form of :

Simple Linear Regression Independent variable (x) Dependent variable (y) The output of a regression is a function that predicts the dependent variable based upon values of the independent variables. Simple regression fits a straight line to the data. y’ = b0 + b1X ± є b0 (y intercept) B1 = slope = ∆y/ ∆x є

Simple Linear Regression Independent variable (x) Dependent variable The function will make a prediction for each observed data point. The observation is denoted by y and the prediction is denoted by y. Zero Prediction: y Observation: y ^ ^

Simple Linear Regression For each observation, the variation can be described as: y = y + ε Actual = Explained + Error Zero Prediction error: ε ^ Prediction: y ^ Observation: y

Regression Independent variable (x) Dependent variable A least squares regression selects the line with the lowest total sum of squared prediction errors. This value is called the Sum of Squares of Error, or SSE.

Calculating SSR Independent variable (x) Dependent variable The Sum of Squares Regression (SSR) is the sum of the squared differences between the prediction for each observation and the population mean. Population mean: y

Regression Formulas The Total Sum of Squares (SST) is equal to SSR + SSE. Mathematically, SSR = ∑ ( y – y ) (measure of explained variation) SSE = ∑ ( y – y ) (measure of unexplained variation) SST = SSR + SSE = ∑ ( y – y ) (measure of total variation in y) ^ ^ 2 2

The Coefficient of Determination The proportion of total variation (SST) that is explained by the regression (SSR) is known as the Coefficient of Determination, and is often referred to as R . R = = The value of R can range between 0 and 1, and the higher its value the more accurate the regression model is. It is often referred to as a percentage. SSR SSR SST SSR + SSE 2 2 2

Standard Error of Regression The Standard Error of a regression is a measure of its variability. It can be used in a similar manner to standard deviation, allowing for prediction intervals. y ± 2 standard errors will provide approximately 95% accuracy, and 3 standard errors will provide a 99% confidence interval. Standard Error is calculated by taking the square root of the average prediction error. Standard Error = SSE n-k Where n is the number of observations in the sample and k is the total number of variables in the model √

The output of a simple regression is the coefficient β and the constant A. The equation is then: y = A + β * x + ε where ε is the residual error. β is the per unit change in the dependent variable for each unit change in the independent variable. Mathematically: β = ∆ y ∆ x

Multiple Linear Regression More than one independent variable can be used to explain variance in the dependent variable, as long as they are not linearly related. A multiple regression takes the form: y = A + β X + β X + … + β k Xk + ε where k is the number of variables, or parameters. 1 1 2 2

Multicollinearity Multicollinearity is a condition in which at least 2 independent variables are highly linearly correlated. It will often crash computers. Example table of Correlations   Y X1 X2 Y 1.000     X1 0.802 1.000   X2 0.848 0.578 1.000 A correlations table can suggest which independent variables may be significant. Generally, an ind. variable that has more than a correlation with the dependent variable and less than with any other ind. variable can be included as a possible predictor.

Nonlinear Regression Nonlinear functions can also be fit as regressions. Common choices include Power, Logarithmic, Exponential, and Logistic, but any continuous function can be used.

Not Linear Linear  x residuals x Y x Y x residuals

Regression Output in Excel

Significance testing… Slope Distribution of slope ~ T n-2     H : β 1 = 0 (no linear relationship) H 1 : β 1  0 (linear relationship does exist) T n-2 =

Functions of multivariate analysis: Control for confounders Test for interactions between predictors (effect modification) Improve predictions

Interpreting Regression 

Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling : multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test : non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest Kruskal-Wallis test: non-parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Covariance covariance  is a measure of the joint variability of two random variables

cov(X,Y) > 0 X and Y are positively correlated cov(X,Y) < 0 X and Y are inversely correlated cov(X,Y) = 0 X and Y are independent Interpreting Covariance

Types of variables to be analyzed Statistical procedure or measure of association Predictor variable/s Outcome variable Cross-sectional/case-control studies   Categorical (>2 groups) Continuous ANOVA Continuous Continuous Simple linear regression Multivariate (categorical and continuous) Continuous Multiple linear regression Categorical Categorical Chi-square test (or Fisher’s exact) Binary Binary Odds ratio, risk ratio Multivariate Binary Logistic regression   Cohort Studies/Clinical Trials Binary Binary Risk ratio Categorical Time-to-event Kaplan-Meier/ log-rank test Multivariate Time-to-event Cox-proportional hazards regression, hazard ratio Binary (two groups) Continuous T-test Binary Ranks/ordinal Wilcoxon rank-sum test Categorical Continuous Repeated measures ANOVA Multivariate Continuous Mixed models; GEE modeling

Alternative summary: statistics for various types of outcome data Outcome Variable Are the observations independent or correlated? Assumptions independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Difference in proportions Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Chi-square test assumes sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Continuous outcome (means); HRP 259/HRP 262 Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling : multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test : non-parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest Kruskal-Wallis test: non-parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Binary or categorical outcomes (proportions); HRP 259/HRP 261 Outcome Variable Are the observations correlated? Alternative to the chi-square test if sparse cells: independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between two or more groups Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate-adjusted odds ratios McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Time-to-event outcome (survival data); HRP 262 Outcome Variable Are the observation groups independent or correlated? Modifications to Cox regression if proportional-hazards is violated: independent correlated Time-to-event (e.g., time to fracture) Kaplan-Meier statistics: estimates survival functions for each group (usually displayed graphically); compares survival functions with log-rank test Cox regression: Multivariate technique for time-to-event data; gives multivariate-adjusted hazard ratios n/a (already over time) Time-dependent predictors or time-dependent hazard ratios (tricky!)
Tags