Unit 7b:Regression Analysis Introduction Simple linear regression Properties of regression coefficients Coefficient of determination
Regression Definition: Regression analysis is the process of determining how the value of dependent variable changes for unit change in independent variable. According to Ya-Lum Chou "Regression analysis attempts to establish the nature of the relationship between variables-that is, to study the functional relationship between the variables and thereby provide a mechanism for prediction or forecasting.“ The application of regression analysis is carried out to estimate the conditional expectation of dependent variables. In simple language, it is used for prediction and forecasting in a business context. This section also helps to determine which of the independent variables is related to the dependent variables. It should be noted that the term dependent and the independent refers to the mathematical or functional meaning of dependence –they do not imply that there is necessarily any cause and effect relationship between the variables.
Regression Concept of Linear and Nonlinear Regression In this chapter we consider the simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line. It is called linear regression, otherwise, non-linear regression. Although there could be wide variety of relationship, we shall restrict our discussion in this section, to linear equation only. Linear equations are important because they are relatively easy to work and interpret.
Regression Regression Lines Let us consider a simple situation where we are interested with the relationship between two variables X and Y, where X is the independent variable or the causal variable and Y depends on it. In case of bi-variate data; say X and Y, we must have two regression lines i.e. (i) Y on X. (ii) X on Y. The regression line of Y on X gives the most probable value of Y for the given values of X and the regression line of X on Y gives the most probable values of X for the given values of Y. It should be noted that the regression line cuts each other at a point gives average value of X and Y, i.e., if the point where both regression line cut each other, a perpendicular line is drawn on the X-axis, It will give the mean value of X. If the point where both regression line cuts each other, a perpendicular line is drawn on the Y-axis, It will give the mean value of Y if from that point a horizontal line is drawn on the Y-axis, It will get the mean value of Y.
Regression
Regression Let the simple linear regression equation of Y on X can be expressed as Y = a + b X ….………………… (1) Where, Y = Dependent variable. X = Independent variable. a = Y-intercept. It indicates the line intersects Y– axis at the point Y = a b = Slope of the line. It indicates the rate of change in Y for unit change in X. For fitting the above regression equation, we need to estimate the parameter a and b by the method of least squares. In this method we minimize the sum of squares of deviation of the observed value of dependent variables and the expected value of the independent variables. Using the least square estimate (L.S.E) for equation (1), we obtain the following two normal equations as: X = na + b X ….………………… (i) XY = a X + b X 2 ….………………… (ii) Solving these two normal equation we get the value of a and b and substituting the value of a and b (1) we get the required line of best fit (or estimated or fitted regression equation of Y on X).
Regression Let the simple linear regression equation X on Y is expressed as X = a’ + b’Y ………………… (2) Where, X = Dependent variable. Y = Independent variable. a’ = x-intercept. It indicates the line intersects X– axis at the point X = a’ b’ = Slope of the line. It indicates the rate of change in X for unit change in Y. In order to estimate the parameter a and b by the rule of least square estimate (L.S.E).Two normal equations is: X = na ’ + b’ Y .......................... (i) XY = a’ Y + b’ Y 2 .......................... (ii) Solving these two normal equations, we will get the value of a’ and b’ and substituting the value of a and b in (2) we get the required line of best fit (X on Y).
Regression The direct method of estimating the regression equation is tedious, and to seek simplified form of the above method, we take the deviation of x and y series from their respective means called regression equations by deviation from arithmetic mean. In this case the two regression equations are written as follows; Y = a + b X and X = a' + b' Y...................... (1) Where, b YX = regression coefficient of Y on X and b XY = regression coefficient of X on Y Since two lines of regression always passes through their mean values ( , ) we have = a + b yx and = a' + b xy ........................... (2) Now subtracting of (2) from (1) (Y – ) = b yx (X – ) and (X – ) = b xy (Y – ) Thus, the regression line of Y on X is (Y – ) = b yx (X – ) and regression line of X on Y is (X – ) = b xy (Y – )
Regression a. Individual series The method of finding regression coefficients are i. Direct method: Regression coefficients are defined by b yx = and b xy = ii. Actual mean method: When deviations are taken from actual mean b yx = and b xy = When, x = X – ; y = Y –
Regression iii. Assumed mean method: If the actual means are in decimals (or fraction), the deviations from the actual mean create difficult task to solve the problem. So the deviations are taken from assumed means. The regression coefficients, by this method are computed by b yx = and b xy = Where, U = X – A and V = Y – B iv. Step deviation method : If the variable values are in continuous type or there is common divisor for each observation then regression coefficients are computed by the following formula. b yx = × and b xy = × Where, U = and V = , h and k are class interval or common divisors of X and Y – series respectively. v. When the standard deviations and correlation coefficients are given then the regression coefficients are b yx = r. and b xy = r. Where, x = Standard deviations of X. y = Standard deviations of Y. r = Correlation coefficient.
Properties of Regression Coefficients 1.Both regression coefficients will have the same sign, i.e. either both are positive or both are negative. This is known as the signature property of regression coefficient. 2.Both regression coefficients cannot be greater than one, because, the value of correlation (r) cannot exceed one. This is known as the magnitude property of regression coefficients, i.e. If b yx > 1 b xy 1 or, b xy > 1 b yx 1
Properties of Regression Coefficients 3.The correlation coefficient is the geometric mean of the two regression coefficients. This is known as the fundamental property of regression coefficients, i.e. r = Note: If regression coefficients are negative then r is negative and if they are positive then r is also positive. 4.The regression coefficients are independent of change of origin but not scale. 5 .The arithmetic mean of the two regression coefficients b yx and b xy is greater than or equal to the correlation coefficient between the variables X and Y. This is known as the mean property of regression coefficients. i.e. ≥ r
Coefficient of determination( ) The square of the correlation coefficient r is called the coefficient of determination. This is a more useful and readily comprehensible measure for indicating the percentage of variation in the dependent variable which is accounted for the independent variable. In other words the coefficient of determination gives the ratio of the explained variance to the total variance. Thus, the coefficient of determination is denoted by r 2 and explained as follows: R 2 = The coefficient of determination is more useful and better measures for interpreting the value of r. For example, if the value of r is 0.6. We cannot conclude that 60% of the variation in the dependent variable is due to the independent variable. But the coefficient of determination in this case is r 2 = 0.36 which implies that only 36% of the variation in the dependent variable has been explained by the independent variable and remaining 64% of the variation is due to other reasons . Limitation of Coefficient of determination 1. Coefficient of determination is always a positive quantity and thus unable to explain the direction of relationship (+ ve or – ve ) between two variables. 2. It is important to note that the value of r 2 decrease rapidly as the value of r decreases.
Practice Work 1. Find r, when b XY = 3 and b YX = 0.3. 2. You are given the following observation on X and Y a. Fit a regression equation of Y on X and hence, predict Y if X = 7. b. Fit a regression equation X on Y and hence, predict X if Y = 4. Calculate the correlation coefficient and coefficient of determination. 3. Estimate Y when X = 42 from the information = 40, = 20, X = 2.5, Y = 4, r = 0.6 4. The equation of two line obtained in a correlation analysis are as follows 3X + 2Y = 26, 6X + Y = 31, Obtain , r, , and coefficient of determination. X 1 3 6 8 5 Y 1 3 2 5 4 4