Measure of Association

2,099 views 32 slides Mar 20, 2021
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Product movement correlation
Rank correlation
Test of Significance
Coefficient of determination
Linear regression


Slide Content

Measures of Association Dr. Manoj Kumar Meher Kalahandi University [email protected]

The measures of association refer to a wide variety of coefficients that measure the statistical strength of the relationship on the variables of interest; these measures of strength, or association , can be described in several ways, depending on the analysis. There are certain statistical distinctions that a researcher should know in order to better understand the measures of statistical association . 1. T he student should know that measures of association are not the same as measures of statistical significance. The measures of significance have a null hypothesis that states that there is no significant difference between the strength of an observed relationship and the strength of an expected relationship by means of simple random sampling . Therefore, there is a possibility of having a relationship that depicts strong measures of association but is not statistically significant, and a relationship that depicts weak measures of association but is very significant .

2. The coefficient that measures statistical association, which can vary depending on the analysis, that has a value of zero signifies no relationship exists. In correlation analyses, if the coefficient (r) has a value of one, it signifies a perfect relationship on the variables of interest . In regression analyses, if the standardized beta weight (?) has a value of one, it also signifies a perfect relationship on the variables of interest. In regards to linear relationships, the measures of association are those which deal with strictly monotonic, ordered monotonic, predictive monotonic, and weak monotonic relationships. The researcher should note that if the relationships in measures of association are perfect due to strict monotonicity, then it should be perfect by other conditions as well. However , in measures of association, one cannot have perfect ordered and perfect predictive monotonicity at the same time. The researcher should note that the linear definitions of perfect relationships in measures of association are inappropriate for curvilinear relationships or discontinuous relationships .

3. The measures of association define the strength of the linear relationship in terms of the degree of monotonicity. This degree of monotonicity used by the measures of association is on the counting of various types of pairs in a relationship. There are basically four types of pairs in the measures of association. These are concordant pairs (i.e. the pairs that agree with each other ),discordant pairs (i.e. the pairs that do not agree with each other), the tied pair on one variable, and the tied pair on the other variable. The researcher should note that as the concordant pair increases, all the linear definitions of perfect relationships in measures of association increases the coefficient of association towards +1.

There are certain assumptions that are made on the measures of association: The measures of association assume categorical (nominal or ordinal) and continuous types Statistics Solutions of level data. The measures of association assume a symmetrical or asymmetrical type of causal direction. The measures of association that define an ideal relationship in terms of the strict monotonicity will attain the value of one only if the two variables have evolved from the same marginal distribution. The measures of association also ignore those rows and columns which have null values.

Product movement correlation Rank correlation Test of Significance Coefficient of determination Linear regression

The Pearson product-moment correlation coefficient (or Pearson correlation coefficient, for short) is a measure of the strength of a linear association between two variables and is denoted by  r . Basically, a Pearson product-moment correlation attempts to draw a line of best fit through the data of two variables, and the Pearson correlation coefficient,  r , indicates how far away all these data points are to this line of best fit (i.e., how well the data points fit this new model/line of best fit). Product Movement C orrelation

r Strength of relationship <0.2 Negligible 0.2 - 0.4 Low 0.4 – 0.7 Moderate 0.7 - 0.9 High >0.9 Very High Thumb rule

Formula Pearson correlation coefficient ( r ) Or  

Production Paddy in Qnt ./ Hact . (x) Fertiliser used Kg/ Hact . (y) x-x̄ y-ȳ (x-x̄)² (y-ȳ)² (x-x̄)*(y-ȳ) 40 90 -1.55 28.27 2.39 799.35 -43.69 42 65 0.45 3.27 0.21 10.71 1.49 37 56 -4.55 -5.73 20.66 32.80 26.03 28 47 -13.55 -14.73 183.48 216.89 199.49 75 59 33.45 -2.73 1119.21 7.44 -91.24 15 38 -26.55 -23.73 704.66 562.98 629.85 45 89 3.45 27.27 11.93 743.80 94.21 47 125 5.45 63.27 29.75 4003.44 345.12 49 25 7.45 -36.73 55.57 1348.89 -273.79 28 58 -13.55 -3.73 183.48 13.89 50.49 51 27 9.45 -34.73 89.39 1205.98 -328.33 x ̄= 41.55 y ̄= 61.73 ∑=2400.7 ∑=8946.18 ∑= 609.64

= 609.64 = 2400.73 ( = 8946.18 = 0.132   r Strength of relationship <0.2 Negligible 0.2 - 0.4 Low 0.4 – 0.7 Moderate 0.7 - 0.9 High >0.9 Very High Solve the problem X 5 7 8 6 10 25 15 10 7 3 Y 10 25 15 10 8 12 6 11 9 5

Sl No Urban Population Literacy (%) 1 60 73 2 35 29 3 15 36 4 22 14 5 18 20 6 38 48 7 47 45 8 5 12 9 12 13 10 9 10 Exercise

The  Spearman’s Rank Correlation Coefficient  is the non-parametric statistical measure used to study the strength of association between the two ranked variables. This method is applied to the ordinal set of numbers, which can be arranged in order, i.e. one after the other so that ranks can be given to each . In the rank correlation coefficient method, the ranks are given to each individual on the basis of its quality or quantity, such as ranking starts from position 1 st  and goes till Nth position for the one ranked last in the group. Rank C orrelation

R Where, R = Rank coefficient of correlation D = Difference of ranks N = Number of Observations Equal Ranks or Tie in Ranks:  In case the same ranks are assigned to two or more entities, then the ranks are assigned on an average basis. Such as if two individuals are ranked equal at third position, then the ranks shall be calculated as: (4+5)/ 2 = 4.5   formula

Population Density No of District 100-120 1 120-140 3 140-160 4 160-180 6 180-200 8 200-220 14 220-240 12 240-260 11 260-280 15 280-300 7

Population Density (X) No of District (Y) Order (X) Order (Y) D= (X-Y) D² 100-120 1 1 10 -9 81 120-140 3 2 9 -7 49 140-160 4 3 8 -5 25 160-180 6 4 7 -4 16 180-200 8 5 5 200-220 14 6 2 4 16 220-240 12 7 3 3 9 240-260 11 8 4 4 16 260-280 15 9 1 8 64 280-300 7 10 6 4 16 ∑=292 R 1- = 1 - 1.77 = 0 .77  

Exercise Sl No Urban Population (,000) Literacy (%) 1 60 73 2 35 29 3 15 36 4 22 14 5 18 20 6 38 48 7 47 45 8 5 36 9 12 13 10 22 36

Exercise X Y 100 1025 120 3336 111 4258 200 150 250 589 99 7589 98 1587 135 987 189 687 60 1523

T he   coefficient of determination , denoted  R 2  or  r 2  and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a statistic used in the context of statistical models whose main purpose is either the  prediction of future outcomes or the testing of hypotheses , on the basis of related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model . There are several definitions of  R 2  that are only sometimes equivalent. One class of such cases includes that of  simple linear regression  where  r 2  is used instead of  R 2 . When an intercept is included, then  r 2  is simply the square of the sample correlation coefficient  (r) between the observed outcomes and the observed predictor values. If additional regressors are included, R 2  is the square of the coefficient of multiple correlation. In both such cases, the coefficient of determination normally ranges from 0 to 1. Coefficient of D etermination

Steps to Find the Coefficient of Determination Find r, Correlation Coefficient Square ‘r’. Change r to percentage.

How to interpret the coefficient of determination? The coefficient of determination, or the R-squared value, is a value between 0.0 and 1.0 that expresses what proportion of the variance in Y can be explained by X: If R 2  = 1, then we have a perfect fit, which means that the values of Y are fully determined (i.e., without any error) by the values of X, and all data points lie precisely at the estimated best-fit line. If R 2  = 0, then our model is no better at predicting the values of Y than the model which always returns the average value of Y as a prediction. Multiplying  R 2  by 100%, you get the percentage of the variance in Y which is explained with help of X. For instance: If R2 = 0.8, then 80% of the variance in Y is predicted by X If R2 = 0.5 then half of the variance in Y can be explained by X The complementary percentage, i.e., (1 - R 2 ) * 100%, quantifies the unexplained variance: If R2 = 0.6, then 60% of the variance in Y has been explained with help of X, while the remaining 40% remains unaccounted for.

Formula for the Coefficient of Determination , R 2 Here are a few (equivalent) formulae: R 2  = SSR / SST or R 2  = 1 - SSE / SST or R 2  = SSR / (SSR + SSE ) TO BE DISCUSS AFTER REGRESSION

The  sum of squares of errors  (SSE in short), also called the  residual sum of squares : SSE= ∑( y i  - ŷ i )² SSE quantifies the discrepancy between real values of Y and those predicted by our model. The  Regression Sum of Squares  (shortened to SSR), which is sometimes also called the  explained sum of squares : SSR = ∑(ŷ i  - ȳ) ² SSR measures the difference between the values predicted by the regression model and those predicted in the most basic way, namely by ignoring  X completely and using only the average value of Y as a universal predictor. The  Total Sum of Squares  (SST), which quantifies the total variability in Y: SST = ∑( y i  - ȳ) ² It turns out that those three sums of squares satisfy: SST= SSR + SSE so you only need to calculate any two of them, and the remaining one can be easily found!

        Sum of Squares of Errors Regression Sum of Squares Total Sum of Squares Original Value Original Value Predicted Value ( Predicted-Original)² (Predicted-Mean)² (Y Value- Mean)² Yi Xi Y^* SSE SSR SST 3.5 16 3.45 0.0025 0.2025 0.25 3.2 14 3.15 0.0025 0.0225 0.04 3.0 12 2.85 0.0225 0.0225 0.00 2.6 11 2.70 0.0100 0.0900 0.16 2.9 12 2.85 0.0025 0.0225 0.01 3.3 15 3.30 0.0000 0.0900 0.09 2.7 13 3.00 0.0900 0.0000 0.09 2.8 11 2.70 0.0100 0.0900 0.04 SUM 0.1400 0.5400 0.68 Mean 3.0 3 Y^= 1.05+0.15X R2= SSR/SST 0.7941 1-SSE/SST 0.7941 0.2593 SSR/(SSR+SSE) 0.7941

Original Value Original Value Predicted Value (Predited-Original)² (Predicted-Mean)² (Y Value- Mean)² Yi Xi Y^* SSE SSR SST 25.00 12.00 30.89 34.69 36.24 35.00 18.00 36.05 1.10 0.74 58.00 22.00 39.49 342.62 6.66 37.00 15.00 33.47 12.46 11.83 27.00 25.00 42.07 227.10 26.63 45.00 17.00 35.19 96.24 2.96 62.00 22.00 39.49 506.70 6.66 32.00 32.00 48.09 258.89 124.99 12.00 8.00 27.45 238.70 89.49 36.91 1718.51 306.19 Y=a+bx y = 0.8647x + 20.57 a= 20.57 R2= SSR/SST b= 0.8600 1-SSE/SST SSR/(SSR+SSE) 0.1512

In any distribution the line of best fit is known as regression line. In a bivariate distribution there are two regression line because there are two variable. If x on y are two variable we get the regression x on y and y on x i.e , by allotting a set of values to x a set of value for y where as a set of value for x can be obtain respective to a set of values y. The line can be means of least square methods i.e the square of the deviation from the expected value are minimum. The  least - square  method states that the curve that best fits a given set of observations, is said to be a curve having a  minimum  sum of the  squared  residuals (or deviations or errors) from the given data points . Linear regression

Formula Y= a+bX Now, here we need to find the value of the slope of the line, b, plotted in scatter plot and the intercept, a Where N = no of observation X= variable X Y= Variable Y X Y 1 1 2 4 3 5 4 6 5

Sl x y x² y² xy 1 16 3.5 256 12.25 56 2 14 3.2 196 10.24 44.8 3 12 3 144 9 36 4 11 2.6 121 6.76 28.6 5 12 2.9 144 8.41 34.8 6 15 3.3 225 10.89 49.5 7 13 2.7 169 7.29 35.1 8 11 2.8 121 7.84 30.8 ∑ 104 24 1376 72.68 315.6 a = ( 24*1376 ) - ( 104*315.6)/ 8*1376 - ( 104*104 ) =201.6/192 =1.05 b = (8*315.6 ) - ( 104*24)/ 8*1376 - ( 104*104 ) = 28.8/192 =0.15

Find out X if Y= 3.0 Y= a+bx = 3=1.05+0.15x a= 1.05 b= 0.15 3=1.05+0.15x 0.15x=3-1.05 0.15x=1.95 X=1.95/0.15 x=13 Find out Y if X= 20

Y= a+bX Y=1.05+0.15X x y 10 2.55 11 2.7 12 2.85 13 3 14 3.15 15 3.3 16 3.45 17 3.6 18 3.75 19 3.9 20 4.05 x y 16 3.5 14 3.2 12 3 11 2.6 12 2.9 15 3.3 13 2.7 11 2.8

X Y 12 25 18 35 22 58 15 37 25 27 17 45 22 62 32 32 8 12 Find Y if X = 25, 50 , 75 & 100 Exercise