Chisquared test.pptx

KrishnaKrishKrish1 207 views 48 slides Jul 04, 2022
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

A chi-squared test is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants
differences between the observed values


Slide Content

Chi- squared Test Krishnakumar D Biostatistician

2 Chi-Square ( χ 2 ) and Frequency Data It was proposed in1900 by Karl Pearson. The data that we analyze consists of frequencies; that is, the number of individuals falling into categories. In other words, the variables are measured on a nominal scale. The test statistic for such frequency data is Pearson Chi-Square. The magnitude of Pearson Chi-Square reflects the amount of discrepancy between observed frequencies and expected frequencies.

3 Steps in Test of Hypothesis Determine the appropriate test Establish the level of significance: α Formulate the statistical hypothesis Calculate the test statistic Determine the degree of freedom Compare calculated test statistic against a table/critical value

4 1. Determine Appropriate Test Chi Square is used when both variables are measured on a nominal scale. It can be applied to interval or ratio data that have been categorized into a small number of groups. It assumes that the observations are randomly sampled from the population. All observations are independent (an individual can appear only once in a table and there are no overlapping categories). It does not make any assumptions about the shape of the distribution nor about the homogeneity of variances.

5 2. Establish Level of Significance α is a predetermined value The convention α = .05 α = .01 α = .001

6 3. Determine The Hypothesis: Whether There is an Association or Not H o : The two variables are independent H a : The two variables are associated

7 4. Calculating Test Statistics Contrasts observed frequencies in each cell of a contingency table with expected frequencies. The expected frequencies represent the number of cases that would be found in each cell if the null hypothesis were true ( i.e. the nominal variables are unrelated). The expected values specify what the values of each cell of the table would be if there was no association between the two variables. Expected frequency of two unrelated events is product of the row and column frequency divided by number of cases. F e = F r F c / N

8 4. Calculating Test Statistics Continued

9 4. Calculating Test Statistics Continued Observed frequencies Expected frequency Expected frequency

10 5. Determine Degrees of Freedom df = (R-1)(C-1) Number of levels in column variable Number of levels in row variable

11 6. Compare computed test statistic against a tabled/critical value The computed value of the Pearson chi- square statistic is compared with the critical value to determine if the computed value is improbable The critical tabled values are based on sampling distributions of the Pearson chi-square statistic If calculated  2 is greater than  2 table value, reject H o

Example General Social survey 1991. Let X= Income Y= Job satisfaction

Hypothesis: Ho : X and Y are independent H1 : X and Y are dependent Pearson Chi – Square Statistic is Degrees of Freedom = 3 * 3 = 9

Observed Frequencies

Expected Frequencies Income Job satisfaction Dissatisfied Little satisf Mod. Satisfied Much Satisfied < 5,000 0.8 3.0 13.3 4.9 5,000 to 15,000 1.3 4.6 20.6 7.5 15000 to 25000 0.9 3.2 14.5 5.3 >25000 0.9 3.2 14.5 5.3 (22*4)/ 104 (24*23)/104

Income Job satisfaction Dissatisfied Little satisf Mod. Satisfied Much Satisfied < 5,000 1.6 0.4 0.0 0.7   5,000 to 15,000 0.4 0.4 0.1 1.6   15000 to 25000 0.9 1.5 0.0 1.4   >25000 0.9 0.0 0.2 1.4   Total 3.8 2.4 0.3 5.1 11.5 (8-5.3)^2/5.3 χ² = 11.5

Table value = Evidence against Ho is weak Possible that Job satisfaction and Income are independent.

An alternative The likelihood ratio test: It compares observed values with the distribution of expected values based on the multinomial probability distribution = 0.0866

Pearson Statistic and Likelihood ratio statistic Like the Pearson statistic, GSq takes its minimum value of 0 when all observed = e xpected , and larger values provide stronger evidence against Ho. Although the Pearson and likelihood-ratio GSq provide separate test statistics, but they share many properties and usually provide the same conclusions. When Ho is true and the expected frequencies are large, the two statistics have the same chi-squared distribution, and their numerical values are similar.

Example 2 Two sample polls of votes for two candidates A and B for a public office are taken , one from among the residents of rural areas. The results are given in the adjoining table. Examine whether the nature of the area is related to voting preference in this election

Hypothesis Ho: The nature of the area is independent of the voting preference in the election H1: The nature of the area is dependent of the voting preference in the election

Interpretation Table value for 1 d.f With 5% level of significance is 3.841. (i.e.,) calculated value is greater than the table value. We Conclude that nature of area is related to voting Preference in the election. 2) P – value (0.001<0.05) and hence Null hypothesis is rejected. Critical value Rejection region( α ) Acceptance Region (1 – α ) 3.841

Residuals Testing of independence using “ Chi – Square test” infers whether the association between two variable exists or not based on the p-value. But, there is no info regarding the “ Strength of Association”: Strength of Association is found using 1) Residual Analysis 2) Partitioning the Chi-Square statistics

Residual analysis Compares Oij (observed) and Eij (Expected) values. The difference between the observed value of the dependent variable ( y ) and its expected value is known as residual. e ij = O ij - E ij

Pearson Residuals Pearson’s residuals attempts to adjust for the notion that larger values of Oij and Eij tend to have larger differences. One approach to adjusting for the variance is to consider dividing the difference ( Oij − Eij ) by √ Eij . Thus, define e ij = O ij - E ij / √ E ij As the pearson residual. Note that ,

Standardised Pearson Residuals Under Ho, e ij are asymptotically normal with mean 0. However, the variance of e ij is less than 1. To compensate for this, one can use the STANDARDIZED Pearson Residuals. Denote r ij as the standardized residuals in which Where is the estimated row I marginal probability. rij is asymptotically distributed as a standard normal.

Standardised Pearson Residuals As a “rule of thumb”, a rij value (which is an absolute value) greater than 2 or 3 indicates a lack of fit of H0 in that cell. However, as the number of cells increases, the likelihood that a cell has a value of 2 or 3 increases. For example, if you have 20 cells, you could expect 1 in the 20 to have a value greater the 2 just by chance (i.e., α = 0.05).

Using SPSS

Output We can find large positive residuals for “ Rural preferred voting Candidate A” and “Urban preferred voting Candidate B”, and large negative residuals for “ Rural preferred voting Candidate B” and “Urban preferred voting Candidate A”. Thus, there were significantly more people in“ Rural preferred voting Candidate A” and “Urban preferred voting Candidate B” and fewer people in “ Rural preferred voting Candidate B” and “Urban preferred voting Candidate A” than the hypothesis of independence predicts.

Partitioning the Likelihood Ratio Test Motivation for this: 1) If you reject the Ho and conclude that X and Y are dependent, the next question could be ‘Are there individual comparisons more significant than others?’. 2) Partitioning (or breaking a general I × J contingency table into smaller tables) may show the association is largely dependent on certain categories or groupings of categories. Recall, these basic principles about Chi Square variables 1) If X1 and X2 are both (independently) distributed as χ 2 with df = 1 then X = X1 + X2 ∼ χ2 ( df = 1 + 1) 2)In general, the sum of independent χ2 random variables is distributed as χ2 ( df = ∑ df (Xi))

General Rules for Partitioning In order to completely partition a I × J contingency table, you need to follow this 3 step plan. 1. The df for the subtables must sum to the df for the full table 2. Each cell count in the full table must be a cell count in one and only one subtable 3. Each marginal total of the full table must be a marginal total for one and only one subtable

Example Independent random samples of 83, 60, 56, and 62 faculty members of a state university system from four system universities were polled to determine which of the three collective bargaining agents (i.e., unions) are preferred. Interest centers on whether there is evidence to indicate a differences in the distribution of preference across the 4 state universities.

Therefore, we see that there is a significant association among University and Bargaining Agent. Just by looking at the data, we see that University 4 seems to prefer Agent 103 Universities 1 and 2 seem to prefer Agent 101 University 3 may be undecided, but leans towards Agent 102 Partitioning will help examine these trends

First subtable The Association of University 4 appears the strongest, so we could consider a subtable of Note: This table was obtained by considering the {4, 3} cell in comparison to the rest of the table. G^2 = 60.5440 on 1 df (p=0.0). We see strong evidence for an association among universities (grouped accordingly) and agents.

Second Subtable Now, we could consider just Agents 101 and 102 with Universities 1 – 3 G^2 = 1.6378 on 2 df (p=0.4411). For Universities 1 -3 and Agents 101 and 102, preference is homogeneous (universities prefer agents in similar proportions from one university to another).

Third Subtable We could also consider Bargaining units by dichotomized university G^2 = 4.8441 on 1 df (p=0.0277). There is indication that the preference for agents varies with the introduction of University 4. Subtable 3 Bargainng Agent Total University 101 102 1 to 3 99 80 179 4 8 17 25 Total 107 97 204

Final table A final table we can construct is G^2 = 4.966 on 2 df (p=0.0835). With the addition of agent 103 back into the summary, we still see that sites 1 - 3 still have homogenous preference.

What have we done? General Notes: We created 4 subtables with df of 1,2,1 and 2 (Recall Rule 1 - df must sum to the total. 1 + 2 + 1 + 2 = 6. Rule 1 -Check!) Rule 2 - Cell counts in only 1 table. (42 was in subtable 2, 29 subtable 2, ..., 37 subtable 1). Rule 2 - Check ! Rule 3 - Marginals can only appear once. (83 was in subtable 4, 60 subtable 4, 56 subtable 4, 62 subtable 1, 107 subtable 3, 97 subtable 3, 57 subtable 1). Rule 3 - Check! Since we have partitioned according to the rules, note the sum of G^2. G^2 = 60.5440 + 1.6378 + 4.8441 + 4.9660 = 71.9910 on 6 df which is the same value obtained from the original table.

Overall Summary of Example Now that we have verified our partitioning, we can draw inference on the subtables . From the partitioning, we can observe Preference distribution is homogeneous among Universities 1 - 3. That preference for a bargaining unit is independent of the faculty’s university with the exception that if a faculty member belongs to university 4, then he or she is much more likely than would otherwise have been expected to show preference for bargaining agent 103 (and vice versa).

Final Comments on Partitioning For the likelihood ratio test (G^2), exact partitioning occurs (meaning you can sum the fully partitioned subtables ’ G^2 to arrive at the original G^2). Pearson’s does not have this property Use the summation of G^2 to double check your partitioning. You can have as many subtables as you have df . However, as in our example, you may have tables with df > 1 (which yields fewer subtables ). The selection of subtables is not unique. To initiate the process, you can use your residual analysis to identify the most extreme cell and begin there (this is why I isolated the {4, 3} cell initially. Partitioning is not easy and is an acquired knack. However, the rewards is additional interpretation that is generally desired in the data summary.

Advantages and Limitations of Chi- squared tests Pearson chi-square statistic and Likelihood ratio statistic do not change value with reorderings of rows or of columns.( i.e ) both variables are nominal. If atleast one is ordinal, this tests does not hold good. G 2 and χ² requires large samples When the Expected frequency is small (<5) , the answer from G 2 and χ² will not be reliable. So, Whenever at least one expected frequency is less than 5 you can instead use a small-sample procedure.

THANK YOU