TEST WORTHINESS: VALIDITY, RELIABILITY, PRACTICALITY

Chapter 5 Test Worthiness Neukrug/Fawcett, Essentials of Testing and Assessment: A Practical Guide for Counselors, Social Workers, and Psychologists , 3rd Edition. © 2015 Cengage. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

Test Worthiness Four cornerstones to test worthiness: Validity Reliability Cross-cultural fairness Cross-cultural fairness But first, we must learn one statistical concept: Correlation coefficient

Correlation Coefficient (1 of 5) Correlation : Statistical expression of the relationship between two sets of scores (or variables) Positive correlation: Increase in one variable accompanied by increase in other “Direct” relationship Negative correlation: Increase in one variable accompanied by decrease in other “Inverse” relationship

Correlation Coefficient (2 of 5) What is the relationship between: Gasoline prices and grocery prices? Grocery prices and good weather? Stress and depression? Depression and job productivity? Partying and grades? Study time and grades?

Correlation Coefficient (3 of 5) Correlation coefficient (r ) A number between −1 and +1 that indicates direction and strength of the relationship As “r” approaches +1, strength increases in a direct and positive way As “r” approaches −1, strength increases in an inverse and negative way As “r” approaches 0, the relationship is weak or non existent (at zero)

Correlation Coefficient (4 of 5)

Correlation Coefficient (5 of 5)

Correlation Examples SAT score Coll. G P A 930 3.0 750 2.9 1110 3.8 625 2.1 885 3.3 950 2.6 605 2.8 810 3.2 1045 3.0 910 3.5 r = .35 Missed Classes Coll. G P A 3 3.0 5 2.9 2 3.8 8 2.1 1 3.3 6 2.6 3 2.8 1 3.2 3 3.0 3.5 r = −.67

Correlation Scatterplots (1 of 2) Plotting two sets of scores from the previous examples on a graph Place person A’s S A T score on the x-axis, and his/her G P A on the y-axis Continue this for person B,C, D etc. This process forms a scatterplot

Examples of Scatterplots

Correlation Scatterplots (2 of 2) What correlation (r ) do you think this graph has? How about this correlation?

More Scatterplots (1 of 2) What might this correlation be? This correlation?

More Scatterplots (2 of 2) This correlation? Last one…

Coefficient of Determination (Shared Variance) (1 of 2) The square of the correlation A statement about factors that underlie the variables that account for their relationship

Coefficient of Determination (Shared Variance) (2 of 2) Correlation between depression and anxiety=.85. Shared variance=.72. What factors might underlie both depression and anxiety?

Validity What is validity? The degree to which all accumulated evidence supports the intended interpretation of test scores for the intended purpose It is a unitary concept; however, there are 3 general types of validity evidence Content validity Criterion-related validity Construct validity

Content Validity (1 of 3) Is the content valid for the kind of test it is? Developers must show evidence that the domain was systematically analyzed and concepts are covered in correct proportion Four-step process: Step 1 - Survey the domain Step 2 - Content of test matches the domain Step 3 - Specific test items match the content Step 4 - Analyze relative importance of each objective (weight)

Content Validity (2 of 3)

Content Validity (3 of 3) Face Validity Not a real type of content validity A quick look at “face” value of questions Sometimes questions may not seem to measure the content, but do How might you show content validity for an instrument that measures depression?

Criterion-Related Validity Criterion-Related Validity The relationship between the test and a criterion the test should be related to Two types: Concurrent Validity: Does the instrument relate to another criterion now (in the present)? Predictive Validity: Does the instrument relate to another criterion in the future?

Criterion-Related Validity: Concurrent Validity Example 1 100 clients take the B D I Correlate their scores with clinicians’ ratings of depression of the same group of clients Example 2 500 people take test of alcoholism tendency Correlate their scores with how significant others rate the amount of alcohol they drink

Criterion-Related Validity: Predictive Validity (1 of 2) Examples: S A T scores correlated with how well students do in college A S V A B scores correlated with success at jobs G R Es correlated with success in graduate school

Criterion-Related Validity: Predictive Validity (2 of 2) TABLE 5.1 Average Estimated Correlations of GRE General Test (Verbal, Quantitative, and Analytical) Scores and Undergraduate Grade Point Average with Graduate First-Year Grade Point Average by Department Type V = GRE verbal, Q = GRE quantitative, A = GRE analytical, U = undergraduate grade point average *Combination of individual predictors. Source: Graduate Record Examinations, 2004–2005. GRE materials selected from 2004–2005. Guide to the Use of Scores , p. 22. Reprinted by permission of Educational Testing Service, the copyright owner. Copayright © 2013 Educational Testing Service. www.ets.org

Concepts Related to Predictive Validity Standard Error of the Estimate Using a known value of one variable to predict a potential range of scores on a second variable (e.g., G R E  G P A range) False Positive Instrument predicts an attribute that does not exist False Negative Instrument forecasts no attribute but in fact it exists

Construct Validity (1 of 2) Construct Validity Extent to which the instrument measures a theoretical or hypothetical trait Many counseling and psychological constructs are complex, ambiguous and not easily agreed upon: Intelligence Self-esteem Empathy Other personality characteristics

Construct Validity (2 of 2) Four methods of gathering evidence for construct validity: Experimental design Factor analysis Convergence with other instruments Discrimination with other measures

Construct Validity: Experimental Design (1 of 2) Creating hypothesis and research studies that show the instrument captures the correct concept

Construct Validity: Experimental Design (2 of 2) Example: Hypothesis : The “Blank” depression test will discriminate between clinically depressed clients and “normals.” Method : Identify 100 clinically depressed clients and 100 “normal” clients and show statistical analysis.

Construct Validity: Factor Analysis (1 of 2) Statistical relationship between subscales of test How similar or different are the sub-scales?

Construct Validity: Factor Analysis (2 of 2) Example: Develop a depression test with three subscales: self-esteem, suicidal ideation, hopelessness Correlate subscales: Self-esteem and suicidal ideation: .35 Self-esteem and hopelessness: .25 Hopelessness and suicidal ideation: .82 What implications might above scores have for this test?

Construct Validity: Convergent Validity Convergence Evidence Comparing test scores to other, well-established tests Example: Correlate new depression test against the B D I. Is there a good correlation between the two? Implications if correlation is extremely high? Implications if correlation is extremely low?

Construct Validity: Discriminant Validity Discriminant Evidence Correlate test scores with other tests that are different Hope to find a meager correlation Example: Compare new depression test with an anxiety test. Implications if correlation is extremely high? Implications if correlation is extremely low?

Validity Recap Three types of validity Content Criterion Concurrent Predictive Construct validity Experimental Factor Analysis Convergent Discriminant

Reliability Accuracy or consistency of test scores Would you score the same if you took the test over, and over, and over again? Reported as a reliability (correlation) coefficient The closer to r = 1.0, the less error in the test

Three Ways of Determining Reliability Test-Retest Alternate, Parallel, or Equivalent Forms Internal Consistency Split-half or Odd Even Coefficient Alpha Kuder-Richardson

Types of Reliability

Test-Retest Reliability Give the test twice to same group of people. Take the first test in this class, and very soon after, take it again. Are the scores about the same? Person 1 Person 2 Person 3 Person 4 Person 5 1 st Test 35 42 43 34 38 2 nd Test 36 44 41 34 37 Problem: Person can look up answers between testings

Alternate, Parallel, or Equivalent Forms Reliability Have two forms of same test Give students two forms at the same time Correlate scores on first form with scores on second form Problem: Are two “equivalent” forms ever really equivalent?

Internal Consistency Reliability (1 of 2) How do individual items relate to each other and the test as a whole? Internal consistency reliability is going “within” the test rather than using multiple administrations

Internal Consistency Reliability (2 of 2) High speed computers and only one test administration has made internal consistency popular Three types: Split-Half or Odd-Even Cronbach’s Coefficient Alpha Kuder-Richardson

Split-Half or Odd-Even Reliability Correlate one half of test with other half for all who took the test Use Spearman-Brown formula to control for shortness of test Example: P1 scores 16 on first half of test and 16 on second half. P2 scores 14 on first half and 18 on second half, etc. Correlate all persons scores on first half with their scores on second half The correlation = the reliability estimate

Split-Half or Odd-Even Reliability Internal Consistency Person Score (1 st Half) Score (2 nd Half) 1 16 16 2 14 18 3 12 20 4 15 17 Problem: Are any two halves really equivalent?

Cronbach’s Alpha and Kuder-Richardson Internal Consistency Other types of Internal Consistency: Average correlation of all of the possible split-half reliabilities Two popular types: Cronbach’s Alpha Kuder-Richardson (KR-20, KR-21)

Item Response Theory: Another Way of Looking at Reliability (1 of 3) Extension of classical test theory, which looks at the amount of error in the total test I R T looks at the probability that individuals will answer each item correctly (or match the quality being assessed) Or, each item is being assessed for its ability to measure the trait being examined

Item Response Theory: Another Way of Looking at Reliability (2 of 3)

Item Response Theory: Another Way of Looking at Reliability (3 of 3) Individuals with lower ability have less probability of getting certain items correct Individuals with higher ability have higher probability of getting more items correct Each item is examined for it’s ability to discriminate based on the trait being measured The better a test can discriminate, the more reliable it is.

Cross-Cultural Fairness (1 of 7) Bias in testing did not get much attention until civil rights movement of 19 60s Series of court decisions established it was unfair to use tests to track students in schools. Black and Hispanic students were being unfairly compared to whites — not their norm group

Cross-Cultural Fairness (2 of 7) Griggs v. Duke Power Company Tests for hiring and advancement must show ability to predict job performance. Example: Can’t give a test to measure intelligence for those who want to get a job as a road worker

Cross-Cultural Fairness (3 of 7) Americans with Disabilities Act: Accommodations for individuals taking tests for employment must be made Tests must be shown to be relevant to the job in question Family Education Rights and Privacy Act (F E R P A): Right to access school records, including test records Parents have the rights to their child’s records

Cross-Cultural Fairness (4 of 7) Carl Perkins Act: Individuals with a disability have the right to vocational assessment, counseling and placement Civil Rights Acts: Series of laws concerned with tests used in employment and promotion

Cross-Cultural Fairness (5 of 7) Freedom of Information Act: Assures access to federal records, including test records Most states have expanded this law so that it also applies to state records I D E A and PL 94-142: Assures rights of students (age 3-21) suspected of having a learning disability to be tested at the school’s expense Child study teams and I E P set up when necessary

Cross-Cultural Fairness (6 of 7) Section 504 of the Rehabilitation Act: Relative to assessment, any instrument used to measure appropriateness for a program or service must measure the individual’s ability, not be a reflection of his or her disability

Cross-Cultural Fairness (7 of 7) BOX 5.1 The Use of Intelligence Tests with Minorities: Confusion and Bedlam The use of intelligence tests with culturally diverse populations has long been an area of controversy. Over the years, states have found intelligence tests biased and banned their use in certain circumstances with some groups (Gold. 19 87: Swenson. 19 97). One case in California in 19 87 highlighted this controversy. Ms. Mary Amaya was concerned that her son was being recommended for remedial courses he did not need. Having had an older son who was found to not need such assistance only after he was tested, Ms. Amaya requested testing with an intelligence test for her other son. However, since the incident with her first son, California decided that intelligence tests, were culturally biased and thus banned their use for members of certain groups. Despite the fact that Ms. Amaya was requesting the use of an intelligence test, it was found that she had no legislative right to have the test given to her son. Although California subsequently reversed its ban, concerns about racial bias in testing continue today (Ortiz. Octioa. & Dynda. 2012).

Disparities in Ability Cognitive differences between people exist However, they are clouded over by issues of S E S, prejudice, stereotyping, etc. Are there real differences? Why do differences exist and what can be done to eliminate these differences? Often seen as environmental — No Child Left Behind

Practicality Several practical concerns: Time Cost Format (clarity of print, print size, sequencing of questions and types of questions) Readability Ease of administration, scoring, and interpretation

Selecting and Administering Tests Five Steps: Determine your client’s goals Choose instruments to reach client goals. Access information about possible instruments Examine validity, reliability, cross-cultural fairness, and practicality Make a wise choice

TEST WORTHINESS: VALIDITY, RELIABILITY, PRACTICALITY

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TEST WORTHINESS: VALIDITY, RELIABILITY, PRACTICALITY

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

FM I Chapter 3: Time Value of Money .ppt

Financial Management I Chapter Three.pdf

Md. Sirajuddwla Vs. The State and Ors. [2016] 1LNJ(HCD)177.

business finance-2nd quarter week 1.pptx

bank-reconciliation-2-240226133520-7f66bb8a.pptx

Create_a_Portfolio_Website_Showcasing_Projects_Skills_and_Contact_Info_Presentation.pdf.pdf