Validity in psychological testing

MILENSRAMOS 26,501 views 36 slides Aug 21, 2011
Slide 1
Slide 1 of 36
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36

About This Presentation

validation, psychological test


Slide Content

Reliability
Ā Ā Ā Ā Ā Ā TestĀ reliablilityĀ refersĀ toĀ theĀ degreeĀ toĀ whichĀ aĀ testĀ isĀ 
consistentĀ andĀ stableĀ inĀ measuringĀ whatĀ itĀ isĀ intendedĀ toĀ 
measure.Ā 
MostĀ simplyĀ put,Ā aĀ testĀ isĀ reliableĀ ifĀ itĀ isĀ consistentĀ withinĀ 
itselfĀ andĀ acrossĀ time.Ā 
ToĀ understandĀ theĀ basicsĀ ofĀ testĀ reliability,Ā thinkĀ ofĀ aĀ 
bathroomĀ scaleĀ thatĀ gaveĀ youĀ drasticallyĀ differentĀ 
readingsĀ everyĀ timeĀ youĀ steppedĀ onĀ itĀ regardlessĀ ofĀ 
whetherĀ yourĀ hadĀ gainedĀ orĀ lostĀ weight.Ā IfĀ suchĀ aĀ scaleĀ 
existed,Ā itĀ wouldĀ beĀ consideredĀ notĀ reliable

Validity
Ā Ā Ā Ā Ā Ā TestĀ validityĀ refersĀ toĀ theĀ degreeĀ toĀ whichĀ theĀ 
testĀ actuallyĀ measuresĀ whatĀ itĀ claimsĀ toĀ measure.Ā 
TestĀ validityĀ isĀ alsoĀ theĀ extentĀ toĀ whichĀ 
inferences,Ā conclusions,Ā andĀ decisionsĀ madeĀ onĀ 
theĀ basisĀ ofĀ testĀ scoresĀ areĀ appropriateĀ andĀ 
meaningful.Ā 

TheĀ RelationshipĀ ofĀ ReliabilityĀ andĀ Validity
Ā Ā Ā Ā Ā Ā TestĀ validityĀ isĀ requisiteĀ toĀ testĀ reliability.Ā IfĀ aĀ 
testĀ isĀ notĀ valid,Ā thenĀ reliabilityĀ isĀ moot.Ā 
InĀ otherĀ words,Ā ifĀ aĀ testĀ isĀ notĀ validĀ thereĀ isĀ noĀ 
pointĀ inĀ discussingĀ reliabilityĀ becauseĀ testĀ validityĀ 
isĀ requiredĀ beforeĀ reliabilityĀ canĀ beĀ consideredĀ inĀ 
anyĀ meaningfulĀ way.Ā Likewise,Ā ifĀ asĀ testĀ isĀ notĀ 
reliableĀ itĀ isĀ alsoĀ notĀ valid.Ā 

classicalĀ modelsĀ dividedĀ theĀ conceptĀ intoĀ 
variousĀ "validities,"Ā suchĀ asĀ 
contentĀ validityĀ 
criterionĀ validityĀ 
constructĀ validity

theĀ modernĀ viewĀ isĀ thatĀ 
validityĀ isĀ aĀ singleĀ unitaryĀ 
construct

CronbachĀ andĀ Meehl’sĀ 
subsequentĀ publicationĀ Ā 
groupedĀ predictiveĀ andĀ 
concurrentĀ validityĀ intoĀ aĀ 
"criterion-orientation",Ā whichĀ 
eventuallyĀ becameĀ 
criterionĀ validity.

A single interpretation of any test may require several
propositions to be true (or may be questioned by any one
of a set of threats to its validity). Strong evidence in
support of a single proposition does not lessen the
requirement to support the other propositions.
Evidence to support (or question) the validity of an
interpretation can be categorized into one of five
categories:
Evidence based on test content
Evidence based on response processes
Evidence based on internal structure
Evidence based on relations to other variables
Evidence based on consequences of testing

1995Ā Ā Ā Ā SamuelĀ Messick’sĀ articleĀ thatĀ describedĀ 
validityĀ asĀ aĀ singleĀ constructĀ composedĀ ofĀ sixĀ 
"aspectsā€œ
[
InĀ hisĀ view,Ā variousĀ inferencesĀ madeĀ 
fromĀ testĀ scoresĀ mayĀ requireĀ differentĀ typesĀ ofĀ 
evidence,Ā butĀ notĀ differentĀ validities.

InĀ scienceĀ andĀ statistics,Ā validityĀ hasĀ noĀ singleĀ agreedĀ definitionĀ 
butĀ generallyĀ refersĀ toĀ theĀ extentĀ toĀ whichĀ aĀ concept,Ā conclusionĀ 
orĀ measurementĀ isĀ well-foundedĀ andĀ correspondsĀ accuratelyĀ toĀ 
theĀ realĀ world.Ā TheĀ wordĀ "valid"Ā isĀ derivedĀ fromĀ theĀ LatinĀ validus,Ā 
meaningĀ strong.Ā ValidityĀ ofĀ aĀ measurementĀ toolĀ (i.e.Ā testĀ inĀ 
education)Ā isĀ consideredĀ toĀ beĀ theĀ degreeĀ toĀ whichĀ theĀ toolĀ 
measuresĀ whatĀ itĀ claimsĀ toĀ measure.
InĀ psychometrics,Ā validityĀ hasĀ aĀ particularĀ applicationĀ knownĀ asĀ 
testĀ validity:Ā "theĀ degreeĀ toĀ whichĀ evidenceĀ andĀ theoryĀ supportĀ theĀ 
interpretationsĀ ofĀ testĀ scores"Ā ("asĀ entailedĀ byĀ proposedĀ usesĀ ofĀ 
tests").
[1]
InĀ theĀ areaĀ ofĀ scientificĀ researchĀ designĀ andĀ experimentation,Ā 
validityĀ refersĀ toĀ whetherĀ aĀ studyĀ isĀ ableĀ toĀ scientificallyĀ answerĀ 
theĀ questionsĀ itĀ isĀ intendedĀ toĀ answer.
InĀ clinicalĀ fields,Ā theĀ validityĀ ofĀ aĀ diagnosisĀ andĀ associatedĀ 
diagnosticĀ testsĀ mayĀ beĀ assessed.

ConstructĀ validityĀ 
ConvergentĀ validityĀ 
DiscriminantĀ validityĀ 
ConvergentĀ validityĀ refersĀ toĀ theĀ degreeĀ toĀ whichĀ aĀ measureĀ isĀ 
correlatedĀ withĀ otherĀ measuresĀ thatĀ itĀ isĀ theoreticallyĀ predictedĀ toĀ 
correlateĀ with.
Discriminant validity
DiscriminantĀ validityĀ describesĀ theĀ degreeĀ toĀ whichĀ theĀ 
operationalizationĀ doesĀ notĀ correlateĀ withĀ otherĀ operationalizationsĀ 
thatĀ itĀ theoreticallyĀ shouldĀ notĀ beĀ correlatedĀ with.

Content validity
ContentĀ validityĀ isĀ aĀ non-statisticalĀ typeĀ ofĀ validityĀ thatĀ 
involvesĀ ā€œtheĀ systematicĀ examinationĀ ofĀ theĀ testĀ contentĀ toĀ 
determineĀ whetherĀ itĀ coversĀ aĀ representativeĀ sampleĀ ofĀ 
theĀ behaviorĀ domainĀ toĀ beĀ measuredā€Ā (AnastasiĀ &Ā Urbina,Ā 
1997Ā p.Ā 114).Ā ForĀ example,Ā doesĀ anĀ IQĀ questionnaireĀ 
haveĀ itemsĀ coveringĀ allĀ areasĀ ofĀ intelligenceĀ discussedĀ inĀ 
theĀ scientificĀ literature?

Content validity evidence involves the degree to which the content of the
test matches a content domain associated with the construct. For example, a
test of the ability to add two numbers should include a range of
combinations of digits. A test with only one-digit numbers, or only even
numbers, would not have good coverage of the content domain. Content
related evidence typically involves subject matter experts (SME's) evaluating
test items against the test specifications.
A test has content validity built into it by careful selection of which items to
include (Anastasi & Urbina, 1997). Items are chosen so that they comply with
the test specification which is drawn up through a thorough examination of
the subject domain.
Foxcraft et al. (2004, p. 49) note that by using a panel of experts to review
the test specifications and the selection of items the content validity of a test
can be improved. The experts will be able to review the items and comment
on whether the items cover a representative sample of the behaviour
domain.

ContentĀ validityĀ evidenceĀ involvesĀ theĀ degreeĀ toĀ whichĀ theĀ contentĀ ofĀ theĀ 
testĀ matchesĀ aĀ contentĀ domainĀ associatedĀ withĀ theĀ construct.Ā 
ForĀ example,Ā aĀ testĀ ofĀ theĀ abilityĀ toĀ addĀ twoĀ numbersĀ shouldĀ includeĀ aĀ 
rangeĀ ofĀ combinationsĀ ofĀ digits.Ā AĀ testĀ withĀ onlyĀ one-digitĀ numbers,Ā orĀ onlyĀ 
evenĀ numbers,Ā wouldĀ notĀ haveĀ goodĀ coverageĀ ofĀ theĀ contentĀ domain.Ā 
ContentĀ relatedĀ evidenceĀ typicallyĀ involvesĀ subjectĀ matterĀ expertsĀ (SME's)Ā 
evaluatingĀ testĀ itemsĀ againstĀ theĀ testĀ specifications.
AĀ testĀ hasĀ contentĀ validityĀ builtĀ intoĀ itĀ byĀ carefulĀ selectionĀ ofĀ whichĀ itemsĀ toĀ 
includeĀ (AnastasiĀ &Ā Urbina,Ā 1997).Ā ItemsĀ areĀ chosenĀ soĀ thatĀ theyĀ complyĀ 
withĀ theĀ testĀ specificationĀ whichĀ isĀ drawnĀ upĀ throughĀ aĀ thoroughĀ 
examinationĀ ofĀ theĀ subjectĀ domain.Ā FoxcraftĀ etĀ al.Ā (2004,Ā p.Ā 49)Ā noteĀ thatĀ byĀ 
usingĀ aĀ panelĀ ofĀ expertsĀ toĀ reviewĀ theĀ testĀ specificationsĀ andĀ theĀ selectionĀ 
ofĀ itemsĀ theĀ contentĀ validityĀ ofĀ aĀ testĀ canĀ beĀ improved.Ā TheĀ expertsĀ willĀ beĀ 
ableĀ toĀ reviewĀ theĀ itemsĀ andĀ commentĀ onĀ whetherĀ theĀ itemsĀ coverĀ aĀ 
representativeĀ sampleĀ ofĀ theĀ behaviourĀ domain.

Representation validity
RepresentationĀ validity,Ā alsoĀ knownĀ asĀ 
translationĀ validity,Ā isĀ aboutĀ theĀ extentĀ toĀ 
whichĀ anĀ abstractĀ theoreticalĀ constructĀ canĀ 
beĀ turnedĀ intoĀ aĀ specificĀ practicalĀ test.

Face validity is an estimate of whether a test appears to measure a certain
criterion; it does not guarantee that the test actually measures phenomena in
that domain. Indeed, when a test is subject to faking (malingering), low face
validity might make the test more valid.
Face validity is very closely related to content validity. While content validity
depends on a theoretical basis for assuming if a test is assessing all domains of a
certain criterion (e.g. does assessing addition skills yield in a good measure for
mathematical skills? - To answer this you have to know, what different kinds of
arithmetic skills mathematical skills include ) face validity relates to whether a
test appears to be a good measure or not. This judgment is made on the "face"
of the test, thus it can also be judged by the amateur.
Face validity is a starting point, but should NEVER be assumed to be provably
valid for any given purpose, as the "experts have been wrong before--the
Malleus Malificarum (Hammer of Witches) had no support for its conclusions
other than the self-imagined competence of two "experts" in "witchcraft
detection," yet it was used as a "test" to condemn and burn at the stake
perhaps 100,000 women as "witches."

Criterion validity
Criterion validity evidence involves the correlation between the
test and a criterion variable (or variables) taken as representative
of the construct. In other words, it compares the test with other
measures or outcomes (the criteria) already held to be valid. For
example, employee selection tests are often validated against
measures of job performance (the criterion), and IQ tests are
often validated against measures of academic performance (the
criterion).
If the test data and criterion data are collected at the same time,
this is referred to as concurrent validity evidence. If the test data
is collected first in order to predict criterion data collected at a
later point in time, then this is referred to as predictive validity
evidence.

Concurrent validity
Concurrent validity refers to the degree to which the operationalization
correlates with other measures of the same construct that are
measured at the same time. Returning to the selection test example,
this would mean that the tests are administered to current employees
and then correlated with their scores on performance reviews.
Predictive validity
Predictive validity refers to the degree to which the operationalization
can predict (or correlate with) other measures of the same construct
that are measured at some time in the future. Again, with the selection
test example, this would mean that the tests are administered to
applicants, all applicants are hired, their performance is reviewed at a
later time, and then their scores on the two measures are correlated.

Diagnostic validity
InĀ clinicalĀ fieldsĀ suchĀ asĀ medicine,Ā theĀ validityĀ ofĀ aĀ diagnosis,Ā andĀ 
associatedĀ diagnosticĀ testsĀ orĀ screeningĀ tests,Ā mayĀ beĀ assessed.
InĀ regardĀ toĀ tests,Ā theĀ validityĀ issuesĀ mayĀ beĀ examinedĀ inĀ theĀ sameĀ 
wayĀ asĀ forĀ psychometricĀ testsĀ asĀ outlinedĀ above,Ā butĀ thereĀ areĀ oftenĀ 
particularĀ applicationsĀ andĀ priorities.Ā InĀ laboratoryĀ work,Ā theĀ medicalĀ 
validityĀ ofĀ aĀ scientificĀ findingĀ hasĀ beenĀ definedĀ asĀ theĀ 'degreeĀ ofĀ 
achievingĀ theĀ objective'Ā -Ā namelyĀ ofĀ answeringĀ theĀ questionĀ whichĀ 
theĀ physicianĀ asks.
[2]
Ā 
AnĀ importantĀ requirementĀ inĀ clinicalĀ diagnosisĀ andĀ testingĀ isĀ 
sensitivityĀ andĀ specificityĀ -Ā aĀ testĀ needsĀ toĀ beĀ sensitiveĀ enoughĀ toĀ 
detectĀ theĀ relevantĀ problemĀ ifĀ itĀ isĀ presentĀ (andĀ thereforeĀ avoidĀ tooĀ 
manyĀ falseĀ negativeĀ results),Ā butĀ specificĀ enoughĀ notĀ toĀ respondĀ toĀ 
otherĀ thingsĀ (andĀ thereforeĀ avoidĀ tooĀ manyĀ falseĀ positiveĀ results).
[3]
Ā 

InĀ psychiatryĀ thereĀ isĀ aĀ particularĀ issueĀ withĀ assessingĀ 
theĀ validityĀ ofĀ theĀ diagnosticĀ categoriesĀ themselves.Ā InĀ 
thisĀ context:
[4]
•contentĀ validityĀ mayĀ referĀ toĀ symptomsĀ andĀ diagnosticĀ 
criteria;Ā 
•concurrentĀ validityĀ mayĀ beĀ definedĀ byĀ variousĀ correlatesĀ 
orĀ markers,Ā andĀ perhapsĀ alsoĀ treatmentĀ response;Ā 
•predictiveĀ validityĀ mayĀ referĀ mainlyĀ toĀ diagnosticĀ stabilityĀ 
overĀ time;Ā 
•discriminantĀ validityĀ mayĀ involveĀ delimitationĀ fromĀ otherĀ 
disorders.Ā 

TheseĀ wereĀ incorporatedĀ intoĀ theĀ FeighnerĀ CriteriaĀ andĀ 
ResearchĀ DiagnosticĀ CriteriaĀ thatĀ haveĀ sinceĀ formedĀ theĀ basisĀ ofĀ 
theĀ DSMĀ andĀ ICDĀ classificationĀ systems

Kendler in 1980 distinguished between:
[4]
•antecedent validators (familial aggregation, premorbid
personality, and precipitating factors)
•concurrent validators (including psychological tests)
•predictive validators (diagnostic consistency over time, rates
of relapse and recovery, and response to treatment)

NancyĀ AndreasenĀ (1995)Ā listedĀ severalĀ additionalĀ validators — 
molecularĀ geneticsĀ andĀ molecularĀ biology,Ā neurochemistry,Ā 
neuroanatomy,Ā neurophysiology,Ā andĀ cognitiveĀ neuroscienceĀ -Ā thatĀ areĀ 
allĀ potentiallyĀ capableĀ ofĀ linkingĀ symptomsĀ andĀ diagnosesĀ toĀ theirĀ 
neuralĀ substrates.
[4]
KendellĀ andĀ JablinskyĀ (2003)Ā emphasizedĀ theĀ importanceĀ ofĀ 
distinguishingĀ betweenĀ validityĀ andĀ utility,Ā andĀ arguedĀ thatĀ diagnosticĀ 
categoriesĀ definedĀ byĀ theirĀ syndromesĀ shouldĀ beĀ regardedĀ asĀ validĀ 
onlyĀ ifĀ theyĀ haveĀ beenĀ shownĀ toĀ beĀ discreteĀ entitiesĀ withĀ naturalĀ 
boundariesĀ thatĀ separateĀ themĀ fromĀ otherĀ disorders.
[4]

Robins and Guze proposed in 1970 what were to become influential formal
criteria for establishing the validity of psychiatric diagnoses. They listed five
criteria:
[4]
•1) distinct clinical description (including symptom profiles, demographic
characteristics, and typical precipitants)
•2) laboratory studies (including psychological tests, radiology and
postmortem findings)
•3) delimitation from other disorders (by means of exclusion criteria)
•4) follow-up studies showing a characteristic course (including evidence of
diagnostic stability)
•5) family studies showing familial clustering

Kendler (2006) emphasized that to be useful, a validating criterion
must be sensitive enough to validate most syndromes that are true
disorders, while also being specific enough to invalidate most
syndromes that are not true disorders. On this basis, he argues that
a Robins and Guze criterion of "runs in the family" is inadequately
specific because most human psychological and physical traits
would qualify - for example, an arbitrary syndrome comprising a
mixture of "height over 6 ft, red hair, and a large nose" will be
found to "run in families" and be "hereditary", but this should not
be considered evidence that it is a disorder. Kendler has further
suggested that "essentialist" gene models of psychiatric disorders,
and the hope that we will be able to validate
categorical psychiatric diagnoses by "carving nature at its joints"
solely as a result of gene discovery, are implausible.
[5]

Questions To Ask When Evaluating Tests

TEST COVERAGE AND USE
There must be a clear statement of recommended uses and a
description of the population for which the test is intended.
The principal question to ask when evaluating a test is whether it
is appropriate for your intended purposes as well as your students.
The use intended by the test developer must be justified by the
publisher on technical grounds. You then need to evaluate your
intended use against the publisher's intended use. Questions to
ask:
1. What are the intended uses of the test? What interpretations
does the publisher feel are appropriate? Are inappropriate
applications identified?
2. Who is the test designed for? What is the basis for considering
whether the test applies to your students?

APPROPRIATE SAMPLES FOR TEST VALIDATION AND
NORMING
The samples used for test validation and norming must be of adequate
size and must be sufficiently representative to substantiate validity
statements, to establish appropriate norms, and to support conclusions
regarding the use of the instrument for the intended purpose.
The individuals in the norming and validation samples should
represent the group for which the test is intended in terms of age,
experience and background. Questions to ask:
1. How were the samples used in pilot testing, validation and norming
chosen? How is this sample related to your student population? Were
participation rates appropriate?
2. Was the sample size large enough to develop stable estimates with
minimal fluctuation due to sampling errors? Where statements are
made concerning subgroups, are there enough test-takers in each
subgroup?
3. Do the difficulty levels of the test and criterion measures (if any)
provide an adequate basis for validating and norming the instrument?
Are there sufficient variations in test scores?

RELIABILITY
The test is sufficiently reliable to permit stable estimates of the ability levels
of individuals in the target group.
Fundamental to the evaluation of any instrument is the degree to which
test scores are free from measurement error and are consistent from one
occasion to another when the test is used with the target group. Sources of
measurement error, which include fatigue, nervousness, content sampling,
answering mistakes, misinterpreting instructions and guessing, contribute
to an individual's score and lower a test's reliability.
Different types of reliability estimates should be used to estimate the
contributions of different sources of measurement error. Inter-rater
reliability coefficients provide estimates of errors due to inconsistencies in
judgment between raters. Alternate-form reliability coefficients provide
estimates of the extent to which individuals can be expected to rank the
same on alternate forms of a test. Of primary interest are estimates of
internal consistency which account for error due to content sampling,
usually the largest single component of measurement errorĀ 

Questions to ask:
1. How have reliability estimates been computed?
Have appropriate statistical methods been used?
(e.g., Split half-reliability coefficients should not be
used with speeded tests as they will produce
artificially high estimates.)
2. What are the reliabilities of the test for different
groups of test-takers? How were they computed?
3. Is the reliability sufficiently high to warrant
using the test as a basis for decisions concerning
individual students?
4. To what extent are the groups used to provide
reliability estimates similar to the groups the test
will be used with?

CRITERION VALIDITY
The test adequately predicts academic performance.
In terms of an achievement test, criterion validity refers to the extent
to which a test can be used to draw inferences regarding
achievement. Empirical evidence in support of criterion validity must
include a comparison of performance on the validated test against
performance on outside criteria. A variety of criterion measures are
available, such as grades, class rank, other tests and teacher ratings.
There are also several ways to demonstrate the relationship between
the test being validated and subsequent performance. In addition to
correlation coefficients, scatterplots, regression equations and
expectancy tables should be provided. Questions to ask:
1. What criterion measure has been used to evaluate validity? What
is the rationale for choosing this measure?
2. Is the distribution of scores on the criterion measure adequate?
3. What is the overall predictive accuracy of the test? How accurate
are predictions for individuals whose scores are close to cut-points of
interest?

CONTENT VALIDITY
Content validity refers to the extent to which the test
questions represent the skills in the specified subject area.
Content validity is often evaluated by examining the plan
and procedures used in test construction. Did the test
development procedure follow a rational approach that
ensures appropriate content? Did the process ensure that
the collection of items would represent appropriate skills?
Other questions to ask:
1. Is there a clear statement of the universe of skills
represented by the test? What research was conducted to
determine desired test content and/or evaluate content?
2. What was the composition of expert panels used in
content validation? How were judgments elicited?
3. How similar is this content to the content you are
interested in testing?

CONSTRUCT VALIDITY
The test measures the "right" psychological constructs.
Intelligence, self-esteem and creativity are examples of such
psychological traits. Evidence in support of construct validity
can take many forms. One approach is to demonstrate that the
items within a measure are inter-related and therefore measure
a single construct. Inter-item correlation and factor analysis are
often used to demonstrate relationships among the items.
Another approach is to demonstrate that the test behaves as
one would expect a measure of the construct to behave. For
example, one might expect a measure of creativity to show a
greater correlation with a measure of artistic ability than with a
measure of scholastic achievement. Questions to ask:
1. Is the conceptual framework for each tested construct clear
and well founded? What is the basis for concluding that the
construct is related to the purposes of the test?
2. Does the framework provide a basis for testable hypotheses
concerning the construct? Are these hypotheses supported by
empirical data?

TEST ADMINISTRATION
Detailed and clear instructions outline appropriate test
administration procedures.
Statements concerning test validity and the accuracy of the norms
can only generalize to testing situations which replicate the
conditions used to establish validity and obtain normative data.
Test administrators need detailed and clear instructions to
replicate these conditions.
All test administration specifications, including instructions to test
takers, time limits, use of reference materials and calculators,
lighting, equipment, seating, monitoring, room requirements,
testing sequence, and time of day, should be fully described.
Questions to ask:
1. Will test administrators understand precisely what is expected
of them?
2. Do the test administration procedures replicate the conditions
under which the test was validated and normed? Are these
procedures standardized?

TEST REPORTING
The methods used to report test results, including scaled scores,
subtests results and combined test results, are described fully
along with the rationale for each method.
Test results should be presented in a manner that will help
schools, teachers and students to make decisions that are
consistent with appropriate uses of the test. Help should be
available for interpreting and using the test results. Questions
to ask:
1. How are test results reported? Are the scales used in
reporting results conducive to proper test use?
2. What materials and resources are available to aid in
interpreting test results?

TEST AND ITEM BIAS
The test is not biased or offensive with regard to race, sex, native language,
ethnic origin, geographic region or other factors.
Test developers are expected to exhibit a sensitivity to the demographic
characteristics of test-takers. Steps can be taken during test development,
validation, standardization and documentation to minimize the influence of
cultural factors on individual test scores. These steps may include evaluating
items for offensiveness and cultural dependency, using statistics to identify
differential item difficulty, and examining the predictive validity for different
groups.
Tests are not expected to yield equivalent mean scores across population
groups. Rather, tests should yield the same scores and predict the same
likelihood of success for individual test-takers of the same ability, regardless
of group membership. Questions to ask:
1. Were the items analyzed statistically for possible bias? What method(s) was
used? How were items selected for inclusion in the final version of the test?
2. Was the test analyzed for differential validity across groups? How was this
analysis conducted?
3. Was the test analyzed to determine the English language proficiency
required of test-takers? Should the test be used with non-native speakers of
English?