HassanAsadollahfam
46 views
31 slides
Oct 17, 2024
Slide 1 of 31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
About This Presentation
The primary concern in test development and use is demonstrating
not only that test scores are reliable, but that the interpretations and
uses we make of test scores are valid. In the previous chapter, we saw
that the demonstration of reliability consists of estimating the
amount of variation in lan...
The primary concern in test development and use is demonstrating
not only that test scores are reliable, but that the interpretations and
uses we make of test scores are valid. In the previous chapter, we saw
that the demonstration of reliability consists of estimating the
amount of variation in language test scores that is due to measurement error. This estimation focuses on the effects of test method
facets and random factors as sources of unreliability in test scores. If
we demonstrate that test scores are reliable, we know that
performance on the test is affected primarily by factors other than
measurement error. In examining validity, we look beyond the
reliability of the test scores themselves, and consider the relationships
between test performance and other types of performance in other
contexts. The types of performance and contexts we select for
investigation will be determined by the uses or interpretations we
wish to make of the test results. Furthermore, since the uses we make
of test scores inevitably involve value judgments, and have both
educational and societal consequences, we must also carefully
examine the value systems that justify a given use of test scores.
It has been traditional to classify validity into different types, such
as content, criterion, and construct validity. However, measurement
specialists have come to view these as aspects of a unitary concept of
validiry that subsumes all of them. Messick (1989), for example,
describes validity as ‘an integrated evaluative judgment of the degree
to which empirical evidence and theoretical rationales support the
adequacy and appropriateness of inferences and actions based on test
scores’ (p. 13). This unitary view of validity has also been clearly
endorsed by the measurement profession as a whole in the most
recent revision of the Standards for Educational and Psychological
Testing:
Validity.. . is a unitary concept. Although evidence may be
accumulated in many ways, validity always refers to the degree to which that evidence supports the inferences that are made from the
scores. The inferences regarding specific uses of a test are
validated, not the test itself.
(American Psychological Association 1985: 9)
We will still find it necessary to gather information about content
relevance, predictive utility, and concurrent criterion relatedness, in the
process of developing a given test. However, it is important to recognize
that none of these by itself is sufficient to demonstrate the validity of a
particular interpretation or use of test scores. And while the relative
emphasis of the different kinds of evidence may vary from one test use
to another, it is only through the collection and interpretation of all
relevant types of information that validity can be demonstrated.
The examination of validity has also traditionally focused on the
types of evidence that need to be gathered to support a particular
meaning or use. Given the significant role that testing now plays in
infl
Size: 289.36 KB
Language: en
Added: Oct 17, 2024
Slides: 31 pages
Slide Content
1
Reliability and Validity
What is reliability?
A quality of test scores which refers to the
consistency of measures across different times,
test forms, raters, and other characteristics of
the measurement context. Synonyms for
reliability are: dependability, stability,
consistency, predictability and accuracy.
2
For instance a reliable man is a man whose behaviour
is consistent, dependable and predictable, i.e. What
he will do tomorrow and next week will be
consistent with what he does today and what he did
last week.
-In language testing, for example, if a student receives
a low score on a test one day and high score on the
same test two days later (the test does not yield
consistent results), the scores cannot be considered
reliable indicators of individual’s ability.
-If two raters give widely different ratings to the same
sample, we say that the ratings are not reliable.
3
-The notion of reliability has to do with accuracy
of measurement. This kind of accuracy is
reflected in the obtaining of similar results when
measurement is repeated on different occasions
or with different instruments or by different
persons.
-Based on Henning [1987], reliability is a measure of
accuracy, consistency, dependability, or fairness of
scores resulting from the administration of a
particular examination e.g. 75% on a test today,
83% tomorrow – problem with reliability.
-
4
Sources of Variance
Potential sources of score variance:
1. Those creating variance related to the purpose of
the test-meaningful variance
Meaningful variance on a test is defined as the
variance that is directly attributable to the testing
purpose.
2. Those generating variance due to other extraneous
sources-measurement error
Measurement error is a term that describes the variance
in scores on a test that is not directly related to the
purpose of the test.
5
Potential sources of measurement
error
1. Environment( location, space, noise, etc.)
2. Administration procedures( directions,
timing, equipment, etc.)
3. Scoring procedures ( error in scoring,
subjectivity, etc.)
4. Test and test items ( item type, item
quality, item security, etc.)
5. Examinees ( health, motivation, memory,
forgiveness, guessing, etc.)
Dr. R. Green, Aug 2006 6
quality of items
number of items
difficulty level of items
level of item discrimination
type of test methods
number of test methods
time allowed
clarity of instructions
use of the test
selection of content
sampling of content
invalid constructs
Dr. R. Green, Aug 2006 7
Test taker
familiarity with test method
attitude towards the test i.e. interest,
motivation, emotional/mental state
degree of guessing employed
level of ability
Dr. R. Green, Aug 2006 8
Test administration
consistency of administration procedure
degree of interaction between invigilators and
test takers
time of day the test is administered
clarity of instructions
test environment – light / heat / noise /
space / layout of room
quality of equipment used e.g. for listening
tests
9
Scoring
accuracy of the key e.g. does it include all
possible alternatives?
inter-rater reliability e.g. in writing,
speaking
intra-rater reliability e.g. in writing,
speaking
machine vs. human
Dr. R. Green, Aug 2006 10
In order to minimize reliability we should try to minimize
measurement error. For example we can think of factors
such as health, lack of interest or motivation and test-
wiseness that can affect individuals’ test performance.,
but which are not generally associated with language
ability and thus not characteristics we want to measure
with language test. Test method facets are another source
of error. When we minimize the effect of these various
factors, we minimize measurement error and maximize
reliability.
12
The investigation of reliability is concerned with
the question:
“How much of an individual’s test performance
is due to measurement error or factors other
than the language ability we want to
measure?” and with the minimizing these
factors on test scores.
13
Unreliability
Inconsistencies in the results of measurement
stemming from factors other than the abilities
we want to measure. These are called
random factors. Such factors would result in
measurement errors which in turn can
introduce fluctuations in the observed scores
and thus reduce reliability.
Systematic factors: factors such as test
method facets which are uniform from one
test administration to the next. Attributes of
14
Individuals that are not related to language ability
Including individual characteristics such as cognitive
style, and knowledge of particular content areas, and
group characteristics such as sex, ethics background
are categorized as systematic factors. Two different
effects are associated with systematic error:
1. General effect: the error of systematic error which is
consistent for all observations, it affects the scores of
all individuals who take the test.
2. Specific effect: the effect which varies across
individuals; it affects different individuals differently.
Test wiseness
A test taker’s capacity to utilize the characteristics
and formats of the test and/or the test taking
situation to guess the correct answer and hence
receive a high score. It includes a variety of
general strategies such as conscious pacing of
one’s time, reading questions before the
passages upon which they are based, and ruling
out alternatives as possible in multiple-choice
items and then guessing among the ones
remaining.
15
Test wiseness can be divided into two Basic parts:
1 Independent elements: these elements are
independent of the test constructor. These
elements are the kinds that many times are
included in books that help students prepare for
standardized tests. These relate to reading
directions, use of time, guessing, etc.
2. Dependent elements: these elements refer to
cues in the stem or in the distracters that need
to be reasoned by the test taker.
16
Test method facet
The specific characteristics of test methods which
constitute the “how” of language testing. Thus
test performance is affected by these “facets” or
characteristics of test method. For example,
some do better on oral interview test; while one
person might perform well on a multiple-choice
test of reading, another may find such tasks very
difficult. Test method facets are systematic to
the extent that they are uniform from one test
administration to the next one.
17
Test method facet include:
1. Testing environment
2. Test rubric(characteristics that specify how
test takers are expected to proceed in taking
the test.
3. Testing organization( the collection of parts in
a test which may be individual items or subtests)
4. Time allocation
5. Instructions
Dr. R. Green, Aug 2006 18
Factors which may contribute to unreliability
1. Fluctuations in the learner
a. Changes in the learner
b. Temporary psychological or physical changes
2. Fluctuations in scoring
a. intra-rater variance
b. inter-rater variance
3. Fluctuations in test administration
a. Regulatory fluctuations
b. Fluctuations in the administrative environment
19
4. The behavior of test characteristics
a.Test length
b. Difficulty level and boundary effect
c. Discriminability
d. Speededness
e. Homogeneity
5. Error associated with response characteristics
a. Response arbitrariness
b. Wiseness and familiarity responses
20
21
How can we measure reliability?
Test-retest
same test administered to the same test
takers following an interval of no more than 2
weeks
Inter-rater reliability
two or more independent estimates on a test
e.g. written scripts marked by two raters
independently and results compared
23
Split half reliability
test to be administered to a group of test takers is
divided into halves, scores on each half correlated
with the other half
the resulting coefficient is then adjusted by
Spearman-Brown Prophecy Formula to allow for the
fact that the total score is based on an instrument
that is twice as long as its halves
24
Cronbach's Alpha [KR 20]
this approach looks at how test takers
perform on each individual item and then
compares that performance against their
performance on the test as a whole
measured on a -1 to +1 scale like
discrimination
25
Reliability is influenced by …..
the longer the test, the more reliable it is likely to be
[though there is a point of no extra return]
items which discriminate will add to reliability,
therefore, if the items are too easy / too difficult,
reliability is likely to be lower
if there is a wide range of abilities amongst the test
takers, test is likely to have higher reliability
the more homogeneous the items are, the higher
the reliability is likely to be
Validity
The extent to which the inferences or decisions we
make on the basis of test scores are meaningful,
appropriate, and useful. In other words, a test is
said to be valid to the extent that it measures
what it is supposed to measures or can be used
for the purposes for which it is intended.
The matter of concern in testing is to ensure that
any test employed is valid for the purpose for
which it is administered. Validity tells us what can
be inferred from test scores. Validity is a quality of
test interpretation and use. If test scores are
affected by abilities other than the one we want to
26
measure, they will not be meaningful indicators
of that particular ability.
An example of validity:
If we ask students to listen to a lecture and to write a short
essay based on that lecture, the essays they write will be
affected by both their writing ability and their ability to
comprehend the lecture. Ratings of their essays,
therefore, might not be valid measures of their writing
ability.
It is important for test developers and test users to realize
that test validation is an ongoing process and that the
interpretations we make of test scores can never be
considered absolutely valid.
27
For most kinds of validity, reliability is a
necessary but not sufficient condition.
Some types of validity:
1.Content validity
2.Construct validity
3. Consequential validity
4. criterion-related validity
28
Content validity?
A form of validity which is based on the degree
to which a test adequately and sufficiently
measures the particular skills or behavior it
sets out to measure . For example, a test of
pronunciation skills in a language would have
low content validity if it tests only some of the
skills which are required for accurate
pronunciation, such a test which tested the
ability to pronounce isolated sounds, but not
stress, intonation , or the pronunciation of
sound within words.
29
The test would have content validity only if it
included a proper sample of the relevant
structures. Just what the relevant structures
will depend, of course, upon the purpose of
the test. In order to judge whether or not a
test has content validity, we need a
specification of skills or structures etc, that it
is meant to cover.
30
Construct validity?
A form of validity which is based on the degree to which
the items in a test reflect the essential aspects of the
theory on which the test is based.
A test is said to have construct validity if it can be
demonstrated that it measures just the ability it is
supposed to measure. For example, if the assumption
is held that systematic language habits are best
acquired at the elementary level by means of the
structural approach, then a test which emphasizes the
communication aspects of the language will have low
construct validity.
31